Staff Software Engineer, Google Compute Engine, Telemetry Insights

1 Month ago • 8-11 Years • Artificial Intelligence • $197,000 PA - $291,000 PA

Job Summary

Job Description

The Staff Software Engineer role focuses on building and owning the technical roadmap for Google Compute Engine (GCE) fleet observability and reliability. This involves leveraging AI/ML expertise to drive innovation and meet customer demands. Responsibilities include partnering with internal teams, defining business metrics, implementing processes and tools, establishing data best practices, and mentoring team members. The engineer will utilize data-driven insights and machine learning to deliver the reliability of large-scale infrastructure. The role requires strong software development, testing, design, and architecture skills, along with significant experience in ML infrastructure and optimization.
Must have:
  • 8+ years software development experience
  • 5+ years testing and launching software
  • 5+ years ML design and infrastructure optimization
  • Expertise in AI/ML
  • Strong data analysis skills
Good to have:
  • Master's or PhD in a technical field
  • Technical leadership experience
  • Experience in a complex organization
  • Data management expertise
  • GPU reliability experience

Job Details


Minimum qualifications:

  • Bachelor's degree or equivalent practical experience.
  • 8 years of experience in software development, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with one or more of the following: Speech/audio (e.g., technology duplicating and responding to the human voice), reinforcement learning (e.g., sequential decision making), Machine Learning (ML) infrastructure, or specialization in another ML field.
  • 5 years of experience leading ML design and optimizing ML infrastructure (e.g., model deployment, model evaluation, data processing, debugging, fine tuning).

Preferred qualifications:

  • Master’s degree or PhD in Engineering, Computer Science, or a related technical field.
  • 3 years of experience in a technical leadership role leading project teams and setting technical direction.
  • 3 years of experience working in a complex, matrixed organization involving cross-functional, or cross-business projects.
  • Experience in data management - data quality and data governance, Data architecture and Data modeling.
  • Experience using data to identify opportunities, mitigate risks, and take on the highest quality and reliability for GPUs.
  • Experience in delivering reliability of large scale infrastructure using data driven insights and Machine learning.

About the job

Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

The US base salary range for this full-time position is $197,000-$291,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about .

Responsibilities

  • Own and build the technical road map of Google Compute Engine (GCE) fleet observability and reliability based on analysis.
  • Act as a subject matter expert in AI/ML, driving innovation in GCE observability to meet the demands of customer base.
  • Partner with internal customers, Site Reliability Engineers, product managers, and project managers to align priorities and manage staffing needs.
  • Define business metrics and Service Level Objectives, and implement processes and tools to maintain them. Establish and promote data best practices throughout GCE.
  • Coach, mentor, and support team members at all levels in their career development.

Similar Jobs

Wargaming - Gameplay Developer (World of Tanks)

Wargaming

Nicosia, Nicosia, Cyprus (Hybrid)
1 Month ago
Sleeper - Senior Frontend Engineer (Mobile)

Sleeper

Las Vegas, Nevada, United States (On-Site)
1 Month ago
Google - Software Engineer II, Health Data

Google

Bucharest, Bucharest, Romania (On-Site)
1 Month ago
NVIDIA - System Software Engineer Intern, Apache Spark Solutions - 2025

NVIDIA

Shanghai, Shanghai, China (On-Site)
4 Months ago
Twitch - Software Engineer - Ads Supply

Twitch

San Francisco, California, United States (On-Site)
2 Months ago
Google - AI Sales Specialist, Google Cloud

Google

Seoul, South Korea (On-Site)
1 Month ago
Google - Software Engineer III, AI/ML, Google Play

Google

Mountain View, California, United States (On-Site)
1 Month ago
Krafton  - Deep Learning Strategy & Operations Associate

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Sausalito, California, United States (On-Site)
6 Months ago
ByteDance - Research Scientist Graduate (Foundation Model, Video Generation) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Video Coding/Transcoding Algorithm Engineer

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
Google - Senior Software Engineer, Mobile (Android)

Google

Bucharest, Bucharest, Romania (On-Site)
1 Month ago
Inkittt - Senior Data Engineer (m/f/d)

Inkittt

Krakow Am See, Mecklenburg-Vorpommern, Germany (Hybrid)
6 Months ago
ByteDance - Research Engineer Graduate (Vision AI Platform)

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
Google - Software Engineer III, Google Cloud Platforms

Google

Kirkland, Washington, United States (On-Site)
6 Months ago
NVIDIA - Senior Technical Program Manager, AI Datacenter

NVIDIA

Beijing, Beijing, China (On-Site)
4 Months ago
Google - Senior Software Engineer, Google Cloud Business Platforms

Google

Seattle, Washington, United States (On-Site)
1 Month ago
Google - Staff Software Engineer, Machine Learning Infrastructure

Google

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Netflix - Machine Learning Manager - Promotional Media

Netflix

Los Gatos, California, United States (On-Site)
4 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Sunnyvale, California, United States

Samsung Semiconductor - Staff Engineer, SoC Design Verification

Samsung Semiconductor

San Jose, California, United States (On-Site)
1 Month ago
Varonis  - BI Developer

Varonis

United States (On-Site)
6 Months ago
Glean - Sales Development Representative

Glean

Palo Alto, California, United States (On-Site)
3 Months ago
Scientific Games  - Senior Business Analyst

Scientific Games

Georgia, United States (Remote)
1 Month ago
Rackspace Technology - Presales Enterprise Architect - Multi Service Line

Rackspace Technology

San Antonio, Texas, United States (Remote)
1 Month ago
Meta - Global Sales Analytics Lead

Meta

San Francisco, California, United States (Remote)
6 Months ago
ByteDance - Infrastructure Software Engineer in Edge Cloud

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Ember Lab - Technical Artist

Ember Lab

California, United States (Hybrid)
3 Months ago
Mattel  Inc  - Warehouse Technician

Mattel Inc

Fort Worth, Texas, United States (On-Site)
1 Month ago
NVIDIA - Automotive Cybersecurity Manager, DRIVE OS

NVIDIA

United States (Remote)
4 Weeks ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Tencent - Game AI Researcher

Tencent

Tokyo, Japan (On-Site)
1 Month ago
Google - Staff Software Engineer, Machine Learning

Google

Mountain View, California, United States (On-Site)
1 Month ago
ByteDance - Research Engineer Graduate (Vision AI Platform)

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
Google - Software Engineer III, AI/ML

Google

Hyderabad, Telangana, India (On-Site)
1 Month ago
PENN Interactive - Engineering Manager, ML Platform

PENN Interactive

Philadelphia, Pennsylvania, United States (Hybrid)
3 Months ago
Hedra - Research Scientist

Hedra

New York, New York, United States (On-Site)
1 Month ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
Zoox - Senior/Staff Software Engineer - Simulator

Zoox

Foster City, California, United States (Hybrid)
7 Months ago
Google - Staff Software Engineer, AI/ML

Google

Hyderabad, Telangana, India (On-Site)
1 Month ago
Inworld AI - AI Trainer (Contractor) - Writing & Gaming

Inworld AI

Mountain View, California, United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

London, England, United Kingdom (On-Site)

Fremont, California, United States (On-Site)

Bengaluru, Karnataka, India (On-Site)

Reston, Virginia, United States (On-Site)

Sunnyvale, California, United States (On-Site)

New Taipei, New Taipei City, Taiwan (On-Site)

Dublin, County Dublin, Ireland (On-Site)

San Jose, California, United States (On-Site)

Mexico City, Mexico City, Mexico (On-Site)

View All Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug