Staff Software Engineer, Machine Learning Infrastructure

1 Month ago • 8-13 Years • Artificial Intelligence

Job Summary

Job Description

This Staff Software Engineer role within Google's Core ML organization centers on optimizing Google's machine learning resources. Key responsibilities include designing, implementing, and enhancing telemetry for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, creating dashboards, and collaborating with teams to improve resource utilization. The role also includes building reporting and analytical solutions, providing in-depth analysis, and leading junior engineers. The candidate will work on improving the efficiency of the ML fleet across all Google products, contributing to better resource allocation and operational efficiency.
Must have:
  • 8+ years software development experience
  • 5+ years ML algorithms & tools experience
  • 5+ years software design & architecture experience
  • Experience with TensorFlow
  • Strong data structures & algorithms knowledge
Good to have:
  • Kubernetes experience
  • GPU programming experience
  • Experience with LLMs
  • CPU/GPU architecture knowledge
  • Adaptability to new tools and frameworks

Job Details

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
  • Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
  • Experience and knowledge of CPU/GPU architecture or HW accelerators
  • Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners,  and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
  • Lead junior SWEs towards delivering project goals.

Similar Jobs

Unity - Senior Big Data & ML Engineer

Unity

(Remote)
3 Months ago
NVIDIA - Senior Software Engineer, AI Resiliency

NVIDIA

Santa Clara, California, United States (On-Site)
2 Months ago
ION - Senior AI Engineer, Italy

ION

Pisa, Tuscany, Italy (On-Site)
7 Months ago
The Walt Disney Company - Lead Software Engineer, Machine Learning - Ad Platforms

The Walt Disney Company

San Francisco, California, United States (On-Site)
6 Months ago
Grab - Data Science Geo Maps - Internship

Grab

Cluj-Napoca, Cluj County, Romania (On-Site)
1 Month ago
Google - EDA/CAD Custom Tool Development Engineer

Google

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Talentica Software - Data Scientist

Talentica Software

India (Remote)
7 Months ago
Google - Product Manager, Activity and Sleep Coaching

Google

Mountain View, California, United States (On-Site)
1 Month ago
ByteDance - Software Engineer Intern (Doubao (Seed) - Machine Learning System) - 2025 Summer (MS)

ByteDance

San Jose, California, United States (On-Site)
7 Months ago
ByteDance - Cloud Native Engineer, ARK Large Model Platform (Singapore)

ByteDance

Singapore (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Snloker AI - AI/ML Developer Advocate

Snloker AI

San Francisco, California, United States (On-Site)
1 Month ago
Nintendo - Machine Learning Ops Engineer

Nintendo

Redmond, Washington, United States (On-Site)
1 Month ago
Google - Senior Software Engineer, SDLC, Gemini Code Assist

Google

Kirkland, Washington, United States (On-Site)
1 Month ago
Perplexity AI - AI Systems Engineer

Perplexity AI

San Francisco, California, United States (On-Site)
3 Weeks ago
SmileGate - Game Data Engineer [LOST ARK]

SmileGate

Seongnam-si, Gyeonggi-do, South Korea (On-Site)
4 Months ago
Axon - Senior Research Scientist, LLM

Axon

Scottsdale, Arizona, United States (On-Site)
1 Month ago
Tencent - IaaS Product Solution Architect

Tencent

(On-Site)
1 Month ago
ASSIST Software - AI Engineer

ASSIST Software

Suceava, Suceava County, Romania (Remote)
6 Months ago
JoinZoe - Lead Machine Learning Engineer

JoinZoe

(Remote)
4 Weeks ago
Meta - Software Engineer, Machine Learning

Meta

San Francisco, California, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Sitetracker - Senior Software Engineer - Salesforce

Sitetracker

Bengaluru, Karnataka, India (On-Site)
9 Months ago
Games2win - Level Designer (Car Games)

Games2win

Mumbai, Maharashtra, India (On-Site)
4 Weeks ago
Bosch Group India - Executive/Assistant Manager - Commercial Purchase (Direct Materials Buyer)

Bosch Group India

Bengaluru, Karnataka, India (On-Site)
3 Months ago
FICO - Software Engineering - Lead Engineer

FICO

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Synechron - Data Engineer

Synechron

Pune, Maharashtra, India (On-Site)
3 Weeks ago
Outscal - Youtube Content Creator

Outscal

Delhi, India (On-Site)
6 Months ago
Digicore studios - Graphic Designer

Digicore studios

Pune, Maharashtra, India (On-Site)
4 Months ago
Nagarro - Principal Engineer, Scrum Master

Nagarro

India (On-Site)
7 Months ago
Google - Senior Software Engineer, Full Stack, Google Cloud

Google

Hyderabad, Telangana, India (On-Site)
1 Month ago
Eccentric - 3D Manager

Eccentric

Mumbai, Maharashtra, India (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

FTF Studios - FTF Mid-Level Programmer

FTF Studios

(Remote)
1 Year ago
Light Speed Studios - Senior Researcher, Natural Language Processing

Light Speed Studios

Tokyo, Japan (On-Site)
2 Months ago
AI Fund - Curriculum Developer

AI Fund

Germany (Remote)
7 Months ago
Social Discovery Group - Senior NLP Engineer

Social Discovery Group

Serbia (Remote)
7 Months ago
ByteDance - Research Engineer Graduate (Vision AI Platform)

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
NVIDIA - Senior System Software Engineer - Triton Inference Server

NVIDIA

California, United States (Remote)
4 Months ago
Rackspace Technology - Practice Manager, Data Science, AI and ML

Rackspace Technology

(Remote)
5 Months ago
Luxoft - Senior ML Engineer

Luxoft

Poland, Ohio, United States (Remote)
5 Months ago
Google - Applied ML Engineer for AICore

Google

Taipei City, Taiwan (On-Site)
1 Month ago
Immersion Labs - Junior Prompt Engineer

Immersion Labs

Warsaw, Masovian Voivodeship, Poland (Hybrid)
3 Months ago

Get notifed when new similar jobs are uploaded