Staff Software Engineer, Machine Learning Infrastructure

2 Days ago • 8-13 Years • Artificial Intelligence

Job Summary

Job Description

This Staff Software Engineer role within Google Cloud's Core ML organization focuses on optimizing Google's machine learning resources. Responsibilities include designing, implementing, and advancing telemetry capabilities for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, developing dashboards, and analyzing metrics to improve resource utilization and allocation across Google products. The role requires collaboration with various teams to achieve efficiency goals and mentoring junior software engineers. The ideal candidate will have extensive experience in software development, machine learning algorithms, and tools (e.g., TensorFlow), along with a strong understanding of software architecture and design.
Must have:
  • 8+ years software development experience
  • 5+ years ML algorithms & tools experience
  • 5+ years experience launching software products
  • Design, implement telemetry for ML resource monitoring
  • Identify and build solutions to improve ML fleet efficiency
  • Build reporting and analytics solutions
Good to have:
  • Experience with Kubernetes, Google Kubernetes Engine
  • GPU Programming, TensorFlow, and Cloud experience
  • Experience analyzing ML model performance or working on LLMs
  • Knowledge of CPU/GPU architecture or HW accelerators

Job Details

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
  • Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
  • Experience and knowledge of CPU/GPU architecture or HW accelerators
  • Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners,  and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
  • Lead junior SWEs towards delivering project goals.

Similar Jobs

NVIDIA - Senior Software QA Test Development Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
3 Weeks ago
Rackspace Technology - Machine Learning Architect (AWS)

Rackspace Technology

(Remote)
3 Months ago
C10 Labs - AI Fellow- BioTech and Life Sciences

C10 Labs

Cambridge, Massachusetts, United States (Hybrid)
2 Days ago
Pika - Research Scientist

Pika

Palo Alto, California, United States (On-Site)
4 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model - Generative AI)

ByteDance

Seattle, Washington, United States (On-Site)
2 Weeks ago
Ubisoft - Senior ML Data Scientist

Ubisoft

Montreal, Quebec, Canada (On-Site)
1 Month ago
Google - CPU AI Workloads and Performance Architect

Google

Austin, Texas, United States (On-Site)
2 Days ago
Google - Customer Engineer, Machine Learning, Google Cloud

Google

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Video Analysis and Quality Algorithm Intern 2023 Summer/Fall (MS)

ByteDance

Seattle, Washington, United States (On-Site)
6 Months ago
ByteDance - Backend Engineer (Model Inference), Machine Learning Systems

ByteDance

Singapore (On-Site)
6 Months ago
NVIDIA - Principal Engineer

NVIDIA

(Remote)
2 Months ago
Puzzle Cats - AI Content Generation Engineer

Puzzle Cats

Toronto, Ontario, Canada (On-Site)
8 Hours ago
ByteDance - Software Engineer - Machine Learning Training

ByteDance

Singapore (On-Site)
2 Weeks ago
WebFX - Entry Level Software Engineer

WebFX

Harrisburg, Pennsylvania, United States (On-Site)
6 Months ago
Moloco - Machine Learning Engineer II

Moloco

Seoul, South Korea (On-Site)
7 Hours ago
ByteDance - Research Scientist in Foundation Model, Speech Understanding - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
6 Months ago
Netflix - Machine Learning Intern - Spring or Summer 2025

Netflix

Los Gatos, California, United States (On-Site)
6 Months ago
ByteDance - Senior Research Scientist, Foundation Model, Speech Understanding

ByteDance

Seattle, Washington, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

ION - Content Editor - 10831

ION

Mumbai, Maharashtra, India (On-Site)
6 Months ago
Google - Senior ML Compiler Engineer, Silicon

Google

Bengaluru, Karnataka, India (On-Site)
2 Weeks ago
Google - Training Program Manager, Design and Delivery

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
ION - Analyst - LCM - Mumbai - 763

ION

Mumbai, Maharashtra, India (On-Site)
6 Months ago
Dialpad AI - Sr. SDET

Dialpad AI

Bengaluru, Karnataka, India (Hybrid)
20 Hours ago
Fusion tech lab - Android Developer

Fusion tech lab

Kolkata, West Bengal, India (On-Site)
2 Years ago
HCL Tech - Product Specialist

HCL Tech

Chennai, Tamil Nadu, India (On-Site)
4 Hours ago
Velotio Technologies - Lead Engineer (Java)

Velotio Technologies

Maharashtra, India (Remote)
1 Month ago
Insight  Software - UI/UX Designer

Insight Software

Bengaluru, Karnataka, India (On-Site)
2 Days ago
Leapwork - Sr. QA Engineer

Leapwork

Gurugram, India (On-Site)
7 Hours ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Inworld AI - Staff / Principal AI Researcher - USA

Inworld AI

Mountain View, California, United States (Remote)
4 Months ago
Zoox - Senior/Staff Software Engineer, ML Performance Optimization

Zoox

Foster City, California, United States (On-Site)
6 Months ago
Google - Software Engineer III

Google

Kirkland, Washington, United States (On-Site)
2 Days ago
Luxoft - Senior ML Engineer

Luxoft

Poland, Ohio, United States (Remote)
4 Months ago
Google - Staff Software Engineer, AI/ML, Google Ads

Google

Mountain View, California, United States (On-Site)
2 Weeks ago
Google - Senior Software Engineer, Google Cloud AI

Google

Warsaw, Masovian Voivodeship, Poland (On-Site)
2 Days ago
Eleven Labs - Machine Learning Researcher

Eleven Labs

Germany (Remote)
1 Month ago
Google - Customer Engineer IV, Field CTO

Google

Austin, Texas, United States (On-Site)
1 Week ago
Inworld AI - AI Trainer (Contractor) - Writing & Gaming

Inworld AI

Mountain View, California, United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

About The Company

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Mountain View, California, United States (On-Site)

Mountain View, California, United States (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

View All Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug