Staff Software Engineer, Machine Learning Infrastructure

1 Day ago • 8-13 Years • Artificial Intelligence

Job Summary

Job Description

This Staff Software Engineer role within Google Cloud's Core ML organization focuses on optimizing Google's machine learning resources. Responsibilities include designing, implementing, and advancing telemetry capabilities for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, developing dashboards, and analyzing metrics to improve resource utilization and allocation across Google products. The role requires collaboration with various teams to achieve efficiency goals and mentoring junior software engineers. The ideal candidate will have extensive experience in software development, machine learning algorithms, and tools (e.g., TensorFlow), along with a strong understanding of software architecture and design.
Must have:
  • 8+ years software development experience
  • 5+ years ML algorithms & tools experience
  • 5+ years experience launching software products
  • Design, implement telemetry for ML resource monitoring
  • Identify and build solutions to improve ML fleet efficiency
  • Build reporting and analytics solutions
Good to have:
  • Experience with Kubernetes, Google Kubernetes Engine
  • GPU Programming, TensorFlow, and Cloud experience
  • Experience analyzing ML model performance or working on LLMs
  • Knowledge of CPU/GPU architecture or HW accelerators

Job Details

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
  • Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
  • Experience and knowledge of CPU/GPU architecture or HW accelerators
  • Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners,  and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
  • Lead junior SWEs towards delivering project goals.

Similar Jobs

SparkCognition - Data Scientist

SparkCognition

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Meta - Software Engineer, Machine Learning

Meta

Washington, District Of Columbia, United States (On-Site)
1 Week ago
Roofstacks - AI/ML Engineer

Roofstacks

İstanbul, İstanbul, Türkiye (Hybrid)
1 Month ago
NVIDIA - Senior System Software Engineer - Dynamo and Triton Inference Server

NVIDIA

California, United States (Remote)
1 Month ago
ByteDance - Research Scientist in Foundation Model, Speech Understanding - 2024 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
ByteDance - Research Engineer / Scientist - AI for Databases

ByteDance

Seattle, Washington, United States (On-Site)
1 Day ago
Google - Software Engineer III, Knowledge and Information

Google

Zürich, Zurich, Switzerland (On-Site)
1 Week ago
Canva - Senior Applied Scientist - AI Research

Canva

Surry Hills, New South Wales, Australia (Remote)
1 Month ago
ByteDance - Software Development Engineer - Large Language Models, AML

ByteDance

San Jose, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

The Walt Disney Company - Senior Machine Learning Engineer - Ad Platforms

The Walt Disney Company

Santa Monica, California, United States (On-Site)
2 Weeks ago
Netflix - Software Engineer L4, Machine Learning Platform (Metaflow)

Netflix

Los Gatos, California, United States (On-Site)
2 Months ago
ByteDance - DevOps Engineer, Applied Machine Learning Engine - 2025 Start

ByteDance

Singapore (On-Site)
5 Months ago
DNEG - Head of Machine Learning

DNEG

London, England, United Kingdom (Remote)
2 Months ago
ByteDance - Senior Machine Learning Engineer

ByteDance

San Jose, California, United States (On-Site)
2 Weeks ago
ByteDance - Video Analysis and Quality Algorithm Intern 2023 Summer/Fall (MS)

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
ByteDance - DevOps Engineer - Applied Machine Learning, Engine

ByteDance

San Jose, California, United States (On-Site)
2 Months ago
ByteDance - Research Scientist in Large Model System

ByteDance

Seattle, Washington, United States (On-Site)
5 Months ago
ByteDance - Research Engineer (Machine Learning Training System) - 2025 Start

ByteDance

Singapore (On-Site)
6 Months ago
ByteDance - Research Scientist in Foundation Model, Speech Understanding - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Cadence - Lead Solutions Engineer

Cadence

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Google - Vendor Operations Manager, YouTube Global Vendor Operations

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
Google - Senior Software Engineer, Kernel and Firmware, Silicon

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
Google - Scaled Delivery Manager, gTech Ads Solutions

Google

Hyderabad, Telangana, India (On-Site)
1 Week ago
version 1 - Senior JDE CNC Consultant

version 1

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Nagarro - Associate Staff Engineer ,Mobile Developer (React Native)

Nagarro

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Apollo Global Management,  Inc  - APPS Analyst

Apollo Global Management, Inc

Maharashtra, India (Hybrid)
7 Months ago
Dream Sports - Product Manager (Verifications)

Dream Sports

Mumbai, Maharashtra, India (On-Site)
1 Month ago
Google - Staff UX Designer, Google Cloud

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
Ajmera Infotech - Senior Lead Recruiter

Ajmera Infotech

Gujarat, India (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Google - Senior Software Engineer, Machine Learning, Applied AI

Google

Kirkland, Washington, United States (On-Site)
1 Week ago
Microsoft - Technical Support Engineer (Data and AI Intelligent Platform)

Microsoft

Selangor, Malaysia (Hybrid)
1 Day ago
Microsoft - Senior Principal Researcher – Generative AI

Microsoft

Redmond, Washington, United States (On-Site)
1 Week ago
Google - Senior Technical Program Manager II, Machine Learning, TPU Systems

Google

Sunnyvale, California, United States (On-Site)
1 Week ago
Krafton  - Technical Project Manager

Krafton

Seoul, South Korea (On-Site)
1 Month ago
Google - Intel Strategist, Scaled Intel Collection, Trust and Safety

Google

Austin, Texas, United States (On-Site)
1 Week ago
NVIDIA - AI Computing Software Development Engineer, TensorRT

NVIDIA

Shanghai, Shanghai, China (On-Site)
3 Months ago
NVIDIA - Senior Software Engineer - Automated Parallel Programming

NVIDIA

North Carolina, United States (Remote)
1 Month ago
Google - Software Engineer III, Full Stack, Applied AI

Google

Kraków, Lesser Poland Voivodeship, Poland (On-Site)
1 Week ago
NVIDIA - Machine Learning Intern - 2025

NVIDIA

(On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

About The Company

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Dublin, County Dublin, Ireland (On-Site)

New York, New York, United States (On-Site)

Waterloo, Ontario, Canada (On-Site)

Taipei City, Taiwan (On-Site)

San Francisco, California, United States (On-Site)

Saint-Ghislain, Wallonia, Belgium (On-Site)

Bengaluru, Karnataka, India (On-Site)

Austin, Texas, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug