Staff Software Engineer, Machine Learning Infrastructure

1 Month ago • 8-13 Years • Artificial Intelligence

Job Summary

Job Description

This Staff Software Engineer role within Google Cloud's Core ML organization focuses on optimizing Google's machine learning resources. Responsibilities include designing, implementing, and advancing telemetry capabilities for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, developing dashboards, and analyzing metrics to improve resource utilization and allocation across Google products. The role requires collaboration with various teams to achieve efficiency goals and mentoring junior software engineers. The ideal candidate will have extensive experience in software development, machine learning algorithms, and tools (e.g., TensorFlow), along with a strong understanding of software architecture and design.
Must have:
  • 8+ years software development experience
  • 5+ years ML algorithms & tools experience
  • 5+ years experience launching software products
  • Design, implement telemetry for ML resource monitoring
  • Identify and build solutions to improve ML fleet efficiency
  • Build reporting and analytics solutions
Good to have:
  • Experience with Kubernetes, Google Kubernetes Engine
  • GPU Programming, TensorFlow, and Cloud experience
  • Experience analyzing ML model performance or working on LLMs
  • Knowledge of CPU/GPU architecture or HW accelerators

Job Details

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
  • Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
  • Experience and knowledge of CPU/GPU architecture or HW accelerators
  • Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners,  and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
  • Lead junior SWEs towards delivering project goals.

Similar Jobs

Google - Senior Research Engineer, AI/ML

Google

London, England, United Kingdom (On-Site)
1 Month ago
Canva - Staff Machine Learning Engineer - User Voice

Canva

Melbourne, Victoria, Australia (Remote)
1 Month ago
NVIDIA - Principal Engineer

NVIDIA

(Remote)
3 Months ago
Tencent - Senior Staff Researcher

Tencent

Palo Alto, California, United States (On-Site)
7 Months ago
Canva - Machine Learning Research Engineering Manager - Image Generation

Canva

Vienna, Vienna, Austria (Remote)
3 Months ago
Match Group - Machine Learning Engineer

Match Group

New York, New York, United States (Hybrid)
7 Months ago
Hedra - Machine Learning Engineer (CUDA)

Hedra

New York, New York, United States (On-Site)
2 Months ago
Blitz app - Lead AI Engineer (Generative & 3D Modeling Expertise)

Blitz app

Tesistán, Jalisco, Mexico (On-Site)
4 Months ago
GoMotive - Software Engineer, Machine Learning

GoMotive

Pakistan (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Bosch Group India - Applied Computer Vision Engineer- Automated Driving

Bosch Group India

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Snloker AI - Research Engineer

Snloker AI

(Remote)
1 Month ago
Avercareers - Technical Architect

Avercareers

United States (Hybrid)
3 Weeks ago
Google - Data and Analytics Consultant, Google Cloud

Google

Hyderabad, Telangana, India (On-Site)
1 Month ago
ByteDance - Research Engineer / Scientist - AI for Databases

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
NVIDIA - Senior Data Scientist and System Architect

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)
1 Month ago
Lighthouse - AI/ML Engineer

Lighthouse

Chennai, Tamil Nadu, India (Hybrid)
3 Weeks ago
Google - Senior ML Systems Engineer, AICore

Google

Taipei City, Taiwan (On-Site)
1 Month ago
Meta - Research Scientist Intern, Smart Glasses in Wearables AI (PhD)

Meta

Menlo Park, California, United States (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Google - Content Strategist, gTech Ad Customer Support

Google

Hyderabad, Telangana, India (On-Site)
1 Month ago
Zamp - Marketing Lead

Zamp

Bengaluru, Karnataka, India (On-Site)
1 Year ago
Electronic Arts - Team Manager, Escalated Care Specialist

Electronic Arts

Hyderabad, Telangana, India (Hybrid)
1 Month ago
Moloco - Senior Software Engineer (Data Engineering)

Moloco

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Millennium - Software Engineer - Treasury

Millennium

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Ansys - Lead Application Engineer

Ansys

Bengaluru, Karnataka, India (On-Site)
4 Weeks ago
Nagarro - Senior Staff Engineer, ServiceNow

Nagarro

India (Remote)
7 Months ago
Xentrix Studios - Animation – Team Lead

Xentrix Studios

India (On-Site)
6 Months ago
Capgemini - Accessibility Testing

Capgemini

Pune, Maharashtra, India (On-Site)
3 Weeks ago
Omnissa - Engineering Manager (C++, Linux/Windows/MacOS internals)

Omnissa

Bengaluru, Karnataka, India (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

Google - Software Engineer III, AI/ML GenAI

Google

New York, New York, United States (On-Site)
1 Month ago
Microsoft - Director - Responsible AI

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago
Zoox - Senior/Staff Machine Learning Engineer - Prediction & Behavior ML

Zoox

Boston, Massachusetts, United States (Hybrid)
7 Months ago
ByteDance - Senior Machine Learning Engineer

ByteDance

San Jose, California, United States (On-Site)
1 Month ago
Microsoft - Member of Technical Staff, AI - Reinforcement Systems

Microsoft

London, England, United Kingdom (On-Site)
2 Months ago
Tencent - Machine Learning Platform Development Intern

Tencent

(On-Site)
3 Months ago
Microsoft - Technical Support Engineer (Data and AI Intelligent Platform)

Microsoft

Selangor, Malaysia (Hybrid)
1 Month ago
Google - Staff Image Quality Evaluation Engineer, Silicon

Google

Mountain View, California, United States (On-Site)
1 Month ago
ByteDance - Student Researcher (Doubao (Seed) - Foundation Model - Speech & Audio) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
7 Months ago
Ello - Tech Lead, GenAI & Machine Learning

Ello

San Francisco, California, United States (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded