Jobs Courses Resources Companies Placements

Home >

Jobs >

Staff Software Engineer, Machine Learning Infrastructure

Google

Karnataka, India (On-site)

Staff Software Engineer, Machine Learning Infrastructure

2 Months ago • 8-13 Years

Job Summary

Job Description

This Staff Software Engineer role within Google Cloud's Core ML organization focuses on optimizing Google's machine learning resources. Responsibilities include designing, implementing, and advancing telemetry capabilities for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, developing dashboards, and analyzing metrics to improve resource utilization and allocation across Google products. The role requires collaboration with various teams to achieve efficiency goals and mentoring junior software engineers. The ideal candidate will have extensive experience in software development, machine learning algorithms, and tools (e.g., TensorFlow), along with a strong understanding of software architecture and design.

Must have:

8+ years software development experience
5+ years ML algorithms & tools experience
5+ years experience launching software products
Design, implement telemetry for ML resource monitoring
Identify and build solutions to improve ML fleet efficiency
Build reporting and analytics solutions

Good to have:

Experience with Kubernetes, Google Kubernetes Engine
GPU Programming, TensorFlow, and Cloud experience
Experience analyzing ML model performance or working on LLMs
Knowledge of CPU/GPU architecture or HW accelerators

9 skills required

9 skills required for this role

Add these skills to join the top 1% applicants for this job

tensorflow

algorithms

kubernetes

deep-learning

data-structures

resource-allocation

resource-planning

networking

user-interface

Job Details

Minimum qualifications:

Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
Experience and knowledge of CPU/GPU architecture or HW accelerators
Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
Build reporting and analytic solutions with key partners, and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources.
Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
Lead junior SWEs towards delivering project goals.

Similar Jobs

Senior Staff Machine Learning Engineer, Ads Targeting

(Remote)

• 2 Months ago

Senior Research Engineer, AI/ML

Google

London, England, United Kingdom (On-Site)

• 2 Months ago

Staff Machine Learning Engineer - User Voice

Canva

Melbourne, Victoria, Australia (Remote)

• 2 Months ago

Principal Engineer

NVIDIA

(Remote)

• 4 Months ago

Senior Staff Researcher

Tencent

Palo Alto, California, United States (On-Site)

• 8 Months ago

Machine Learning Research Engineering Manager - Image Generation

Canva

Vienna, Vienna, Austria (Remote)

• 4 Months ago

Machine Learning Engineer

Match Group

New York, New York, United States (Hybrid)

• 8 Months ago

Machine Learning Engineer (CUDA)

Hedra

New York, New York, United States (On-Site)

• 3 Months ago

Lead AI Engineer (Generative & 3D Modeling Expertise)

Blitz app

Tesistán, Jalisco, Mexico (On-Site)

• 5 Months ago

Software Engineer, Machine Learning

GoMotive

Pakistan (Remote)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Applied Computer Vision Engineer- Automated Driving

Bosch Group India

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Senior Data Engineer with Data Science/MLOps Background

N-iX

Poland (Remote)

• 2 Months ago

Research Engineer

Snloker AI

(Remote)

• 2 Months ago

Technical Architect

Avercareers

United States (Hybrid)

• 1 Month ago

Data and Analytics Consultant, Google Cloud

Google

Hyderabad, Telangana, India (On-Site)

• 2 Months ago

Research Engineer / Scientist - AI for Databases

ByteDance

Seattle, Washington, United States (On-Site)

• 2 Months ago

Senior Data Scientist and System Architect

NVIDIA

Yokne'am Illit, North District, Israel (On-Site)

• 2 Months ago

AI/ML Engineer

Lighthouse

Chennai, Tamil Nadu, India (Hybrid)

• 1 Month ago

Senior ML Systems Engineer, AICore

Google

Taipei City, Taiwan (On-Site)

• 2 Months ago

Research Scientist Intern, Smart Glasses in Wearables AI (PhD)

Jobs in Bengaluru, Karnataka, India

Content Strategist, gTech Ad Customer Support

Google

Hyderabad, Telangana, India (On-Site)

• 2 Months ago

Marketing Lead

Zamp

Bengaluru, Karnataka, India (On-Site)

• 1 Year ago

Team Manager, Escalated Care Specialist

Electronic Arts

Hyderabad, Telangana, India (Hybrid)

• 2 Months ago

Senior Software Engineer (Data Engineering)

Moloco

Bengaluru, Karnataka, India (On-Site)

• 2 Months ago

Software Engineer - Treasury

Millennium

Bengaluru, Karnataka, India (On-Site)

• 8 Months ago

Lead Application Engineer

Ansys

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Senior Staff Engineer, ServiceNow

Nagarro

India (Remote)

• 8 Months ago

Animation – Team Lead

Xentrix Studios

India (On-Site)

• 7 Months ago

Accessibility Testing

Capgemini

Pune, Maharashtra, India (On-Site)

• 1 Month ago

Engineering Manager (C++, Linux/Windows/MacOS internals)

Omnissa

Bengaluru, Karnataka, India (Hybrid)

• 7 Months ago

Get notifed when new similar jobs are uploaded

Similar Category Jobs

Software Engineer III, AI/ML GenAI

Google

New York, New York, United States (On-Site)

• 2 Months ago

Director - Responsible AI

Microsoft

Redmond, Washington, United States (On-Site)

• 2 Months ago

Senior/Staff Machine Learning Engineer - Prediction & Behavior ML

Zoox

Boston, Massachusetts, United States (Hybrid)

• 8 Months ago

Senior Machine Learning Engineer

ByteDance

San Jose, California, United States (On-Site)

• 2 Months ago

Member of Technical Staff, AI - Reinforcement Systems

Microsoft

London, England, United Kingdom (On-Site)

• 3 Months ago

Machine Learning Platform Development Intern

Tencent

(On-Site)

• 4 Months ago

Technical Support Engineer (Data and AI Intelligent Platform)

Microsoft

Selangor, Malaysia (Hybrid)

• 2 Months ago

Staff Image Quality Evaluation Engineer, Silicon

Google

Mountain View, California, United States (On-Site)

• 2 Months ago

Student Researcher (Doubao (Seed) - Foundation Model - Speech & Audio) - 2025 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)

• 8 Months ago

Tech Lead, GenAI & Machine Learning

Ello

San Francisco, California, United States (On-Site)

• 2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Google

374 Active Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Staff Software Engineer, Machine Learning Infrastructure

Job Summary

Job Description

9 skills required

9 skills required for this role

Job Details

Minimum qualifications:

Preferred qualifications:

About the job

Responsibilities

Similar Jobs

Senior Staff Machine Learning Engineer, Ads Targeting

Senior Research Engineer, AI/ML

Staff Machine Learning Engineer - User Voice

Principal Engineer

Senior Staff Researcher

Machine Learning Research Engineering Manager - Image Generation

Machine Learning Engineer

Machine Learning Engineer (CUDA)

Lead AI Engineer (Generative & 3D Modeling Expertise)

Software Engineer, Machine Learning

Similar Skill Jobs

Applied Computer Vision Engineer- Automated Driving

Senior Data Engineer with Data Science/MLOps Background

Research Engineer

Technical Architect

Data and Analytics Consultant, Google Cloud

Research Engineer / Scientist - AI for Databases

Senior Data Scientist and System Architect

AI/ML Engineer

Senior ML Systems Engineer, AICore

Research Scientist Intern, Smart Glasses in Wearables AI (PhD)

Jobs in Bengaluru, Karnataka, India

Content Strategist, gTech Ad Customer Support

Marketing Lead

Team Manager, Escalated Care Specialist

Senior Software Engineer (Data Engineering)

Software Engineer - Treasury

Lead Application Engineer

Senior Staff Engineer, ServiceNow

Animation – Team Lead

Accessibility Testing

Engineering Manager (C++, Linux/Windows/MacOS internals)

Similar Category Jobs

Software Engineer III, AI/ML GenAI

Director - Responsible AI

Senior/Staff Machine Learning Engineer - Prediction & Behavior ML

Senior Machine Learning Engineer

Member of Technical Staff, AI - Reinforcement Systems

Machine Learning Platform Development Intern

Technical Support Engineer (Data and AI Intelligent Platform)

Staff Image Quality Evaluation Engineer, Silicon

Student Researcher (Doubao (Seed) - Foundation Model - Speech & Audio) - 2025 Start (PhD)

Tech Lead, GenAI & Machine Learning

About The Company

Software Engineer, Photos, Android

2D Artist / Generalist

Senior ML Systems Engineer, AICore

Senior Technical Program Manager II, Infrastructure, Google Cloud

Software Engineer III, Security/Privacy, Google Cloud

PhD Software Engineer

Senior Software Engineering Manager, Google Cloud

Software Engineer III, Site Reliability Engineering

Senior Software Engineer, Messages, Android System Health

Software Engineer III, Engineering Productivity, Google Cloud Platforms

Level Up Your Career in Game Development!