Staff Software Engineer, Machine Learning Infrastructure

1 Week ago • 8-13 Years • Artificial Intelligence

Job Summary

Job Description

This Staff Software Engineer role within Google's Core ML organization centers on optimizing Google's machine learning resources. Key responsibilities include designing, implementing, and enhancing telemetry for monitoring TPU and GPU fleet efficiency. This involves identifying key performance indicators, creating dashboards, and collaborating with teams to improve resource utilization. The role also includes building reporting and analytical solutions, providing in-depth analysis, and leading junior engineers. The candidate will work on improving the efficiency of the ML fleet across all Google products, contributing to better resource allocation and operational efficiency.
Must have:
  • 8+ years software development experience
  • 5+ years ML algorithms & tools experience
  • 5+ years software design & architecture experience
  • Experience with TensorFlow
  • Strong data structures & algorithms knowledge
Good to have:
  • Kubernetes experience
  • GPU programming experience
  • Experience with LLMs
  • CPU/GPU architecture knowledge
  • Adaptability to new tools and frameworks

Job Details

Minimum qualifications:

  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages, and with data structures/algorithms.
  • 5 years of experience testing, and launching software products, and 3 years of experience with software design and architecture.
  • 5 years of experience with machine learning algorithms and tools (e.g., TensorFlow), artificial intelligence, deep learning or natural language processing.

Preferred qualifications:

  • Experience with Kubernetes, Google Kubernetes Engine, GPU Programming, TensorFlow, and Cloud.
  • Experience analyzing ML models performance or working on LLM prompting, training or developing LLMs.
  • Experience and knowledge of CPU/GPU architecture or HW accelerators
  • Ability to quickly adapt to new tools, frameworks, and languages.

About the job

Google Cloud's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google Cloud's needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. You will anticipate our customer needs and be empowered to act like an owner, take action and innovate. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

In this role, you will join a team that's part of Google's Core ML organization, focused on optimizing Google's Machine Learning resources. You will help develop monitoring tools and dashboards to track the performance and efficiency of TPUs and GPUs, which are used across all Google products. This data helps to improve resource allocation, identify areas for improvement, and drive efficiency gains across Google's products.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design, implement and advance the telemetry capabilities needed for monitoring and evaluating the fleet-wide efficiency of ML resources (TPUs and GPUs). This includes identifying the right underlying signals, devising the right high-level metrics of interest, and creating common dashboards for highlighting fleet-wide performance and efficiency.
  • Identify opportunities to improve the efficiency of the ML fleet and build solutions and capabilities to improve ML fleet efficiency.
  • Build reporting and analytic solutions with key partners,  and provide in-depth analysis of the metrics to improve the operation and utilization of ML resources. 
  • Drive collaboration with various teams (across different PAs) as needed to accomplish the efficiency improvement goals.
  • Lead junior SWEs towards delivering project goals.

Similar Jobs

AI Fund - ML Engineer

AI Fund

San Francisco, California, United States (On-Site)
1 Week ago
Microsoft - Senior Applied Scientist

Microsoft

Bengaluru, Karnataka, India (On-Site)
3 Days ago
ByteDance - Tech Lead Machine Learning Engineer

ByteDance

Seattle, Washington, United States (On-Site)
1 Month ago
Google - Research Engineer, Vision Language Models

Google

Zürich, Zurich, Switzerland (On-Site)
2 Weeks ago
Google - Data and Analytics Consultant, Google Cloud

Google

Hyderabad, Telangana, India (On-Site)
1 Week ago
Google - Software Engineer, Compiler Frontend, Silicon

Google

Mountain View, California, United States (On-Site)
2 Days ago
Google - Open Career Opportunities, Autonomous (Self-Driving) Vehicle Jobs, Waymo

Google

Phoenix, Arizona, United States (On-Site)
5 Months ago
Microsoft - Member of Technical Staff – Machine Learning Engineer

Microsoft

New York, New York, United States (Hybrid)
1 Month ago
Google - Customer Engineer, Cloud AI, Google Cloud

Google

Seattle, Washington, United States (On-Site)
2 Days ago
Google - Senior Technical Program Manager, AI Risk Reporting Lead

Google

Seattle, Washington, United States (On-Site)
2 Days ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

The Walt Disney Company - Lead Software Engineer, Machine Learning - Ad Platforms

The Walt Disney Company

Seattle, Washington, United States (On-Site)
5 Months ago
Canva - Senior Machine Learning Engineer - Photo AI

Canva

Prague, Czechia (Remote)
3 Months ago
Demandbase - Senior Data Scientist

Demandbase

San Francisco, California, United States (Hybrid)
8 Hours ago
Altagram Group - Data Science Internship/Workstudent

Altagram Group

Germany (On-Site)
1 Month ago
Google - Cloud Developer II, AI/ML, Professional Services

Google

Atlanta, Georgia, United States (On-Site)
2 Days ago
InfoStretch Corporation - AI Developer

InfoStretch Corporation

Lansing, Michigan, United States (On-Site)
1 Month ago
NVIDIA - Senior Software Engineer, AI Resiliency

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
ByteDance - Video Analysis and Quality Algorithm Intern 2023 Summer/Fall (PHD)

ByteDance

San Diego, California, United States (On-Site)
6 Months ago
Netflix - Research Scientist (L6) - Identity Algorithms

Netflix

Los Gatos, California, United States (On-Site)
6 Months ago
Razer - Solutions Architect

Razer

Singapore (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Google - Program Manager, Supply Chain Inventory Planning

Google

Bengaluru, Karnataka, India (On-Site)
1 Week ago
Tekion Corp - Senior Software Engineer

Tekion Corp

Bengaluru, Karnataka, India (On-Site)
1 Day ago
Siemens - Release Manager – Data Platform

Siemens

Pune, Maharashtra, India (On-Site)
1 Day ago
Maersk Careers - Second Cook (SF)

Maersk Careers

Mumbai, Maharashtra, India (On-Site)
336 Years ago
NVIDIA - Senior System Software Engineer, GPU Firmware

NVIDIA

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Krafton india - Sr Product Manager

Krafton india

Bengaluru, Karnataka, India (On-Site)
3 Weeks ago
Truecaller - Senior Solution Consultant

Truecaller

Bengaluru, Karnataka, India (On-Site)
8 Hours ago
Aisera Jobs - Solutions Architect (Post Sales)

Aisera Jobs

Hyderabad, Telangana, India (On-Site)
1 Day ago
Snyk - Technical Success Manager

Snyk

New Delhi, Delhi, India (On-Site)
7 Hours ago
Hitachi - Windchill PLM Consultant

Hitachi

Chennai, Tamil Nadu, India (On-Site)
6 Months ago

Get notifed when new similar jobs are uploaded

Artificial Intelligence Jobs

ByteDance - Research Scientist in Foundation Model, Speech & Audio Graduates - 2024 Start (PhD)

ByteDance

Seattle, Washington, United States (On-Site)
6 Months ago
Google - Cloud Engineer II, AI/ML, Professional Services

Google

Mexico City, Mexico City, Mexico (On-Site)
2 Weeks ago
NVIDIA - Senior Software Engineer - Triton Tools

NVIDIA

California, United States (Remote)
3 Months ago
Krafton  - [Global Strategy & BD Div.] Strategy Manager(AI Ethics) (4년 ~ 7년)

Krafton

Seoul, South Korea (On-Site)
4 Months ago
SparkCognition - Data Scientist

SparkCognition

Bengaluru, Karnataka, India (On-Site)
7 Months ago
Meta - Research Scientist Intern, Machine Perception for Input and Interaction (PhD)

Meta

Burlingame, California, United States (On-Site)
5 Months ago
Google - AI/ML Engineer, National Security, Public Sector

Google

Reston, Virginia, United States (On-Site)
2 Weeks ago
Microsoft - Senior Principal Researcher – Generative AI

Microsoft

Redmond, Washington, United States (On-Site)
2 Weeks ago
ByteDance - Senior Machine Learning Engineer

ByteDance

San Jose, California, United States (On-Site)
2 Weeks ago
Microsoft - Member of Technical Staff – Voice & Vision

Microsoft

London, England, United Kingdom (On-Site)
1 Week ago

Get notifed when new similar jobs are uploaded

About The Company

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Mountain View, California, United States (On-Site)

Mountain View, California, United States (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

View All Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug