Software Engineer, Machine Learning Supercomputer Reliability

1 Minute ago • 8 Years + • Research Development • $197,000 PA - $291,000 PA

Job Summary

Job Description

Google's software engineers develop next-generation technologies at massive scale. This role involves working on a project critical to Google’s needs, with opportunities to switch teams. The team's mission is to provide software for reliable scaling of accelerators for massive Machine Learning (ML) applications, focusing on distributed systems, ML, and networking technologies. The MSCA organization manages hardware, software, ML, and systems infrastructure for all Google services and Google Cloud, prioritizing security, efficiency, and reliability.
Must have:
  • Bachelor's degree or equivalent practical experience
  • 8 years of experience with general purpose programming languages (e.g., Java, C/C++, Python)
  • Design and maintain supercomputer software across different layers of the software stack
  • Provide technical leadership to formulate and drive software development plans
  • Identify commonalities between different supercomputer generations and accelerator types
  • Create well abstracted and flexible software
  • Help identify dependencies in cross-functional teams and drive common execution
Good to have:
  • Experience with coding in data structures, algorithms and software design
  • Understanding of distributed systems concepts
  • Knowledge of common ML algorithms and how they map to software and hardware operations
  • Passion for building back-end software for high-performance computing and machine learning applications
Perks:
  • Bonus
  • Equity
  • Benefits

Job Details

About the job

Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

The mission of our team is to provide software for reliable scale out and scale up of accelerators, specifically for massive scale Machine Learning (ML) applications, that is easy to use and maintain. This work is heavy in understanding distributed systems, machine learning, networking technologies, and other aspects relevant to providing connected and large compute systems.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud’s Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

Responsibilities

  • Design and maintain supercomputer software across different layers of the software stack (e.g., network routing rules built into Tensor Processing Unit (TPUs), control software running on specialized machines, distributed software running on Google’s internal and cloud infrastructure) to control, monitor, build, deploy, qualify and service super computing systems.
  • Provide technical leadership to help formulate and drive software development plans.
  • Identify commonalities between different supercomputer generations and accelerator types and to create well abstracted and flexible software.
  • Help identify dependencies in cross-functional teams and drive common execution with a peculiar focus on development velocity and quality.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Sunnyvale, California, United States

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Research Development Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

Sunnyvale, California, United States (On-Site)

Mountain View, California, United States (On-Site)

Sunnyvale, California, United States (On-Site)

Mexico City, Mexico City, Mexico (On-Site)

Belo Horizonte, State Of Minas Gerais, Brazil (On-Site)

Mexico City, Mexico City, Mexico (On-Site)

Mountain View, California, United States (On-Site)

Mountain View, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Google

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug