Principal Supercomputing Software Engineer

2 Hours ago • 6 Years + • DevOps

About the job

Job Description

Microsoft Azure AI/HPC team seeks a Principal Supercomputing Software Engineer to build and utilize cutting-edge tools for managing hyperscale cloud infrastructure. Responsibilities involve analyzing system metrics, debugging HPC issues, developing solutions for operating supercomputers in the public cloud, collaborating with customers and vendors, and ensuring platform performance, scalability, and resilience. The role demands expertise in AI/HPC system management, high-speed networks, HPC storage, or cloud infrastructure management. The engineer will contribute to establishing best practices, driving architectural changes, and influencing roadmaps for software and hardware components.
Must have:
  • 6+ years technical engineering experience
  • 5+ years experience operating AI/HPC systems
  • 3+ years specialized experience in AI/HPC system management, high-speed networks, HPC storage, or cloud infrastructure
  • Proficient in C, C++, C#, Java, JavaScript, or Python
  • Strong analytical and problem-solving skills
Good to have:
  • Master's or PhD in Computer Science
  • Experience running large-scale HPC systems in cloud environments
  • Experience troubleshooting machine learning workloads on GPU-based HPC systems
  • Expertise in cloud computing, virtualization, and container technologies
  • Familiarity with the HPC software stack
Perks:
  • Industry leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Overview

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is looking for systems engineers, architects and thought leaders to enable customers in deploying, monitoring, profiling, and debugging their applications on hyperscale cloud infrastructure. Azure is enabling the largest supercomputing deployments to tackle complex computational problems in public cloud, evident from the various HPC products that have already made the mark on Top500, MLPerf and Graph500 rankings.

 

At this supercomputing scale, we need specialized tools and techniques to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the Service Level Agreements (SLAs) of customers. Your job would be to build and use state-of-the-art tools and techniques, find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers. As a Principal Supercomputing Engineer, you would also bring to the table establishing best practices drive architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general.



 

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Qualifications

Required Qualifications:

  • Bachelor's Degree in Computer Science or related technical or scientific field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
    • OR equivalent experience
  • 5+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
  • 3+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure

 

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: 
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

 

Preferred Qualifications: 

  • Masters' Degree or PhD in Computer Science or related technical or scientific field
  • Operational experience running large scale HPC systems or infrastructure situated in Cloud environments
  • Previous experience with running and troubleshooting machine learning workloads on GPU-based HPC systems
  • Expertise in Cloud Computing, Virtualization and Container Technologies
  • Familiarity with the HPC software stack

Software Engineering IC5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Microsoft will accept applications for the role until January 6, 2025.

 

 

#azurecorejobs

Responsibilities

  • Be part of a comprehensive systems management team focused on operational excellence and customer success
  • Analyze key system metrics and telemetry to proactively identify and debug HPC system issues, build appropriate tooling, help develop processes and ensure that solutions are responsive to emerging user needs
  • Partner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environment
  • Ensure that the Azure platform is performant, scalable and resilient
  • Foster test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
View Full Job Description
$137.6K - $294.0K/yr (Outscal est.)
$215.8K/yr avg.

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

London, England, United Kingdom (On-Site)

Dublin, County Dublin, Ireland (On-Site)

Ho Chi Minh City, Ho Chi Minh City, Vietnam (On-Site)

San José, San José Province, Costa Rica (On-Site)

Prague, Prague, Czechia (On-Site)

View All Jobs

Get notified when new jobs are added by Microsoft

Similar Jobs

The Walt Disney Company - Lead Software Engineer (Full-Stack)

The Walt Disney Company, United States (On-Site)

OpenGov - Director, Developer Experience

OpenGov, United States (Hybrid)

Hasbro - Lead DevOps Engineer

Hasbro, United States (On-Site)

 Sagecor Solutions - Software Integration Engineer 2 (IDN - 057)

Sagecor Solutions, United States (On-Site)

Techland - DevOps Engineer - online services

Techland, Poland (On-Site)

IG Group - Senior Systems Engineer

IG Group, India (Hybrid)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Nagarro - Senior Engineer, Java

Nagarro, India (On-Site)

Luxoft - Solutions Architect

Luxoft, India (On-Site)

Alight Solutions - SDET

Alight Solutions, India (Hybrid)

Nagarro - Senior Java Developer

Nagarro, Egypt (On-Site)

Next Level Business Services - Full Stack Developer

Next Level Business Services, United States (On-Site)

Get notifed when new similar jobs are uploaded

DevOps Jobs

Trend Micro - Sr. Engineer

Trend Micro, Taiwan (On-Site)

Ubisoft - Web Developer

Ubisoft, Romania (Hybrid)

Axinous - Staff Site Reliability Engineer

Axinous, India (On-Site)

Rackspace Technology - Sr AWS Sales Engineer

Rackspace Technology, United States (Hybrid)

Ness Digital - Architect - Offshore

Ness Digital, India (Hybrid)

Google - Workspace Cloud Architect

Google, Poland (On-Site)

Warner Bros Games - Senior Software Developer

Warner Bros Games, Canada (Hybrid)

Get notifed when new similar jobs are uploaded