Principal Software Engineer-distributed training system

1 Hour ago • 6 Years + • Research & Development

About the job

Job Description

The Principal Software Engineer will design and implement a distributed training system for trillion-parameter machine learning models. Responsibilities include optimizing training and inference on GPUs, implementing streaming training and publishing of these models, analyzing metrics to identify improvement opportunities, and developing scalable solutions. Collaboration with cross-functional teams is essential. The role requires expertise in high-performance C++, CUDA, Python, or C#, experience with machine learning and TensorFlow/PyTorch distributed training, and strong problem-solving and communication skills. The team works on various aspects of online advertising, impacting millions of users and advertisers.
Must have:
  • 6+ years software engineering experience
  • High-performance C++, CUDA, Python, or C# coding
  • Machine learning & TensorFlow/PyTorch experience
  • Distributed training system design & implementation
  • GPU utilization and optimization
  • Strong problem-solving and debugging skills
Good to have:
  • Ads, search, or content service domain knowledge
Perks:
  • Industry-leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Overview

MAI Ads team in Microsoft APRD is responsible for providing the advertising industry with the state-of-the-art online advertising platform and service. Our team is at the core of this effort, working on the following research & development: Selection(recall), Relevance, User Response Prediction (Click Prediction and Conversion prediction), Autobidding, Large Language Model and Large Scale Machine Learning & Serving System. The team is a world-class R&D team of passionate and talented scientists and engineers who aspire to solve challenging problems and turn innovative ideas into high-quality products and services that can help hundreds of millions of users and advertisers, and directly impact our business.

Qualifications

• Bachelor, Master, PhD degree in CS/EE or related areas is required.
• 6+ years of industry experiences in software engineering.
• Solid experience of shipping high performance C++, CUDA, python, C#, or equivalent language code.
• Experience with machine learning and TensorFlow/PyTorch distributed training is preferred.
• Domain knowledge of ads, search or content services is a plus.
• Quick learning and solid problem solving and debugging skills.
• Good communication skill, fluent in English (both oral and written).


Responsibilities

• Design and implement distributed training system for trillion parameter machine learning models.
• Drive our team efforts around utilization and optimization of training and inference on GPUs.
• Design and implement streaming training and publish of trillion parameter machine learning models.
• Analyze metrics and identify opportunities based on offline and online testing, develop and deliver robust and scalable solutions.
• Collaborate with cross-functional teams to deliver high-quality solutions.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
View Full Job Description

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

London, England, United Kingdom (On-Site)

Dublin, County Dublin, Ireland (On-Site)

Ho Chi Minh City, Ho Chi Minh City, Vietnam (On-Site)

San José, San José Province, Costa Rica (On-Site)

Prague, Prague, Czechia (On-Site)

View All Jobs

Get notified when new jobs are added by Microsoft

Similar Jobs

Meta - Software Engineer, Machine Learning

Meta, United States (On-Site)

Blizzard Entertainment - Senior Data Scientist, Computer Graphics

Blizzard Entertainment, United States (On-Site)

NK Securities Research - ML- Quantitative Researcher

NK Securities Research, India (Hybrid)

Cadence - Lead C++ Software Engineer

Cadence, United States (On-Site)

Starkflow - Technical Lead (Golang)

Starkflow, United Arab Emirates (On-Site)

Intel Corporation - SOC DFT Pre-Silicon Verification Engineer

Intel Corporation, Malaysia (Hybrid)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Netflix - Research Scientist (L4) - Globalization

Netflix, United States (On-Site)

The Walt Disney Company - Lead Machine Learning Engineer, Ad Platforms

The Walt Disney Company, United States (On-Site)

Google - Silicon Engineering Intern, 2025

Google, Taiwan (On-Site)

ThreeV Technologies,  Inc  - Data Scientist Computer Vision

ThreeV Technologies, Inc , India (Remote)

Oportun - Senior ML Engineer

Oportun, India (Remote)

Get notifed when new similar jobs are uploaded

Jobs in Beijing, Beijing, China

Tencent - Senior Engine Programmer

Tencent, China (On-Site)

Maersk Careers - Senior Software Engineer

Maersk Careers, China (On-Site)

Intel Corporation - Product Management

Intel Corporation, China (On-Site)

undefined - 3D游戏动作

Beijing, Beijing, China (On-Site)

Ubisoft - Senior Engine Programmer

Ubisoft, China (On-Site)

Keywords Studios (Player Support) - Tools Engineer

Keywords Studios (Player Support), China (On-Site)

Keywords Studios (Player Support) - Workday Document Management People Technology Partner

Keywords Studios (Player Support), China (Remote)

Intel Corporation - Product Development Engineer(ATE)

Intel Corporation, China (On-Site)

Canva - CJKI Product Program Manager

Canva, China (Remote)

Get notifed when new similar jobs are uploaded

Research & Development Jobs

Assystems - Ingénieur PLM 3DX H/F

Assystems, France (On-Site)

Luxoft - Regular Infotainment System Engineer

Luxoft, Sweden (On-Site)

Luxoft - Cores Verification Engineer

Luxoft, Romania (On-Site)

Parallel Wireless - Technical Lead, Stack

Parallel Wireless, India (On-Site)

Samsung Semiconductor - Intern, High Capacity SSD Software Ecosystem

Samsung Semiconductor, United States (Hybrid)

Anavation - Software Developer 3

Anavation, United States (On-Site)

Get notifed when new similar jobs are uploaded