Principal Software Engineer-distributed training system

1 Month ago • 6 Years + • Research & Development

Job Summary

Job Description

The Principal Software Engineer will design and implement a distributed training system for trillion-parameter machine learning models. Responsibilities include optimizing training and inference on GPUs, implementing streaming training and publishing of these models, analyzing metrics to identify improvement opportunities, and developing scalable solutions. Collaboration with cross-functional teams is essential. The role requires expertise in high-performance C++, CUDA, Python, or C#, experience with machine learning and TensorFlow/PyTorch distributed training, and strong problem-solving and communication skills. The team works on various aspects of online advertising, impacting millions of users and advertisers.
Must have:
  • 6+ years software engineering experience
  • High-performance C++, CUDA, Python, or C# coding
  • Machine learning & TensorFlow/PyTorch experience
  • Distributed training system design & implementation
  • GPU utilization and optimization
  • Strong problem-solving and debugging skills
Good to have:
  • Ads, search, or content service domain knowledge
Perks:
  • Industry-leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Job Details

Overview

MAI Ads team in Microsoft APRD is responsible for providing the advertising industry with the state-of-the-art online advertising platform and service. Our team is at the core of this effort, working on the following research & development: Selection(recall), Relevance, User Response Prediction (Click Prediction and Conversion prediction), Autobidding, Large Language Model and Large Scale Machine Learning & Serving System. The team is a world-class R&D team of passionate and talented scientists and engineers who aspire to solve challenging problems and turn innovative ideas into high-quality products and services that can help hundreds of millions of users and advertisers, and directly impact our business.

Qualifications

• Bachelor, Master, PhD degree in CS/EE or related areas is required.
• 6+ years of industry experiences in software engineering.
• Solid experience of shipping high performance C++, CUDA, python, C#, or equivalent language code.
• Experience with machine learning and TensorFlow/PyTorch distributed training is preferred.
• Domain knowledge of ads, search or content services is a plus.
• Quick learning and solid problem solving and debugging skills.
• Good communication skill, fluent in English (both oral and written).


Responsibilities

• Design and implement distributed training system for trillion parameter machine learning models.
• Drive our team efforts around utilization and optimization of training and inference on GPUs.
• Design and implement streaming training and publish of trillion parameter machine learning models.
• Analyze metrics and identify opportunities based on offline and online testing, develop and deliver robust and scalable solutions.
• Collaborate with cross-functional teams to deliver high-quality solutions.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect

Similar Jobs

Trustana - Senior Data Engineer

Trustana

Gurugram, Haryana, India (Hybrid)
4 Months ago
Meta - Software Engineer, Machine Learning

Meta

Seattle, Washington, United States (On-Site)
3 Months ago
ByteDance - AI Security Researcher - Security - San Jose

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Microsoft - Senior Applied Scientist

Microsoft

Redmond, Washington, United States (Hybrid)
1 Month ago
Inworld AI - Staff / Principal AI Researcher - USA

Inworld AI

Mountain View, California, United States (Remote)
2 Months ago
Intel Corporation - Software Team Manager – Deep Learning Graph Compiler

Intel Corporation

Haifa District, Israel (Hybrid)
1 Month ago
Nielsen Holdings - Software Engineering Manager - Windows\C++\.Net

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
Zuru - Sr. Python Developer

Zuru

Kolkata, West Bengal, India (On-Site)
3 Months ago
Tencent - NLP Research Intern 104493

Tencent

London, England, United Kingdom (On-Site)
1 Month ago
Riot Games - Principal Technical Producer - League Studio

Riot Games

Los Angeles, California, United States (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Research Scientist - Multimodal Foundation Model - 2025 Start

ByteDance

Singapore (On-Site)
3 Months ago
Dolby Laboratories - Sr. Generative Computer Vision Research

Dolby Laboratories

Bengaluru, Karnataka, India (Hybrid)
4 Months ago
Frost & Sullivan - AI Engineer

Frost & Sullivan

Tamil Nadu, India (On-Site)
4 Months ago
ByteDance - Student Researcher (Doubao (Seed) - Machine Learning System) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
ByteDance - Machine Learning Engineer Graduate (AML Algorithm) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Inworld AI - Staff / Principal AI Researcher - USA

Inworld AI

Mountain View, California, United States (Remote)
2 Months ago
The Walt Disney Company - Software Engineering Manager, Machine Learning - Ad Platforms

The Walt Disney Company

California, United States (On-Site)
1 Month ago
ByteDance - Research Engineer in Large Model System

ByteDance

San Jose, California, United States (On-Site)
3 Months ago
Intel Corporation - AI Frameworks Architect

Intel Corporation

Bengaluru, Karnataka, India (Hybrid)
2 Months ago
ByteDance - Software Engineer in Machine Learning Systems

ByteDance

San Jose, California, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Beijing, Beijing, China

Intel Corporation - Senior Project Manager - CSA Lead

Intel Corporation

Chengdu, Sichuan, China (On-Site)
1 Month ago
Riot Games - Senior Visual Design Artist

Riot Games

Shanghai, Shanghai, China (On-Site)
4 Months ago
Tencent - 在研3A风格化写实射击项目-高级2D角色原画设计师

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
Tencent - Senior Manager, Tencent Social Impact

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Month ago
Maersk Careers - Customer Solution Manager

Maersk Careers

Shanghai, Shanghai, China (On-Site)
4 Months ago
Nagarro - Principal Consultant, Support Presales

Nagarro

China (Remote)
3 Months ago
Intel Corporation - Senior NAND Product Development Technologist

Intel Corporation

Dalian, Liaoning, China (On-Site)
2 Months ago
Buckman - Sourcing and Procurement Director

Buckman

Shanghai, Shanghai, China (On-Site)
2 Months ago
Tencent - Senior VFX Artist (Global Realistic 3A Action Game)

Tencent

Shenzhen, Guangdong Province, China (On-Site)
2 Months ago
undefined - Scenario mode FO

Beijing, Beijing, China (On-Site)
7 Months ago

Get notifed when new similar jobs are uploaded

Research & Development Jobs

EPAM Systems - Senior Python Software Engineer

EPAM Systems

Hyderabad, Telangana, India (Remote)
5 Months ago
Krafton  - [Chairman’s Office] Staff (3년 ~ 6년)

Krafton

Seoul, South Korea (On-Site)
3 Months ago
Epic Games - Principal Research Engineer

Epic Games

(On-Site)
1 Month ago
Google - Software Engineering Intern, PhD, Summer 2025

Google

Mountain View, California, United States (On-Site)
3 Months ago
Google - Lead CPU Design Verification Engineer, Silicon

Google

(On-Site)
2 Months ago
Sphere Entertainment Co - Senior Manager Visualization and Performance Capture

Sphere Entertainment Co

Burbank, California, United States (On-Site)
2 Months ago
Netflix - Research Engineer L4/L5 -LLMs for Search, Recommendations, and Personalization

Netflix

Los Gatos, California, United States (On-Site)
3 Months ago
Microsoft - Senior Validation Engineer

Microsoft

Mountain View, California, United States (On-Site)
1 Month ago
eBay - ML Staff Software Engineer - Risk

eBay

Austin, Texas, United States (Hybrid)
4 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

Redmond, Washington, United States (On-Site)

Mountain View, California, United States (On-Site)

London, England, United Kingdom (Hybrid)

London, England, United Kingdom (On-Site)

Jakarta, Jakarta, Indonesia (On-Site)

Prague, Prague, Czechia (On-Site)

Montreal, Quebec, Canada (On-Site)

Dublin, County Dublin, Ireland (On-Site)

Hyderabad, Telangana, India (On-Site)

View All Jobs

Get notified when new jobs are added by Microsoft

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug