Director, Reliability Engineering

1 Hour ago • 8-12 Years • DevOps • Manufacturing

About the job

Job Description

The Director, Reliability Engineering leads a team responsible for the reliability of Microsoft's cloud infrastructure hardware. This involves overseeing architecture, design, manufacturing, and operations to ensure high quality and performance. Key responsibilities include leading strategic innovations, driving root cause analysis, optimizing reliability solutions, and collaborating with cross-functional teams. The role requires strong leadership, technical expertise in reliability engineering, and experience in cloud operations. The candidate will define and manage the integration of various aspects of the hardware lifecycle to optimize cloud infrastructure reliability.
Must have:
  • Doctorate or Master's degree in relevant engineering field
  • 5+ years management experience
  • 8+ years technical engineering experience (Bachelor's)
  • Experience leading system engineering teams
  • Knowledge of cloud fleet management and diagnostics
Good to have:
  • MBA in engineering management
  • Experience with liquid cooling infrastructure
  • Experience developing design specifications
Perks:
  • Industry leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Overview

Microsoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission.

 

As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a strong passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability.

 

We are looking for an experienced System Reliability Director who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders.

Qualifications

Required Qualifications

  • Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience
    • OR Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 7+ years technical engineering experience
    • OR Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 8+ years technical engineering experience.
  • 5+ Years of Management including resource planning, career development and performance management.

Other Requirements

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

 

Preferred Qualifications:

  • MBA in engineering management or operations.
  • Experience with cloud fleet management, telemetry, diagnostic and troubleshooting of IT systems.
  • Experience and knowledge in the server industry product development process.
  • Experience in leading system engineering teams in both NPI and Sustaining lifecycles, and managing suppliers.
  • Experience and background developing design specifications and or developing product requirement documents.
  • Experience with system reliability, manufacturing process and datacenter operations, leading continuous improvements through automation
  • Experience with liquid cooling infrastructure for IT racks

Reliability Engineering M5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

Microsoft will accept applications for the role until January 18, 2025

 

 

#azurehwjobs   #HIFE  #Azure #Cloud #Hardware

Responsibilities

As a Director, Reliability Engineering, you will be responsible for the following:

  • Leading the Cloud System and Components Reliability Engineering organization with an ability to operate in a fast-paced environment, transforming ambiguity into clarity.
  • Leading strategic innovations and developing processes which integrate industry practices to ensure scalability and efficiency to achieve high reliability and quality performance.
  • Leading by example and coaching to inspire team members to grow and develop in the field of System and Components Reliability Engineering.
  • Leading retrospective and deep dives to drive root cause and corrective actions to prevent future escapes.
  • Combine technical and process expertise with in-depth understanding of cloud operations, to optimize reliability solutions for future server and storage products.
  • Define, facilitate and manage integration of architecture, design, manufacturing, operation, troubleshooting and diagnostic methods to optimize cloud infrastructure reliability.
  • Participate in, and approve, mechanical, thermal, electrical, telemetry & diagnostic design reviews to ensure system reliability requirements are properly implemented.
  • Drive System Reliability Readiness of new cloud platforms landing in Microsoft Datacenters.
  • Support Hardware Systems Group development, deployment and sustaining teams from system concept to decommission. Work with cross-functional strategic teams on process optimizations and inter-related strategic initiatives.
  • Develop key metrics to evaluate system reliability program’s performance and build implementation plans to confirm our performance and compliance against program metrics and internal company requirements.
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
View Full Job Description
$137.6K - $294.0K/yr (Outscal est.)
$215.8K/yr avg.
Redmond, Washington, United States

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

London, England, United Kingdom (On-Site)

Dublin, County Dublin, Ireland (On-Site)

Ho Chi Minh City, Ho Chi Minh City, Vietnam (On-Site)

San José, San José Province, Costa Rica (On-Site)

Prague, Prague, Czechia (On-Site)

View All Jobs

Get notified when new jobs are added by Microsoft

Similar Jobs

NCR Atleos - Infrastructure Ops Engineer I

NCR Atleos, India (Hybrid)

Microsoft - Technical Program Manager 2

Microsoft, India (On-Site)

Microsoft - Account Executive - Digital Native

Microsoft, Israel (On-Site)

Saviynt - Sr. Solutions Engineer, New York

Saviynt, United States (Remote)

Picarro - DevOps Manager

Picarro, India (On-Site)

Forescout Technologies  Inc  - Manager Devops

Forescout Technologies Inc , India (On-Site)

Equivalent Jobs - HEAD OF IT OPERATIONS

Equivalent Jobs, (Remote)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Rackspace Technology - Lead AppDev Enterprise Architect

Rackspace Technology, United States (Remote)

CloudHire - Anaplan Solution Architect

CloudHire, Georgia (On-Site)

Microsoft - Senior Business Strategy Manager

Microsoft, United States (On-Site)

Blue Yonder - Associate Technical Consultant

Blue Yonder, Poland (Hybrid)

Teradata - Senior Product Manager

Teradata, India (On-Site)

Microsoft - Principal Software Engineer

Microsoft, United States (On-Site)

Get notifed when new similar jobs are uploaded

Jobs in Redmond, Washington, United States

Varonis  - BI Developer

Varonis , United States (On-Site)

Sonar Source - Enterprise Field Representative

Sonar Source, United States (Hybrid)

Searchability® - Senior Rendering Engineer

Searchability®, United States (Remote)

Trek - Assembler

Trek, United States (On-Site)

Evolution - iGaming Presenter (Waiter/Waitress Alternative)

Evolution, United States (On-Site)

Blinkhealth - Certified Pharmacy Technician (Onsite)

Blinkhealth, United States (On-Site)

Snail Games - Video Editor (Cinematic Game Trailers)

Snail Games, United States (Hybrid)

ByteDance - Backend Software Engineer - CapCut - San Jose

ByteDance, United States (On-Site)

Visa - Sr. Product Manager Digital Transformation

Visa, United States (Hybrid)

Get notifed when new similar jobs are uploaded