Senior Site Reliability Engineer

8 Months ago • 6-6 Years • $117,200 PA - $250,200 PA

Devops

Job Description

Senior Site Reliability Engineer (SRE) for Office 365 at Microsoft. Responsibilities include ensuring service availability, proactively identifying and resolving incidents, implementing automation, and driving the adoption of AI/ML for predictive analytics. The role requires extensive experience in software engineering, cloud technologies (Azure preferred), and troubleshooting large-scale production issues. The ideal candidate will have experience with machine learning concepts, cloud platforms, and large language models. The role involves working with product engineering teams, contributing to code reviews, and participating in on-call rotations. The SRE will also be involved in developing and improving automation, analyzing operational metrics, and mentoring junior engineers.

Good To Have:

Experience with Microsoft Azure ML, Cognitive Services
Familiarity with Large Language Models and Generative AI
Experience building infrastructure using Microsoft Azure

Must Have:

6+ years experience in software engineering or related field
Experience troubleshooting large-scale cloud environments
Proficiency in Azure and cloud technologies
Strong programming skills (C++, C#, Node.JS)
Experience with AI/ML concepts and tools

Perks:

Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Networking opportunities

Add these skills to join the top 1% applicants for this job

tensorflow

algorithms

microsoft-azure

azure

data-structures

swift

pytorch

node.js

fpga

scalability

user-experience-ux

quality-control

problem-solving

Overview

Senior Site Reliability Engineer- (Office 365) We’re looking for a Senior Site Reliability Engineer (SRE) with the right mix of systems engineering, software development, on-line services experience and passion for quality to envision, design, and deliver Office 365 (O365) Enterprise Cloud service offerings. 

Team Overview: Within the vast framework of M365 Office Engineering Direct (OED), our SRE team is instrumental to the success of Exchange Online. With the service spanning hundreds of components, our goal is clear: ensure unmatched service availability and continually elevate user satisfaction. 

What We Do & Our Impact: Our approach is layered and precise. By implementing proactive engineering solutions, we identify and tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced, remains our cornerstone, adeptly capturing anomalies beyond the scope of conventional systems. As swift diagnostics steer our course, we channel our efforts towards automation, efficiently managing the incident lifecycle from detection to resolution. Additionally, with a commitment rooted in understanding our users, we meticulously prioritize and execute Design Change Requests, ensuring Exchange Online's evolution aligns with user expectations. 

The Future – ArtificiaI Intelligence (AI) & Machine Learning (ML) in Focus: As we look to the horizon, the fusion of AI and ML with our SRE practices beckons a transformative era for Exchange Online. We are in the early stages of integrating predictive analytics to anticipate issues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes, identifying patterns and correlations previously overlooked. Our journey with AI and ML is not just about enhancement; it's about redefining reliability, precision, and the user experience in the M365 suite. 

Location: By applying to this U.S. based position, while remote work is possible, relocation does not apply/is not provided for the role

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Qualifications

Required/Minimum Qualifications: 

6+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration 
- Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Qualifications:

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

Microsoft Cloud Background Check:  This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications :

6+ years’ experience troubleshooting, investigating, and fixing production issues in large scale cloud and/or hosted environments. 
4+ years experience with building infrastructure using Microsoft Azure technology. 
5+ years experience writing programs leveraging a major cloud service (C++, C# or Node.JS) including experience with algorithms, data structures, and software design. 
Familiarity with core machine learning concepts, including infrastructure and open-source options (ex: compute systems - GPU & FPGA, AI/ML frameworks – TensorFlow, MLflow, JAX & PyTorch, tools - Jupyter notebooks & VS Code, etc.).  
Familiarity with the Microsoft Azure cloud as well as technologies such as Azure ML, Microsoft’s Cognitive Services, Azure OpenAI, or Azure Cognitive Search or similar experience with another cloud platform.  
Familiarity using Large Language Models and Generative AI to solve real-world problems.

Site Reliability Engineering IC4 - The typical base pay range for this role across the U.S. is USD $117,200 - $229,200 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $153,600 - $250,200 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until Janaury 24, 2025.

#M365CORE

Responsibilities

Technical Knowledge and Domain-Specific Expertise 

Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies; identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance. Drives the adoption of new solutions across engineering teams working with related products within an organization and provides guidance and coaching to others on relevant topics. 
Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability.    
Contributions to Development and Design 
Engages with product engineering teams by driving code/design reviews, hosting regular meetings, and participating in on-call rotations and incident responses throughout product development and operations cycles; leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention.

Driving Operational Excellence 

Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale; reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization. 
Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale. Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization. 
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the highly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings. 
Shares insights and best practices that can be applied to improve development and operations across related sets of systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with more experienced Site Reliability Engineers (SREs) and members of product engineering teams. Mentors and coaches less experienced engineers to help them identify and propose relevant solutions. 
Serve as a point of contact, trusted advisor and interact with customers other external stakeholders as a spokesperson for customer confidence or escalations calls and Support process for incident management including quality control of Root Cause Analysis (RCAs).