Senior Site Reliability Engineer

34 Minutes ago • 6-10 Years • DevOps • Undisclosed

About the job

Job Description

Microsoft's Azure Cosmos DB team seeks a Senior Site Reliability Engineer to build and optimize solutions for analyzing massive telemetry data, performing automated root cause analysis, and implementing mitigations to maintain Service Level Objectives (SLOs). Responsibilities include collaborating with engineering teams on automation, working with customers to address supportability issues, communicating technical details to enterprise clients during service escalations, designing and implementing service telemetry changes, enhancing customer experience through proactive alerting, and providing operational insights to design and product teams. The role demands strong problem-solving skills, effective communication, experience with large-scale cloud services, and proficiency in Python/Java/C#.
Must have:
  • 6+ years experience in relevant field
  • Experience with large-scale cloud services
  • Proficiency in Python/Java/C#
  • Strong problem-solving skills
  • Effective communication skills
Good to have:
  • Understanding of Observability and MELT
  • Experience with Logic Apps and Jupyter Notebooks
  • Experience improving Service Reliability, Availability, and Performance
Perks:
  • Industry-leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Networking opportunities

Overview

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.

Microsoft’s Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products our portfolio include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture.

 

Within Azure Data, the databases team builds and maintains Microsoft's operational Database systems. We store and manage data in a structured way to enable multitude of applications across various industries. We are on a journey to enable developer friendly, mission-critical, AI enabled operational Databases across relational, non-relational and Open Source Software (OSS) offerings.

 

We believe in making the day in the life of the On-Call Engineer boring while living up to the expectations of a massive cloud service with stringent Service Level Objectives (SLO’s). We do this by thinking differently, stretching ourselves to go all the way to the root of the problem, keeping data in front and center for all our decisions and taking a systems approach for generating outcomes that far exceeds the expectations. Helping attain the aspirational Service Level Objectives (SLO’s) through pragmatic innovation is what sets the SRE’s in Cosmos DB apart. If you share the same purpose, cause and belief and have passion to follow this pursuit, please read the rest of the Job description on what we do, and we would love to have you join us!

Azure Cosmos DB is Microsoft’s next generation of globally distributed, massively scalable, multi-model cloud database service. It is designed to enable developers to build planet-scale applications. Azure Cosmos DB is one of the fastest growing Azure services. Joining the Azure Cosmos DB team is a fantastic opportunity to work with incredibly talented engineers operating like a startup and be at the forefront of building and shaping the Livesite Automation and AI Ops stack in Cosmos DB and lead the path for broader adoption across Microsoft Azure.

Cosmos DB is a database of choice for the spectrum spanning from the hobbyist developer to the largest of Fortune 500 companies. The database provides the data backbone of many critical systems in Health Care, Retail, Telecommunications, IoT etc. where the Service Availability and Latency is paramount. Cosmos DB provides financially backed SLA (service level agreements) around 99.99 Availability and < 10 MS Latency and we take pride in upholding ourselves to even more stringent Service Level Objectives (SLO) that delight our customers. Other than a resilient and fault tolerant architecture, a key to attaining the SLO’s is automating the root cause analysis and mitigation of Issues and a lot of times proactively addressing the issues even before any customer impact. This team prides itself on building systems where a vast majority of Livesite issues are automatically mitigated without the need for human intervention.

We are looking for a self-driven Senior Site Reliability Engineer (SRE) who likes taking a data driven and systems-based approach to solve Service Reliability problems. You will be responsible for building and optimizing solutions that can analyze massive amounts of telemetry and other Service Health indicators in near real time and perform automated root cause analysis and necessary mitigations to restore SLO’s.

Our team focuses on diversity of all types of candidates for our roles and we strive to hire people with different experiences and perspectives into our team. To that end, we know that no candidate has every desired skill and experience, but all of us together make our team strong.

 

We do not just value differences or different perspectives. We seek them out and invite them in so we can tap into the collective power of everyone in the company. As a result, our customers are better served.

Qualifications

Required/minimum qualifications

  • 6+ years technical experience in software engineering, network engineering, or systems administration
    • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
    • OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration.

Other Requirements

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:
    • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred/Additional Qualifications

  • Understanding of Observability and MELT implementation patterns for large-scale services.
  • Experience in Logic Apps and authoring Jupyter Notebooks, and experience in analyzing, troubleshooting, and automating root cause analysis and mitigation of incidents impacting large-scale distributed systems.
  • Influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of curiosity
  • Ability to deal with the ambiguity associated with working in a fast-paced environment and influencing the product architecture and roadmap to make sure the customer-experienced supportability is always a key consideration when evolving the product
  • 5+ years of SRE or SWE (Software Engineer) experience running large scale cloud services and 5+ years of hands-on experience in Python/Java/C#
  • 3+ years of operational experience in improving Service Reliability, Availability and Performance

Site Reliability Engineering IC4 - The typical base pay range for this role across Canada is CAD $108,100 - CAD $199,700 per year.

Find additional pay information here:

 

 Microsoft will accept applications for the role until November 21, 2024.

 

 

#azdat

#azuredata

#SRE

Responsibilities

  • Collaborating closely with engineering teams on building and enhancing tooling and automation solutions for faster resolution of issues impacting SLO’s and averting incidents altogether when possible.
  • Collaborating with the customers to understand their pain points around Supportability and SLO attainment and formulate strategies for addressing recurring issues in a sustainable way.
    Communicate on a deeply technical level and be the single point of contact for interfacing with large enterprise customers for handling service escalations and driving the issues to resolution.
  • Ability to design and implement any changes to service telemetry for the automation to consume if it is not already available.
  • Enhancing customer facing experience by proactive alerting based on utilization, trends, resource health, etc.
  • Analyze data and provide operational insights into customer experience to Design and Product teams, so that we can design features with Supportability in mind.
  • Embody our and
Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.
Industry leading healthcare
Educational resources
Discounts on products and services
Savings and investments
Maternity and paternity leave
Generous time away
Giving programs
Opportunities to network and connect
View Full Job Description

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

Microsoft is a tech giant that develops, licenses, and supports a range of software products, services, and devices.

Redmond, Washington, United States (On-Site)

Dublin, County Dublin, Ireland (On-Site)

Bengaluru, Karnataka, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Busan, Busan, South Korea (On-Site)

Paris, Île-de-France, France (On-Site)

North Holland, Netherlands (On-Site)

Reston, Virginia, United States (On-Site)

View All Jobs

Get notified when new jobs are added by Microsoft

Similar Jobs

Enphase Energy - EVSE - Staff Engineer

Enphase Energy, India (On-Site)

ION - Senior Technical Consultant - Endur

ION, United States (On-Site)

ComeOn Group - Data Engineering Lead

ComeOn Group, Gibraltar (Hybrid)

Simplify 360 - Senior Software Engineer Tech Lead (Java)

Simplify 360, India (Hybrid)

Nagarro - Power Platform Developer

Nagarro, Philippines (On-Site)

Luxoft - Senior Software Support Engineer

Luxoft, Slovakia (Remote)

Luxoft - Senior Cloud Engineer

Luxoft, India (On-Site)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Milliman - Senior Quality Assurance Engineer

Milliman, India (On-Site)

Next Level Business Services - Technical Lead – Java

Next Level Business Services, United States (On-Site)

Alpha Sense - Join AlphaSense India Talent Community

Alpha Sense, India (On-Site)

Nagarro - Associate Staff Engineer - Java

Nagarro, South Africa (On-Site)

Sinch - Backend Engineer

Sinch, Sweden (Hybrid)

Expedia - Principal Mobile Engineer

Expedia, United States (On-Site)

Globalization Partners - Software Engineer I

Globalization Partners, United States (Remote)

DEVOTEAM - Tech Lead Java

DEVOTEAM, Morocco (Remote)

Get notifed when new similar jobs are uploaded

Jobs in Vancouver, British Columbia, Canada

Digital Extremes - Senior Engine Programmer

Digital Extremes, Canada (Remote)

Epic Games - Senior Release Manager

Epic Games, Canada (On-Site)

Zoic Studios - BC - Senior Effects Artist

Zoic Studios, Canada (On-Site)

Next Level Games - Senior Gameplay Engineer

Next Level Games, Canada (On-Site)

Coursera - Strategic Account Manager

Coursera, Canada (Remote)

Ziff Davis - Editor, Tech & Science (ExtremeTech)

Ziff Davis, Canada (Remote)

Prodigy Education - VP, Data

Prodigy Education, Canada (Hybrid)

Get notifed when new similar jobs are uploaded

DevOps Jobs

Rackspace Technology - AWS Support Engineer III - R-20542

Rackspace Technology, India (Remote)

Microsoft - Senior Software Engineer (Full-Stack)

Microsoft, Canada (On-Site)

Luxoft - DevOps Engineer

Luxoft, India (Remote)

AbZorba Games  - Dev Ops Engineer

AbZorba Games , United States (Remote)

Tanla Platforms  - Senior Site Reliability Engineer

Tanla Platforms , India (On-Site)

Next Level Business Services - Pivotal cloud Architect

Next Level Business Services, United States (On-Site)

ION - Senior DevSecOps Engineer, Italy

ION, United Kingdom (On-Site)

FEG - Group Cloud Data Engineer

FEG, India (On-Site)

Get notifed when new similar jobs are uploaded