Software Engineering Manager, Site Reliability, Cloud Incident Response

1 Month ago • 8-11 Years • DevOps

About the job

Job Description

As a Software Engineering Manager for Site Reliability and Cloud Incident Response at Google, you'll play a vital role in ensuring the dependability of Google Cloud Platform (GCP) for our customers. You will lead a team dedicated to responding to and mitigating major incidents across GCP, working closely with product teams, customer-facing teams, and stakeholders. Your responsibilities will include participating in on-call rotations for critical incident response, collaborating to ensure high-quality customer outcomes, developing incident management training and processes, building systems and tools to support the team, and proactively identifying and mitigating potential risks within Cloud infrastructure.
Must have:
  • Bachelor's degree or equivalent practical experience
  • 8 years of experience with software development
  • 3 years of experience in a technical leadership role
  • 2 years of experience in people management
Good to have:
  • Master's degree or PhD in Computer Science
  • Experience working in a changing organization

Minimum qualifications:

  • Bachelor's degree or equivalent practical experience.
  • 8 years of experience with software development in one or more programming languages (e.g., Python, C, C++, Java, JavaScript).
  • 3 years of experience in a technical leadership role; overseeing projects, with 2 years of experience in a people management, supervision/team leadership role.

Preferred qualifications:

  • Master's degree or PhD in Computer Science, or a related technical field.
  • Experience working in a changing organization.

About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance.

Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

The team's mission is to create a dependable experience for GCP customers. In this role, you will be responding to and helping to coordinate, mitigate, or resolve major incidents across all of GCP. The Cloud Incident Response Team supports the responders, tooling, and outcomes for GCP Major Incidents. The team collaborates across GCP products, customer facing teams, and a wide range of stakeholders.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

Responsibilities

  • Participate in on-call rotation supporting Critical Incident Response for GCP.
  • Focus on high-quality customer outcomes and continuous collaboration across GCP teams.
  • Create IMAG training and processes for incident management life-cycle and partnering with Cloud SRE UTLs, and Cloud Support leadership team.
  • Build systems and tooling to support the team, improve visibility, detection of issues, communications to customers, stakeholders, and customer facing teams.
  • Define and escalate risks in Cloud, reduce incident probabilities with strategic and tactical/pragmatic approaches as needed.
View Full Job Description

Add your resume

80%

Upload your resume, increase your shortlisting chances by 80%

About The Company

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

View All Jobs

Get notified when new jobs are added by Google

Similar Jobs

Google - Software Engineer III, GCP Foundation Services

Google, United States (On-Site)

Sinch - Manager, Mediation Data Engineering

Sinch, United States (Hybrid)

Resideo - Sr Engineer - Software QA

Resideo, India (Hybrid)

Evolution - Game Mathematician Evolution Live

Evolution, Sweden (On-Site)

Three Space Lab - DevOps/ Cloud Engineer

Three Space Lab, (Remote)

Trend Micro - DevOps Engineer

Trend Micro, Philippines (On-Site)

Luxoft - DevOps + Java Engineer

Luxoft, India (On-Site)

Razer - Senior AWS Systems Administrator

Razer, Malaysia (On-Site)

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

White Hat Gaming  - Scala Developer

White Hat Gaming , (Remote)

Luxoft - Full Stack Developer

Luxoft, Malaysia (On-Site)

ICE - Premier Services Engineer

ICE, India (Hybrid)

Paypal - Software Engineer- AI Data Governance

Paypal, United States (Hybrid)

Deliveroo - Software Engineer, Security

Deliveroo, India (On-Site)

ByteDance - LLM Coding Trainer - Specialist

ByteDance, Singapore (On-Site)

Solventum - Software Quality Test Engineer

Solventum, India (On-Site)

Get notifed when new similar jobs are uploaded

Jobs in London, England, United Kingdom

Playground Games - Gameplay Animation Engineer (All Levels)

Playground Games, United Kingdom (Hybrid)

PublicisGroupe - Senior Product Manager

PublicisGroupe, United Kingdom (On-Site)

ESL FACEIT Group - EFG - Senior Director, Global Brand Partnerships

ESL FACEIT Group - EFG, United Kingdom (Remote)

SEGA - Senior HR Business Partner

SEGA, United Kingdom (Hybrid)

Assystems - Senior Estimator

Assystems, United Kingdom (Hybrid)

Frontier Developments - PR Manager - 12-Month FTC (Maternity Cover)

Frontier Developments, United Kingdom (Hybrid)

Kwalee - Senior Game Programmer

Kwalee, United Kingdom (Hybrid)

Keywords Studios (Player Support) - Lead Programmer

Keywords Studios (Player Support), United Kingdom (Hybrid)

Steel City Interactive - QA Manager

Steel City Interactive, United Kingdom (Hybrid)

ION - Manager/Director of Mid-Markets - 9779

ION, United Kingdom (On-Site)

Get notifed when new similar jobs are uploaded

DevOps Jobs

Siemens Digital Industries Software - Teamcenter Release Manager

Siemens Digital Industries Software, India (Hybrid)

GoGuardian - Site Reliability Engineer

GoGuardian, India (Remote)

The Walt Disney Company - Lead Software Engineer, Scala

The Walt Disney Company, United States (On-Site)

Luxoft - ServiceNow Engineer

Luxoft, India (On-Site)

Alphaserve Technologies®, an ECI Company - Senior Software Engineer

Alphaserve Technologies®, an ECI Company, India (On-Site)

Ajmera Infotech - Sr. AWS DevOps Engineer

Ajmera Infotech, India (On-Site)

Get notifed when new similar jobs are uploaded