Lead Architect – SRE & Observability

1 Month ago • 10-15 Years • Devops

Job Summary

Job Description

Applied Materials is a global leader in materials engineering solutions for chip and display production, enabling technologies like AI and IoT. The Lead Architect – SRE & Observability will be a key leader in designing, scaling, and governing monitoring and observability platforms, while ensuring the reliability of infrastructure and application services. This role involves leading cross-functional initiatives, establishing technical standards, and driving automation, telemetry, and incident response maturity across the enterprise.
Must have:
  • Architect and lead end-to-end observability strategies (logs, metrics, traces) across on-premises, private, and public cloud environments.
  • Manage and mature enterprise observability solutions across complex architectures.
  • Define standards for telemetry data collection, correlation, and alerting for distributed systems.
  • Collaborate with application and infrastructure teams to ensure instrumentation coverage and SLO/SLI definition.
  • Lead the migration and consolidation of legacy monitoring platforms to modern observability stacks.
  • Enable proactive problem detection, root cause analysis, and capacity forecasting using analytics and AI/ML insights.
  • Define and implement SRE principles (SLIs/SLOs, error budgets, chaos testing, postmortems, etc.) across supported services.
  • Design and manage infrastructure automation, CI/CD pipelines, AI/ML solutions, runbooks, and self-healing systems.
  • Lead incident response coordination during major outages and drive post-incident analysis and systemic fixes.
  • Collaborate with DevOps, Cloud, and Security teams to enforce resiliency, observability, and reliability as core design principles.
  • Mentor junior SREs and CAMO engineers to grow technical and operational expertise.
Good to have:
  • Familiarity with AIOPS, ITSM, CAASM tools and configuration management databases.
  • Exposure to compliance and governance frameworks such as CIS, NIST.
  • Relevant certifications in observability, cloud platforms, SRE, or security domains.
  • Bachelor’s or Master’s degree in computer science, Engineering, or related field.
  • Excellent communication and stakeholder engagement skills.
Perks:
  • Competitive and comprehensive total rewards program.
  • Employee Assistance Program.
  • Meditation and family support resources.
  • Travel insurance.
  • Programs and support for personal and professional growth.
  • Supportive work culture.
  • Free career development and mentoring programs.
  • Technical and professional courses.
  • Worldwide "Giving" program (employee donations matched by Applied Materials Foundation).
  • Happiness, health, and resiliency programs.

Job Details

Who We Are

Applied Materials is the global leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. We design, build and service cutting-edge equipment that helps our customers manufacture display and semiconductor chips – the brains of devices we use every day. As the foundation of the global electronics industry, Applied enables the exciting technologies that literally connect our world – like AI and IoT. If you want to work beyond the cutting-edge, continuously pushing the boundaries of science and engineering to make possible® the next generations of technology, join us to Make Possible® a Better Future.

What We Offer

At Applied, we prioritize the well-being of you and your family and encourage you to bring your best self to work. Your happiness, health, and resiliency are at the core of our benefits and wellness programs. Our robust total rewards package makes it easier to take care of your whole self and your whole family. We’re committed to providing programs and support that encourage personal and professional growth and care for you at work, at home, or wherever you may go. Learn more about our benefits.

You’ll also benefit from a supportive work culture that encourages you to learn, develop and grow your career as you take on challenges and drive innovative solutions for our customers. We empower our team to push the boundaries of what is possible—while learning every day in a supportive leading global company. Visit our Careers website to learn more about careers at Applied.

About the GIS CAMO & SRE Team

The Cybersecurity Asset Management and Observability (CAMO) and Site Reliability Engineering (SRE) teams within GIS are at the forefront of ensuring operational excellence, resilience, and visibility across hybrid cloud and datacenter infrastructures. The CAMO team is responsible for enterprise-wide observability, IT asset visibility, event correlation, compliance monitoring, and tooling strategy. The SRE team ensures uptime, availability, and automation across mission-critical services through innovative engineering and DevOps practices.

Role Summary:

As a Lead Architect – SRE & Observability, you will play a key leadership role in designing, scaling, and governing monitoring and observability platforms, while ensuring the reliability of infrastructure and application services. You will lead cross-functional initiatives, establish technical standards, and drive automation, telemetry, and incident response maturity across the enterprise.

Key Responsibilities:

  • Monitoring & Observability (CAMO Focus)
  • Architect and lead end-to-end observability strategies (logs, metrics, traces) across on-premises, private, and public cloud environments.
  • Manage and mature enterprise observability solutions across complex architectures.
  • Define standards for telemetry data collection, correlation, and alerting for distributed systems.
  • Collaborate with application and infrastructure teams to ensure instrumentation coverage and SLO/SLI definition.
  • Lead the migration and consolidation of legacy monitoring platforms to modern observability stacks.
  • Enable proactive problem detection, root cause analysis, and capacity forecasting using analytics and AI/ML insights.
  • Site Reliability Engineering (SRE Focus)
  • Define and implement SRE principles (SLIs/SLOs, error budgets, chaos testing, postmortems, etc.) across supported services.
  • Design and manage infrastructure automation, CI/CD pipelines, AI/ML solutions, runbooks, and self-healing systems.
  • Lead incident response coordination during major outages and drive post-incident analysis and systemic fixes.
  • Collaborate with DevOps, Cloud, and Security teams to enforce resiliency, observability, and reliability as core design principles.
  • Mentor junior SREs and CAMO engineers to grow technical and operational expertise.

Technical Skills:

  • Expertise in designing and implementing observability frameworks including logs, metrics, and traces across hybrid environments (on-premises, private cloud, public cloud).
  • Strong understanding of distributed systems, microservices architecture, and telemetry pipelines.
  • Proficiency in infrastructure automation and configuration management using tools like Terraform, Ansible, and scripting languages (Python, Shell, etc.).
  • Experience with CI/CD pipelines, incident response automation, and self-healing systems.
  • Familiarity with container orchestration platforms (e.g., Kubernetes) and virtualization technologies.

Functional Knowledge:

  • Experience in implementing cyber asset management and security observability principles.
  • Familiarity with AIOPS, ITSM, CAASM tools and configuration management databases.
  • Exposure to compliance and governance frameworks such as CIS, NIST for cyber resilience, observability and alerting.
  • Relevant certifications in observability, cloud platforms, SRE, or security domains.

Qualifications:

  • Bachelor’s or Master’s degree in computer science, Engineering, or related field.
  • 10-15 years of experience in IT Operations, SRE, DevOps, or Monitoring Engineering roles.
  • Strong expertise in modern observability platforms and telemetry pipelines.
  • Experience with hybrid environments including virtualization, container orchestration, and cloud platforms.
  • Proven track record in automation, telemetry governance, and infrastructure as code.
  • Excellent incident management, communication, and stakeholder engagement skills.

Interpersonal Skills

  • Communicates difficult concepts and negotiates with others to adopt a different point of view

Additional Information

Time Type: Full time

Employee Type: Assignee / Regular

Travel: Yes, 10% of the Time

Relocation Eligible: Yes

Similar Jobs

arctic7 - Network Software Developer

arctic7

(Remote)
1 Month ago
Arkose Labs - Senior Live Site Engineer

Arkose Labs

Pune, Maharashtra, India (Hybrid)
3 Months ago
Mindtickle - Director, Revenue Enablement

Mindtickle

Pune, Maharashtra, India (Hybrid)
3 Months ago
Any Desk - Network Security Engineer

Any Desk

Tampa, Florida, United States (Hybrid)
3 Months ago
CharacterAI - Software Engineer, Machine Learning Infrastructure

CharacterAI

New York, New York, United States (On-Site)
4 Months ago
CRB workforce  - Senior Solutions Architect

CRB workforce

Seattle, Washington, United States (On-Site)
2 Months ago
Google - Software Engineer III, Site Reliability Engineering, Google Cloud

Google

Seattle, Washington, United States (On-Site)
8 Months ago
Canva - Senior Frontend Engineer - Apps API Platform

Canva

Melbourne, Victoria, Australia (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

PHINIA - Senior Account Manager

PHINIA

San Luis Potosi, Mexico (On-Site)
1 Month ago
Activision - Senior Designer - Immersive Ad Formats (Gaming)

Activision

San Francisco, California, United States (On-Site)
1 Month ago
deel. - Executive Compensation Manager

deel.

United States (Remote)
4 Weeks ago
Sourcegraph  Inc  - ML Engineer

Sourcegraph Inc

San Francisco, California, United States (On-Site)
3 Months ago
Trellix - Senior Integrated Marketing Manager

Trellix

United States (Remote)
2 Months ago
Lambda - Storage Engineering Manager

Lambda

San Jose, California, United States (Hybrid)
4 Weeks ago
Riot Games - Senior Game Producer - League of Legends, Summoner's Rift Environment

Riot Games

Los Angeles, California, United States (On-Site)
7 Months ago
yubo - Backend Engineering Manager

yubo

Paris, Île-de-France, France (Hybrid)
2 Months ago
Rippling - Account Manager- Mid-Market, Growth & Retention (HR Services Channel)

Rippling

San Francisco, California, United States (Hybrid)
2 Months ago
Daybreak - Senior Publishing Producer

Daybreak

San Diego, California, United States (Remote)
3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

Digicore studios - Workshop Tutor – Generative AI Applications

Digicore studios

Pune, Maharashtra, India (Remote)
7 Months ago
PhonePe - Specialist, HR Digitalisation

PhonePe

Bengaluru, Karnataka, India (On-Site)
3 Months ago
Keywords International - Senior Research Associate - AI

Keywords International

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Capgemini - SAP Finance Solution Expert

Capgemini

Bengaluru, Karnataka, India (On-Site)
3 Months ago
App on software  - Unity Developer

App on software

Pune, Maharashtra, India (On-Site)
3 Months ago
Landor - Senior Client Manager

Landor

Delhi, India (Hybrid)
1 Month ago
Brillio - Data Specialist

Brillio

Bengaluru, Karnataka, India (Hybrid)
1 Month ago
sitetracker - General Accountant

sitetracker

Bengaluru, Karnataka, India (Hybrid)
4 Weeks ago
Toppan MErril - Customer Service Representative Associate

Toppan MErril

Chennai, Tamil Nadu, India (On-Site)
1 Month ago
Zelis  - Service Delivery Analyst

Zelis

Hyderabad, Telangana, India (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Wind River - Cloud Platform Software Developer – Member of Technical Staff

Wind River

Ottawa, Ontario, Canada (Hybrid)
3 Months ago
WebTech Corporation - Senior Staff Software Architect

WebTech Corporation

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Nintendo - Senior Systems Engineer, Linux

Nintendo

Redmond, Washington, United States (On-Site)
4 Months ago
CyberArk - Associate Site Reliability Engineer

CyberArk

India (On-Site)
1 Month ago
Spaulding Ridge - Oracle EPM Solution Architect

Spaulding Ridge

Chicago, Illinois, United States (On-Site)
3 Months ago
Qualcomm - Engineer- Python Automation Machine Learning

Qualcomm

Hyderabad, Telangana, India (On-Site)
3 Months ago
broadcom - AI Platform Engineer

broadcom

Austin, Texas, United States (On-Site)
1 Month ago
TALA - Senior DevOps Engineer

TALA

Mexico City, Mexico (Remote)
2 Months ago
Attio - Site Reliability Engineer

Attio

Poland (Remote)
1 Month ago
Daybreak Game Company LLC - Senior Software Engineer, Platform

Daybreak Game Company LLC

San Diego, California, United States (Remote)
9 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Applied Materials is the global leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. We design, build and service cutting-edge equipment that helps our customers manufacture display and semiconductor chips – the brains of devices we use every day. If you want to work beyond the cutting-edge, continuously pushing the boundaries of science and engineering to make possible the next generations of technology, join us to Make Possible® a Better Future.

Bengaluru, Karnataka, India (On-Site)

Grenoble, Auvergne-Rhône-Alpes, France (On-Site)

Bengaluru, Karnataka, India (On-Site)

Bengaluru, Karnataka, India (On-Site)

Migdal HaEmek, North District, Israel (On-Site)

Taichung City, Taiwan (On-Site)

Singapore (On-Site)

View All Jobs

Get notified when new jobs are added by Applied materials

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug