Site Reliability Engineer

1 Month ago • All levels • Devops

Job Summary

Job Description

We are seeking a highly skilled and adaptable Site Reliability Engineer to join our Cloud Engineering team. Your primary role will be to design and enhance our cloud infrastructure, emphasizing reliability, security, and scalability. You will apply software engineering principles to address operational challenges, ensuring the continuous stability and resilience of our systems. This position involves managing live production environments and contributing to engineering improvements such as automation. Responsibilities include architecting and managing cloud infrastructure for high availability, performance, and cost efficiency; leading reliability best practices through monitoring and alerting systems; driving automation and efficiency by reducing operational overhead; participating in incident response and root cause analysis; and collaborating with cross-functional teams to integrate cloud solutions and promote innovation. A proactive and adaptable mindset is crucial.
Must have:
  • Expertise in AWS, Azure, or GCP
  • Proficiency in Linux and Windows OS
  • Extensive containerization (Docker) and orchestration (Kubernetes)
  • Proficiency in scripting (shell, Python)
  • Experience with configuration management (Ansible, Puppet)
  • Exposure to IaC tools (Terraform, CloudFormation)
  • Experience with monitoring tools (Prometheus, Grafana, ELK)
  • Hands-on OpenTelemetry implementation
  • Strong understanding of SLI, SLO, SLA
  • Excellent communication and interpersonal skills
Good to have:
  • Experience with DevOps toolchain (Git, Jenkins, Rundeck)
  • Experience with MySQL and Hadoop
  • Knowledge of cloud cost management
  • Understanding of cloud security best practices
  • Experience with disaster recovery plans
  • Familiarity with ITIL processes

Job Details

Job Summary

We are looking for a highly skilled and adaptable Site Reliability Engineer to become a key member of our Cloud Engineering team. In this crucial role, you will be instrumental in designing and refining our cloud infrastructure with a strong focus on reliability, security, and scalability. As an SRE, you'll apply software engineering principles to solve operational challenges, ensuring the overall operational resilience and continuous stability of our systems. This position requires a blend of managing live production environments and contributing to engineering efforts such as automation and system improvements.

Key Responsibilities:

  • Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions to support the development and deployment of scalable and reliable applications. This includes managing and optimizing cloud platforms for high availability, performance, and cost efficiency.
  • Enhancing Service Reliability: Lead reliability best practices by establishing and managing monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and improve reliability. Identify and resolve potential bottlenecks and areas for enhancement.
  • Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Identify and implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins.
  • Incident Response and Resolution: Participate in and help resolve major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach.
  • Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and components, and implement software and tools to enhance resilience and automate operations. Be open to adopting new tools and approaches as needed.

Required Skills and Experience:

  • Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP).
  • Infrastructure Management: Proven proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM). Solid understanding of storage internals (NAS, SAN, EFS, NFS) and protocols (FTP, SFTP, SMTP, NTP, DNS, DHCP). Experience with networking and firewall technologies. Strong hands-on experience with Linux internals and operating systems (RHEL, CentOS, Rocky Linux). Experience with Windows operating systems to support varied environments.
  • Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies.
  • Automation & IaC: Proficiency in scripting languages (shell and Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools (Terraform or CloudFormation).
  • Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications.
  • Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting.
  • Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. We value proactive individuals who are eager to learn and adapt in a dynamic environment. Must possess a pragmatic and adaptable mindset, with a willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Must be a change advocate for reliability initiatives.

Desired/Bonus Skills:

  • Experience with DevOps toolchain elements like Git, Jenkins, Rundeck, ArgoCD, or Crossplane.
  • Experience with database management, particularly MySQL and Hadoop.
  • Knowledge of cloud cost management and optimization strategies.
  • Understanding of cloud security best practices, including data encryption, access controls, and identity management.
  • Experience implementing disaster recovery and business continuity plans.
  • Familiarity with ITIL (Information Technology Infrastructure Library) processes

Similar Jobs

Mistplay - Senior Data Scientist II

Mistplay

Montreal, Quebec, Canada (Hybrid)
3 Months ago
endava - Senior DevOps Engineer (Azure)

endava

Bogotá, Bogota, Colombia (On-Site)
3 Months ago
OKX - Leadership Growth Expert, Product & Engineering

OKX

Singapore, Singapore (On-Site)
10 Months ago
Riot Games - Art Lead, Animation Art - TFT, Gameplay

Riot Games

Los Angeles, California, United States (On-Site)
3 Months ago
Tecknotrove - Sr. Service Engineer

Tecknotrove

Vapi, Gujarat, India (On-Site)
1 Month ago
Temporal Technologies - Staff Software Engineer, Cloud Infrastructure

Temporal Technologies

United States (Remote)
3 Months ago
Salesforce - Account Solution Engineer - Mulesoft

Salesforce

Oslo, Oslo, Norway (Hybrid)
1 Month ago
Visa - Sr. Site Reliability Engineer

Visa

Ashburn, Virginia, United States (Hybrid)
3 Months ago
Brillio - Enterprise Architect, AWS - R01535258

Brillio

Bengaluru, Karnataka, India (Hybrid)
10 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Playstation - UX Manager

Playstation

London, England, United Kingdom (Hybrid)
3 Months ago
Shield AI - Senior Staff Product Manager, Vision Systems (R3591)

Shield AI

Washington, District Of Columbia, United States (On-Site)
3 Weeks ago
PwC - Audit Associate

PwC

Colombo, Western Province, Sri Lanka (On-Site)
10 Months ago
bytedance - Senior Software Engineer, Multi Cloud CDN

bytedance

Seattle, Washington, United States (On-Site)
3 Months ago
WebMD - Business Analyst

WebMD

Portland, Oregon, United States (On-Site)
2 Months ago
G5 games - C++ Gameplay Programmer

G5 games

Tbilisi, Tbilisi, Georgia (Remote)
9 Months ago
Hawkeye Innovations - Football Video Systems Technician

Hawkeye Innovations

Bolzano, Trentino-South Tyrol, Italy (On-Site)
4 Months ago
Tesla - Automotive Technician/Mechatronics Technician

Tesla

Carinthia, Austria (On-Site)
6 Months ago
Postman - Software Engineer, IAM

Postman

Hyderabad, Telangana, India (Hybrid)
3 Months ago
OKX - Data Engineer

OKX

Hong Kong (On-Site)
10 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Hyderabad, Telangana, India

Arista Networks - Campus Technical Solutions Engineer

Arista Networks

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Rolls-Royce - Project Lead

Rolls-Royce

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Capgemini - Financial Controller

Capgemini

Mumbai, Maharashtra, India (On-Site)
3 Months ago
Granicus - Senior DevOps Engineer

Granicus

Bengaluru, Karnataka, India (Hybrid)
7 Months ago
Valeo - Site Industrial Management Controller

Valeo

Chennai, Tamil Nadu, India (On-Site)
1 Month ago
Mindtickle - Product Designer- II

Mindtickle

Pune, Maharashtra, India (Hybrid)
4 Months ago
Illumina - PLM Analyst

Illumina

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Paytm - Team Lead - Sales

Paytm

Virudhunagar, Tamil Nadu, India (On-Site)
2 Months ago
HCL Tech - Associate vice president

HCL Tech

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Paytm - Go-To-Market Lead - Deputy General Manager - Offline Merchants QR

Paytm

Chennai, Tamil Nadu, India (On-Site)
2 Years ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Lambda - Supply Chain Solutions Architect

Lambda

San Jose, California, United States (Hybrid)
4 Months ago
CyberArk - Senior Solutions Engineer - Israel & The Balkans

CyberArk

Israel (Hybrid)
1 Month ago
NVIDIA - Senior Solution Architect - Hardware

NVIDIA

Beijing, Beijing, China (On-Site)
7 Months ago
NVIDIA - Senior BMC Firmware Development Engineer - Platform Lead

NVIDIA

Taipei City, Taiwan (On-Site)
4 Months ago
sinch  - Senior DevOps Engineer

sinch

Noida, Uttar Pradesh, India (Hybrid)
1 Month ago
Semgrep - Senior Software Engineer, Infrastructure

Semgrep

San Francisco, California, United States (On-Site)
1 Month ago
Zuora - Sr Enterprise Solution Architect-Zuora Billing & CPQ

Zuora

United States (Remote)
3 Months ago
Ion - Cloud Engineer Kubernetes

Ion

Italy (Hybrid)
10 Months ago
Palo Alto Networks - Principal DevOps Engineer

Palo Alto Networks

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
3 Months ago
Devoteam - Cloud Native Exchange Platform Architect - M/F

Devoteam

Levallois-Perret, Île-de-France, France (On-Site)
2 Months ago

Get notifed when new similar jobs are uploaded

About The Company

We are Pragmatists. Today, there are 3 distinct types of companies: the Pretenders, the Fairytale Startups, and the Pragmatists. At our core, we embody the latter. We prefer to keep it real. Unlike the Pretenders, we want our core values to guide decisioning and show up in the way people think, feel and act on a daily basis. Instead of being a Fairytale Startup, we want our people to think of us as their work home away from home (not a theme park) and to feel that they are making a huge impact. Our employees use their creativity and talent to invent new solutions, meet demands, and offer the most effective services/products.

Get notified when new jobs are added by high radius

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug