DevOps Engineer

1 Hour ago • All levels

Job Summary

Job Description

As a Site Reliability Engineer, you will be instrumental in designing and refining cloud infrastructure with a focus on reliability, security, and scalability. You will manage live production environments and contribute to automation and system improvements. Key responsibilities include designing cloud infrastructure, managing cloud platforms, enhancing service reliability, driving automation, and participating in incident response. Collaboration with cross-functional teams is essential. The role requires a proactive individual who is eager to learn and adapt.
Must have:
  • Expertise in AWS, Azure, or GCP.
  • Proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM).
  • Experience with containerization (Docker) and orchestration (Kubernetes).
  • Proficiency in scripting languages (shell and Python).
  • Experience with configuration management tools (Ansible or Puppet).
  • Experience with Infrastructure as Code (IaC) tools (Terraform or CloudFormation).
  • Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack).
  • Understanding of SLI, SLO, SLA, and error budgeting.

Job Details

Job Summary:

We are looking for a highly skilled and adaptable Site Reliability Engineer to become a key member of our Cloud Engineering team. In this crucial role, you will be instrumental in designing and refining our cloud infrastructure with a strong focus on reliability, security, and scalability. As an SRE, you'll apply software engineering principles to solve operational challenges, ensuring the overall operational resilience and continuous stability of our systems. This position requires a blend of managing live production environments and contributing to engineering efforts such as automation and system improvements.

Key Responsibilities:

  • Cloud Infrastructure Architecture and Management: Design, build, and maintain resilient cloud infrastructure solutions to support the development and deployment of scalable and reliable applications. This includes managing and optimizing cloud platforms for high availability, performance, and cost efficiency.
  • Enhancing Service Reliability: Lead reliability best practices by establishing and managing monitoring and alerting systems to proactively detect and respond to anomalies and performance issues. Utilize SLI, SLO, and SLA concepts to measure and improve reliability. Identify and resolve potential bottlenecks and areas for enhancement.
  • Driving Automation and Efficiency: Contribute to the automation, provisioning, and standardization of infrastructure resources and system configurations. Identify and implement automation for repetitive tasks to significantly reduce operational overhead. Develop Standard Operating Procedures (SOPs) and automate workflows using tools like Rundeck or Jenkins.
  • Incident Response and Resolution: Participate in and help resolve major incidents, conduct thorough root cause analyses, and implement permanent solutions. Effectively manage incidents within the production environment using a systematic problem-solving approach.
  • Collaboration and Innovation: Work closely with diverse stakeholders and cross-functional teams, including software engineers, to integrate cloud solutions, gather requirements, and execute Proof of Concepts (POCs). Foster strong collaboration and communication. Guide designs and processes with a focus on resilience and minimizing manual effort. Promote the adoption of common tooling and components, and implement software and tools to enhance resilience and automate operations. Be open to adopting new tools and approaches as needed.

Required Skills and Experience:

  • Cloud Platforms: Demonstrated expertise in at least one major cloud platform (AWS, Azure, or GCP).
  • Infrastructure Management: Proven proficiency in on-premises hosting and virtualization platforms (VMware, Hyper-V, or KVM). Solid understanding of storage internals (NAS, SAN, EFS, NFS) and protocols (FTP, SFTP, SMTP, NTP, DNS, DHCP). Experience with networking and firewall technologies. Strong hands-on experience with Linux internals and operating systems (RHEL, CentOS, Rocky Linux). Experience with Windows operating systems to support varied environments.
  • Extensive experience with containerization (Docker) and orchestration (Kubernetes) technologies.
  • Automation & IaC: Proficiency in scripting languages (shell and Python). Experience with configuration management tools (Ansible or Puppet). Must have exposure to Infrastructure as Code (IaC) tools (Terraform or CloudFormation).
  • Monitoring & Observability: Experience setting up and configuring monitoring tools (Prometheus, Grafana, or the ELK stack). Hands-on experience implementing OpenTelemetry for observability. Familiarity with monitoring and logging tools for cloud-based applications.
  • Service Reliability Concepts: A strong understanding of SLI, SLO, SLA, and error budgeting.
  • Soft Skills & Mindset: Excellent communication and interpersonal skills for effective teamwork. We value proactive individuals who are eager to learn and adapt in a dynamic environment. Must possess a pragmatic and adaptable mindset, with a willingness to step outside comfort zones and acquire new skills. Ability to consider the broader system impact of your work. Must be a change advocate for reliability initiatives.

Similar Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Skill Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Jobs in Hyderabad, Telangana, India

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Similar Category Jobs

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

About The Company

We are Pragmatists. We prefer to keep it real. Our employees use their creativity and talent to invent new solutions, meet demands, and offer the most effective services/products. We want our employees to be a part of our incredible journey, even if it’s a little bumpy along the way (we’ll call those growing pains).

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

Houston, Texas, United States (On-Site)

Hyderabad, Telangana, India (On-Site)

Hyderabad, Telangana, India (On-Site)

View All Jobs

Get notified when new jobs are added by high radius

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug