Site Reliability Engineer

Flexera Software

Job Summary

Flexera is seeking a Site Reliability Engineer to design, implement, and maintain reliable, scalable, and secure cloud infrastructure on AWS. This role involves developing and managing Infrastructure as Code using Terraform, collaborating on CI/CD pipelines, and optimizing networking configurations. The engineer will also support on-call rotations, design robust observability solutions with platforms like Datadog and Prometheus, and troubleshoot complex production issues. Continuous evaluation and integration of new tools are essential to improve system performance and operational efficiency. Candidates should have 2-4 years of experience in SRE, DevOps, or Cloud Infrastructure, with strong AWS and networking skills.

Must Have

  • Design, implement, and maintain cloud infrastructure on AWS.
  • Develop and manage Infrastructure as Code (IaC) using Terraform.
  • Collaborate on CI/CD pipelines for automated deployments.
  • Optimize networking configurations (VPCs, subnets, routing, load balancers, DNS, security groups).
  • Participate in on-call rotations for system availability.
  • Design and maintain observability solutions (Datadog, Prometheus, Grafana, Coralogix, New Relic).
  • Troubleshoot production issues and perform root cause analysis.
  • 2-4 years experience as SRE, DevOps, or Cloud Infrastructure Engineer.
  • Strong experience with AWS services (EC2, S3, RDS, ECS/EKS, CloudWatch, IAM, Route 53, ALB/NLB).
  • Proficiency in Terraform.
  • Solid understanding of networking fundamentals (TCP/IP, DNS, VPN, firewalls).
  • Experience with CI/CD tools (GitHub Actions, AWS CodePipeline).
  • Familiarity with containerization and orchestration (Docker, Kubernetes, ECS).
  • Proficiency in Linux system administration, shell scripting, and automation (Bash, Python, Go).
  • Hands-on experience with monitoring platforms (ELK, OpenTelemetry).
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Understanding of security best practices and compliance in cloud environments.

Good to Have

  • Experience with Infrastructure automation frameworks (Ansible, Packer).
  • Expertise in Kubernetes for container orchestration, deployment, and scaling.
  • Exposure to multi-cloud or hybrid cloud setups.
  • Experience in incident response and postmortem analysis.
  • Familiarity with cost optimization and performance tuning in AWS environments.

Job Description

Key Responsibilities

  • Design, implement, and maintain reliable, scalable, and secure cloud infrastructure on AWS.
  • Develop and manage Infrastructure as Code (IaC) using Terraform, ensuring consistent and repeatable infrastructure deployments.
  • Collaborate with development teams to design and implement and maintain CI/CD pipelines, ensuring smooth and automated deployments.
  • Maintain and optimize networking configurations (VPCs, subnets, routing, load balancers, DNS, security groups).
  • Support and participate in on-call rotations, ensuring high availability of critical systems.
  • Design, implement, and maintain robust observability solutions covering metrics, logs, and traces, with a strong understanding of observability platforms such as Datadog, Prometheus, Grafana, Coralogix, or New Relic, to enable proactive monitoring, alerting, and deep system visibility.
  • Troubleshoot complex production issues across systems and applications, driving root cause analysis (RCA) and long-term resolutions.
  • Collaborate with development teams to design and implement CI/CD pipelines, ensuring smooth and automated deployments.
  • Continuously evaluate and integrate new tools and technologies to improve system performance and operational efficiency.

Required Skills & Experience

  • 2-4 years of hands-on experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
  • Strong experience with AWS (EC2, S3, RDS, ECS/EKS, CloudWatch, IAM, Route 53, ALB/NLB, etc.).
  • Proficiency in Terraform for infrastructure provisioning and management.
  • Solid understanding of networking fundamentals (TCP/IP, DNS, VPN, load balancing, firewalls, routing).
  • Experience with CI/CD tools (GitHub Actions, AWS CodePipeline etc).
  • Familiarity with containerization and orchestration (Docker, Kubernetes, or ECS).
  • Proficiency in Linux system administration, shell scripting, and automation (Bash, Python, or Go).
  • Hands-on experience with monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, Coralogix, New Relic, ELK, or OpenTelemetry).
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Good understanding of security best practices and compliance in cloud environments.

Good to Have

  • Experience with Infrastructure automation frameworks (Ansible, Packer).
  • Expertise in Kubernetes for container orchestration, deployment, and scaling.
  • Exposure to multi-cloud or hybrid cloud setups.
  • Experience in incident response and postmortem analysis.
  • Familiarity with cost optimization and performance tuning in AWS environments.

22 Skills Required For This Role

Problem Solving Github Game Texts Networking Dns Incident Response Linux Aws Load Balancing Ansible Prometheus New Relic Grafana Terraform Elk Ci Cd Docker Kubernetes Python Github Actions Shell Bash

Similar Jobs