Site Reliability Engineer (SRE) – AWS + Docker
Synechron
Job Summary
We are seeking a seasoned Site Reliability Engineer (SRE) to enhance the reliability, scalability, and performance of cloud-native systems. This role involves managing AWS infrastructure, Dockerized workloads, CI/CD pipelines, and observability. The SRE will also be responsible for incident response, collaborating with development, QA, and security teams to ensure high availability and cost-efficient operations.
Must Have
- Define and maintain SLOs/SLIs/SLAs, error budgets, and reliability roadmaps.
- Build and manage AWS infrastructure (EC2/ECS/EKS, ALB/NLB, ASG, VPC, IAM, Route 53, S3, CloudFront, RDS/ElastiCache).
- Containerize and operate services using Docker; orchestrate via ECS/EKS/Kubernetes.
- Implement and optimize CI/CD pipelines (GitHub Actions/Jenkins/Azure DevOps) with blue/green & canary deployments.
- Set up observability: metrics, logs, traces (CloudWatch, Prometheus/Grafana, ELK/OpenSearch, X-Ray).
- Automate operations using Infrastructure as Code (Terraform/CloudFormation) and scripting (Python/Bash).
- Establish incident management: on-call, runbooks, RCA, postmortems, and remediation plans.
- Drive performance tuning, capacity planning, and cost optimization in AWS.
- Implement security best practices: least privilege IAM, secrets management, SSL/TLS, vulnerability scanning.
- Partner with development teams to design for reliability (circuit breakers, retries, backoff, health checks).
- 7+ years in SRE/DevOps/Cloud Operations with production ownership.
- Strong hands-on experience with AWS (EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch).
- Solid experience with Docker and container orchestration (EKS/Kubernetes or ECS).
- Proficiency in CI/CD and release engineering; pipelines, artifact management, approvals.
- Strong in monitoring/alerting and incident response (on-call rotations, playbooks, MTTR reduction).
- Experience with IaC (Terraform/CloudFormation) and automation (Python/Bash).
- Good understanding of networking (DNS, TCP/IP, TLS, security groups/NACLs) and Linux internals.
- Knowledge of resiliency patterns: autoscaling, graceful degradation, rate limiting, caching.
Job Description
About the Role
We’re hiring a seasoned Site Reliability Engineer (SRE) to drive reliability, scalability, and performance across cloud-native systems. You will own AWS infrastructure, containerized workloads (Docker), CI/CD, observability, and incident response, partnering closely with Dev, QA, and Security to ensure high availability and cost‑efficient operations.
Key Responsibilities
- Define and maintain SLOs/SLIs/SLAs, error budgets, and reliability roadmaps.
- Build and manage AWS infrastructure (EC2/ECS/EKS, ALB/NLB, ASG, VPC, IAM, Route 53, S3, CloudFront, RDS/ElastiCache).
- Containerize and operate services using Docker; orchestrate via ECS/EKS/Kubernetes.
- Implement and optimize CI/CD pipelines (GitHub Actions/Jenkins/Azure DevOps) with blue/green & canary deployments.
- Set up observability: metrics, logs, traces (CloudWatch, Prometheus/Grafana, ELK/OpenSearch, X-Ray).
- Automate operations using Infrastructure as Code (Terraform/CloudFormation) and scripting (Python/Bash).
- Establish incident management: on-call, runbooks, RCA, postmortems, and remediation plans.
- Drive performance tuning, capacity planning, and cost optimization in AWS.
- Implement security best practices: least privilege IAM, secrets management, SSL/TLS, vulnerability scanning.
- Partner with development teams to design for reliability (circuit breakers, retries, backoff, health checks).
Required Skills & Qualifications
- 7+ years in SRE/DevOps/Cloud Operations with production ownership.
- Strong hands-on experience with AWS (EC2, ECS/EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch).
- Solid experience with Docker and container orchestration (EKS/Kubernetes or ECS).
- Proficiency in CI/CD and release engineering; pipelines, artifact management, approvals.
- Strong in monitoring/alerting and incident response (on-call rotations, playbooks, MTTR reduction).
- Experience with IaC (Terraform/CloudFormation) and automation (Python/Bash).
- Good understanding of networking (DNS, TCP/IP, TLS, security groups/NACLs) and Linux internals.
- Knowledge of resiliency patterns: autoscaling, graceful degradation, rate limiting, caching.
- Excellent communication, collaboration, and documentation skills.
SYNECHRON’S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.
About Us
At Synechron, we believe in the power of digital to transform businesses for the better. Our global consulting firm combines creativity and innovative technology to deliver industry-leading digital solutions. Synechron’s progressive technologies and optimization strategies span end-to-end Artificial Intelligence, Consulting, Digital, Cloud & DevOps, Data, and Software Engineering, servicing an array of noteworthy financial services and technology firms. Through research and development initiatives in our FinLabs we develop solutions for modernization, from Artificial Intelligence and Blockchain to Data Science models, Digital Underwriting, mobile-first applications and more.
Over the last 20+ years, our company has been honored with multiple employer awards, recognizing our commitment to our talented teams. With top clients to boast about, Synechron has a global workforce of 14,500+, and has 58 offices in 21 countries within key global markets.
For more information on the company, please visit our website
or LinkedIn
community.
Sustainability and Health Safety Commitment
At Synechron, we are committed to integrating sustainability into our business strategy, ensuring responsible growth while minimizing environmental impact. Employees play a key role in driving our sustainability initiatives, from reducing our carbon footprint to fostering ethical and sustainable business practices across global operations. All positions are required to adhere to our Sustainability and Health Safety standards, demonstrating a commitment to environmental stewardship, workplace safety, and sustainable practices.