System Reliability Engineer

23 Minutes ago • 5 Years +

System Design

Job Description

We're looking for a passionate and experienced System Reliability Engineer to play a key role in designing, implementing, and maintaining our evolving cloud-native platform. You´ll be instrumental in shaping our reliability practices, automating operational tasks, and driving continuous improvement across our systems. This is an exciting time to join us as we embark on significant refactoring efforts and continue to leverage cutting-edge technologies.

Must Have:

Design, build, and maintain highly available, scalable, and resilient systems on GCP.
Proactively monitor system health, performance, and capacity.
Automate infrastructure provisioning, deployment, and operational tasks.
Collaborate with development teams for reliability and operational excellence.
Manage and optimize MongoDB Atlas instances.
Lead refactoring of Redis services to Pub/Sub or Kafka architecture.
Participate in on-call rotations and incident response.
Contribute to best practices, runbooks, and system documentation.
Identify and implement cost optimization opportunities.
5+ years in System Reliability Engineering, DevOps, or SRE roles.
Strong hands-on experience with Google Cloud Platform (GCP) services.
Proven expertise in managing and optimizing MongoDB Atlas databases.
Solid experience with Docker and Kubernetes containerization.
Demonstrated experience with Infrastructure as Code (Terraform, Cloud Deployment Manager).
Proficiency in scripting languages like Python, Go, or Bash.
Direct experience with Kafka or Google Cloud Pub/Sub is a must.
Familiarity with Prometheus, Grafana, or similar monitoring tools.
Experience with service mesh technologies like Istio.
Experience with CI/CD tools and practices.
Strong understanding of network protocols, security, and distributed systems.

Perks:

Work remotely Monday - Friday
10 business days vacation a year
5 National Holidays a year
5 Company Holidays a year (Christmas Eve, Christmas Day, New Year's Eve, New Year's Day, Zipdev Day)
Parental Leave
Health Care Reimbursement
Active Lifestyle Reimbursement
Quarterly Home Office Reimbursement
Payroll Deduction Purchase Plans
Longevity Bonus
Continuous Learning Bonus
Access to Training and Professional Development Platforms

Add these skills to join the top 1% applicants for this job

communication

problem-solving

game-texts

networking

rabbitmq

incident-response

service-mesh

prometheus

terraform

grafana

google-cloud-platform

redis

mongodb

ci-cd

docker

kubernetes

python

monday

sql

bash

Description

What You'll Do:

Design, build, and maintain highly available, scalable, and resilient systems on Google Cloud Platform (GCP).
Proactively monitor system health, performance, and capacity, identifying and resolving issues before they impact users.
Develop and implement automation for infrastructure provisioning, deployment, and operational tasks (e.g., CI/CD pipelines, disaster recovery).
Collaborate with development teams to ensure new features are designed and implemented with reliability and operational excellence in mind.
Manage and optimize our MongoDB Atlas instances, ensuring data integrity, performance, and security.
Lead the refactoring effort of our Redis services to a more scalable and resilient Pub/Sub or Kafka-based architecture.
Participate in on-call rotations and incident response, conducting thorough post-mortems and implementing preventative measures.
Contribute to the development of best practices, runbooks, and documentation for system operations.
Identify and implement opportunities for cost optimization without compromising reliability.

Requirements

5+ years of experience in a System Reliability Engineering, DevOps, or Site Reliability Engineering role.
Strong hands-on experience with Google Cloud Platform (GCP) services (e.g., Computer Engine, Kubernetes Engine, Cloud SQL, Cloud Monitoring, Cloud Functions, Networking).
Proven expertise in managing and optimizing MongoDB Atlas (or other cloud-hosted) databases.
Solid experience with containerization technologies, particularly Docker and Kubernetes.
Demonstrated experience with Infrastructure as Code (e.g., Terraform, Cloud Deployment Manager).
Proficiency in scripting languages such as Python, Go, or Bash.
Familiarity with message queuing systems like Redis, RabbitMQ, or Kafka; direct experience with Kafka or Google Cloud Pub/Sub is a must.
Familiarity with Prometheus, Grafana, or similar monitoring and alerting tools.
Experience with service mesh technologies (e.g., Istio).
Experience with CI/CD tools and practices.
Strong understanding of network protocols, security best practices, and distributed systems.
Excellent problem-solving skills, with a methodical approach to troubleshooting complex issues.
Ability to communicate effectively with both technical and non-technical stakeholders.
A proactive mindset, with a commitment to continuous learning and improvement.

Benefits

Work remotely Monday - Friday, 40 hours a week (no weekends)
Vacation: 10 business days a year
Holidays: 5 National Holidays a year
Company Holidays: 5 Company Holidays a year (Christmas Eve, Christmas Day, New Year's Eve, New Year's Day, Zipdev Day)
Parental Leave
Health Care Reimbursement
Active Lifestyle Reimbursement
Quarterly Home Office Reimbursement
Payroll Deduction Purchase Plans
Longevity Bonus
Continuous Learning Bonus
Access to Training and Professional Development Platforms
Did we mention it's REMOTE?!!

One of our core values at Zipdev is "Be authentic." that's why we encourage you to answer the application form in your own words; we are interested in getting to know you, not a digital assistant.

Wondering how our remote environment or our payment method work? We've put together some helpful answers in our FAQs at the bottom our our career site. Take a look and let us know if you have any other questions!

Set alerts for more jobs like System Reliability Engineer

Set alerts for new jobs by zipdev

Set alerts for System Design (Remote) jobs