Description
We're looking for a passionate and experienced System Reliability Engineer to play a key role in designing, implementing, and maintaining our evolving cloud-native platform. You´ll be instrumental in shaping our reliability practices, automating operational tasks, and driving continuous improvement across our systems. This is an exciting time to join us as we embark on significant refactoring efforts and continue to leverage cutting-edge technologies.
What You'll Do:
- Design, build, and maintain highly available, scalable, and resilient systems on Google Cloud Platform (GCP).
- Proactively monitor system health, performance, and capacity, identifying and resolving issues before they impact users.
- Develop and implement automation for infrastructure provisioning, deployment, and operational tasks (e.g., CI/CD pipelines, disaster recovery).
- Collaborate with development teams to ensure new features are designed and implemented with reliability and operational excellence in mind.
- Manage and optimize our MongoDB Atlas instances, ensuring data integrity, performance, and security.
- Lead the refactoring effort of our Redis services to a more scalable and resilient Pub/Sub or Kafka-based architecture.
- Participate in on-call rotations and incident response, conducting thorough post-mortems and implementing preventative measures.
- Contribute to the development of best practices, runbooks, and documentation for system operations.
- Identify and implement opportunities for cost optimization without compromising reliability.
Requirements
- 5+ years of experience in a System Reliability Engineering, DevOps, or Site Reliability Engineering role.
- Strong hands-on experience with Google Cloud Platform (GCP) services (e.g., Computer Engine, Kubernetes Engine, Cloud SQL, Cloud Monitoring, Cloud Functions, Networking).
- Proven expertise in managing and optimizing MongoDB Atlas (or other cloud-hosted) databases.
- Solid experience with containerization technologies, particularly Docker and Kubernetes.
- Demonstrated experience with Infrastructure as Code (e.g., Terraform, Cloud Deployment Manager).
- Proficiency in scripting languages such as Python, Go, or Bash.
- Familiarity with message queuing systems like Redis, RabbitMQ, or Kafka; direct experience with Kafka or Google Cloud Pub/Sub is a must.
- Familiarity with Prometheus, Grafana, or similar monitoring and alerting tools.
- Experience with service mesh technologies (e.g., Istio).
- Experience with CI/CD tools and practices.
- Strong understanding of network protocols, security best practices, and distributed systems.
- Excellent problem-solving skills, with a methodical approach to troubleshooting complex issues.
- Ability to communicate effectively with both technical and non-technical stakeholders.
- A proactive mindset, with a commitment to continuous learning and improvement.
Benefits
- Work remotely Monday - Friday, 40 hours a week (no weekends)
- Vacation: 10 business days a year
- Holidays: 5 National Holidays a year
- Company Holidays: 5 Company Holidays a year (Christmas Eve, Christmas Day, New Year's Eve, New Year's Day, Zipdev Day)
- Parental Leave
- Health Care Reimbursement
- Active Lifestyle Reimbursement
- Quarterly Home Office Reimbursement
- Payroll Deduction Purchase Plans
- Longevity Bonus
- Continuous Learning Bonus
- Access to Training and Professional Development Platforms
- Did we mention it's REMOTE?!!
One of our core values at Zipdev is "Be authentic." that's why we encourage you to answer the application form in your own words; we are interested in getting to know you, not a digital assistant.
Wondering how our remote environment or our payment method work? We've put together some helpful answers in our FAQs at the bottom our our career site. Take a look and let us know if you have any other questions!