We are seeking a highly skilled and experienced Senior DevOps Engineer with a primary focus on Monitoring and Observability to drive the continuous improvement of our security-focused SaaS platform. In this role, you will work alongside engineering, security, and operations teams to ensure that our systems are secure, scalable, and always up and running. Your efforts will directly impact the performance, uptime, and security of our offerings for clients around the world.
Key Responsibilities:
Monitoring & Observability: Design, implement, and maintain sophisticated monitoring, alerting, and logging solutions to ensure the reliability, availability, and performance of our security-focused SaaS platform. Use tools like Prometheus, Grafana, Datadog to provide deep visibility into system health, security metrics, and application performance.
Incident Management: Respond to and mitigate incidents in real time, ensuring minimal impact on customers. Drive post-mortems and root cause analyses (RCAs) to improve monitoring and response processes.
System Reliability: Collaborate with cross-functional teams to define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for both security and performance metrics.
Automation & CI/CD Integration: Build automated monitoring and alerting pipelines that integrate seamlessly with CI/CD workflows to catch issues early in development, testing, and production environments.
Mentorship & Best Practices: Provide guidance and mentorship to junior DevOps engineers, helping them adopt best practices for monitoring, observability, and security.
Optimization & Continuous Improvement: Continuously evaluate and refine monitoring tools and practices to adapt to new threats, technologies, and regulatory requirements.
Required Qualifications:
5+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure roles, ideally in cybersecurity or SaaS environments.
Strong experience with monitoring tools like Prometheus, Grafana, Datadog, ELK, Splunk, or similar observability solutions.
Expertise in Linux/Unix-based systems and cloud environments (AWS, GCP, Azure).
Proficiency in scripting languages such as Python, Bash, or Go to automate monitoring tasks and create custom solutions.
Deep understanding of security principles and experience integrating security monitoring into DevOps practices (e.g., SIEM systems, threat detection).
Experience with containerization (Docker) and orchestration (Kubernetes) to monitor containerized applications in production.
Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation to automate infrastructure monitoring setup.
Solid problem-solving skills, a keen eye for detail, and a proactive approach to system monitoring and incident response.
Preferred Qualifications:
Experience in cybersecurity or working on security monitoring solutions.
Experience with performance monitoring and APM (Application Performance Monitoring) tools.
Background in a software engineering discipline or security engineering.
Certifications in relevant fields, such as AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), Certified DevOps Engineer (or similar).