Senior Observability & Monitoring Engineer (SRE/DevOps)

1 Week ago • 7 Years +

Devops

Job Description

We are seeking an experienced Observability & Monitoring Engineer (SRE/DevOps) to lead reliability, observability, and performance efforts for our most critical applications. This role bridges development, operations, and product, ensuring our systems are robust, scalable, and drive superior business outcomes. The Senior Observability & Monitoring Engineer will design and optimize monitoring strategies, automate operational tasks, and serve as a technical mentor for reliability within the R&D organization.

Good To Have:

Strong familiarity with SaaS, microservices architectures, and security best practices.
Cloud certifications (e.g., AWS Certified Solutions Architect, GCP Professional Cloud Engineer).
Deep experience with chaos engineering, performance/load testing, and continuous improvement frameworks.
Demonstrated ability to mentor engineers, promote reliability culture, and foster knowledge sharing.

Must Have:

Architect, implement, and maintain advanced monitoring, logging, and alerting solutions using Datadog.
Lead and optimize reliability, performance, and scalability efforts for PostgreSQL, Redis, SQS, K8s, and cloud-native environments.
Design, build, and maintain automations for operational tasks, deployments, and remediations.
Mentor engineers on reliability engineering best practices, monitoring usage, and troubleshooting methodologies.
7+ years of experience in SRE, DevOps, or production engineering roles supporting large-scale distributed systems.
Expertise architecting and operating monitoring, tracing, and alerting with Datadog.
Hands-on knowledge of PostgreSQL, Redis, SQS, and Kubernetes.
Advanced scripting/programming skills with Python, Bash, or another relevant language.
Track record of designing and implementing automated solutions (Infrastructure-as-Code, CI/CD pipelines, auto-remediation).

Add these skills to join the top 1% applicants for this job

saas-business-models

communication

problem-solving

game-texts

load-testing

postgresql

aws

prometheus

grafana

elk

redis

ci-cd

microservices

kubernetes

python

bash

##### Company Description

About CyberArk:

CyberArk (NASDAQ: CYBR), is the global leader in Identity Security. Centered on privileged access management, CyberArk provides the most comprehensive security offering for any identity – human or machine – across business applications, distributed workforces, hybrid cloud workloads and throughout the DevOps lifecycle. The world’s leading organizations trust CyberArk to help secure their most critical assets. To learn more about CyberArk, visit our CyberArk blogs or follow us on X, LinkedIn or Facebook.

##### Job Description

Key Responsibilities:

Architect, implement, and maintain advanced monitoring, logging, and alerting solutions using Datadog (mandatory), covering infrastructure, application, and business-level metrics.
Lead and optimize reliability, performance, and scalability efforts for PostgreSQL, Redis, SQS, K8s, and cloud-native environments.
Design, build, and maintain automations for operational tasks, deployments, and remediations (Infrastructure-as-Code, CI/CD, self-healing workflows).
Mentor engineers on reliability engineering best practices, monitoring usage, and troubleshooting methodologies.
Lead knowledge sharing by producing high-quality documentation, technical presentations, and internal training.
Perform capacity planning, performance tuning, and proactively address potential bottlenecks or scaling issues.
Stay current with SRE, DevOps, and cloud trends; evaluate and recommend new tools and approaches for continuous improvement.

#LI-Hybrid

#LI-CR1

##### Qualifications

7+ years of experience in SRE, DevOps, or production engineering roles supporting large-scale distributed systems.
Expertise architecting and operating monitoring, tracing, and alerting with Datadog (including custom metrics, dashboards, and advanced alerting techniques).
Experience with additional monitoring/observability platforms (e.g., Prometheus, Grafana, ELK stack).
Hands-on knowledge of PostgreSQL, Redis, SQS, and Kubernetes (deployment, troubleshooting, scaling, and performance optimization).
Advanced scripting/programming skills with Python, Bash, or another relevant language.
Track record of designing and implementing automated solutions (Infrastructure-as-Code, CI/CD pipelines, auto-remediation).
Strong communication skills, including technical writing, documentation, and presentation to diverse technical audiences.
Experience working closely with development, product, and architecture teams to embed reliability from the design phase.
Fluent technical English.

Preferred Qualifications:

Strong familiarity with SaaS, microservices architectures, and security best practices.
Cloud certifications (e.g., AWS Certified Solutions Architect, GCP Professional Cloud Engineer) are a plus.
Deep experience with chaos engineering, performance/load testing, and continuous improvement frameworks.
Demonstrated ability to mentor engineers, promote reliability culture, and foster knowledge sharing.

Set alerts for more jobs like Senior Observability & Monitoring Engineer (SRE/DevOps)

Set alerts for new jobs by CyberArk

Set alerts for new Devops jobs in Israel

Set alerts for new jobs in Israel

Set alerts for Devops (Remote) jobs