Cloud Operations Engineer – Monitoring Lead
extreme network
Job Summary
Extreme is seeking a highly skilled and experienced Cloud Operations Engineer – Monitoring Lead to join their growing Cloud Operations team. This critical role involves designing, implementing, and optimizing a comprehensive monitoring and alerting strategy across cloud infrastructure and applications. The lead will drive proactive issue identification, ensure system health, and contribute to operational excellence and reliability. Responsibilities include leading the design and improvement of monitoring frameworks for cloud infrastructure (AWS, Azure, GCP), applications, and services, defining KPIs, SLIs, and SLOs, evaluating and integrating monitoring tools, and developing automation scripts. The role also requires building dashboards, analyzing data for performance bottlenecks, collaborating with engineering teams, and providing 24/7 support for Cloud services.
Must Have
- Lead monitoring and alerting strategy
- Define KPIs, SLIs, SLOs
- Evaluate and integrate monitoring tools
- Develop automation scripts
- Build dashboards and alerts
- Analyze monitoring data
- Collaborate with engineering teams
- Create documentation
- BS technical degree
- 8+ years in Cloud Ops/DevOps/SRE
- Expertise in AWS, Azure, or GCP
- Technical lead experience
- Working knowledge of Docker, Kubernetes
- Experience with Prometheus, Grafana, Datadog, Splunk
- Problem-solving and analytical skills
Good to Have
- Computer Science or Engineering background
- Working knowledge of Elasticsearch, PostgreSQL, Redis, Ignite, Kafka, RabbitMQ
- Comfortable working in distributed teams
Job Description
Responsibilities
- Lead the design, implementation, and continuous improvement of our end-to-end monitoring and alerting framework for cloud infrastructure (AWS, Azure, GCP), applications, and services.
- Define key performance indicators (KPIs), service level indicators (SLIs), and service level objectives (SLOs) for critical systems.
- Evaluate, select, and integrate monitoring tools (e.g., Prometheus, Grafana, Datadog, Splunk, CloudWatch, Azure Monitor, GCP Operations Suite) to meet evolving needs.
- Develop and implement automation scripts and tools (e.g., Python, Bash, PowerShell) to streamline monitoring deployment, configuration, and incident remediation.
- Build and maintain dashboards, alerts, and reports that provide actionable insights into system performance, health, and availability.
- Analyze monitoring data to identify performance bottlenecks, resource inefficiencies, and potential cost optimization opportunities.
- Collaborate with engineering teams to implement performance improvements and cost-saving measures.
- Create and maintain comprehensive documentation for monitoring systems, procedures, and best practices.
- Proactively identify areas for improvement in our cloud operations and monitoring capabilities.
- Provide 24* 7 support for Cloud services
- Participate in cloud security and compliance implementation.
Ideal Qualifications:
- BS level technical degree required; Computer Science or Engineering background preferred.
- 8+ years of progressive experience in Cloud Operations, DevOps, or Site Reliability Engineering roles, with a strong focus on monitoring.
- Deep expertise with at least one major public cloud platform (AWS, Azure, or Google Cloud Platform).
- Proven experience as a technical lead or senior contributor in a monitoring-focused role.
- Working knowledge of container-based architecture and deployment (Docker, Kubernetes.)
- Extensive experience with various monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, ELK Stack, vendor-specific monitoring solutions).
- Excellent problem-solving, analytical, and troubleshooting skills.
- Working knowledge of Elasticsearch, PostgreSQL, Redis, Ignite, Kafka and RabbitMQ.
- Comfortable working within a distributed team located in multiple time zones.