Cloud Observability and Performance Engineer

1 Month ago • 5 Years + • Devops • $150,000 PA - $185,000 PA

Job Summary

Job Description

Halcyon is seeking a Cloud Observability and Performance Engineer to join their Chaos Cloud Engineering team. This role involves designing and implementing observability, monitoring, and performance strategies for cloud-hosted microservices that manage endpoint security agents at scale. The engineer will be responsible for ensuring the reliability, visibility, and performance optimization of backend systems supporting millions of endpoints globally. Key duties include building end-to-end observability for distributed cloud services, developing metrics pipelines and dashboards, ensuring high performance and availability, collaborating with various teams for troubleshooting, defining SLOs/SLIs, and automating performance testing.
Must have:
  • 5+ years experience in observability, SRE, or cloud performance
  • Strong experience with monitoring/observability stacks
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization/orchestration (Docker, Kubernetes)
  • Design/implement performance testing frameworks
  • Knowledge of distributed systems, microservices
  • Proficiency in Python, Scala, or similar
  • Familiarity with CI/CD, IaC, Git
  • Ability to participate in on-call rotation
Good to have:
  • Experience with endpoint security platforms
  • Familiarity with SIEM or security analytics
  • Background in networking performance
  • Experience with SLA/SLO-driven operations
  • Knowledge of Go
  • Integrating performance tests into CI/CD
  • Familiarity with k6, Locust, JMeter, Gatling
Perks:
  • Comprehensive healthcare (medical, dental, vision)
  • 401k plan with employer contribution
  • Short and long-term disability coverage
  • Basic life and AD&D insurance
  • Medical and dependent care FSA options
  • Flexible PTO policy
  • Parental leave
  • Generous equity offering

Job Details

What we do:
Halcyon is the industry’s first dedicated, adaptive security platform that combines multiple proprietary advanced prevention engines along with AI models focused specifically on stopping ransomware.

Who we are:
Halcyon was formed in 2021 by a team of cyber industry veterans after battling the scourge of ransomware (and advanced threats) for years at some of the largest global security vendors. Comprised of leaders from Cylance (now Blackberry), Accuvant (now Optiv), Fireye and ISS X-Force (now IBM), Halcyon is focused on building products and solutions for mid-market and enterprise customers.

As a remote-native, completely distributed global team, we recognize great talent can exist anywhere. We invite you to apply to a job you’re interested in and we'll work a plan to meet your needs.

The Role:

We are looking for a Cloud Observability and Performance Engineer to join our Chaos Cloud Engineering team. In this role, you will design and implement observability, monitoring, and performance strategies for cloud-hosted microservices that manage and orchestrate endpoint security agents at scale. This position is critical to ensuring the reliability, visibility, and performance optimization of our backend systems that power cloud-based security operations for millions of endpoints worldwide.

Responsibilities: 

  • Design, build, and maintain end-to-end observability for distributed cloud services (telemetry, logging, tracing, alerting). 
  • Develop and optimize metrics pipelines and dashboards (e.g., Prometheus, Grafana, OpenTelemetry, Datadog).    
  • Ensure high performance, availability, and scalability of agent management systems in production.    
  • Collaborate with development, SRE, and security teams to troubleshoot production issues using observability tooling.    
  • Define and implement SLOs, SLIs, and performance benchmarks for cloud components and services.    
  • Instrument code and services to expose business-relevant metrics and latency bottlenecks.    
  • Automate performance regression testing and anomaly detection.    
  • Support proactive incident detection and real-time monitoring strategies across multi-cloud environments.
  • Design, implement, and own a performance testing framework to validate system throughput, latency, and scalability under load.
  • Define baseline performance thresholds and use observability tooling to monitor and validate results.    
  • Provide root cause analysis and performance tuning recommendations.    

Skills and Qualifications:

  • 5+ years of professional work experience in observability, site reliability, or cloud performance roles.    
  • Strong experience with monitoring and observability stacks (e.g., Prometheus, Grafana, ELK, OpenTelemetry, Datadog, AWS CloudWatch).    
  • Proficiency in cloud platforms (e.g., AWS, GCP, Azure) and cloud-native services (e.g., ECS, EKS, Lambda).  
  • Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).  
  • Hands-on experience designing and implementing performance or load testing frameworks for distributed systems.
  • Ability to define and validate throughput baselines, latency thresholds, and system limits under real-world traffic scenarios. 
  • Solid knowledge of distributed systems, microservices, and performance debugging.    
  • Proficiency in Python, Scala, or other language(s) for tooling and automation.    
  • Familiarity with CI/CD pipelines, infrastructure as code (e.g., Terraform), and version control (Git).    
  • Ability to participate in an on-call rotation to support observability infrastructure and assist with incident investigations.

Bonus Skills and Qualifications:

  • Experience with endpoint security platforms or agent-based systems.    
  • Familiarity with SIEM, security analytics, or cloud threat detection pipelines.    
  • Background in networking performance, TLS handshake optimization, or load balancing.    
  • Experience with SLA/SLO-driven operational excellence in high-scale environments.    
  • Knowledge of additional languages, such as Go.
  • Experience integrating performance tests into CI/CD pipelines and visualizing results using tools like Grafana, Datadog, or similar.
  • Familiarity with tools such as k6, Locust, JMeter, Gatling, or custom-built performance testing solutions.

Why Join Us?

  • Work on cutting-edge cloud infrastructure supporting global security products.    
  • Be a core part of a team building resilient systems at internet scale.    
  • Competitive salary, flexible work culture, and continuous learning opportunities.

Benefits:

 Halcyon offers the following benefits to eligible employees:

  • Comprehensive healthcare (medical, dental, and vision) with premiums paid in full for employees and dependents.

  • 401k plan with a generous employer contribution.

  • Short and long-term disability coverage, basic life and AD&D insurance plans.

  • Medical and dependent care FSA options.

  • Flexible PTO policy.

  • Parental leave.

  • Generous equity offering.

The Company reserves the right to modify or change these benefits programs at any time, with or without notice.​

Base Salary Range: $150,000 - $185,000

Bonus Target: 10%

In accordance with applicable state and federal laws, the range provided is Halcyon’s reasonable estimate of the base compensation for this role. The actual amount may differ based on non-discriminatory factors such as experience, knowledge, skills, abilities, and location. Base pay is one part of the total package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and equity in the Company.

We understand it takes a diverse team of highly intelligent, passionate, curious, and creative people to develop the exceptional product we are building. Our dynamic team has incredible perspectives to share, just as we know you do, and we take great pride in being an equal opportunity employer.

Similar Jobs

Shield AI - V-BAT Air Vehicle Operator

Shield AI

Dallas, Texas, United States (On-Site)
3 Weeks ago
Western Digital - Technician 2, Engineering

Western Digital

Phra Nakhon Si Ayutthaya, Thailand (On-Site)
3 Weeks ago
AeroSpike - Senior Manager of Technical Support

AeroSpike

Mountain View, California, United States (On-Site)
2 Months ago
Granicus - SLED Enterprise Account Executive - State Team - East Region

Granicus

United States (Remote)
3 Months ago
Side - Firmware Quality Engineer

Side

São Paulo, State Of São Paulo, Brazil (Hybrid)
3 Weeks ago
Spaulding Ridge - Anaplan Solution Architect

Spaulding Ridge

Toronto, Ontario, Canada (On-Site)
3 Months ago
Canva - Senior Backend Engineer - Support Automation and AI Enablement

Canva

Brisbane, Queensland, Australia (Remote)
4 Months ago
Nagarro - Associate Principal Engineer, Cloud

Nagarro

Hyderabad, Telangana, India (On-Site)
10 Months ago
TALA - Senior DevOps Engineer

TALA

Mexico City, Mexico (Remote)
2 Months ago
Sonar Source - Staff Cloud Engineer (AWS)

Sonar Source

Austin, Texas, United States (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Toppan MErril - Software Engineer - Production Support (Python, Angular)

Toppan MErril

Canada (Remote)
1 Year ago
Aeries technology - Associate Deal Desk Analyst

Aeries technology

Bengaluru, Karnataka, India (On-Site)
1 Month ago
Tesla - Manufacturing Engineer, Battery Cell

Tesla

Brandenburg, Germany (On-Site)
6 Months ago
Aristocrat - Financial Analyst

Aristocrat

London, England, United Kingdom (Hybrid)
2 Months ago
Alpha Sense - Analyst, Client & Product Support

Alpha Sense

Mumbai, Maharashtra, India (On-Site)
2 Months ago
CGS Carrers - Payment Operations Lead

CGS Carrers

Bengaluru, Karnataka, India (Remote)
1 Month ago
PwC - AI/ML Azure Engineer (m/f/d)

PwC

Luxembourg (On-Site)
10 Months ago
Epic Games - Senior Engineer, Patching

Epic Games

Cary, North Carolina, United States (On-Site)
7 Months ago
Interactive Brokers - Senior Windows Platform Operations Engineer (L2)

Interactive Brokers

Kowloon, Hong Kong (On-Site)
3 Months ago
Brillio - Snowflake Admin - R01553043

Brillio

Bengaluru, Karnataka, India (Hybrid)
3 Weeks ago

Get notifed when new similar jobs are uploaded

Jobs in Worldwide

Looks like we're out of matches

Set up an alert and we'll send you similar jobs the moment they appear!

Devops Jobs

Kavalirio - Cloud Solutions Architect

Kavalirio

Chantilly, Virginia, United States (On-Site)
3 Months ago
CloudLinux - Senior Ruby Engineer - Linux Platform & Automation

CloudLinux

(Remote)
1 Month ago
Apple - Cloud Infrastructure Software Developer

Apple

Seattle, Washington, United States (On-Site)
3 Months ago
Saviynt - Principal Engineer – SRE

Saviynt

Bengaluru, Karnataka, India (Hybrid)
1 Year ago
Sigma Software - Senior DevSecOps Engineer

Sigma Software

Hungary (On-Site)
3 Months ago
PlayStation Global - Senior Build System Engineer

PlayStation Global

United States (Remote)
4 Months ago
bytedance - Backend Software Engineer (Cloud Platform), Cloud Infrastructure

bytedance

Singapore (On-Site)
9 Months ago
Resolver - Solutions Architect

Resolver

Toronto, Ontario, Canada (Hybrid)
1 Month ago
USE Insider - Solution Architect

USE Insider

Istanbul, İstanbul, Türkiye (On-Site)
7 Months ago
Canva - Staff Frontend Engineer - Apps API Platform

Canva

Auckland, Auckland, New Zealand (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded