Senior Site Reliability Engineer

11 Minutes ago • 5 Years +
Devops

Job Description

Join Lytx's dynamic SRE team to design and support cutting-edge IoT infrastructure, transitioning to the cloud with "Operations as Code" and "Infrastructure as Code." This role is pivotal in ensuring the availability, reliability, observability, and resilience of Lytx’ services, both on-premises and in the cloud. You will craft transformative solutions, design robust cloud infrastructure, and focus on continuous improvement, engineering the future of business continuity.
Must Have:
  • Design, implement, and maintain scalable and reliable systems.
  • Lead incident response, diagnose root causes, and implement long-term solutions.
  • Develop and maintain comprehensive monitoring and alerting systems.
  • Automate repetitive tasks and processes to improve operational efficiency.
  • Continuously optimize system performance to meet service level objectives (SLOs).
  • Forecast future system requirements and plan capacity upgrades.
  • Collaborate closely with development teams and mentor junior SREs.
  • Create and maintain detailed documentation on system design and operational practices.
  • 5+ years of SRE experience within AWS environments at medium to large-scale organizations.
  • 5+ years of hands-on experience implementing and managing observability tools (Prometheus, New Relic, Grafana, or similar).
  • Advanced programming skills in Python, Groovy, and Bash.
  • Strong understanding of database technologies, including both SQL and NoSQL systems.
  • 3+ years of experience developing and managing infrastructure deployment pipelines using Git, Terraform, Helm, Jenkins/Jenkins X/ArgoCD, or similar tools.
  • Proven expertise in designing, evaluating, and supporting production environments in AWS.
  • Hands-on experience with Linux systems and various network protocols and technologies.
  • Extensive experience with Kubernetes and container/cloud-native technologies.
  • Significant experience in managing 24/7 on-call rotations, creating runbooks, and monitoring systems.
  • Ability to thrive under pressure in a technically challenging environment.

Add these skills to join the top 1% applicants for this job

communication
excel
github
game-texts
nginx
dns
linux
incident-response
aws
nosql
load-balancing
prometheus
terraform
new-relic
grafana
helm
kubernetes
git
python
sql
bash
jenkins
system-design

Why Lytx:

Join our dynamic and passionate team of driven, low-ego engineers who are at the forefront of designing and supporting cutting-edge IoT infrastructure. As we rapidly grow and transition to the cloud, we're diving into the exciting realms of "Operations as Code," "Infrastructure as Code," and innovative infrastructure automation.

Our Site Reliability Engineering (SRE) team is pivotal in ensuring the availability, reliability, observability, and resilience of Lytx’ services, both on-premises and in the cloud. We're not just keeping the lights on—we're engineering the future of our business's continuity.

If you're energized by crafting transformative solutions and excel at designing robust, detailed cloud infrastructure with a focus on continuous improvement, this could be the perfect role for you!

Responsibilities:

  • System Design and Architecture: Design, implement, and maintain scalable and reliable systems, ensuring they can handle both current and future demands.
  • Incident Management: Lead incident response efforts, diagnose root causes, and implement long-term solutions to prevent recurrence. Ensure effective communication during outages.
  • Monitoring and Observability: Develop and maintain comprehensive monitoring and alerting systems to proactively identify and address issues before they impact users.
  • Automation and Efficiency: Automate repetitive tasks and processes to improve operational efficiency and reduce manual intervention.
  • Performance Tuning: Continuously optimize system performance, including fine-tuning applications, databases, and infrastructure to meet service level objectives (SLOs).
  • Capacity Planning: Forecast future system requirements based on growth trends and current usage, and plan capacity upgrades to ensure system reliability.
  • Collaboration and Mentoring: Work closely with development teams to integrate reliability into the software development lifecycle. Mentor junior SREs and share best practices.
  • Documentation and Knowledge Sharing: Create and maintain detailed documentation on system design, incident response procedures, and operational practices to ensure knowledge is preserved and accessible.

Requirements:

  • 5+ years of experience as an SRE within AWS environments at medium to large-scale organizations.
  • 5+ years of hands-on experience implementing and managing observability tools, such as Prometheus, New Relic, Grafana, or similar.
  • Advanced programming skills in Python, Groovy, and Bash.
  • Strong understanding of database technologies, including both SQL and NoSQL systems.
  • 3+ years of experience developing and managing infrastructure deployment pipelines using Git, Terraform, Helm, Jenkins/Jenkins X/ArgoCD, or similar tools.
  • Proven expertise in designing, evaluating, and supporting production environments in AWS, including VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail, Control Tower, GuardDuty, MSK, S3, Glacier, Gateways, Direct Connect, Route 53, RDS, ALBs, Autoscaling, and more.
  • Hands-on experience with Linux systems and protocols and technologies such as HTTP, REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message Brokers, Nginx, Vault, etc.
  • Extensive experience with Kubernetes and various container and cloud-native technologies.
  • Significant experience in managing 24/7 on-call rotations, creating runbooks, establishing support procedures, and proactively monitoring systems across multiple geographic locations.
  • Ability to thrive under pressure and excel in a technically challenging environment.

Innovation Lives Here

You go all in no matter what you do, and so do we. At Lytx, we’re powered by cutting-edge technology and Happy People. You want your work to make a positive impact in the world, and that’s what we do. Join our diverse team of hungry, humble and capable people united to make a difference.

Together, we help save lives on our roadways.

Find out how good it feels to be a part of an inclusive, collaborative team. We’re committed to delivering an environment where everyone feels valued, included and supported to do their best work and share their voices.

Lytx, Inc. is proud to be an equal opportunity/affirmative action employer and maintains a drug-free workplace. We’re committed to attracting, retaining and maximizing the performance of a diverse and inclusive workforce. EOE/M/F/Disabled/Vet.

About Us

Lytx® is a leading provider of video telematics, analytics, safety and productivity solutions for commercial and public sector fleets. Our unrivaled Driver Safety Program, powered by our best-in-class DriveCam® Event Recorder, is proven to help save lives and reduce risk. We harness the power of video to help clients see what happened in the past, manage their operations more efficiently in the present and improve driver behavior to change the future. Our customizable services and programs span driver safety, risk detection, fleet tracking, compliance and fuel management. Using the world’s largest driving database of its kind, along with proprietary machine vision and artificial intelligence technology, we help protect and connect thousands of fleets and more than one million drivers worldwide. For more information, visit www.lytx.com, @lytx on X, LinkedIn, our Facebook page or YouTube channel.

Private Notice: Please see Lytx’s Global Human Resources Privacy Statement for more information related to Personal Information we process and store related to our applicants.

Set alerts for more jobs like Senior Site Reliability Engineer
Set alerts for new jobs by Lytx, Inc.
Set alerts for new Devops jobs in India
Set alerts for new jobs in India
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙