Lead Site Reliability Engineer - Federal Team
Saviynt
Job Summary
Saviynt is seeking a Lead Site Reliability Engineer for their Federal Team. This role involves performing customer deployments, migrations, and upgrades in cloud environments, installing and configuring Saviynt products, and troubleshooting incidents. Responsibilities include managing and maintaining cloud infrastructure on AWS, Azure, or Google Cloud, automating manual tasks, developing and maintaining CI/CD pipelines, and troubleshooting cloud-related infrastructure issues. The engineer will also automate infrastructure setup using Infrastructure as Code (IaC), collaborate with development and operations teams, maintain compliance with security and quality standards, and create technical documentation. The role requires designing and implementing novel solutions to automate cloud environment provisioning and developing automation scripts to streamline processes, reduce repetitive tasks, and eliminate human error. The position also involves configuring and deploying monitoring tools.
Must Have
- U.S. Citizenship
- 8+ years experience in observability, SRE, or cloud platform roles
- 4+ years hands-on cloud experience (AWS, Azure)
- 3+ years experience in software development (Python, NodeJS, Java)
- Advanced expertise in container orchestration (Kubernetes)
- Hands-on experience with observability tools (Prometheus, Grafana, etc.)
- Experience driving adoption of SLOs, SLIs, error budgets
- Strong experience with IaC (Terraform, Helm)
- Proven leadership in setting engineering standards
Good to Have
- Meet US persons on US soil requirements
- Undergo full background investigation/screening
- Undergo IAL3 requirements
Perks & Benefits
- Competitive total rewards package
- Learning and tremendous opportunities to grow and advance
- Discretionary bonus plan
Job Description
WHAT YOU WILL BE DOING
- Perform customer deployments, migrations, and upgrades in the cloud environment.
- Installing and configuring Saviynt product(s) following installation procedure and organizational guidelines
- Troubleshooting and resolving incidents while collaborating with the development and IT teams to minimize downtime and maintain service quality
- Manage and maintain cloud infrastructure on platforms such as AWS, Azure, or Google Cloud. Monitor cloud resources to ensure availability and scalability.
- Automate any manual work being performed pre/during/post deployments.
- Troubleshoot cloud-related infrastructure incidents and issues.
- Develop and maintain CI/CD pipelines to ensure reliable and efficient software delivery. Monitor and troubleshoot issues within the CI/CD pipelines.
- Automate infrastructure setup and maintenance using Infrastructure as Code (IaC) tools.
- Collaborate with development, operations, and QA teams to improve deployment processes.
- Maintain compliance with security and quality standards throughout the CI/CD pipeline
- Creating and maintaining technical documents for cloud infrastructure and related processes.
- Design and implement novel solutions to automate cloud-environment provisioning.
- Developing automation solutions to streamline processes, such as creating scripts to run specific tasks on systems. Developing and implementing automation scripts to reduce repetitive tasks and eliminate human error.
- Configuring and deploying monitoring tools
WHAT YOU BRING
- U.S. Citizenship: Applicants must be United States citizens.
- 8+ years of professional experience in observability, SRE, or cloud platform roles, with demonstrated success in leading strategic initiatives and cross-team collaborations.
- 4+ years of hands-on cloud experience (AWS, Azure), with deep understanding of cloud-native architectures and observability practices.
- Proven track record of designing and operating highly available and resilient systems in public cloud environments (especially AWS).
- 3+ years of experience in software development using Python, NodeJS, or Java, with strong focus on automation, CI/CD integration, and DevOps practices.
- Advanced expertise in container orchestration platforms (Kubernetes) and service mesh technologies.
- Hands-on experience implementing observability at scale using tools such as Prometheus, Grafana, OpenTelemetry, ELK/OpenSearch, Datadog, CloudWatch, or Azure Monitor.
- Demonstrated success in driving adoption of SLOs, SLIs, error budgets, and automated alerting frameworks across engineering teams.
- Strong experience with infrastructure as code (e.g., Terraform, Helm) and automated deployment pipelines.
- Proven leadership in setting engineering standards, mentoring team members, and driving initiatives that reduce MTTD/MTTR and improve operational excellence.
- Strong analytical skills, communication capabilities, and a strategic mindset to influence and guide technical direction across large-scale engineering teams.
- Meet US persons on US soil requirements
- Undergo full background investigation/screening
- Undergo IAL3 requirements (Identity proofing to include I-9 document verification, biometric collection, and mailing address confirmation)