Site Reliability Engineer - US Government

13 Minutes ago • 5 Years + • $180,000 PA - $440,000 PA
Devops

Job Description

xAI is seeking a highly skilled Senior Infrastructure Engineer for its US Government Team. The role involves designing, building, and operating secure, scalable infrastructure for critical government AI projects. Responsibilities include developing and managing training and inference clusters and reliable applications across bare metal, classified cloud, and hybrid cloud architectures. The engineer will leverage expertise in Kubernetes and GPU hardware to deliver robust, secure systems supporting large-scale AI workloads, while meeting stringent federal compliance. A passion for automation, observability, and system integrity in a high-security environment is essential.
Good To Have:
  • Deep familiarity with installing and using GPU hardware, including drivers and debugging.
  • Experience with high-traffic web/mobile application workloads, optimizing Kubernetes for large-scale deployments.
  • Familiarity with chaos engineering, capacity planning for system resilience.
  • Proficiency with tools like Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for complex technical challenges.
  • Passion for problem-solving and proactive drive to deliver impactful results.
  • Certifications in security-related fields (e.g., CISSP) or experience in secure federal environments.
Must Have:
  • Develop and optimize software for infrastructure provisioning and management.
  • Enhance infrastructure reliability, performance, and cost-effectiveness for AI workloads.
  • Design tailored solutions meeting government-specific needs and compliance standards.
  • Implement robust observability, monitoring, and security practices.
  • Manage storage infrastructure using IaC tools like Pulumi, Terraform, or Ansible.
  • Drive system reliability via incident management, postmortems, and defining SLAs/SLOs.
  • Possess an active Top Secret (TS) security clearance.
  • 5+ years experience as an Infrastructure Engineer or Site Reliability Engineer.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI.
  • Excellent communication and documentation skills.
Perks:
  • Equity
  • Comprehensive medical, vision, and dental coverage
  • Access to a 401(k) retirement plan
  • Short & long-term disability insurance
  • Life insurance
  • Various other discounts and perks

Add these skills to join the top 1% applicants for this job

communication
problem-solving
data-analytics
game-texts
ansible
terraform
kubernetes

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

We are seeking a highly skilled Senior Infrastructure Engineer to join our US Government Team, focused on designing, building, and operating secure, scalable infrastructure for critical government projects. In this role, you will develop and manage training and inference clusters, as well as highly reliable applications, across bare metal, classified cloud, and hybrid cloud architectures. You will leverage your expertise in Kubernetes and GPU hardware to deliver robust, secure systems that support large-scale AI workloads while meeting stringent federal compliance requirements. This role demands a passion for automation, observability, and ensuring system integrity in a fast-paced, high-security environment.

Responsibilities

  • Develop and optimize software to provision and manage xAI’s infrastructure across on-premise, virtual machine, and classified cloud environments, enabling efficient scaling for US government initiatives.
  • Enhance the reliability, performance, and cost-effectiveness of infrastructure to support large-scale AI and application workloads in secure, classified settings.
  • Collaborate with xAI engineers to understand workload requirements and design tailored solutions that meet government-specific needs and compliance standards.
  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems, adhering to federal protocols.
  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible, with a focus on secure data handling.
  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs, while maintaining security and compliance.
  • This is an in-person role based in Palo Alto, CA or Washington, DC, with up to 50% travel required.

Required Qualifications

  • Active Top Secret (TS) security clearance.
  • 5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer, or similar role, with a focus on building and maintaining reliable, scalable systems, preferably in secure or government environments.
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs.
  • Excellent communication and documentation skills, with the ability to handle sensitive information concisely and accurately.

Preferred Qualifications

  • Deep familiarity with installing and using GPU hardware, including setting up drivers, debugging issues, and ensuring reliability.
  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments in classified or federal settings.
  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience in government projects.
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges in secure environments.
  • Passion for problem-solving and a proactive drive to deliver impactful results while adhering to security protocols.
  • Certifications in security-related fields (e.g., CISSP) or experience in secure federal environments.

Interview Process

After submitting your application, our team will review your CV and statement of exceptional work. If your application advances, you will be invited to a 15-minute phone interview to discuss basic qualifications. Successful candidates will proceed to the main process, which includes:

1. Technical deep-dive: Discussing your infrastructure and secure systems experience.

2. A hands-on challenge focused on designing or troubleshooting infrastructure for secure environments.

3. A meet-and-greet with the wider team.

Our goal is to complete the main interview process within one week.

Annual Salary Range

$180,000 - $440,000 USD

Benefits

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice

Create a Job Alert

Interested in building your career at xAI? Get future opportunities sent straight to your email.

Create alert

Apply for this job

------------------

  • indicates a required field

Autofill with MyGreenhouse

First Name*

Last Name*

Email*

Phone

Country*

Phone*

Location (City)*

Locate me

Resume/CV*

AttachAttach

Dropbox

Google Drive

Enter manuallyEnter manually

Accepted file types: pdf, doc, docx, txt, rtf

  • * *

Do you have an active U.S. security clearance (e.g., Secret, Top Secret)?*

Select...

Current company

If you are currently employed in the field, please tell us the name of your employer.

Current title

If you are currently employed in the field, please tell us your role including your seniority level (e.g. Software Engineer II).

LinkedIn Profile

If you have a public LinkedIn profile, please provide its URL.

X Profile

If you have a public X profile, please provide its URL.

Google Scholar

If you have a Google Scholar page, please provide its URL.

What exceptional work have you done?*

In 100 words or less, tell us about a piece of work you are most proud of.

Will you now, or in the future, require sponsorship for employment visa status (e.g., H-1B visa) to legally work for X.AI LLC in the U.S.?*

Select...

Submit application

Set alerts for more jobs like Site Reliability Engineer - US Government
Set alerts for new jobs by xAI
Set alerts for new Devops jobs in United States
Set alerts for new jobs in United States
Set alerts for Devops (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙