Senior Site Reliability Engineer / Production Engineer

4 Months ago • 10-10 Years • DevOps

Job Summary

Job Description

SambaNova is seeking a Senior Site Reliability Engineer with 10+ years of experience to design, deploy, and troubleshoot AI platforms and services. This role requires deep expertise in Linux systems, Kubernetes, and automation. The ideal candidate will be a strong problem solver with a passion for building and scaling reliable and scalable AI infrastructure.
Must have:
  • Linux Systems
  • Kubernetes Clusters
  • Production Engineering
  • Automation Expertise
Good to have:
  • Infrastructure as Code
  • Public Cloud Providers
  • Monitoring Systems
  • High Availability
Perks:
  • Competitive Salary
  • Equity Benefits

Job Details

About the job

The era of pervasive AI has arrived. In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fundamentally transform their businesses and operations at scale.

SambaNova Suite™ is the first full-stack, generative AI platform, from chip to model, optimized for enterprise and government organizations. Powered by the intelligent SN40L chip, the SambaNova Suite is a fully integrated platform, delivered on-premises or in the cloud, combined with state-of-the-art open-source models that can be easily and securely fine-tuned using customer data for greater accuracy. Once adapted with customer data, customers retain model ownership in perpetuity, so they can turn generative AI into one of their most valuable assets.

Working at SambaNova

SambaNova’s mission is to be the number 1 platform for business AI. We are a full-stack provider of AI-specific chips, software, and models that come together to help every organization accelerate their AI journey.

This role presents a unique opportunity to shape the future of AI and the value it can unlock across every aspect of an organization’s business and operations, including building, securing, operating, and scaling the platform and infrastructure that enable us to deliver our groundbreaking capabilities to enterprise customers.

Job Description

As a site reliability engineer on the operations team, you will be solving interesting challenges in a fast paced environment by designing, deploying, and troubleshooting state of the art AI platforms and services with great attention to reliability, security, scalability, operability, and performance. Working alongside engineering teams that are building cutting edge technologies revolutionizing the AI landscape, you will leverage your experience across software, systems, infrastructure, and production operations to lead key initiatives that enable us to rapidly deliver reliable and scalable service for customers in a hybrid deployment pattern.

The ideal candidate for this highly visible and critical role will have the knowledge of a software engineer, the experience of a systems and infrastructure engineer, and a strong passion for troubleshooting and automation across bare metal datacenter infrastructure and public cloud services.

This individual will be responsible for

  • Assume full-stack ownership for the successful delivery of our SambaNova services in a hybrid model, including, but not limited to, deployment, configuration, integrations, observability, and ongoing operations
  • Develop deep understanding of the end-to-end configurations, dependencies, customer requirements, and overall characteristics of the production services as the accountable owner for overall service operations
  • Systems and application administration for multiple customer facing production environments (hosted and on-premise), with a continued focus on improving efficiencies, availability, and supportability through automation and well defined run-books
  • Partner and collaborate with product and engineering teams to recommend and implement improvements to the security, resilience, and operational readiness of our systems, with the flexibility to integrate into unique customer environments
  • Augment ongoing efforts to design and develop automation for deployments, updates and upgrades of the entire SambaNova software stack
  • Lead efforts to triage, debug, and fix issues related to networks, storage, operating systems, containers, and applications to drive proactive and reactive incident resolution and root cause analysis
  • Build the systems and tools for centralized command and control of distributed environments
  • Participate in on-call rotation responsibilities

Basic Qualifications

  • Bachelors and/or Masters in CS or related field
  • 10+ years of hands-on experience in SRE / Production engineering roles with focus on supporting, scaling and ensuring the reliability of large-scale production services and infrastructure
  • Extensive experience in deploying, securing, managing, and operating Linux systems in globally distributed production environments
  • Good knowledge of containers with hands-on experience in deploying, managing, and troubleshooting Kubernetes clusters and components in private data centers as well as public cloud
  • Proficient with at least one modern programming language (Python preferred) and the willingness to learn new languages as required
  • A systematic problem-solving approach to troubleshooting and the desire to solve the root cause of common problems in 24x7 environments

Preferred Qualifications

  • Deep understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
  • Must have past experience deploying and managing systems and infrastructure in data centers, with the ability to debug and resolve recurring hardware issues.
  • Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, and ArgoCD
  • Good working knowledge of build automation and continuous integration / delivery
  • Knowledge of virtualization and multiple hypervisor technologies
  • Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
  • Experience deploying applications and managing infrastructure in one or more public cloud providers (AWS, Azure, GCP) is highly desirable
  • Configuration and maintenance of web servers, load balancers, databases, storage systems and messaging systems
  • A passion to design for high availability and scale, with the discipline and desire for extensive automation
  • Strong communication skills with the ability and willingness to work with diverse teams and customers across multiple time zones

Preferred Qualifications

  • Experience working in a high-growth startup
  • A team player who demonstrates humility
  • Action-oriented with a focus on speed and results
  • Ability to thrive in a no-boundaries culture and make an impact on innovation

Benefits Summary For US-Based Full-Time Direct Employment Positions

(The Recruiter will provide benefit details for non-US-based roles)

SambaNova offers a competitive total rewards package, including the base salary, plus equity and benefits. We cover 95% premium coverage for employee medical insurance, and 77% premium coverage for dependents and offer a Health Savings Account (HSA) with employer contribution. We also offer Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans in addition to Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care. Our library of well-being benefits available to you and your dependents includes a full subscription to Headspace, Gympass+ membership with access to physical gyms, One Medical membership, counseling services with an Employee Assistance Program, and much more.

Submission Guidelines

Please note that in order to be considered an applicant for any position at SambaNova Systems, you must submit an application form for each position for which you believe you are qualified.

If you are a new, recent (within the last two years), or upcoming college graduate and are interested in opportunities with SambaNova Systems, please apply through our university job listings.

EEO Policy

SambaNova Systems is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard basis of age (40 and over), color, disability, gender identity, genetic information, marital status, military or veteran status, national origin/ancestry, race, religion, creed, sex (including pregnancy, childbirth, breastfeeding), sexual orientation, and any other applicable status protected by federal, state, or local laws.

Similar Jobs

Futurum Technology  - DevOps Engineer

Futurum Technology

Poland (On-Site)
8 Months ago
Ambient Security - Staff Software Engineer

Ambient Security

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
ByteDance - Principal Product Manager - IaaS AI Infra

ByteDance

San Jose, California, United States (On-Site)
4 Weeks ago
Unity - Senior Full Stack Engineer (FE Oriented)

Unity

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
8 Months ago
Wind River Systems - Field Application Engineer

Wind River Systems

United States (On-Site)
4 Months ago
Keywords Studios (Player Support) - Software Engineer II - DevOps

Keywords Studios (Player Support)

Pune, Maharashtra, India (Hybrid)
2 Months ago
SparkCognition - Senior IT Cloud Engineer

SparkCognition

Bengaluru, Karnataka, India (On-Site)
5 Months ago
QSC - DevOps Engineer

QSC

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Nagarro - Senior Engineer, Cloud

Nagarro

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Luxoft - Azure Enterprise Architect

Luxoft

Warsaw, Masovian Voivodeship, Poland (On-Site)
3 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

4ThePlayer - REMOTE Junior Backend JavaScript Developer

4ThePlayer

(Remote)
2 Months ago
N-iX - SENIOR FULLSTACK ENGINEER (#2723)

N-iX

Ukraine (Remote)
1 Month ago
Luxoft - Senior/Lead Java Developer with search engine expertise

Luxoft

Ukrainka, Kyiv Oblast, Ukraine (Remote)
2 Months ago
PlayStation Global - Software Engineer II

PlayStation Global

Carlsbad, California, United States (On-Site)
1 Month ago
Luxoft - DevOps Engineer

Luxoft

(Remote)
2 Months ago
ByteDance - Network Software Development Engineer Graduate (Network Engineering-Virtual Network) - 2025 Start (PhD)

ByteDance

San Jose, California, United States (On-Site)
4 Months ago
Nissan - Warehouse Operator

Nissan

Greenville, South Carolina, United States (On-Site)
4 Months ago
Zeta - Lead Software Development Engineer - Backend.

Zeta

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Cult of the North - Senior Backend Developer

Cult of the North

Stockholm, Stockholm County, Sweden (On-Site)
8 Months ago
Nisum - DevOps Engineer - A6651

Nisum

Hyderabad, Telangana, India (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Bengaluru, Karnataka, India

GlobalFoundries - Lead Data Engineer

GlobalFoundries

Karnataka, India (On-Site)
3 Months ago
Google - Software Engineer III, AI/ML, Google Cloud

Google

Bengaluru, Karnataka, India (On-Site)
3 Months ago
NinjaVan - Senior Data Engineer

NinjaVan

Hyderabad, Telangana, India (On-Site)
4 Months ago
Grant Thornton INDUS - Analyst - C&RM INDUS

Grant Thornton INDUS

Bengaluru, Karnataka, India (On-Site)
6 Months ago
Gamut HR Solutions - 3D visualizers

Gamut HR Solutions

Hyderabad, Telangana, India (On-Site)
4 Months ago
Nagarro - Senior Engineer, Java Fullstack

Nagarro

India (Remote)
4 Months ago
Keywords Studios (Player Support) - Software Engineer II - DevOps (On Contract)

Keywords Studios (Player Support)

Pune, Maharashtra, India (Hybrid)
2 Months ago
Simplify 360 - Director of Engineering (Intelligence)

Simplify 360

Chennai, Tamil Nadu, India (Hybrid)
4 Months ago
PowerSchool - Senior Quality Assurance Engineer 1

PowerSchool

Bengaluru, Karnataka, India (On-Site)
3 Months ago
CloudHire - Senior Scala Architect

CloudHire

Delhi, India (Remote)
4 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Revolgy - Junior Cloud Ops Engineer (Intern)

Revolgy

(Remote)
1 Month ago
Nagarro - Senior Staff Engineer (Cloud Infrastructure)

Nagarro

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Microsoft - Senior Engineering Manager – CI/CD Engineering

Microsoft

Hyderabad, Telangana, India (On-Site)
1 Month ago
Microsoft - Software Engineer II/Senior Software Engineer - CTJ - POLY

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago
Luxoft - Robotic Process Automation (RPA) Developer

Luxoft

Abu Dhabi, Abu Dhabi, United Arab Emirates (On-Site)
3 Months ago
Alpha Sense - Staff Engineer - User Activities

Alpha Sense

Helsinki, Uusimaa, Finland (On-Site)
3 Months ago
Glean - Site Reliability Engineer (India)

Glean

Bengaluru, Karnataka, India (On-Site)
4 Months ago
Crunchyroll - Staff DevOps Engineer, Core Infrastructure Engineering

Crunchyroll

San Francisco, California, United States (Remote)
3 Months ago
Microsoft - Principal Cloud Services Engineer - Gaming

Microsoft

Redmond, Washington, United States (On-Site)
1 Month ago
Rackspace Technology - Principal Java Engineer (GCP)

Rackspace Technology

United States (Remote)
2 Months ago

Get notifed when new similar jobs are uploaded