Site Reliability Engineer, HPC and LSF

1 Month ago • 10 Years + • DevOps • $184,000 PA - $287,500 PA

Job Summary

Job Description

As a Site Reliability Engineer (SRE) at NVIDIA, you will collaborate with various teams to enhance the infrastructure supporting the development of cutting-edge chips. Responsibilities include managing workload schedulers (LSF, SLURM) in a large-scale HPC environment, automating deployments and monitoring, developing solutions for complex resource management, troubleshooting issues, and defining standard methodologies. You'll work with EDA and software experts to build new infrastructure, focusing on scalability, reliability, and high performance. This role directly contributes to the quality and speed of next-generation chip development.
Must have:
  • Extensive LSF/SLURM experience
  • Proficient in CentOS/RHEL
  • Docker expertise
  • UNIX scripting proficiency
  • Strong problem-solving skills
  • Excellent communication & teamwork
Good to have:
  • HPC/EDA workload performance tuning
  • Ansible experience
  • Perl proficiency
  • Distributed system understanding
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
 

As an SRE, you'll collaborate with various teams to improve our infrastructure environment within NVIDIA's Hardware Infrastructure team. You will enable our engineers to have the best environment on the planet to make the most innovative chips in the world. You will work with your team of EDA and software experts to build new infrastructure in an agile environment. You will continuously innovate and improve scalable, reliable, high performance systems and tools to enable the next generation of chips!
 

What you’ll be doing:

  • Manage and support workload and resource schedulers in a large-scale HPC environment.

  • Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.

  • Develop solutions for complex computing resource management requirements.

  • Extract and leverage grid performance metrics for troubleshooting and performance optimization.

  • Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

  • Develop, define and document standard methodologies to share with internal teams.

  • Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.

  • Directly contribute to the overall quality and improve time to market for our next generation chips.


What we need to see:

  • Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).

  • Proficient in administering Centos/RHEL Linux distributions.

  • In depth understating of container technologies like Docker.

  • Proficiency in UNIX scripting languages.

  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.

  • 10+ years experience in a large, distributed Linux environment.

  • BS in Computer Science, similar degree or equivalent experience.


Ways to stand out from the crowd:

  • Experience analyzing and tuning performance for a variety of HPC or EDA workloads.

  • Solid understanding of cluster configuration managements tools such as Ansible.

  • Proficiency in Perl for maintaining legacy automation scripts.

  • Deep understanding of distributed system principles.

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Google - Software Engineer, Early Career, Cloud AI

Google

(On-Site)
4 Months ago
DNEG - FX Lead (DNEG Animation)

DNEG

Chennai, Tamil Nadu, India (On-Site)
5 Months ago
Anavation - Software Developer 4

Anavation

Chantilly, Virginia, United States (On-Site)
5 Months ago
Interactive Brokers - Senior DevOps/Software Engineer

Interactive Brokers

Greenwich, Connecticut, United States (Hybrid)
6 Months ago
Keywords Studios (Player Support) - Software Engineer II- Backend

Keywords Studios (Player Support)

Maharashtra, India (Hybrid)
3 Months ago
Omnissa - Staff Engineer (C++ Windows Internals)

Omnissa

Bengaluru, Karnataka, India (On-Site)
6 Months ago
LSEG (London Stock Exchange Group) - Technical Design Authority

LSEG (London Stock Exchange Group)

Bengaluru, Karnataka, India (Hybrid)
6 Months ago
NVIDIA - Solutions Architect, Infrastructure - Research Computing

NVIDIA

New York, New York, United States (Remote)
2 Months ago
Warner Bros Games - Senior Software Developer

Warner Bros Games

Ottawa, Ontario, Canada (Hybrid)
4 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Software Engineer, Cloud Infrastructure

ByteDance

San Jose, California, United States (On-Site)
5 Months ago
Info Stretch - Senior Java Engineer

Info Stretch

Krakow Am See, Mecklenburg-Vorpommern, Germany (On-Site)
4 Months ago
Luxoft - Murex XVA Techno-Functional Business Analyst

Luxoft

Sydney, New South Wales, Australia (On-Site)
4 Months ago
ARHS - IT Java Architect

ARHS

Luxembourg (On-Site)
5 Months ago
Nolimit City - Backend developer

Nolimit City

Stockholm, Stockholm County, Sweden (On-Site)
5 Months ago
Anthology  Inc  - Support Analyst

Anthology Inc

Bengaluru, Karnataka, India (On-Site)
2 Months ago
Axinous - Senior Sales Engineer

Axinous

Tokyo, Japan (On-Site)
3 Months ago
Booming games - Platform Architect

Booming games

Pressig, Bavaria, Germany (Remote)
2 Months ago
Next Level Business Services - Oracle DBA (With SAP Experience)

Next Level Business Services

Austin, Texas, United States (On-Site)
5 Months ago
DNEG - Animation TD (DNEG Animation)

DNEG

Chennai, Tamil Nadu, India (On-Site)
5 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Durham, North Carolina, United States

Meta - Software Engineer, Computer Vision (Technical Leadership)

Meta

Burlingame, California, United States (Remote)
4 Months ago
Light Speed Studios - Lead Graphics Engineer

Light Speed Studios

Irvine, California, United States (On-Site)
4 Months ago
Intrepid Studios,  Inc  - Helpdesk Support Technician

Intrepid Studios, Inc

San Diego, California, United States (On-Site)
7 Months ago
Riot Games - Architect User Experience Designer - League of Legends

Riot Games

Los Angeles, California, United States (On-Site)
4 Months ago
Netflix - Technical Program Manager 6 - Games Social, Trust and Safety

Netflix

United States (Remote)
3 Months ago
Microsoft - Senior Researcher - Embodied AI/Robotics - Microsoft Research

Microsoft

Redmond, Washington, United States (On-Site)
3 Months ago
Paypal - MTS 1, Software Engineer

Paypal

Austin, Texas, United States (Hybrid)
5 Months ago
The Walt Disney Company - Sr Data Analyst

The Walt Disney Company

Santa Monica, California, United States (On-Site)
3 Months ago
Onward Search - Video Producer

Onward Search

Washington, District Of Columbia, United States (Remote)
2 Months ago
Paypal - MTS 1, Software Engineer

Paypal

Scottsdale, Arizona, United States (Hybrid)
6 Months ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Equivalent Jobs - MLOPS ENGINEER

Equivalent Jobs

(Remote)
5 Months ago
Axinous - Senior Staff Site Reliability Engineer - Federal

Axinous

Virginia, United States (Remote)
1 Month ago
Funko - Cloud Systems Engineer

Funko

Washington, United States (On-Site)
4 Months ago
BSH Home Appliances India - Architect MES Foundation

BSH Home Appliances India

Bengaluru, Karnataka, India (On-Site)
5 Months ago
PlayerUnknown Productions - IT Manager (Part-Time)

PlayerUnknown Productions

Amsterdam, North Holland, Netherlands (Hybrid)
5 Months ago
Netomi - Devops Engineer - II

Netomi

Gurugram, Haryana, India (Remote)
4 Months ago
Nielsen Holdings - SENIOR DEVOPS ENGINEER

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)
5 Months ago
VGW - Infrastructure Engineer

VGW

Sydney, New South Wales, Australia (On-Site)
1 Month ago
AppZen - Senior DevOps Engineer

AppZen

San Jose, California, United States (Hybrid)
5 Months ago
Saviynt - Sr. Principal Software Engineer - Privileged Access Management (PAM)

Saviynt

El Segundo, California, United States (Hybrid)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Santa Clara, California, United States (On-Site)

Texas, United States (Remote)

Santa Clara, California, United States (On-Site)

Yokne'am Illit, North District, Israel (On-Site)

United Kingdom (Remote)

Yokne'am Illit, North District, Israel (On-Site)

Bengaluru, Karnataka, India (Hybrid)

Toronto, Ontario, Canada (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug