Jobs Courses Resources Companies Placements

Home >

Jobs >

Site Reliability Engineer, HPC and LSF

NVIDIA

North Carolina, United States (On-site)

Site Reliability Engineer, HPC and LSF

5 Months ago • 10 Years + • Devops • $184,000 PA - $287,500 PA

Job Summary

Job Description

As a Site Reliability Engineer (SRE) at NVIDIA, you will collaborate with various teams to enhance the infrastructure supporting the development of cutting-edge chips. Responsibilities include managing workload schedulers (LSF, SLURM) in a large-scale HPC environment, automating deployments and monitoring, developing solutions for complex resource management, troubleshooting issues, and defining standard methodologies. You'll work with EDA and software experts to build new infrastructure, focusing on scalability, reliability, and high performance. This role directly contributes to the quality and speed of next-generation chip development.

Must have:

Extensive LSF/SLURM experience
Proficient in CentOS/RHEL
Docker expertise
UNIX scripting proficiency
Strong problem-solving skills
Excellent communication & teamwork

Good to have:

HPC/EDA workload performance tuning
Ansible experience
Perl proficiency
Distributed system understanding

Perks:

Equity
Benefits

11 skills required

11 skills required for this role

Add these skills to join the top 1% applicants for this job

unix

perl

docker

linux

ansible

communication

agile-development

innovation

problem-solving

performance-analysis

team-management

Job Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

As an SRE, you'll collaborate with various teams to improve our infrastructure environment within NVIDIA's Hardware Infrastructure team. You will enable our engineers to have the best environment on the planet to make the most innovative chips in the world. You will work with your team of EDA and software experts to build new infrastructure in an agile environment. You will continuously innovate and improve scalable, reliable, high performance systems and tools to enable the next generation of chips!

What you’ll be doing:

Manage and support workload and resource schedulers in a large-scale HPC environment.
Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.
Develop solutions for complex computing resource management requirements.
Extract and leverage grid performance metrics for troubleshooting and performance optimization.
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Develop, define and document standard methodologies to share with internal teams.
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).
Proficient in administering Centos/RHEL Linux distributions.
In depth understating of container technologies like Docker.
Proficiency in UNIX scripting languages.
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
10+ years experience in a large, distributed Linux environment.
BS in Computer Science, similar degree or equivalent experience.

Ways to stand out from the crowd:

Experience analyzing and tuning performance for a variety of HPC or EDA workloads.
Solid understanding of cluster configuration managements tools such as Ansible.
Proficiency in Perl for maintaining legacy automation scripts.
Deep understanding of distributed system principles.

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Software Engineer, Early Career, Cloud AI

Google

(On-Site)

• 9 Months ago

FX Lead (DNEG Animation)

DNEG

Chennai, Tamil Nadu, India (On-Site)

• 10 Months ago

Software Developer 4

Anavation

Chantilly, Virginia, United States (On-Site)

• 10 Months ago

Senior DevOps/Software Engineer

Interactive Brokers

Greenwich, Connecticut, United States (Hybrid)

• 10 Months ago

Software Engineer II- Backend

Keywords Studios (Player Support)

Maharashtra, India (Hybrid)

• 8 Months ago

Staff Engineer (C++ Windows Internals)

Omnissa

Bengaluru, Karnataka, India (On-Site)

• 10 Months ago

Technical Design Authority

LSEG (London Stock Exchange Group)

Bengaluru, Karnataka, India (Hybrid)

• 11 Months ago

IN-Manager_D365 Azure Integration Developer and Design Architect _MS Dynamics_Advisory_Kolkata

PwC

Kolkata, West Bengal, India (On-Site)

• 9 Months ago

Solutions Architect, Infrastructure - Research Computing

NVIDIA

New York, New York, United States (Remote)

• 7 Months ago

Senior Software Developer

Warner Bros Games

Ottawa, Ontario, Canada (Hybrid)

• 8 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Software Engineer, Cloud Infrastructure

ByteDance

San Jose, California, United States (On-Site)

• 9 Months ago

Senior Java Engineer

Info Stretch

Krakow Am See, Mecklenburg-Vorpommern, Germany (On-Site)

• 9 Months ago

Murex XVA Techno-Functional Business Analyst

Luxoft

Sydney, New South Wales, Australia (On-Site)

• 9 Months ago

IT Java Architect

ARHS

Luxembourg (On-Site)

• 10 Months ago

Backend developer

Nolimit City

Stockholm, Stockholm County, Sweden (On-Site)

• 9 Months ago

Support Analyst

Anthology Inc

Bengaluru, Karnataka, India (On-Site)

• 6 Months ago

Senior Sales Engineer

Axinous

Tokyo, Japan (On-Site)

• 7 Months ago

Platform Architect

Booming games

Pressig, Bavaria, Germany (Remote)

• 6 Months ago

Oracle DBA (With SAP Experience)

Next Level Business Services

Austin, Texas, United States (On-Site)

• 10 Months ago

Animation TD (DNEG Animation)

DNEG

Chennai, Tamil Nadu, India (On-Site)

• 9 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Durham, North Carolina, United States

Software Engineer, Computer Vision (Technical Leadership)

Lead Graphics Engineer

Light Speed Studios

Irvine, California, United States (On-Site)

• 8 Months ago

Helpdesk Support Technician

Intrepid Studios, Inc

San Diego, California, United States (On-Site)

• 1 Year ago

Architect User Experience Designer - League of Legends

Riot Games

Los Angeles, California, United States (On-Site)

• 9 Months ago

Technical Program Manager 6 - Games Social, Trust and Safety

Netflix

United States (Remote)

• 7 Months ago

Senior Researcher - Embodied AI/Robotics - Microsoft Research

Microsoft

Redmond, Washington, United States (On-Site)

• 7 Months ago

MTS 1, Software Engineer

Paypal

Austin, Texas, United States (Hybrid)

• 9 Months ago

Sr Data Analyst

The Walt Disney Company

Santa Monica, California, United States (On-Site)

• 7 Months ago

Video Producer

Onward Search

Washington, District Of Columbia, United States (Remote)

• 6 Months ago

MTS 1, Software Engineer

Paypal

Scottsdale, Arizona, United States (Hybrid)

• 10 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

MLOPS ENGINEER

Equivalent Jobs

(Remote)

• 9 Months ago

Senior Staff Site Reliability Engineer - Federal

Axinous

Virginia, United States (Remote)

• 6 Months ago

Cloud Systems Engineer

Funko

Washington, United States (On-Site)

• 8 Months ago

Architect MES Foundation

BSH Home Appliances India

Bengaluru, Karnataka, India (On-Site)

• 9 Months ago

IT Manager (Part-Time)

PlayerUnknown Productions

Amsterdam, North Holland, Netherlands (Hybrid)

• 9 Months ago

Devops Engineer - II

Netomi

Gurugram, Haryana, India (Remote)

• 8 Months ago

SENIOR DEVOPS ENGINEER

Nielsen Holdings

Gurugram, Haryana, India (Hybrid)

• 10 Months ago

Infrastructure Engineer

VGW

Sydney, New South Wales, Australia (On-Site)

• 5 Months ago

Senior DevOps Engineer

AppZen

San Jose, California, United States (Hybrid)

• 10 Months ago

Sr. Principal Software Engineer - Privileged Access Management (PAM)

Saviynt

El Segundo, California, United States (Hybrid)

• 10 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

110 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Site Reliability Engineer, HPC and LSF

Job Summary

Job Description

11 skills required

11 skills required for this role

Job Details

Similar Jobs

Software Engineer, Early Career, Cloud AI

FX Lead (DNEG Animation)

Software Developer 4

Senior DevOps/Software Engineer

Software Engineer II- Backend

Staff Engineer (C++ Windows Internals)

Technical Design Authority

IN-Manager_D365 Azure Integration Developer and Design Architect _MS Dynamics_Advisory_Kolkata

Solutions Architect, Infrastructure - Research Computing

Senior Software Developer

Similar Skill Jobs

Software Engineer, Cloud Infrastructure

Senior Java Engineer

Murex XVA Techno-Functional Business Analyst

IT Java Architect

Backend developer

Support Analyst

Senior Sales Engineer

Platform Architect

Oracle DBA (With SAP Experience)

Animation TD (DNEG Animation)

Jobs in Durham, North Carolina, United States

Software Engineer, Computer Vision (Technical Leadership)

Lead Graphics Engineer

Helpdesk Support Technician

Architect User Experience Designer - League of Legends

Technical Program Manager 6 - Games Social, Trust and Safety

Senior Researcher - Embodied AI/Robotics - Microsoft Research

MTS 1, Software Engineer

Sr Data Analyst

Video Producer

MTS 1, Software Engineer

Devops Jobs

MLOPS ENGINEER

Senior Staff Site Reliability Engineer - Federal

Cloud Systems Engineer

Architect MES Foundation

IT Manager (Part-Time)

Devops Engineer - II

SENIOR DEVOPS ENGINEER

Infrastructure Engineer

Senior DevOps Engineer

Sr. Principal Software Engineer - Privileged Access Management (PAM)

About The Company

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Network Site Reliability Engineer

ASIC Engineer

Senior ASIC Design Engineer

Physical Design CAD Team Manager

Senior Data Scientist and System Architect

Solutions Architect for NCP

Senior Networking Architect

Level Up Your Career in Game Development!