Jobs Courses Resources Companies Placements

Home >

Jobs >

Senior Site Reliability Engineer, HPC and LSF

NVIDIA

North Carolina, United States (On-site)

Senior Site Reliability Engineer, HPC and LSF

5 Months ago • 10 Years + • Devops • $184,000 PA - $287,500 PA

Job Summary

Job Description

As a Senior Site Reliability Engineer at NVIDIA, you will be responsible for designing and implementing high-performance compute clusters for silicon development. You will manage workload schedulers (like LSF), automate deployments, troubleshoot complex issues, and optimize grid performance. Collaboration with domain experts to improve chip development processes and contributing to time-to-market improvements are key aspects of this role. The ideal candidate possesses extensive HPC experience, strong scripting skills (Python, UNIX), and expertise in containerization (Docker).

Must have:

Extensive LSF/SLURM experience
Proficient in CentOS/RHEL
Docker expertise
UNIX scripting & Python
Problem-solving & analysis skills
Strong communication & teamwork

Good to have:

HPC/EDA workload performance tuning
Ansible experience
Perl proficiency
Distributed systems understanding

Perks:

Equity
Benefits

15 skills required

15 skills required for this role

Add these skills to join the top 1% applicants for this job

team-management

communication

problem-solving

performance-analysis

game-texts

networking

linux

unix

ansible

deep-learning

docker

python

perl

css

machine-learning

Job Details

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice, join our diverse team today!

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you’ll be doing:

Manage and support workload and resource schedulers in a large-scale HPC environment.
Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.
Develop solutions for complex computing resource management requirements.
Extract and leverage grid performance metrics for troubleshooting and performance optimization.
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Develop, define and document standard methodologies to share with internal teams.
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).
Proficient in administering Centos/RHEL Linux distributions.
In depth understating of container technologies like Docker.
Proficiency in UNIX scripting languages and Python.
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
10+ years experience in a large, distributed Linux environment.
BS in Computer Science, similar degree or equivalent experience.

Ways to stand out from the crowd:

Experience analyzing and tuning performance for a variety of HPC or EDA workloads.
Solid understanding of cluster configuration managements tools such as Ansible.
Proficiency in Perl for maintaining legacy automation scripts.
Deep understanding of distributed system principles.

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Market Research Analyst

WebFX

Harrisburg, Pennsylvania, United States (On-Site)

• 10 Months ago

Sales Advisor

Tesla

Baden-Württemberg, Germany (On-Site)

• 6 Months ago

Financial Analyst - Controllership

Bengaluru, Karnataka, India (On-Site)

• 1 Month ago

Enterprise Account Executive, Corporate

Alpha Sense

Singapore, Singapore (On-Site)

• 8 Months ago

Senior Data Scientist

Bluevine India

Bengaluru, Karnataka, India (Hybrid)

• 4 Months ago

Infrastructure Engineer

Wargaming

Nicosia, Nicosia, Cyprus (Hybrid)

• 2 Months ago

Senior ML Infrastructure Engineer

Unity

San Francisco, California, United States (On-Site)

• 11 Months ago

Senior Site Reliability Engineer

Snyk

Lisbon, Lisbon, Portugal (Hybrid)

• 3 Months ago

Senior DevOps Engineer

endava

Kuala Lumpur, Federal Territory Of Kuala Lumpur, Malaysia (On-Site)

• 3 Months ago

Solutions Engineer

Glean

Central, South Carolina, United States (Remote)

• 9 Months ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

Senior Manager, Sales Engineering (East Region)

fluence

Arlington, Virginia, United States (Hybrid)

• 2 Months ago

Embedded Software Engineer

Toptracer

Danderyd, Stockholm County, Sweden (Hybrid)

• 2 Months ago

Partner Development Manager, OEM

Lytx, Inc

United States (Remote)

• 2 Months ago

Budget Manager, PD&M Operations

Apple

Culver City, California, United States (On-Site)

• 2 Months ago

Principal Software Engineer

Capgemini

Hyderabad, Telangana, India (On-Site)

• 3 Months ago

Art Director (Rocket League)

Epic Games

Vancouver, British Columbia, Canada (On-Site)

• 4 Months ago

SAP Testing Consultant

Inveniolsi

Delhi, India (On-Site)

• 3 Months ago

Lead Technical Artist

Haptic

Dallas, Texas, United States (Remote)

• 8 Months ago

Software Development Engineer

Amazon games

Bucharest, Bucharest, Romania (On-Site)

• 4 Months ago

Business Intelligence Developer

Resolver

Toronto, Ontario, Canada (On-Site)

• 3 Months ago

Get notifed when new similar jobs are uploaded

Jobs in Durham, North Carolina, United States

Intune Administrator

onwards Search

Jacksonville, Florida, United States (Hybrid)

• 1 Month ago

Software Engineer

Redhorse Corp

Chantilly, Virginia, United States (On-Site)

• 1 Month ago

Custodian

Lightcast

Moscow, Idaho, United States (On-Site)

• 2 Months ago

AI Research Scientist – GenAI

Bosch Group

Sunnyvale, California, United States (On-Site)

• 2 Months ago

Senior Software Engineer, iOS

Saturn

New York, New York, United States (On-Site)

• 3 Months ago

Digital Media Planner

Liquid Advertising

United States (Remote)

• 3 Months ago

Salesforce Administrator

Dave Ramsey

Franklin, Tennessee, United States (On-Site)

• 2 Months ago

Join our Talent Community

HYCU

Boston, Massachusetts, United States (Hybrid)

• 2 Years ago

Recruiting Coordinator Intern

Domo

American Fork, Utah, United States (On-Site)

• 2 Months ago

Engineering Manager, Seller Scaling

Whatnot

San Francisco, California, United States (On-Site)

• 4 Months ago

Get notifed when new similar jobs are uploaded

Devops Jobs

Senior Automotive Platform/BSP Application Engineer

Qualcomm

Berlin, Berlin, Germany (On-Site)

• 3 Months ago

Senior Cloud Engineer

Survay Monkey

Bengaluru, Karnataka, India (Hybrid)

• 2 Months ago

Sr. Manager - Site Reliability Engineer

Visa

Ashburn, Virginia, United States (Hybrid)

• 3 Months ago

Lead Frontend Engineer with Full Stack Experience (GraphQL / React / AWS / Java / Event Driven Architecture) - US Remote

Square

Boise, Idaho, United States (Remote)

• 2 Months ago

Senior Associate Azure DevOps App Tech-MS Engineering Advisory

PwC

Hyderabad, Telangana, India (On-Site)

• 1 Month ago

Site Reliability Engineer

Argus

Indonesia (Remote)

• 4 Months ago

DevOps Senior

Turbulent

Montreal, Quebec, Canada (On-Site)

• 2 Months ago

Software Architect - Java Multi-Tenant SAAS Cloud Native

Ion

Pune, Maharashtra, India (On-Site)

• 10 Months ago

Intermediate Site Reliability Engineer, Database Operations

gitlab

Canada (Remote)

• 1 Month ago

Senior DevOps Programmer

Epic Games

Montreal, Quebec, Canada (On-Site)

• 5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

NVIDIA

76 Active Jobs

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

A global community of game builders. Helping people upskill and land jobs in the best gaming studios.

Company

Key Links

hello@outscal.com

Made in INDIA 💛💙

Senior Site Reliability Engineer, HPC and LSF

Job Summary

Job Description

15 skills required

15 skills required for this role

Job Details

Similar Jobs

Market Research Analyst

Sales Advisor

Financial Analyst - Controllership

Enterprise Account Executive, Corporate

Senior Data Scientist

Infrastructure Engineer

Senior ML Infrastructure Engineer

Senior Site Reliability Engineer

Senior DevOps Engineer

Solutions Engineer

Similar Skill Jobs

Senior Manager, Sales Engineering (East Region)

Embedded Software Engineer

Partner Development Manager, OEM

Budget Manager, PD&M Operations

Principal Software Engineer

Art Director (Rocket League)

SAP Testing Consultant

Lead Technical Artist

Software Development Engineer

Business Intelligence Developer

Jobs in Durham, North Carolina, United States

Intune Administrator

Software Engineer

Custodian

AI Research Scientist – GenAI

Senior Software Engineer, iOS

Digital Media Planner

Salesforce Administrator

Join our Talent Community

Recruiting Coordinator Intern

Engineering Manager, Seller Scaling

Devops Jobs

Senior Automotive Platform/BSP Application Engineer

Senior Cloud Engineer

Sr. Manager - Site Reliability Engineer

Lead Frontend Engineer with Full Stack Experience (GraphQL / React / AWS / Java / Event Driven Architecture) - US Remote

Senior Associate Azure DevOps App Tech-MS Engineering Advisory

Site Reliability Engineer

DevOps Senior

Software Architect - Java Multi-Tenant SAAS Cloud Native

Intermediate Site Reliability Engineer, Database Operations

Senior DevOps Programmer

About The Company

System Design Power Validation Engineer

OEM Account Manager

System Debug Lead Engineer

Network Site Reliability Engineer

ASIC Engineer

Senior ASIC Design Engineer

Physical Design CAD Team Manager

Engineering Farm Engineer

Senior Mixed Signal Design Verification Engineer

Senior Solutions Architect, Cloud Infrastructure and DevOps

Level Up Your Career in Game Development!