Senior Site Reliability Engineer, HPC and LSF

2 Days ago • 10 Years + • DevOps • $184,000 PA - $287,500 PA

Job Summary

Job Description

As a Senior Site Reliability Engineer at NVIDIA, you will be responsible for designing and implementing high-performance compute clusters for silicon development. You will manage workload schedulers (like LSF), automate deployments, troubleshoot complex issues, and optimize grid performance. Collaboration with domain experts to improve chip development processes and contributing to time-to-market improvements are key aspects of this role. The ideal candidate possesses extensive HPC experience, strong scripting skills (Python, UNIX), and expertise in containerization (Docker).
Must have:
  • Extensive LSF/SLURM experience
  • Proficient in CentOS/RHEL
  • Docker expertise
  • UNIX scripting & Python
  • Problem-solving & analysis skills
  • Strong communication & teamwork
Good to have:
  • HPC/EDA workload performance tuning
  • Ansible experience
  • Perl proficiency
  • Distributed systems understanding
Perks:
  • Equity
  • Benefits

Job Details

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice, join our diverse team today!
 

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.


What you’ll be doing:

  • Manage and support workload and resource schedulers in a large-scale HPC environment.

  • Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.

  • Develop solutions for complex computing resource management requirements.

  • Extract and leverage grid performance metrics for troubleshooting and performance optimization.

  • Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

  • Develop, define and document standard methodologies to share with internal teams.

  • Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.

  • Directly contribute to the overall quality and improve time to market for our next generation chips.


What we need to see:

  • Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).

  • Proficient in administering Centos/RHEL Linux distributions.

  • In depth understating of container technologies like Docker.

  • Proficiency in UNIX scripting languages and Python.

  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.

  • 10+ years experience in a large, distributed Linux environment.

  • BS in Computer Science, similar degree or equivalent experience.


Ways to stand out from the crowd:

  • Experience analyzing and tuning performance for a variety of HPC or EDA workloads.

  • Solid understanding of cluster configuration managements tools such as Ansible.

  • Proficiency in Perl for maintaining legacy automation scripts.

  • Deep understanding of distributed system principles.

The base salary range is 184,000 USD - 287,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

NVIDIA - Senior VLSI Physical Design Integration Engineer

NVIDIA

Westford, Massachusetts, United States (On-Site)
3 Weeks ago
NVIDIA - Senior Math Libraries Engineer - Dense Linear Algebra

NVIDIA

California, United States (Hybrid)
2 Months ago
Genies - Machine Learning Engineer, Character Animation & Motion AI

Genies

San Mateo, California, United States (On-Site)
2 Days ago
ByteDance - Tech Lead Machine Learning Engineer

ByteDance

Seattle, Washington, United States (On-Site)
3 Days ago
NVIDIA - Account Leader, Automotive

NVIDIA

(On-Site)
1 Month ago
Fandom - Principal DevOps Engineer

Fandom

Poznań, Greater Poland Voivodeship, Poland (Remote)
1 Month ago
ARHS - Configuration / Deployment Specialist

ARHS

Warsaw, Masovian Voivodeship, Poland (On-Site)
5 Months ago
Tencent - Tencent Cloud Solution Architect Intern (Indonesia Market)

Tencent

Shenzhen, Guangdong Province, China (On-Site)
1 Week ago
The Walt Disney Company - Sr. System Reliability Engineer

The Walt Disney Company

Burbank, California, United States (On-Site)
3 Days ago
Rackspace Technology - Lead Cloud Engineer

Rackspace Technology

United States (Remote)
1 Month ago

Get notifed when new similar jobs are uploaded

Similar Skill Jobs

ByteDance - Video Analysis and Quality Algorithm Intern 2023 Summer/Fall (MS)

ByteDance

Seattle, Washington, United States (On-Site)
4 Months ago
Zoox - Senior/Staff Software Engineer, ML Performance Optimization

Zoox

Seattle, Washington, United States (On-Site)
5 Months ago
NVIDIA - Senior Observability Architect, AI and HPC

NVIDIA

Canada (On-Site)
1 Month ago
NVIDIA - Senior Product Architect

NVIDIA

Canada (On-Site)
3 Weeks ago
NVIDIA - Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Santa Clara, California, United States (Hybrid)
2 Months ago
NVIDIA - Manager, Developer Technology, Data Compression

NVIDIA

United States (Hybrid)
1 Month ago
NVIDIA - US Indirect Tax Manager

NVIDIA

Canada (On-Site)
1 Month ago
ByteDance - Machine Learning Engineer - MLDev

ByteDance

San Jose, California, United States (On-Site)
3 Days ago
Easygo - Data Scientist

Easygo

Melbourne, Victoria, Australia (On-Site)
2 Months ago
NVIDIA - Director of AI Research

NVIDIA

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)
1 Month ago

Get notifed when new similar jobs are uploaded

Jobs in Durham, North Carolina, United States

PlayStation Global - Staff Technical Program Manager

PlayStation Global

Carlsbad, California, United States (Hybrid)
2 Weeks ago
Corsair - Senior Product Manager - Furniture

Corsair

Milpitas, California, United States (On-Site)
3 Days ago
The Walt Disney Company - Area Ride and Show Technician

The Walt Disney Company

Florida, United States (On-Site)
3 Days ago
Patreon - Benefits Manager

Patreon

San Francisco, California, United States (Hybrid)
2 Days ago
Activision - Senior Manager, North America Accounting Ops

Activision

Los Angeles, California, United States (On-Site)
3 Months ago
NVIDIA - Senior Hardware Customer Quality Engineer

NVIDIA

Santa Clara, California, United States (On-Site)
3 Weeks ago
NVIDIA - US Indirect Tax Manager

NVIDIA

Santa Clara, California, United States (On-Site)
1 Month ago
Zoox - Senior/Staff Software Engineer - Simulation C++ Framework

Zoox

Foster City, California, United States (Hybrid)
5 Months ago
Eleven Labs - Website Engineer

Eleven Labs

United States (Remote)
2 Days ago
Activision - Expert Software Engineer, Graphics

Activision

California, United States (Remote)
4 Days ago

Get notifed when new similar jobs are uploaded

DevOps Jobs

Auros Global - Senior Site Reliability Engineer

Auros Global

United Kingdom (Remote)
1 Week ago
EXUSIA - AWS DevOps Engineer/Admin

EXUSIA

India (Remote)
5 Months ago
Saviynt - Sr. Principal Software Engineer - Privileged Access Management (PAM)

Saviynt

El Segundo, California, United States (Hybrid)
5 Months ago
Globalization Partners - Principal Solution Architect

Globalization Partners

United States (Remote)
1 Month ago
Probably Monsters - Build Engineer, Ecosystems (Core Technology)

Probably Monsters

Texas, United States (On-Site)
1 Month ago
Sigma Software - Senior/Principal Site Reliability Engineer (AdTech)

Sigma Software

Brasília, Federal District, Brazil (Remote)
5 Months ago
Nielsen Holdings - Sr DevOps Engineer (AM-TECH-DA-40)

Nielsen Holdings

Bengaluru, Karnataka, India (Hybrid)
5 Months ago
Wind River Systems - Cloud Solutions Architect

Wind River Systems

Tokyo, Japan (On-Site)
5 Months ago
ByteDance - Cloud Site Reliability Engineer

ByteDance

San Jose, California, United States (On-Site)
3 Days ago
EXUSIA - Google Cloud Platform - Senior Data Engineer

EXUSIA

India (Remote)
5 Months ago

Get notifed when new similar jobs are uploaded

About The Company

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.


Yokne'am Illit, North District, Israel (On-Site)

Hyderabad, Telangana, India (On-Site)

Tel Aviv-Yafo, Tel Aviv District, Israel (On-Site)

Santa Clara, California, United States (On-Site)

Texas, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

Santa Clara, California, United States (On-Site)

View All Jobs

Get notified when new jobs are added by NVIDIA

Level Up Your Career in Game Development!

Transform Your Passion into Profession with Our Comprehensive Courses for Aspiring Game Developers.

Job Common Plug