Senior Systems Engineer HPC - R-21841

1 Hour ago • 10 Years +

System Design

Job Description

This Senior Systems Engineer HPC role involves comprehensive system administration and maintenance of HPC clusters, including installation, configuration, updates, and troubleshooting. The position focuses on performance optimization, cluster and resource management using tools like Slurm and LSF, and managing high-speed networking and parallel file systems. Key responsibilities also include implementing security controls, automating deployments with DevOps tools, providing user support, and contributing to infrastructure planning and innovation, potentially exploring cloud-based HPC solutions.

Good To Have:

Proficiency in scripting languages (Python, Bash, R).
Familiarity with MPI libraries for parallel and distributed computing.
Knowledge of HPC in cloud environments (AWS, Azure, GCP HPC offerings).

Must Have:

Install, configure, and maintain HPC clusters (hardware, software, operating systems).
Perform regular updates/patching, manage user accounts, and troubleshoot hardware/software issues.
Monitor and analyze system/application performance, identify bottlenecks, and implement tuning solutions.
Manage and optimize job scheduling, resource allocation, and cluster operations using tools like Slurm, LSF, Bright Cluster Manager.
Configure, manage, and tune Linux networking and high-speed HPC interconnects (InfiniBand, Ethernet).
Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS).
Implement security controls, ensure compliance, and manage authentication services (LDAP, Active Directory).
Automate deployments, application packaging (RPM/DEB), and system configurations using Ansible, Terraform, Jenkins, Git.
Provide technical support, documentation, and training to researchers.
Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
Minimum 10 years systems experience, with at least 5 years in HPC.
Strong knowledge of Linux operating systems, internals, administration, and performance tuning.
Experience building and managing RPM and DEB packages.
Proficiency with job schedulers and resource managers like Slurm and LSF.
Strong understanding of Linux networking and HPC interconnects.
Knowledge of parallel file systems like Lustre, Ceph, or GPFS.
Working knowledge of Linux authentication and directory services.
Strong experience with DevOps and configuration management tools (Ansible, Terraform, Jenkins, Git).
Strong knowledge of Linux security, compliance standards, and data protection best practices.

Add these skills to join the top 1% applicants for this job

communication

data-analytics

resource-allocation

github

game-texts

resource-planning

networking

dns

linux

ldap

aws

azure

ansible

terraform

git

python

bash

jenkins

Responsibilities:

System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues.

Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency.

Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf.

Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication.

Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery.

Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory.

DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations.

User Support & Collaboration: Provide technical support, documentation, and training to researchers; collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs.

Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable.

Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree).
Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC.
Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning.
Experience building and managing RPM and DEB packages.
Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf.
Proficiency with job schedulers and resource managers such as Slurm and LSF.
Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning.
Knowledge of parallel file systems such as Lustre, Ceph, or GPFS.
Working knowledge of Linux authentication and directory services such as LDAP and Active Directory.
Proficiency in scripting languages (e.g., Python, Bash, R) and familiarity with MPI libraries for parallel and distributed computing (nice to have).
Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git.
Knowledge of HPC in cloud environments (e.g., AWS, Azure, GCP HPC offerings) is a plus.
Strong knowledge of Linux security, compliance standards, and data protection best practices.
Excellent communication, interpersonal, and problem-solving skills.

Set alerts for more jobs like Senior Systems Engineer HPC - R-21841

Set alerts for new jobs by Rackspace Technology

Set alerts for new System Design jobs in India

Set alerts for new jobs in India

Set alerts for System Design (Remote) jobs

More System Design Jobs

Sustaining Engineer, Chassis

Alten Technology

Foster City, California, United States (On-site)

Senior Systems Engineer HPC - R-21841

Job Summary

Job Description

18 skills required for this role

Job Details

Responsibilities:

Qualifications:

Job Alerts

Similar Jobs

More System Design Jobs

Sustaining Engineer, Chassis

Senior Systems Engineer

Lead Engineer – Electrical Integration

Senior Engineer - System Performance

Sr. Embedded Firmware Design Engineer - Mixed-Signal ICs

Lead Applications Engineer – DDR Design IP

End-to-End Solution Architect

Systems Technician (Brasilia/DF)

Data Center Operations System Engineer - Columbus

Instrumentation Solutions architect- R01555514

More Software Development & Engineering Jobs

MWA/MSCA Specialized Oracle Developer IV - IN

Trainee Cloud Engineer

Senior Software Developer HYBRID!

Senior Software Developer – Mobile/RE HYBRID!

Data Annotator - ONSITE(Lahore)

Senior Solutions Engineer

Senior Cloud Engineer

Senior Manager, Solutions Architect (Service Cloud)

Senior Manager, Solutions Architect (Service Cloud)

Principal Salesforce Developer

Rackspace Technology

MWA/MSCA Specialized Oracle Developer IV - IN

Citrix System Administration L3 - R-22195

Trainee Cloud Engineer

Network Engineer II

Travel and Expense Administrator II

Senior Financial Analyst

Financial Analyst

Data Center Facility Engineer

BFSI Sales Executive V - Remote - Northeast or TX

Level Up Your Career in Game Development!