Senior HPC Engineer/Administrator (IMC - 001)
Sagecor
Job Summary
SageCor Solutions is seeking a Senior HPC Engineer/Administrator to provide system administration and technical support for traditional and High-Performance Computing (HPC) systems in a research environment. Responsibilities include configuring and managing Linux and Windows operating systems, administering HPC clusters, implementing automation tools, and providing IT system support. The role requires expertise in troubleshooting, performance optimization, and supporting researchers, with a strong background in Linux, scripting (C, Python), and various HPC technologies.
Must Have
- Active TS/SCI w/ Polygraph clearance
- Configure and manage Linux and Windows operating systems
- Administer, monitor, and maintain HPC systems
- Provide IT system support and problem resolution
- Implement and maintain automation tools
- Troubleshoot IT systems, server hardware, applications
- Experience with Linux-based servers and HPC clusters
- Experience with job schedulers (Slurm, LSF, PBS)
- Configure and manage VPN clients and servers
- Scripting/programming in C and Python
- Knowledge of Ansible for automation
- Knowledge of Warewolf for provisioning
- Knowledge of distributed storage (Lustre, BeeGFS)
- Knowledge of containerization (Docker, Apptainer)
- Experience with monitoring tools (Grafana, Prometheus)
- Set up and execute HPC benchmarks
Good to Have
- Meets DoD 8140.01 or DoD 8570.01-M training and certification requirements
Job Description
Description
Serving Maryland and the Greater Washington D.C. area, SageCor Solutions (SageCor) is a growing company bringing complete engineering services and true full lifecycle System Engineering services to areas requiring (or desiring) nationally-recognized expertise in high performance computing, large data analytics and cutting edge information technologies.
Active TS/SCI w/ Polygraph required.
The Systems Administrator will be responsible for providing system administration and technical support of traditional and High-Performance Computing (HPC) systems in a research-driven environment.
Requirements:
- Configure and manage Linux and Windows (or other applicable) operating systems and installs/loads operating system software, troubleshoot, maintain integrity of and configure network components, along with implementing operating systems enhancements to improve security, reliability, and performance
- Administer, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks
- Provide support to IT systems including day-to-day operations, monitoring and problem resolution for all of the client/server/storage/network devices, mobile devices, etc.
- Implement and maintain automation tools for system provisioning, configuration management, and monitoring.
- Provide support for implementation, troubleshooting and maintenance of IT systems
- Manage the daily activities of configuration and operation of IT systems
- Provide assistance to users in accessing and using IT systems
- Optimize system operations and resource utilization, and perform system capacity analysis and planning
- Provide in-depth experience in trouble-shooting IT systems
- Analyze and resolve complex problems associated with server hardware, applications and software integration
- Contribute to performance benchmarking, system tuning, and capacity planning
- Support researchers by providing technical expertise and resolving IT-related roadblocks or issues
- Document system administration procedures and contribute to knowledge-sharing initiatives
Technical skills:
- Experience administering Linux-based servers and HPC clusters, including job schedulers (e.g., Slurm, LSF, PBS)
- Experience configuring and managing Virtual Private Network (VPN) clients and servers
- Scripting/programming skills (C and Python)
- Knowledge of:
- System automation tools (e.g., Ansible)
- System provisioning tools (e.g., Warewolf)
- Distributed storage systems (e.g., Lustre, BeeGFS)
- Containerization (e.g., Docker, Apptainer)
- Installing, maintaining and using infrastructure and performance monitoring and optimization tools (e.g., Grafana, Prometheus)
- Setting up and executing benchmarks in an HPC environment and analyzing their results systematically
Qualifications:
- Active Top Secret/SCI clearance with polygraph
- Preferably meets DoD 8140.01 or DoD 8570.01-M training and certification requirements
Consistent with federal and state law where SageCor conducts business, SageCor Solutions provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or veteran status, or any other protected class.