Data Center Operations Engineer
Cadence
Job Summary
The Data Center Operations Engineer supports, maintains, and deploys critical data center infrastructure, focusing on Linux-based systems, GPU server deployments, and InfiniBand networking. This role involves hands-on expertise in data center operations, cluster bring-up, hardware installation, and troubleshooting across compute, network, and GPU environments. The engineer will collaborate with global teams to ensure reliable, secure, and scalable service delivery, participating in on-call rotations and maintaining accurate documentation.
Must Have
- Provide hands-on operational support for data center projects, deployments, and repairs.
- Participate in on-call rotation and provide on-site or remote support.
- Troubleshoot and resolve operational issues related to Linux servers, GPU platforms, networking, and storage.
- Support customer and internal deployments of GPU servers and clusters.
- Perform InfiniBand fabric bring-up, switch configuration, subnet management, and troubleshooting.
- Conduct daily health checks of Linux systems and infrastructure components.
- Install, configure, test, and maintain server hardware (rack and stack, labeling, HDDs, memory, CPUs, RAID batteries, NICs).
- Install, configure, and troubleshoot networking equipment (routers, switches, terminal servers).
- Review and validate equipment deployments against approved design documentation.
- Support data center builds, refreshes, migrations, and expansions.
- Coordinate with vendors for hardware delivery, diagnostics, replacement, and warranty services.
- Utilize monitoring and alerting frameworks to identify issues and escalate appropriately.
- Maintain accurate documentation of operational procedures, system configurations, and runbooks.
- Follow established incident management, escalation procedures, and service-level agreements (SLAs).
- Collaborate with global teams across time zones to support operational initiatives.
- Contribute to process improvement initiatives and ensure adherence to documented policies.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
- Strong hands-on experience in Linux environments, including system administration, troubleshooting, and performance validation.
- Proficiency with Linux command-line tools and shell scripting (Bash or equivalent).
- Experience with cluster bring-up, driver installation, and system-level configuration.
- Hands-on experience setting up and validating GPU servers in clustered environments.
- Experience with end-to-end GPU testing in InfiniBand-based clusters.
- Working knowledge of InfiniBand networking, including switch configuration and subnet management.
- Solid understanding of networking fundamentals, including the OSI model and TCP/IP protocol suite.
- Experience installing, configuring, and troubleshooting routers, switches, and terminal servers.
- Familiarity with fiber and copper cabling, including IP and SAN deployments.
- Experience managing incident tickets, maintaining acceptable ticket loads, and meeting SLAs.
Good to Have
- Experience supporting HPC, AI, or large-scale GPU environments.
- Exposure to data center monitoring
- Experience documenting operational processes and maintaining technical runbooks.
- Familiarity with large-scale data center buildouts or refresh programs.
Job Description
At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology.
Job Summary
The Data Center Operations Engineer is responsible for supporting, maintaining, and deploying critical data center infrastructure with a strong focus on Linux-based systems, GPU server deployments, and InfiniBand networking. This role requires hands-on expertise in data center operations, cluster bring-up, hardware installation, and troubleshooting across compute, network, and GPU environments. The engineer will collaborate closely with global infrastructure, development, and operations teams to ensure reliable, secure, and scalable service delivery.
Key Responsibilities
- Provide hands-on operational support for all data center projects, deployments, and repair activities.
- Participate in an on-call rotation and provide on-site or remote support during maintenance windows and incidents.
- Troubleshoot and resolve operational issues related to Linux servers, GPU platforms, networking, and storage infrastructure.
- Support customer and internal deployments, ensuring timely and successful bring-up of GPU servers and clusters.
- Perform InfiniBand fabric bring-up, switch configuration, subnet management, and troubleshooting.
- Conduct daily health checks of Linux systems and infrastructure components, proactively identifying and mitigating risks.
- Install, configure, test, and maintain server hardware (rack and stack, labeling, HDDs, memory, CPUs, RAID batteries, NICs, etc.).
- Install, configure, and troubleshoot networking equipment including routers, switches, and terminal servers for out-of-band management.
- Review and validate equipment deployments against approved design documentation and standards.
- Support data center builds, refreshes, migrations, and expansions while adhering to quality and safety standards.
- Coordinate with vendors and onsite staff for hardware delivery, diagnostics, replacement, and warranty services.
- Utilize monitoring and alerting frameworks to identify issues, escalate appropriately, and ensure timely service restoration.
- Maintain accurate documentation of operational procedures, system configurations, and runbooks.
- Follow established incident management, escalation procedures, and service-level agreements (SLAs).
- Collaborate with global teams across time zones to support operational initiatives and continuous improvement efforts.
- Contribute to process improvement initiatives and ensure adherence to documented policies, processes, and procedures.
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
- Strong hands-on experience in Linux environments, including system administration, troubleshooting, and performance validation.
- Proficiency with Linux command-line tools and shell scripting (Bash or equivalent).
- Experience with cluster bring-up, driver installation, and system-level configuration.
- Hands-on experience setting up and validating GPU servers in clustered environments.
- Experience with end-to-end GPU testing in InfiniBand-based clusters.
- Working knowledge of InfiniBand networking, including switch configuration and subnet management.
- Solid understanding of networking fundamentals, including the OSI model and TCP/IP protocol suite (IP, ARP, ICMP, TCP, UDP, SMTP, FTP, TFTP).
- Experience installing, configuring, and troubleshooting routers, switches, and terminal servers.
- Familiarity with fiber and copper cabling, including IP and SAN deployments.
- Experience managing incident tickets, maintaining acceptable ticket loads, and meeting SLAs.
- Strong organizational skills with meticulous attention to detail in data center environments.
- Ability to follow and enforce documented escalation procedures and operational policies.
- Strong verbal and written communication skills, with the ability to collaborate effectively with cross-functional and global teams.
Preferred Qualifications
- Experience supporting HPC, AI, or large-scale GPU environments.
- Exposure to data center monitoring
- Experience documenting operational processes and maintaining technical runbooks.
- Familiarity with large-scale data center buildouts or refresh programs.
Physical Requirements
- Ability to perform the essential functions of the role, including lifting, moving, and installing equipment weighing 50 pounds or more, with or without reasonable accommodation.
- Ability to work in data center environments, including raised floors, equipment racks, and confined spaces.
- Willingness to work flexible hours, including nights, weekends, and on-call rotations as required.
Work Environment
- On-site data center environment with occasional remote coordination.
- Interaction with hardware vendors, service providers, and internal engineering teams.
- Fast-paced operational setting requiring attention to detail, adherence to safety standards, and rapid problem resolution.
We’re doing work that matters. Help us solve what others can’t.
Equal Employment Opportunity Policy:
Cadence is committed to equal employment opportunity throughout all levels of the organization.
We welcome your interest in the company and want to make sure our job site is accessible to all. If you experience difficulty using this site or to request a reasonable accommodation, please contact staffing@cadence.com.
Privacy Policy:
Job Applicant If you are a job seeker creating a profile using our careers website, please see the privacy policy(opens in a new tab).
E-Verify Cadence participates in the
E-Verify program in certain U.S. locations as required by law. Download More Information on E-Verify (64K)
plays a critical role in creating the technologies that modern life depends on. We are a global electronic design automation company, providing software, hardware, and intellectual property to design advanced semiconductor chips that enable our customers create revolutionary products and experiences.
Thanks to the outstanding caliber of the team and the empowering culture that we have cultivated for over 25 years, continues to be recognized by Fortune Magazine as one of the 100 Best Companies to Work For. Our shared passion for solving the world’s toughest technical challenges, our dedication to pushing the limits of the industry, and our drive to do meaningful work differentiates the people of .
is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, basis of disability, or any other protected class.