Director, Data Center Operations - North America

Lambda

10+ Years | United States (Remote) | Full Time | 1 day ago

Apply Now

Job Summary

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. The company's mission is to make compute as ubiquitous as electricity, providing access to artificial intelligence for everyone. Lambda is seeking a highly skilled and experienced Director of Data Center Operations to lead and support its North America data center operations. This role involves overseeing large-scale AI and high-performance computing (HPC) infrastructure, ensuring reliability, managing hardware, planning capacity, interfacing with providers, mentoring teams, and setting up new data centers to achieve world-class uptime and scalability for rapidly growing AI demands.

Must Have

Develop and execute North American data center operations strategy.
Drive continuous improvement across facility operations.
Lead multi-site operations team ensuring 24/7/365 reliability and SLA response.
Establish standardized procedures, metrics, and best practices.
Monitor operational KPIs: uptime, PUE, safety, and compliance.
Build, mentor, and scale high-performing operations teams.
Develop and manage operating budgets and capital expenditures.
Oversee strategic vendor partnerships with data center providers.
Ensure compliance with environmental, safety, and industry regulations.
Lead incident response and root cause analysis.
Act as primary contact for data center operations audits (SOCII, ISO).
10+ years experience in data center operations, 7+ in leadership.
Proven experience supporting AI, HPC, or cloud infrastructure at scale.
Deep understanding of power, cooling, networking, capacity planning, DCIM, BMS.

Good to Have

Experience with GPU clusters.
Experience with AI infrastructure networking.
Experience with large-scale storage systems.
Familiarity with cloud-scale operational practices (AWS, Google, Microsoft).
Certifications like CDCDP, CDCP, PMP, or PE.

Perks & Benefits

Generous cash & equity compensation.
Health, dental, and vision coverage for you and your dependents.
Wellness and Commuter stipends for select roles.
401k Plan with 2% company match (USA employees).
Flexible Paid Time Off Plan.

Job Description

What You'll Do:

As Director of Data Center Operations for North America you lead and support large-scale AI and high-performance computing (HPC) infrastructure in all of Lambda’s North America data centers. This individual will lead and oversee all aspects of data center operations — including reliability, hardware break/fix, capacity planning, provider interface, team mentorship, and new data center setup —ensuring world-class uptime, customer response, and scalability to meet rapidly growing AI infrastructure demands.

Key Responsibilities:

Strategic Leadership

Develop and execute the North American data center operations strategy aligned with AI infrastructure goals and organizational growth.
Drive continuous improvement across facility operations, emphasizing sustainability, efficiency, and resilience.
Partner with Engineering, Capacity Planning, and Infrastructure teams to forecast and support future AI and GPU-based compute requirements. As well as provide operational feedback on designs and system improvements.
Oversee expansion projects, retrofits, and site selection in collaboration with Data Center Infrastructure Engineering and HPC Architecture teams.

Operational Excellence

Lead a multi-site operations team ensuring 24/7/365 reliability, availability, and SLA response across all facilities.
Establish standardized procedures, metrics, and best practices for preventive maintenance, incident management, and service delivery.
Monitor operational KPIs including uptime, PUE, safety, and compliance with corporate and regulatory standards.
Implement automation and AI-driven monitoring solutions to optimize system performance and predictive maintenance. Coordinate and communicate data center provider maintenances with customers and impacted teams.

Team Leadership and Development

Build, mentor, and scale a high-performing team of operations managers, technicians, and engineers across multiple regions.
Routinely visit all sites to maintain standards, develop relationships, and identify areas of efficiency.
Foster a culture of safety, accountability, and continuous learning driving data center operations to take on more responsibility and work up the stack.
Assist in the build out of new data center whitespace and deployment of AI Infrastructure.

Financial and Vendor Management

Develop and manage operating budgets, capital expenditures, and cost-optimization initiatives.
Oversee strategic vendor partnerships with numerous data center providers for power, cooling, maintenance, and critical infrastructure components.

Risk and Compliance

Ensure compliance with environmental, safety, and industry regulations (e.g., NFPA, OSHA, ISO standards).
Lead incident response and root cause analysis to drive preventive improvements for incidents related to data center operations or infrastructure.
Act as primary point of contact for audits related to data center operations for compliance such as SOCII, ISO, etc.

Qualifications:

10+ years of experience in data center operations, with at least 7 years in a leadership role managing multi-site or hyperscale facilities.
Proven experience supporting AI, HPC, or cloud infrastructure at scale.
Deep understanding of power and cooling systems, networking, capacity planning, and facility automation tools (DCIM, BMS, etc.).
Strong track record of improving operational efficiency and managing relationships with data center providers.
Preferred Bachelor’s degree in Engineering, Computer Science, or related field; Master’s bonus.
Exceptional communication, cross-functional collaboration, and stakeholder management skills. Ability to build relationships and consensus and positive team culture.
Willingness to travel (up to 50%) to data center sites across North America and data center sites under construction.

Preferred Skills:

Experience with GPU clusters, AI infrastructure networking, and large-scale storage systems.
Familiarity with cloud-scale operational practices (e.g., AWS, Google, Microsoft data center standards).
Certifications such as CDCDP, CDCP, PMP, or PE are a plus.