Director, Data Center Operations - North America
Lambda
Job Summary
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. The company's mission is to make compute as ubiquitous as electricity, providing access to artificial intelligence for everyone. Lambda is seeking a highly skilled and experienced Director of Data Center Operations to lead and support its North America data center operations. This role involves overseeing large-scale AI and high-performance computing (HPC) infrastructure, ensuring reliability, managing hardware, planning capacity, interfacing with providers, mentoring teams, and setting up new data centers to achieve world-class uptime and scalability for rapidly growing AI demands.
Must Have
- Develop and execute North American data center operations strategy.
- Drive continuous improvement across facility operations.
- Lead multi-site operations team ensuring 24/7/365 reliability and SLA response.
- Establish standardized procedures, metrics, and best practices.
- Monitor operational KPIs: uptime, PUE, safety, and compliance.
- Build, mentor, and scale high-performing operations teams.
- Develop and manage operating budgets and capital expenditures.
- Oversee strategic vendor partnerships with data center providers.
- Ensure compliance with environmental, safety, and industry regulations.
- Lead incident response and root cause analysis.
- Act as primary contact for data center operations audits (SOCII, ISO).
- 10+ years experience in data center operations, 7+ in leadership.
- Proven experience supporting AI, HPC, or cloud infrastructure at scale.
- Deep understanding of power, cooling, networking, capacity planning, DCIM, BMS.
Good to Have
- Experience with GPU clusters.
- Experience with AI infrastructure networking.
- Experience with large-scale storage systems.
- Familiarity with cloud-scale operational practices (AWS, Google, Microsoft).
- Certifications like CDCDP, CDCP, PMP, or PE.
Perks & Benefits
- Generous cash & equity compensation.
- Health, dental, and vision coverage for you and your dependents.
- Wellness and Commuter stipends for select roles.
- 401k Plan with 2% company match (USA employees).
- Flexible Paid Time Off Plan.
Job Description
What You'll Do:
As Director of Data Center Operations for North America you lead and support large-scale AI and high-performance computing (HPC) infrastructure in all of Lambda’s North America data centers. This individual will lead and oversee all aspects of data center operations — including reliability, hardware break/fix, capacity planning, provider interface, team mentorship, and new data center setup —ensuring world-class uptime, customer response, and scalability to meet rapidly growing AI infrastructure demands.
Key Responsibilities:
Strategic Leadership
- Develop and execute the North American data center operations strategy aligned with AI infrastructure goals and organizational growth.
- Drive continuous improvement across facility operations, emphasizing sustainability, efficiency, and resilience.
- Partner with Engineering, Capacity Planning, and Infrastructure teams to forecast and support future AI and GPU-based compute requirements. As well as provide operational feedback on designs and system improvements.
- Oversee expansion projects, retrofits, and site selection in collaboration with Data Center Infrastructure Engineering and HPC Architecture teams.
Operational Excellence
- Lead a multi-site operations team ensuring 24/7/365 reliability, availability, and SLA response across all facilities.
- Establish standardized procedures, metrics, and best practices for preventive maintenance, incident management, and service delivery.
- Monitor operational KPIs including uptime, PUE, safety, and compliance with corporate and regulatory standards.
- Implement automation and AI-driven monitoring solutions to optimize system performance and predictive maintenance. Coordinate and communicate data center provider maintenances with customers and impacted teams.
Team Leadership and Development
- Build, mentor, and scale a high-performing team of operations managers, technicians, and engineers across multiple regions.
- Routinely visit all sites to maintain standards, develop relationships, and identify areas of efficiency.
- Foster a culture of safety, accountability, and continuous learning driving data center operations to take on more responsibility and work up the stack.
- Assist in the build out of new data center whitespace and deployment of AI Infrastructure.
Financial and Vendor Management
- Develop and manage operating budgets, capital expenditures, and cost-optimization initiatives.
- Oversee strategic vendor partnerships with numerous data center providers for power, cooling, maintenance, and critical infrastructure components.
Risk and Compliance
- Ensure compliance with environmental, safety, and industry regulations (e.g., NFPA, OSHA, ISO standards).
- Lead incident response and root cause analysis to drive preventive improvements for incidents related to data center operations or infrastructure.
- Act as primary point of contact for audits related to data center operations for compliance such as SOCII, ISO, etc.
Qualifications:
- 10+ years of experience in data center operations, with at least 7 years in a leadership role managing multi-site or hyperscale facilities.
- Proven experience supporting AI, HPC, or cloud infrastructure at scale.
- Deep understanding of power and cooling systems, networking, capacity planning, and facility automation tools (DCIM, BMS, etc.).
- Strong track record of improving operational efficiency and managing relationships with data center providers.
- Preferred Bachelor’s degree in Engineering, Computer Science, or related field; Master’s bonus.
- Exceptional communication, cross-functional collaboration, and stakeholder management skills. Ability to build relationships and consensus and positive team culture.
- Willingness to travel (up to 50%) to data center sites across North America and data center sites under construction.
Preferred Skills:
- Experience with GPU clusters, AI infrastructure networking, and large-scale storage systems.
- Familiarity with cloud-scale operational practices (e.g., AWS, Google, Microsoft data center standards).
- Certifications such as CDCDP, CDCP, PMP, or PE are a plus.