Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.
If you'd like to build the world's best deep learning cloud, join us.
Travel: 50% , Travel required to various data center sites.
What You’ll Do:
We are seeking an accomplished Advanced Cooling Facilities Manager specializing in Direct Liquid Cooling (DLC) systems to lead the global strategy, implementation, and operational excellence of Lambda’s next-generation liquid cooling infrastructure. This role will define methodologies and standards for the deployment, optimization, and scaling of cooling systems that enable Lambda’s GPU Cloud to deliver industry-leading performance for AI and machine learning workloads.
With deep domain expertise in liquid cooling technologies and critical facilities management, you will drive the design and operation of complex cooling ecosystems, including Coolant Distribution Units (CDUs), hybrid loop architectures, and advanced heat-rejection systems, across colocation and owned data center environments. You will work cross-functionally with internal and external experts to establish best practices, evaluate emerging technologies, and ensure that Lambda’s cooling infrastructure scales reliably and efficiently to support extreme rack densities.
Key Responsibilities:
Liquid Cooling Systems Strategy & Management
- CDU Operations & Optimization: Define and oversee operational standards and lifecycle management for all CDU systems (L2L and L2A), including performance optimization, reliability engineering, and capacity expansion strategies. Utilize advanced analytics to identify trends and implement predictive maintenance practices.
- Technical Loop Governance: Lead the design and management of multi-stage cooling loops — from facility to rack level — ensuring precise control of temperature, pressure, and flow rate across variable load conditions. Establish system performance benchmarks and quality assurance protocols for coolant integrity and flow balancing.
- System Integration Leadership: Coordinate and validate integration of CDUs with facility water systems (FWS), heat exchangers, and mechanical infrastructure. Develop standardized control sequences and commissioning procedures across multiple OEM platforms.
- Performance Engineering & Monitoring: Architect the monitoring framework for coolant system telemetry — pressure, temperature, flow, differential, and conductivity — and leverage analytics for continuous improvement in thermal performance, redundancy, and energy efficiency.
- Predictive & Preventive Maintenance: Design and institutionalize maintenance methodologies, including condition-based maintenance schedules, failure-mode analysis, and reliability improvement plans for pumps, heat exchangers, and filtration systems.
Infrastructure Planning & Scaling
- Capacity Planning & Design Leadership: Evaluate and forecast thermal capacity requirements for high-density GPU clusters, driving design and procurement of CDUs and loop systems to support rack densities exceeding 1 MW. Develop multi-year cooling capacity roadmaps aligned with corporate growth strategies.
- Engineering Collaboration: Partner with data center design and mechanical engineering teams to co-develop cooling topologies, redundancy strategies, and modular infrastructure designs optimized for scalability and efficiency.
- Vendor & Technology Strategy: Act as the primary technical authority for liquid cooling vendor engagement — influencing product roadmaps, negotiating technical specifications, and qualifying emerging solutions such as direct-to-chip and immersion cooling.
- Innovation & Continuous Improvement: Evaluate and pilot next-generation cooling technologies and automation platforms to reduce PUE, enhance reliability, and support sustainability objectives.
- Cost & Efficiency Optimization: Establish performance metrics for cooling energy efficiency, uptime, and total cost of ownership. Drive initiatives to reduce CapEx/OpEx through standardization, component reuse, and intelligent control strategies.
Operations & Reliability
- Mission-Critical Operations: Oversee global operation of liquid cooling infrastructure with near-zero downtime objectives. Define escalation protocols, lead root-cause analysis for thermal incidents, and ensure resilience through redundancy and proactive risk management.
- Incident Command & Response: Act as the senior technical lead for major cooling incidents, coordinating cross-functional response teams and developing long-term corrective action plans.
- Documentation & Knowledge Management: Establish robust documentation standards — including P&IDs, SOPs, commissioning reports, and change logs — to ensure operational continuity and technical traceability
- Regulatory & Environmental Compliance: Ensure adherence to all applicable codes, environmental standards, and safety protocols. Champion safe handling practices for coolants and system fluids.
- Team Leadership & Development: Mentor and develop specialized liquid cooling technicians and engineers, building a culture of technical excellence, safety, and continuous improvement across all facilities.
Colocation & Multi-Site Management
- Global Coordination: Lead liquid cooling deployment and operational programs across colocation and owned facilities worldwide, ensuring alignment with Lambda’s technical standards and SLAs.
- Standardization & Governance: Define and enforce standardized cooling system configurations, control sequences, and operating parameters across all sites to ensure uniform performance and maintainability.
- Remote Monitoring & Analytics: Deploy and manage advanced remote monitoring and control systems (DCIM/BMS integrations) for multi-site visibility, predictive analytics, and fault detection.
- Scalability & Future Growth: Architect the global cooling expansion framework to support rapid scaling of Lambda’s GPU cloud services, integrating modular and prefabricated cooling components for deployment speed and flexibility.
Ideal Candidate Profile:
- Deep technical mastery of liquid cooling systems and their application in mission-critical environments.
- Proven track record of architecting, deploying, and operating cooling infrastructure supporting multi-MW high-density computing environments.
- Strategic thinker capable of aligning cooling design and operations with company-wide performance, reliability, and sustainability goals.
- Adept at leading multidisciplinary teams and influencing technical direction across mechanical, electrical, and network domains.
- Operates with minimal oversight and consistently delivers innovative solutions in complex, ambiguous environments.
- Strong communicator and collaborator with the ability to influence senior stakeholders, vendors, and partners.
- Committed to continuous learning, advancing sustainability, and driving operational excellence in next-generation data center design.
Required Qualifications:
Education & Certifications
- Bachelor’s degree in Mechanical, Electrical, or Thermal Engineering (Master’s preferred).
- Professional certifications such as DCCA, CompTIA Server+, or liquid cooling manufacturer certifications are strongly preferred.
Experience Requirements
- 10+ years of experience in data center or mission-critical facility operations.
- 7+ years managing advanced liquid cooling systems (CDUs, L2L/L2A loops, heat exchangers).
- 5+ years supporting GPU/AI infrastructure or high-density compute workloads (>300 W per rack).
- 3+ years managing technical teams in distributed, multi-site environments.
- Proven success leading system design reviews, technology evaluations, and vendor negotiations.
Technical Expertise
- Liquid Cooling Systems: Expert knowledge of CDU operation, coolant distribution, manifolds, and control systems.
- Thermal Management: Deep understanding of thermodynamics, heat transfer modeling, and system efficiency optimization.
- Critical Infrastructure: Comprehensive knowledge of UPS, emergency power, fire suppression, and mechanical systems integration.
- Monitoring & Controls: Advanced proficiency with DCIM/BMS systems and real-time telemetry analytics.
- Mechanical Systems: Expertise in pumps, chillers, cooling towers, and hybrid HVAC configurations.
Core Competencies
- Strategic and analytical mindset for resolving complex thermal and operational challenges.
- Exceptional project leadership and cross-functional coordination skills.
- Demonstrated financial acumen in CapEx/OpEx optimization and vendor negotiation.
- Strong communication and presentation abilities for executive and technical audiences.
- Decisive leadership under pressure with robust incident response capability.
- Passion for innovation, sustainability, and advancing high-efficiency data center cooling technologies.
Preferred Qualifications:
- Advanced Degree: Master’s in Mechanical or Thermal Engineering.
- AI/ML Infrastructure: Experience designing or supporting large-scale GPU clusters and AI cooling ecosystems.
- Industry Experience: Background in hyperscale, HPC, or advanced colocation environments.
- Automation: Experience with AI-driven control systems and thermal optimization algorithms.
- Sustainability: Demonstrated success implementing energy-efficient and water-conservation cooling strategies.
Salary Range Information
The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About Lambda
- Founded in 2012, ~400 employees (2025) and growing fast
- We offer generous cash & equity compensation
- Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
- We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
- Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
- Health, dental, and vision coverage for you and your dependents
- Wellness and Commuter stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible Paid Time Off Plan that we all actually use
A Final Note:
You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
Equal Opportunity Employer
Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.