Staff Technical Lead for Inference & ML Performance

fal

| Remote | Full Time | 1 months ago

Apply Now

Job Summary

fal is seeking a Staff Technical Lead for Inference & ML Performance to lead and optimize state-of-the-art inference systems for generative-media infrastructure. This role involves setting technical direction, hands-on contribution to performance enhancements, collaborating with research and applied ML teams, driving advanced optimizations, and mentoring a team of performance-focused engineers. The ideal candidate will have deep expertise in ML performance optimization and a strategic vision to push the boundaries of model inference performance.

Must Have

Set technical direction for high-performance inference solutions
Personally contribute to critical inference performance enhancements and optimizations
Collaborate closely with research & applied ML teams on inference strategies
Drive advanced performance optimizations including model parallelism, kernel optimization, and compiler strategies
Mentor and scale a team of performance-focused engineers
Deep experience in ML performance optimization for large-scale generative models in production
Understanding of the full ML performance stack (PyTorch, TensorRT, TransformerEngine, Triton, CUTLASS kernels)
Expert-level familiarity with advanced inference techniques (quantization, kernel authoring, compilation, model parallelism, distributed serving, profiling)
Lead from the front as a respected IC
Thrive in cross-functional collaboration

Good to Have

Experience building inference engines for diffusion and generative media models
Track record of industry-leading performance improvements (papers, open-source contributions, benchmarks)
Leadership experience in scaling technical teams

Perks & Benefits

One of the highest impact roles at a fast-growing company
Revenue growing 40% MoM
60x+ RR compared to last year
Raised Series A/B/C within the last 12 months
World-changing vision: hyperscaling human creativity

Job Description

fal is pioneering the next generation of generative-media infrastructure. We're pushing the boundaries of model inference performance to power seamless creative experiences at unprecedented scale. We're looking for a Staff Technical Lead for Inference & ML Performance, someone who blends deep technical expertise with strategic vision, guiding a team to build and optimize state-of-the-art inference systems. This role is intense yet deeply impactful. Apply if you're ready to lead the future of inference performance at a fast-paced, high-growth frontier.

Why this role matters

You’ll shape the future of fal’s inference engine and ensure our generative models achieve best-in-class performance. Your work directly impacts our ability to rapidly deliver cutting-edge creative solutions to users, from individual creators to global brands.

What you'll do

| Day-to-day | What success looks like |

| --- | --- |

| Set technical direction. Guide your team (kernels, applied performance, ML compilers, distributed inference) to build high-performance inference solutions. | fal’s inference engine consistently outperforms industry benchmarks in throughput, latency, and efficiency. |

| Hands-on IC leadership. Personally contribute to critical inference performance enhancements and optimizations. | You regularly ship code that significantly improves model serving performance. |

| Collaborate closely with research & applied ML teams. Influence model inference strategies and deployment techniques. | Seamless integration of inference innovations rapidly moves from research to production deployment. |

| Drive advanced performance optimizations. Implement model parallelism, kernel optimization, and compiler strategies. | Performance bottlenecks are quickly identified and eliminated, dramatically enhancing inference speed and scalability. |

| Mentor and scale your team. Coach and expand your team of performance-focused engineers. | Your team independently innovates, proactively solves complex performance challenges, and consistently levels up their skills. |

You might be a fit if you

Are deeply experienced in ML performance optimization. You've optimized inference for large-scale generative models in production environments.
Understand the full ML performance stack. From PyTorch, TensorRT, TransformerEngine, Triton to CUTLASS kernels, you’ve navigated and optimized them all.
Know inference inside-out. Expert-level familiarity with advanced inference techniques: quantization, kernel authoring, compilation, model parallelism (TP, context/sequence parallel, expert parallel), distributed serving and profiling.
Lead from the front. You're a respected IC who enjoys getting hands-on with the toughest problems, demonstrating excellence to inspire your team.
Thrive in cross-functional collaboration. Comfortable interfacing closely with applied ML teams, researchers, and stakeholders.

Nice-to-haves

Experience building inference engines specifically for diffusion and generative media models
Track record of industry-leading performance improvements (papers, open-source contributions, benchmarks)
Leadership experience in scaling technical teams

What you'll get

One of the highest impact roles at one of the fastest growing companies (revenue is growing 40% MoM, we are 60x+ RR compared to last year, raised Series A/B/C within the last 12 months) with a world changing vision: hyperscaling human creativity.

Sound like your calling? Share your proudest optimization breakthrough, open-source contribution, or performance milestone with us. Let's set new standards for inference performance, together.

6 Skills Required For This Role

Cross Functional Leadership Game Texts Cross Functional Collaboration Model Serving Pytorch

Similar Jobs