Senior Lead Machine Learning Engineer, Agentic AI

upwork

| Toronto, Ontario, Canada (Hybrid) | Full Time | 1 day ago

Apply Now

Job Summary

Upwork is seeking a Senior Lead Machine Learning Engineer to design, implement, and scale agentic intelligence. This role involves leading end-to-end development of AI agents and their powering platform, from LLM training and evaluation to runtime orchestration, safety, and developer APIs. It's a high-impact position focusing on applied research and platform engineering, enabling reliable, safe, and high-performing agents for internal and external use.

Must Have

Design and implement multi-agent systems with robust guardrails and recovery strategies.
Develop protocol-aware agents and services that interoperate cleanly with developer tooling.
Own reliability at scale: deterministic execution, idempotency, timeouts/retries, and evaluation-driven iteration.
Lead data strategy and curation for agent tasks; drive SFT, DPO, RLHF/RLAIF, and safety tuning.
Stand up evaluation harnesses for functional, task, and longitudinal metrics.
Build policy-driven guardrails; partner with Legal/Security on data governance and privacy.
Architect low-latency inference, retrieval, and orchestration services with strong SLOs.
Ship production-grade services (APIs/SDKs, auth, rate limiting, observability).
Optimize cost/performance via quantization, distillation, model-routing, and autoscaling.
Provide technical leadership across research, product, and platform teams; mentor senior ICs.
Publish internal guidance and exemplar implementations; contribute to technical content.
Define and track KPIs for data/quality/throughput, and drive continuous improvement.
Senior level experience in applied ML/ML systems, with experience building LLM-powered products.
Proven delivery of agentic workflows in production.
Hands-on mastery of LLM adaptation (prompting, tool/function calling), data curation, and safety/guardrails.
Strong software fundamentals (distributed systems, transactions, consistency, resiliency).
Experience building high-throughput microservices/APIs/SDKs.
Fluency with Python.
Experience with container orchestration, messaging/streaming, and observability stacks.
Experience designing eval suites for agents and closing the loop from evals training runtime policy.
Comfort with cost, latency, and reliability trade-offs; use metrics to make crisp decisions.
Familiarity with agent frameworks and protocols (e.g., MCP; API/SDK design).
Track record of leading cross-functional initiatives and mentoring senior engineers.
Excellent written communication and bias for measurable results.

Good to Have

Proficiency in one of Go/Java/Javascript.

Perks & Benefits

Competitive benefits (initially through a partner)
Access to Upwork's resources, culture, and growth opportunities

Job Description

We’re seeking a Senior Lead Machine Learning Engineer to architect, ship, and scale the next generation of agentic intelligence across Upwork. You will lead end‑to‑end development of AI agents and the platform that powers them—from LLM training and evaluation to runtime orchestration, safety, and developer APIs. This is a hands‑on, high‑impact role at the intersection of applied research and platform engineering, enabling internal teams and external developers to build reliable, safe, and high‑performing agents on Upwork.

Responsibilities

Build Agentic Intelligence. Design and implement multi‑agent systems (planning, tool‑use, memory, debate/critique, reflection) with robust guardrails and recovery strategies.
Develop protocol‑aware agents and services that interoperate cleanly with developer tooling (e.g., agent frameworks and protocols such as MCP).
Own reliability at scale: deterministic execution where needed, idempotency, timeouts/retries, and evaluation‑driven iteration on agent behavior.
Train, Align, and Evaluate LLMs for Agents. Lead data strategy and curation for agent tasks; drive SFT, DPO, RLHF/RLAIF, and safety tuning tailored to multi‑tool, multi‑step workflows.
Stand up evaluation harnesses for functional, task, and longitudinal metrics (success rate, time‑to‑completion, hallucination/escape rates, cost/latency).
Build policy‑driven guardrails; partner with Legal/Security on data governance and privacy.
Engineer Agentic Platform Backend Infrastructure. Architect low‑latency inference, retrieval, and orchestration services (streaming, event‑driven pipelines; scalable queues; caching; batching) with strong SLOs.
Ship production‑grade services (APIs/SDKs, auth, rate limiting, observability) that make agent features easy to integrate for internal and external developers.
Optimize cost/performance via quantization, distillation, model‑routing, and autoscaling; integrate evaluation signals directly into runtime and CI/CD.
Lead, Partner, and Uplevel the Ecosystem. Provide technical leadership across research, product, and platform teams; mentor senior ICs; influence roadmaps with clear metrics and trade‑offs.
Publish internal guidance and exemplar implementations; contribute to technical content, samples, and reference architectures for our agent platform.
Define and track KPIs for data/quality/throughput, and drive continuous improvement using experiment results and production telemetry.

What it takes to catch our eye

Senior level experience applied ML/ML systems, with experience building LLM‑powered products; proven delivery of agentic workflows in production.
Hands‑on mastery of LLM adaptation (prompting, tool/function calling), data curation, and safety/guardrails.
Strong software fundamentals (distributed systems, transactions, consistency, resiliency) and experience building high‑throughput microservices/APIs/SDKs.
Fluency with Python; proficiency in one of Go/Java/Javascript a plus. Experience with container orchestration, messaging/streaming, and observability stacks.
Experience designing eval suites for agents (task/rubric‑based, offline/online) and closing the loop from evals

training

runtime policy.

Comfort with cost, latency, and reliability trade‑offs; you use metrics to make crisp decisions under ambiguity.
Familiarity with agent frameworks and protocols (e.g., MCP; API/SDK design for developer productivity).
Track record of leading cross‑functional initiatives and mentoring senior engineers; excellent written communication and bias for measurable results.

***

Come change how the world works.

Upwork is establishing its first international operational hub in Lisbon, Portugal. The new office is expected to be fully operational by Q4 2026.

This position will initially be employed through a partner to ensure a seamless hiring process while we establish the hub. Once the hub is established, there may be opportunities to transition to employment with Upwork depending on business needs and other requirements. While employed by the partner, you’ll work as part of Upwork’s team, with access to our resources, culture, and growth opportunities.

Our partner will offer competitive benefits. When Upwork’s hub is established, we will be excited to offer employment and benefits directly as business needs require.

Upwork is committed to building a diverse, inclusive, and equitable workforce. Employment decisions are made without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, disability, or any other status protected by applicable law.

To learn more about how Upwork processes and protects your personal information as part of the application process, please review our Global Job Applicant Privacy Notice

7 Skills Required For This Role

Game Texts Ci Cd Microservices Python Javascript Java Machine Learning

Similar Jobs

Research Development

Software Engineer, Applied ML (Discovery, Recommendation & Search)

CharacterAI • Redwood City, California, United States (On Site)