Senior Machine Learning Platform Engineer

30 Minutes ago • 5-7 Years
Research Development

Job Description

This role is crucial for Tekion's AI-native automotive platform, transforming dealership data into real-time intelligence. You will operationalize a graph-based contextual ecosystem, build resilient control layers like MCP and LLM Gateway for safe, cost-efficient multi-provider LLM usage, and define standards for agentic systems. This unique opportunity offers direct, measurable impact on dealer outcomes and consumer experiences across Tekion’s Automotive Retail Cloud and Automotive Enterprise Cloud, with end-to-end ownership of an LLM control plane and gateway. You will leverage rich vertical datasets and domain graphs to power context-aware agents and retrieval-augmented generation, shaping core levers like agent orchestration, evaluation frameworks, and safety guardrails.
Good To Have:
  • Platform-as-product mindset (developer experience, paved roads, clear SLAs)
  • Thinks in systems (observability, fallback, access control are core)
  • Passionate about AI (enjoys enabling real-world LLM and agentic use cases)
  • Cost-aware builder (treats latency and dollars as first-class metrics, designs for graceful degradation)
  • Vendor-agnostic thinker (chooses right model/provider, builds for portability and resilience)
  • Documentation and teaching skills (makes complex systems understandable, uplevels teams)
Must Have:
  • Build and run LLM control plane/gateway (smart routing, rate limits, failover, cost tracking)
  • Ship unified API and SDKs (REST/gRPC) with normalized schemas and observability
  • Enforce safety and privacy (content filtering, prompt/response validation, PII redaction)
  • Enable multi-model, multi-vendor LLM usage with automated canarying and versioning
  • Own agent runtime (tool registry, permissions, function calling, grounding, retrieval)
  • Design orchestration patterns and manage agent state/workflows
  • Enable platform components for classical ML training and scoring pipelines
  • Monitor model and data drift, retraining and tuning models
  • Add human-in-the-loop review and safe-actioning
  • Evolve domain graph and entity resolution; build reliable data ingestion pipelines
  • Serve real-time context to agents with access controls and lineage
  • Power retrieval with hybrid search (graph + vector + keyword) and smart cache
  • Run continuous offline/online evaluations for quality, factuality, bias, and safety
  • Define SLOs for latency, uptime, and cost; enable autoscaling and spend controls
  • Maintain model/agent registry, versioning, approvals, audit trails, reproducibility
  • Provide templates/CLIs, sandboxes, and docs; mentor engineers and champion MLOps and AI safety
  • 5-7 years building large-scale data/ML or platform systems
  • Strong software engineering fundamentals (Abstracted API design, concurrency, distributed systems)
  • Production experience with Python plus one of Java/Scala/Go
  • Experience with microservices and API design
  • MLOps at scale (pipelines, tracking/registry, CI/CD for models, A/B testing, shadow/canary, online feature computation)
  • Cloud and containers (AWS, Docker/Kubernetes)
  • Performance, reliability, and cost engineering in multi-tenant SaaS
  • Practical ML knowledge (feature engineering, training, evaluation, drift detection)
  • Experience deploying models that power user-facing workflows
  • Built or operated an LLM gateway/control plane
  • Experience with agentic systems (tool use, orchestration, human-in-the-loop, safety)
  • Experience with graph and retrieval (knowledge graphs, GraphQL, vector search, hybrid retrieval)

Add these skills to join the top 1% applicants for this job

saas-business-models
data-analytics
data-structures
game-texts
alphabeta-testing
hr-operations
hris-human-resource-information-systems
aws
spark
ci-cd
neo4j
docker
microservices
graphql
kubernetes
python
scala
java
accounting
machine-learning

Senior Machine Learning Platform Engineer

About Tekion:

Positively disrupting an industry that has not seen any innovation in over 50 years, Tekion has challenged the paradigm with the first and fastest cloud-native automotive platform that includes the revolutionary Automotive Retail Cloud (ARC) for retailers, Automotive Enterprise Cloud (AEC) for manufacturers and other large automotive enterprises and Automotive Partner Cloud (APC) for technology and industry partners. Tekion connects the entire spectrum of the automotive retail ecosystem through one seamless platform. The transformative platform uses cutting-edge technology, big data, machine learning, and AI to seamlessly bring together OEMs, retailers/dealers and consumers. With its highly configurable integration and greater customer engagement capabilities, Tekion is enabling the best automotive retail experiences ever. Tekion employs close to 3,000 people across North America, Asia and Europe.

Why This Role Matters

This role powers Tekion’s AI‑native, end‑to‑end automotive platform by turning unified dealership data across DMS, CRM, Digital Retail, Service, and Payments into real‑time intelligence. You’ll operationalize a graph‑based contextual ecosystem so agents can retrieve the right context, enforce policy, and personalize experiences that drive measurable dealer outcomes. You’ll also build the resilient control layer - MCP and the LLM Gateway - that enables safe, cost‑efficient, multi‑provider LLM usage. Finally, you’ll define the standards for building, evaluating, deploying, and governing agentic systems so product teams can ship AI features quickly, safely, and at scale. In addition to enabling agentic systems powered by LLMs, this role also drives building the platform for classical ML models driving optimization across dealership operations.

What Makes This Opportunity Unique

This role offers direct, measurable impact on dealer outcomes and consumer experiences across Tekion’s Automotive Retail Cloud and Automotive Enterprise Cloud, with end‑to‑end ownership of an LLM control plane and gateway that serve multi‑tenant workloads under SLAs and, quality and cost guardrails. You’ll leverage a rich vertical dataset and domain graph spanning sales, service, parts, F&I, accounting, and consumer touchpoints to power context‑aware agents and retrieval‑augmented generation. You’ll also shape core levers - agent orchestration patterns, evaluation frameworks, and safety guardrails - so improvements in latency, reliability, evaluation quality, and safety translate into dealer KPIs like upsell, cycle time, CSAT, and service revenue. You’ll also maintain and enhance the platform to support classical supervised and unsupervised ML models .

Responsibilities

  • Build and run the LLM control plane/gateway: smart routing, rate limits/quotas, failover, and token/cost tracking.
  • Ship a unified API and SDKs (REST/gRPC) with normalized schemas, structured outputs, caching, and full observability (traces/logs/metrics).
  • Enforce safety and privacy by default: content filtering, prompt/response validation, and PII redaction.
  • Enable multi‑model, multi‑vendor use LLMs with automated canarying and versioning.
  • Own the agent runtime: tool registry, permissions, function calling, grounding, and retrieval.
  • Design orchestration patterns (sequential, planner‑executor, streaming) and manage agent state and long‑running workflows.
  • Enabling platform components for training and scoring pipelines for classical ML (e.g., XGBoost/LightGBM/linear/trees) and deep models; standardize experiment tracking and packaging.
  • Create components to Monitor model and data drift, retraining and tuning models as needed to maintain accuracy and relevance.
  • Add human‑in‑the‑loop review and safe‑actioning before agents touch dealer systems.
  • Evolve the domain graph and entity resolution; build reliable data ingestion pipelines.
  • Serve real‑time context to agents (profiles, inventory, pricing, appointments, service history) with access controls and lineage.
  • Power retrieval with hybrid search (graph + vector + keyword) and smart cache/TTL to balance accuracy, latency, and cost.
  • Run continuous offline/online evaluations for quality, factuality, bias, and safety for the platform sanity.
  • Define SLOs for latency (p50/p95), uptime, and cost view capabilities; enable autoscaling and spend controls.
  • Maintain a model/agent registry, versioning, approvals, audit trails, and reproducibility; support compliances where needed.
  • Provide templates/CLIs, sandboxes, and docs so product teams can build and ship fast; mentor engineers and champion MLOps and AI safety best practices.

Desired Skills & Experience

  • 5 - 7 years building large‑scale data/ML or platform systems; strong software engineering fundamentals (Abstracted API design, concurrency, distributed systems).
  • Production experience with Python plus one of Java/Scala/Go; microservices and API design.
  • MLOps at scale: pipelines (Airflow/Kubeflow), tracking/registry (MLflow), CI/CD for models, A/B testing, shadow/canary, and online feature computation (Spark/Flink/Kafka).
  • Cloud and containers: AWS (preferred), plus Docker/Kubernetes; performance, reliability, and cost engineering in multi‑tenant SaaS.
  • Practical ML knowledge (feature engineering, training, evaluation, drift detection); experience deploying models that power user‑facing workflows.
  • Built or operated an LLM gateway/control plane: provider adapters, routing/policies, caching, quota/rate‑limit, cost and token accounting.
  • Agentic systems: tool use/function calling, orchestration frameworks, human‑in‑the‑loop, safety/guardrails, and online evaluation/telemetry.
  • Graph and retrieval: knowledge graphs (e.g., Neo4j/Neptune/TigerGraph), GraphQL, vector search (e.g., pgvector/Qdrant/Milvus), hybrid retrieval patterns.

Preferred Mindset

  • Platform‑as‑product: obsess over developer experience, paved roads, and clear SLAs.
  • Thinks in systems - observability, fallback, access control are core, not afterthoughts.
  • Passionate about AI - enjoys enabling real-world LLM and agentic use cases.
  • Cost‑aware builder: you treat latency and dollars as first‑class metrics and design for graceful degradation.
  • Vendor‑agnostic thinker: choose the right model/provider per use case; build for portability and resilience.
  • Documentation and teaching: you make complex systems understandable; you uplevel teams.

Tekion is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, victim of violence or having a family member who is a victim of violence, the intersectionality of two or more protected categories, or other applicable legally protected characteristics.

For more information on our privacy practices, please refer to our Applicant Privacy Notice here

.

Create a Job Alert

Interested in building your career at Tekion? Get future opportunities sent straight to your email.

Create alert

Apply for this job

------------------

  • indicates a required field

Autofill with MyGreenhouse

First Name*

Last Name*

Preferred First Name

Email*

Phone

Country*

Phone*

Resume/CV*

AttachAttach

Dropbox

Google Drive

Enter manuallyEnter manually

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

AttachAttach

Dropbox

Google Drive

Enter manuallyEnter manually

Accepted file types: pdf, doc, docx, txt, rtf

---

Education

School*

Select...

Degree*

Select...

Start date year*

Add another

---

LinkedIn Profile (Strongly Preferred)

Are you currently a Tekion employee? (If "yes" - please apply via Greenhouse Internal Job Board)*

Select...

Have you previously worked at Tekion as an employee or contractor?*

Select...

This will help us avoid duplicate profiles by connecting your application to your prior worker profile in our HRIS tools.

What is your target base and bonus compensation for the role you are considering at Tekion? We want to be mindful of all parties' time during the application and interview process. Therefore, it is important that you share your desired compensation to ensure alignment to the role before engaging in the interview process. *

Do you know anyone who works for Tekion? If yes, what is your relationship to them? (relationship could include the following: spouse/partner/significant other, parent/step parent, child/step child, grandparent, grandchild, brother/brother-in-law, sister/sister-in-law, uncle, aunt, nephew, niece, first cousin, in-laws (father, mother, son, daughter.))*

Submit application

Set alerts for more jobs like Senior Machine Learning Platform Engineer
Set alerts for new jobs by Tekion Corp
Set alerts for new Research Development jobs in India
Set alerts for new jobs in India
Set alerts for Research Development (Remote) jobs

Contact Us
hello@outscal.com
Made in INDIA 💛💙