Machine Learning Engineer (Data & Evaluation Infrastructure)

Nousresearch

Job Summary

We are seeking a Machine Learning Engineer (MLE) to manage our post-training evaluation pipeline. The role involves building and scaling evaluation processes to assess model capabilities across various tasks, pinpointing areas of failure, and driving improvements. Key responsibilities include identifying tasks for evaluation, creating or curating test cases and measurement methods, implementing evaluations through objective verification, LLM judging, reward modeling, or human evaluation. You will also be responsible for expanding coverage, deeply analyzing failure cases, identifying solutions, and developing scalable and accessible internal evaluation presentation methods, such as GUIs or Slurm scripts.

Must Have

  • Experience with evaluation frameworks
  • Experience with automated and human evaluation
  • Ability to build evaluation infrastructure from scratch
  • Scale existing systems

Good to Have

  • History of OSS contributions

Job Description

We’re looking for an MLE to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.

Responsibilities:

  • Identifying tasks for evaluation coverage
  • Creating, curating, or generating test cases and ways to measure these tasks
  • Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
  • Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
  • Identifying ways to remedy failure cases
  • Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)

Qualifications:

  • Strong experience with evaluation frameworks
  • Experience with both automated and human evaluation methodologies
  • Ability to build evaluation infrastructure from scratch and scale existing systems

Preferred:

  • History of OSS contributions

2 Skills Required For This Role

Test Coverage Machine Learning

Similar Jobs