Data/Evaluations Engineer
Nousresearch
Job Summary
Nous Research is seeking a Data/Evaluations Engineer to lead their post-training evaluation pipeline. This role involves building and scaling evaluation methods to measure model capabilities across various tasks, identify failure modes, and drive model improvements. The engineer will be responsible for creating test cases, implementing diverse evaluation techniques, analyzing failures, and developing scalable internal tools for evaluation.
Must Have
- Identify tasks for evaluation coverage
- Create, curate, or generate test cases and measurement methods
- Implement evaluation via objective output verification, LLM judge/reward modeling, or human evaluation
- Add coverage and analyze failure cases deeply
- Identify ways to remedy failure cases
- Develop scalable and accessible internal evaluation tools (e.g., light GUIs, Slurm scripts)
- Strong experience with evaluation frameworks
- Experience with both automated and human evaluation methodologies
- Ability to build evaluation infrastructure from scratch and scale existing systems
Good to Have
- History of OSS contributions
Job Description
Data/Evaluations Engineer
We’re looking for a data/evaluations engineer to own our post-training evaluation pipeline. You’ll build and scale evals depth and breadth that measure model capabilities across diverse tasks, identify failure modes, and drive model improvements.
Responsibilities:
- Identifying tasks for evaluation coverage
- Creating, curating, or generating test cases and ways to measure these tasks
- Implementing evaluation through objective output verification, LLM judge/reward modeling, human evaluation, or any tricks of the trade you may bring to the table
- Adding coverage and diving deep into analyzing what’s really gone wrong in failure cases
- Identifying ways to remedy failure cases
- Developing ways to present and make the evals scalable and accessible internally (e.g. light GUIs, scalable Slurm scripts, etc for running the evals)
Qualifications:
- Strong experience with evaluation frameworks
- Experience with both automated and human evaluation methodologies
- Ability to build evaluation infrastructure from scratch and scale existing systems
Preferred:
- History of OSS contributions
2 Skills Required For This Role
Game Texts
Test Coverage