AI Engineer (AI Tests Generation Startup)

8 Minutes ago • 1 Years +

Research Development

Job Description

We’re a JetBrains-backed incubator startup building an AI-powered Chrome extension that turns manual web test journeys into clean, production-ready end-to-end test code. As a Senior AI Engineer, you will own the experimentation, evaluation pipeline, and metrics for our code generation agent, focusing on refining the evaluation pipeline, shipping core agent improvements, and building deterministic feedback loops. This role offers real ownership, quick feedback, and work on practical multi-agent LLM systems in a financially secure startup environment.

Good To Have:

Building LLM agents with tool-calling, planning, or multi-step reasoning.
Experience with code generation benchmarks like SWE-bench or HumanEval.
Experience with web UI automation frameworks like Selenium or Playwright.
Experience with static program analysis tasks, such as parsing code and working with ASTs.

Must Have:

Own the experimentation, evaluation pipeline, and metrics for our code generation agent.
Refine our evaluation pipeline, audit metrics, and rebuild the system for a tighter feedback loop.
Prototype, evaluate, and productionize features like agent planning, context reduction, and RAG.
Guide our team’s experts in code analysis and test execution to create systems that allow agents to generate better tests.
Engineering pragmatism to cut unnecessary complexity and find effective solutions.
A shipping mindset, breaking down ambiguous problems into small, iterative experiments.
True ownership, seeing work through from initial idea to production and measuring impact.
Built and shipped at least one significant LLM-powered feature from concept to production.
Designed and built evaluation pipelines for ML systems, defining metrics for user value.
Practical experience writing production-quality Python for data pipelines, backends, or internal tools.
At least one full year of experience in a startup or fast-moving team.
Tech stack: Python, LangGraph, LangFuse, and PydanticAI.

Perks:

Small team that ships fast and skips bureaucracy, offering real ownership and quick feedback.
Backed by JetBrains, financially secure, with no reliance on external VC funding.
Work on practical multi-agent LLM systems for web testing in a growing space.
Freedom to do your best work.

Add these skills to join the top 1% applicants for this job

game-texts

test-coverage

playwright

selenium

python

Your role

As part of our team, you’ll own the experimentation, evaluation pipeline, and metrics for our code generation agent. While our whole team contributes, you will be our expert on continuous evaluation and advanced LLM techniques.

Your focus in the first six months:

Refining our evaluation pipeline. You’ll audit our metrics, ensure they correlate with user value, and rebuild the system for a tighter feedback loop.
Shipping core agent improvements. You'll prototype, evaluate, and productionize features like agent planning, context reduction, and RAG.
Building deterministic feedback loops. You’ll guide our team’s experts in code analysis and test execution to create systems that allow agents to generate better tests.

Why you should join us

We’re a small team that ships fast and skips bureaucracy. You get real ownership and quick feedback from users.
We’re backed by JetBrains, financially secure, and have no reliance on external VC funding.
We work on practical multi-agent LLM systems for web testing. This is a growing space with real problems to crack, where you’ll have the freedom to do your best work.

Who we’re looking for

A Senior AI Engineer with hands-on experience in building, evaluating, and shipping LLM or ML systems. What matters most:

Engineering pragmatism. You cut unnecessary complexity to find the most effective solution – whether that’s a deterministic parser or a complex LLM agent.
A shipping mindset. You break down ambiguous problems into small, iterative experiments that deliver value to users, fast.
True ownership. You see your work through from the initial idea to production, constantly measuring its impact. "Done" means it's working for our users.

Required experience

Tech stack: Python, LangGraph, LangFuse, and PydanticAI (ongoing experiment).

You have built and shipped at least one significant LLM-powered feature, owning it from initial concept to production users.
You've designed and built evaluation pipelines for ML systems, defining metrics that measure real user value and using them to drive improvements.
Your engineering skills go beyond models and notebooks; you have practical experience writing production-quality Python for things like data pipelines, backends, or internal tools.
At least one full year of experience in a startup or fast-moving team.

Nice-to-haves

We’d be especially excited if you have experience with any of the following (side projects count!):

Building LLM agents with tool-calling, planning, or multi-step reasoning.
Code generation benchmarks like SWE-bench or HumanEval.
Web UI automation frameworks like Selenium or Playwright (web-based end-to-end testing, web parsing, etc.).
Static program analysis tasks, such as parsing code and working with ASTs.

Set alerts for more jobs like AI Engineer (AI Tests Generation Startup)

Set alerts for new jobs by jetbrains

Set alerts for new Research Development jobs in Cyprus

Set alerts for new jobs in Cyprus

Set alerts for Research Development (Remote) jobs