Back to jobs
Job Description
We are building a benchmark dataset to evaluate AI models on professional document understanding and instruction following within the Education domain.
Tasks consist of complex, multi-step requests grounded in real-world workspace files (technical drawings, project specifications, engineering reports), web search, and code execution — each paired with a clearly defined ground truth output and an objective evaluation rubric. You will be responsible for authoring tasks that test an AI's ability to interpret engineering documentation, follow multi-step instructions, and produce precise, well-structured outputs.
We expect a minimum commitment of 15–20 hours per week.
Ideal candidates have 3+ years of hands-on experience in one or more of the following sub-domains:
- Curriculum & instructional design
- Academic research
- Teaching & training
