Back to jobs
Job Description
- Design, implement, and maintain scalable, robust frameworks to enable large-scale evaluation of robot policies across offline open-loop testing and real-world hardware evaluations.
- Partner with researchers to design the content of various benchmarks in order to maximize evaluation signal and stress-test model capabilities.
- Build diagnostic and visualization tools that allow the team to easily root-cause policy failures and track performance regressions.
- Establish evaluation criteria for model releases and own the stability and benchmarking of models slated for critical demos.
- Innovate on how to make real-world hardware evaluation faster, more reproducible, and less reliant on manual human intervention.
