Back to jobs
Rhoda AI

Research Engineer- Training Platform

Palo AltoPosted Yesterday
FullTime

Job Description

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We're looking for a Research Engineer to build and maintain the training platform that powers our model development — experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

What You'll Do

  • Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters

  • Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management

  • Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection

  • Optimize and automate the research iteration loop from experiment launch to results analysis

  • Manage job scheduling and cluster utilization for efficient use of GPU compute

  • Build internal tooling and interfaces that help researchers move faster

  • Collaborate with training systems, data infrastructure, and research teams to support their platform needs

What We're Looking For

  • Strong software engineering skills with experience in MLOps or ML platform engineering

  • Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)

  • Experience building experiment tracking, reproducibility, and artifact management systems

  • Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)

  • Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice to Have (But Not Required)

  • Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)

  • Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)

  • Experience supporting large model training pipelines (LLMs, VLMs, or video models)

  • Understanding of parallelism strategies and how they affect training efficiency and debugging

  • Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Why This Role

  • Your platform is the daily tool every researcher and engineer uses to train models

  • Improvements to training velocity and reliability compound across every experiment the team runs

  • High visibility with direct feedback from researchers and ML engineers

  • Build systems that scale from today's models to future frontier training runs

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

51-200 employees
Palo Alto, US
Website
Research Engineer- Training Platform at Rhoda AI | Renata