Back to jobs
J

Director, Reinforcement Learning & Agentic Post-Training

ParisPosted Yesterday
Full-timeremote

Job Description

About The AI Studio

The AI Studio's mission is to find the fastest possible path to an autonomous supply chain.

We build AI agents, learning systems, model training pipelines, evaluations, simulations, and decision-making systems for some of the hardest problems in global supply chain. The work spans LLMs, reinforcement learning, agentic workflows, software automation, optimization, and production engineering.

In short, we are having a lot of fun.

Your Mission

We are looking for a deeply technical Director of Reinforcement Learning & Agentic Post-Training to lead how Blue Yonder trains LLM-based agents to operate supply chain software.

This role sits at the center of our Model Training Factory, built with NVIDIA, where we develop specialized AI agents for the autonomous supply chain. These agents must reason over supply chain state, use tools, interact with Blue Yonder workflows, execute multi-step operational tasks, and improve through feedback, evaluation, and reinforcement learning.

Tool use is not a side feature here. Our agents must learn to work inside real enterprise software: querying state, proposing actions, invoking APIs, respecting constraints, handling exceptions, escalating uncertainty, and collaborating with human operators. The challenge is not simply making a model sound knowledgeable about supply chain. The challenge is training models that can reliably act.

We are looking for someone who has personally gone through the hard parts: post-training LLMs, designing tool-use environments, building reward models or verifiers, creating evaluations that catch real failures, shipping reinforced models into production, and leading strong machine learning engineers through that process.

This is not a pure research management role, and it is not a project management role. You should be comfortable setting strategy, writing and reviewing technical designs, mentoring senior engineers, challenging weak assumptions, and staying close enough to the work to know whether the system is actually learning.

What You'll Do

  • Lead the technical strategy for reinforcement learning, post-training, and tool-using LLM agents within the AI Studio.
  • Build and manage a team of machine learning engineers working on agent training, RL environments, reward modeling, evaluation, data generation, and training infrastructure.
  • Design environments where LLM agents learn to operate Blue Yonder software through APIs, tools, workflows, simulations, and human feedback.
  • Develop training and evaluation systems for multi-step supply chain workflows across planning, warehouse management, transportation, commerce, and network operations.
  • Define what "good" looks like for operational agents: correct tool use, constraint adherence, business outcome quality, latency, cost, robustness, escalation behavior, and human trust.
  • Build reward models, verifiers, preference pipelines, automated graders, and evaluation harnesses for agent behavior.
  • Create evaluation frameworks that measure real agent performance, including tool-call correctness, workflow completion, recovery from bad state, long-horizon reliability, and failure modes.
  • Partner with product, engineering, architecture, and domain experts to turn real supply chain workflows into trainable agent environments.
  • Guide model improvement across supervised fine-tuning, preference optimization, reinforcement learning from human or AI feedback, rejection sampling, synthetic data generation, and policy optimisation.
  • Make practical technical tradeoffs between model capability, inference cost, latency, reliability, product timelines, and operational safety.
  • Establish engineering standards for experiment tracking, reproducibility, observability, rollout safety, and production monitoring.
  • Document what works and what fails so the team compounds learning over time.

What We're Looking For

We want to talk if you:

  • Have led a team to ship LLM models trained with Reinforcement Learning, SFT, DPO, RLHF/RLAIF and other post-trained models in production.
  • Have led a team to train models to use tools, call APIs, interact with software environments, or complete multi-step tasks.
  • Have a strong machine learning engineering background and can credibly lead engineers because you have built systems like this yourself.
  • Have managed or technically led high-performing Reinforcement Learning ML engineering teams.
  • Are highly proficient in Python and PyTorch.
  • Understand modern LLM post-training workflows, including supervised fine-tuning, preference data, reward modeling, policy optimisation, evaluation, and deployment.
  • Have hands-on experience with reinforcement learning methods such as reward shaping, PPO-style optimisation, GRPO, offline RL, policy evaluation, rejection sampling, or environment design.
  • Know how to evaluate open-ended agent behaviour beyond static benchmark scores.
  • Can reason about production constraints: latency, inference cost, safety, observability, rollback, and reliability.
  • Can balance frontier-oriented exploration with shipping production systems.
  • Are comfortable with ambiguity but intolerant of unsound technical thinking.
  • Care about engineering craft, reproducibility, and learning velocity.
  • Are curious about why systems work, not just whether a metric moved.

Bonus Points

  • Experience building simulated or sandboxed enterprise software environments for agent training.
  • Experience with NVIDIA Nemotron, NVIDIA NeMo, Megatron, vLLM, Ray, distributed training, or large-scale inference systems.
  • Experience with warehouse management, supply chain planning, transportation, merchandising, logistics, operations research, or enterprise workflow automation.
  • Experience designing agent safety systems, including permissioning, action validation, approval flows, uncertainty escalation, and audit trails.
  • Evidence of technical taste through papers, open-source contributions, internal platforms, side projects, or shipped systems that show deep curiosity about model behaviour.

What Makes This Role Different

Supply chains are full of hard AI problems: partial observability, long-horizon consequences, competing objectives, brittle constraints, noisy feedback, and decisions that matter in the real world.

We are not applying reinforcement learning to toy environments. We are training production LLM agents that operate supply chain software through tools, feedback, verification, and reinforcement. The work sits at the intersection of LLMs, agents, reinforcement learning, evaluation, simulation, optimisation, and production engineering.

If you want to build learning systems that leave the lab and operate in one of the world's most complex real-world domains, this is the role.

Our Values

If you want to know the heart of a company, take a look at its values. Ours unite us. They drive our success and the success of our customers.

Equal Opportunity

All qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other legally protected characteristic.

Our Values


If you want to know the heart of a company, take a look at their values. Ours unite us. They are what drive our success – and the success of our customers. Does your heart beat like ours? Find out here: Core Values

All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability or protected veteran status.

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Director, Reinforcement Learning & Agentic Post-Training at Jda | Renata