
AI Infra Engineer - Large Model Training Infrastructure (LLM/VLM /Agent RL)
Job Description
About the Team We are dedicated to building the training infrastructure for ultra-large-scale language models, vision-language models, and frontier agentic models. Our mission is to provide a robust, scalable, and high-performance foundation for post-training, multimodal learning, and reinforcement learning at the hundred-billion-parameter scale and beyond. You will work on some of the most challenging problems in large-model training systems, from multimodal data efficiency to convergence optimization for next-generation foundation models.
What You'll Do
- Build and evolve unified training infrastructure for large models across post-training workflows, modalities, and training paradigms
- Design and optimize distributed training strategies for 100B to 1T parameter models, including DP, TP, PP, EP, operator fusion, memory optimization, and cluster-level MFU improvement
- Develop training and evaluation systems for Reasoning RL and Agent RL, including benchmarks, harnesses, convergence optimization, and rollout efficiency
- Enable multimodal training across image, text, audio, and video, and support emerging architectures such as MoE and Linear Attention with correctness and convergence validation
Minimum Qualifications:
- Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields
- 2+ years of experience in large-scale ML systems, training infrastructure, or performance optimization
- Strong programming skills in Python and C++
- Strong understanding of PyTorch and distributed training frameworks such as DeepSpeed, Megatron, and FSDP
- Experience with distributed training for ultra-large models and strong debugging skills in convergence and system bottlenecks
Preferred Qualifications:
- Experience with PPO, GRPO, or Agent RL
- Experience building large-model evaluation systems, agentic harnesses, or benchmarking infrastructure
- Familiarity with multimodal training, post-training systems, MoE, or Linear Attention
- Experience with training optimization for 100B+ parameter models is a plus