Back to jobs
TikTok

AI Infra Engineer - Large Model Training Infrastructure (LLM/VLM /Agent RL)

San Jose, California, United States of AmericaPosted Today
Full-timehybrid

Job Description

About the Team We are dedicated to building the training infrastructure for ultra-large-scale language models, vision-language models, and frontier agentic models. Our mission is to provide a robust, scalable, and high-performance foundation for post-training, multimodal learning, and reinforcement learning at the hundred-billion-parameter scale and beyond. You will work on some of the most challenging problems in large-model training systems, from multimodal data efficiency to convergence optimization for next-generation foundation models.

What You'll Do

  • Build and evolve unified training infrastructure for large models across post-training workflows, modalities, and training paradigms
  • Design and optimize distributed training strategies for 100B to 1T parameter models, including DP, TP, PP, EP, operator fusion, memory optimization, and cluster-level MFU improvement
  • Develop training and evaluation systems for Reasoning RL and Agent RL, including benchmarks, harnesses, convergence optimization, and rollout efficiency
  • Enable multimodal training across image, text, audio, and video, and support emerging architectures such as MoE and Linear Attention with correctness and convergence validation

Minimum Qualifications:

  • Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields
  • 2+ years of experience in large-scale ML systems, training infrastructure, or performance optimization
  • Strong programming skills in Python and C++
  • Strong understanding of PyTorch and distributed training frameworks such as DeepSpeed, Megatron, and FSDP
  • Experience with distributed training for ultra-large models and strong debugging skills in convergence and system bottlenecks

Preferred Qualifications:

  • Experience with PPO, GRPO, or Agent RL
  • Experience building large-model evaluation systems, agentic harnesses, or benchmarking infrastructure
  • Familiarity with multimodal training, post-training systems, MoE, or Linear Attention
  • Experience with training optimization for 100B+ parameter models is a plus

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
AI Infra Engineer - Large Model Training Infrastructure (LLM/VLM /Agent RL) at TikTok | Renata