
AI Infra Engineer - Large Model Inference Systems (Multimodal/LLM/VLM)
Job Description
About the Team We are dedicated to building the inference infrastructure for ultra-large-scale language models, vision-language models, and frontier multimodal AI systems. Our mission is to provide a robust, scalable, and high-performance foundation for distributed serving, heterogeneous scheduling, and low-latency inference at massive scale. You will work on some of the most challenging problems in large-model online serving, spanning traffic orchestration, throughput and latency optimization, kernel efficiency, and production reliability for next-generation AI systems.
Responsibilities - What You'II Do
- Build and evolve next-generation inference systems for large-scale online traffic, including global scheduling across heterogeneous compute resources, high-concurrency load balancing, and efficient batch formation
- Optimize distributed inference for 200B+ models and complex multimodal models through TP, EP, DP, and related strategies to improve throughput and latency in production
- Develop high-performance kernels for frontier model architectures such as MoE, emerging attention mechanisms, and multimodal fusion layers using CUDA, Triton, and related tools
- Explore AI-driven infrastructure for inference systems, including AI Agents for kernel optimization, performance tuning, consistency validation, deployment pipelines, and intelligent operations
Minimum Qualifications:
- Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields
- 2+ years of experience in high-performance computing, distributed scheduling systems, or large-model inference engine development
- Familiarity with large-model architectures and strong system design skills for complex, high-concurrency environments
- Strong understanding of asynchronous scheduling, resource pooling, and load balancing in distributed microservice systems
- Strong engineering skills in performance optimization and production system development
Preferred Qualifications
- Deep understanding of inference frameworks such as vLLM and SGLang, with hands-on experience in customization and production optimization
- Familiarity with GPU microarchitecture and operator-level optimization using CUDA, Triton, Cutlass, or related tools
- Experience with LLM inference optimization, such as PTQ, QAT, KV cache optimization, or PD disaggregation
- Experience deploying and optimizing VLMs or multimodal models in production