Back to jobs
Job Description
The Role
We're looking for engineers and scientists to design, optimize, and maintain the compute foundations that power large-scale language model training and inference. You will develop high-performance ML kernels, enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training and serving large models possible.
Key Responsibilities
- Design and implement custom ML kernels (CUDA, CuTe, Triton) for core dLLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU architectures.
- Design compute primitives to reduce memory bandwidth bottlenecks and improve kernel efficiency.
- Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.
Qualifications
- BS/MS/PhD in Computer Science, Engineering, or a related field (or equivalent experience).
- Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
- Understanding of ML frameworks (PyTorch, TensorFlow) from a systems perspective.
- Background in performance optimization and profiling of ML systems.
- Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (XLA, TVM).
- Familiarity with distributed training techniques (data parallel, model parallel, pipeline parallel).
- Proficiency in Python and at least one systems programming language (C++/Rust/Go).
- Experience with containerization (Docker), orchestration (Kubernetes), and CI/CD pipelines.
Preferred Skills
- Experience building and maintaining large-scale language models with tens of billions of parameters or more.
- Experience with distributed systems and cloud computing platforms (AWS/GCP/Azure).
- Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM.
- Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA.