Back to jobs
Job Description
- Design and maintain TPU supercomputer software across multiple stack layers, ranging from daemons on host machines to network routing rules embedded directly into the TPUs.
- Develop and manage control software on specialized machines and distributed infrastructure to support the operation of massive collections of networked hardware.
- Implement robust systems to monitor, deploy, qualify, and service supercomputing systems, ensuring they remain reliable and performant at scale.
- Engineer software solutions for the reliable scale-out and scale-up of accelerators, specifically tailored to meet the needs of massive-scale machine learning applications.
- Architect and build software to optimally interconnect TPUs, enabling efficient execution of data parallelism algorithms like ring all-reduce.
