Back to jobs
Job Description
- Lead the design and implementation of solutions in specialized ML areas, optimize ML infrastructure, and guide the development of model optimization and data processing strategies.
- Design and implement AI/ML models to predict, detect, and mitigate hardware and software faults across a global fleet.
- Analyze petabytes of telemetry and performance data to uncover insights that improve the reliability of ML TPUs and traditional compute infrastructure.
- Build scalable automated systems that allow Google’s data center footprint to grow while maintaining industry-leading uptime.
- Partner with hardware designers and site reliability engineers (SREs) to integrate intelligent diagnostics into the core data center lifecycle.
