Back to jobs

Technical Program Manager III, GPU Infrastructure Reliability, Google Cloud
Posted Today
Job Description
- Lead the end-to-end development, project planning, and delivery of next-gen AI Infra GPU products from concept to production.
- Lead software qualifications, release strategy, and test infrastructure management for AI hypercompute clusters.
- Manage escalations and critical incidents while proactively identifying and mitigating risks that could impact project success.
- Coordinate with TPMs in AI2 (e.g., ACI, Platforms, and CSCO) and ACI leadership on cross-functional initiatives related to AI Infra customer onboarding and production support.
- Participate in the development of core management software, monitoring, and diagnostic tooling for scalable Cloud ML solutions.