Back to jobs
Google

AI Accelerator Reliability Uber Tech Lead

Sunnyvale, CA, USAPosted Yesterday
hybrid

Job Description

  • Define, own, and drive the end-to-end reliability, availability, and serviceability (RAS) strategy for a novel, large-scale AI accelerator system.
  • Establish and enforce reliability engineering principles, standards, and best practices across all components of the system, including custom ASICs, trays, racks, power, cooling, and the full software stack (firmware, system software, runtime, and orchestration).
  • Lead and influence cross-functional teams – including Hardware Engineering, Silicon Design, Software Engineering, Supply Chain, Manufacturing, and Site Reliability Engineering (SRE) – to ensure reliability is designed-in and validated throughout the entire product lifecycle.
  • Drive the design and implementation of fault injection testing, stress testing, and DiRT-style exercises to validate system behavior under failure conditions.
  • Define and oversee the development of robust error handling, monitoring, telemetry, and diagnostic capabilities to enable rapid detection, root cause analysis, and recovery from failures.

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
AI Accelerator Reliability Uber Tech Lead at Google | Renata