Engage with and improve the entire lifecycle of services; from inception and design, through capacity planning and launch reviews, to deployment, operation and continual improvement.
Solve large, ambiguous problems and drive solutions across Site Reliability Engineering (SRE) and Development teams where our reliability expertise lends us to be a team multiplier.
Balance the need for a reliable system, efficient incident response and blameless postmortems with highly dynamic, customer priorities.
Maintain services' long-term health by creating and monitoring service level objectives (SLOs), scaling systems and processes sustainably through mechanisms such as automation.
Mentor and train team members on design, coding, and reliability best practices and grow knowledge of Gemini and Vertex AI within GDC SRE.

Site Reliability Engineering Manager, Google Distributed Cloud

Job Description