Back to jobs

System Debug Engineer Manager, Cloud AI Infrastructure
Kirkland, WA, USAPosted 2 weeks ago
onsite
Job Description
- Drive technical team performance across on-call activities and system management by delivering leadership, mentorship, and career development while collaborating with primary responders to address system issues.
- Debug platform hardware, silicon, and AI/ML workloads to drive root-cause resolution, develop permanent infrastructure improvements, and build tools for faster diagnosis through troubleshooting and reproduction.
- Collaborate cross-functionally with Product, Quality, and Engineering teams to ehance product outcomes, and engage with Site Reliability Engineering (SRE) teams to ensure high-quality production and reliability.
- Resolve customer challenges on AI/ML infrastructure through effective diagnosis, resolution, and the implementation of investigation tools to increase productivity for critical reported issues.
- Serve as a consultant and subject matter expert for internal stakeholders to resolve deployment and operational obstacles across AI infrastructure environments daily.