Back to jobs

Technical Program Manager III, Capacity Management, Cloud
Kirkland, WA, USAPosted 6 days ago
onsite
Job Description
- Lead cross-functional programs related to ML Fleet capacity management, including the design, update, and maintenance of ML Fleet's cluster-level allocation plan of record.
- Drive the development, implementation, and ongoing maintenance of fleet-wide accelerator and auxiliary resource usage metrics, policies, and governance frameworks.
- Identify gaps and drive initiatives to improve existing tooling and processes, enhancing the efficiency, agility, and responsiveness of ML capacity allocation and management.
- Partner with key stakeholders including ML Strategy and Allocation (MLSA), Product Area Resource Management teams (PARMs), capital engineering, supply teams, tooling engineering (e.g., OneFleet, Tpulse, GQM Dev), and system infrastructure SREs (e.g., Spatial Flex, PIE).
- Manage communications and escalations related to ML resource allocation, performance, and strategic shifts for product areas and partners.