Back to jobsActing as the overall coordinator and primary point of contact for end-to-end GPUaaS operations, including data centre operations and operational reporting.
Leading daily GPUaaS and data centre operations covering hardware, environmental controls, networking, security, and supporting software platforms.
Managing operations teams, vendors, and consultants during both normal operations and emergency situations.
Coordinating with internal teams and external partners to implement GPUaaS enhancements and data centre initiatives.
Implementing, validating, and continuously improving operational plans to ensure platform stability across GPU hardware, software, and data centre infrastructure (e.g. power and cooling).
Leading incident response and resolution for GPUaaS environments, including root cause analysis (RCA) and timely communication to customers and stakeholders.
Presenting operational status, risks, and improvement plans to senior management and relevant stakeholders.
Ensuring incidents are addressed or escalated in accordance with criticality, impact, and SLA/SLO requirements.
Building and leading a high-performing operations team, fostering collaboration, innovation, and continuous improvement.
Setting clear goals, mentoring team members, and supporting professional development.
Leading security incident management and enforcing security and compliance best practices within the GPUaaS environment.
Monitoring industry security trends and implementing measures to protect customer data and platform integrity.
Participating in scheduled or on-call support outside standard working hours as required.
