Digital Incident & Service Management Expert
Job Description
At Globe, our goal is to create a wonderful world for our people, business, and nation. By uniting people of passion who believe they can make a difference, we are confident that we can achieve this goal.
Job Description
Responsible for ensuring the reliability, performance, and stability of digital channels and APIs across web and mobile platforms. Leads day-to-day digital and API operations, including proactive monitoring, incident management, and service recovery. Partners closely with engineering, IT, and vendors to identify systemic issues, restore services within SLAs, and translate technical incidents into customer and business impact. Drives operational excellence through robust monitoring, analytics, governance, and continuous improvement of incident response and service management practices across multiple environments.1. Digital Channel & API Operations Management
Monitor end-to-end health of digital channels (web and mobile app), ensuring uptime, stability, and performance against defined SLOs and SLAs.
Proactively track and analyze API performance metrics including success rate, latency, throughput, and error rates.
Identify systemic issues, degradation patterns, and recurring failure points impacting transactions and system reliability.
Coordinate with IT, platform teams, and external vendors to investigate incidents, validate root causes, and restore services within agreed timelines.
Quantify technical incidents in terms of operational and customer impact to support prioritization and remediation.
2. Incident & Recovery Operations
Serve as the operational incident lead during channel or API disruptions, driving coordination across L2/L3 support, engineering, infrastructure, and vendors.
Ensure proper incident triage, escalation, and resolution in accordance with incident management frameworks (e.g., ITIL)
Oversee execution of recovery actions, post-incident validation, and service normalization activities.
Ensure incidents are logged, tracked, and closed within SLA, with accurate classification for trend and RCA analysis.
3. Technical Monitoring, Analytics & Reporting
Develop and maintain operational dashboards covering availability, latency, error rates, traffic, and incident trends using monitoring tools (e.g., Grafana, AWS).
Analyze operational data to detect early warning signals, capacity risks, or reliability gaps.
Produce weekly and monthly technical operations reports summarizing channel health, incident patterns, and stability risks.
Partner with analytics and engineering teams to correlate system performance with downstream customer and business impact.
GCP and AWS API’s technical skills
4. Operational Excellence & Governance
Execute and continuously improve the Digital Operations Playbook, including monitoring standards, escalation paths, and incident response procedures.
Enforce operational governance through consistent RCA documentation, post-incident reviews, and corrective action tracking.
Coach and guide operations analysts on technical monitoring practices, incident handling, and service management standards.
Govern service transitions from project delivery teams to steady-state L2/L3 operations, ensuring readiness, documentation completeness, and monitoring coverage.
Key KPIs (Technical Operations–Led)
Channel Availability: ≥ 99.5% uptime for web and mobile app
Incident Resolution: ≥ 95% resolved within SLA
Mean Time to Detect (MTTD) & Restore (MTTR): Continuous improvement targets
Recurring Incident Reduction: Measured quarter-over-quarter
Top Deliverables (Action-Oriented, KPI-Driven)
Ensure Channel & API Reliability – Maintain stable, high-performing digital channels through proactive monitoring and controls, driving improvements in availability, MTTD, and overall service reliability.
Drive Fast Incident Detection & Recovery – Lead incident triage, escalation, and resolution to minimize customer impact, with clear accountability for MTTD and MTTR performance.
Prevent Recurrence & Improve Resilience – Convert incident learnings into prioritized fixes, monitoring enhancements, and process changes that reduce repeat incidents and improve long-term MTTR and NPS.
Provide Performance Visibility & Governance – Deliver regular, outcome-focused reporting on MTTD, MTTR, SLA attainment, and NPS impact to enable leadership oversight and data-driven decisions.
Hiring Requirements
Work Experience
At least two years of full-time work experience in web front end and backend configuration.
With skills in GCP and mongoDB, AWS, Clairevoyance, CMS.
Experience in crafting, delivery, or reviewing learning programs is a plus
Experience in operational readiness, service delivery, or technical enablement is a plus.
Level of Knowledge & Skills
4–6 years of experience in IT, digital operations, platform operations, or service management.
Strong understanding of incident, problem, and service management (ITIL preferred).
Hands-on or working knowledge of monitoring, logging, and observability tools (e.g., Grafana, AWS).
Experience working with APIs, digital platforms, and distributed systems.
Proven ability to coordinate across engineering, IT operations, business stakeholders, and vendors.
Familiarity with Globe systems and digital platforms is an advantage.
Equal Opportunity Employer
Globe’s hiring process promotes equal opportunity to applicants, Any form of discrimination is not tolerated throughout the entire employee lifecycle, including the hiring process such as in posting vacancies, selecting, and interviewing applicants.
Globe’s Diversity, Equity and Inclusion Policy Commitment can be accessed here
Make Your Passion Part of Your Profession. Attracting the best and brightest Talents is pivotal to our success. If you are ready to share our purpose of Creating a Globe of Good, explore opportunities with us.