Back to jobs
T

Expert, Site Reliability Engineering (Techcom Life)

TP. Ha Noi, VNPosted Today

Job Description

Key Accountabilities (1)

'Participate in monitoring and handling system alerts/incidents/problems:
- Perform 24/7 monitoring and handle alerts of services of the entire IT infrastructure/application/services. In case encounter difficulties, escalate to L3 for coordinated processing.
- Ensure projects/specialized operations departments provide adequate alert/incident handling instructions for new services before golive and periodically review and update existing alert/incident handling instructions.
- Responsible for periodically reviewing issues/vulnerabilities in IT infrastructure/applications/services within scope of responsibility
- Provide in-depth transfer skills in monitoring and handling alerts and critical IT service incidents
- Participate Lead the standardizing and developing relevant processes and regulations to ensure effective monitoring and handling of alerts/incidents.
- Coordinate with relevant units to promptly restore services/systems, investigate root causes, propose solutions and implement solutions.
- Participate in implementing changes across the software development environment, including on Prem and cloud.
 
Participate in building and optimizing centralized monitoring tools:
- Implement the development and promulgation of standards and operate centralized monitoring tools (Dynatrace, Grafana, Splunk...)
- Implement monitoring tool integration and support building monitoring dashboards for new IT infrastructure/applications/services
- Ensure projects/specialized operations departments provide adequate monitoring indicators/monitoring thresholds for new services before golive.

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.