
Customer Reliability Engineer (CRE)
Job Description
The Customer Reliability Engineer ensures the best customer experience by assuring services reliability from customer eyes and making sure Incident/Service Requests are resolved in the shortest timeframe. The CRE ensures overall service quality and health by closing the loop with feedbacks to PO & SRE regarding service issues and improvements. The CRE improves service adoption and customer success by ensuring customers are getting the most from our Cloud Services.
Activities/Responsibilities:
- Manage Incidents/Service requests within SLA, tracking SLA and taking actions in case of deviation.
- Escalate to SRE or Engineering (L3) Incidents/Service requests that cannot be resolved.
- Communicate with customers by keeping them informed about the updates related to Incidents/Service requests on a regular basis.
- Write Work Instructions/Incident Response Plans for L1, automate WI based on alert.
- Write Technical Notes/Knowledge Articles for CRE and
- Customers (focused on service usage improvement).
- Maintain high technical skills on solution/services, be Subject
- Matter Expert (SME) of selected solution/service.
- Build and deliver technical webinar to Customers and other CRE, work with SD/PO to build and CSM/CRSM to plan such webinar.
- Be Customer Champion for selected accounts, establish privileged relationship and deeper technical understanding for selected accounts, stay up-to-date on their plans regarding service usage.
- Deploy Customer's specific changes upon CAB approval.
- Implement and maintain SLI, dashboards and Customer's specific alerts to follow performance/improvement plan.
- Follow-up Customer activity through dashboards (% of success, % of enrollment, % of conversion, etc…).
- Provide close support to CRSM when in comes to understanding of customer use cases and Incidents/Service requests.
- Scale up/down to meet customer business need (if possible at CRE level, or raise the need to SRE).
- Lead RCA when no SRE involved (if there's a service outage,
- CRE will be involved and will naturally become the RCA leader).
- Participate in post-mortems and contribute to RCA.
- Translate internal RCA to external RCA, publish external RCA in due time (according to service/customer agreement).
- Review repeated incidents or known error with PO/SRE.
- Raise product/service improvement requests to PO/SRE.
- CRE is working on-call to provide 365x24x7 upon L1 escalation.
Requirements: Skills, Experience & Education:
- BS in Computer Science
- 5+ years of experience as DevOPS or SRE in Application Service or support
- Working on shift to offer our customers a 24/7 service
- Good to have experience in managing Splunk App development, scripting and log management solution design
- 5+ years of experience integrating data input from Splunk from other tools such as AWS, Datadog, GCP, private cloud etc
- Knowledge of ITIL and Service Delivery best practices. ITIL certification would be appreciated
- Knowledge on AWS/GCP Cloud, monitoring tools, networking, infrastructure, Linux and mobile application
- Experience in direct cooperation with international customers
- Excellent interpersonal and communication skills
- Very good organizational and negotiation skills
- Ability to make decisions and take initiatives
- English – Mandatory
- Spanish – Optional
- French – Optional