Back to jobs
Job Description
Location: Guadalajara (Mexico)
What you'll do:
- Experience of working with large scale distributed systems, including scalability, disaster recovery and fault tolerance.
- Expertise Python scripting .
- Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.
- Use error budgets to influence release decisions, prioritize reliability work, and manage operational risk.
- Design and maintain observability platforms including metrics, logs, traces, and real-time telemetry.
- Track, manage, and reduce operational toil by converting repetitive operational work into Jira stories and epics with clear ownership and measurable outcomes.
- Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.
- Lead incident response, act as an escalation point for high-severity incidents, and drive blameless postmortems.
- Partner with scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.
- Capture incident action items and reliability improvements in Jira, ensuring closure, accountability, and continuous improvement.
- Perform deep root cause analysis, debugging, and performance tuning across distributed systems.
- Provide technical leadership and mentoring to junior SREs and engineers.
- Promote shift-left reliability by embedding operability, monitoring, and failure testing early in the SDLC.
- Strong knowledge on CICD Pipeline, GIT, AWS/Azure/GCP as Paas service
- Demonstrated knowledge of Configuration Management and Deployment tools automation
- Strong Experience with networking concepts and protocols (HTTP, HTTPS, Telnet, SSH, Firewall, VPN, Routing and Load Balancing)
- Strong Experience with Linux
- Experience with Monitoring solutions like Prometheus, Grafana, Products like ELK/Splunk etc.
- Experience of working with large scale systems
- Experience with containers and orchestration technologies like Docker, Kubernetes
- Experience on Service Mesh like Istio, etc. would be added Advantage
- Experience with any CDN like Akamai etc..
What you'll bring:
- Bachelor's Degree in Computer Science or related technical field.
- 4+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.
- Hands-on experience with Java/J2EE-based distributed systems. React experience is a plus.
- Proven ability to design and operate systems using SLO-driven reliability models.
- Experience defining and measuring SLIs (availability, latency, error rates, throughput, saturation).
- Good understanding with NoSQL technologies and RDBMS. Should be able to write queries to fetch results from database.
- Experience deploying and operating services on cloud platforms (AWS, Azure, or Google Cloud).
- Expertise with observability, APM, and caching tools (Dynatrace, Splunk, ELK, Akamai, QuantumMetric/Tealeaf, etc.).
- Strong experience using Jira for backlog management, incident follow-ups, toil reduction tracking, and cross-team coordination.
- Ability to independently own services and drive reliability initiatives end-to-end.
- Strong communication skills and ability to influence engineering and product teams.
- Experience being on On-Call rotation and handling critical/high incidents.
Good to have:
- Candidates with application support experience can be considered.
- Any monitoring tools experience is acceptable such as New Relic or Datadog can also be considered.
- Candidates with 3 to 4 years of experience are fine; even junior resources with 3 years of experience can be considered.
- Akamai experience is optional.
- Any cloud experience is acceptable.
