Location: Guadalajara (Mexico)

What you'll do:

Experience of working with large scale distributed systems, including scalability, disaster recovery and fault tolerance.
Expertise Python scripting .
Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.
Use error budgets to influence release decisions, prioritize reliability work, and manage operational risk.
Design and maintain observability platforms including metrics, logs, traces, and real-time telemetry.
Track, manage, and reduce operational toil by converting repetitive operational work into Jira stories and epics with clear ownership and measurable outcomes.
Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.
Lead incident response, act as an escalation point for high-severity incidents, and drive blameless postmortems.
Partner with scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.
Capture incident action items and reliability improvements in Jira, ensuring closure, accountability, and continuous improvement.
Perform deep root cause analysis, debugging, and performance tuning across distributed systems.
Provide technical leadership and mentoring to junior SREs and engineers.
Promote shift-left reliability by embedding operability, monitoring, and failure testing early in the SDLC.
Strong knowledge on CICD Pipeline, GIT, AWS/Azure/GCP as Paas service
Demonstrated knowledge of Configuration Management and Deployment tools automation
Strong Experience with networking concepts and protocols (HTTP, HTTPS, Telnet, SSH, Firewall, VPN, Routing and Load Balancing)
Strong Experience with Linux
Experience with Monitoring solutions like Prometheus, Grafana, Products like ELK/Splunk etc.
Experience of working with large scale systems
Experience with containers and orchestration technologies like Docker, Kubernetes
Experience on Service Mesh like Istio, etc. would be added Advantage
Experience with any CDN like Akamai etc..

What you'll bring:

Bachelor's Degree in Computer Science or related technical field.
4+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.
Hands-on experience with Java/J2EE-based distributed systems. React experience is a plus.
Proven ability to design and operate systems using SLO-driven reliability models.
Experience defining and measuring SLIs (availability, latency, error rates, throughput, saturation).
Good understanding with NoSQL technologies and RDBMS. Should be able to write queries to fetch results from database.
Experience deploying and operating services on cloud platforms (AWS, Azure, or Google Cloud).
Expertise with observability, APM, and caching tools (Dynatrace, Splunk, ELK, Akamai, QuantumMetric/Tealeaf, etc.).
Strong experience using Jira for backlog management, incident follow-ups, toil reduction tracking, and cross-team coordination.
Ability to independently own services and drive reliability initiatives end-to-end.
Strong communication skills and ability to influence engineering and product teams.
Experience being on On-Call rotation and handling critical/high incidents.

Good to have:

Candidates with application support experience can be considered.
Any monitoring tools experience is acceptable such as New Relic or Datadog can also be considered.
Candidates with 3 to 4 years of experience are fine; even junior resources with 3 years of experience can be considered.
Akamai experience is optional.
Any cloud experience is acceptable.

SRE +Dynatrace | Mexico

Job Description

See Your Match Score

More jobs at Photon

More jobs at Photon