Job Description
Job Location: Mexico City, Mexico
Calling all originals: At Levi Strauss & Co., you can be yourself — and be part of something bigger. We’re a company of people who like to forge our own path and leave the world better than we found it. Who believe that what makes us different makes us stronger. So add your voice. Make an impact. Find your fit — and your future.
We're seeking a curious and driven Site Reliability Engineer to join our Data & AI Platform Engineering team. In this role, you'll help keep our data and AI platforms running reliably, efficiently, and securely — platforms that power decisions across our global retail operations.
You'll work alongside experienced SREs and engineers to monitor production systems, respond to incidents, reduce operational toil, and build the automation that makes our infrastructure more resilient. This is an excellent opportunity to grow your SRE craft in a fast-paced, collaborative environment on Google Cloud Platform, with exposure to multi-cloud technologies and modern data engineering.
About the Job
Reliability & Incident Response
Monitor production systems using observability tooling — dashboards, alerts, and logs — to detect and triage issues before they impact end users
Participate in on-call rotations, respond to incidents following established runbooks, and escalate appropriately when needed
Contribute to blameless post-mortems, documenting root causes and follow-up action items to prevent recurrence
Help maintain and improve SLO dashboards and alerting thresholds to ensure platform health is visible and measurable
Toil Reduction & Automation
Identify repetitive manual tasks and build automation to eliminate them, reducing toil for yourself and the broader team
Write and maintain scripts, tooling, and CI/CD pipeline components that improve deployment reliability and operational efficiency
Support self-serve infrastructure initiatives that allow engineering teams to safely provision and manage their own resources
Platform Operations & Cloud Infrastructure
Operate and maintain workloads running on GCP — including GKE, Cloud Run, BigQuery, Pub/Sub, GCS, and Composer
Apply Infrastructure-as-Code practices (Terraform, Helm) to consistently and safely manage and version infrastructure changes
Support multi-cloud awareness across GCP and Azure, following team standards for consistency and security across environments
Adhere to data security and governance policies — IAM best practices, secrets management, encryption, and audit logging
Collaboration & Growth
Work closely with Data Engineering, AI Platform, and Software Engineering teams to ensure reliability is considered from design through deployment
Participate in reliability reviews, design discussions, and team ceremonies, contributing ideas and raising operational concerns early
Engage with AI and agentic platform workloads, gaining exposure to the operational patterns of LLM-based systems and data pipelines
Continuously develop your technical skills and SRE craft, supported by team knowledge-sharing, documentation, and hands-on experience
About You
Required Qualifications
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent practical experience)
6+ years of experience in Site Reliability Engineering, DevOps, or Platform/Infrastructure Engineering in production environments
Hands-on experience with GCP services — particularly GKE, Cloud Run, BigQuery, Pub/Sub, and GCS
Working proficiency with Infrastructure-as-Code tools such as Terraform or Helm
Familiarity with observability tooling — metrics, logging, tracing, and alerting (e.g., Cloud Monitoring, Datadog, or Prometheus/Grafana)
Understanding of SLO/SLI concepts and how they relate to production reliability and on-call operations
Exposure to data security fundamentals: IAM, encryption, secrets management, and network policies
Proficiency in at least one scripting or systems language (Python, Bash, or Go) for automation and operational tooling
Strong communication skills with the ability to clearly document incidents, runbooks, and technical processes
Technical Familiarity
Experience with container orchestration — Kubernetes or GKE — and the operational patterns around deploying and managing containerized workloads
Basic understanding of CI/CD pipelines and GitOps workflows (ArgoCD, GitHub Actions, or similar)
Comfort working with data platforms — familiarity with batch or streaming data pipelines is a plus
Awareness of multi-cloud concepts, particularly across GCP and Azure
Desirable Experience
Experience working in retail, e-commerce, or consumer goods environments
Familiarity with Google's SRE principles — error budgets, toil tracking, and production readiness reviews
Exposure to AI or ML platform operations, including monitoring model serving infrastructure
Experience with FinOps or cloud cost visibility tooling
Why Join Us?
If you're an engineer who is passionate about reliability, loves solving operational problems, and wants to grow your SRE craft at a global iconic brand, we'd love to hear from you.
