Back to jobs
R

Graphite - Site Reliability Engineer (SRE)

Guadalajara, MEXPosted 8 months ago
Full-timehybrid

Job Description

Site Reliability Engineer (SRE)

Overview

We're looking for a passionate and hands-on Site Reliability Engineer (SRE) to join our team. This role is critical for ensuring the stability, performance, and scalability of our production services. You'll be the bridge between development and operations, with a strong focus on using code to manage infrastructure and eliminate toil.

Key Responsibilities

  • Monitoring and Alerting: Design, implement, and maintain robust monitoring and alerting systems (e.g., GCP Monitoring, Prometheus, Grafana, Traces, Logs) to provide visibility into application performance and infrastructure health.
  • Infrastructure Management: Build, provision, and maintain our core infrastructure, with a strong emphasis on Cloud environments and Kubernetes clusters.
  • Automation and Tooling: Write and maintain scripts and automation workflows (e.g., Python, Bash, TypeScript (Pulumi)) to streamline deployment, scaling, and operational tasks, embracing the philosophy of "automating everything."
  • Incident Response: Provide hands-on, real-time incident response and participate in an on-call rotation to quickly mitigate service disruptions and restore functionality.
  • Production Debugging: Deeply debug and troubleshoot complex production problems across the entire stack, from network issues to application code defects.
  • Process Improvement: Conduct blameless post-mortems for major incidents, implementing long-term solutions to prevent recurrence and continuously improve service reliability.

Qualifications

  • Proven experience as an SRE, DevOps Engineer, or similar role.
  • Expertise in managing and scaling Kubernetes in a production environment.
  • Strong proficiency in a scripting or programming language (e.g., Python, Go, Bash).
  • Deep understanding of monitoring, logging, and alerting best practices.
  • Solid experience with at least one major Cloud provider (AWS, GCP, or Azure).
  • Experience with Infrastructure as Code (IaC) tools like Terraform or Pulumi is a plus.

What You'll Bring

A proactive, data-driven approach to reliability and a passion for managing complex systems at scale.


See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Graphite - Site Reliability Engineer (SRE) at Rctsglobal Com | Renata