Back to jobs
Maya

Senior Site Reliability Engineer

Posted Today

Job Description

NATURE OF WORK

  • Lead architectural design and implementation of fault-tolerant, self-healing infrastructure across cloud and hybrid environments
  • Drive organization-wide automation initiatives, eliminating manual operations through advanced IaC and CI/CD frameworks
  • Own technical program leadership for reliability initiatives spanning multiple teams and services
  • Strategic management of OPEX and CAPEX budgets with cost optimization accountability
  • Deep expertise in compliance frameworks (CIS, PCI-DSS, BSP) with ability to architect compliant solutions
  • Establish and enforce cloud governance policies, account structures, and organizational standards across AWS/Azure/GCP environments 

 

REQUIRED QUALIFICATIONS

  • Expert-level proficiency in Kubernetes (CRDs, Operators, multi-tenancy, advanced scheduling)
  • Advanced Terraform expertise (custom providers, module design, automated testing)
  • Deep Service Mesh knowledge (Istio traffic management, circuit breaking, rate limiting, mTLS)
  • Proven experience building Internal Developer Platforms (IDP) with self-service workflows
  • Advanced GitLab CI/CD and GitOps implementation (ArgoCD/FluxCD, multi-project pipelines)
  • Expert-level WAF, API Gateway (Kong, Apigee, AWS APIGW), and network security implementation
  • Strong software development skills in Go, Python, or Java with ability to review code for reliability impact
  • Experience leading technical programs and cross-functional reliability initiatives
  • Deep understanding of observability platforms (Dynatrace, Prometheus, OpenTelemetry) with custom integration experience
  • Proven track record architecting microservices with high-availability and resiliency patterns
  • Experience implementing AWS Organizations, Control Tower, Service Control Policies, and multi-account governance frameworks
  • Proficiency in cloud policy-as-code tools (AWS Config, OPA, Sentinel) and compliance automation
  • Knowledge of cloud security standards (CIS Benchmarks, AWS Well-Architected Framework, Azure/GCP best practices)
  • Advanced expertise in Dynatrace, Datadog, or Grafana for building enterprise observability solutions
  • Experience implementing SLO-based alerting, error budgets, and burn rate monitoring using Prometheus, Grafana, or commercial APM tools
  • Proficiency in distributed tracing (Jaeger, Zipkin, OpenTelemetry) and log aggregation (ELK, Loki)
  • Ability to design custom metrics, synthetic monitoring, and real user monitoring (RUM) strategies

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Senior Site Reliability Engineer at Maya | Renata