Senior Site Reliability Engineer at Maya

NATURE OF WORK

Lead architectural design and implementation of fault-tolerant, self-healing infrastructure across cloud and hybrid environments
Drive organization-wide automation initiatives, eliminating manual operations through advanced IaC and CI/CD frameworks
Own technical program leadership for reliability initiatives spanning multiple teams and services
Strategic management of OPEX and CAPEX budgets with cost optimization accountability
Deep expertise in compliance frameworks (CIS, PCI-DSS, BSP) with ability to architect compliant solutions
Establish and enforce cloud governance policies, account structures, and organizational standards across AWS/Azure/GCP environments

REQUIRED QUALIFICATIONS

Expert-level proficiency in Kubernetes (CRDs, Operators, multi-tenancy, advanced scheduling)
Advanced Terraform expertise (custom providers, module design, automated testing)
Deep Service Mesh knowledge (Istio traffic management, circuit breaking, rate limiting, mTLS)
Proven experience building Internal Developer Platforms (IDP) with self-service workflows
Advanced GitLab CI/CD and GitOps implementation (ArgoCD/FluxCD, multi-project pipelines)
Expert-level WAF, API Gateway (Kong, Apigee, AWS APIGW), and network security implementation
Strong software development skills in Go, Python, or Java with ability to review code for reliability impact
Experience leading technical programs and cross-functional reliability initiatives
Deep understanding of observability platforms (Dynatrace, Prometheus, OpenTelemetry) with custom integration experience
Proven track record architecting microservices with high-availability and resiliency patterns
Experience implementing AWS Organizations, Control Tower, Service Control Policies, and multi-account governance frameworks
Proficiency in cloud policy-as-code tools (AWS Config, OPA, Sentinel) and compliance automation
Knowledge of cloud security standards (CIS Benchmarks, AWS Well-Architected Framework, Azure/GCP best practices)
Advanced expertise in Dynatrace, Datadog, or Grafana for building enterprise observability solutions
Experience implementing SLO-based alerting, error budgets, and burn rate monitoring using Prometheus, Grafana, or commercial APM tools
Proficiency in distributed tracing (Jaeger, Zipkin, OpenTelemetry) and log aggregation (ELK, Loki)
Ability to design custom metrics, synthetic monitoring, and real user monitoring (RUM) strategies

Senior Site Reliability Engineer

Job Description

See Your Match Score

More jobs at Maya

More jobs at Maya