Middle/High-Middle DevOps / SRE Engineer at Aghanim

Aghanim helps mobile game studios build direct relationships with their players. We provide the infrastructure behind direct sales, taking care of payments, tax, and compliance while enabling teams to run LiveOps, segmentation, and player engagement at scale.
Today, more than 100 live games worldwide rely on Aghanim. We build products that become part of our customers’ core operations, which is why reliability, simplicity, and attention to detail matter in everything we do.

We’re a team of around 50 people across California, Lisbon, Belgrade, and remote locations. We move quickly, keep communication direct, and give people the freedom and responsibility to own their work. We care deeply about building useful products, maintaining high standards, and working with thoughtful, ambitious people.

We’re looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions.

You’ll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil—especially important in a highload system with sudden traffic spikes.

Role Responsibilities

Platform Operations (GCP/GKE)

Operate and support production systems on GCP, primarily GKE and managed services.
Execute platform improvements and operational tasks delegated by Senior/Principal owners.

IaC & Delivery Enablement

Implement infrastructure changes via Terraform (and Terragrunt where used).
Maintain and evolve Helm charts and Kubernetes manifests.
Improve reliability of GitHub Actions / CI/CD workflows and deployment automation.

Observability & Monitoring (Datadog)

Build and maintain Datadog dashboards/monitors and keep alerting healthy.
Close monitoring gaps across critical components; reduce noisy alerts and improve signal quality.

Incident Response

Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow-up fixes.
Contribute to postmortems with clear facts, timelines, and actionable remediation tasks.

Security Basics (DevSecOps)

Run/configure security tooling and monitoring, help triage findings, and implement fixes under guidance.
Support secure-by-default practices (secrets hygiene, access controls, baseline hardening).

Cost Awareness

Identify and implement cost optimizations (right-sizing, waste removal, efficiency improvements) without harming reliability.

Required Qualifications

Hands-on production experience with Kubernetes (ideally GKE) and basic cluster operations.
Working experience with Terraform and Helm in PR-based workflows.
Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore).
Monitoring/alerting and troubleshooting skills (preferably Datadog).
Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.

Preferred Qualifications

Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
Experience writing/maintaining runbooks and participating in postmortems.
Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
Experience in high-load consumer products or game dev.

What Success Looks Like

Improved monitoring coverage and healthier alerting (less noise, faster detection).
Faster, safer deployments with fewer manual steps and fewer production regressions.
Incidents are triaged effectively and resolved within expected timelines.
Platform reliability improves through steady delivery of operational fixes and automation.
Costs trend in the right direction thanks to recurring optimizations and guardrails.

Why Join Us

Cloud-only, highload environment with real engineering challenges (not “just keep the lights on”).
Small team with ownership, autonomy, and quick iteration.
Strong opportunity to grow into broader platform ownership and SRE leadership paths.
Direct impact on reliability, scalability, and developer velocity.

Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Middle/High-Middle DevOps / SRE Engineer

Job Description