Back to jobs
W

Site Reliability Engineer (SRE)

Singapore, SGPosted 2 weeks ago
remote

Job Description

Job Summary

SGX is hiring Site Reliability Engineers who treat operations as a software problem. You'll keep production healthy, but more importantly you'll build the automation, tooling, and agentic workflows that make running our systems boring and predictable. This is an engineering role - if your instinct on a recurring issue is to write code that removes it, you'll fit in well.

 

We operate in a regulated capital-markets environment, so the bar for reliability, security, and operational rigour is high.

Job Responsibilities

  • Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.
  • Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.
  • Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.
  • Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.
  • Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.
  • Continuously reduce toil: measure it, attack it with code, and raise the floor on what "easy to maintain" looks like.

Job Requirements

  • 5+ years in SRE, platform, or infrastructure engineering, with a clear track record of replacing manual work with code
  • Strong programming ability in at least one modern language (e.g. Go, Python, Kotlin, TypeScript, Rust, etc), you write production code, not just glue scripts
  • AI-native ways of working: real experience orchestrating agents for ops workflows, not just using AI for autocomplete
  • Deep hands-on with Kubernetes, IaC (Terraform or equivalent), CI/CD, and modern observability (metrics, logs, traces)
  • Production experience on a major cloud: GCP preferred, AWS acceptable
  • Solid foundations in distributed systems and the failure modes that matter in production
  • Incident-response maturity: calm under pressure, sharp on root cause, disciplined about follow-through
  • Comfort in complex, regulated environments

 

Nice to Have

  • Familiarity with the FIX protocol or capital-markets domain
  • Experience building internal developer platforms or self-service tooling consumed by other engineers

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Site Reliability Engineer (SRE) at Welcome to SGX Group | Renata