Site Reliability Engineer - Membership Platform Department (MPD)
Job Description
Job Description:
Business Overview
The Technology Platforms Division (TPD) drives the growth of the Rakuten Ecosystem by delivering innovative, high-quality technology platforms characterized by integrated control and strategic partnerships.
Within TPD, the Ecosystem Platform Supervisory Department (EPSD) develops scalable and reliable platforms that support the entire Rakuten Ecosystem globally, fostering a culture of ownership and data-driven decision-making.
Department Overview
The Membership Platform Department (MPD) is responsible for the development and maintenance of scalable platforms that provide membership services, profile management, and fraud prevention across services within the global Rakuten Ecosystem. We are focused on creating next-generation membership services to support our users across the world.
Position:
Position Details
Responsibilities
- Own the reliability, availability, and performance of assigned platform services through SLI/SLO definition and error budget management
- Design, build, and maintain CI/CD pipelines and deployment automation for production systems
- Respond to and lead resolution of production incidents; contribute to blameless postmortems and drive follow-up action items to closure
- Develop and improve observability tooling: metrics, dashboards, alerting, and distributed tracing
- Identify and automate repetitive operational tasks to reduce toil, leveraging AI-assisted development tools such as Claude Code, to accelerate delivery
- Collaborate with development teams on infrastructure design, capacity planning, and production readiness reviews
- Participate in an on-call rotation and actively improve the on-call experience over time
- Contribute to internal documentation, runbooks, and knowledge sharing
Mandatory Qualifications:
- 3 to 5 years of experience in site reliability engineering, DevOps, or systems/platform engineering
- Hands-on experience managing Kubernetes in production (cluster operations, workload design, resource management)
- Proficiency with Infrastructure as Code - Terraform, Pulumi, or equivalent
- Experience building and maintaining CI/CD pipelines (GitLab CI/CD, Jenkins, ArgoCD, Tekton, or similar)
- Practical experience with observability stacks: metrics (Prometheus/Grafana or Datadog), structured logging, and distributed tracing (OpenTelemetry)
- Solid understanding of Linux systems, networking fundamentals (TCP/IP, DNS, HTTP/gRPC), and container runtimes
- Experience operating and troubleshooting virtual/baremetal machine infrastructure (provisioning, lifecycle management, performance tuning) in on-premises or cloud environments
- Experience writing automation and tooling in at least one language: Python, Go, or Bash
- Demonstrated ability to diagnose and resolve production incidents independently
- Comfortable using AI-assisted development tools (e.g., Claude Code, GitHub Copilot) to write, review, and debug code and infrastructure configurations
Desired Qualifications:
- Experience with cloud platforms (AWS, GCP, or Azure) and managed Kubernetes offerings (EKS, GKE, or AKS)
- Familiarity with GitOps workflows and progressive delivery patterns (canary releases, feature flags)
- Exposure to platform engineering concepts: internal developer platforms, self-service infrastructure, or developer portals
- Experience with security and compliance tooling in a DevSecOps context (SBOM, container image scanning, secrets management with Vault or cloud-native equivalents)
- Knowledge of cost optimization practices for cloud infrastructure
- Background in SRE principles (error budgets, toil reduction, capacity planning)
#engineer #infrastructureengineer #technologyplatformdiv #Python #Golang #Go