Site Reliability Engineer - Senior
Job Description
About the Role
-
Gradion is expanding its SRE team for the a client with a long-term managed services contract running through 2028. You will be part of a global, follow-the-sun SRE function, responsible for platform stability, cloud infrastructure, and 24/7 incident response across European and global client time zones.
-
This role suits engineers who are technically solid, self-directed, and comfortable operating in a fast-moving, internationally distributed environment. You will go through a structured onboarding alongside commercetools' internal SRE team before taking on independent operational responsibility.
What You Will Do
-
Own platform availability: monitor, triage, and resolve incidents within defined SLA windows
-
Manage cloud infrastructure on AWS and/or GCP - provisioning, scaling, and day-to-day operations
-
Maintain and improve CI/CD pipelines and GitOps workflows
-
Operate observability systems: monitoring, logging, and alerting at production scale
-
Participate in on-call rotation as part of the global follow-the-sun coverage model
-
Configure, deploy, and manage AI tooling and MCP servers in production environments
-
Contribute to infrastructure automation, scripting, and internal tooling
-
Write clear post-incident reviews and contribute to the monthly operational report
-
Collaborate closely with engineering teams across multiple time zones
What You Bring
-
4+ years in a DevOps / SRE / Platform Engineering role within an international team
-
Solid Kubernetes knowledge - cluster operations, troubleshooting, and configuration
-
Hands-on cloud experience with AWS and/or GCP
-
Good understanding of networking fundamentals - DNS, load balancing, firewalls, VPC
-
Scripting and automation skills (Python, Bash, or similar)
-
Experience with CI/CD tools and GitOps-based delivery
-
Working knowledge of monitoring and observability systems (Prometheus, ELK, or equivalent)
-
Fluent English (C1 minimum) - daily communication with European stakeholders is a core requirement
-
Self-directed and proactive - you ask the right questions and drive issues to resolution without waiting to be told
Nice to Have
-
Experience configuring and managing MCP servers and AI tooling in production
-
Exposure to AI enablement workflows or LLM infrastructure
-
Background supporting eCommerce or SaaS platforms
-
Familiarity with the Frontastic / commercetools Frontend ecosystem
About the Role
-
Gradion is expanding its SRE team for the a client with a long-term managed services contract running through 2028. You will be part of a global, follow-the-sun SRE function, responsible for platform stability, cloud infrastructure, and 24/7 incident response across European and global client time zones.
-
This role suits engineers who are technically solid, self-directed, and comfortable operating in a fast-moving, internationally distributed environment. You will go through a structured onboarding alongside commercetools' internal SRE team before taking on independent operational responsibility.
What You Will Do
-
Own platform availability: monitor, triage, and resolve incidents within defined SLA windows
-
Manage cloud infrastructure on AWS and/or GCP - provisioning, scaling, and day-to-day operations
-
Maintain and improve CI/CD pipelines and GitOps workflows
-
Operate observability systems: monitoring, logging, and alerting at production scale
-
Participate in on-call rotation as part of the global follow-the-sun coverage model
-
Configure, deploy, and manage AI tooling and MCP servers in production environments
-
Contribute to infrastructure automation, scripting, and internal tooling
-
Write clear post-incident reviews and contribute to the monthly operational report
-
Collaborate closely with engineering teams across multiple time zones
What You Bring
-
4+ years in a DevOps / SRE / Platform Engineering role within an international team
-
Solid Kubernetes knowledge - cluster operations, troubleshooting, and configuration
-
Hands-on cloud experience with AWS and/or GCP
-
Good understanding of networking fundamentals - DNS, load balancing, firewalls, VPC
-
Scripting and automation skills (Python, Bash, or similar)
-
Experience with CI/CD tools and GitOps-based delivery
-
Working knowledge of monitoring and observability systems (Prometheus, ELK, or equivalent)
-
Fluent English (C1 minimum) - daily communication with European stakeholders is a core requirement
-
Self-directed and proactive - you ask the right questions and drive issues to resolution without waiting to be told
Nice to Have
-
Experience configuring and managing MCP servers and AI tooling in production
-
Exposure to AI enablement workflows or LLM infrastructure
-
Background supporting eCommerce or SaaS platforms
-
Familiarity with the Frontastic / commercetools Frontend ecosystem