Senior/ Lead DevOps Engineer
Job Description
About the Role
-
Gradion is expanding its SRE team for the a client with a long-term managed services contract running through 2028. You will be part of a global, follow-the-sun SRE function, responsible for platform stability, cloud infrastructure, and 24/7 incident response across European and global client time zones.
-
This role suits engineers who are technically solid, self-directed, and comfortable operating in a fast-moving, internationally distributed environment. You will go through a structured onboarding alongside an internal SRE team before taking on independent operational responsibility.
What You Will Do
-
Own platform availability: monitor, triage, and resolve incidents within defined SLA windows
-
Manage cloud infrastructure on AWS and/or GCP - provisioning, scaling, and day-to-day operations
-
Maintain and improve CI/CD pipelines and GitOps workflows
-
Operate observability systems: monitoring, logging, and alerting at production scale
-
Participate in on-call rotation as part of the global follow-the-sun coverage model
-
Configure, deploy, and manage AI tooling and MCP servers in production environments
-
Contribute to infrastructure automation, scripting, and internal tooling
-
Write clear post-incident reviews and contribute to the monthly operational report
-
Collaborate closely with engineering teams across multiple time zones
What You Bring
-
4+ years in a DevOps / SRE / Platform Engineering role within an international team
-
Solid Kubernetes knowledge - cluster operations, troubleshooting, and configuration
-
Hands-on cloud experience with AWS and/or GCP
-
Good understanding of networking fundamentals - DNS, load balancing, firewalls, VPC
-
Scripting and automation skills (Python, Bash, or similar)
-
Experience with CI/CD tools and GitOps-based delivery
-
Working knowledge of monitoring and observability systems (Prometheus, ELK, or equivalent)
-
Good English - daily communication with European stakeholders is a core requirement
-
Self-directed and proactive - you ask the right questions and drive issues to resolution without waiting to be told
Nice to Have
-
Experience configuring and managing MCP servers and AI tooling in production
-
Exposure to AI enablement workflows or LLM infrastructure
-
Background supporting eCommerce or SaaS platforms
-
Familiarity with the Frontastic / commercetools Frontend ecosystem
About the Role
-
Gradion is expanding its SRE team for the a client with a long-term managed services contract running through 2028. You will be part of a global, follow-the-sun SRE function, responsible for platform stability, cloud infrastructure, and 24/7 incident response across European and global client time zones.
-
This role suits engineers who are technically solid, self-directed, and comfortable operating in a fast-moving, internationally distributed environment. You will go through a structured onboarding alongside an internal SRE team before taking on independent operational responsibility.
What You Will Do
-
Own platform availability: monitor, triage, and resolve incidents within defined SLA windows
-
Manage cloud infrastructure on AWS and/or GCP - provisioning, scaling, and day-to-day operations
-
Maintain and improve CI/CD pipelines and GitOps workflows
-
Operate observability systems: monitoring, logging, and alerting at production scale
-
Participate in on-call rotation as part of the global follow-the-sun coverage model
-
Configure, deploy, and manage AI tooling and MCP servers in production environments
-
Contribute to infrastructure automation, scripting, and internal tooling
-
Write clear post-incident reviews and contribute to the monthly operational report
-
Collaborate closely with engineering teams across multiple time zones
What You Bring
-
4+ years in a DevOps / SRE / Platform Engineering role within an international team
-
Solid Kubernetes knowledge - cluster operations, troubleshooting, and configuration
-
Hands-on cloud experience with AWS and/or GCP
-
Good understanding of networking fundamentals - DNS, load balancing, firewalls, VPC
-
Scripting and automation skills (Python, Bash, or similar)
-
Experience with CI/CD tools and GitOps-based delivery
-
Working knowledge of monitoring and observability systems (Prometheus, ELK, or equivalent)
-
Good English - daily communication with European stakeholders is a core requirement
-
Self-directed and proactive - you ask the right questions and drive issues to resolution without waiting to be told
Nice to Have
-
Experience configuring and managing MCP servers and AI tooling in production
-
Exposure to AI enablement workflows or LLM infrastructure
-
Background supporting eCommerce or SaaS platforms
-
Familiarity with the Frontastic / commercetools Frontend ecosystem
🏆 Join Vietnam’s Best IT Company - Gradion Vietnam (formerly NFQ Vietnam) was recognized by ITViec for 8 consecutive years, including 2 successive years as the Winner. Work with some of the best minds in the industry and be part of a company that’s redefining how businesses scale through technology.
🌍 Career Growth & Leadership Development - Work closely with our leadership team, gain mentorship from experienced executives, and have direct exposure to high-level strategic decisions. Your growth is limitless, as long as you’re ready to step up, opportunities will always be there for you.
🚀 AI-First Engineering & Strategic Consulting - Our engineering culture integrates AI as a core driver of design, development, and optimization - not an add-on. As a forward-thinking consultancy, we go beyond traditional engineering, combining technical excellence with a strategic mindset to deliver transformative solutions for ambitious businesses.
💰 Competitive Compensation - We believe great talent deserves great rewards. Expect an attractive salary, performance-based bonuses, and a benefits package that reflects your impact. We value talent over salary budgets - exceptional contributions deserve exceptional rewards.
✨ And Many More Benefits to Explore! But most importantly, a healthy work-life balance and an environment where you can thrive professionally and personally. Including:
- Performance bonus of up to 2 months’ salary.
- Performance review twice a year, so your growth is recognized and rewarded.
- Premium healthcare for you, plus an annual health check.
- 15 days of annual leave.
- Full salary during probation.
- Hybrid working for real flexibility.
- Monthly Happy Hour and Community Tech activities.
- Work on global projects as part of an innovation team that shapes ideas for the hi-tech world.
- Diverse training programs to keep you growing.