What we need…
We're looking for a Site Reliability Engineer to join our SRE team, with strong expertise across Kubernetes, cloud infrastructure, and automation in both public cloud (AWS/Azure) and on-premise environments. You'll bring a passion for reducing toil through automation, observability, and thoughtful engineering.
The ideal candidate will have...
Proficiency with Kubernetes in production environments, including cluster provisioning, troubleshooting, and monitoring
Linux systems administration skills, including performance tuning and application troubleshooting
Comfort with scripting and automation using Python, Bash, and/or PowerShell
Experience deploying infrastructure as code with Terraform and/or Ansible
Experience with CI/CD Pipelines, GitLab CI preferred
A mindset focused on problem-solving, blameless root cause analysis, and continuous improvement
Willingness to participate in on-call rotation and incident response
What’s in it for you…
COCC offers a collaborative environment, career growth, and all the benefits you’d expect from an award-winning employer, including:
Hybrid schedules and ample paid time off allowing you work/life balance and flexibility
Customized training and onboarding to support you in your first year at COCC
Robust employee development programs aligned with career pathing objectives
Cutting-edge training and educational resources from vendors like SANS, PluralSight and CBTNuggets
Generous PTO offerings, benefits and competitive compensation
On-site fitness centers, wellness incentives, and lifestyle spending accounts
Tuition Reimbursement
One-on-one career coaching
DEIB initiatives championing inclusion and encouraging you to bring your whole self to work
Financial planning assistance with certified professionals
Peer recognition programs
What you’ll do…
Manage and support Kubernetes clusters (on-premises via TKGI/VKS and cloud) across production, staging, and development environments, ensuring stability, scalability, and high availability
Diagnose and resolve complex issues across Kubernetes, container runtimes, workloads, operating systems, and supporting infrastructure
Deploy and manage cloud (AWS/Azure) and on-premise infrastructure using Terraform, Ansible, and Helm
Build, maintain, and troubleshoot GitLab CI/CD pipelines for application and infrastructure deployments
Implement and maintain observability using our stack: Alloy, Prometheus, Mimir, Loki, Grafana, Dynatrace, and Netbrain
Containerize applications and author Kubernetes deployment manifests
Plan, design, and execute automation solutions to reduce manual workload across the Infrastructure Department
Conduct blameless root cause analyses on incidents and outages, turning learnings into runbooks and preventive measures
Provide emergency response through on-call rotation, reacting to monitoring alerts and escalating when needed
Develop internal tools in Python, Go, and/or C# to improve automation, observability, and self-service capabilities for engineering teams
Administer and troubleshoot Windows Server environments as needed, complementing our primarily Linux-based infrastructure
What you’ll bring…
Bachelor's degree in Computer Science preferred, but relevant work experience and/or certifications will be considered.
3-5 years' experience in performance and infrastructure engineering
Experience with containerization, cloud technology and Kubernetes preferred
Motivation to change processes, innovate new products and make a difference on a collaborative team
This role is virtual but local candidates who can come to the Rocky Hill office a few times per year is preferred. Salary range for this role is $122,400-$170K