Back to jobs
COCC

Site Reliability Engineer - Req #593

RemotePosted 1 months ago
remote

Job Description

What we need… We're looking for a Site Reliability Engineer to join our SRE team, with strong expertise across Kubernetes, cloud infrastructure, and automation in both public cloud (AWS/Azure) and on-premise environments. You'll bring a passion for reducing toil through automation, observability, and thoughtful engineering. The ideal candidate will have... Proficiency with Kubernetes in production environments, including cluster provisioning, troubleshooting, and monitoring Linux systems administration skills, including performance tuning and application troubleshooting Comfort with scripting and automation using Python, Bash, and/or PowerShell Experience deploying infrastructure as code with Terraform and/or Ansible Experience with CI/CD Pipelines, GitLab CI preferred A mindset focused on problem-solving, blameless root cause analysis, and continuous improvement Willingness to participate in on-call rotation and incident response What’s in it for you… COCC offers a collaborative environment, career growth, and all the benefits you’d expect from an award-winning employer, including: Hybrid schedules and ample paid time off allowing you work/life balance and flexibility Customized training and onboarding to support you in your first year at COCC Robust employee development programs aligned with career pathing objectives Cutting-edge training and educational resources from vendors like SANS, PluralSight and CBTNuggets  Generous PTO offerings, benefits and competitive compensation On-site fitness centers, wellness incentives, and lifestyle spending accounts Tuition Reimbursement One-on-one career coaching DEIB initiatives championing inclusion and encouraging you to bring your whole self to work Financial planning assistance with certified professionals Peer recognition programs What you’ll do… Manage and support Kubernetes clusters (on-premises via TKGI/VKS and cloud) across production, staging, and development environments, ensuring stability, scalability, and high availability Diagnose and resolve complex issues across Kubernetes, container runtimes, workloads, operating systems, and supporting infrastructure Deploy and manage cloud (AWS/Azure) and on-premise infrastructure using Terraform, Ansible, and Helm Build, maintain, and troubleshoot GitLab CI/CD pipelines for application and infrastructure deployments Implement and maintain observability using our stack: Alloy, Prometheus, Mimir, Loki, Grafana, Dynatrace, and Netbrain Containerize applications and author Kubernetes deployment manifests Plan, design, and execute automation solutions to reduce manual workload across the Infrastructure Department Conduct blameless root cause analyses on incidents and outages, turning learnings into runbooks and preventive measures Provide emergency response through on-call rotation, reacting to monitoring alerts and escalating when needed Develop internal tools in Python, Go, and/or C# to improve automation, observability, and self-service capabilities for engineering teams Administer and troubleshoot Windows Server environments as needed, complementing our primarily Linux-based infrastructure What you’ll bring… Bachelor's degree in Computer Science preferred, but relevant work experience and/or certifications will be considered. 3-5 years' experience in performance and infrastructure engineering Experience with containerization, cloud technology and Kubernetes preferred Motivation to change processes, innovate new products and make a difference on a collaborative team This role is virtual but local candidates who can come to the Rocky Hill office a few times per year is preferred. Salary range for this role is $122,400-$170K

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Site Reliability Engineer - Req #593 at COCC | Renata