Back to jobs

Sr. Technology Operations & Reliability Manager (100% Remote)
RemotePosted 4 days ago
remote
Job Description
Summary:
The Sr. Technology Operations & Reliability Manager is responsible for leading and managing the DevOps and 24/7 Site Reliability Engineering (SRE) teams to ensure the performance, reliability, and scalability of the ClearCaptions call platform. The role requires a strategic and hands-on approach to optimizing cloud infrastructure, implementing automation, and enforcing best practices to maintain high availability of enterprise applications and databases. The Sr. Technology Operations & Reliability Manager collaborates with stakeholders across departments to ensure a resilient, secure, and highly efficient technology environment.
This role requires practical experience supporting highly available telecommunications, VoIP, or real-time communications platforms. Responsibilities include understanding the operational impact of SIP signaling, media quality, carrier interconnects, routing, SBCs, and telecom-related incident response, while leading teams responsible for platform reliability, observability, escalation management, and continuous improvement.
What This Role Does:
Leads and manages the DevOps and 24/7 SRE teams responsible for CI/CD pipelines, cloud infrastructure, observability, and automation.
Develops and enforces best practices for site reliability, incident management, and infrastructure optimization.
Ensures high availability, performance, and scalability of the call platform while minimizing service disruptions.
Leads operational reliability for ClearCaptions’ real-time communications platform, including VoIP, PSTN-connected services, SIP-based call flows, carrier interconnects, routing, and media quality.
Partners with telecom engineering, carriers, vendors, SRE, Security, Product, and Engineering teams to improve resiliency, reduce incident frequency, and ensure rapid recovery from service-impacting events.
Oversees telecom-related incident response, escalation, RCA, and corrective/preventive action plans, with measurable improvements in availability, MTTR, call quality, and customer impact reduction.
Ensures operational readiness for telecom platform changes, including runbooks, monitoring, alerting, change validation, failover testing, and disaster recovery planning.
Drives observability and SLO development for call-platform services, including signaling, media, carrier connectivity, routing performance, and service health.
Establishes monitoring and alerting strategies using industry-leading observability tools to proactively detect and resolve issues.
Oversees infrastructure management, configuration, scaling, and deployment of enterprise server-based computing systems in AWS.
Implements and manages infrastructure as code (IaC) solutions to enhance automation and cloud efficiency.
Drives cloud-native architecture improvements to support business growth and operational excellence.
Ensures compliance with security policies, internal controls, and cloud governance frameworks.
Recruits, mentors, and develops DevOps and SRE team members, fostering a culture of continuous learning and innovation.
Collaborates with cross-functional teams, including Security, Product, and Engineering, to align technical strategy with business objectives.
Leads tactical responses to cybersecurity incidents and implements preventative security measures.
Develops strategic initiatives to improve operational efficiency, system resilience, and cost optimization.
Performs other duties as assigned.
What You Will Bring:
Bachelor’s degree in computer science, engineering, or a related field. Equivalent combinations of education and relevant experience will be considered.
A minimum of eight (8) years of experience in IT operations, DevOps, SRE, or telecommunications operations, including leadership responsibility for mission-critical cloud, VoIP, UCaaS, contact center, or real-time communications platforms.
A minimum of three (3) years of experience leading people or teams, including accountability for performance, development, and delivery of results aligned with organizational objectives.
Telecom, VoIP, or real-time communications platform experience strongly preferred; experience with SIP, SBCs, carrier interconnects, and VoIP/PSTN interoperability is highly desirable.
Working knowledge of SIP, RTP, SBCs, carrier interconnects, call routing, VoIP/PSTN interoperability, and telecom incident troubleshooting.
Experience leading operational response for mission-critical communications platforms, including major incidents, carrier/vendor escalations, RCAs, and service reliability improvement plans.
Familiarity with telecom observability practices, including dashboards, alerting, SLOs, runbooks, packet/call-flow analysis, and quality/performance baselining.
AWS Certified Solutions Architect – Associate certification required; higher-level AWS certifications preferred.
Expertise in cloud computing, Kubernetes, and infrastructure automation tools (Terraform, CloudFormation, or similar).
Experience managing 24/7 mission-critical applications and implementing SRE principles.
Proficiency in observability and monitoring tools such as Prometheus, Grafana, Datadog, or New Relic.
Hands-on experience with CI/CD tools such as Jenkins, GitHub Actions, or GitLab CI/CD.
Strong knowledge of networking, Linux administration, and troubleshooting complex system performance issues.
Experience optimizing enterprise databases and cloud-based storage solutions.
Demonstrated ability to lead through influence, including mentoring others, leading initiatives or workstreams, contributing to best practices, and driving cross-functional outcomes.
Strong analytical, decision-making, and problem-solving skills.
Excellent leadership, communication, and stakeholder management skills.
Ability to thrive in a fast-paced, high-growth environment while ensuring operational stability.
Ability to work collaboratively with colleagues and staff to create a high-quality, results-driven, team-oriented environment.
Demonstrated ability to use discretion, make sound decisions, and maintain confidentiality.
Willingness to work flexible hours, participate in on-call rotations when necessary, and travel up to 10%, which may include occasional overnight travel.
Proficiency in Microsoft Office Suite and modern communication tools for virtual teams (e.g., Microsoft Teams, Slack).
Physical Demands:
In accordance with the Americans with Disabilities Act (ADA) and applicable state and local laws, the Company will provide reasonable accommodations to qualified individuals with documented disabilities to enable them to perform the essential functions of the job, unless such accommodations would impose an undue hardship. Employees seeking accommodation should contact the People Department to initiate the interactive process.
Employees may experience the following physical demands for extended periods of time:
Sitting, standing, and walking (90-100%).
Keyboarding (95-100%).
Viewing computer monitor, tablet, and cell phone requiring close vision (70-90%).
May require occasional lifting and racking of equipment (up to 50 lbs.) in data center environments.
Work Environment:
100% Remote with Travel: Work environment is primarily indoors (home office, customer or vendor sites, or other business meeting venues); travel may involve exposure to varying weather and temperature conditions, as well as driving and traffic hazards. Travel is required, approximately 10%, and may include overnight and out-of-state trips.