Back to jobs

Head of Platform Reliability and Observability
Boston, MAPosted 1 weeks ago
onsite
Job Description
Job Id:
448
# of Openings:
1
Pay Range: $200,000 - $220,000 per year
Geode Capital Management, LLC is seeking a Head of Platform Reliability & Observability to lead the function responsible for the stability, resilience, performance and operational transparency of our mission critical platforms. This role owns the end-to-end reliability posture of production systems, spanning production support, incident management, infrastructure coordination, and observability strategy.
This is a senior leadership role with clear accountability for outcomes. You will lead and evolve the teams and practices that ensure issues are detected early, resolved quickly, and prevented from recurring. This role reports directly to the Chief Technology Officer and partners closely with engineering, infrastructure, and business stakeholders to continuously improve how our platforms operate at scale.
The ideal candidate brings a strong mix of technical depth, operational leadership, and people management, and is comfortable operating in a highly regulated, business critical environment.
This is a hybrid work environment opportunity located in Boston, MA with a weekly in office schedule of Tuesdays, Wednesdays and Thursdays and remote work availability on Mondays and Fridays.
Responsibilities:
- Own the platform reliability and observability strategy across applications, data pipelines, and supporting infrastructure
- Lead and develop teams, both onshore and offshore, responsible for production support (L1/L2), incident response, infrastructure troubleshooting, and 24/7 monitoring
- Serve as the senior escalation point for high severity production incidents, providing leadership, clarity, and calm during time critical events
- Establish and enforce standards for incident management, root cause analysis, post incident reviews, and corrective action tracking
- Partner with engineering to improve production readiness, release quality, and operational risk management
- Drive the evolution of observability practices, including metrics, logs, alerts, dashboards, and service health indicators
- Ensure monitoring and alerting are actionable, business relevant, and continuously improving, reducing noise and manual effort
- Oversee Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs) partnering with development teams to prevent recurring issues.
- Analyze incident trends and operational data to identify systemic risks, recurring failure patterns, and automation opportunities
- Champion automation, resilience, and reliability improvements that reduce toil and improve platform stability over time
- Communicate reliability posture, risks, and improvements clearly to senior technology and business leadership
Skills You Bring:
- 15+ years of general experience and 10+ years of experience as a leader
- Proven experience leading production support, reliability engineering, or platform operations teams in a complex environment
- Strong background in change and incident management, escalation leadership, and operational risk ownership
- Broad technical understanding across applications, infrastructure, batch processing, data pipelines, and observability tooling
- Experience defining and implementing monitoring, alerting, and observability strategies at scale
- Ability to balance hands on technical judgment with senior level leadership and delegation
- Strong communication skills, especially during incidents and executive level discussions
- Experience operating in regulated, high availability, or financial services environments
- A mindset focused on continuous improvement, learning, and long-term reliability outcomes, not just short-term fixes
- Familiarity working in both data center and cloud environments as well as with SaaS vendors
- Familiarity with ITSM/ITIL practices and tools like ServiceNow, Zendesk, Datadog, CloudWatch, Grafana and other similar tools
Company Overview:
Founded in 2001, Geode is headquartered in Boston’s financial district, the center of one of the world’s most vibrant finance and technology hubs and employs approximately 200 employees.
Geode is an institutional asset manager providing core beta exposures across a range of equity and niche asset classes, with over $1.5 trillion. With a robust infrastructure and experienced investment professionals, Geode offers the scale of a large asset management firm with the benefits of a smaller organization.
Our compensation philosophy is designed to attract, motivate, and retain top talent. We are committed to ensuring that compensation reflects the value our employees bring to Geode. Employees at all levels are eligible to receive a combination of base salary, variable compensation, and a comprehensive benefits package. Compensation decisions are informed by a range of factors including role, experience, education, and skillset.
Our benefits program is designed to support employees both professionally and personally, offering comprehensive health coverage, 401(k) matching, annual profit sharing, paid parental leave, and generous time off. We also provide tuition and certification reimbursement, student loan support, fitness reimbursement, commuter subsidy, charitable donation matching, family care assistance including a backup care benefit, adoption and surrogacy support. Hybrid work arrangements and a culture that encourages community engagement through volunteer opportunities and employee events further enhance the employee experience at Geode.
Geode is proud to be an equal opportunity employer and support a diversified work environment. Learn more about Geode at www.geodecapital.com/careers.
Pay Range: $200,000 - $220,000 per year
Apply for this Position