
Disaster Recovery and Major Incident Response Manager
Job Description
How you move is why we’re here. ®
Now more than ever.
Get back to what you need and love to do.
The possibilities are endless...
Now more than ever, our guiding principles are helping us in our search for exceptional talent - candidates who align with our unique workplace culture and who want to maximize the abundant opportunities for growth and success.
If this describes you then let’s talk!
HSS is consistently among the top-ranked hospitals for orthopedics and rheumatology by U.S. News & World Report. As a recipient of the Magnet Award for Nursing Excellence, HSS was the first hospital in New York City to receive the distinguished designation. Whether you are early in your career or an expert in your field, you will find HSS an innovative, supportive and inclusive environment.
Working with colleagues who love what they do and are deeply committed to our Mission, you too can be part of our transformation across the enterprise.
Emp Status
Regular Full timeWork Shift
Compensation Range
The base pay scale for this position is $128,500.00 - $196,375.00. In addition, this position will be eligible for additional benefits consistent with the role. The salary of the finalist selected for this role will be determined based on various factors, including but not limited to: scope of role, level of experience, education, accomplishments, internal equity, budget, and subject to Fair Market Value evaluation. The hiring range listed is a good faith determination of potential compensation at the time of this job advertisement and may be modified in the future.What you will be doing
Key Responsibilities
Active and Backup Responder on Duty (AROD/BROD)
- Serve as Active or Backup Responder on Duty (AROD/BROD) on a scheduled rotation for major incidents, declared disasters, and extended outages.
- Function as the single point of command and escalation during DR and major incidents in coordination with the Executive on Duty (EOD).
- Assess incident severity and determine when to escalate to disaster recovery activation in collaboration with IT EOD and business leadership.
- Coordinate cross‑functional response efforts involving infrastructure, application teams, cybersecurity, vendors, and business operations.
- Lead real‑time incident coordination calls and ensure clear task assignment, escalation, and decision tracking.
- Oversee internal and external communications related to service outages, recovery progress, and restoration status. Ensure continuous 24/7 readiness by validating that playbooks, contact lists, tooling access, vendor support, and escalation paths are current, accessible, and executable at all times.
- Ensure effective shift handoffs, documentation continuity, and leadership coverage during prolonged or multi‑day incidents.
Disaster Recovery Execution & Oversight
- Own the activation and execution of disaster recovery plans and runbooks during declared events.
- Coordinate technical recovery activities across infrastructure, platform, and application teams.
- Ensure application recovery is validated by appropriate application and business owners prior to declaring service restoration.
- Maintain operational oversight for prolonged recovery efforts, including shift coverage, resource planning, and vendor engagement.
- Ensure recovery actions are executed in accordance with approved DR standards, policies, and tiering requirements.
DR Program Governance & Readiness
- Partner with the DR/BC Governance function to maintain enterprise DR readiness across all application tiers.
- Own the creation, maintenance, and continuous improvement of disaster recovery and major incident playbooks to ensure they are:
- Present for all in‑scope applications
- Technically accurate and executable
- Reviewed and validated on a defined cadence.
- Partner with IT Operations, Infrastructure, Applications, and Cybersecurity teams to validate technical accuracy and operational effectiveness of disaster recovery and major incident playbooks.
- Support application tiering decisions and ensure recovery strategies align to business impact and risk tolerance.
- Lead the planning, execution, and facilitation of disaster recovery testing, tabletop exercises, and simulations; ensure findings are documented and tracked to closure.
- Ensure exercise outcomes, identified gaps, and remediation actions are documented, tracked, and resolved within defined timeframes.
- Ensure DR processes align with internal policies, regulatory requirements, and audit expectations.
Post‑Incident Review & Continuous Improvement
- Lead the creation of Root Cause Analysis (RCA) documents and/or postmortem reviews following major incidents and disaster recovery events.
- Ensure lessons learned, control gaps, and process improvements are documented and assigned to accountable owners.
- Track remediation actions through completion and provide status updates to leadership and governance committees.
- Identify recurring incident patterns or recovery risks and recommend corrective actions.
- Develop and present actionable, data‑driven recommendations to IT and business leadership to improve disaster recovery readiness, response effectiveness, and operational resilience, including recovery strategy enhancements, tooling gaps, staffing models, and escalation processes.
Stakeholder & Executive Engagement
- Provide regular status updates and briefings to IT leadership, business partners, and governance committees.
- Escalate recovery risks, resource constraints, or unresolved issues to executive leadership as appropriate.
- Partner with business continuity leaders to ensure alignment between IT recovery and operational continuity procedures.
Required Qualifications
- Bachelor’s degree in computer science, Information Technology, Business Administration or a related field
- 8+ years of experience in IT operations, major incident management, disaster recovery, or service management
- Strong demonstrated experience acting in an incident command or major incident leadership role.
- Strong understanding of:
- Disaster recovery and business continuity concepts
- Application tiering and dependency management
- Infrastructure and application recovery strategies
- Proven ability to lead cross‑functional teams during high‑pressure incidents.
- Effective communication skills and executive‑presence.
- This role holds operational command responsibility during declared disasters and major IT incidents.
- Afterhours, weekend, and holiday availability are essential functions of the role.
- Success requires the ability to drive execution, accountability, and remediation through influence across multiple teams.
Preferred Qualifications
- Experience in healthcare, regulated, or audit‑driven environments.
- Familiarity with ITIL, ISO 22301, NIST, or similar frameworks
- Experience leading and supporting large, complex application portfolios or programs.
- Experience working with third‑party vendors during recovery events.
Success Looks Like
- Disaster recovery events are coordinated, decisive, and well‑documented.
- Recovery timelines improve due to clear command and role clarity.
- Audit findings related to DR execution, testing, or governance are reduced or eliminated.
- IT Infrastructure, Application, and business owners clearly understand their roles during incidents.
- DR transitions from a reactive effort to a repeatable, trusted operational capability.
Non-Discrimination Policy
Hospital for Special Surgery is committed to providing high quality care and skilled, compassionate, reliable service to our community in a safe and healing environment. Consistent with this commitment, Hospital for Special Surgery provides care, admits, and treats patients and provides all services without regard to age, race, color, creed, ethnicity, religion, national origin, culture, language, physical or mental disability, socioeconomic status, veteran or military status, marital status, sex, sexual orientation, gender identity or expression, or any other basis prohibited by federal, state, or local law or by accreditation standards.