Back to jobs
ECS

Release/Incident Operations Engineer

5611 Columbia Pike-80Posted 2 weeks ago
onsite

Job Description

Everforth ECS is seeking a Release/Incident Operations Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax. Please Note: This position is contingent upon contract award.


The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.


The Release/Incident Operations Engineer coordinates release operations and incident triage support for AI and machine learning model-serving pipelines across WDP Core Integration's full multi-enclave environment, ensuring deployment consistency and operational continuity in direct support of DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership. This role is central to sustaining mission-ready AI model-serving performance across all classification levels through disciplined release governance, root-cause analysis, and proactive operational risk management.


• Coordinates release operations for artificial intelligence and machine learning model serving across War Data Platform (WDP) Core Integration environments supporting Department of War missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
• Directs change-window execution, rollback readiness activities, and deployment governance for model-runtime updates, serving endpoints, and pipeline modifications.
• Conducts incident triage support by analyzing telemetry, reviewing service health indicators, and initiating stabilization actions across Kubernetes clusters, VMware environments, GitLab Continuous Integration pipelines, Prometheus metrics, Grafana dashboards, and Elastic Stack observability tooling.
• Executes root-cause analysis activities for serving incidents by collecting operational evidence, reconstructing failure sequences, validating remediation steps, and documenting corrective actions aligned with mission assurance requirements.
• Maintains operational readiness for model serving by coordinating with Platform One, Cloud One, multi-national engineering teams, and cross-service mission partners to align release activities with enclave-specific constraints, cross-domain deployment architectures, and security requirements.
• Produces mission-critical deliverables including release plans, rollback packages, incident triage reports, root-cause analysis documentation, operational risk assessments, and service restoration summaries.
• Strengthens program value by advancing deployment consistency, reducing mission risk, and reinforcing operational continuity across all enclaves.
• Supports Tier-4 incident response actions to maintain service-level agreements and sustain mission performance for enterprise artificial intelligence model-serving capabilities.
• Performs other duties as assigned.

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
Release/Incident Operations Engineer at ECS | Renata