Back to jobs
Frederick National Laboratory for Cancer Research

AI Research Computing Infrastructure Engineer

Frederick, MD, USPosted 3 months ago
onsite

Job Description

PROGRAM DESCRIPTION The mission of Enterprise Information Technology (EIT) is to develop an enterprise-level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) is a part of Enterprise Information Technology (EIT) within Leidos Biomedical Research, Inc. ITOG is responsible for computational servers, storage servers, virtual machine infrastructure, and the FNLCR network. ITOG focuses on implementing enterprise IT best practices in the areas of computational services, storage, backup, and archiving; batch and application support; server consolidation and virtualization; network infrastructure; unification of voice, teleconferencing, and video communication technologies; and improved infrastructure for collocation of dedicated servers. KEY ROLES/RESPONSIBILITIES: The Research Computing Infrastructure Engineer will design, build, and operate next-generation high-performance computing (HPC) environments that support container-based workflows and GPU-accelerated research computing. The position will play a key role in evaluating, implementing, and maintaining scalable and secure computing architectures for advanced data analysis, AI/ML model training, and simulation workloads. The engineer will collaborate closely with researchers, IT professionals, and external partners to translate scientific requirements into reliable, high-performance computing solutions. Design and implement next-generation high-performance computing (HPC) environments that leverage container-driven workflows for GPU-accelerated research. Build and maintain container orchestration systems for batch and distributed workloads. Integrate containerized job workflows with existing HPC schedulers and storage systems. Develop and maintain job templates for batch GPU training and multi-node distributed computing. Automate deployment, configuration, and scaling through infrastructure-as-code and CI/CD practices. Monitor, benchmark, and optimize system performance, reliability, and resource utilization. Collaborate with researchers to containerize and optimize legacy workflows for scalable execution. Lead evaluation of emerging tools (e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration and distributed computing. Contribute to the development of tools and bridges between orchestration frameworks and traditional HPC environments. BASIC QUALIFICATIONS To be considered for this position, you must minimally meet the knowledge, skills, and abilities listed below: Possession of Bachelor’s degree from an accredited college/university according to the Council for Higher Education Accreditation (CHEA) or four (4) years relevant experience in lieu of degree. Foreign degrees must be evaluated for U.S. equivalency. In addition to the education requirement, a minimum of eight (8) years of related experience. Strong Linux systems engineering and administration experience. Hands-on experience with container orchestration tools such as Kubernetes, Nomad, Run:AI, etc. Hands-on experience with scripting/programming skills (Python, Bash, or Go) for automation, monitoring, and job orchestration. Experience with infrastructure-as-code / automation tooling (Terraform, Ansible, Packer, or equivalent). Familiarity with system performance analysis, monitoring, and tuning. Comfortable with small-team environments and taking end-to-end ownership of compute infrastructure. Ability to obtain and maintain a security clearance. PREFERRED QUALIFICATIONS Candidates with these desired skills will be given preferential consideration: Experience with multi-node distributed ML frameworks (PyTorch DDP, Ray, Horovod, TensorFlow,etc). Familiarity with pipeline orchestration tools (Prefect, Airflow, Dagster, Kubeflow). Understanding of resource management and scheduling concepts (queues, allocations, GPU device plugins, gang scheduling, multi-node coordination). Understanding of storage integration with high-performance clusters (POSIX + object storage, VAST or similar). Familiarity with cloud GPU environments (AWS, GCP, Azure) and hybrid workflows. Familiarity with workflow orchestration/pipeline tools (Argo, Kubeflow, Ray, MLFlow). Good communication and documentation skills, the ability to make complex infrastructure understandable to researchers and other engineers. EXPECTED COMPETENCIES: Expertise in Kubernetes, Nomad, or equivalent container orchestration systems for large-scale computing. Deep knowledge of Linux systems administration, performance tuning, and automation. Ability to translate research computing needs into scalable, reliable infrastructure designs. Commitment to documentation, reproducibility, and open science principles. Collaborative mindset and willingness to mentor peers in containerization and HPC best practices.

See Your Match Score

Sign up and Renata will show you how this job matches your skills and experience.

Get Started Free
AI Research Computing Infrastructure Engineer at Frederick National Laboratory for Cancer Research | Renata