Principal Staff Software Developer – AI/ML Performance Validation & Systems Testing

Markham, Ontario, CanadaPosted 3 weeks ago

Full-timehybrid

Job Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems

Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary

When you join AMD, you’ll discover the real differentiator is our culture

We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives

Join us as we shape the future of AI and beyond. Together, we advance your career.

About the Role

We are seeking a Principal Software Quality Engineer to serve as the senior technical leader for ROCm software validation across compute workloads and server-class systems

In this individual-contributor leadership role, you will define how AMD proves ROCm is ready to ship — from unit and component testing, through full-stack workload validation, to multi-node system-level qualification on AMD Instinct™ GPU platforms. You will set the technical direction for validation strategy, build and evolve the test infrastructure that gates every ROCm release, and personally drive the hardest debugging, characterization, and qualification problems

Your work directly determines the quality bar experienced by hyperscalers, OEMs, sovereign-AI customers, and the open-source community running ROCm in production.

What You Will Do

Own the end-to-end validation architecture for ROCm — unit, integration, framework, workload, performance, stress, stability, scale-out, and system-level test layers — across multiple GPU generations and server platforms.

Define release-qualification gates and exit criteria for ROCm software releases (functional coverage, performance regressions, stability hours, scale targets, RAS criteria) and drive the org to meet them.

Lead system-level testing for server nodes — multi-GPU topologies, PCIe/Infinity Fabric/xGMI, BMC/IPMI, thermal/power, firmware interactions, and multi-node fabric (Ethernet/InfiniBand/UALink) bring-up and validation.

Drive compute workload validation and characterization — LLM training and inference (PyTorch, vLLM, Triton, JAX), recommender systems, scientific HPC kernels, MLPerf-class benchmarks — establishing reproducible methodology, baselines, and regression tracking.

Architect the test infrastructure — distributed test runners, GitHub Actions / Jenkins / internal CI fleets, hardware lab orchestration, result data lakes, flaky-test detection, bisection automation, and self-service developer pre-submit pipelines.

Champion modern, agile quality engineering — shift-left testing, test pyramids, contract testing between layers, hermetic test environments, deterministic reproducers, and continuous validation in trunk.

Set the bar for GitHub-based quality workflows — PR gating policy, required checks, code-coverage standards, bug-bash and triage cadences, and disciplined issue management across ROCm/* repositories and partner upstream projects.

Lead complex escalation debug — partner with development, hardware, firmware, and customer-facing teams to root-cause the hardest multi-day, multi-node, multi-component failures and convert findings into durable test coverage.

Influence the roadmap — work with product management, silicon, platform, and software architecture to ensure validation readiness for next-generation Instinct GPUs and server platforms before tape-in milestones and silicon arrival.

Mentor and elevate Senior and Staff validation engineers, SDETs, and SQA leads; raise the technical bar through design review, code review, and written guidance.

Represent ROCm validation externally — strategic customer engagements, OEM qualification programs, and open-source community quality initiatives.

Minimum Qualifications

12+ years of professional software engineering experience with a strong validation, SDET, or quality-engineering focus, including 5+ years in a senior IC role (Staff/Principal/PMTS or equivalent) leading validation of complex systems software.

BS/MS/PhD in Computer Science, Computer Engineering, or related discipline (or equivalent demonstrated experience).

Expert-level Python for test automation and infrastructure; strong C++ for debugging, and extending production code paths under test.

Deep, demonstrable validation experience in at least two of the following domains:

GPU compute software stacks (ROCm, CUDA, oneAPI, SYCL)

Deep-learning frameworks and inference engines (PyTorch, TensorFlow, JAX, Triton, vLLM)

HPC / parallel runtimes and communication libraries (MPI, RCCL/NCCL, UCX, Libfabric)

See Your Match Score

Get Started Free

About amd

More jobs at amd

Integration Test Req1 - iCIMS Test Req $$$

Austin, Texas, United States

Principal SoC Reliability Lead Engineer

San Jose, California, United States

Customer Quality Engineer (CQE)

Shanghai, Shanghai, China

Product Development Eng.

Taipei, Taiwan

Senior Systems Design Engineer

Penang, Pulau Pinang, Malaysia

Product Development Engineer

Singapore, Singapore

Similar roles

Senior Software Quality Assurance Engineer

CareDx, Inc. · Brisbane, CA

Desenvolvedor(a) de Software - IT Credit Risk

Banco PAN · São Paulo

Software Engineer, Data

Acrisure Innovation · Austin, TX

Staff iOS Software Engineer

Fetch · Remote

Senior Software Engineer

Catapult Sports · Boston, MA

Engenheiro(a) de Software (.NET Core) | R&C

BTG Pactual · Rio de Janeiro, Rio de Janeiro, Brazil