Senior Failure Analysis Engineer - Test Development
Job Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems
Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary
When you join AMD, you’ll discover the real differentiator is our culture
We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives
Join us as we shape the future of AI and beyond. Together, we advance your career.
The ROLE:
The Quality Engineering team is looking for an experienced Senior Failure Analysis Engineer - Test Development to create advanced test methods that surface elusive failures in GPU accelerator platforms
This role is centered on designing custom execution flows that go beyond standard validation, using stress-based scenarios, VPOD environments, AI/ML workloads, and adaptive test logic to make hard-to-capture issues observable and actionable
The engineer will expand failure analysis capability across lab, factory, and customer-return cases by building test content that improves repeatability, shortens debug cycles, and increases confidence in root cause findings
They will also help shape intelligent test systems that use internal engineering knowledge and live model inference to guide execution decisions in real time
Working across FA, validation, firmware, diagnostics, and data teams, this person will help convert unclear symptoms into testable conditions that accelerate resolution.
THE PERSON:
The ideal candidate is inventive, methodical, and technically versatile, with a strong instinct for designing experiments that reveal behavior hidden under normal test conditions
They are comfortable navigating hardware, firmware, software, and system-level interactions, and know how to choose the right levers—environment, timing, workload composition, instrumentation, or automation—to provoke meaningful behavior
They are effective in VPOD-based test environments, capable of using model-driven compute activity as part of system stimulation, and confident building AI-enabled workflows that draw from team-specific knowledge during execution
Just as importantly, they can turn messy observations into disciplined experiments, communicate clearly across teams, and document approaches in a way others can reuse.
KEY RESPONSIBILITIES:
Architect targeted test methods for hard-to-capture platform behaviors across GPU, server, and rack-scale environments.
Invent new workload patterns, sequencing approaches, and stress combinations that reveal conditions not covered by conventional diagnostics.
Build and maintain VPOD-based environments that support scalable experimentation, long-duration execution, and controlled reproduction studies.
Use inference and training activity as system stimuli to probe platform limits, timing sensitivities, and failure-prone operating regions.
Develop automation, scripting, and orchestration tools to launch workloads, monitor execution, collect logs, and analyze results at scale across Windows and Linux environments.
Interpret telemetry, logs, and observed signatures to refine experiments, isolate trigger conditions, and improve confidence in reproduced behavior.
Create AI-enabled execution flows that use internal FA knowledge and live inference to guide test branching, detect emerging patterns, and support faster triage decisions.
Partner closely with FA, validation, diagnostics, firmware, and manufacturing teams to translate vague symptoms or sporadic field issues into targeted and repeatable test content.
Document workload intent, test methods, reproduction conditions, and findings clearly so they can be reused across teams and incorporated into future FA workflows.
Drive continuous improvement of test development methods, workload libraries, and failure reproduction strategies to expand FA coverage and reduce time to root cause.
PREFERRED EXPERIENCE:
Proven track record of developing custom test methodologies for intermittent, low-occurrence, or otherwise difficult-to-observe failure modes.
Strong foundation in GPU and server platform behavior, including system stress interactions, concurrency effects, and stability characterization.
Demonstrated ability to build, run, and optimize VPOD environments and related infrastructure for large-scale FA or validation test execution.
Hands-on familiarity with inference and training environments, including their use as controllable system stressors in platform investigation.
Proficient in Python, shell scripting, and automation development for workload launch, orchestration, telemetry capture, and post-run analysis.
Ability to interpret system data and debug artifacts to uncover meaningful signals and guide the next experimental step.
Familiarity with diagnostics, firmware interactions, drivers, and hardware/software boundaries that influence failure behavior under stress workloads.
Experience building AI-enabled test systems that incorporate internal engineering knowledge and support real-time inference during execution.
Strong communication, documentation, collaboration, and presentation skills, with the ability to explain complex reproduction strategies and findings across technical teams.