Job Description
About Us
Architect is a frontier AI lab for chip design. We build AI models and tools for on-demand custom ASICs at scale. Our goal is to co-design custom ASICs alongside evolving ML workloads, and enable a new era of domain-specific chips that unlock capabilities impossible with current hardware paradigms. Born out of Stanford Research, our team blends AI with Silicon with a founding team from Anthropic, Google DeepMind, Meta SuperIntelligence, xAI, Apple and Intel.
We're looking for staff/principal-level compiler engineers with deep experience building code generation toolchains for custom AI accelerators. Ideal candidates have shipped production compilers at places like Apple, Google (XLA/TPU), Groq, Cerebras, Qualcomm, AMD, or similar.
What You'll Do
As a Member of the Technical Staff on the Compilers team at Architect, you'll own the compiler stack targeting our SIMD/VLIW NPU — from graph ingestion through code generation on production silicon. You'll work directly with the NPU architect to co-design the ISA, closing the loop between compiler needs and hardware decisions.
Own the compiler end-to-end: graph ingestion (ONNX, PyTorch) through IR optimization, AI-driven code generation, instruction scheduling, and register allocation for a SIMD/VLIW NPU.
Implement and own the memory management layer; for instance SW-managed on-chip scratchpad memory with the compiler handling data tiling, bank allocation, DMA scheduling, and double-buffering across SRAM banks.
Design and iterate on mid-end and backend optimization passes: operator fusion, loop transformations, vectorization, and software pipelining to close the gap between peak and achieved throughput.
Co-design the ISA and instruction encoding with the architect and silicon team. Feed real workload performance data back into architectural decisions.
Support quantization and mixed-precision lowering (32bit single-precision FP or INT, along with lower INT8/4, BF16, FP16/8/4 precisions) with correct numerics end-to-end.
Benchmark compiler output against cycle-accurate models, RTL simulation, and FPGA prototypes. Own QoR tracking.
Grow into a compiler team lead as the team scales.
What We'd Like to See
Qualifications & Skills:
Degree: Bachelor's, Master's, or PhD in Computer Science, Computer Engineering, or a closely related field.
Experience: 5+ years building compilers or code generation toolchains for custom accelerators. Must have targeted ML/AI hardware compiler experience, as general-purpose (GCC/LLVM for CPUs) is not sufficient.
Domain Background: Hands-on experience on at least one of: Apple Neural Engine compiler, Google XLA / Edge TPU / TPU codegen, Groq TSP compiler (spatial scheduling, IR dialect design), Cerebras compiler stack, Qualcomm Hexagon NN / AI Engine, AMD AIE / Vitis AI, or similar/equivalent custom accelerator compiler(s).
Backend Mechanics: Strong grasp of instruction scheduling, register allocation, and software pipelining — especially for SIMD/VLIW or spatial architectures.
ML Optimizations: Experience with tiling strategies, loop nest optimization, and operator fusion for ML workloads (such as convolution, attention, element-wise ops, reduction, transpositions, etc.).
SW-Managed Memory: Experience with scratchpad type memory allocation, data layout, DMA orchestration, and multi-buffering.
Coding: Strong C++. Python proficiency. Familiarity with MLIR or LLVM infrastructure.
Leadership: Ability to lead and grow the compiler team over time.
Bonus:
HW/SW co-design experience: defining ISA features, instruction encodings, or hardware interfaces driven by compiler needs.
IR design for ML accelerators (custom dialects, MLIR-based flows, or graph-level IRs like XLA HLO).
ML framework experience (PyTorch, TensorFlow) and portable graph formats (ONNX).
Experience benchmarking and profiling compiler output on real hardware, FPGA, or cycle-accurate simulators.
Understanding of ML inference systems and workload-level optimizations: FlashAttention, RadixAttention, PagedAttention, continuous batching, speculative decoding, KV cache management, and prefill/decode scheduling.
Contributions to open-source ML compiler projects (TVM, MLIR, Triton, XLA).
Domain-specific expertise: Track record on energy-efficient, high-performance HW accelerator bring-up.
What We Offer
Competitive salary and meaningful equity stake
Fast-paced startup with autonomy and visible impact
Cutting-edge challenges at the intersection of AI and silicon design
Direct ownership of the compiler stack as we scale