Machine Learning Engineer — Inference Optimization

Remote (world)Posted 4 months ago

Full-timeremote

About the Role

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

What You’ll Do

Optimize inference latency, throughput, and cost for large-scale ML models in production
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
Implement and tune techniques such as:
- Quantization (fp16, bf16, int8, fp8)
- KV-cache optimization & reuse
- Speculative decoding, batching, and streaming
- Model pruning or architectural simplifications for inference
Collaborate with research engineers to productionize new model architectures
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
Improve system reliability, observability, and cost efficiency under real workloads

What We’re Looking For

Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity

Nice to Have

Experience with LLM or long-context model inference
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
Experience optimizing across different hardware vendors
Open-source contributions in ML systems or inference tooling
Background in distributed systems or low-latency services

Why Join Us

Real ownership over performance-critical systems
Direct impact on product reliability and unit economics
Close collaboration with research, infra, and product
Competitive compensation + meaningful equity at Series A
A team that cares about engineering quality, not hype

See Your Match Score

Get Started Free

About Featherless AI

Website

More jobs at Featherless AI

Founding Business Development Rep (AI Cloud US/CA)

Remote US & Canada

Content Marketer

Europe

$3K - $4K

Business Development Rep (AI Cloud)

Europe

Founding Account Executive (AI Cloud)

Remote US & Canada

AI Researcher — Inference Optimization

Remote (world)

AI Researcher — Distillation

Remote (world)

Similar roles

Lead Machine Learning Engineer

Thoughtworks · Toronto, Canada

Machine Learning Engineer – Feed Recommendation

AppLovin · Singapore

Machine Operator - 2nd Shift

Eaton · Nacogdoches, TX, US

Staff Machine Learning Engineer L5

Inovalon · Gurugram, India

Machine Learning Engineer (GoLang)

Comcast · DC - Washington, 1325 G ST NW STE 300

Process Machine Operator I

SunOpta · St. Davids, ON, CA

Machine Learning Engineer — Inference Optimization

Job Description

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

Why Join Us

See Your Match Score

More jobs at Featherless AI

Similar roles

More jobs at Featherless AI

Similar roles