Member of Technical Staff - Distributed Systems
Job Description
About Us
Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them.
The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together.
Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization.
We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI.
About the role
Gimlet Labs is seeking a Member of Technical Staff focused on distributed systems. In this role, you will build the core platform that schedules, routes, and operates AI workloads reliably at production scale. You will work on systems that coordinate execution across thousands of nodes, expose stable production APIs, and ensure workloads run predictably under real-world load and failure conditions.
This role is well-suited for engineers who enjoy building foundational infrastructure, understanding systems end-to-end, and operating at scale.
What you will work on
Design and build distributed systems that orchestrate and operate AI workloads at large scale
Develop scheduling, routing, and resource management components that coordinate execution across many nodes and services
Build production-grade APIs and control planes for deploying and managing workloads
Implement mechanisms for reliability, availability, and fault tolerance in distributed environments
Instrument systems for observability and debugging at scale
Work closely with compilers, runtimes, and hardware to ensure end-to-end system correctness and performance
You may be a good fit if
Strong software engineering fundamentals
Experience building or operating distributed systems in production environments
Comfort reasoning about concurrency, failure modes, and tradeoffs in large-scale systems
Strong candidates may also have
Experience with Kubernetes or Kubernetes-adjacent systems beyond basic usage
Experience designing service-oriented architectures using RPC or asynchronous messaging
Familiarity with scheduling, queues, or resource management systems
Experience building reliable APIs and operating systems under high load
Software development experience in languages commonly used for systems development (e.g., Go, C++, Python)
What Makes Gimlet Different
At Gimlet, you will work on infrastructure problems that span the full stack of modern AI systems. Our team operates across datacenters, networking, distributed systems, compilers, runtimes, orchestration, and performance engineering to build the foundation for the next generation of AI infrastructure.
As an early member of the team, you will have significant ownership, work alongside highly technical engineers, and help shape both the systems we build and how we scale the company.
We value people who are excited to work across domains, take ownership of meaningful problems, and build technology that enables the next generation of AI.