
Staff Engineer, Engineering Productivity & AI Quality
Job Description
Staff Engineer, Engineering Productivity & AI Quality
Harper is an AI-native commercial insurance company, based in San Francisco and built from scratch. Most knowledge work is judgment locked inside people's heads — the exceptions, the precedents, the decision traces no one ever wrote down. Converting that judgment into software is one of the largest human-to-computational transitions still in front of us, and we think the most honest place to prove it is the hardest one: commercial insurance, a trillion-dollar industry that is still, even now, more than 90% done by hand. We're not patching legacy workflows or adding a copilot to them. We're rebuilding the business so that AI does the work and people do the judgment that AI can't yet — and then teaching it that, too.
It's working: ~1,000 new customers a month and roughly 100x growth in the past year. That pace sets the culture. We're on-site in San Francisco, in the building together, working long days to high standards — because a rebuild this large doesn't happen part-time or by committee. Almost no one joins Harper because they're passionate about insurance. They join because they want to be on the frontier of the AI transition, doing the most consequential work of their career, in a company being built to define a category rather than join one. If that's the work you're looking for, insurance is just where you get to do it.
The role
Every great AI company ends up building the same invisible machine: the harnesses, tests, instructions, and review loops that let a small team ship with impossible leverage. At Harper that machine is existential. Our agents write code, serve customers, assemble submissions, and make decisions that move revenue — and AI-generated code volume has pulled the scaling problem forward. Even with a 20-person engineering team, our coding agents create the surface area, review burden, and architectural drift of a 100-person org. If the rails are strong, twenty engineers operate like a hundred; if they're weak, velocity turns into drag and the CTO becomes the rail — which doesn't scale. This is the founding seat for that machine. You'll turn the CTO's taste into systems — PR preflight, integration tests, architecture rules, agent instructions, eval gates, the feedback loops every engineer feels daily — across three sub-disciplines: Harness Engineering (the meta-harness over our frontier coding agents, OpenClaw, Hermes, and internal agents), Developer Experience (CI/CD gates, build caching, merge queues, dev/staging/CI parity, the internal platform, eval infrastructure), and AI Quality (eval suite design, golden datasets, LLM-as-judge graders, production trajectory monitoring, drift detection, anti-slop guardrails). The mission is simple: make the right way the easy way, and make Harper's engineering org compound with every ship.
What you'll own
CI/CD quality gates across Harper's most critical services — the minimum bar before code can merge.
Integration test harnesses anchored to real failure modes — every repeated operational failure becomes a regression test, a validation, or an architecture rule.
The agent harness substrate — sandbox lifecycle, tool routing, prompt/context layer, model-provider abstraction, multi-agent coordination.
Repo-level agent instructions and context hygiene — AGENTS.md per repo, canonical data-model docs, banned patterns. The information environment our coding agents read.
Automated PR preflight — service-impact summary, tests run, missing tests, model/migration changes, critical-path warnings. The robot that reviews every PR before a human does.
Architecture-rule enforcement — custom lints and structural tests that encode the CTO's taste mechanically. Once a rule is written down, it never gets argued in PR comments again.
Eval framework infrastructure — pre-merge eval gating, experiments against curated datasets, production trajectory monitoring, all wired together.
Engineering metrics that matter — rework rate, escaped defects, flaky-test count, deploy rollbacks, time-to-confident-ship, AI-generated PR quality. Anti-vanity, anti-LOC.
What we're looking for
8+ years building software, including Senior+ scope at a high-growth company (8–12 years total, 3+ at Senior+).
A track record of building developer-productivity, platform, CI/CD, build, test-infra, or internal tooling that other engineers actually adopted.
You write and review production code at a Staff level — this is not a process or PM role.
Production AI/ML systems experience (agent harness, eval frameworks, LLM-as-judge graders, prompt/context engineering), even if it's not your primary stack.
Strong opinions on maintainability, architecture, testability, and DX — backed by mechanical enforcement, not lectures. Excited by AI coding agents, skeptical enough to build the guardrails they need.
You can describe a specific lint rule, integration test, or eval-harness pattern you built that kept a class of bugs out of production for good.
You write code with AI daily and routinely run 3+ parallel sessions, and you'd rather create leverage for other engineers than own one product surface.
Strong written communication (RFCs, architecture-rule docs, lint-rule rationale, playbooks).
Bonus: eval-framework infrastructure (OSS or internal); developer platforms at an AI-native company; custom lint/structural-test authoring at scale; agent harnesses (sandboxing, isolation, execution environments); encoding a CTO's architectural taste into mechanical rules.
If "Engineering Productivity" sounds like dashboards and roadmaps, this isn't it. We measure ourselves on rework prevented and confident-ship time, not artifacts produced.
The reality
On-site in San Francisco, in person, long days, high standards. This is a founding seat with founder and CTO access and a mandate to encode taste into systems the whole org runs on — which is high-leverage and high-scrutiny in equal measure. A rebuild this large doesn't happen part-time or by committee. The right person reads the intensity as the reason to take the seat.
Logistics
Compensation (OTE): $253,000–$308,000 cash (base + target performance bonus), plus competitive equity.
Location: San Francisco, in-office. Based here or willing to relocate.
Benefits: Uber commuter benefits; breakfast, lunch, and dinner provided; snacks and coffee stocked; free gym membership; health, dental, and vision.
Process: Founder call (15 min) → CTO deep-dive (60 min, architecture-rule taste and eval-harness depth) → Super Day on-site → founder + CTO offer. No committee. Best offer, first.
To apply: If you want to be the engineer whose lint rules, test harnesses, and PR preflight checks let a 100-person org run on a 25-person team — send your resume, a link to a developer platform / eval harness / lint-rule system you built, and tell us about an architectural drift you stopped before it reached production.