Job Description
- Accelerate AI product development through reusable tooling and paved roads
- Provide end-to-end observability across AI systems (models, agents, pipelines, applications)
- Enable self-improving systems through telemetry-driven feedback loops
- Optimise cost, performance, and reliability of AI workloads
- Support both production AI systems and internal engineering agents
- Build and evolve AI platform tooling (e.g., developer workflows, benchmarking systems)
- Design developer-friendly APIs, SDKs, and interfaces
- Contribute to systems across the Model Development Lifecycle (experimentation, deployment, evaluation)
- Build and operate observability platforms and telemetry pipelines (logs, metrics, traces, events)
- Provide visibility into latency, token usage, cost, quality, drift, and reliability
- Define instrumentation standards, schemas, and conventions
- Implement distributed tracing using modern approaches (e.g., OpenTelemetry)
- Enable end-to-end debugging of AI and agent workflows (model calls, tool usage, retrieval, orchestration)
- Build benchmarking, regression detection, and performance analysis capabilities
- Support observability for both production systems and internal engineering agents
- Develop systems that turn telemetry into action (automated experimentation, regression detection, alerting)
- Build feedback loops that continuously improve model quality and system behavior
- Enable self-healing and self-optimising workflows
- Build tooling for cost visibility, forecasting, and optimization
- Define SLOs, alerting, and performance tuning practices
- Improve reliability and scalability of AI infrastructure
- Own projects end-to-end (RFCs, architecture, implementation, rollout, production support)
- Partner with AI teams to drive adoption of platform tooling and standards
- Produce high-quality documentation and improve developer experience
- Demonstrated experience building production software or platform systems
- Strong engineering fundamentals with distributed systems or backend platforms
- Experience or strong interest in observability and debugging complex systems
- Experience or strong interest in AI/ML systems, LLMs, or agent-based architectures
- Strong ownership mindset and ability to drive ambiguous problems to production
- Hands-on experience with modern agentic coding tools (e.g., Claude Code, Codex CLI, Cursor) and multi-model workflows
- Working knowledge of agent architecture internals (context engineering, tool loops, sub-agent orchestration)
We’d love to see:
- Experience with OpenTelemetry and modern observability ecosystems, including instrumentation, collectors, exporters, and tools like Prometheus, Grafana, and tracing/log systems
- Experience designing and operating telemetry pipelines, including sampling, retention, cardinality, and cost tradeoffs, as well as integrating observability into CI/CD and developer workflows
- Familiarity with AI/agent frameworks, including instrumentation of LLM calls, tool usage, workflows, and evaluation signals (quality metrics, benchmarking, regression detection)
- Experience building cost monitoring, forecasting, and optimization systems for AI workloads
- Familiarity with cloud and infrastructure tooling (e.g., AWS, Azure, Kubernetes, Terraform)
- Experience with agentic infrastructure concepts such as MCP servers, hooks, skills, subagents, sandboxing, and persistent memory patterns
- Active engagement with the agentic engineering frontier, including emerging patterns (e.g., harness vs. model, review debt, feedback loops)
- Demonstrated agent-native development practices (iterating with agents using testing, verification, and feedback loops)
- Strong security awareness for autonomous systems, including sandboxing, prompt injection risks, credential exposure, and guardrails