Job Description
We specialize in proactive anomaly detection, providing advanced performance insights and best practice guidance. Our team collaborates with application developers to define meaningful SLOs, implement chaos engineering, and build diagnostic tools that mitigate architectural risks as our platforms scale.
In your day-to-day, you’ll develop frameworks for tracking reliability metrics, collaborate on system health reports, and build libraries that standardize alerting and incident response. You will also use failure injection and chaos testing to validate system performance under real-world stress. Our teams primarily build software using Python
We’ll trust you to:
- Define and promote standards for observability, alerting, and incident response.
- Develop self-maintaining tools using statistical analysis, health metrics, and distributed tracing.
- Embed resiliency best practices into the full software development lifecycle.
- Lead initiatives to mitigate risks related to performance, capacity, and scale.
- Translate technical findings into actionable insights for engineers and stakeholders.
- Automate operational tasks to enhance the safety and scalability of our infrastructure.
- Professional experience with Python or C++.
- Strong collaboration and communication skills.
- An understanding of distributed systems and system reliability.
- Familiarity with SLOs, SLIs, and SLAs.
- A degree in Computer Science, Engineering, or equivalent practical experience.
- Experience in an SRE, Reliability or Production Engineering role.
- Deep knowledge of system health assessment and building effective alerting.
- Hands-on experience with monitoring tools (e.g., Grafana, Humio) and chaos engineering.
- Familiarity with leveraging Generative AI (e.g., GitHub Copilot, Gemini) to accelerate development.
- Experience with big data technologies like Apache Spark or Amazon S3.