Job Description
- Build tools and automation to improve how our distributed systems are operated and debugged
- Defining and implementing service level objectives (SLOs) that reflect real user impact
- Identify and continuously assess reliability risks across services, infrastructure, and workflows, helping teams prioritize work based on real impact
- Improve development and deployment workflows, driving more consistent and reliable paths to production
- Reduce time to recovery and triage effort by improving diagnostics, alerting, and system-level visibility
- Design and validate failure scenarios and resilience testing practices, ensuring systems behave predictably under stress
- You will collaborate closely with software engineers and product teams to influence how systems are designed, built, and operated.
- Work on systems operating at very high scale, with billions of messages processed daily
- Tackle complex distributed systems challenges involving latency, consistency, and failure handling
- Build tooling and frameworks used across multiple teams
- Have direct impact on systems relied upon by the global financial industry
- 4+ years of experience in software engineering
- Proficiency in Python
- Experience working with distributed systems
- Strong Understanding of system reliability, observability, and performance
- Familiarity with SLOs, SLIs, and SLAs, and how to relate system performance back to client impact.
- Strong collaboration and communication skills
- A degree in Computer Science, Engineering, or equivalent practical experience.
- Experience with monitoring or tracing tools such as Grafana, Humio, distributed tracing
- Familiarity with Kafka, Java, or large-scale data systems
- Experience with chaos engineering, failure injection, or resilience testing frameworks.
- Exposure to capacity planning and scaling analysis.
- Contributions to open source or involvement in SRE communities.
- Experience with big data technologies like Apache Spark, Amazon S3
We offer one of the most comprehensive and generous benefits plans available and offer a range of total rewards that may include merit increases, incentive compensation (exempt roles only), paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) +match, life insurance, and various wellness programs, among others. The Company does not provide benefits directly to contingent workers/contractors and interns.