Job Description
The Team
We’re the Site Reliability Engineering team within Bloomberg’s Application Middleware group. Our mission: ensure that Bloomberg’s core connectivity and messaging layers are resilient, scalable, and fully observable.
- Gateways: Secure, high-performance TCP/SSL entry points to our data centers
- HFN & NSTP: A global HTTP CDN and SOCKS5 proxy network delivering fast access from any geography
- Playlist Services: Dynamic path configuration systems optimizing user connectivity in real-time
- PGM Relays: Infrastructure for reliable multicast data delivery
- Build production-grade software that powers Bloomberg’s global infrastructure
- Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation
- Collaborate across engineering teams to introduce automated, self-service operational workflows
- Conduct deep systems analysis and root cause investigations for complex, distributed systems
- Propose and prototype innovative approaches to reliability and risk mitigation
- Contribute to design docs, runbooks, and post-incident reviews—clear communication is part of the job
- A degree in Computer Science, Engineering, Mathematics, or equivalent practical experience
- Strong software engineering skills in any high-level language (we mainly use Python and C++)
- A deep understanding of software system reliability and risk management—including how to identify potential points of failure and design mitigation strategies.
- A good understanding of data structures, algorithms, and system design
- Experience navigating and improving large, distributed codebases
- An ability to identify system risks and engineer around points of failure
- Clear written and verbal communication, including technical documentation and incident analysis
- Systems Knowledge: A strong grasp of operating systems, fundamental networking protocols (TCP, UDP, multicast), or core database concepts as they apply to modern infrastructure.
- Cluster Management: Experience with deployments, staging, and configuration management. Direct experience with Argo and/or Kubernetes or other Pipeline Management Platforms is a significant advantage.
- Machine Management at Scale: Experience with capacity planning and automating the lifecycle of large machine fleets.
- System Observability and Monitoring: Deep understanding of SLIs/SLOs/SLAs, alerting, and building dashboards for complex systems.
- Reliability in Distributed Systems: Knowledge of fault tolerance and the unique challenges of network and node failure in distributed environments.
- Mentoring: Proven experience mentoring and growing junior Engineers