Job Description
Job Description Summary
As a Staff Software Engineer (Observability), you will be responsible for defining and implementing the observability strategy across PCS Digital Solutions Cloud Applications.Job Description
Roles and Responsibilities
In this role, you will:
- Define and evolve the observability vision and roadmap for PCS DS applications
- Design and implement/integrate standardized observability frameworks (metrics, logs, traces, events, profiling).
- Collaborate with platform, SRE, and product teams to instrument services using OpenTelemetry and other modern observability tooling.
- Build and maintain dashboards, alerts, and SLOs that reflect both technical and business health indicators.
- Evaluate, integrate, and optimize observability agents (e.g., Prometheus, Fluent bit, OTEL and other agents).
- Design self-remediation solutions leveraging observability tooling.
- Implement Best Practices for using GenAI tools of Observability platforms.
- Lead / contribute to incident analysis and postmortem reviews, driving improvements in system resilience and observability coverage.
- Conduct Operational Readiness Reviews (ORRs) to validate monitoring, alerting, and rollback strategies before go-live.
- Ensure observability practices align with healthcare compliance standards (e.g., HIPAA, GDPR, HITRUST).
- Mentor engineers and promote a culture of observability-first development.
Required Qualifications
- Bachelor’s or master’s degree in computer science, Engineering, or a related technical field.
- 10+ years of experience in software engineering, SRE, or platform engineering roles.
- 4+ years of experience in contributing in observability solutions in cloud-native environments (Kubernetes, microservices, serverless).
- Deep expertise in observability pillars (metrics, logs, traces) and tools like OpenTelemetry, Prometheus, Grafana, Datadog, Dynatrace etc.
- Strong programming/scripting skills (e.g., Go, Python, Bash, Terraform).
- Experience with distributed tracing, SLO/SLI frameworks, and incident response workflows.
- Deep expertise in distributed systems, microservices, and cloud platforms (AWS, Azure, GCP).
- Experience with AI-powered anomaly detection, automated incident response, and cost optimization for observability at scale.
- Familiarity with SRE practices, chaos engineering
- Excellent communication and collaboration skills.
Desired Characteristics
- Experience in healthcare or regulated industries.
- Knowledge of data privacy and compliance (HIPAA, HITRUST).
- Experience with cost optimization and telemetry data governance.
- Contributions to open-source observability projects.
Additional Information
Relocation Assistance Provided: No
