Staff Site Reliability Engineer

Mountain View, USPosted 5 days ago

onsite

About EarnIn

As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks.

We’re fortunate to have an incredibly experienced leadership team, combined with world-class funding partners like A16Z, Matrix Partners, DST, Ribbit Capital, and a very healthy core business with a tremendous runway. We’re growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of our growth journey.

WHY this role exists

EarnIn’s products must deliver speed, reliability, resilience, and trust to community members who depend on them. As EarnIn grows, we cannot rely on heroics, tribal knowledge, manual investigation, or isolated SRE expertise. We must embed reliability practices that scale across product engineering teams, enhance customer experience, and enable rapid shipping without increasing operational risk. This role exists to lead EarnIn’s next stage of reliability maturity: an AI-first operating model that uses AI to actively detect, investigate, respond to, learn from, and prevent production issues. As a Staff Site Reliability Engineer, you will guide technical direction for reliability across critical services, relying on AI-assisted workflows as key tools to reduce toil, speed incident response, improve production readiness, and enhance the operational quality of the engineering organization.

The base salary range for this full-time position is $252,000-$308,000, plus equity and benefits. Our salary ranges are determined by role, level, and location. This is a hybrid position in Mountain View (Headquarters) and will require in-office work 2 days a week.

HOW you will create impact

Act as a Staff-level technical leader: define standards, architect solutions, mentor engineers, influence cross-team efforts, and construct reusable systems and practices that multiply your impact.
You will embed AI-first thinking into reliability practices, leveraging AI to streamline alert triage, accelerate incident investigation, automate runbooks, retrieve operational knowledge, enhance postmortem quality, track corrective actions, quantify reliability with scorecards, detect capacity risks, and analyze architectural risks.
You will maintain human ownership and engineering judgment at the center of operations. AI aids engineers by speeding context gathering, clarifying reasoning, and reducing repetition, but it does not replace accountability.
Collaborate with SRE, product engineering, infrastructure, security, and leadership teams to embed reliability, making it easy to adopt and impossible to ignore.

WHAT you will own

Reliability strategy and standards

Define and evolve reliability standards across critical services, including SLIs, SLOs, error budgets, production readiness, observability, incident response, and resilience patterns.
Establish a reliability operating model that clarifies service ownership, operational expectations, and decision-making around reliability tradeoffs for product engineering teams.
Use AI-assisted analysis to interpret reliability trends, detect weak operational signals, highlight capacity risks using pattern recognition, and generate actionable reliability scorecards for teams, clearly delineating where AI automates data gathering and insight generation.

AI-first incident response and operational workflows

Overhaul key stages of the incident lifecycle to achieve faster detection, sharper triage, richer context retrieval, clearer communication, and stronger follow-through.
Command high-severity incidents as Incident Commander and reinforce the systems, tools, and practices that simplify incident management.
Design and implement workflows in which AI assists with alert correlation, signal enrichment, root-cause exploration, runbook retrieval, postmortem drafting, and corrective-action tracking.
Ensure AI-assisted incident workflows remain reviewable, auditable, and safe by requiring human verification at all critical steps and maintaining clear operational ownership with humans accountable for final decisions.

On-call quality and toil reduction

Elevate on-call quality by silencing noisy alerts, automating repetitive investigations, and enabling responders to rapidly digest service context.
Build tools that gather context from systems like Datadog, CloudWatch, incident.io, Slack, runbooks, deployment history, and service metadata.
Transition teams from reactive paging to proactive reliability enhancement.

Architecture and resilience

Steer service designs for graceful degradation, failure isolation, robust capacity planning, and operational safety throughout EarnIn’s AWS environment.
Apply production data, incident learnings, and AI analysis to spot architectural risks before they recur.
Instruct engineering teams to embed reliability expectations into design reviews, launch protocols, and service evolution.

Mentorship and cross-org influence

Coach engineers in reliability practices, incident response, SLOs, observability, production debugging, and AI-assisted operational workflows.
Direct design reviews, incident reviews, and operational maturity discussions to improve engineering judgment across teams.
Produce documentation, tooling, and reusable patterns that unlock reliability knowledge and enable action.

WHAT you'll do

Set a reliability strategy with AI at the center. Define SLIs, SLOs, and error budgets across critical services. Use AI to surface trends, predict capacity risks, and auto-generate reliability scorecards so teams act on data.
Redesign the incident lifecycle around AI-assisted speed. Lead high-severity incident response as IC. Build AI-driven alert correlation and triage that reduces noise and accelerates root-cause identification. Drive adoption of AI-generated postmortems that surface systemic patterns and automatically track corrective actions through to completion.
Improve on-call fundamentally better through automation. Build AI agents that draft runbook responses, pull relevant context from Datadog, incident.io, and Slack during pages, and recommend remediation steps, so on-call engineers spend less time deciding and searching.
Push AI-first operations into product engineering teams. Partner with product engineering to embed AI-assisted investigation, alerting, and production readiness into their workflows. Make AI tooling the default path for every team that owns a service, not an SRE-only capability.
Architect for resilience at scale. Guide service designs for graceful degradation, failure isolation, and capacity planning across EarnIn's AWS footprint (EKS, Kafka, DynamoDB, RDS, SQS). Use AI-driven analysis to identify architectural weak points before they become incidents.
Raise the bar through mentorship and standards. Coach engineers on reliability practices, run design and incident reviews, and build documentation and tooling that makes reliability knowledge accessible. Set the expectation that AI-assisted workflows are how EarnIn operates, not an experiment.

WHAT we're looking for

7+ years in SRE, Software Engineering, or Infrastructure Engineering with increasing scope and cross-org influence. Track record of KPI driven reliability and operational excellence improvements at scale.
Demonstrated experience improving reliability and operational excellence at scale using clear KPIs such as MTTR, MTTD, alert quality, incident recurrence, SLO attainment, on-call health, or corrective-action completion.
Shipped experience applying AI/LLMs to engineering or operational workflows, such as alert triage, runbook automation, incident investigation, postmortem drafting, remediation recommendation, operational knowledge retrieval, or agentic operations tooling.
Significant expertise with SLIs, SLOs, error budgets, incident command, blameless postmortems, and recurrence prevention in large-scale distributed systems.
Strong software engineering ability in Python, Go, or similar languages. You build tools and automation, not just dashboards.
Deep observability experience with systems such as Datadog, CloudWatch, OpenTelemetry, or similar platforms, with a bias toward signal-heavy alerting designed for real human response.
Strong infrastructure-as-code and cloud infrastructure experience, including Terraform, Kubernetes, AWS, and safe, reversible deployment practices.
Practical experience using AI-assisted development tools such as Cursor, Claude Code, Copilot, ChatGPT, or similar tools to accelerate your own engineering work and model effective adoption for partner teams.
Experience in fintech, regulated environments, SOC 2, PCI, FinOps, or cost/performance tradeoffs in high-scale systems is a plus.

#LI-Hybrid

At EarnIn, we believe that the best way to build a financial system that works for everyday people is by hiring a team that represents our diverse community. Our team is diverse not only in background and experience but also in perspective. We celebrate our diversity and strive to create a culture of belonging. EarnIn does not unlawfully discriminate based on race, color, religion, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), gender identity, gender expression, national origin, ancestry, citizenship, age, physical or mental disability, legally protected medical condition, family care status, military or veteran status, marital status, registered domestic partner status, sexual orientation, genetic information, or any other basis protected by local, state, or federal laws. EarnIn is an E-Verify participant.

EarnIn does not accept unsolicited resumes from individual recruiters or third-party recruiting agencies in response to job postings. No fee will be paid to third parties who submit unsolicited candidates directly to our hiring managers or HR team.

See Your Match Score

About EarnIn

Website

More jobs at EarnIn

Workday Integrations & Data Architect

Remote - Mexico

Analytics Engineer

Bengaluru, India

Senior Mobile Engineer (iOS)

Bengaluru, India

Senior Analytics Engineer

Bengaluru, India

Senior Data Analytics Engineer

Bengaluru, India

Sr. Data Analyst

Bengaluru, India

Similar roles

Test Site Kitchen Operations

Anduril Industries · Remote

Senior Site Reliability Engineer, Production Engineering

Anduril Industries · Seattle, Washington, United States

Site Reliability Engineer, Discovery

Anduril Industries · Seattle, Washington, United States

Site Manager, Dashmart

DoorDash · Minneapolis, MN

Senior Site Reliability Engineer

honeycomb.io · Remote-Ireland

Satellite Controller IV / Site Lead

akima · Colorado Springs, Colorado, United States

Staff Site Reliability Engineer

Job Description

About EarnIn

See Your Match Score

More jobs at EarnIn

Similar roles

More jobs at EarnIn

Similar roles