Skip to main content
NEW runnerly v3.2, self-hosted GitHub runners with FedRAMP boundary support
SRE, Remote US Posted Apr 28, 2026

Senior Site Reliability Engineer.

SLOs, postmortems, error budgets. You like the phrase "burn rate" and you know what to do about it. You want to work on systems where the on-call rotation actually has a feedback loop into the architecture.

About the role

You'll be the third SRE on a four-person reliability practice. We split our time across two anchor clients (one Series C SaaS, one regulated-cloud platform) and the open-source observability stack we maintain (status pages, SOC monitor, runner pool). Your week will be roughly half client work, a third on the toolkit and writing, and the rest in design reviews.

This is a real on-call role. We expect you to take a paged primary about one week in four, with a paid backup rotation. We invest seriously in toil reduction; if a runbook gets used twice it gets automated.

What you'll do

  • Define and own SLOs for client services. Negotiate them with product, instrument them in code, and burn budgets responsibly.
  • Run incidents end to end. Drive blameless postmortems with action items that actually land.
  • Build the observability stack: Prometheus, Grafana, Loki, OpenTelemetry, alert routing.
  • Improve the open-source SOC monitor and status-page tooling. Real users depend on it.
  • Pair with platform and software engineers to make services more operable before they hit production.

Who you are

  • 5+ years operating production systems with explicit SLOs and an on-call rotation.
  • Fluent in Prometheus, Grafana, alerting, log pipelines, and at least one tracing system.
  • Comfortable in Go or Python for instrumentation, ad-hoc tooling, and CI work.
  • You've written a postmortem you would still defend a year later.
  • US-based, eligible to work without sponsorship.

Bonus, not required

  • Experience with chaos engineering, fault injection, or game-day exercises.
  • OpenTelemetry contributions, exporters, or instrumentation libraries you've shipped.
  • Capacity planning and cost-aware right-sizing for cloud workloads.
  • You've operated a system at scale that surprised you. We want to hear that story.

Interview process

  1. Application, resume + GitHub + paragraph. ~10 minutes for you, 30 for us.
  2. Engineering chat, 60 min, paired on a production incident trace. No whiteboard.
  3. Take-home, paid, ~6 hours, on our public toolkit. You submit a PR.
  4. Team day, 4 hours: incident drill, design review, peer Q&A.
  5. Offer, within 48 hours of team day.

We pay for step 3 at $150/hr. If you turn down the offer, you keep the work and the payment.

Compensation & benefits

Salary band $185,000 to $225,000, plus 0.05 to 0.15% equity. We share comp ranges in the job ad because making you guess is an asshole move.

  • Platinum medical, dental, vision, 100% premium covered for you
  • 5 weeks PTO, 13 federal holidays, end-of-year shutdown
  • On-call premium paid on top of base for primary weeks
  • $2,500 home-office sign-on, $750/yr maintenance
  • $3,000/yr learning budget
  • 10% open-source time
Questions before applying?

Email jobs@caelicode.com.

A senior engineer answers within two business days. No SDR, no recruiter chain.