Senior Site Reliability Engineer.
SLOs, postmortems, error budgets. You like the phrase burn rate and you know what to do about it.
About the role
You will work with senior engineers to make production systems measurable, supportable, and boring in the best possible sense.
The role sits close to client operations, incident response, and the tooling that turns operational knowledge into repeatable practice.
What you'll do
- Design SLOs, alerts, dashboards, and incident workflows that production teams can actually use.
- Improve CI/CD and release safety with runbooks, rollback paths, and operational evidence.
- Lead postmortem follow-through and convert findings into code, policy, or process improvements.
- Work with platform and application teams to reduce paging noise and increase confidence.
Who you are
- You have operated production systems and participated in real incident response.
- You understand Prometheus, Grafana, logs, traces, alerting, and error-budget thinking.
- You can write infrastructure or application code well enough to fix the underlying issue.
- You communicate clearly under pressure and write useful operational notes.
Bonus, not required
- Experience with regulated or high-availability environments.
- Familiarity with OpenTelemetry.
- Experience coaching product teams on production ownership.
Interview process
- Application, resume, GitHub, and a short paragraph.
- Engineering chat, 60 minutes with a senior engineer. No whiteboard.
- Take-home, paid, scoped work on a real engineering problem.
- Team day, focused conversations around design, security, and collaboration.
- Offer, written clearly and discussed directly.
Compensation & benefits
Competitive senior-engineer compensation, full-time, remote-first, US. We discuss specifics early in the process so nobody is left guessing.
- Medical, dental, and vision benefits
- Flexible paid time away from work
- Home-office and learning support
- Time for writing, open engineering, and internal platform improvement