Skip to main content
platform / internal tooling

The toolkit behind
the team.

We run our own platform on the same patterns we ship to clients: GitOps governance, dual-engine secret scanning, SOC observability, automated incident response. Each tool is config-driven and battle-tested in production before it ever reaches a client environment.

caelicode/platform · live live
tools
6
uptime
99.99%
deploy cadence
daily
tools / 01-06

Six tools.
One operating model.

Every tool below is deployed via ArgoCD, tested in CI, monitored by the SOC stack, and documented with runbooks. When we ship these to client environments, the only change is the config values.

01 · status-page

Incident Automation

Synthetic checks, automated incident creation, status page updates, and structured postmortems.

test coverage216 tests
check interval30s
incident MTTR< 7m p50

Synthetic monitoring

HTTP, TCP, DNS, and TLS checks from 3 regions. Configurable thresholds and alert escalation.

Incident lifecycle

Auto-create, auto-notify, auto-resolve. PagerDuty and Slack integrations. Timeline generated automatically.

Postmortems

Structured templates, action item tracking, SLO impact scoring. Published internally within 5 business days.

Status page

Public-facing component status, maintenance windows, subscriber notifications. Zero manual updates during incidents.

02 · secret-scanner

Secret Scanning

Dual-engine scanning (gitleaks + trufflehog) with incremental mode and CI-blocking gates.

enginesgitleaks + trufflehog
custom rules22+
modeincremental, CI-blocking

Dual-engine approach

Two scanners with different detection strategies. One misses what the other catches. Combined false-positive rate under 2%.

Incremental scanning

Only scans new commits, not full history on every run. PR-blocking in under 8 seconds for typical diffs.

Custom rules

22+ org-specific patterns: internal tokens, service account keys, webhook URLs, config secrets.

Routing & SLAs

Findings routed to code owners. Critical: 4h SLA. High: 24h. Medium: 1 sprint. Low: best-effort.

03 · org-governance

Org Governance

YAML-driven GitHub org policy with nightly drift detection and auto-remediation.

run cadencenightly
drift detection< 5s
policy formatyaml declarative

Declarative config

Repo settings, branch protections, team memberships, and webhook configs defined in YAML. PRs for all changes.

Drift detection

Nightly reconciliation against desired state. Drift reported as issues, auto-fixed when safe, flagged for review otherwise.

Audit trail

Every policy change is a merged PR. Full git history of who changed what, when, and why.

Multi-org support

Runs across our org and client orgs (with their PATs). Same policy engine, different config repos.

04 · soc-stack

SOC Observability

Grafana + Loki + Alertmanager with 7 production alert rules and on-call routing.

alert rules7 active
log retention30d hot, 1yr cold
integrationspagerduty, slack, email

Dashboards

Pre-built Grafana dashboards for infrastructure health, deploy frequency, error budgets, and cost tracking.

Alert quality

Every alert is actionable, owned, and rare. Alert fatigue is treated as a bug, not a feature.

Log aggregation

Loki-based, label-indexed, fast. Structured logs with trace correlation. 30-day hot, 1-year cold storage.

On-call routing

Time-based and severity-based routing. Escalation policies. Quiet hours respected unless P1.

05 · runner-infra

Runner Infrastructure

Self-hosted, autoscaling CI runners on spot capacity with cron dispatch.

scaling0 to N in < 30s
cost modelspot instances
dispatchcron + event

Autoscaling

Scale-to-zero when idle, burst to capacity on push. No per-minute billing. Predictable monthly cost.

Security isolation

Ephemeral VMs, destroyed after each job. No shared state, no credential persistence, no lateral movement.

Cron dispatcher

Scheduled jobs (nightly builds, security scans, governance checks) managed as code, not UI clicks.

Audit trail

Full log of every job: who triggered, what ran, what artifacts produced. Retained 90 days.

06 · notify-bus

Notification Fabric

Unified alerting across email, Slack, PagerDuty, and webhooks. One config, all channels.

channels5 (email, slack, pd, sms, webhook)
routingseverity + team
dedupcontent-hash based

Multi-channel

One event, routed to the right channel based on severity and team ownership. No duplicate noise.

Deduplication

Content-hash dedup prevents alert storms. Grouped notifications for related events.

Templates

Channel-specific formatting. Slack gets rich blocks, email gets structured HTML, PagerDuty gets severity metadata.

Dead-letter queue

Failed deliveries retried with backoff. Persistent failures escalate to fallback channel. Nothing silently dropped.

Want this platform
in your environment?

Every tool above is deployable to client infrastructure. Same code, different config. We'll have it running in your AWS/Azure/GCP account within the first sprint.

Start a conversation
book intro · 30mcal.com/caelicode