Skip to main content
NEW runnerly v3.2, self-hosted GitHub runners with FedRAMP boundary support
DevOps May 02, 2026 · 14 min read

GitOps beyond the basics: what we learned running ArgoCD across 14 clusters.

A two-year retrospective on multi-tenant ArgoCD: what worked, what we ripped out, and the policy gates that made our SOC 2 audit boring.

Above: simplified diagram. Jump to the full topology section below.

Why we wrote this down

Two years ago we agreed to operate a platform that, on paper, was already running ArgoCD. The previous team had read the docs, watched the conference talks, and reached the place every team that does GitOps reaches eventually: the pipeline works on Tuesday and pages someone on Thursday. Nobody was happy with it, and nobody could clearly articulate why.

This post is the writeup we wish we had read before signing the engagement. Most of it is opinionated. None of it is magic. If you're operating ArgoCD across more than three clusters and the diff between "syncing" and "synced" feels like a coin flip, you'll recognize most of these patterns.

The 14-cluster topology

The platform spans three regions, four environments, and two regulatory boundaries. The control plane is one ArgoCD installation per tier, not per cluster, not per environment. Each tier sees its child clusters as destinations and routes app projects through hub-and-spoke ApplicationSets. The full layout:

copy# tier-prod control plane
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-prod
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            tier: prod
            boundary: commercial
  template:
    spec:
      project: platform
      syncPolicy:
        automated: { prune: true, selfHeal: true }

Two things matter in that snippet. First, boundary is a label we treat as load-bearing, every PolicyEngine rule keys off it. Second, selfHeal: true is non-negotiable. Disabling self-heal makes drift "safer" in theory and makes 2am pages absolutely guaranteed in practice.

Multi-tenancy without tears

The hardest part of running ArgoCD at this scale is not Argo. It's the org chart. Multi-tenancy means three things that are easy to confuse:

  1. Authorization, who can sync what.
  2. Visibility, who can see what.
  3. Isolation, what happens to your tenant when mine catches fire.

We separate these concerns explicitly. Authorization is RBAC at the AppProject level. Visibility is a per-tenant ArgoCD UI proxy that filters the API response by team. Isolation is hard limits on resource quotas, NetworkPolicies, and a separate ArgoCD instance per tier.

"If your tenants share an ArgoCD instance and a regulatory boundary, you don't have multi-tenancy. You have shared blast radius."

Policy gates that earned their keep

We run three OPA bundles inline with every sync: a structural bundle (kinds, labels, image registries), a security bundle (no privileged, no hostPath, no :latest tags), and a compliance bundle (boundary, classification, FIPS-mode flag for the regulated tier). The compliance bundle alone caught 47 violations in the first six weeks. Most were genuine mistakes; a few were configuration that the platform team had labeled "acceptable risk" in 2022 and forgotten about.

Concrete numbers, last 90 days:

  • 1,827 sync attempts
  • 52 blocked at structural gate
  • 11 blocked at security gate
  • 3 blocked at compliance gate (all genuine, all caught a config drift before prod)
  • 0 production incidents traced to a missed gate

What we ripped out

Two patterns we inherited and threw away:

App-of-apps for everything. It's elegant in a slide and miserable in operation. We replaced it with ApplicationSets keyed off cluster labels, flatter, easier to reason about, and one tier of indirection instead of three.

Sync waves as a feature flag system. The previous team had been using argocd.argoproj.io/sync-wave annotations to stage rollouts. This works until it doesn't, and then it fails in ways that are hard to debug. Real feature flags now live in OpenFeature; sync waves are reserved for actual ordering dependencies.

Making the audit boring

The thing nobody tells you about GitOps is that it's an auditor's dream, if you set it up that way. Every change to production has a Git commit, a signed tag, an OPA decision, and an ArgoCD sync record. Cross-reference those four logs and you have an evidence package most SOC 2 auditors will accept on the spot.

We wrote this up in a previous post: FedRAMP-credible engineering, without the binder.

Five takeaways

  1. One ArgoCD per tier, not per cluster. The blast radius of a bad upgrade is the entire tier; size the deployment to that, not to the clusters.
  2. self-heal is non-negotiable. If you can't tolerate self-heal, you have a different problem than GitOps.
  3. OPA inline, three bundles. Structural, security, compliance. Keep them separate; they fail for different reasons.
  4. Sync waves are not feature flags. Use the right tool for the right job.
  5. The audit is a query, not a binder. If your evidence package isn't queryable, it's not done.

If any of this resonates with what you're running, or if you'd like to push back on any of it, our inbox is open. We learn more from the pushback than from the agreement.

Want help with this in your environment?

Talk to D. Marin and the team.

30-minute call. No SDR. We'll tell you whether this is the right shape for what you're running.