Skip to content

ArgoCD & GitOps Architecture

This document details the “App of Apps” GitOps architecture used in this cluster, focusing on the Sync Wave strategy, Diff Strategy, and Health Check Customizations that enable a fully self-managing cluster.

We use a hierarchical “App of Apps” pattern to manage the entire cluster state.

┌─────────────────┐
│ Root Application│ ← Only manual step (bootstrap)
│ root.yaml │
└────────┬────────┘
│ manages
┌────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌──────────────┐
│Standalone│ │Application│ │ AppProject │
│ Apps │ │ Sets │ │ Definitions │
└────┬─────┘ └─────┬─────┘ └──────────────┘
│ │ auto-discovers directories
┌───────┼───────┐ │
▼ ▼ ▼ ▼
cilium longhorn kyverno ┌──────────────────────────┐
(wave0) (wave1) (wave3) │ Generated Applications │
│ cert-manager, gpu-op, ... │
└──────────────────────────┘

The entry point is infrastructure/controllers/argocd/root.yaml. This application:

  1. Points to infrastructure/controllers/argocd/apps/
  2. Deploys the ApplicationSet definitions found there.
  3. Is the only thing applied manually (during bootstrap).

We use four ApplicationSets to categorize workloads:

  1. Infrastructure (infrastructure-appset.yaml): Core system components (Cert-Manager, GPU operators, Gateway, etc.).
  2. Database (database-appset.yaml): Database operators and instances via glob discovery (infrastructure/database/*/*). Uses selfHeal: false to preserve skip-reconcile annotations during DR.
  3. Monitoring (monitoring-appset.yaml): Observability stack (Prometheus, Grafana).
  4. My Apps (my-apps-appset.yaml): User workloads.

Some components need guaranteed ordering that ApplicationSets cannot provide (AppSets report “healthy” immediately on creation). These are deployed as standalone Application resources with explicit sync waves:

AppWaveWhy standalone?
cilium0CNI must exist before any pod
argocd0Self-management
1password-connect0Secret backend for all ExternalSecrets
external-secrets0CRDs needed by downstream apps
longhorn1Storage must exist before PVCs
snapshot-controller1VolumeSnapshot CRDs for backups
volsync1Backup/restore engine
pvc-plumber2Must be healthy before Kyverno calls its API
kyverno3Webhooks must register before app PVCs are created
opentelemetry-operator5Needs cert-manager (Wave 4) for webhook certificates

To solve the “chicken-and-egg” problem of bootstrapping a cluster (e.g., needing storage for apps, but networking for storage), we use ArgoCD Sync Waves.

Wave 0 Wave 1 Wave 2 Wave 3 Wave 4 Wave 5 Wave 6
┌─────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────┐
│ Cilium │ │ Longhorn │ │ PVC │ │ Kyverno │ │ Infra AppSet│ │ OTEL Operator│ │ My Apps │
│ ArgoCD │→│ Snapshot │→│ Plumber │→│ │→│ DB AppSet │→│ Mon. AppSet │→│ AppSet │
│ 1Pass │ │ VolSync │ │ │ │ │ │ │ │ │ │ │
│ ExtSec │ │ │ │ │ │ │ │ │ │ │ │ │
└─────────┘ └───────────┘ └──────────┘ └─────────┘ └─────────────┘ └─────────────┘ └──────────┘
Networking Persistence Backup gate Policies Core services Observability User apps
+ Secrets + Webhooks + Databases
WavePhaseComponentsDescription
0Foundationcilium, argocd, 1password-connect, external-secrets, projectsNetworking & Secrets. The absolute minimum required for other pods to start and pull credentials.
1Storagelonghorn, snapshot-controller, volsyncPersistence. Depends on Wave 0 for Pod-to-Pod communication and secrets.
2PVC Plumberpvc-plumberBackup checker. Must be running before Kyverno policies in Wave 3 call its API.
3KyvernokyvernoPolicy engine. Standalone Application (not in AppSet) so webhooks register before any app PVCs are created.
4Infrastructurecert-manager, gpu-operator, gateway, etc.Core Services via Infrastructure ApplicationSet (explicit path list).
4Databasecloudnative-pg/*/*Databases via Database ApplicationSet (glob discovery). Uses selfHeal: false for DR.
5OTEL + Monitoringopentelemetry-operator, prometheus-stack, loki-stackObservability. OTEL is standalone (needs cert-manager from Wave 4).
6Usermy-apps/*/*Workloads via My-Apps ApplicationSet (discovers my-apps/*/*).

Each Application resource in infrastructure/controllers/argocd/apps/ is annotated with a sync wave:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cilium
annotations:
argocd.argoproj.io/sync-wave: "0"

ArgoCD processes these waves sequentially. Wave 1 will NOT start until Wave 0 is healthy.

Standard ArgoCD behavior is to mark a parent Application as “Healthy” as soon as the child Application resource is created, even if the child app is still syncing or degraded. This breaks the Sync Wave logic for App-of-Apps.

To fix this, we inject custom Lua health checks in infrastructure/controllers/argocd/values.yaml.

resource.customizations.health.argoproj.io_Application: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.health ~= nil then
hs.status = obj.status.health.status
if obj.status.health.message ~= nil then
hs.message = obj.status.health.message
end
end
end
return hs

What this does:

  1. It overrides the health assessment of Application resources.
  2. It forces the parent (Root App) to report the actual status of the child Application.
  3. If cilium (Wave 0) is “Progressing”, the Root App sees it as “Progressing”.
  4. The Root App pauses processing Wave 1 until all Wave 0 apps report “Healthy”.
ResourcePurpose
ClusterPolicyWaits for Kyverno’s Ready condition before advancing past Wave 3
ReplicationSourceReports “Healthy” after first successful sync (prevents false “Progressing”)
ReplicationDestinationReports “Healthy” when latestImage is available for restore
  1. Bootstrap: You apply root.yaml.
  2. Adoption: ArgoCD sees cilium defined in Git (Wave 0). It adopts the running Cilium instance.
  3. Expansion: ArgoCD deploys external-secrets (Wave 0).
  4. Wait: ArgoCD waits for Cilium and External Secrets to be green.
  5. Storage: ArgoCD deploys longhorn (Wave 1).
  6. Completion: The process continues until all waves are healthy.

This ensures a deterministic, reliable boot sequence every time.

This cluster uses Server-Side Diff paired with Server-Side Apply. These must be aligned — using one without the other causes silent sync failures.

Client-Side Diff (legacy) Server-Side Diff (modern)
───────────────────────── ─────────────────────────
Git manifest ──────► ArgoCD compares locally ──► Git manifest ──────► K8s API dry-run apply ──►
against live resource returns predicted result
│ │
String comparison Semantic comparison
(doesn't understand (understands quantities,
quantities, defaults, defaults, field ownership,
field ownership) schema types)
│ │
"1000m" != "1" ← FALSE DIFF "1000m" == "1" ← CORRECT

Client-Side Diff (legacy, DO NOT USE with SSA)

Section titled “Client-Side Diff (legacy, DO NOT USE with SSA)”

ArgoCD downloads the live resource from the cluster, then compares it against the Git manifest locally in the ArgoCD controller. It’s essentially doing diff manifest.yaml live-resource.yaml on its own.

Problem: ArgoCD doesn’t know what Kubernetes would actually do with the manifest. Kubernetes adds defaults, mutating webhooks modify fields, and SSA has field ownership rules. ArgoCD is guessing — and sometimes guesses wrong (thinks it’s “in-sync” when it’s not).

Server-Side Diff (modern, REQUIRED with SSA)

Section titled “Server-Side Diff (modern, REQUIRED with SSA)”

ArgoCD sends the Git manifest to the Kubernetes API as a dry-run server-side apply and gets back what the result would look like. Then it compares that against the live resource.

Why it’s better: Kubernetes itself tells ArgoCD “here’s what would change if you applied this” — accounting for defaults, field ownership, webhooks, everything. No guessing.

Even with server-side diff enabled, some cases still require ignoreDifferences:

Server-Side Diff handles: Still needs ignoreDifferences:
─────────────────────── ──────────────────────────────
✓ Resource quantity normalization ✗ Mutation webhook fields (caBundle,
(1000m vs "1", 1Gi vs 1073741824) skipBackgroundRequests, etc.)
✗ StatefulSet volumeClaimTemplates
✓ .status fields (ArgoCD 3.0+ apiVersion/kind stripping
ignores all status by default) ✗ CRD labels added by controllers
✗ PVC immutable fields (dataSourceRef,
✓ Server-side defaulting volumeName, storage)
(fields K8s adds during apply) ✗ Controller-managed annotations

Why mutation webhooks are excluded: By default, server-side diff strips mutation webhook changes from the dry-run result. There is an IncludeMutationWebhook=true option, but ArgoCD maintainers recommend against it — it causes any webhook-added field to show as OutOfSync unless you also have it in Git.

“enabling that option means that any changes made by a mutating webhook will cause your app to be out of sync. That seems like generally undesirable behavior.” — Michael Crenshaw, ArgoCD maintainer (#19800)

The ConfigMap Sync Failure (Why SSA + SSD Must Be Paired)

Section titled “The ConfigMap Sync Failure (Why SSA + SSD Must Be Paired)”

Without Server-Side Diff, using Server-Side Apply + ApplyOutOfSyncOnly:

Git: configmap data = NEW content
Client-side diff: "managed fields metadata looks the same..." → IN SYNC (wrong!)
ApplyOutOfSyncOnly: "it's in-sync, skip it"
Result: configmap never applied, ArgoCD says "Synced" ✓ (LIE)

With Server-Side Diff:

Git: configmap data = NEW content
K8s API dry-run: "this would change .data.presets.ini" → OUT OF SYNC
Sync: applies the configmap
Result: configmap actually updated ✓

Enabled globally in infrastructure/controllers/argocd/values.yaml:

configs:
cm:
resource.server-side-diff: "true"

Dealing with Operator Mutations (ignoreDifferences)

Section titled “Dealing with Operator Mutations (ignoreDifferences)”

Many Kubernetes operators and controllers mutate resources after creation. This creates a loop: ArgoCD applies the Git state, the operator mutates it, ArgoCD detects the diff, re-applies, and the cycle repeats.

The preferred approach is to write the normalized value in Git so there’s no diff to fight about:

# BAD — operator normalizes 1000m to "1", causing perpetual OutOfSync
resources:
limits:
cpu: 1000m # ← ArgoCD sees "1000m" vs live "1" → diff!
# GOOD — matches what K8s/operator will normalize to
resources:
limits:
cpu: "1" # ← ArgoCD sees "1" vs live "1" → no diff

This works for:

  • Resource quantities (1000m"1", 1024Mi1Gi)
  • Kyverno policy defaults (skipBackgroundRequests: true, allowExistingViolations: true, method: GET)
  • Any field where the operator adds a default you can predict

When you can’t control the source (Helm charts, CRDs, controller mutations), use ignoreDifferences:

# Per-Application or per-ApplicationSet
ignoreDifferences:
# Kyverno injects caBundle into webhooks after creation
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
jqPathExpressions:
- .webhooks[].clientConfig.caBundle
# StatefulSet volumeClaimTemplates — K8s strips apiVersion/kind
# (known ArgoCD bug #11143, unresolved)
- group: apps
kind: StatefulSet
jqPathExpressions:
- .spec.volumeClaimTemplates[].apiVersion
- .spec.volumeClaimTemplates[].kind
# CRD fields added by controllers
- group: apiextensions.k8s.io
kind: CustomResourceDefinition
jqPathExpressions:
- .metadata.labels
- .spec.conversion

ArgoCD 3.0 expanded status ignoring from CRD-only to all resources (PR #22230). You no longer need .status in ignoreDifferences — it’s handled globally. We removed all .status entries from our configs as part of the 3.x cleanup.

ScopeWhereUse for
Globalvalues.yamlresource.customizations.ignoreDifferences.*CRDs, resource types that always need ignoring cluster-wide
Per-AppSettemplate.spec.ignoreDifferencesHTTPRoute, ExternalSecret, PVC fields for all apps in that AppSet
Per-Appspec.ignoreDifferencesOperator-specific mutations (Kyverno webhooks, OTEL collector)
Global (values.yaml):
├── CRDs: .metadata.generation, .spec.conversion
├── OpenTelemetryCollector: .metadata.generation, .metadata.annotations
└── All resources: managedFieldsManagers (kube-controller-manager, kube-scheduler)
Kyverno App:
├── ClusterPolicy/ClusterCleanupPolicy/Policy: .metadata.generation
├── Webhook configs: .webhooks[].clientConfig.caBundle
└── CRDs: .metadata.generation, .metadata.labels, .spec.conversion
Infrastructure/My-Apps/Monitoring AppSets:
├── HTTPRoute: backendRefs group/kind/weight
├── ExternalSecret: .metadata.generation/finalizers, remoteRef defaults
└── PVC: dataSourceRef, dataSource, volumeName, storage
My-Apps AppSet (additional):
└── StatefulSet: imagePullPolicy, volumeClaimTemplates apiVersion/kind
Database AppSet:
├── CNPG Cluster: .metadata.generation
├── ExternalSecret: (same as above)
└── PVC: (same as above)
OTEL Operator App:
└── OpenTelemetryCollector: .metadata.generation, .metadata.annotations
configs:
cm:
# How often ArgoCD checks for drift (plus random jitter)
timeout.reconciliation: "60s"
timeout.reconciliation.jitter: "30s"
# Hard reconciliation (full git re-fetch + cache invalidation)
# Set to "0" (disabled) — use "Hard Refresh" button for manual re-fetch
# WARNING: Setting this to "60s" makes EVERY reconcile a hard reconcile,
# hammering the repo server and GitHub API
timeout.hard.reconciliation: "0"
configs:
params:
# Parallel status processors (default 20, ~1 per 20 apps)
controller.status.processors: "50"
# Concurrent sync operations (default 10)
controller.operation.processors: "25"
# Limit concurrent manifest generations to prevent OOM
reposerver.parallelism.limit: "5"
# Increase timeout for large Helm charts (prometheus-stack)
controller.repo.server.timeout.seconds: "300"
controller:
env:
# K8s API client throughput
- name: ARGOCD_K8S_CLIENT_QPS
value: "50"
- name: ARGOCD_K8S_CLIENT_BURST
value: "100"
# Split large app trees across Redis keys
- name: ARGOCD_APPLICATION_TREE_SHARD_SIZE
value: "100"

Server-side diff adds ~5-10x overhead per reconciliation (dry-run API calls to the K8s API server). Mitigations in this cluster:

  • Reconciliation jitter (30s) prevents all 60+ apps from reconciling simultaneously
  • Hard reconciliation disabled ("0") — avoids redundant git re-fetches
  • Caching — dry-run results are cached; new API calls only trigger on refresh, new git revision, or app spec change
  • Status processors increased (50) to handle the higher per-app reconciliation time

Standard sync options for all ApplicationSets:

syncOptions:
- CreateNamespace=true
- ServerSideApply=true # Server-side apply for better conflict resolution
- RespectIgnoreDifferences=true # Honor ignoreDifferences for PVC, HTTPRoute, etc.
- Replace=false # Use patch, not full replace

DO NOT add these options:

  • ApplyOutOfSyncOnly=true — Even with ServerSideDiff, has known edge cases with key removal. Not worth the risk for a homelab-scale cluster.
  • IgnoreMissingTemplate=true — Can mask real template errors in ApplicationSets.
IssueImpactOur Workaround
#11143 StatefulSet VCT strippingK8s strips apiVersion/kind from volumeClaimTemplatesignoreDifferences on StatefulSet
#18344 SSD performance~10x reconciliation overheadJitter + disabled hard reconciliation
#19800 IncludeMutationWebhookMaintainers deny global toggleMatch canonical forms in Git instead
#22230 Status ignoring3.0+ ignores all .statusRemoved redundant .status ignores
#24134 SSD not defaultMust opt-in even in 3.3Explicit resource.server-side-diff: "true"
#24882 Key removal detectionApplyOutOfSyncOnly misses deletesDon’t use ApplyOutOfSyncOnly