ArgoCD & GitOps Architecture
This document details the "App of Apps" GitOps architecture used in this cluster, specifically focusing on the Sync Wave strategy and Health Check Customizations that enable a fully self-managing cluster.
🏗️ The "App of Apps" Pattern
We use a hierarchical "App of Apps" pattern to manage the entire cluster state.
graph TD;
RootApp[Root Application] -->|Manages| AppSets[ApplicationSets];
AppSets -->|Generates| Apps[Applications];
Apps -->|Deploys| Resources[Kubernetes Resources];
The Root Application
The entry point is infrastructure/controllers/argocd/root.yaml. This application:
1. Points to infrastructure/controllers/argocd/apps/
2. Deploys the ApplicationSet definitions found there.
3. Is the only thing applied manually (during bootstrap).
ApplicationSets
We use three primary ApplicationSets to categorize workloads:
1. Infrastructure (infrastructure-appset.yaml): Core system components (Cilium, Longhorn, Cert-Manager).
2. Monitoring (monitoring-appset.yaml): Observability stack (Prometheus, Grafana).
3. My Apps (my-apps-appset.yaml): User workloads.
🌊 Sync Waves & Dependency Management
To solve the "chicken-and-egg" problem of bootstrapping a cluster (e.g., needing storage for apps, but networking for storage), we use ArgoCD Sync Waves.
The Wave Strategy
| Wave | Phase | Components | Description |
|---|---|---|---|
| 0 | Foundation | cilium, argocd, 1password-connect, external-secrets, projects |
Networking & Secrets. The absolute minimum required for other pods to start and pull credentials. |
| 1 | Storage | longhorn, snapshot-controller, volsync |
Persistence. Depends on Wave 0 for Pod-to-Pod communication and secrets. |
| 2 | PVC Plumber | pvc-plumber |
Backup checker. Must be running before Kyverno policies in Wave 4 call its API. |
| 4 | Infrastructure | cert-manager, kyverno, gpu-operator, databases, gateway, etc. |
Core Services via ApplicationSet (explicit path list). |
| 5 | Monitoring | prometheus-stack, loki-stack, tempo |
Observability via ApplicationSet (discovers monitoring/*). |
| 6 | User | my-apps/*/* |
Workloads via ApplicationSet (discovers my-apps/*/*). |
How It Works
Each Application resource in infrastructure/controllers/argocd/apps/ is annotated with a sync wave:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cilium
annotations:
argocd.argoproj.io/sync-wave: "0"
ArgoCD processes these waves sequentially. Wave 1 will NOT start until Wave 0 is healthy.
🏥 Health Check Customizations
Standard ArgoCD behavior is to mark a parent Application as "Healthy" as soon as the child Application resource is created, even if the child app is still syncing or degraded. This breaks the Sync Wave logic for App-of-Apps.
To fix this, we inject a custom Lua health check in infrastructure/controllers/argocd/values.yaml.
The "Wait for Child" Script
resource.customizations.health.argoproj.io_Application: |
hs = {}
hs.status = "Progressing"
hs.message = ""
if obj.status ~= nil then
if obj.status.health ~= nil then
hs.status = obj.status.health.status
if obj.status.health.message ~= nil then
hs.message = obj.status.health.message
end
end
end
return hs
What this does:
1. It overrides the health assessment of Application resources.
2. It forces the parent (Root App) to report the actual status of the child Application.
3. If cilium (Wave 0) is "Progressing", the Root App sees it as "Progressing".
4. The Root App pauses processing Wave 1 until all Wave 0 apps report "Healthy".
🔄 Self-Management Loop
- Bootstrap: You apply
root.yaml. - Adoption: ArgoCD sees
ciliumdefined in Git (Wave 0). It adopts the running Cilium instance. - Expansion: ArgoCD deploys
external-secrets(Wave 0). - Wait: ArgoCD waits for Cilium and External Secrets to be green.
- Storage: ArgoCD deploys
longhorn(Wave 1). - Completion: The process continues until all waves are healthy.
This ensures a deterministic, reliable boot sequence every time.
Server-Side Diff vs Client-Side Diff
This cluster uses Server-Side Diff (resource.server-side-diff: "true" in argocd-cm) paired with Server-Side Apply (ServerSideApply=true in syncOptions). These must be aligned — using one without the other causes silent sync failures.
Client-Side Diff (legacy, DO NOT USE with SSA)
ArgoCD downloads the live resource from the cluster, then compares it against the Git manifest locally in the ArgoCD controller. It's essentially doing diff manifest.yaml live-resource.yaml on its own.
Problem: ArgoCD doesn't know what Kubernetes would actually do with the manifest. Kubernetes adds defaults, mutating webhooks modify fields, and SSA has field ownership rules. ArgoCD is guessing — and sometimes guesses wrong (thinks it's "in-sync" when it's not).
Server-Side Diff (modern, REQUIRED with SSA)
ArgoCD sends the Git manifest to the Kubernetes API as a dry-run server-side apply and gets back what the result would look like. Then it compares that against the live resource.
Why it's better: Kubernetes itself tells ArgoCD "here's what would change if you applied this" — accounting for defaults, field ownership, webhooks, everything. No guessing.
Why the Mismatch Breaks ConfigMaps
Without Server-Side Diff, using Server-Side Apply + ApplyOutOfSyncOnly:
Git: configmap data = NEW content
↓
Client-side diff: "managed fields metadata looks the same..." → IN SYNC (wrong!)
↓
ApplyOutOfSyncOnly: "it's in-sync, skip it"
↓
Result: configmap never applied, ArgoCD says "Synced" ✓ (LIE)
With Server-Side Diff:
Git: configmap data = NEW content
↓
K8s API dry-run: "this would change .data.presets.ini" → OUT OF SYNC
↓
Sync: applies the configmap
↓
Result: configmap actually updated ✓
Configuration
Enabled globally in infrastructure/controllers/argocd/values.yaml:
Sync Options (CRITICAL — do not add ApplyOutOfSyncOnly)
Standard sync options for all ApplicationSets:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true # Server-side apply for better conflict resolution
- RespectIgnoreDifferences=true # Honor ignoreDifferences for PVC, HTTPRoute, etc.
- Replace=false # Use patch, not full replace
DO NOT add these options:
- ApplyOutOfSyncOnly=true — Even with ServerSideDiff, has known edge cases with key removal. Not worth the risk for a homelab-scale cluster.
- IgnoreMissingTemplate=true — Can mask real template errors in ApplicationSets.