How-to · Troubleshooting

Recover from a LAPI or Postgres outage¶

This guide describes the expected behavior and recovery actions for two component failures: the LAPI engine pod and the Postgres backend cluster.

Expected plugin behavior¶

The Traefik plugin is configured fail-open at the LAPI layer. When LAPI is unreachable, the plugin’s existing cache serves requests for as long as TTL allows, then passes all traffic through unblocked.

This trade-off keeps services available during component outages at the cost of a brief security gap.

Warning

Fail-open only holds once the plugin already has a cached decision stream. If Traefik (or the plugin) restarts while LAPI is unreachable, the startup stream fetch fails and the plugin fails closed — every non-trusted external client gets HTTP 403 on every site, while internal ClientTrustedIPs (private ranges) keep working. This is exactly what turned a long-standing LAPI outage into a cluster-wide 403 on 2026-06-15 when Traefik restarted; see Scenario D. To confirm it is the bouncer (not the app), test the app in-cluster — a healthy app still answers (e.g. 302), while every external client IP gets 403 through Traefik.

Scenario A: LAPI pod down¶

Symptoms:

kubectl get pods -n crowdsec shows crowdsec-lapi-... in CrashLoopBackOff or Error
The ArgoCD application health shows Degraded
Traefik plugin logs show LAPI unreachable warnings

What happens to user traffic:

Seconds 0 to 60: the plugin cache continues to enforce known decisions
Seconds 60 and beyond: plugin polls fail, cache expires per TTL, plugin transitions to fail-open
Services remain reachable for all clients, including those in the CAPI feed — no protection during this window

Recovery:

Inspect the logs to identify the failure mode:

kubectl logs -n crowdsec deploy/crowdsec-lapi --tail=100

Common causes:

Postgres temporarily unreachable — usually recovers on its own
OOMKill — increase resources.limits.memory in engine.ts
Helm-upgrade conflict — see the Phase 1 plan’s “Helm Replace” section for the recovery procedure

Force a restart if needed:

kubectl rollout restart deploy/crowdsec-lapi -n crowdsec

Scenario B: Postgres outage¶

Symptoms:

kubectl get cluster -n crowdsec crowdsec-db does not report “Cluster in healthy state”
One of the two pods in crowdsec-db-* is not Running

What happens to LAPI:

LAPI cannot persist new decisions, but its memory cache holds existing ones
Plugin continues polling and receives the in-memory decisions
New CAPI pulls cannot be persisted until Postgres recovers

What happens to user traffic:

Same as normal operation, as long as the LAPI pod itself stays running

Recovery:

CNPG has automatic failover. If the primary fails, the standby takes over within roughly 30 seconds.

Check cluster status:

kubectl describe cluster -n crowdsec crowdsec-db

Identify the primary pod:

kubectl get pods -n crowdsec -l role=primary

Inspect the failing pod’s logs:

kubectl logs -n crowdsec <pod-name>

For a manual switchover in extreme cases:

kubectl exec -n crowdsec crowdsec-db-1 -c postgres -- cnpg promote crowdsec-db 2

Scenario C: Both fail simultaneously¶

When a cluster node fails and both LAPI and the Postgres primary lived on that node:

Pods reschedule to other nodes (this can take 1 to 5 minutes)
During this window CrowdSec is fully offline — plugin is fail-open, no decisions enforced
Once pods are Running again, LAPI synchronizes with the database
The plugin polls LAPI and applies the new decisions
Full protection returns after one or two pull cycles

This behavior is accepted for Phase 2. For higher availability, a multi-replica LAPI setup is required — currently blocked by the Longhorn XFS minimum size limit conflicting with the chart’s default PVC sizing.

Scenario D: Postgres PVC full (root cause of the 2026-06-15 outage)¶

Symptoms:

kubectl get cluster -n crowdsec crowdsec-db reports Not enough disk space
The primary crowdsec-db-* pod CrashLoops with Detected low-disk space condition, avoid starting the instance
crowdsec-db-rw and crowdsec-service Endpoints are empty, LAPI CrashLoops, and — if the Traefik plugin also restarted — every external client gets 403 (see the warning above)

The CrowdSec database has no WAL archiving configured, so this is plain data growth outgrowing a too-small PVC, not a WAL runaway.

Recovery — PVC expansion is non-destructive, and storageClass longhorn has allowVolumeExpansion: true:

# 1) raise the desired size in the cluster spec
kubectl -n crowdsec patch cluster crowdsec-db --type merge -p '{"spec":{"storage":{"size":"5Gi"}}}'

# 2) a dead primary stops CNPG from propagating the resize — patch the PVCs directly
kubectl -n crowdsec patch pvc crowdsec-db-2 --type merge -p '{"spec":{"resources":{"requests":{"storage":"5Gi"}}}}'
kubectl -n crowdsec patch pvc crowdsec-db-3 --type merge -p '{"spec":{"resources":{"requests":{"storage":"5Gi"}}}}'

# 3) once Longhorn has grown the block device, delete the crashlooping primary so it
#    remounts: the filesystem resizes, the low-disk check passes, and Postgres starts
kubectl -n crowdsec delete pod <crashlooping-primary-pod>

Make it durable by bumping postgres.storageSize in dp-infra/crowdsec/config.yaml, then npm run build and push. The real prevention is a disk-usage alert on the CNPG PVCs.

Monitoring (currently manual)¶

Per user preference, no Alertmanager wiring is configured. Run these manual daily checks:

kubectl get application crowdsec-app-c83f79ee -n argocd
kubectl get pods -n crowdsec

Alternatively, watch the “Running CrowdSec” stat panel on the Grafana “CrowdSec Overview” dashboard — it must show a value greater than zero.