How-to · Troubleshooting
Recover from a LAPI or Postgres outage¶
This guide describes the expected behavior and recovery actions for two component failures: the LAPI engine pod and the Postgres backend cluster.
Expected plugin behavior¶
The Traefik plugin is configured fail-open at the LAPI layer. When LAPI is unreachable, the plugin’s existing cache serves requests for as long as TTL allows, then passes all traffic through unblocked.
This trade-off keeps services available during component outages at the cost of a brief security gap.
Scenario A: LAPI pod down¶
Symptoms:
kubectl get pods -n crowdsecshowscrowdsec-lapi-...in CrashLoopBackOff or ErrorThe ArgoCD application health shows Degraded
Traefik plugin logs show
LAPI unreachablewarnings
What happens to user traffic:
Seconds 0 to 60: the plugin cache continues to enforce known decisions
Seconds 60 and beyond: plugin polls fail, cache expires per TTL, plugin transitions to fail-open
Services remain reachable for all clients, including those in the CAPI feed — no protection during this window
Recovery:
Inspect the logs to identify the failure mode:
kubectl logs -n crowdsec deploy/crowdsec-lapi --tail=100
Common causes:
Postgres temporarily unreachable — usually recovers on its own
OOMKill — increase
resources.limits.memoryinengine.tsHelm-upgrade conflict — see the Phase 1 plan’s “Helm Replace” section for the recovery procedure
Force a restart if needed:
kubectl rollout restart deploy/crowdsec-lapi -n crowdsec
Scenario B: Postgres outage¶
Symptoms:
kubectl get cluster -n crowdsec crowdsec-dbdoes not report “Cluster in healthy state”One of the two pods in
crowdsec-db-*is not Running
What happens to LAPI:
LAPI cannot persist new decisions, but its memory cache holds existing ones
Plugin continues polling and receives the in-memory decisions
New CAPI pulls cannot be persisted until Postgres recovers
What happens to user traffic:
Same as normal operation, as long as the LAPI pod itself stays running
Recovery:
CNPG has automatic failover. If the primary fails, the standby takes over within roughly 30 seconds.
Check cluster status:
kubectl describe cluster -n crowdsec crowdsec-db
Identify the primary pod:
kubectl get pods -n crowdsec -l role=primary
Inspect the failing pod’s logs:
kubectl logs -n crowdsec <pod-name>
For a manual switchover in extreme cases:
kubectl exec -n crowdsec crowdsec-db-1 -c postgres -- cnpg promote crowdsec-db 2
Scenario C: Both fail simultaneously¶
When a cluster node fails and both LAPI and the Postgres primary lived on that node:
Pods reschedule to other nodes (this can take 1 to 5 minutes)
During this window CrowdSec is fully offline — plugin is fail-open, no decisions enforced
Once pods are Running again, LAPI synchronizes with the database
The plugin polls LAPI and applies the new decisions
Full protection returns after one or two pull cycles
This behavior is accepted for Phase 2. For higher availability, a multi-replica LAPI setup is required — currently blocked by the Longhorn XFS minimum size limit conflicting with the chart’s default PVC sizing.
See also
Reference: configuration describes the engine helm values that control LAPI replica count and persistent volume settings.
Monitoring (currently manual)¶
Per user preference, no Alertmanager wiring is configured. Run these manual daily checks:
kubectl get application crowdsec-app-c83f79ee -n argocd
kubectl get pods -n crowdsec
Alternatively, watch the “Running CrowdSec” stat panel on the Grafana “CrowdSec Overview” dashboard — it must show a value greater than zero.