How-to · Troubleshooting

Recover from a LAPI or Postgres outage

This guide describes the expected behavior and recovery actions for two component failures: the LAPI engine pod and the Postgres backend cluster.

Expected plugin behavior

The Traefik plugin is configured fail-open at the LAPI layer. When LAPI is unreachable, the plugin’s existing cache serves requests for as long as TTL allows, then passes all traffic through unblocked.

This trade-off keeps services available during component outages at the cost of a brief security gap.

Scenario A: LAPI pod down

Symptoms:

  • kubectl get pods -n crowdsec shows crowdsec-lapi-... in CrashLoopBackOff or Error

  • The ArgoCD application health shows Degraded

  • Traefik plugin logs show LAPI unreachable warnings

What happens to user traffic:

  • Seconds 0 to 60: the plugin cache continues to enforce known decisions

  • Seconds 60 and beyond: plugin polls fail, cache expires per TTL, plugin transitions to fail-open

  • Services remain reachable for all clients, including those in the CAPI feed — no protection during this window

Recovery:

Inspect the logs to identify the failure mode:

kubectl logs -n crowdsec deploy/crowdsec-lapi --tail=100

Common causes:

  • Postgres temporarily unreachable — usually recovers on its own

  • OOMKill — increase resources.limits.memory in engine.ts

  • Helm-upgrade conflict — see the Phase 1 plan’s “Helm Replace” section for the recovery procedure

Force a restart if needed:

kubectl rollout restart deploy/crowdsec-lapi -n crowdsec

Scenario B: Postgres outage

Symptoms:

  • kubectl get cluster -n crowdsec crowdsec-db does not report “Cluster in healthy state”

  • One of the two pods in crowdsec-db-* is not Running

What happens to LAPI:

  • LAPI cannot persist new decisions, but its memory cache holds existing ones

  • Plugin continues polling and receives the in-memory decisions

  • New CAPI pulls cannot be persisted until Postgres recovers

What happens to user traffic:

  • Same as normal operation, as long as the LAPI pod itself stays running

Recovery:

CNPG has automatic failover. If the primary fails, the standby takes over within roughly 30 seconds.

Check cluster status:

kubectl describe cluster -n crowdsec crowdsec-db

Identify the primary pod:

kubectl get pods -n crowdsec -l role=primary

Inspect the failing pod’s logs:

kubectl logs -n crowdsec <pod-name>

For a manual switchover in extreme cases:

kubectl exec -n crowdsec crowdsec-db-1 -c postgres -- cnpg promote crowdsec-db 2

Scenario C: Both fail simultaneously

When a cluster node fails and both LAPI and the Postgres primary lived on that node:

  1. Pods reschedule to other nodes (this can take 1 to 5 minutes)

  2. During this window CrowdSec is fully offline — plugin is fail-open, no decisions enforced

  3. Once pods are Running again, LAPI synchronizes with the database

  4. The plugin polls LAPI and applies the new decisions

  5. Full protection returns after one or two pull cycles

This behavior is accepted for Phase 2. For higher availability, a multi-replica LAPI setup is required — currently blocked by the Longhorn XFS minimum size limit conflicting with the chart’s default PVC sizing.

See also

Reference: configuration describes the engine helm values that control LAPI replica count and persistent volume settings.

Monitoring (currently manual)

Per user preference, no Alertmanager wiring is configured. Run these manual daily checks:

kubectl get application crowdsec-app-c83f79ee -n argocd
kubectl get pods -n crowdsec

Alternatively, watch the “Running CrowdSec” stat panel on the Grafana “CrowdSec Overview” dashboard — it must show a value greater than zero.