Explanation
Longhorn Resilience Configuration¶
Understand why Longhorn manager pods require custom health probe configuration and how this prevents CrashLoopBackOff issues during cluster operations.
Overview¶
The Longhorn storage system’s manager DaemonSet requires enhanced health probe configuration to maintain stability during Kubernetes cluster disruptions. This explanation covers the technical reasons behind this requirement and the design decisions in the custom configuration.
The Problem: Readiness Probe Chicken-and-Egg¶
How Longhorn Manager Works¶
Each Longhorn manager pod serves dual roles:
Manager Functions: Orchestrates storage operations, manages volumes, coordinates replicas
Admission Webhook: Validates and mutates Longhorn custom resources via Kubernetes admission control
The manager pod exposes an HTTPS webhook endpoint on port 9502 at /v1/healthz:
┌─────────────────────────┐
│ longhorn-manager pod │
│ │
│ ┌──────────────────┐ │
│ │ Manager Process │ │
│ └──────────────────┘ │
│ ↓ │
│ ┌──────────────────┐ │
│ │ Webhook Server │ │
│ │ Port: 9502 │ │
│ │ Path: /v1/healthz│ │
│ └──────────────────┘ │
└─────────────────────────┘
All manager pods register their webhook endpoints with the longhorn-admission-webhook Service:
Service: longhorn-admission-webhook.longhorn-system.svc:9502
↓ endpoints
├─ 10.42.1.198:9502 (pod on node fsn1-yim)
├─ 10.42.2.148:9502 (pod on node fsn1-wej)
├─ 10.42.3.33:9502 (pod on node nbg1-qzp)
└─ 10.42.4.123:9502 (pod on node fsn1-qdf)
The Circular Dependency¶
Default Longhorn configuration (from Helm chart):
readinessProbe:
httpGet:
path: /v1/healthz
port: 9502
scheme: HTTPS
failureThreshold: 3
periodSeconds: 10
timeoutSeconds: 1
Timeline of failure during network disruption:
Time Event
────────────────────────────────────────────────────────
T+0s Pod starts after K3s upgrade
T+0s CNI initializing (Cilium network setup)
T+0s Readiness probe begins checking webhook endpoint
T+0s First probe fails (webhook server not yet listening)
T+10s Second probe fails (webhook initializing, TLS certs loading)
T+20s Third probe fails (network not fully initialized)
T+30s Kubernetes marks pod "NotReady" (3 consecutive failures)
↓
CRITICAL: Pod never added to Service endpoints
↓
T+60s Manager checks if webhook service is accessible
Tries: https://longhorn-admission-webhook.longhorn-system.svc:9502
Result: Timeout (no healthy endpoints - this pod was never added!)
↓
T+120s Manager crashes with fatal error
"admission webhook service is not accessible after 2m0s"
↓
T+120s Container restarts
↓
Cycle repeats indefinitely (chicken-and-egg loop)
Why this is a chicken-and-egg problem:
Pod needs to be “Ready” to be added to Service endpoints
Pod checks if the Service is accessible before considering itself ready
Service has no endpoints because pod isn’t Ready yet
Pod can never become Ready because Service isn’t accessible
Loop continues forever
Why Only Some Nodes Fail¶
Network initialization timing varies by node:
Fast initialization (pod succeeds):
T+0s Pod starts
T+5s CNI fully initialized
T+8s Webhook server listening
T+10s First readiness probe succeeds ✅
T+10s Pod marked "Ready", added to Service
Slow initialization (pod fails):
T+0s Pod starts
T+15s CNI still initializing
T+25s Webhook server listening
T+30s Three readiness probes failed ❌
T+30s Pod marked "NotReady", never added to Service
T+60s Manager can't reach Service (no endpoints)
T+120s Fatal crash, infinite restart loop
Factors affecting initialization speed:
Network interface configuration time
DNS resolution delays
CNI plugin initialization
TLS certificate generation
Node load/resource contention
Kubernetes Health Probe Types¶
Kubernetes provides three types of health probes, each with different purposes:
1. Startup Probe¶
Purpose: Give slow-starting containers extra time to initialize
Behavior:
Checked first, before readiness or liveness probes
All other probes disabled until startup probe succeeds
If startup probe doesn’t succeed within configured time, pod is killed
Use case: Applications with variable initialization times (like Longhorn manager during network disruptions)
2. Liveness Probe¶
Purpose: Detect and recover from application deadlock or stuck states
Behavior:
Checked periodically after startup probe succeeds
If liveness probe fails, Kubernetes kills and recreates the pod
Restarts pod to attempt recovery from stuck state
Use case: Applications that can enter unrecoverable states but stay “running” (exactly Longhorn’s CrashLoopBackOff scenario)
3. Readiness Probe¶
Purpose: Determine when pod is ready to receive traffic
Behavior:
Checked periodically after startup probe succeeds
If readiness probe fails, pod is removed from Service endpoints
Pod stays running, just not receiving traffic
Use case: Gracefully handle temporary unavailability (load, queue backlog, etc.)
Custom Probe Configuration Design¶
Configuration Applied¶
Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):
containers:
- name: longhorn-manager
startupProbe:
httpGet:
path: /v1/healthz
port: 9502
scheme: HTTPS
failureThreshold: 30 # 30 failures × 10s = 300s max
periodSeconds: 10
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /v1/healthz
port: 9502
scheme: HTTPS
failureThreshold: 3 # 3 failures × 30s = 90s tolerance
periodSeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /v1/healthz
port: 9502
scheme: HTTPS
failureThreshold: 3
periodSeconds: 10
timeoutSeconds: 5 # Increased from 1s
Design Rationale¶
Startup Probe: 5-Minute Grace Period¶
Configuration:
failureThreshold: 30- Allow 30 failures before giving upperiodSeconds: 10- Check every 10 secondstimeoutSeconds: 5- 5-second timeout per checkTotal: 300 seconds (5 minutes) maximum startup time
Why 5 minutes:
K3s upgrades can cause 30-60 second network disruptions
CNI initialization varies: 5-45 seconds typical, 120s worst case observed
TLS certificate generation: 2-10 seconds
DNS propagation: 5-30 seconds
Safety margin: 2× worst case = 4 minutes, rounded to 5 for comfort
Trade-off:
✅ Prevents false failures during legitimate slow starts
✅ Allows pods to succeed in challenging network conditions
⚠️ Delays detection of truly broken pods by up to 5 minutes
⚠️ Acceptable: DaemonSet has 4 other healthy replicas providing service
Liveness Probe: Automatic Recovery¶
Configuration:
failureThreshold: 3- Allow 3 failures before restartperiodSeconds: 30- Check every 30 seconds (less aggressive)timeoutSeconds: 5- 5-second timeoutTotal: 90 seconds of continuous failure before restart
Why add liveness probe:
Without liveness: Pod can stay in CrashLoopBackOff indefinitely (11 days observed in incident)
With liveness: Kubernetes automatically kills and recreates truly stuck pods
90-second tolerance: Prevents restarts during brief network hiccups
Why 30-second interval (vs 10s):
Webhook endpoint is lightweight, not resource-intensive
Reduces unnecessary health check load
Still detects failures within 90 seconds (acceptable SLA for automated recovery)
Trade-off:
✅ Automatic recovery from stuck states (no manual intervention needed)
✅ Less aggressive than readiness (30s vs 10s interval)
⚠️ Up to 90 seconds before recovery initiated
⚠️ Acceptable: This is for catastrophic failures, not normal operation
Readiness Probe: Network Latency Tolerance¶
Configuration:
failureThreshold: 3- Keep default (3 failures)periodSeconds: 10- Keep default (10 seconds)timeoutSeconds: 5- Increased from 1s to 5s
Why 5-second timeout:
HTTPS health checks require TLS handshake: 50-200ms typical, 1s+ during load
Network latency during upgrades: 100-500ms typical, 2s worst case
DNS resolution: 50-500ms typical, 1s+ during CoreDNS load
Safety margin: 5s handles 99.9th percentile scenarios
Why keep other settings:
failureThreshold: 3maintains original behavior (30s window)periodSeconds: 10provides quick traffic routing decisions
Trade-off:
✅ Tolerates realistic network latency scenarios
✅ Prevents false-positive “NotReady” marking
⚠️ Takes 15 seconds (3 × 5s) to detect truly down endpoints
⚠️ Acceptable: Service has multiple endpoints, brief delays OK
Probe Sequence During Startup¶
Successful startup with custom probes:
Time Probe Result Pod State Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s startupProbe Fail Starting Not added (expected)
T+10s startupProbe Fail Starting Not added (expected)
T+20s startupProbe Fail Starting Not added (expected)
T+30s startupProbe Success Starting Not added yet
↓ startupProbe succeeds, enable readiness/liveness probes
T+30s readinessProbe Success Ready ✅ Added to Service!
↓ Pod now receives traffic
T+60s livenessProbe Success Ready In Service
T+90s readinessProbe Success Ready In Service
Compare to default probes (for reference):
Time Probe Result Pod State Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s readinessProbe Fail Starting Not added
T+10s readinessProbe Fail Starting Not added
T+20s readinessProbe Fail Starting Not added
T+30s ❌ Pod marked NotReady (3 failures)
T+60s Manager checks Service accessibility
T+60s ❌ Service has no endpoints (this pod not added)
T+120s ❌ Fatal crash: "webhook service not accessible"
Implementation Approach¶
Why Strategic Merge Patch¶
Challenge: Longhorn Helm chart doesn’t expose probe configuration in values.yaml
Solution Options Evaluated:
❌ Fork Helm chart: High maintenance burden, must track upstream changes
❌ Manual kubectl patch: Not persistent, lost on cluster rebuild
✅ Strategic merge patch via extra-manifests: Infrastructure-as-code, automatically applied
Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: longhorn-manager
namespace: longhorn-system
spec:
template:
spec:
containers:
- name: longhorn-manager
startupProbe: { ... } # Added to existing container config
livenessProbe: { ... } # Added to existing container config
readinessProbe: { ... } # Merged with existing readinessProbe
How it works:
OpenTofu applies extra-manifests after Helm chart deployment
Kubernetes strategically merges patch with existing DaemonSet
Probe configuration added/updated without replacing entire DaemonSet
Longhorn manager pods rolling restart with new probe configuration
Why Extra-Manifests¶
Advantages:
✅ Version-controlled with infrastructure code
✅ Automatically applied on every
tofu apply✅ Persists across cluster rebuilds
✅ No custom Helm chart maintenance
✅ Easy to modify or remove
Disadvantages:
⚠️ Applied after Helm chart (brief window where default probes exist)
⚠️ Requires understanding of strategic merge semantics
⚠️ May conflict with future Helm chart changes
Comparison: Default vs Custom¶
Startup Scenario Outcomes¶
Scenario |
Network Init Time |
Default Probes |
Custom Probes |
|---|---|---|---|
Fast node |
10 seconds |
✅ Success |
✅ Success |
Normal node |
25 seconds |
✅ Success |
✅ Success |
Slow node |
45 seconds |
❌ CrashLoopBackOff |
✅ Success |
Very slow node |
90 seconds |
❌ CrashLoopBackOff |
✅ Success |
Broken pod |
5+ minutes |
❌ CrashLoopBackOff forever |
❌ Killed after 5 min, recreated |
Recovery from Stuck State¶
Situation |
Default Probes |
Custom Probes |
|---|---|---|
Pod stuck for 5 minutes |
⚠️ Stays running, unusable |
✅ Killed at 5 min (startup timeout) |
Pod stuck for 11 days |
⚠️ Stays running forever |
✅ Killed at 90s (liveness failure) |
Manual intervention |
✅ Delete pod |
✅ Automatic recovery |
Network Latency Handling¶
Latency |
Default (1s timeout) |
Custom (5s timeout) |
|---|---|---|
50ms |
✅ Success |
✅ Success |
500ms |
✅ Success |
✅ Success |
1.5s |
❌ Probe timeout |
✅ Success |
3s |
❌ Probe timeout |
✅ Success |
6s |
❌ Probe timeout |
❌ Probe timeout |
Observability¶
Monitoring Probe Health¶
Check current configuration:
kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
grep -E "(startupProbe|livenessProbe|readinessProbe)" -A 10
Monitor probe failures:
# Watch for Unhealthy events
kubectl get events -n longhorn-system \
--field-selector involvedObject.kind=Pod,reason=Unhealthy -w
# Example output if probe fails:
# Readiness probe failed: Get "https://10.42.3.33:9502/v1/healthz":
# context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Check restart counts (indicator of liveness probe failures):
kubectl get pods -n longhorn-system -l app=longhorn-manager \
-o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
Metrics and Alerts¶
Prometheus metrics (from kube-state-metrics):
# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{
namespace="longhorn-system",
pod=~"longhorn-manager.*",
reason="CrashLoopBackOff"
}
# Restart count trend
rate(kube_pod_container_status_restarts_total{
namespace="longhorn-system",
pod=~"longhorn-manager.*"
}[5m])
Recommended alerts:
# Alert on CrashLoopBackOff
- alert: LonghornManagerCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{namespace="longhorn-system",pod=~"longhorn-manager.*",reason="CrashLoopBackOff"} > 0
for: 10m
severity: critical
summary: "Longhorn manager pod {{ $labels.pod }} in CrashLoopBackOff"
# Alert on high restart rate
- alert: LonghornManagerHighRestarts
expr: rate(kube_pod_container_status_restarts_total{namespace="longhorn-system",pod=~"longhorn-manager.*"}[1h]) > 0.1
for: 15m
severity: warning
summary: "Longhorn manager pod {{ $labels.pod }} restarting frequently"
Lessons Learned¶
Design Principles¶
Health probes checking self-hosted services create circular dependencies
Probe failure → Pod not Ready → Not in Service → Can’t reach Service
Mitigation: Extended startup grace period
Default Helm chart values may not suit all deployment scenarios
Charts optimize for standard environments
Real-world network timing varies significantly
Custom patches sometimes necessary
Missing liveness probes mean no automatic recovery
Kubernetes assumes “running” containers are healthy
Need explicit liveness probe for self-healing
Trade-off: Balance between aggressive recovery and stability
Probe timeouts must account for worst-case network conditions
HTTPS requires TLS handshake overhead
Network latency during upgrades is higher than normal
1-second timeouts insufficient for realistic scenarios
Operational Insights¶
From the 11-day CrashLoopBackOff incident:
✅ Random timing means some nodes always at risk during disruptions
✅ Manual intervention doesn’t scale (pod deletion required monitoring)
✅ Aggressive probes create false failures more often than detecting real problems
✅ Missing probes prevent self-healing (no liveness probe = no auto-recovery)
Technical References¶
Kubernetes Documentation:
Longhorn Documentation:
Implementation Files:
kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpldocs/plan/longhorn-manager-probes.md