How-To Guide
Troubleshoot Longhorn Manager Pod CrashLoopBackOff¶
Learn how to diagnose and resolve Longhorn manager pods stuck in CrashLoopBackOff, causing Longhorn nodes to appear “down” in the UI.
Problem Description¶
Symptom: Longhorn node shows status “down” in Longhorn UI (https://longhorn.ops.kup6s.net), and the corresponding longhorn-manager pod is stuck in CrashLoopBackOff with high restart counts.
Example:
kubectl get pods -n longhorn-system | grep longhorn-manager
NAME READY STATUS RESTARTS AGE
longhorn-manager-45l8m 1/2 CrashLoopBackOff 102 (60s ago) 11d
# ^^^^^ Manager container failing
Pod logs show:
level=fatal msg="Error starting webhooks: admission webhook service is not accessible
on cluster after 2m0s sec: timed out waiting for endpoint
https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz to be available"
Key differences from other issues:
❌ Node stuck after upgrade (covered in Nodes Stuck After K3S Upgrade): Node is cordoned but healthy
✅ This issue: Longhorn manager pod is crashing, node shows “down” in Longhorn UI
Root Cause: Readiness Probe Chicken-and-Egg Problem¶
The longhorn-manager pod crashes because it cannot reach its own admission webhook service, creating a circular dependency:
Pod starts after K3s upgrade or node restart
Network initialization may be slow (CNI, DNS, service mesh starting up)
Readiness probe begins checking
https://localhost:9502/v1/healthzevery 10 secondsWebhook not ready within 30 seconds (default: 3 failures × 10s period)
Pod marked “NotReady” by Kubernetes
Critical: Pod is never added to
longhorn-admission-webhookService endpoints because it’s NotReadyPod crashes trying to verify the webhook service is accessible
Restart loop: Cycle repeats indefinitely (pod can never reach Service it should be part of)
This is a timing-sensitive issue - if the network initializes within 30 seconds, the pod succeeds. If not, it fails permanently.
Automatic Prevention: Custom Health Probes¶
Since 2025-11-09, the cluster includes custom health probe configuration that prevents this issue automatically.
How It Works¶
Custom probes added via extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl:
Enhanced Probes:
✅ startupProbe: 5-minute grace period for slow network initialization
✅ livenessProbe: Auto-restarts pods stuck in bad states (90s tolerance)
✅ readinessProbe: Enhanced timeout (5s instead of 1s) for network latency
Result: Longhorn manager pods can initialize properly even during network disruptions, and automatically recover if they get stuck.
Verify Custom Probes Are Configured¶
Check that the custom probe configuration is applied:
# Check for all three probe types
kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
grep -E "(startupProbe|livenessProbe|readinessProbe)"
# Expected output - all three should be present:
# startupProbe:
# livenessProbe:
# readinessProbe:
Detailed probe configuration:
# View full probe configuration
kubectl get ds longhorn-manager -n longhorn-system -o yaml | grep -A 15 "startupProbe:"
Expected values:
startupProbe: failureThreshold: 30, periodSeconds: 10 (5-minute max startup)livenessProbe: failureThreshold: 3, periodSeconds: 30 (90s tolerance)readinessProbe: timeoutSeconds: 5 (enhanced from default 1s)
Quick Fix: Delete the Stuck Pod¶
If a pod is already stuck in CrashLoopBackOff, delete it to force recreation:
1. Identify the Stuck Pod¶
# Find pods in CrashLoopBackOff
kubectl get pods -n longhorn-system | grep -i crash
# Example output:
# longhorn-manager-45l8m 1/2 CrashLoopBackOff 102 (60s ago) 11d
2. Verify Other Manager Pods Are Healthy¶
Important: Ensure at least 3 other longhorn-manager pods are running before deleting:
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Expected: At least 3-4 pods showing "2/2 Running"
# NAME READY STATUS RESTARTS AGE
# longhorn-manager-8fjqn 2/2 Running 4 11d ← Healthy
# longhorn-manager-gr8ws 2/2 Running 2 11d ← Healthy
# longhorn-manager-n78bn 2/2 Running 2 11d ← Healthy
# longhorn-manager-45l8m 1/2 CrashLoopBackOff 102 11d ← Problem
3. Delete the Stuck Pod¶
# Replace with your actual pod name
kubectl delete pod longhorn-manager-45l8m -n longhorn-system
What happens:
DaemonSet automatically creates a new pod to replace the deleted one
New pod starts with fresh state (no accumulated failures)
If network is stable, new pod initializes successfully within seconds
4. Monitor the New Pod¶
# Watch pod recreation
kubectl get pods -n longhorn-system -w -l app=longhorn-manager
# Wait for new pod to reach "2/2 Running"
5. Verify Node Recovery in Longhorn UI¶
Open Longhorn UI: https://longhorn.ops.kup6s.net/#/node
Expected: Previously “down” node should now show status “Schedulable” and “Ready”.
Manual Diagnosis (When Quick Fix Doesn’t Work)¶
If the new pod also enters CrashLoopBackOff, investigate deeper:
1. Check Pod Logs¶
# Replace with your pod name
POD_NAME="longhorn-manager-45l8m"
# View manager container logs
kubectl logs -n longhorn-system $POD_NAME -c longhorn-manager --tail=100
Look for:
admission webhook service is not accessible- Network/DNS issueconnection refused- Webhook server not startingtimeout- Network latency or firewall blockingcertificateerrors - TLS certificate issues
2. Check Webhook Service¶
# Verify webhook service exists
kubectl get svc -n longhorn-system longhorn-admission-webhook
# Expected output:
# NAME TYPE CLUSTER-IP PORT(S)
# longhorn-admission-webhook ClusterIP 10.43.139.56 9502/TCP
Check service endpoints:
kubectl get endpoints -n longhorn-system longhorn-admission-webhook
# Expected: At least 3-4 endpoints (from healthy manager pods)
# ENDPOINTS
# 10.42.1.198:9502,10.42.2.148:9502,10.42.4.123:9502 + 1 more...
If no endpoints: All manager pods are failing (cluster-wide issue, not just one node)
3. Check Network Connectivity¶
Test if the pod can reach the service:
# Execute network test from within the pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
wget -O- --timeout=5 https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz 2>&1
Possible results:
✅ Success: Network is fine, probe configuration issue
❌ Timeout: Network latency or firewall blocking
❌ Connection refused: No healthy webhook endpoints
❌ DNS error: CoreDNS or service discovery issue
4. Check Node Health¶
# Check if the node itself is healthy
kubectl get nodes | grep <node-name>
# Check node conditions
kubectl describe node <node-name> | grep -A 10 "Conditions:"
Look for:
Node status should be “Ready”
No “NetworkUnavailable” or “NotReady” conditions
No taints preventing pod scheduling
5. Review Probe Configuration¶
Verify the custom probes are actually configured:
kubectl get ds longhorn-manager -n longhorn-system -o jsonpath='{.spec.template.spec.containers[0].startupProbe}'
If empty: Custom probe patch not applied (see Prevention section below)
Advanced Troubleshooting¶
Check CoreDNS Status¶
DNS resolution issues can prevent webhook connectivity:
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Test DNS resolution from problem pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
nslookup longhorn-admission-webhook.longhorn-system.svc
Check Network Policies¶
Verify no network policies are blocking pod-to-service communication:
# Check for network policies affecting longhorn-system
kubectl get networkpolicies -n longhorn-system
# If any exist, review their rules
kubectl describe networkpolicy <policy-name> -n longhorn-system
Check Cilium/CNI Status¶
If using Cilium CNI, verify it’s healthy:
# Check Cilium agent pods
kubectl get pods -n kube-system -l k8s-app=cilium
# Check Cilium connectivity
kubectl exec -n kube-system ds/cilium -- cilium status
When NOT to Delete Pods¶
Do NOT delete a longhorn-manager pod if:
❌ All manager pods are in CrashLoopBackOff (cluster-wide issue, deleting won’t help)
❌ Active volume operations are in progress (check Longhorn UI)
❌ Node is undergoing maintenance or upgrade
❌ Less than 3 healthy manager pods exist
Safe to delete when:
✅ Only one manager pod is failing
✅ At least 3 other manager pods are healthy (“2/2 Running”)
✅ No active volume operations
✅ Node status is “Ready” in Kubernetes
Prevention¶
Ensure Custom Probes Are Configured¶
The custom probe configuration should be automatically applied during cluster provisioning:
Check configuration exists:
ls -la kube-hetzner/extra-manifests/ | grep longhorn-manager-probes
# Expected: 40-G-longhorn-manager-probes-patch.yaml.tpl
If missing: Re-apply infrastructure manifests:
cd kube-hetzner
source .env
tofu apply
Monitor Longhorn Manager Health¶
Add to monitoring dashboards:
# Check restart counts regularly
kubectl get pods -n longhorn-system -l app=longhorn-manager \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Expected: Low restart counts (0-5 over lifetime)
# High counts (50+): Indicates recurring issues
Set up alerts (optional Prometheus alert):
alert: LonghornManagerCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff",pod=~"longhorn-manager.*"} > 0
for: 10m
severity: critical
Best Practices¶
✅ Keep custom probe configuration applied (default since 2025-11-09)
✅ Monitor manager pod restart counts after K3s upgrades
✅ Verify all Longhorn nodes show “Ready” after cluster maintenance
✅ Test volume operations after resolving manager pod issues
Technical Details¶
Probe Configuration Files:
Patch Manifest:
kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tplPlan Document:
docs/plan/longhorn-manager-probes.mdDaemonSet:
longhorn-system/longhorn-manager
Default vs Custom Probes:
Probe Type |
Default |
Custom |
Improvement |
|---|---|---|---|
startupProbe |
❌ None |
✅ 5-minute grace |
Handles slow network init |
livenessProbe |
❌ None |
✅ 90s tolerance |
Auto-recovers stuck pods |
readinessProbe |
⚠️ 1s timeout |
✅ 5s timeout |
Tolerates network latency |
Why Default Fails:
30-second total startup window (3 failures × 10s) too short for webhook initialization during network disruptions
No liveness probe means Kubernetes never attempts automatic recovery
1-second timeout insufficient for HTTPS health checks with network latency
Why Custom Works:
5-minute startup window accommodates slow CNI/DNS initialization
Liveness probe auto-restarts pods that get truly stuck
5-second timeout handles realistic network latency scenarios