How-To Guide

Troubleshoot Longhorn Manager Pod CrashLoopBackOff¶

Type: How-To (Task-oriented)

Learn how to diagnose and resolve Longhorn manager pods stuck in CrashLoopBackOff, causing Longhorn nodes to appear “down” in the UI.

Problem Description¶

Symptom: Longhorn node shows status “down” in Longhorn UI (https://longhorn.ops.kup6s.net), and the corresponding longhorn-manager pod is stuck in CrashLoopBackOff with high restart counts.

Example:

kubectl get pods -n longhorn-system | grep longhorn-manager
NAME                     READY   STATUS             RESTARTS        AGE
longhorn-manager-45l8m   1/2     CrashLoopBackOff   102 (60s ago)   11d
#                        ^^^^^ Manager container failing

Pod logs show:

level=fatal msg="Error starting webhooks: admission webhook service is not accessible
on cluster after 2m0s sec: timed out waiting for endpoint
https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz to be available"

Key differences from other issues:

❌ Node stuck after upgrade (covered in Nodes Stuck After K3S Upgrade): Node is cordoned but healthy
✅ This issue: Longhorn manager pod is crashing, node shows “down” in Longhorn UI

Root Cause: Readiness Probe Chicken-and-Egg Problem¶

The longhorn-manager pod crashes because it cannot reach its own admission webhook service, creating a circular dependency:

Pod starts after K3s upgrade or node restart
Network initialization may be slow (CNI, DNS, service mesh starting up)
Readiness probe begins checking https://localhost:9502/v1/healthz every 10 seconds
Webhook not ready within 30 seconds (default: 3 failures × 10s period)
Pod marked “NotReady” by Kubernetes
Critical: Pod is never added to longhorn-admission-webhook Service endpoints because it’s NotReady
Pod crashes trying to verify the webhook service is accessible
Restart loop: Cycle repeats indefinitely (pod can never reach Service it should be part of)

This is a timing-sensitive issue - if the network initializes within 30 seconds, the pod succeeds. If not, it fails permanently.

Automatic Prevention: Custom Health Probes¶

Since 2025-11-09, the cluster includes custom health probe configuration that prevents this issue automatically.

How It Works¶

Custom probes added via extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl:

Enhanced Probes:

✅ startupProbe: 5-minute grace period for slow network initialization
✅ livenessProbe: Auto-restarts pods stuck in bad states (90s tolerance)
✅ readinessProbe: Enhanced timeout (5s instead of 1s) for network latency

Result: Longhorn manager pods can initialize properly even during network disruptions, and automatically recover if they get stuck.

Verify Custom Probes Are Configured¶

Check that the custom probe configuration is applied:

# Check for all three probe types
kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
  grep -E "(startupProbe|livenessProbe|readinessProbe)"

# Expected output - all three should be present:
# startupProbe:
# livenessProbe:
# readinessProbe:

Detailed probe configuration:

# View full probe configuration
kubectl get ds longhorn-manager -n longhorn-system -o yaml | grep -A 15 "startupProbe:"

Expected values:

startupProbe: failureThreshold: 30, periodSeconds: 10 (5-minute max startup)
livenessProbe: failureThreshold: 3, periodSeconds: 30 (90s tolerance)
readinessProbe: timeoutSeconds: 5 (enhanced from default 1s)

Quick Fix: Delete the Stuck Pod¶

If a pod is already stuck in CrashLoopBackOff, delete it to force recreation:

1. Identify the Stuck Pod¶

# Find pods in CrashLoopBackOff
kubectl get pods -n longhorn-system | grep -i crash

# Example output:
# longhorn-manager-45l8m   1/2     CrashLoopBackOff   102 (60s ago)   11d

2. Verify Other Manager Pods Are Healthy¶

Important: Ensure at least 3 other longhorn-manager pods are running before deleting:

kubectl get pods -n longhorn-system -l app=longhorn-manager

# Expected: At least 3-4 pods showing "2/2 Running"
# NAME                     READY   STATUS    RESTARTS   AGE
# longhorn-manager-8fjqn   2/2     Running   4          11d  ← Healthy
# longhorn-manager-gr8ws   2/2     Running   2          11d  ← Healthy
# longhorn-manager-n78bn   2/2     Running   2          11d  ← Healthy
# longhorn-manager-45l8m   1/2     CrashLoopBackOff  102   11d  ← Problem

3. Delete the Stuck Pod¶

# Replace with your actual pod name
kubectl delete pod longhorn-manager-45l8m -n longhorn-system

What happens:

DaemonSet automatically creates a new pod to replace the deleted one
New pod starts with fresh state (no accumulated failures)
If network is stable, new pod initializes successfully within seconds

4. Monitor the New Pod¶

# Watch pod recreation
kubectl get pods -n longhorn-system -w -l app=longhorn-manager

# Wait for new pod to reach "2/2 Running"

5. Verify Node Recovery in Longhorn UI¶

Open Longhorn UI: https://longhorn.ops.kup6s.net/#/node

Expected: Previously “down” node should now show status “Schedulable” and “Ready”.

Manual Diagnosis (When Quick Fix Doesn’t Work)¶

If the new pod also enters CrashLoopBackOff, investigate deeper:

1. Check Pod Logs¶

# Replace with your pod name
POD_NAME="longhorn-manager-45l8m"

# View manager container logs
kubectl logs -n longhorn-system $POD_NAME -c longhorn-manager --tail=100

Look for:

admission webhook service is not accessible - Network/DNS issue
connection refused - Webhook server not starting
timeout - Network latency or firewall blocking
certificate errors - TLS certificate issues

2. Check Webhook Service¶

# Verify webhook service exists
kubectl get svc -n longhorn-system longhorn-admission-webhook

# Expected output:
# NAME                         TYPE        CLUSTER-IP     PORT(S)
# longhorn-admission-webhook   ClusterIP   10.43.139.56   9502/TCP

Check service endpoints:

kubectl get endpoints -n longhorn-system longhorn-admission-webhook

# Expected: At least 3-4 endpoints (from healthy manager pods)
# ENDPOINTS
# 10.42.1.198:9502,10.42.2.148:9502,10.42.4.123:9502 + 1 more...

If no endpoints: All manager pods are failing (cluster-wide issue, not just one node)

3. Check Network Connectivity¶

Test if the pod can reach the service:

# Execute network test from within the pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
  wget -O- --timeout=5 https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz 2>&1

Possible results:

✅ Success: Network is fine, probe configuration issue
❌ Timeout: Network latency or firewall blocking
❌ Connection refused: No healthy webhook endpoints
❌ DNS error: CoreDNS or service discovery issue

4. Check Node Health¶

# Check if the node itself is healthy
kubectl get nodes | grep <node-name>

# Check node conditions
kubectl describe node <node-name> | grep -A 10 "Conditions:"

Look for:

Node status should be “Ready”
No “NetworkUnavailable” or “NotReady” conditions
No taints preventing pod scheduling

5. Review Probe Configuration¶

Verify the custom probes are actually configured:

kubectl get ds longhorn-manager -n longhorn-system -o jsonpath='{.spec.template.spec.containers[0].startupProbe}'

If empty: Custom probe patch not applied (see Prevention section below)

Advanced Troubleshooting¶

Check CoreDNS Status¶

DNS resolution issues can prevent webhook connectivity:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution from problem pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
  nslookup longhorn-admission-webhook.longhorn-system.svc

Check Network Policies¶

Verify no network policies are blocking pod-to-service communication:

# Check for network policies affecting longhorn-system
kubectl get networkpolicies -n longhorn-system

# If any exist, review their rules
kubectl describe networkpolicy <policy-name> -n longhorn-system

Check Cilium/CNI Status¶

If using Cilium CNI, verify it’s healthy:

# Check Cilium agent pods
kubectl get pods -n kube-system -l k8s-app=cilium

# Check Cilium connectivity
kubectl exec -n kube-system ds/cilium -- cilium status

When NOT to Delete Pods¶

Do NOT delete a longhorn-manager pod if:

❌ All manager pods are in CrashLoopBackOff (cluster-wide issue, deleting won’t help)
❌ Active volume operations are in progress (check Longhorn UI)
❌ Node is undergoing maintenance or upgrade
❌ Less than 3 healthy manager pods exist

Safe to delete when:

✅ Only one manager pod is failing
✅ At least 3 other manager pods are healthy (“2/2 Running”)
✅ No active volume operations
✅ Node status is “Ready” in Kubernetes

Prevention¶

Ensure Custom Probes Are Configured¶

The custom probe configuration should be automatically applied during cluster provisioning:

Check configuration exists:

ls -la kube-hetzner/extra-manifests/ | grep longhorn-manager-probes

# Expected: 40-G-longhorn-manager-probes-patch.yaml.tpl

If missing: Re-apply infrastructure manifests:

cd kube-hetzner
source .env
tofu apply

Monitor Longhorn Manager Health¶

Add to monitoring dashboards:

# Check restart counts regularly
kubectl get pods -n longhorn-system -l app=longhorn-manager \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Expected: Low restart counts (0-5 over lifetime)
# High counts (50+): Indicates recurring issues

Set up alerts (optional Prometheus alert):

alert: LonghornManagerCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff",pod=~"longhorn-manager.*"} > 0
for: 10m
severity: critical

Best Practices¶

✅ Keep custom probe configuration applied (default since 2025-11-09)
✅ Monitor manager pod restart counts after K3s upgrades
✅ Verify all Longhorn nodes show “Ready” after cluster maintenance
✅ Test volume operations after resolving manager pod issues

Technical Details¶

Probe Configuration Files:

Patch Manifest: kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl
Plan Document: docs/plan/longhorn-manager-probes.md
DaemonSet: longhorn-system/longhorn-manager

Default vs Custom Probes:

Probe Type	Default	Custom	Improvement
startupProbe	❌ None	✅ 5-minute grace	Handles slow network init
livenessProbe	❌ None	✅ 90s tolerance	Auto-recovers stuck pods
readinessProbe	⚠️ 1s timeout	✅ 5s timeout	Tolerates network latency

Why Default Fails:

30-second total startup window (3 failures × 10s) too short for webhook initialization during network disruptions
No liveness probe means Kubernetes never attempts automatic recovery
1-second timeout insufficient for HTTPS health checks with network latency

Why Custom Works:

5-minute startup window accommodates slow CNI/DNS initialization
Liveness probe auto-restarts pods that get truly stuck
5-second timeout handles realistic network latency scenarios