How-To Guide

Troubleshoot Longhorn Manager Pod CrashLoopBackOff

Learn how to diagnose and resolve Longhorn manager pods stuck in CrashLoopBackOff, causing Longhorn nodes to appear “down” in the UI.

Problem Description

Symptom: Longhorn node shows status “down” in Longhorn UI (https://longhorn.ops.kup6s.net), and the corresponding longhorn-manager pod is stuck in CrashLoopBackOff with high restart counts.

Example:

kubectl get pods -n longhorn-system | grep longhorn-manager
NAME                     READY   STATUS             RESTARTS        AGE
longhorn-manager-45l8m   1/2     CrashLoopBackOff   102 (60s ago)   11d
#                        ^^^^^ Manager container failing

Pod logs show:

level=fatal msg="Error starting webhooks: admission webhook service is not accessible
on cluster after 2m0s sec: timed out waiting for endpoint
https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz to be available"

Key differences from other issues:

  • Node stuck after upgrade (covered in Nodes Stuck After K3S Upgrade): Node is cordoned but healthy

  • This issue: Longhorn manager pod is crashing, node shows “down” in Longhorn UI

Root Cause: Readiness Probe Chicken-and-Egg Problem

The longhorn-manager pod crashes because it cannot reach its own admission webhook service, creating a circular dependency:

  1. Pod starts after K3s upgrade or node restart

  2. Network initialization may be slow (CNI, DNS, service mesh starting up)

  3. Readiness probe begins checking https://localhost:9502/v1/healthz every 10 seconds

  4. Webhook not ready within 30 seconds (default: 3 failures × 10s period)

  5. Pod marked “NotReady” by Kubernetes

  6. Critical: Pod is never added to longhorn-admission-webhook Service endpoints because it’s NotReady

  7. Pod crashes trying to verify the webhook service is accessible

  8. Restart loop: Cycle repeats indefinitely (pod can never reach Service it should be part of)

This is a timing-sensitive issue - if the network initializes within 30 seconds, the pod succeeds. If not, it fails permanently.

Automatic Prevention: Custom Health Probes

Since 2025-11-09, the cluster includes custom health probe configuration that prevents this issue automatically.

How It Works

Custom probes added via extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl:

Enhanced Probes:

  1. startupProbe: 5-minute grace period for slow network initialization

  2. livenessProbe: Auto-restarts pods stuck in bad states (90s tolerance)

  3. readinessProbe: Enhanced timeout (5s instead of 1s) for network latency

Result: Longhorn manager pods can initialize properly even during network disruptions, and automatically recover if they get stuck.

Verify Custom Probes Are Configured

Check that the custom probe configuration is applied:

# Check for all three probe types
kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
  grep -E "(startupProbe|livenessProbe|readinessProbe)"

# Expected output - all three should be present:
# startupProbe:
# livenessProbe:
# readinessProbe:

Detailed probe configuration:

# View full probe configuration
kubectl get ds longhorn-manager -n longhorn-system -o yaml | grep -A 15 "startupProbe:"

Expected values:

  • startupProbe: failureThreshold: 30, periodSeconds: 10 (5-minute max startup)

  • livenessProbe: failureThreshold: 3, periodSeconds: 30 (90s tolerance)

  • readinessProbe: timeoutSeconds: 5 (enhanced from default 1s)

Quick Fix: Delete the Stuck Pod

If a pod is already stuck in CrashLoopBackOff, delete it to force recreation:

1. Identify the Stuck Pod

# Find pods in CrashLoopBackOff
kubectl get pods -n longhorn-system | grep -i crash

# Example output:
# longhorn-manager-45l8m   1/2     CrashLoopBackOff   102 (60s ago)   11d

2. Verify Other Manager Pods Are Healthy

Important: Ensure at least 3 other longhorn-manager pods are running before deleting:

kubectl get pods -n longhorn-system -l app=longhorn-manager

# Expected: At least 3-4 pods showing "2/2 Running"
# NAME                     READY   STATUS    RESTARTS   AGE
# longhorn-manager-8fjqn   2/2     Running   4          11d  ← Healthy
# longhorn-manager-gr8ws   2/2     Running   2          11d  ← Healthy
# longhorn-manager-n78bn   2/2     Running   2          11d  ← Healthy
# longhorn-manager-45l8m   1/2     CrashLoopBackOff  102   11d  ← Problem

3. Delete the Stuck Pod

# Replace with your actual pod name
kubectl delete pod longhorn-manager-45l8m -n longhorn-system

What happens:

  • DaemonSet automatically creates a new pod to replace the deleted one

  • New pod starts with fresh state (no accumulated failures)

  • If network is stable, new pod initializes successfully within seconds

4. Monitor the New Pod

# Watch pod recreation
kubectl get pods -n longhorn-system -w -l app=longhorn-manager

# Wait for new pod to reach "2/2 Running"

5. Verify Node Recovery in Longhorn UI

Open Longhorn UI: https://longhorn.ops.kup6s.net/#/node

Expected: Previously “down” node should now show status “Schedulable” and “Ready”.

Manual Diagnosis (When Quick Fix Doesn’t Work)

If the new pod also enters CrashLoopBackOff, investigate deeper:

1. Check Pod Logs

# Replace with your pod name
POD_NAME="longhorn-manager-45l8m"

# View manager container logs
kubectl logs -n longhorn-system $POD_NAME -c longhorn-manager --tail=100

Look for:

  • admission webhook service is not accessible - Network/DNS issue

  • connection refused - Webhook server not starting

  • timeout - Network latency or firewall blocking

  • certificate errors - TLS certificate issues

2. Check Webhook Service

# Verify webhook service exists
kubectl get svc -n longhorn-system longhorn-admission-webhook

# Expected output:
# NAME                         TYPE        CLUSTER-IP     PORT(S)
# longhorn-admission-webhook   ClusterIP   10.43.139.56   9502/TCP

Check service endpoints:

kubectl get endpoints -n longhorn-system longhorn-admission-webhook

# Expected: At least 3-4 endpoints (from healthy manager pods)
# ENDPOINTS
# 10.42.1.198:9502,10.42.2.148:9502,10.42.4.123:9502 + 1 more...

If no endpoints: All manager pods are failing (cluster-wide issue, not just one node)

3. Check Network Connectivity

Test if the pod can reach the service:

# Execute network test from within the pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
  wget -O- --timeout=5 https://longhorn-admission-webhook.longhorn-system.svc:9502/v1/healthz 2>&1

Possible results:

  • Success: Network is fine, probe configuration issue

  • Timeout: Network latency or firewall blocking

  • Connection refused: No healthy webhook endpoints

  • DNS error: CoreDNS or service discovery issue

4. Check Node Health

# Check if the node itself is healthy
kubectl get nodes | grep <node-name>

# Check node conditions
kubectl describe node <node-name> | grep -A 10 "Conditions:"

Look for:

  • Node status should be “Ready”

  • No “NetworkUnavailable” or “NotReady” conditions

  • No taints preventing pod scheduling

5. Review Probe Configuration

Verify the custom probes are actually configured:

kubectl get ds longhorn-manager -n longhorn-system -o jsonpath='{.spec.template.spec.containers[0].startupProbe}'

If empty: Custom probe patch not applied (see Prevention section below)

Advanced Troubleshooting

Check CoreDNS Status

DNS resolution issues can prevent webhook connectivity:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Test DNS resolution from problem pod
kubectl exec -n longhorn-system $POD_NAME -c longhorn-manager -- \
  nslookup longhorn-admission-webhook.longhorn-system.svc

Check Network Policies

Verify no network policies are blocking pod-to-service communication:

# Check for network policies affecting longhorn-system
kubectl get networkpolicies -n longhorn-system

# If any exist, review their rules
kubectl describe networkpolicy <policy-name> -n longhorn-system

Check Cilium/CNI Status

If using Cilium CNI, verify it’s healthy:

# Check Cilium agent pods
kubectl get pods -n kube-system -l k8s-app=cilium

# Check Cilium connectivity
kubectl exec -n kube-system ds/cilium -- cilium status

When NOT to Delete Pods

Do NOT delete a longhorn-manager pod if:

  • ❌ All manager pods are in CrashLoopBackOff (cluster-wide issue, deleting won’t help)

  • ❌ Active volume operations are in progress (check Longhorn UI)

  • ❌ Node is undergoing maintenance or upgrade

  • ❌ Less than 3 healthy manager pods exist

Safe to delete when:

  • ✅ Only one manager pod is failing

  • ✅ At least 3 other manager pods are healthy (“2/2 Running”)

  • ✅ No active volume operations

  • ✅ Node status is “Ready” in Kubernetes

Prevention

Ensure Custom Probes Are Configured

The custom probe configuration should be automatically applied during cluster provisioning:

Check configuration exists:

ls -la kube-hetzner/extra-manifests/ | grep longhorn-manager-probes

# Expected: 40-G-longhorn-manager-probes-patch.yaml.tpl

If missing: Re-apply infrastructure manifests:

cd kube-hetzner
source .env
tofu apply

Monitor Longhorn Manager Health

Add to monitoring dashboards:

# Check restart counts regularly
kubectl get pods -n longhorn-system -l app=longhorn-manager \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'

# Expected: Low restart counts (0-5 over lifetime)
# High counts (50+): Indicates recurring issues

Set up alerts (optional Prometheus alert):

alert: LonghornManagerCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff",pod=~"longhorn-manager.*"} > 0
for: 10m
severity: critical

Best Practices

  • ✅ Keep custom probe configuration applied (default since 2025-11-09)

  • ✅ Monitor manager pod restart counts after K3s upgrades

  • ✅ Verify all Longhorn nodes show “Ready” after cluster maintenance

  • ✅ Test volume operations after resolving manager pod issues

Technical Details

Probe Configuration Files:

  • Patch Manifest: kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl

  • Plan Document: docs/plan/longhorn-manager-probes.md

  • DaemonSet: longhorn-system/longhorn-manager

Default vs Custom Probes:

Probe Type

Default

Custom

Improvement

startupProbe

❌ None

✅ 5-minute grace

Handles slow network init

livenessProbe

❌ None

✅ 90s tolerance

Auto-recovers stuck pods

readinessProbe

⚠️ 1s timeout

✅ 5s timeout

Tolerates network latency

Why Default Fails:

  • 30-second total startup window (3 failures × 10s) too short for webhook initialization during network disruptions

  • No liveness probe means Kubernetes never attempts automatic recovery

  • 1-second timeout insufficient for HTTPS health checks with network latency

Why Custom Works:

  • 5-minute startup window accommodates slow CNI/DNS initialization

  • Liveness probe auto-restarts pods that get truly stuck

  • 5-second timeout handles realistic network latency scenarios