Explanation

Longhorn Resilience Configuration

Understand why Longhorn manager pods require custom health probe configuration and how this prevents CrashLoopBackOff issues during cluster operations.

Overview

The Longhorn storage system’s manager DaemonSet requires enhanced health probe configuration to maintain stability during Kubernetes cluster disruptions. This explanation covers the technical reasons behind this requirement and the design decisions in the custom configuration.

The Problem: Readiness Probe Chicken-and-Egg

How Longhorn Manager Works

Each Longhorn manager pod serves dual roles:

  1. Manager Functions: Orchestrates storage operations, manages volumes, coordinates replicas

  2. Admission Webhook: Validates and mutates Longhorn custom resources via Kubernetes admission control

The manager pod exposes an HTTPS webhook endpoint on port 9502 at /v1/healthz:

┌─────────────────────────┐
│ longhorn-manager pod    │
│                         │
│  ┌──────────────────┐   │
│  │ Manager Process  │   │
│  └──────────────────┘   │
│         ↓               │
│  ┌──────────────────┐   │
│  │ Webhook Server   │   │
│  │ Port: 9502       │   │
│  │ Path: /v1/healthz│   │
│  └──────────────────┘   │
└─────────────────────────┘

All manager pods register their webhook endpoints with the longhorn-admission-webhook Service:

Service: longhorn-admission-webhook.longhorn-system.svc:9502
  ↓ endpoints
  ├─ 10.42.1.198:9502 (pod on node fsn1-yim)
  ├─ 10.42.2.148:9502 (pod on node fsn1-wej)
  ├─ 10.42.3.33:9502  (pod on node nbg1-qzp)
  └─ 10.42.4.123:9502 (pod on node fsn1-qdf)

The Circular Dependency

Default Longhorn configuration (from Helm chart):

readinessProbe:
  httpGet:
    path: /v1/healthz
    port: 9502
    scheme: HTTPS
  failureThreshold: 3
  periodSeconds: 10
  timeoutSeconds: 1

Timeline of failure during network disruption:

Time  Event
────────────────────────────────────────────────────────
T+0s  Pod starts after K3s upgrade
T+0s  CNI initializing (Cilium network setup)
T+0s  Readiness probe begins checking webhook endpoint
T+0s  First probe fails (webhook server not yet listening)
T+10s Second probe fails (webhook initializing, TLS certs loading)
T+20s Third probe fails (network not fully initialized)
T+30s Kubernetes marks pod "NotReady" (3 consecutive failures)
      CRITICAL: Pod never added to Service endpoints
T+60s Manager checks if webhook service is accessible
      Tries: https://longhorn-admission-webhook.longhorn-system.svc:9502
      Result: Timeout (no healthy endpoints - this pod was never added!)
T+120s Manager crashes with fatal error
       "admission webhook service is not accessible after 2m0s"
T+120s Container restarts
       Cycle repeats indefinitely (chicken-and-egg loop)

Why this is a chicken-and-egg problem:

  1. Pod needs to be “Ready” to be added to Service endpoints

  2. Pod checks if the Service is accessible before considering itself ready

  3. Service has no endpoints because pod isn’t Ready yet

  4. Pod can never become Ready because Service isn’t accessible

  5. Loop continues forever

Why Only Some Nodes Fail

Network initialization timing varies by node:

Fast initialization (pod succeeds):

T+0s  Pod starts
T+5s  CNI fully initialized
T+8s  Webhook server listening
T+10s First readiness probe succeeds ✅
T+10s Pod marked "Ready", added to Service

Slow initialization (pod fails):

T+0s  Pod starts
T+15s CNI still initializing
T+25s Webhook server listening
T+30s Three readiness probes failed ❌
T+30s Pod marked "NotReady", never added to Service
T+60s Manager can't reach Service (no endpoints)
T+120s Fatal crash, infinite restart loop

Factors affecting initialization speed:

  • Network interface configuration time

  • DNS resolution delays

  • CNI plugin initialization

  • TLS certificate generation

  • Node load/resource contention

Kubernetes Health Probe Types

Kubernetes provides three types of health probes, each with different purposes:

1. Startup Probe

Purpose: Give slow-starting containers extra time to initialize

Behavior:

  • Checked first, before readiness or liveness probes

  • All other probes disabled until startup probe succeeds

  • If startup probe doesn’t succeed within configured time, pod is killed

Use case: Applications with variable initialization times (like Longhorn manager during network disruptions)

2. Liveness Probe

Purpose: Detect and recover from application deadlock or stuck states

Behavior:

  • Checked periodically after startup probe succeeds

  • If liveness probe fails, Kubernetes kills and recreates the pod

  • Restarts pod to attempt recovery from stuck state

Use case: Applications that can enter unrecoverable states but stay “running” (exactly Longhorn’s CrashLoopBackOff scenario)

3. Readiness Probe

Purpose: Determine when pod is ready to receive traffic

Behavior:

  • Checked periodically after startup probe succeeds

  • If readiness probe fails, pod is removed from Service endpoints

  • Pod stays running, just not receiving traffic

Use case: Gracefully handle temporary unavailability (load, queue backlog, etc.)

Custom Probe Configuration Design

Configuration Applied

Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):

containers:
- name: longhorn-manager
  startupProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 30      # 30 failures × 10s = 300s max
    periodSeconds: 10
    timeoutSeconds: 5

  livenessProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 3       # 3 failures × 30s = 90s tolerance
    periodSeconds: 30
    timeoutSeconds: 5

  readinessProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 3
    periodSeconds: 10
    timeoutSeconds: 5         # Increased from 1s

Design Rationale

Startup Probe: 5-Minute Grace Period

Configuration:

  • failureThreshold: 30 - Allow 30 failures before giving up

  • periodSeconds: 10 - Check every 10 seconds

  • timeoutSeconds: 5 - 5-second timeout per check

  • Total: 300 seconds (5 minutes) maximum startup time

Why 5 minutes:

  • K3s upgrades can cause 30-60 second network disruptions

  • CNI initialization varies: 5-45 seconds typical, 120s worst case observed

  • TLS certificate generation: 2-10 seconds

  • DNS propagation: 5-30 seconds

  • Safety margin: 2× worst case = 4 minutes, rounded to 5 for comfort

Trade-off:

  • ✅ Prevents false failures during legitimate slow starts

  • ✅ Allows pods to succeed in challenging network conditions

  • ⚠️ Delays detection of truly broken pods by up to 5 minutes

  • ⚠️ Acceptable: DaemonSet has 4 other healthy replicas providing service

Liveness Probe: Automatic Recovery

Configuration:

  • failureThreshold: 3 - Allow 3 failures before restart

  • periodSeconds: 30 - Check every 30 seconds (less aggressive)

  • timeoutSeconds: 5 - 5-second timeout

  • Total: 90 seconds of continuous failure before restart

Why add liveness probe:

  • Without liveness: Pod can stay in CrashLoopBackOff indefinitely (11 days observed in incident)

  • With liveness: Kubernetes automatically kills and recreates truly stuck pods

  • 90-second tolerance: Prevents restarts during brief network hiccups

Why 30-second interval (vs 10s):

  • Webhook endpoint is lightweight, not resource-intensive

  • Reduces unnecessary health check load

  • Still detects failures within 90 seconds (acceptable SLA for automated recovery)

Trade-off:

  • ✅ Automatic recovery from stuck states (no manual intervention needed)

  • ✅ Less aggressive than readiness (30s vs 10s interval)

  • ⚠️ Up to 90 seconds before recovery initiated

  • ⚠️ Acceptable: This is for catastrophic failures, not normal operation

Readiness Probe: Network Latency Tolerance

Configuration:

  • failureThreshold: 3 - Keep default (3 failures)

  • periodSeconds: 10 - Keep default (10 seconds)

  • timeoutSeconds: 5 - Increased from 1s to 5s

Why 5-second timeout:

  • HTTPS health checks require TLS handshake: 50-200ms typical, 1s+ during load

  • Network latency during upgrades: 100-500ms typical, 2s worst case

  • DNS resolution: 50-500ms typical, 1s+ during CoreDNS load

  • Safety margin: 5s handles 99.9th percentile scenarios

Why keep other settings:

  • failureThreshold: 3 maintains original behavior (30s window)

  • periodSeconds: 10 provides quick traffic routing decisions

Trade-off:

  • ✅ Tolerates realistic network latency scenarios

  • ✅ Prevents false-positive “NotReady” marking

  • ⚠️ Takes 15 seconds (3 × 5s) to detect truly down endpoints

  • ⚠️ Acceptable: Service has multiple endpoints, brief delays OK

Probe Sequence During Startup

Successful startup with custom probes:

Time  Probe           Result    Pod State     Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s  startupProbe    Fail      Starting      Not added (expected)
T+10s startupProbe    Fail      Starting      Not added (expected)
T+20s startupProbe    Fail      Starting      Not added (expected)
T+30s startupProbe    Success   Starting      Not added yet
      ↓ startupProbe succeeds, enable readiness/liveness probes
T+30s readinessProbe  Success   Ready         ✅ Added to Service!
      ↓ Pod now receives traffic
T+60s livenessProbe   Success   Ready         In Service
T+90s readinessProbe  Success   Ready         In Service

Compare to default probes (for reference):

Time  Probe           Result    Pod State     Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s  readinessProbe  Fail      Starting      Not added
T+10s readinessProbe  Fail      Starting      Not added
T+20s readinessProbe  Fail      Starting      Not added
T+30s ❌ Pod marked NotReady (3 failures)
T+60s Manager checks Service accessibility
T+60s ❌ Service has no endpoints (this pod not added)
T+120s ❌ Fatal crash: "webhook service not accessible"

Implementation Approach

Why Strategic Merge Patch

Challenge: Longhorn Helm chart doesn’t expose probe configuration in values.yaml

Solution Options Evaluated:

  1. Fork Helm chart: High maintenance burden, must track upstream changes

  2. Manual kubectl patch: Not persistent, lost on cluster rebuild

  3. Strategic merge patch via extra-manifests: Infrastructure-as-code, automatically applied

Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: longhorn-manager
  namespace: longhorn-system
spec:
  template:
    spec:
      containers:
      - name: longhorn-manager
        startupProbe: { ... }    # Added to existing container config
        livenessProbe: { ... }   # Added to existing container config
        readinessProbe: { ... }  # Merged with existing readinessProbe

How it works:

  1. OpenTofu applies extra-manifests after Helm chart deployment

  2. Kubernetes strategically merges patch with existing DaemonSet

  3. Probe configuration added/updated without replacing entire DaemonSet

  4. Longhorn manager pods rolling restart with new probe configuration

Why Extra-Manifests

Advantages:

  • ✅ Version-controlled with infrastructure code

  • ✅ Automatically applied on every tofu apply

  • ✅ Persists across cluster rebuilds

  • ✅ No custom Helm chart maintenance

  • ✅ Easy to modify or remove

Disadvantages:

  • ⚠️ Applied after Helm chart (brief window where default probes exist)

  • ⚠️ Requires understanding of strategic merge semantics

  • ⚠️ May conflict with future Helm chart changes

Comparison: Default vs Custom

Startup Scenario Outcomes

Scenario

Network Init Time

Default Probes

Custom Probes

Fast node

10 seconds

✅ Success

✅ Success

Normal node

25 seconds

✅ Success

✅ Success

Slow node

45 seconds

CrashLoopBackOff

✅ Success

Very slow node

90 seconds

CrashLoopBackOff

✅ Success

Broken pod

5+ minutes

❌ CrashLoopBackOff forever

❌ Killed after 5 min, recreated

Recovery from Stuck State

Situation

Default Probes

Custom Probes

Pod stuck for 5 minutes

⚠️ Stays running, unusable

✅ Killed at 5 min (startup timeout)

Pod stuck for 11 days

⚠️ Stays running forever

✅ Killed at 90s (liveness failure)

Manual intervention

✅ Delete pod

✅ Automatic recovery

Network Latency Handling

Latency

Default (1s timeout)

Custom (5s timeout)

50ms

✅ Success

✅ Success

500ms

✅ Success

✅ Success

1.5s

Probe timeout

✅ Success

3s

Probe timeout

✅ Success

6s

❌ Probe timeout

❌ Probe timeout

Observability

Monitoring Probe Health

Check current configuration:

kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
  grep -E "(startupProbe|livenessProbe|readinessProbe)" -A 10

Monitor probe failures:

# Watch for Unhealthy events
kubectl get events -n longhorn-system \
  --field-selector involvedObject.kind=Pod,reason=Unhealthy -w

# Example output if probe fails:
# Readiness probe failed: Get "https://10.42.3.33:9502/v1/healthz":
# context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Check restart counts (indicator of liveness probe failures):

kubectl get pods -n longhorn-system -l app=longhorn-manager \
  -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount

Metrics and Alerts

Prometheus metrics (from kube-state-metrics):

# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{
  namespace="longhorn-system",
  pod=~"longhorn-manager.*",
  reason="CrashLoopBackOff"
}

# Restart count trend
rate(kube_pod_container_status_restarts_total{
  namespace="longhorn-system",
  pod=~"longhorn-manager.*"
}[5m])

Recommended alerts:

# Alert on CrashLoopBackOff
- alert: LonghornManagerCrashLoopBackOff
  expr: kube_pod_container_status_waiting_reason{namespace="longhorn-system",pod=~"longhorn-manager.*",reason="CrashLoopBackOff"} > 0
  for: 10m
  severity: critical
  summary: "Longhorn manager pod {{ $labels.pod }} in CrashLoopBackOff"

# Alert on high restart rate
- alert: LonghornManagerHighRestarts
  expr: rate(kube_pod_container_status_restarts_total{namespace="longhorn-system",pod=~"longhorn-manager.*"}[1h]) > 0.1
  for: 15m
  severity: warning
  summary: "Longhorn manager pod {{ $labels.pod }} restarting frequently"

Lessons Learned

Design Principles

  1. Health probes checking self-hosted services create circular dependencies

    • Probe failure → Pod not Ready → Not in Service → Can’t reach Service

    • Mitigation: Extended startup grace period

  2. Default Helm chart values may not suit all deployment scenarios

    • Charts optimize for standard environments

    • Real-world network timing varies significantly

    • Custom patches sometimes necessary

  3. Missing liveness probes mean no automatic recovery

    • Kubernetes assumes “running” containers are healthy

    • Need explicit liveness probe for self-healing

    • Trade-off: Balance between aggressive recovery and stability

  4. Probe timeouts must account for worst-case network conditions

    • HTTPS requires TLS handshake overhead

    • Network latency during upgrades is higher than normal

    • 1-second timeouts insufficient for realistic scenarios

Operational Insights

From the 11-day CrashLoopBackOff incident:

  • Random timing means some nodes always at risk during disruptions

  • Manual intervention doesn’t scale (pod deletion required monitoring)

  • Aggressive probes create false failures more often than detecting real problems

  • Missing probes prevent self-healing (no liveness probe = no auto-recovery)

Technical References

Kubernetes Documentation:

Longhorn Documentation:

Implementation Files:

  • kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl

  • docs/plan/longhorn-manager-probes.md