Explanation

Longhorn Resilience Configuration¶

Type: Explanation (Understanding-oriented)

Understand why Longhorn manager pods require custom health probe configuration and how this prevents CrashLoopBackOff issues during cluster operations.

Overview¶

The Longhorn storage system’s manager DaemonSet requires enhanced health probe configuration to maintain stability during Kubernetes cluster disruptions. This explanation covers the technical reasons behind this requirement and the design decisions in the custom configuration.

The Problem: Readiness Probe Chicken-and-Egg¶

How Longhorn Manager Works¶

Each Longhorn manager pod serves dual roles:

Manager Functions: Orchestrates storage operations, manages volumes, coordinates replicas
Admission Webhook: Validates and mutates Longhorn custom resources via Kubernetes admission control

The manager pod exposes an HTTPS webhook endpoint on port 9502 at /v1/healthz:

┌─────────────────────────┐
│ longhorn-manager pod    │
│                         │
│  ┌──────────────────┐   │
│  │ Manager Process  │   │
│  └──────────────────┘   │
│         ↓               │
│  ┌──────────────────┐   │
│  │ Webhook Server   │   │
│  │ Port: 9502       │   │
│  │ Path: /v1/healthz│   │
│  └──────────────────┘   │
└─────────────────────────┘

All manager pods register their webhook endpoints with the longhorn-admission-webhook Service:

Service: longhorn-admission-webhook.longhorn-system.svc:9502
  ↓ endpoints
  ├─ 10.42.1.198:9502 (pod on node fsn1-yim)
  ├─ 10.42.2.148:9502 (pod on node fsn1-wej)
  ├─ 10.42.3.33:9502  (pod on node nbg1-qzp)
  └─ 10.42.4.123:9502 (pod on node fsn1-qdf)

The Circular Dependency¶

Default Longhorn configuration (from Helm chart):

readinessProbe:
  httpGet:
    path: /v1/healthz
    port: 9502
    scheme: HTTPS
  failureThreshold: 3
  periodSeconds: 10
  timeoutSeconds: 1

Timeline of failure during network disruption:

Time  Event
────────────────────────────────────────────────────────
T+0s  Pod starts after K3s upgrade
T+0s  CNI initializing (Cilium network setup)
T+0s  Readiness probe begins checking webhook endpoint
T+0s  First probe fails (webhook server not yet listening)
T+10s Second probe fails (webhook initializing, TLS certs loading)
T+20s Third probe fails (network not fully initialized)
T+30s Kubernetes marks pod "NotReady" (3 consecutive failures)
      ↓
      CRITICAL: Pod never added to Service endpoints
      ↓
T+60s Manager checks if webhook service is accessible
      Tries: https://longhorn-admission-webhook.longhorn-system.svc:9502
      Result: Timeout (no healthy endpoints - this pod was never added!)
      ↓
T+120s Manager crashes with fatal error
       "admission webhook service is not accessible after 2m0s"
       ↓
T+120s Container restarts
       ↓
       Cycle repeats indefinitely (chicken-and-egg loop)

Why this is a chicken-and-egg problem:

Pod needs to be “Ready” to be added to Service endpoints
Pod checks if the Service is accessible before considering itself ready
Service has no endpoints because pod isn’t Ready yet
Pod can never become Ready because Service isn’t accessible
Loop continues forever

Why Only Some Nodes Fail¶

Network initialization timing varies by node:

Fast initialization (pod succeeds):

T+0s  Pod starts
T+5s  CNI fully initialized
T+8s  Webhook server listening
T+10s First readiness probe succeeds ✅
T+10s Pod marked "Ready", added to Service

Slow initialization (pod fails):

T+0s  Pod starts
T+15s CNI still initializing
T+25s Webhook server listening
T+30s Three readiness probes failed ❌
T+30s Pod marked "NotReady", never added to Service
T+60s Manager can't reach Service (no endpoints)
T+120s Fatal crash, infinite restart loop

Factors affecting initialization speed:

Network interface configuration time
DNS resolution delays
CNI plugin initialization
TLS certificate generation
Node load/resource contention

Kubernetes Health Probe Types¶

Kubernetes provides three types of health probes, each with different purposes:

1. Startup Probe¶

Purpose: Give slow-starting containers extra time to initialize

Behavior:

Checked first, before readiness or liveness probes
All other probes disabled until startup probe succeeds
If startup probe doesn’t succeed within configured time, pod is killed

Use case: Applications with variable initialization times (like Longhorn manager during network disruptions)

2. Liveness Probe¶

Purpose: Detect and recover from application deadlock or stuck states

Behavior:

Checked periodically after startup probe succeeds
If liveness probe fails, Kubernetes kills and recreates the pod
Restarts pod to attempt recovery from stuck state

Use case: Applications that can enter unrecoverable states but stay “running” (exactly Longhorn’s CrashLoopBackOff scenario)

3. Readiness Probe¶

Purpose: Determine when pod is ready to receive traffic

Behavior:

Checked periodically after startup probe succeeds
If readiness probe fails, pod is removed from Service endpoints
Pod stays running, just not receiving traffic

Use case: Gracefully handle temporary unavailability (load, queue backlog, etc.)

Custom Probe Configuration Design¶

Configuration Applied¶

Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):

containers:
- name: longhorn-manager
  startupProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 30      # 30 failures × 10s = 300s max
    periodSeconds: 10
    timeoutSeconds: 5

  livenessProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 3       # 3 failures × 30s = 90s tolerance
    periodSeconds: 30
    timeoutSeconds: 5

  readinessProbe:
    httpGet:
      path: /v1/healthz
      port: 9502
      scheme: HTTPS
    failureThreshold: 3
    periodSeconds: 10
    timeoutSeconds: 5         # Increased from 1s

Design Rationale¶

Startup Probe: 5-Minute Grace Period¶

Configuration:

failureThreshold: 30 - Allow 30 failures before giving up
periodSeconds: 10 - Check every 10 seconds
timeoutSeconds: 5 - 5-second timeout per check
Total: 300 seconds (5 minutes) maximum startup time

Why 5 minutes:

K3s upgrades can cause 30-60 second network disruptions
CNI initialization varies: 5-45 seconds typical, 120s worst case observed
TLS certificate generation: 2-10 seconds
DNS propagation: 5-30 seconds
Safety margin: 2× worst case = 4 minutes, rounded to 5 for comfort

Trade-off:

✅ Prevents false failures during legitimate slow starts
✅ Allows pods to succeed in challenging network conditions
⚠️ Delays detection of truly broken pods by up to 5 minutes
⚠️ Acceptable: DaemonSet has 4 other healthy replicas providing service

Liveness Probe: Automatic Recovery¶

Configuration:

failureThreshold: 3 - Allow 3 failures before restart
periodSeconds: 30 - Check every 30 seconds (less aggressive)
timeoutSeconds: 5 - 5-second timeout
Total: 90 seconds of continuous failure before restart

Why add liveness probe:

Without liveness: Pod can stay in CrashLoopBackOff indefinitely (11 days observed in incident)
With liveness: Kubernetes automatically kills and recreates truly stuck pods
90-second tolerance: Prevents restarts during brief network hiccups

Why 30-second interval (vs 10s):

Webhook endpoint is lightweight, not resource-intensive
Reduces unnecessary health check load
Still detects failures within 90 seconds (acceptable SLA for automated recovery)

Trade-off:

✅ Automatic recovery from stuck states (no manual intervention needed)
✅ Less aggressive than readiness (30s vs 10s interval)
⚠️ Up to 90 seconds before recovery initiated
⚠️ Acceptable: This is for catastrophic failures, not normal operation

Readiness Probe: Network Latency Tolerance¶

Configuration:

failureThreshold: 3 - Keep default (3 failures)
periodSeconds: 10 - Keep default (10 seconds)
timeoutSeconds: 5 - Increased from 1s to 5s

Why 5-second timeout:

HTTPS health checks require TLS handshake: 50-200ms typical, 1s+ during load
Network latency during upgrades: 100-500ms typical, 2s worst case
DNS resolution: 50-500ms typical, 1s+ during CoreDNS load
Safety margin: 5s handles 99.9th percentile scenarios

Why keep other settings:

failureThreshold: 3 maintains original behavior (30s window)
periodSeconds: 10 provides quick traffic routing decisions

Trade-off:

✅ Tolerates realistic network latency scenarios
✅ Prevents false-positive “NotReady” marking
⚠️ Takes 15 seconds (3 × 5s) to detect truly down endpoints
⚠️ Acceptable: Service has multiple endpoints, brief delays OK

Probe Sequence During Startup¶

Successful startup with custom probes:

Time  Probe           Result    Pod State     Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s  startupProbe    Fail      Starting      Not added (expected)
T+10s startupProbe    Fail      Starting      Not added (expected)
T+20s startupProbe    Fail      Starting      Not added (expected)
T+30s startupProbe    Success   Starting      Not added yet
      ↓ startupProbe succeeds, enable readiness/liveness probes
T+30s readinessProbe  Success   Ready         ✅ Added to Service!
      ↓ Pod now receives traffic
T+60s livenessProbe   Success   Ready         In Service
T+90s readinessProbe  Success   Ready         In Service

Compare to default probes (for reference):

Time  Probe           Result    Pod State     Service Endpoints
─────────────────────────────────────────────────────────────────
T+0s  readinessProbe  Fail      Starting      Not added
T+10s readinessProbe  Fail      Starting      Not added
T+20s readinessProbe  Fail      Starting      Not added
T+30s ❌ Pod marked NotReady (3 failures)
T+60s Manager checks Service accessibility
T+60s ❌ Service has no endpoints (this pod not added)
T+120s ❌ Fatal crash: "webhook service not accessible"

Implementation Approach¶

Why Strategic Merge Patch¶

Challenge: Longhorn Helm chart doesn’t expose probe configuration in values.yaml

Solution Options Evaluated:

❌ Fork Helm chart: High maintenance burden, must track upstream changes
❌ Manual kubectl patch: Not persistent, lost on cluster rebuild
✅ Strategic merge patch via extra-manifests: Infrastructure-as-code, automatically applied

Strategic merge patch (40-G-longhorn-manager-probes-patch.yaml.tpl):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: longhorn-manager
  namespace: longhorn-system
spec:
  template:
    spec:
      containers:
      - name: longhorn-manager
        startupProbe: { ... }    # Added to existing container config
        livenessProbe: { ... }   # Added to existing container config
        readinessProbe: { ... }  # Merged with existing readinessProbe

How it works:

OpenTofu applies extra-manifests after Helm chart deployment
Kubernetes strategically merges patch with existing DaemonSet
Probe configuration added/updated without replacing entire DaemonSet
Longhorn manager pods rolling restart with new probe configuration

Why Extra-Manifests¶

Advantages:

✅ Version-controlled with infrastructure code
✅ Automatically applied on every tofu apply
✅ Persists across cluster rebuilds
✅ No custom Helm chart maintenance
✅ Easy to modify or remove

Disadvantages:

⚠️ Applied after Helm chart (brief window where default probes exist)
⚠️ Requires understanding of strategic merge semantics
⚠️ May conflict with future Helm chart changes

Comparison: Default vs Custom¶

Startup Scenario Outcomes¶

Scenario	Network Init Time	Default Probes	Custom Probes
Fast node	10 seconds	✅ Success	✅ Success
Normal node	25 seconds	✅ Success	✅ Success
Slow node	45 seconds	❌ CrashLoopBackOff	✅ Success
Very slow node	90 seconds	❌ CrashLoopBackOff	✅ Success
Broken pod	5+ minutes	❌ CrashLoopBackOff forever	❌ Killed after 5 min, recreated

Recovery from Stuck State¶

Situation	Default Probes	Custom Probes
Pod stuck for 5 minutes	⚠️ Stays running, unusable	✅ Killed at 5 min (startup timeout)
Pod stuck for 11 days	⚠️ Stays running forever	✅ Killed at 90s (liveness failure)
Manual intervention	✅ Delete pod	✅ Automatic recovery

Network Latency Handling¶

Latency	Default (1s timeout)	Custom (5s timeout)
50ms	✅ Success	✅ Success
500ms	✅ Success	✅ Success
1.5s	❌ Probe timeout	✅ Success
3s	❌ Probe timeout	✅ Success
6s	❌ Probe timeout	❌ Probe timeout

Observability¶

Monitoring Probe Health¶

Check current configuration:

kubectl get ds longhorn-manager -n longhorn-system -o yaml | \
  grep -E "(startupProbe|livenessProbe|readinessProbe)" -A 10

Monitor probe failures:

# Watch for Unhealthy events
kubectl get events -n longhorn-system \
  --field-selector involvedObject.kind=Pod,reason=Unhealthy -w

# Example output if probe fails:
# Readiness probe failed: Get "https://10.42.3.33:9502/v1/healthz":
# context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Check restart counts (indicator of liveness probe failures):

kubectl get pods -n longhorn-system -l app=longhorn-manager \
  -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount

Metrics and Alerts¶

Prometheus metrics (from kube-state-metrics):

# Pods in CrashLoopBackOff
kube_pod_container_status_waiting_reason{
  namespace="longhorn-system",
  pod=~"longhorn-manager.*",
  reason="CrashLoopBackOff"
}

# Restart count trend
rate(kube_pod_container_status_restarts_total{
  namespace="longhorn-system",
  pod=~"longhorn-manager.*"
}[5m])

Recommended alerts:

# Alert on CrashLoopBackOff
- alert: LonghornManagerCrashLoopBackOff
  expr: kube_pod_container_status_waiting_reason{namespace="longhorn-system",pod=~"longhorn-manager.*",reason="CrashLoopBackOff"} > 0
  for: 10m
  severity: critical
  summary: "Longhorn manager pod {{ $labels.pod }} in CrashLoopBackOff"

# Alert on high restart rate
- alert: LonghornManagerHighRestarts
  expr: rate(kube_pod_container_status_restarts_total{namespace="longhorn-system",pod=~"longhorn-manager.*"}[1h]) > 0.1
  for: 15m
  severity: warning
  summary: "Longhorn manager pod {{ $labels.pod }} restarting frequently"

Lessons Learned¶

Design Principles¶

Health probes checking self-hosted services create circular dependencies
- Probe failure → Pod not Ready → Not in Service → Can’t reach Service
- Mitigation: Extended startup grace period
Default Helm chart values may not suit all deployment scenarios
- Charts optimize for standard environments
- Real-world network timing varies significantly
- Custom patches sometimes necessary
Missing liveness probes mean no automatic recovery
- Kubernetes assumes “running” containers are healthy
- Need explicit liveness probe for self-healing
- Trade-off: Balance between aggressive recovery and stability
Probe timeouts must account for worst-case network conditions
- HTTPS requires TLS handshake overhead
- Network latency during upgrades is higher than normal
- 1-second timeouts insufficient for realistic scenarios

Operational Insights¶

From the 11-day CrashLoopBackOff incident:

✅ Random timing means some nodes always at risk during disruptions
✅ Manual intervention doesn’t scale (pod deletion required monitoring)
✅ Aggressive probes create false failures more often than detecting real problems
✅ Missing probes prevent self-healing (no liveness probe = no auto-recovery)

Technical References¶

Kubernetes Documentation:

Longhorn Documentation:

Implementation Files:

kube-hetzner/extra-manifests/40-G-longhorn-manager-probes-patch.yaml.tpl
docs/plan/longhorn-manager-probes.md

Longhorn Resilience Configuration¶

Overview¶

The Problem: Readiness Probe Chicken-and-Egg¶

How Longhorn Manager Works¶

The Circular Dependency¶

Why Only Some Nodes Fail¶

Kubernetes Health Probe Types¶

1. Startup Probe¶

2. Liveness Probe¶

3. Readiness Probe¶

Custom Probe Configuration Design¶

Configuration Applied¶

Design Rationale¶

Startup Probe: 5-Minute Grace Period¶

Liveness Probe: Automatic Recovery¶

Readiness Probe: Network Latency Tolerance¶

Probe Sequence During Startup¶

Implementation Approach¶

Why Strategic Merge Patch¶

Why Extra-Manifests¶

Comparison: Default vs Custom¶

Startup Scenario Outcomes¶

Recovery from Stuck State¶

Network Latency Handling¶

Observability¶

Monitoring Probe Health¶

Metrics and Alerts¶

Lessons Learned¶

Design Principles¶

Operational Insights¶

Related Documentation¶

Technical References¶