How-To Guide

Troubleshoot K3S Upgrade Failures

Learn how to diagnose and fix K3S upgrade failures caused by unhealthy pods preventing node drain.

Problem Description

Symptom: K3S system-upgrade-controller fails to upgrade nodes, leaving them in SchedulingDisabled (cordoned) state.

Common error messages:

level=error msg="upgrade failed: Job was active longer than specified deadline"

Node status:

kubectl get nodes
NAME                         STATUS   ROLES    AGE   VERSION
kup6s-agent-cax31-fsn1-yim   Ready    <none>   1h    v1.31.13+k3s1
#                            ^^^^^^ Shows Ready but node is SchedulingDisabled

Root Cause: Unhealthy Pods Blocking Drain

K3S upgrades follow this process:

  1. Cordon node (mark SchedulingDisabled)

  2. Drain node (evict all pods safely)

  3. Upgrade K3S binary

  4. Uncordon node

Failure occurs when: Pods cannot be evicted during drain due to:

  • PodDisruptionBudgets (PDBs) preventing disruption when pods are unhealthy

  • Pods stuck in non-ready state

  • Insufficient healthy replicas to meet PDB requirements

Result: Drain times out → upgrade job exceeds activeDeadlineSeconds → node left cordoned

Common Scenario: Loki Pods + Missing S3 Bucket

This is the most common cause in KUP6S clusters:

What Happens

  1. Loki S3 bucket missing (bucket not created or creation failed)

  2. Loki backend pods fail with S3 errors:

    NoSuchBucket: The specified bucket does not exist.
    status code: 404
    
  3. Loki read pods fail (can’t form memberlist ring with unhealthy backends)

  4. PodDisruptionBudget blocks drain:

    • PDB loki-read requires maxUnavailable: 1

    • Currently 0/2 pods healthy → disruptionsAllowed: 0

    • Drain cannot proceed without violating PDB

  5. Upgrade job times out after 900s (15 minutes)

  6. Node left cordoned (SchedulingDisabled)

Why It Happened (Historical)

Before infrastructure updates: Loki S3 bucket (logs-loki-kup6s) was commented out in kustomization and required manual application due to CRD timing issues.

If forgotten: Loki deployed without bucket → pods unhealthy → upgrade failures

Current state: S3 buckets now deploy automatically with CRD wait conditions (resolved in recent infrastructure updates)

Diagnosis Steps

1. Check Node Status

kubectl get nodes

Look for nodes showing SchedulingDisabled in the output (won’t be labeled explicitly, but kubectl describe will show it).

2. Check Upgrade Job Status

kubectl get jobs -n system-upgrade

Look for jobs with COMPLETIONS showing failures or active beyond deadline.

3. Identify Unhealthy Pods on Node

# Replace NODE_NAME with your cordoned node
NODE_NAME="kup6s-agent-cax31-fsn1-yim"

kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o wide | grep -v Running

4. Check PodDisruptionBudgets

kubectl get pdb -A

Look for PDBs with ALLOWED DISRUPTIONS: 0 and check their target pods.

Example:

NAMESPACE    NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
monitoring   loki-read   N/A             1                 0                     1h

5. Investigate Pod Failures

For unhealthy pods found in step 3:

# Check pod status
kubectl describe pod POD_NAME -n NAMESPACE

# Check logs
kubectl logs POD_NAME -n NAMESPACE --tail=50

Common errors to look for:

  • NoSuchBucket - S3 bucket missing (Loki, etcd backups)

  • Connection refused - Service dependency unavailable

  • CrashLoopBackOff - Application failing to start

Resolution: Fix Underlying Issue

Case 1: Missing S3 Bucket (Loki)

Check if bucket exists:

kubectl get buckets.s3.aws.upbound.io -A

If missing, re-apply infrastructure to create buckets:

cd kube-hetzner
bash scripts/apply-and-configure-longhorn.sh

Verify bucket created:

kubectl get buckets.s3.aws.upbound.io logs-loki-kup6s -n crossplane-system

Should show SYNCED=True, READY=True.

Wait for Loki to recover:

kubectl get pods -n monitoring -l app.kubernetes.io/component=read -w

Wait until READY shows 1/1 or 2/2.

Case 2: Missing S3 Bucket (etcd Backups)

Similar to Loki, but bucket name is backup-etcd-kup6s:

kubectl get buckets.s3.aws.upbound.io backup-etcd-kup6s -n crossplane-system

If missing, run the apply script (same as Case 1).

Case 3: Other Unhealthy Pods

Identify root cause from pod logs and events:

kubectl describe pod POD_NAME -n NAMESPACE
kubectl logs POD_NAME -n NAMESPACE --tail=100

Common fixes:

  • ConfigMap/Secret missing: Apply required resources

  • Service unavailable: Check dependent services are running

  • Resource limits: Increase CPU/memory if evicted due to limits

  • Image pull errors: Check image name and registry access

Uncordon Node After Fix

Once underlying issue is resolved and pods are healthy:

# Replace NODE_NAME with your cordoned node
NODE_NAME="kup6s-agent-cax31-fsn1-yim"

kubectl uncordon $NODE_NAME

Verify:

kubectl get nodes

Node should no longer show SchedulingDisabled.

Manual Drain (If Needed)

If upgrade is stuck and you need to force drain:

⚠️ Warning: Only do this if you understand the impact. Forced drains can disrupt services.

NODE_NAME="kup6s-agent-cax31-fsn1-yim"

# Force drain (ignores PDB)
kubectl drain $NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=0 \
  --disable-eviction

Then uncordon:

kubectl uncordon $NODE_NAME

Prevention

Automatic S3 Bucket Deployment

Current setup: S3 buckets for Loki and etcd backups now deploy automatically during cluster provisioning via:

  • 50-C-etcd-backup-bucket.yaml.tpl (DR region)

  • 50-D-loki-s3-bucket.yaml.tpl (production region)

CRD wait conditions in kustomization_user.tf ensure:

  • Crossplane S3 Provider CRDs are registered before bucket creation

  • No manual intervention required

Verify After Deployment

After cluster creation, verify buckets exist:

kubectl get buckets.s3.aws.upbound.io -A

Expected output:

NAME                SYNCED   READY   EXTERNAL-NAME       AGE
backup-etcd-kup6s   True     True    backup-etcd-kup6s   5m
logs-loki-kup6s     True     True    logs-loki-kup6s     5m

Monitor Pod Health

Regularly check for unhealthy pods:

kubectl get pods -A | grep -v Running

Fix issues before upgrades to avoid drain failures.

System Upgrade Controller Configuration

The K3S upgrade behavior is controlled by:

activeDeadlineSeconds: 900 (15 minutes)

  • Maximum time for upgrade job including drain

  • If exceeded, job fails and node stays cordoned

Drain options:

  • --ignore-daemonsets - DaemonSet pods are allowed

  • --delete-emptydir-data - Deletes emptyDir volume data

  • --force - Forces deletion if pod won’t terminate

  • --skip-wait-for-delete-timeout 60 - Waits 60s for graceful deletion

If drain repeatedly times out, increase activeDeadlineSeconds or investigate why pods take so long to evict.