How-To Guide
Troubleshoot K3S Upgrade Failures¶
Learn how to diagnose and fix K3S upgrade failures caused by unhealthy pods preventing node drain.
Problem Description¶
Symptom: K3S system-upgrade-controller fails to upgrade nodes, leaving them in SchedulingDisabled (cordoned) state.
Common error messages:
level=error msg="upgrade failed: Job was active longer than specified deadline"
Node status:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kup6s-agent-cax31-fsn1-yim Ready <none> 1h v1.31.13+k3s1
# ^^^^^^ Shows Ready but node is SchedulingDisabled
Root Cause: Unhealthy Pods Blocking Drain¶
K3S upgrades follow this process:
Cordon node (mark
SchedulingDisabled)Drain node (evict all pods safely)
Upgrade K3S binary
Uncordon node
Failure occurs when: Pods cannot be evicted during drain due to:
PodDisruptionBudgets (PDBs) preventing disruption when pods are unhealthy
Pods stuck in non-ready state
Insufficient healthy replicas to meet PDB requirements
Result: Drain times out → upgrade job exceeds activeDeadlineSeconds → node left cordoned
Common Scenario: Loki Pods + Missing S3 Bucket¶
This is the most common cause in KUP6S clusters:
What Happens¶
Loki S3 bucket missing (bucket not created or creation failed)
Loki backend pods fail with S3 errors:
NoSuchBucket: The specified bucket does not exist. status code: 404
Loki read pods fail (can’t form memberlist ring with unhealthy backends)
PodDisruptionBudget blocks drain:
PDB
loki-readrequiresmaxUnavailable: 1Currently 0/2 pods healthy →
disruptionsAllowed: 0Drain cannot proceed without violating PDB
Upgrade job times out after 900s (15 minutes)
Node left cordoned (
SchedulingDisabled)
Why It Happened (Historical)¶
Before infrastructure updates: Loki S3 bucket (logs-loki-kup6s) was commented out in kustomization and required manual application due to CRD timing issues.
If forgotten: Loki deployed without bucket → pods unhealthy → upgrade failures
Current state: S3 buckets now deploy automatically with CRD wait conditions (resolved in recent infrastructure updates)
Diagnosis Steps¶
1. Check Node Status¶
kubectl get nodes
Look for nodes showing SchedulingDisabled in the output (won’t be labeled explicitly, but kubectl describe will show it).
2. Check Upgrade Job Status¶
kubectl get jobs -n system-upgrade
Look for jobs with COMPLETIONS showing failures or active beyond deadline.
3. Identify Unhealthy Pods on Node¶
# Replace NODE_NAME with your cordoned node
NODE_NAME="kup6s-agent-cax31-fsn1-yim"
kubectl get pods -A --field-selector spec.nodeName=$NODE_NAME -o wide | grep -v Running
4. Check PodDisruptionBudgets¶
kubectl get pdb -A
Look for PDBs with ALLOWED DISRUPTIONS: 0 and check their target pods.
Example:
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
monitoring loki-read N/A 1 0 1h
5. Investigate Pod Failures¶
For unhealthy pods found in step 3:
# Check pod status
kubectl describe pod POD_NAME -n NAMESPACE
# Check logs
kubectl logs POD_NAME -n NAMESPACE --tail=50
Common errors to look for:
NoSuchBucket- S3 bucket missing (Loki, etcd backups)Connection refused- Service dependency unavailableCrashLoopBackOff- Application failing to start
Resolution: Fix Underlying Issue¶
Case 1: Missing S3 Bucket (Loki)¶
Check if bucket exists:
kubectl get buckets.s3.aws.upbound.io -A
If missing, re-apply infrastructure to create buckets:
cd kube-hetzner
bash scripts/apply-and-configure-longhorn.sh
Verify bucket created:
kubectl get buckets.s3.aws.upbound.io logs-loki-kup6s -n crossplane-system
Should show SYNCED=True, READY=True.
Wait for Loki to recover:
kubectl get pods -n monitoring -l app.kubernetes.io/component=read -w
Wait until READY shows 1/1 or 2/2.
Case 2: Missing S3 Bucket (etcd Backups)¶
Similar to Loki, but bucket name is backup-etcd-kup6s:
kubectl get buckets.s3.aws.upbound.io backup-etcd-kup6s -n crossplane-system
If missing, run the apply script (same as Case 1).
Case 3: Other Unhealthy Pods¶
Identify root cause from pod logs and events:
kubectl describe pod POD_NAME -n NAMESPACE
kubectl logs POD_NAME -n NAMESPACE --tail=100
Common fixes:
ConfigMap/Secret missing: Apply required resources
Service unavailable: Check dependent services are running
Resource limits: Increase CPU/memory if evicted due to limits
Image pull errors: Check image name and registry access
Uncordon Node After Fix¶
Once underlying issue is resolved and pods are healthy:
# Replace NODE_NAME with your cordoned node
NODE_NAME="kup6s-agent-cax31-fsn1-yim"
kubectl uncordon $NODE_NAME
Verify:
kubectl get nodes
Node should no longer show SchedulingDisabled.
Manual Drain (If Needed)¶
If upgrade is stuck and you need to force drain:
⚠️ Warning: Only do this if you understand the impact. Forced drains can disrupt services.
NODE_NAME="kup6s-agent-cax31-fsn1-yim"
# Force drain (ignores PDB)
kubectl drain $NODE_NAME \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=0 \
--disable-eviction
Then uncordon:
kubectl uncordon $NODE_NAME
Prevention¶
Automatic S3 Bucket Deployment¶
Current setup: S3 buckets for Loki and etcd backups now deploy automatically during cluster provisioning via:
50-C-etcd-backup-bucket.yaml.tpl(DR region)50-D-loki-s3-bucket.yaml.tpl(production region)
CRD wait conditions in kustomization_user.tf ensure:
Crossplane S3 Provider CRDs are registered before bucket creation
No manual intervention required
Verify After Deployment¶
After cluster creation, verify buckets exist:
kubectl get buckets.s3.aws.upbound.io -A
Expected output:
NAME SYNCED READY EXTERNAL-NAME AGE
backup-etcd-kup6s True True backup-etcd-kup6s 5m
logs-loki-kup6s True True logs-loki-kup6s 5m
Monitor Pod Health¶
Regularly check for unhealthy pods:
kubectl get pods -A | grep -v Running
Fix issues before upgrades to avoid drain failures.
System Upgrade Controller Configuration¶
The K3S upgrade behavior is controlled by:
activeDeadlineSeconds: 900 (15 minutes)
Maximum time for upgrade job including drain
If exceeded, job fails and node stays cordoned
Drain options:
--ignore-daemonsets- DaemonSet pods are allowed--delete-emptydir-data- Deletes emptyDir volume data--force- Forces deletion if pod won’t terminate--skip-wait-for-delete-timeout 60- Waits 60s for graceful deletion
If drain repeatedly times out, increase activeDeadlineSeconds or investigate why pods take so long to evict.