How-To Guide
Troubleshoot Nodes Stuck After K3S Upgrade¶
Learn how to diagnose and resolve nodes that remain cordoned (SchedulingDisabled) after K3S upgrades complete successfully.
Problem Description¶
Symptom: Node shows status Ready,SchedulingDisabled days or weeks after K3S upgrade completed successfully.
Example:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kup6s-agent-cax31-fsn1-yim Ready,SchedulingDisabled <none> 11d v1.31.13+k3s1
# ^^^^^ Node is healthy but cannot schedule pods
Key differences from upgrade failures:
❌ Upgrade failures (covered in K3S Upgrade Failures): Upgrade job times out due to unhealthy pods blocking drain
✅ This issue: Upgrade completed successfully, but node was never uncordoned
Root Cause: System-Upgrade-Controller Bug¶
The K3S system-upgrade-controller has a known race condition:
Upgrade process cordons node → drains pods → upgrades K3S → should uncordon
Controller pod restarts during or after upgrade (normal operation)
Controller loses state and forgets which nodes need uncordoning
Node stuck in
SchedulingDisabledstate indefinitely
This is not a configuration error - it’s a timing-sensitive bug in the upstream controller.
Automatic Solution: Auto-Uncordon CronJob¶
Since 2025-11-09, the cluster includes an automated solution that runs every 5 minutes.
How It Works¶
The auto-uncordon CronJob automatically detects and uncordons stuck nodes with these safety checks:
Safety checks (ALL must pass before uncordoning):
✅ Node is cordoned (
spec.unschedulable == true)✅ Node status is
Ready✅ Node version matches target version from upgrade plan
✅ No active upgrade jobs exist for the node
✅ Upgrade plan shows
Completestatus
Result: Stuck nodes are automatically uncordoned within 5 minutes with no manual intervention.
Verify Auto-Uncordon is Working¶
Check recent CronJob runs:
# List recent jobs
kubectl get jobs -n kube-system -l app.kubernetes.io/name=auto-uncordon \
--sort-by=.metadata.creationTimestamp | tail -5
# View logs from most recent job
kubectl logs -n kube-system -l app.kubernetes.io/name=auto-uncordon --tail=100
Expected log output (when no stuck nodes):
[2025-11-09 12:37:00 UTC] Starting stuck node detection...
[2025-11-09 12:37:04 UTC] Target versions - Agent: v1.31.13-k3s1, Server: v1.31.13-k3s1
[2025-11-09 12:37:07 UTC] Upgrade plans Complete status - Agent: True, Server: True
[2025-11-09 12:37:08 UTC] INFO: No cordoned nodes found. Nothing to do.
When stuck node detected and uncordoned:
[2025-11-09 12:40:03 UTC] Found cordoned nodes: kup6s-agent-cax31-fsn1-yim
[2025-11-09 12:40:03 UTC] Checking node: kup6s-agent-cax31-fsn1-yim
[2025-11-09 12:40:05 UTC] ✅ Node is Ready
[2025-11-09 12:40:06 UTC] Node type: agent, target version: v1.31.13-k3s1
[2025-11-09 12:40:08 UTC] ✅ Version matches target: v1.31.13+k3s1
[2025-11-09 12:40:10 UTC] ✅ No active upgrade jobs
[2025-11-09 12:40:10 UTC] ✅ ALL CHECKS PASSED - Ready to uncordon
[2025-11-09 12:40:10 UTC] 🔧 Uncordoning node kup6s-agent-cax31-fsn1-yim
node/kup6s-agent-cax31-fsn1-yim uncordoned
[2025-11-09 12:40:11 UTC] ✅ Successfully uncordoned kup6s-agent-cax31-fsn1-yim
[2025-11-09 12:40:11 UTC] Complete. Uncordoned 1 node(s).
Check CronJob Status¶
# Verify CronJob is running
kubectl get cronjob auto-uncordon-stuck-nodes -n kube-system
# Expected output
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE
auto-uncordon-stuck-nodes */5 * * * * False 0 2m
Manual Diagnosis (If Auto-Uncordon Fails)¶
If a node remains stuck after 10+ minutes, manually diagnose:
1. Identify Cordoned Nodes¶
kubectl get nodes | grep SchedulingDisabled
2. Check Node Details¶
# Replace with your stuck node name
NODE_NAME="kup6s-agent-cax31-fsn1-yim"
# Check node version
kubectl get node $NODE_NAME -o jsonpath='{.status.nodeInfo.kubeletVersion}'
# Check if node is cordoned
kubectl get node $NODE_NAME -o jsonpath='{.spec.unschedulable}'
# Output: true (cordoned) or <empty> (schedulable)
3. Check Upgrade Plan Status¶
# Check agent upgrade plan
kubectl get plan k3s-agent -n system-upgrade -o yaml | grep -A 5 "status:"
# Check server upgrade plan
kubectl get plan k3s-server -n system-upgrade -o yaml | grep -A 5 "status:"
Look for:
latestVersion: Should match node versionconditions.type: Completewithstatus: "True"
4. Check for Active Upgrade Jobs¶
kubectl get jobs -n system-upgrade
Expected: No active jobs (all should be completed or failed)
If active jobs exist: Wait for them to complete before manually uncordoning
5. Review Auto-Uncordon Logs¶
# Get latest job logs
kubectl logs -n kube-system -l app.kubernetes.io/name=auto-uncordon --tail=100
Look for:
❌ Version mismatch: Node version doesn’t match plan target version
❌ Active jobs: Upgrade jobs still running for the node
❌ Plan incomplete: Upgrade plan not showing
Complete: True✅ All checks passed: Node should have been uncordoned
Manual Uncordon (If Needed)¶
If auto-uncordon hasn’t fixed it and all safety checks pass:
# Replace with your stuck node name
NODE_NAME="kup6s-agent-cax31-fsn1-yim"
# Verify node is ready
kubectl get node $NODE_NAME | grep Ready
# Verify no active upgrade jobs
kubectl get jobs -n system-upgrade | grep -i active || echo "No active jobs"
# Uncordon manually
kubectl uncordon $NODE_NAME
Verify:
kubectl get nodes $NODE_NAME
# Should show "Ready" (not "Ready,SchedulingDisabled")
Troubleshooting Auto-Uncordon CronJob¶
CronJob Not Running¶
Check CronJob exists:
kubectl get cronjob -n kube-system | grep auto-uncordon
If missing: Re-apply infrastructure manifests:
cd kube-hetzner
bash scripts/apply-and-configure-longhorn.sh
Check for suspended CronJob:
kubectl get cronjob auto-uncordon-stuck-nodes -n kube-system -o jsonpath='{.spec.suspend}'
# Output should be: false
If suspended: Resume the CronJob:
kubectl patch cronjob auto-uncordon-stuck-nodes -n kube-system \
-p '{"spec":{"suspend":false}}'
CronJob Pods Failing¶
Check recent job failures:
kubectl get jobs -n kube-system -l app.kubernetes.io/name=auto-uncordon \
--sort-by=.metadata.creationTimestamp | tail -10
Get failure logs:
# Get pod name from failed job
kubectl get pods -n kube-system -l app.kubernetes.io/name=auto-uncordon \
--field-selector=status.phase=Failed
# View logs (replace POD_NAME)
kubectl logs -n kube-system POD_NAME
Common issues:
Image pull failures: Check cluster network and image availability
Permission errors: Verify RBAC permissions exist
API server errors: Check cluster health
Re-enable Dry-Run Mode (For Testing)¶
If you suspect the CronJob is uncordoning incorrectly:
# Enable dry-run mode (logs only, no uncordoning)
kubectl set env cronjob/auto-uncordon-stuck-nodes -n kube-system DRY_RUN=true
# Wait for next scheduled run (within 5 minutes)
kubectl logs -n kube-system -l app.kubernetes.io/name=auto-uncordon --tail=100 -f
# Disable dry-run mode to resume automatic uncordoning
kubectl set env cronjob/auto-uncordon-stuck-nodes -n kube-system DRY_RUN=false
When NOT to Uncordon¶
Do NOT uncordon if:
❌ Node version doesn’t match cluster version (upgrade still in progress)
❌ Active upgrade jobs exist in
system-upgradenamespace❌ Node status is not
Ready❌ Node has been manually cordoned for maintenance (check annotations)
How to safely cordon for maintenance (without auto-uncordon interfering):
The auto-uncordon CronJob only uncordons nodes that:
Match the target K3S version
Have no active upgrade jobs
Are in
Readystate
So manually cordoning a node for maintenance is safe - it won’t be auto-uncordoned unless it meets ALL criteria (which it won’t during active maintenance).
Prevention¶
The auto-uncordon CronJob prevents this issue automatically - no additional configuration needed.
Best practices:
✅ Keep auto-uncordon CronJob running (default)
✅ Monitor CronJob logs after K3S upgrades
✅ Verify all nodes are schedulable after upgrades complete
Check all nodes are schedulable:
# Should return empty (no cordoned nodes)
kubectl get nodes -o jsonpath='{.items[?(@.spec.unschedulable==true)].metadata.name}'
Technical Details¶
CronJob Configuration:
Manifest:
kube-hetzner/extra-manifests/85-auto-uncordon-cronjob.yaml.tplNamespace:
kube-systemSchedule:
*/5 * * * *(every 5 minutes)Image:
alpine/k8s:1.31.3(Alpine Linux with kubectl and jq)ServiceAccount:
auto-uncordon(minimal RBAC permissions)
RBAC Permissions:
nodes: get, list, patch (read node status, uncordon)plans.upgrade.cattle.io: get, list (read upgrade plan target versions)jobs: get, list (check for active upgrade jobs)
Version Normalization: The script normalizes K3S version formats since:
Upgrade plans use:
v1.31.13-k3s1(hyphen)Kubelet reports:
v1.31.13+k3s1(plus sign)
Both formats are valid and equivalent.