How-To Guide
Apply Infrastructure Changes Safely¶
Goal: Deploy or update KUP6S cluster infrastructure using OpenTofu without causing downtime or data loss.
Time: ~10-30 minutes (depending on changes)
Danger
CRITICAL: NEVER run tofu apply directly for kube.tf changes!
You MUST use the apply-and-configure-longhorn.sh script instead. Running tofu apply manually will leave Longhorn storage in a misconfigured state with incorrect storage reservations.
Correct command:
bash scripts/apply-and-configure-longhorn.sh
See Step 6: Apply changes for details.
Prerequisites¶
Environment set up with credentials
Access to the kube-hetzner directory
Understanding of what changes you’re making
The safe workflow¶
graph LR
A[Source .env] --> B[tofu plan]
B --> C[Review plan carefully]
C --> D{Changes OK?}
D -->|No| E[Fix configuration]
E --> B
D -->|Yes| F[Run apply-and-configure-longhorn.sh]
F --> G[Monitor deployment]
G --> H[Verify cluster health]
H --> I[Verify Longhorn storage]
Step 1: Source environment variables¶
Always source your .env file first:
cd ~/kup6s/kube-hetzner
source .env
Verify:
env | grep TF_VAR_hcloud_token | head -c 50
You should see your token (partially).
Step 2: Review what you’re changing¶
Check your git diff¶
git status
git diff kube.tf
Understand exactly what’s changing before applying.
Common change types¶
Safe changes (no downtime):
Adding new agent nodes
Updating manifest content (extra-manifests/)
Changing resource limits
Adding environment variables
Requires caution:
Removing nodes (may cause pod rescheduling)
Changing control plane settings
Modifying network configuration
Updating critical components (Traefik, Longhorn)
Dangerous (potential downtime):
Changing cluster name
Modifying SSH keys on existing nodes
Changing network CIDR ranges
Removing control plane nodes
Step 3: Run tofu plan¶
tofu plan
What to look for¶
✅ Good signs:
Plan: 1 to add, 0 to change, 0 to destroy.
# Adding new resources only
Plan: 0 to add, 2 to change, 0 to destroy.
# Modifying existing resources (check what's changing)
⚠️ Warning signs:
Plan: 0 to add, 0 to change, 3 to destroy.
# Destroying resources - verify this is intentional!
# module.kube-hetzner.hcloud_server.control_plane[0] must be replaced
# Replacing control plane nodes - HIGH RISK!
Save plan output¶
For important changes, save the plan:
tofu plan -out=tfplan
tofu show tfplan > plan-review.txt
Review plan-review.txt carefully.
Step 4: Review specific change types¶
Adding nodes¶
# module.kube-hetzner.hcloud_server.agents[3] will be created
✅ Safe: Adding nodes is always safe.
Modifying manifests¶
# module.kube-hetzner.null_resource.kustomization must be replaced
✅ Safe: Manifest updates trigger kustomization re-application.
Changing existing servers¶
# module.kube-hetzner.hcloud_server.control_plane[0] will be updated in-place
~ labels = {
- "role" = "control-plane" -> "control-plane-primary"
}
✅ Usually safe: In-place updates don’t recreate the server.
Replacing servers¶
# module.kube-hetzner.hcloud_server.agents[0] must be replaced
-/+ resource "hcloud_server" "agents" {
~ server_type = "cax21" -> "cax31" # Forces replacement
}
⚠️ Caution: Server will be deleted and recreated. Pods will be rescheduled.
Step 5: Understand resource actions¶
Symbol |
Meaning |
Risk Level |
|---|---|---|
|
Create |
✅ Low |
|
Update in-place |
✅ Low |
|
Destroy |
⚠️ Medium |
|
Replace (destroy then create) |
⚠️ High |
|
Read data |
✅ None |
Step 6: Apply changes¶
MANDATORY: Use the apply-and-configure-longhorn.sh script¶
Danger
CRITICAL: You MUST use this script for all kube.tf changes. Direct tofu apply will misconfigure Longhorn!
cd ~/kup6s/kube-hetzner
bash scripts/apply-and-configure-longhorn.sh
What the script does:
Sources
.envfile (loads credentials)Runs
tofu apply -auto-approveWaits for Longhorn to stabilize (30 seconds)
Configures all Longhorn nodes with correct 15GB fixed storage reservation
Shows storage configuration summary
Script output example:
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
Waiting for Longhorn to stabilize...
Configuring Longhorn storage reservations...
✓ Node kup6s-agent-arm-1-xyz: 15GB reserved
✓ Node kup6s-agent-arm-2-abc: 15GB reserved
✓ Node kup6s-agent-arm-3-def: 15GB reserved
Configuration complete!
Why this script is mandatory¶
Problem: Longhorn nodes default to percentage-based storage reservation (30% = 8-31GB wasted per node).
Solution: The script configures fixed 15GB reservation per node, maximizing usable storage while preserving adequate space for system + OCI images.
Without the script: Nodes will have incorrect storage reservation, reducing available Longhorn capacity by 50GB+ across the cluster.
Advanced: Targeted apply (still use the script!)¶
If you need to apply specific resources first (risky changes), edit the script temporarily:
# Edit scripts/apply-and-configure-longhorn.sh line 15:
tofu apply -target=module.kube-hetzner.hcloud_server.agents[3] -auto-approve
# Run the script
bash scripts/apply-and-configure-longhorn.sh
# Then restore the script and run again for remaining resources
Warning
Targeted applies should only be used for risky changes. Always apply all changes eventually.
Step 7: Monitor the deployment¶
Watch for completion¶
OpenTofu will show progress:
module.kube-hetzner.hcloud_server.agents[3]: Creating...
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [10s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [20s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Creation complete after 25s
Monitor cluster during changes¶
In another terminal:
export KUBECONFIG=~/kup6s/kube-hetzner/kup6s_kubeconfig.yaml
watch kubectl get nodes
See nodes joining/leaving in real-time.
Step 8: Verify cluster health¶
Check all nodes are ready¶
kubectl get nodes
All should show STATUS: Ready.
Check system pods¶
kubectl get pods --all-namespaces | grep -v Running
Should show no pods in error states (except Completed jobs).
Check critical components¶
# Traefik ingress
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
# Longhorn storage
kubectl get pods -n longhorn-system
# Monitoring stack
kubectl get pods -n monitoring
# ArgoCD
kubectl get pods -n argocd
All should be Running.
Verify applications still work¶
Test a few applications:
curl -I https://grafana.ops.kup6s.net
# Should return 200 OK or 302 redirect
Common scenarios¶
Scenario: Add 2 new agent nodes¶
1. Edit kube.tf:
agent_nodepools = [
{
name = "agent-arm-2"
server_type = "cax21"
location = "hel1"
count = 4 # Changed from 2
# ...
}
]
2. Plan and apply:
source .env
tofu plan
# Review: Should show 2 new servers being created
tofu apply
3. Wait for nodes:
kubectl get nodes -w
# Watch new nodes join and become Ready
Downtime: None ✅
Scenario: Update Grafana dashboard¶
1. Edit manifest:
vim extra-manifests/70-A-kube-prometheus-stack.yaml.tpl
# Make your changes
2. Plan and apply:
bash -c "dotenv .env && tofu plan"
# Should show kustomization resource being replaced
bash scripts/apply-and-configure-longhorn.sh
3. Verify:
kubectl rollout status deployment -n monitoring kube-prometheus-stack-grafana
Downtime: Minimal (Grafana restarts) ⚠️
Scenario: Change Traefik version¶
1. Edit kube.tf:
traefik_image_tag = "v3.4.2" # Updated from v3.4.1
2. Plan:
tofu plan
# Shows traefik_image_tag change
3. Apply:
tofu apply
4. Monitor:
kubectl rollout status deployment -n kube-system traefik
Downtime: ~10-30 seconds during Traefik restart ⚠️
Emergency: Abort a deployment¶
If something goes wrong during tofu apply:
Press Ctrl+C¶
This stops OpenTofu. Resources created so far will remain.
Check what was created¶
tofu show
Roll back if needed¶
git checkout HEAD -- kube.tf
source .env
tofu apply
# This will remove partially-created resources
Best practices¶
✅ DO¶
Always run
tofu planfirstReview the plan carefully
Save plan output for important changes
Apply during maintenance windows
Monitor cluster during and after
Test in staging first (if available)
Keep backups current
Document changes in git commits
❌ DON’T¶
Apply without planning
Skip reviewing the plan
Use
-auto-approvein production manuallyMake multiple unrelated changes at once
Apply during peak traffic hours
Modify production without testing
Forget to source .env
Leave untracked changes in git
Troubleshooting¶
“Error: Invalid credentials”¶
Forgot to source .env:
source .env
tofu plan
“Error: Resource already exists”¶
OpenTofu state is out of sync:
tofu refresh
tofu plan
“Changes don’t apply”¶
Check if kustomization cached:
kubectl delete -n kube-system cm kustomize-generated
tofu apply
“Node won’t join cluster”¶
Check node logs:
# SSH to node
ssh root@NODE_IP
journalctl -u k3s -f
Apply hangs¶
Timeout issue. Safe to Ctrl+C and retry:
# Press Ctrl+C
tofu apply -refresh=false # Skip refresh, faster
Next steps¶
Backup and restore - Protect your cluster
Add node - Scale your cluster
Upgrade components - Update software versions