How-To Guide

Apply Infrastructure Changes Safely¶

Type: How-To (Task-oriented)

Goal: Deploy or update KUP6S cluster infrastructure using OpenTofu without causing downtime or data loss.

Time: ~10-30 minutes (depending on changes)

Danger

CRITICAL: NEVER run tofu apply directly for kube.tf changes!

You MUST use the apply-and-configure-longhorn.sh script instead. Running tofu apply manually will leave Longhorn storage in a misconfigured state with incorrect storage reservations.

Correct command:

bash scripts/apply-and-configure-longhorn.sh

See Step 6: Apply changes for details.

Prerequisites¶

Environment set up with credentials
Access to the kube-hetzner directory
Understanding of what changes you’re making

The safe workflow¶

        graph LR
    A[Source .env] --> B[tofu plan]
    B --> C[Review plan carefully]
    C --> D{Changes OK?}
    D -->|No| E[Fix configuration]
    E --> B
    D -->|Yes| F[Run apply-and-configure-longhorn.sh]
    F --> G[Monitor deployment]
    G --> H[Verify cluster health]
    H --> I[Verify Longhorn storage]

Step 1: Source environment variables¶

Always source your .env file first:

cd ~/kup6s/kube-hetzner
source .env

Verify:

env | grep TF_VAR_hcloud_token | head -c 50

You should see your token (partially).

Step 2: Review what you’re changing¶

Check your git diff¶

git status
git diff kube.tf

Understand exactly what’s changing before applying.

Common change types¶

Safe changes (no downtime):

Adding new agent nodes
Updating manifest content (extra-manifests/)
Changing resource limits
Adding environment variables

Requires caution:

Removing nodes (may cause pod rescheduling)
Changing control plane settings
Modifying network configuration
Updating critical components (Traefik, Longhorn)

Dangerous (potential downtime):

Changing cluster name
Modifying SSH keys on existing nodes
Changing network CIDR ranges
Removing control plane nodes

Step 3: Run tofu plan¶

tofu plan

What to look for¶

✅ Good signs:

Plan: 1 to add, 0 to change, 0 to destroy.
# Adding new resources only

Plan: 0 to add, 2 to change, 0 to destroy.
# Modifying existing resources (check what's changing)

⚠️ Warning signs:

Plan: 0 to add, 0 to change, 3 to destroy.
# Destroying resources - verify this is intentional!

# module.kube-hetzner.hcloud_server.control_plane[0] must be replaced
# Replacing control plane nodes - HIGH RISK!

Save plan output¶

For important changes, save the plan:

tofu plan -out=tfplan
tofu show tfplan > plan-review.txt

Review plan-review.txt carefully.

Step 4: Review specific change types¶

Adding nodes¶

# module.kube-hetzner.hcloud_server.agents[3] will be created

✅ Safe: Adding nodes is always safe.

Modifying manifests¶

# module.kube-hetzner.null_resource.kustomization must be replaced

✅ Safe: Manifest updates trigger kustomization re-application.

Changing existing servers¶

# module.kube-hetzner.hcloud_server.control_plane[0] will be updated in-place
  ~ labels = {
      - "role" = "control-plane" -> "control-plane-primary"
    }

✅ Usually safe: In-place updates don’t recreate the server.

Replacing servers¶

# module.kube-hetzner.hcloud_server.agents[0] must be replaced
-/+ resource "hcloud_server" "agents" {
      ~ server_type = "cax21" -> "cax31"  # Forces replacement
    }

⚠️ Caution: Server will be deleted and recreated. Pods will be rescheduled.

Step 5: Understand resource actions¶

Symbol	Meaning	Risk Level
`+`	Create	✅ Low
`~`	Update in-place	✅ Low
`-`	Destroy	⚠️ Medium
`-/+`	Replace (destroy then create)	⚠️ High
`<=`	Read data	✅ None

Step 6: Apply changes¶

MANDATORY: Use the apply-and-configure-longhorn.sh script¶

Danger

CRITICAL: You MUST use this script for all kube.tf changes. Direct tofu apply will misconfigure Longhorn!

cd ~/kup6s/kube-hetzner
bash scripts/apply-and-configure-longhorn.sh

What the script does:

Sources .env file (loads credentials)
Runs tofu apply -auto-approve
Waits for Longhorn to stabilize (30 seconds)
Configures all Longhorn nodes with correct 15GB fixed storage reservation
Shows storage configuration summary

Script output example:

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
Waiting for Longhorn to stabilize...
Configuring Longhorn storage reservations...
✓ Node kup6s-agent-arm-1-xyz: 15GB reserved
✓ Node kup6s-agent-arm-2-abc: 15GB reserved
✓ Node kup6s-agent-arm-3-def: 15GB reserved
Configuration complete!

Why this script is mandatory¶

Problem: Longhorn nodes default to percentage-based storage reservation (30% = 8-31GB wasted per node).

Solution: The script configures fixed 15GB reservation per node, maximizing usable storage while preserving adequate space for system + OCI images.

Without the script: Nodes will have incorrect storage reservation, reducing available Longhorn capacity by 50GB+ across the cluster.

Advanced: Targeted apply (still use the script!)¶

If you need to apply specific resources first (risky changes), edit the script temporarily:

# Edit scripts/apply-and-configure-longhorn.sh line 15:
tofu apply -target=module.kube-hetzner.hcloud_server.agents[3] -auto-approve

# Run the script
bash scripts/apply-and-configure-longhorn.sh

# Then restore the script and run again for remaining resources

Warning

Targeted applies should only be used for risky changes. Always apply all changes eventually.

Step 7: Monitor the deployment¶

Watch for completion¶

OpenTofu will show progress:

module.kube-hetzner.hcloud_server.agents[3]: Creating...
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [10s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [20s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Creation complete after 25s

Monitor cluster during changes¶

In another terminal:

export KUBECONFIG=~/kup6s/kube-hetzner/kup6s_kubeconfig.yaml
watch kubectl get nodes

See nodes joining/leaving in real-time.

Step 8: Verify cluster health¶

Check all nodes are ready¶

kubectl get nodes

All should show STATUS: Ready.

Check system pods¶

kubectl get pods --all-namespaces | grep -v Running

Should show no pods in error states (except Completed jobs).

Check critical components¶

# Traefik ingress
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Longhorn storage
kubectl get pods -n longhorn-system

# Monitoring stack
kubectl get pods -n monitoring

# ArgoCD
kubectl get pods -n argocd

All should be Running.

Verify applications still work¶

Test a few applications:

curl -I https://grafana.ops.kup6s.net
# Should return 200 OK or 302 redirect

Common scenarios¶

Scenario: Add 2 new agent nodes¶

1. Edit kube.tf:

agent_nodepools = [
  {
    name        = "agent-arm-2"
    server_type = "cax21"
    location    = "hel1"
    count       = 4  # Changed from 2
    # ...
  }
]

2. Plan and apply:

source .env
tofu plan
# Review: Should show 2 new servers being created
tofu apply

3. Wait for nodes:

kubectl get nodes -w
# Watch new nodes join and become Ready

Downtime: None ✅

Scenario: Update Grafana dashboard¶

1. Edit manifest:

vim extra-manifests/70-A-kube-prometheus-stack.yaml.tpl
# Make your changes

2. Plan and apply:

bash -c "dotenv .env && tofu plan"
# Should show kustomization resource being replaced
bash scripts/apply-and-configure-longhorn.sh

3. Verify:

kubectl rollout status deployment -n monitoring kube-prometheus-stack-grafana

Downtime: Minimal (Grafana restarts) ⚠️

Scenario: Change Traefik version¶

1. Edit kube.tf:

traefik_image_tag = "v3.4.2"  # Updated from v3.4.1

2. Plan:

tofu plan
# Shows traefik_image_tag change

3. Apply:

tofu apply

4. Monitor:

kubectl rollout status deployment -n kube-system traefik

Downtime: ~10-30 seconds during Traefik restart ⚠️

Emergency: Abort a deployment¶

If something goes wrong during tofu apply:

Press Ctrl+C¶

This stops OpenTofu. Resources created so far will remain.

Check what was created¶

tofu show

Roll back if needed¶

git checkout HEAD -- kube.tf
source .env
tofu apply
# This will remove partially-created resources

Best practices¶

✅ DO¶

Always run tofu plan first
Review the plan carefully
Save plan output for important changes
Apply during maintenance windows
Monitor cluster during and after
Test in staging first (if available)
Keep backups current
Document changes in git commits

❌ DON’T¶

Apply without planning
Skip reviewing the plan
Use -auto-approve in production manually
Make multiple unrelated changes at once
Apply during peak traffic hours
Modify production without testing
Forget to source .env
Leave untracked changes in git

Troubleshooting¶

“Error: Invalid credentials”¶

Forgot to source .env:

source .env
tofu plan

“Error: Resource already exists”¶

OpenTofu state is out of sync:

tofu refresh
tofu plan

“Changes don’t apply”¶

Check if kustomization cached:

kubectl delete -n kube-system cm kustomize-generated
tofu apply

“Node won’t join cluster”¶

Check node logs:

# SSH to node
ssh root@NODE_IP
journalctl -u k3s -f

Apply hangs¶

Timeout issue. Safe to Ctrl+C and retry:

# Press Ctrl+C
tofu apply -refresh=false  # Skip refresh, faster

Next steps¶

Backup and restore - Protect your cluster
Add node - Scale your cluster
Upgrade components - Update software versions