How-To Guide

Apply Infrastructure Changes Safely

Goal: Deploy or update KUP6S cluster infrastructure using OpenTofu without causing downtime or data loss.

Time: ~10-30 minutes (depending on changes)

Danger

CRITICAL: NEVER run tofu apply directly for kube.tf changes!

You MUST use the apply-and-configure-longhorn.sh script instead. Running tofu apply manually will leave Longhorn storage in a misconfigured state with incorrect storage reservations.

Correct command:

bash scripts/apply-and-configure-longhorn.sh

See Step 6: Apply changes for details.

Prerequisites

  • Environment set up with credentials

  • Access to the kube-hetzner directory

  • Understanding of what changes you’re making

The safe workflow

        graph LR
    A[Source .env] --> B[tofu plan]
    B --> C[Review plan carefully]
    C --> D{Changes OK?}
    D -->|No| E[Fix configuration]
    E --> B
    D -->|Yes| F[Run apply-and-configure-longhorn.sh]
    F --> G[Monitor deployment]
    G --> H[Verify cluster health]
    H --> I[Verify Longhorn storage]
    

Step 1: Source environment variables

Always source your .env file first:

cd ~/kup6s/kube-hetzner
source .env

Verify:

env | grep TF_VAR_hcloud_token | head -c 50

You should see your token (partially).

Step 2: Review what you’re changing

Check your git diff

git status
git diff kube.tf

Understand exactly what’s changing before applying.

Common change types

Safe changes (no downtime):

  • Adding new agent nodes

  • Updating manifest content (extra-manifests/)

  • Changing resource limits

  • Adding environment variables

Requires caution:

  • Removing nodes (may cause pod rescheduling)

  • Changing control plane settings

  • Modifying network configuration

  • Updating critical components (Traefik, Longhorn)

Dangerous (potential downtime):

  • Changing cluster name

  • Modifying SSH keys on existing nodes

  • Changing network CIDR ranges

  • Removing control plane nodes

Step 3: Run tofu plan

tofu plan

What to look for

✅ Good signs:

Plan: 1 to add, 0 to change, 0 to destroy.
# Adding new resources only
Plan: 0 to add, 2 to change, 0 to destroy.
# Modifying existing resources (check what's changing)

⚠️ Warning signs:

Plan: 0 to add, 0 to change, 3 to destroy.
# Destroying resources - verify this is intentional!
# module.kube-hetzner.hcloud_server.control_plane[0] must be replaced
# Replacing control plane nodes - HIGH RISK!

Save plan output

For important changes, save the plan:

tofu plan -out=tfplan
tofu show tfplan > plan-review.txt

Review plan-review.txt carefully.

Step 4: Review specific change types

Adding nodes

# module.kube-hetzner.hcloud_server.agents[3] will be created

Safe: Adding nodes is always safe.

Modifying manifests

# module.kube-hetzner.null_resource.kustomization must be replaced

Safe: Manifest updates trigger kustomization re-application.

Changing existing servers

# module.kube-hetzner.hcloud_server.control_plane[0] will be updated in-place
  ~ labels = {
      - "role" = "control-plane" -> "control-plane-primary"
    }

Usually safe: In-place updates don’t recreate the server.

Replacing servers

# module.kube-hetzner.hcloud_server.agents[0] must be replaced
-/+ resource "hcloud_server" "agents" {
      ~ server_type = "cax21" -> "cax31"  # Forces replacement
    }

⚠️ Caution: Server will be deleted and recreated. Pods will be rescheduled.

Step 5: Understand resource actions

Symbol

Meaning

Risk Level

+

Create

✅ Low

~

Update in-place

✅ Low

-

Destroy

⚠️ Medium

-/+

Replace (destroy then create)

⚠️ High

<=

Read data

✅ None

Step 6: Apply changes

MANDATORY: Use the apply-and-configure-longhorn.sh script

Danger

CRITICAL: You MUST use this script for all kube.tf changes. Direct tofu apply will misconfigure Longhorn!

cd ~/kup6s/kube-hetzner
bash scripts/apply-and-configure-longhorn.sh

What the script does:

  1. Sources .env file (loads credentials)

  2. Runs tofu apply -auto-approve

  3. Waits for Longhorn to stabilize (30 seconds)

  4. Configures all Longhorn nodes with correct 15GB fixed storage reservation

  5. Shows storage configuration summary

Script output example:

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
Waiting for Longhorn to stabilize...
Configuring Longhorn storage reservations...
✓ Node kup6s-agent-arm-1-xyz: 15GB reserved
✓ Node kup6s-agent-arm-2-abc: 15GB reserved
✓ Node kup6s-agent-arm-3-def: 15GB reserved
Configuration complete!

Why this script is mandatory

Problem: Longhorn nodes default to percentage-based storage reservation (30% = 8-31GB wasted per node).

Solution: The script configures fixed 15GB reservation per node, maximizing usable storage while preserving adequate space for system + OCI images.

Without the script: Nodes will have incorrect storage reservation, reducing available Longhorn capacity by 50GB+ across the cluster.

Advanced: Targeted apply (still use the script!)

If you need to apply specific resources first (risky changes), edit the script temporarily:

# Edit scripts/apply-and-configure-longhorn.sh line 15:
tofu apply -target=module.kube-hetzner.hcloud_server.agents[3] -auto-approve

# Run the script
bash scripts/apply-and-configure-longhorn.sh

# Then restore the script and run again for remaining resources

Warning

Targeted applies should only be used for risky changes. Always apply all changes eventually.

Step 7: Monitor the deployment

Watch for completion

OpenTofu will show progress:

module.kube-hetzner.hcloud_server.agents[3]: Creating...
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [10s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Still creating... [20s elapsed]
module.kube-hetzner.hcloud_server.agents[3]: Creation complete after 25s

Monitor cluster during changes

In another terminal:

export KUBECONFIG=~/kup6s/kube-hetzner/kup6s_kubeconfig.yaml
watch kubectl get nodes

See nodes joining/leaving in real-time.

Step 8: Verify cluster health

Check all nodes are ready

kubectl get nodes

All should show STATUS: Ready.

Check system pods

kubectl get pods --all-namespaces | grep -v Running

Should show no pods in error states (except Completed jobs).

Check critical components

# Traefik ingress
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

# Longhorn storage
kubectl get pods -n longhorn-system

# Monitoring stack
kubectl get pods -n monitoring

# ArgoCD
kubectl get pods -n argocd

All should be Running.

Verify applications still work

Test a few applications:

curl -I https://grafana.ops.kup6s.net
# Should return 200 OK or 302 redirect

Common scenarios

Scenario: Add 2 new agent nodes

1. Edit kube.tf:

agent_nodepools = [
  {
    name        = "agent-arm-2"
    server_type = "cax21"
    location    = "hel1"
    count       = 4  # Changed from 2
    # ...
  }
]

2. Plan and apply:

source .env
tofu plan
# Review: Should show 2 new servers being created
tofu apply

3. Wait for nodes:

kubectl get nodes -w
# Watch new nodes join and become Ready

Downtime: None ✅

Scenario: Update Grafana dashboard

1. Edit manifest:

vim extra-manifests/70-A-kube-prometheus-stack.yaml.tpl
# Make your changes

2. Plan and apply:

bash -c "dotenv .env && tofu plan"
# Should show kustomization resource being replaced
bash scripts/apply-and-configure-longhorn.sh

3. Verify:

kubectl rollout status deployment -n monitoring kube-prometheus-stack-grafana

Downtime: Minimal (Grafana restarts) ⚠️

Scenario: Change Traefik version

1. Edit kube.tf:

traefik_image_tag = "v3.4.2"  # Updated from v3.4.1

2. Plan:

tofu plan
# Shows traefik_image_tag change

3. Apply:

tofu apply

4. Monitor:

kubectl rollout status deployment -n kube-system traefik

Downtime: ~10-30 seconds during Traefik restart ⚠️

Emergency: Abort a deployment

If something goes wrong during tofu apply:

Press Ctrl+C

This stops OpenTofu. Resources created so far will remain.

Check what was created

tofu show

Roll back if needed

git checkout HEAD -- kube.tf
source .env
tofu apply
# This will remove partially-created resources

Best practices

✅ DO

  • Always run tofu plan first

  • Review the plan carefully

  • Save plan output for important changes

  • Apply during maintenance windows

  • Monitor cluster during and after

  • Test in staging first (if available)

  • Keep backups current

  • Document changes in git commits

❌ DON’T

  • Apply without planning

  • Skip reviewing the plan

  • Use -auto-approve in production manually

  • Make multiple unrelated changes at once

  • Apply during peak traffic hours

  • Modify production without testing

  • Forget to source .env

  • Leave untracked changes in git

Troubleshooting

“Error: Invalid credentials”

Forgot to source .env:

source .env
tofu plan

“Error: Resource already exists”

OpenTofu state is out of sync:

tofu refresh
tofu plan

“Changes don’t apply”

Check if kustomization cached:

kubectl delete -n kube-system cm kustomize-generated
tofu apply

“Node won’t join cluster”

Check node logs:

# SSH to node
ssh root@NODE_IP
journalctl -u k3s -f

Apply hangs

Timeout issue. Safe to Ctrl+C and retry:

# Press Ctrl+C
tofu apply -refresh=false  # Skip refresh, faster

Next steps