How-To Guide

Upgrade Monitoring Components

Step-by-step guide for upgrading Prometheus, Loki, Thanos, Alloy, and other monitoring stack components.

Before You Begin

Prerequisites:

  • Access to the cluster with kubectl configured

  • Git access to dp-infra repository

  • Understanding of the current deployment

Safety Checklist:

  • [ ] Review release notes for breaking changes

  • [ ] Backup Grafana dashboards

  • [ ] Document current versions

  • [ ] Plan maintenance window (if needed)

  • [ ] Have rollback plan ready

Current Version Tracking:

# Check deployed versions
kubectl get helmchart -n monitoring -o yaml | grep version:
kubectl get pods -n monitoring -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'

Upgrade kube-prometheus-stack (Prometheus, Grafana, Alertmanager)

Step 1: Check Current Version

cd dp-infra/monitoring
cat config.yaml | grep "prometheus:"

Example output:

versions:
  prometheus: "65.0.0"  # Current chart version

Step 2: Review Release Notes

Check Helm chart changelog:

# List available versions
helm search repo prometheus-community/kube-prometheus-stack --versions | head -10

# View release notes
open https://github.com/prometheus-community/helm-charts/releases

Critical checks:

  • CRD changes (CustomResourceDefinitions)

  • Breaking configuration changes

  • Prometheus version compatibility

  • Grafana version compatibility

  • Required Kubernetes version

Step 3: Update Configuration

Edit config.yaml:

versions:
  prometheus: "66.0.0"  # New version

Check for deprecated values:

# Compare old and new chart values
helm show values prometheus-community/kube-prometheus-stack --version 65.0.0 > /tmp/old-values.yaml
helm show values prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-values.yaml
diff /tmp/old-values.yaml /tmp/new-values.yaml

Step 4: Update CRDs (if needed)

Check if CRDs changed:

helm show crds prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-crds.yaml

Apply CRD updates:

# CRDs must be applied manually before Helm upgrade
kubectl apply --server-side --force-conflicts -f /tmp/new-crds.yaml

Important: Always apply CRDs before upgrading the chart.

Step 5: Generate and Review Manifests

cd dp-infra/monitoring
npm run build

Review changes:

git diff manifests/monitoring.k8s.yaml

Look for:

  • Resource spec changes

  • New/removed resources

  • Storage changes (PVCs don’t auto-resize)

Step 6: Deploy Upgrade

Option A: Via ArgoCD (Recommended)

# Commit changes
git add config.yaml manifests/
git commit -m "Upgrade kube-prometheus-stack to v66.0.0"
git push

# Sync in ArgoCD
argocd app sync monitoring

# Watch progress
argocd app wait monitoring --health

Option B: Direct kubectl

kubectl apply -f manifests/monitoring.k8s.yaml

Step 7: Verify Upgrade

# Check all pods running
kubectl get pods -n monitoring

# Check Prometheus version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090/status
# Check "Version" field

# Check Grafana version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000
# Check bottom-left corner

# Verify metrics still flowing
# Query: up{job="kubernetes-nodes"}

Rollback Procedure

If upgrade fails:

# Revert config.yaml
git revert HEAD

# Regenerate manifests
npm run build

# Apply old version
kubectl apply -f manifests/monitoring.k8s.yaml

# Or via ArgoCD
git push
argocd app sync monitoring

Upgrade Loki

Step 1: Check Current Version

cat config.yaml | grep "loki:"

Example:

versions:
  loki: "6.16.0"  # Current chart version

Step 2: Review Release Notes

# List available versions
helm search repo grafana/loki --versions | head -10

# View release notes
open https://github.com/grafana/loki/releases

Critical checks:

  • Storage schema changes

  • Breaking config changes

  • S3 compatibility

  • SimpleScalable mode changes

Step 3: Test Storage Schema Compatibility

Check current schema:

kubectl logs -n monitoring loki-backend-0 | grep schema

Example output:

level=info schema=v13 msg="using schema"

Verify new version supports schema v13 (or current schema).

Important: Schema migrations can require data reprocessing.

Step 4: Update Configuration

# config.yaml
versions:
  loki: "6.17.0"  # New version

Step 5: Deploy Upgrade

# Generate manifests
npm run build

# Review changes
git diff manifests/monitoring.k8s.yaml

# Commit and deploy
git add config.yaml manifests/
git commit -m "Upgrade Loki to v6.17.0"
git push

# ArgoCD sync
argocd app sync monitoring

Step 6: Verify Upgrade

# Check Loki pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki

# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

# Verify logs in Grafana
# Navigate to Explore → Loki → {namespace="monitoring"}

Upgrade Thanos

Thanos components (Query, Store, Compactor) use direct container images, not Helm charts.

Step 1: Check Current Version

kubectl get pods -n monitoring -o jsonpath='{.items[?(@.metadata.labels.app\.kubernetes\.io/name=="thanos-query")].spec.containers[0].image}' | head -1

Example output:

quay.io/thanos/thanos:v0.36.1

Step 2: Review Release Notes

# View Thanos releases
open https://github.com/thanos-io/thanos/releases

# Check for breaking changes in:
# - StoreAPI
# - Compaction
# - Query API

Step 3: Update Configuration

# config.yaml
versions:
  thanos: "v0.37.0"  # New version

Step 4: Deploy Upgrade

Rolling update (safe):

npm run build
git add config.yaml manifests/
git commit -m "Upgrade Thanos to v0.37.0"
git push
argocd app sync monitoring

Kubernetes updates pods one at a time (StatefulSet/Deployment default behavior).

Step 5: Verify Upgrade

# Check all Thanos components upgraded
kubectl get pods -n monitoring -o wide | grep thanos

# Check Thanos Query stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores

# Verify queries work
# Query: up{job="kubernetes-nodes"}

# Check compaction still running
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep compact

Upgrade Alloy (Grafana Agent)

Step 1: Check Current Version

cat config.yaml | grep "alloy:"

Example:

versions:
  alloy: "0.9.0"  # Current chart version

Step 2: Review Release Notes

helm search repo grafana/alloy --versions | head -10
open https://github.com/grafana/alloy/releases

Critical checks:

  • Configuration syntax changes

  • Loki API compatibility

  • Kubernetes API changes

Step 3: Update Configuration

# config.yaml
versions:
  alloy: "0.10.0"  # New version

Step 4: Deploy Upgrade

npm run build
git add config.yaml manifests/
git commit -m "Upgrade Alloy to v0.10.0"
git push
argocd app sync monitoring

DaemonSet rolling update: One pod per node updates sequentially.

Step 5: Verify Upgrade

# Check all Alloy pods updated
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Check logs flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=10

# Verify Loki receiving logs
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

Upgrade Multiple Components Simultaneously

Not recommended, but possible for minor version bumps.

Approach

# config.yaml - update all at once
versions:
  prometheus: "66.0.0"
  loki: "6.17.0"
  alloy: "0.10.0"
  thanos: "v0.37.0"

Deploy

npm run build
git add config.yaml manifests/
git commit -m "Upgrade all monitoring components"
git push
argocd app sync monitoring

Monitor Closely

# Watch all pods
watch kubectl get pods -n monitoring

# Check for errors
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

Rollback if any component fails:

git revert HEAD
git push
argocd app sync monitoring

Upgrade Testing Strategy

Test in Staging First

Best practice: Test upgrades in non-production environment.

Staging cluster steps:

  1. Deploy same monitoring stack version as production

  2. Apply upgrade

  3. Run validation tests

  4. Monitor for 24-48 hours

  5. If stable, proceed to production

Smoke Tests After Upgrade

Checklist:

  • [ ] All pods running

  • [ ] Prometheus scraping targets

  • [ ] Grafana dashboards loading

  • [ ] Loki receiving logs

  • [ ] Thanos querying S3 data

  • [ ] Alerts still firing (test with dummy alert)

  • [ ] No error spikes in logs

Quick smoke test script:

#!/bin/bash
echo "=== Checking Pods ==="
kubectl get pods -n monitoring | grep -v Running

echo "=== Checking Prometheus Targets ==="
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'

echo "=== Checking Loki Labels ==="
kubectl port-forward -n monitoring svc/loki-gateway 3100:80 &
sleep 2
curl -s http://localhost:3100/loki/api/v1/labels | jq '.data | length'

echo "=== Checking Thanos Stores ==="
kubectl port-forward -n monitoring svc/thanos-query 9091:9090 &
sleep 2
curl -s http://localhost:9091/api/v1/stores | jq '.data | length'

pkill -f "port-forward"

Common Upgrade Issues

Issue: CRD Version Mismatch

Symptom:

Error: unable to recognize "monitoring.k8s.yaml": no matches for kind "ServiceMonitor"

Solution:

# Apply CRDs manually
kubectl apply --server-side --force-conflicts -f \
  https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml

Issue: PVC Size Cannot Shrink

Symptom:

Error: Forbidden: field is immutable

Solution: PVCs can only grow, not shrink. To reduce:

  1. Delete StatefulSet (keep PVC)

  2. Delete PVC

  3. Recreate with smaller size

  4. Redeploy StatefulSet

  5. Data loss - restore from backup

Issue: Breaking Configuration Change

Symptom:

Error: unknown field "oldFieldName" in ...

Solution:

  • Read chart CHANGELOG.md

  • Find migration path for renamed fields

  • Update charts/constructs/*.ts with new field names

  • Regenerate manifests

Issue: Image Pull Errors

Symptom:

Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found

Solution:

  • Verify image tag exists: docker pull quay.io/thanos/thanos:v0.37.0

  • Check for typos in config.yaml

  • Wait if image just released (may not be available yet)


Upgrade Maintenance Windows

When to Schedule Maintenance

Low-risk upgrades (patch versions):

  • No maintenance window needed

  • Rolling updates cause no downtime

High-risk upgrades (major versions):

  • Schedule during low-traffic period

  • Announce to team

  • Have rollback plan ready

Maintenance Procedure

Before window:

  1. Announce maintenance

  2. Backup Grafana dashboards

  3. Document current versions

  4. Prepare rollback commands

During window:

  1. Apply upgrade

  2. Monitor closely

  3. Run smoke tests

  4. Verify no errors

After window:

  1. Monitor for 24 hours

  2. Check for error spikes

  3. Verify alerts still working

  4. Document lessons learned


Automation Considerations

Automated Upgrades (Renovate/Dependabot)

Not recommended for monitoring stack due to:

  • High blast radius if upgrade breaks

  • Need to verify metrics/logs continuity

  • Potential CRD conflicts

If automating:

  • Use staging cluster first

  • Require manual approval for production

  • Have comprehensive smoke tests

  • Set up rollback automation

Version Pinning

Recommended approach:

# Pin to specific versions (not "latest")
versions:
  prometheus: "66.0.0"  # Not "latest"
  loki: "6.17.0"
  thanos: "v0.37.0"

Benefits:

  • Predictable deployments

  • Controlled upgrade timing

  • Easier troubleshooting


See Also