How-To Guide

Upgrade Monitoring Components¶

Type: How-To (Task-oriented)

Related: Helm Values | Configuration

Step-by-step guide for upgrading Prometheus, Loki, Thanos, Alloy, and other monitoring stack components.

Before You Begin¶

Prerequisites:

Access to the cluster with kubectl configured
Git access to dp-infra repository
Understanding of the current deployment

Safety Checklist:

[ ] Review release notes for breaking changes
[ ] Backup Grafana dashboards
[ ] Document current versions
[ ] Plan maintenance window (if needed)
[ ] Have rollback plan ready

Current Version Tracking:

# Check deployed versions
kubectl get helmchart -n monitoring -o yaml | grep version:
kubectl get pods -n monitoring -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'

Upgrade kube-prometheus-stack (Prometheus, Grafana, Alertmanager)¶

Step 1: Check Current Version¶

cd dp-infra/monitoring
cat config.yaml | grep "prometheus:"

Example output:

versions:
  prometheus: "65.0.0"  # Current chart version

Step 2: Review Release Notes¶

Check Helm chart changelog:

# List available versions
helm search repo prometheus-community/kube-prometheus-stack --versions | head -10

# View release notes
open https://github.com/prometheus-community/helm-charts/releases

Critical checks:

CRD changes (CustomResourceDefinitions)
Breaking configuration changes
Prometheus version compatibility
Grafana version compatibility
Required Kubernetes version

Step 3: Update Configuration¶

Edit config.yaml:

versions:
  prometheus: "66.0.0"  # New version

Check for deprecated values:

# Compare old and new chart values
helm show values prometheus-community/kube-prometheus-stack --version 65.0.0 > /tmp/old-values.yaml
helm show values prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-values.yaml
diff /tmp/old-values.yaml /tmp/new-values.yaml

Step 4: Update CRDs (if needed)¶

Check if CRDs changed:

helm show crds prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-crds.yaml

Apply CRD updates:

# CRDs must be applied manually before Helm upgrade
kubectl apply --server-side --force-conflicts -f /tmp/new-crds.yaml

Important: Always apply CRDs before upgrading the chart.

Step 5: Generate and Review Manifests¶

cd dp-infra/monitoring
npm run build

Review changes:

git diff manifests/monitoring.k8s.yaml

Look for:

Resource spec changes
New/removed resources
Storage changes (PVCs don’t auto-resize)

Step 6: Deploy Upgrade¶

Option A: Via ArgoCD (Recommended)

# Commit changes
git add config.yaml manifests/
git commit -m "Upgrade kube-prometheus-stack to v66.0.0"
git push

# Sync in ArgoCD
argocd app sync monitoring

# Watch progress
argocd app wait monitoring --health

Option B: Direct kubectl

kubectl apply -f manifests/monitoring.k8s.yaml

Step 7: Verify Upgrade¶

# Check all pods running
kubectl get pods -n monitoring

# Check Prometheus version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090/status
# Check "Version" field

# Check Grafana version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000
# Check bottom-left corner

# Verify metrics still flowing
# Query: up{job="kubernetes-nodes"}

Rollback Procedure¶

If upgrade fails:

# Revert config.yaml
git revert HEAD

# Regenerate manifests
npm run build

# Apply old version
kubectl apply -f manifests/monitoring.k8s.yaml

# Or via ArgoCD
git push
argocd app sync monitoring

Upgrade Loki¶

Step 1: Check Current Version¶

cat config.yaml | grep "loki:"

Example:

versions:
  loki: "6.16.0"  # Current chart version

Step 2: Review Release Notes¶

# List available versions
helm search repo grafana/loki --versions | head -10

# View release notes
open https://github.com/grafana/loki/releases

Critical checks:

Storage schema changes
Breaking config changes
S3 compatibility
SimpleScalable mode changes

Step 3: Test Storage Schema Compatibility¶

Check current schema:

kubectl logs -n monitoring loki-backend-0 | grep schema

Example output:

level=info schema=v13 msg="using schema"

Verify new version supports schema v13 (or current schema).

Important: Schema migrations can require data reprocessing.

Step 4: Update Configuration¶

# config.yaml
versions:
  loki: "6.17.0"  # New version

Step 5: Deploy Upgrade¶

# Generate manifests
npm run build

# Review changes
git diff manifests/monitoring.k8s.yaml

# Commit and deploy
git add config.yaml manifests/
git commit -m "Upgrade Loki to v6.17.0"
git push

# ArgoCD sync
argocd app sync monitoring

Step 6: Verify Upgrade¶

# Check Loki pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki

# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

# Verify logs in Grafana
# Navigate to Explore → Loki → {namespace="monitoring"}

Upgrade Thanos¶

Thanos components (Query, Store, Compactor) use direct container images, not Helm charts.

Step 1: Check Current Version¶

kubectl get pods -n monitoring -o jsonpath='{.items[?(@.metadata.labels.app\.kubernetes\.io/name=="thanos-query")].spec.containers[0].image}' | head -1

Example output:

quay.io/thanos/thanos:v0.36.1

Step 2: Review Release Notes¶

# View Thanos releases
open https://github.com/thanos-io/thanos/releases

# Check for breaking changes in:
# - StoreAPI
# - Compaction
# - Query API

Step 3: Update Configuration¶

# config.yaml
versions:
  thanos: "v0.37.0"  # New version

Step 4: Deploy Upgrade¶

Rolling update (safe):

npm run build
git add config.yaml manifests/
git commit -m "Upgrade Thanos to v0.37.0"
git push
argocd app sync monitoring

Kubernetes updates pods one at a time (StatefulSet/Deployment default behavior).

Step 5: Verify Upgrade¶

# Check all Thanos components upgraded
kubectl get pods -n monitoring -o wide | grep thanos

# Check Thanos Query stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores

# Verify queries work
# Query: up{job="kubernetes-nodes"}

# Check compaction still running
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep compact

Upgrade Alloy (Grafana Agent)¶

Step 1: Check Current Version¶

cat config.yaml | grep "alloy:"

Example:

versions:
  alloy: "0.9.0"  # Current chart version

Step 2: Review Release Notes¶

helm search repo grafana/alloy --versions | head -10
open https://github.com/grafana/alloy/releases

Critical checks:

Configuration syntax changes
Loki API compatibility
Kubernetes API changes

Step 3: Update Configuration¶

# config.yaml
versions:
  alloy: "0.10.0"  # New version

Step 4: Deploy Upgrade¶

npm run build
git add config.yaml manifests/
git commit -m "Upgrade Alloy to v0.10.0"
git push
argocd app sync monitoring

DaemonSet rolling update: One pod per node updates sequentially.

Step 5: Verify Upgrade¶

# Check all Alloy pods updated
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Check logs flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=10

# Verify Loki receiving logs
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

Upgrade Multiple Components Simultaneously¶

Not recommended, but possible for minor version bumps.

Approach¶

# config.yaml - update all at once
versions:
  prometheus: "66.0.0"
  loki: "6.17.0"
  alloy: "0.10.0"
  thanos: "v0.37.0"

Deploy¶

npm run build
git add config.yaml manifests/
git commit -m "Upgrade all monitoring components"
git push
argocd app sync monitoring

Monitor Closely¶

# Watch all pods
watch kubectl get pods -n monitoring

# Check for errors
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

Rollback if any component fails:

git revert HEAD
git push
argocd app sync monitoring

Upgrade Testing Strategy¶

Test in Staging First¶

Best practice: Test upgrades in non-production environment.

Staging cluster steps:

Deploy same monitoring stack version as production
Apply upgrade
Run validation tests
Monitor for 24-48 hours
If stable, proceed to production

Smoke Tests After Upgrade¶

Checklist:

[ ] All pods running
[ ] Prometheus scraping targets
[ ] Grafana dashboards loading
[ ] Loki receiving logs
[ ] Thanos querying S3 data
[ ] Alerts still firing (test with dummy alert)
[ ] No error spikes in logs

Quick smoke test script:

#!/bin/bash
echo "=== Checking Pods ==="
kubectl get pods -n monitoring | grep -v Running

echo "=== Checking Prometheus Targets ==="
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'

echo "=== Checking Loki Labels ==="
kubectl port-forward -n monitoring svc/loki-gateway 3100:80 &
sleep 2
curl -s http://localhost:3100/loki/api/v1/labels | jq '.data | length'

echo "=== Checking Thanos Stores ==="
kubectl port-forward -n monitoring svc/thanos-query 9091:9090 &
sleep 2
curl -s http://localhost:9091/api/v1/stores | jq '.data | length'

pkill -f "port-forward"

Common Upgrade Issues¶

Issue: CRD Version Mismatch¶

Symptom:

Error: unable to recognize "monitoring.k8s.yaml": no matches for kind "ServiceMonitor"

Solution:

# Apply CRDs manually
kubectl apply --server-side --force-conflicts -f \
  https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml

Issue: PVC Size Cannot Shrink¶

Symptom:

Error: Forbidden: field is immutable

Solution: PVCs can only grow, not shrink. To reduce:

Delete StatefulSet (keep PVC)
Delete PVC
Recreate with smaller size
Redeploy StatefulSet
Data loss - restore from backup

Issue: Breaking Configuration Change¶

Symptom:

Error: unknown field "oldFieldName" in ...

Solution:

Read chart CHANGELOG.md
Find migration path for renamed fields
Update charts/constructs/*.ts with new field names
Regenerate manifests

Issue: Image Pull Errors¶

Symptom:

Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found

Solution:

Verify image tag exists: docker pull quay.io/thanos/thanos:v0.37.0
Check for typos in config.yaml
Wait if image just released (may not be available yet)

Upgrade Maintenance Windows¶

When to Schedule Maintenance¶

Low-risk upgrades (patch versions):

No maintenance window needed
Rolling updates cause no downtime

High-risk upgrades (major versions):

Schedule during low-traffic period
Announce to team
Have rollback plan ready

Maintenance Procedure¶

Before window:

Announce maintenance
Backup Grafana dashboards
Document current versions
Prepare rollback commands

During window:

Apply upgrade
Monitor closely
Run smoke tests
Verify no errors

After window:

Monitor for 24 hours
Check for error spikes
Verify alerts still working
Document lessons learned

Automation Considerations¶

Automated Upgrades (Renovate/Dependabot)¶

Not recommended for monitoring stack due to:

High blast radius if upgrade breaks
Need to verify metrics/logs continuity
Potential CRD conflicts

If automating:

Use staging cluster first
Require manual approval for production
Have comprehensive smoke tests
Set up rollback automation

Version Pinning¶

Recommended approach:

# Pin to specific versions (not "latest")
versions:
  prometheus: "66.0.0"  # Not "latest"
  loki: "6.17.0"
  thanos: "v0.37.0"

Benefits:

Predictable deployments
Controlled upgrade timing
Easier troubleshooting

Upgrade Monitoring Components¶

Before You Begin¶

Upgrade kube-prometheus-stack (Prometheus, Grafana, Alertmanager)¶

Step 1: Check Current Version¶

Step 2: Review Release Notes¶

Step 3: Update Configuration¶

Step 4: Update CRDs (if needed)¶

Step 5: Generate and Review Manifests¶

Step 6: Deploy Upgrade¶

Step 7: Verify Upgrade¶

Rollback Procedure¶

Upgrade Loki¶

Step 1: Check Current Version¶

Step 2: Review Release Notes¶

Step 3: Test Storage Schema Compatibility¶

Step 4: Update Configuration¶

Step 5: Deploy Upgrade¶

Step 6: Verify Upgrade¶

Upgrade Thanos¶

Step 1: Check Current Version¶

Step 2: Review Release Notes¶

Step 3: Update Configuration¶

Step 4: Deploy Upgrade¶

Step 5: Verify Upgrade¶

Upgrade Alloy (Grafana Agent)¶

Step 1: Check Current Version¶

Step 2: Review Release Notes¶

Step 3: Update Configuration¶

Step 4: Deploy Upgrade¶

Step 5: Verify Upgrade¶

Upgrade Multiple Components Simultaneously¶

Approach¶

Deploy¶

Monitor Closely¶

Upgrade Testing Strategy¶

Test in Staging First¶

Smoke Tests After Upgrade¶

Common Upgrade Issues¶

Issue: CRD Version Mismatch¶

Issue: PVC Size Cannot Shrink¶

Issue: Breaking Configuration Change¶

Issue: Image Pull Errors¶

Upgrade Maintenance Windows¶

When to Schedule Maintenance¶

Maintenance Procedure¶

Automation Considerations¶

Automated Upgrades (Renovate/Dependabot)¶

Version Pinning¶

See Also¶