How-To Guide
Upgrade Monitoring Components¶
Step-by-step guide for upgrading Prometheus, Loki, Thanos, Alloy, and other monitoring stack components.
Before You Begin¶
Prerequisites:
Access to the cluster with kubectl configured
Git access to
dp-infrarepositoryUnderstanding of the current deployment
Safety Checklist:
[ ] Review release notes for breaking changes
[ ] Backup Grafana dashboards
[ ] Document current versions
[ ] Plan maintenance window (if needed)
[ ] Have rollback plan ready
Current Version Tracking:
# Check deployed versions
kubectl get helmchart -n monitoring -o yaml | grep version:
kubectl get pods -n monitoring -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'
Upgrade kube-prometheus-stack (Prometheus, Grafana, Alertmanager)¶
Step 1: Check Current Version¶
cd dp-infra/monitoring
cat config.yaml | grep "prometheus:"
Example output:
versions:
prometheus: "65.0.0" # Current chart version
Step 2: Review Release Notes¶
Check Helm chart changelog:
# List available versions
helm search repo prometheus-community/kube-prometheus-stack --versions | head -10
# View release notes
open https://github.com/prometheus-community/helm-charts/releases
Critical checks:
CRD changes (CustomResourceDefinitions)
Breaking configuration changes
Prometheus version compatibility
Grafana version compatibility
Required Kubernetes version
Step 3: Update Configuration¶
Edit config.yaml:
versions:
prometheus: "66.0.0" # New version
Check for deprecated values:
# Compare old and new chart values
helm show values prometheus-community/kube-prometheus-stack --version 65.0.0 > /tmp/old-values.yaml
helm show values prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-values.yaml
diff /tmp/old-values.yaml /tmp/new-values.yaml
Step 4: Update CRDs (if needed)¶
Check if CRDs changed:
helm show crds prometheus-community/kube-prometheus-stack --version 66.0.0 > /tmp/new-crds.yaml
Apply CRD updates:
# CRDs must be applied manually before Helm upgrade
kubectl apply --server-side --force-conflicts -f /tmp/new-crds.yaml
Important: Always apply CRDs before upgrading the chart.
Step 5: Generate and Review Manifests¶
cd dp-infra/monitoring
npm run build
Review changes:
git diff manifests/monitoring.k8s.yaml
Look for:
Resource spec changes
New/removed resources
Storage changes (PVCs don’t auto-resize)
Step 6: Deploy Upgrade¶
Option A: Via ArgoCD (Recommended)
# Commit changes
git add config.yaml manifests/
git commit -m "Upgrade kube-prometheus-stack to v66.0.0"
git push
# Sync in ArgoCD
argocd app sync monitoring
# Watch progress
argocd app wait monitoring --health
Option B: Direct kubectl
kubectl apply -f manifests/monitoring.k8s.yaml
Step 7: Verify Upgrade¶
# Check all pods running
kubectl get pods -n monitoring
# Check Prometheus version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090/status
# Check "Version" field
# Check Grafana version
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000
# Check bottom-left corner
# Verify metrics still flowing
# Query: up{job="kubernetes-nodes"}
Rollback Procedure¶
If upgrade fails:
# Revert config.yaml
git revert HEAD
# Regenerate manifests
npm run build
# Apply old version
kubectl apply -f manifests/monitoring.k8s.yaml
# Or via ArgoCD
git push
argocd app sync monitoring
Upgrade Loki¶
Step 1: Check Current Version¶
cat config.yaml | grep "loki:"
Example:
versions:
loki: "6.16.0" # Current chart version
Step 2: Review Release Notes¶
# List available versions
helm search repo grafana/loki --versions | head -10
# View release notes
open https://github.com/grafana/loki/releases
Critical checks:
Storage schema changes
Breaking config changes
S3 compatibility
SimpleScalable mode changes
Step 3: Test Storage Schema Compatibility¶
Check current schema:
kubectl logs -n monitoring loki-backend-0 | grep schema
Example output:
level=info schema=v13 msg="using schema"
Verify new version supports schema v13 (or current schema).
Important: Schema migrations can require data reprocessing.
Step 4: Update Configuration¶
# config.yaml
versions:
loki: "6.17.0" # New version
Step 5: Deploy Upgrade¶
# Generate manifests
npm run build
# Review changes
git diff manifests/monitoring.k8s.yaml
# Commit and deploy
git add config.yaml manifests/
git commit -m "Upgrade Loki to v6.17.0"
git push
# ArgoCD sync
argocd app sync monitoring
Step 6: Verify Upgrade¶
# Check Loki pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels
# Verify logs in Grafana
# Navigate to Explore → Loki → {namespace="monitoring"}
Upgrade Thanos¶
Thanos components (Query, Store, Compactor) use direct container images, not Helm charts.
Step 1: Check Current Version¶
kubectl get pods -n monitoring -o jsonpath='{.items[?(@.metadata.labels.app\.kubernetes\.io/name=="thanos-query")].spec.containers[0].image}' | head -1
Example output:
quay.io/thanos/thanos:v0.36.1
Step 2: Review Release Notes¶
# View Thanos releases
open https://github.com/thanos-io/thanos/releases
# Check for breaking changes in:
# - StoreAPI
# - Compaction
# - Query API
Step 3: Update Configuration¶
# config.yaml
versions:
thanos: "v0.37.0" # New version
Step 4: Deploy Upgrade¶
Rolling update (safe):
npm run build
git add config.yaml manifests/
git commit -m "Upgrade Thanos to v0.37.0"
git push
argocd app sync monitoring
Kubernetes updates pods one at a time (StatefulSet/Deployment default behavior).
Step 5: Verify Upgrade¶
# Check all Thanos components upgraded
kubectl get pods -n monitoring -o wide | grep thanos
# Check Thanos Query stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores
# Verify queries work
# Query: up{job="kubernetes-nodes"}
# Check compaction still running
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep compact
Upgrade Alloy (Grafana Agent)¶
Step 1: Check Current Version¶
cat config.yaml | grep "alloy:"
Example:
versions:
alloy: "0.9.0" # Current chart version
Step 2: Review Release Notes¶
helm search repo grafana/alloy --versions | head -10
open https://github.com/grafana/alloy/releases
Critical checks:
Configuration syntax changes
Loki API compatibility
Kubernetes API changes
Step 3: Update Configuration¶
# config.yaml
versions:
alloy: "0.10.0" # New version
Step 4: Deploy Upgrade¶
npm run build
git add config.yaml manifests/
git commit -m "Upgrade Alloy to v0.10.0"
git push
argocd app sync monitoring
DaemonSet rolling update: One pod per node updates sequentially.
Step 5: Verify Upgrade¶
# Check all Alloy pods updated
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
# Check logs flowing
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=10
# Verify Loki receiving logs
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels
Upgrade Multiple Components Simultaneously¶
Not recommended, but possible for minor version bumps.
Approach¶
# config.yaml - update all at once
versions:
prometheus: "66.0.0"
loki: "6.17.0"
alloy: "0.10.0"
thanos: "v0.37.0"
Deploy¶
npm run build
git add config.yaml manifests/
git commit -m "Upgrade all monitoring components"
git push
argocd app sync monitoring
Monitor Closely¶
# Watch all pods
watch kubectl get pods -n monitoring
# Check for errors
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20
Rollback if any component fails:
git revert HEAD
git push
argocd app sync monitoring
Upgrade Testing Strategy¶
Test in Staging First¶
Best practice: Test upgrades in non-production environment.
Staging cluster steps:
Deploy same monitoring stack version as production
Apply upgrade
Run validation tests
Monitor for 24-48 hours
If stable, proceed to production
Smoke Tests After Upgrade¶
Checklist:
[ ] All pods running
[ ] Prometheus scraping targets
[ ] Grafana dashboards loading
[ ] Loki receiving logs
[ ] Thanos querying S3 data
[ ] Alerts still firing (test with dummy alert)
[ ] No error spikes in logs
Quick smoke test script:
#!/bin/bash
echo "=== Checking Pods ==="
kubectl get pods -n monitoring | grep -v Running
echo "=== Checking Prometheus Targets ==="
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
sleep 2
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'
echo "=== Checking Loki Labels ==="
kubectl port-forward -n monitoring svc/loki-gateway 3100:80 &
sleep 2
curl -s http://localhost:3100/loki/api/v1/labels | jq '.data | length'
echo "=== Checking Thanos Stores ==="
kubectl port-forward -n monitoring svc/thanos-query 9091:9090 &
sleep 2
curl -s http://localhost:9091/api/v1/stores | jq '.data | length'
pkill -f "port-forward"
Common Upgrade Issues¶
Issue: CRD Version Mismatch¶
Symptom:
Error: unable to recognize "monitoring.k8s.yaml": no matches for kind "ServiceMonitor"
Solution:
# Apply CRDs manually
kubectl apply --server-side --force-conflicts -f \
https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
Issue: PVC Size Cannot Shrink¶
Symptom:
Error: Forbidden: field is immutable
Solution: PVCs can only grow, not shrink. To reduce:
Delete StatefulSet (keep PVC)
Delete PVC
Recreate with smaller size
Redeploy StatefulSet
Data loss - restore from backup
Issue: Breaking Configuration Change¶
Symptom:
Error: unknown field "oldFieldName" in ...
Solution:
Read chart CHANGELOG.md
Find migration path for renamed fields
Update
charts/constructs/*.tswith new field namesRegenerate manifests
Issue: Image Pull Errors¶
Symptom:
Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found
Solution:
Verify image tag exists:
docker pull quay.io/thanos/thanos:v0.37.0Check for typos in config.yaml
Wait if image just released (may not be available yet)
Upgrade Maintenance Windows¶
When to Schedule Maintenance¶
Low-risk upgrades (patch versions):
No maintenance window needed
Rolling updates cause no downtime
High-risk upgrades (major versions):
Schedule during low-traffic period
Announce to team
Have rollback plan ready
Maintenance Procedure¶
Before window:
Announce maintenance
Backup Grafana dashboards
Document current versions
Prepare rollback commands
During window:
Apply upgrade
Monitor closely
Run smoke tests
Verify no errors
After window:
Monitor for 24 hours
Check for error spikes
Verify alerts still working
Document lessons learned
Automation Considerations¶
Automated Upgrades (Renovate/Dependabot)¶
Not recommended for monitoring stack due to:
High blast radius if upgrade breaks
Need to verify metrics/logs continuity
Potential CRD conflicts
If automating:
Use staging cluster first
Require manual approval for production
Have comprehensive smoke tests
Set up rollback automation
Version Pinning¶
Recommended approach:
# Pin to specific versions (not "latest")
versions:
prometheus: "66.0.0" # Not "latest"
loki: "6.17.0"
thanos: "v0.37.0"
Benefits:
Predictable deployments
Controlled upgrade timing
Easier troubleshooting
See Also¶
Configuration Reference - Version configuration
Troubleshooting - Common issues
Architecture Overview - Component relationships