How-To Guide
Scale Monitoring Resources¶
Step-by-step guide for scaling CPU, memory, storage, and replicas in the monitoring stack.
Before You Begin¶
When to scale:
Pods frequently OOMKilled
CPU throttling >20%
Query latency increasing
Storage approaching capacity
Adding more cluster nodes
Scaling options:
Vertical scaling: Increase CPU/memory/storage
Horizontal scaling: Increase replica count
Storage expansion: Increase PVC sizes
S3 scaling: Automatic (unlimited)
Vertical Scaling (CPU & Memory)¶
Step 1: Identify Resource Bottleneck¶
Check current usage:
# Overall resource usage
kubectl top pods -n monitoring --sort-by=memory
kubectl top pods -n monitoring --sort-by=cpu
# Check for OOM kills
kubectl get events -n monitoring | grep OOM
# Check CPU throttling
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1
Identify which component needs scaling:
# Top memory consumers
kubectl top pods -n monitoring | sort -k3 -h | tail -5
# Top CPU consumers
kubectl top pods -n monitoring | sort -k2 -h | tail -5
Step 2: Update Resource Requests/Limits¶
Edit config.yaml:
Example: Scale Prometheus
resources:
prometheus:
requests:
cpu: 200m # Doubled from 100m
memory: 3Gi # Doubled from 1500Mi
limits:
cpu: 2000m # Doubled from 1000m
memory: 6Gi # Doubled from 3Gi
Example: Scale Thanos Query
resources:
thanosQuery:
requests:
cpu: 400m # Doubled from 200m
memory: 1Gi # Doubled from 512Mi
limits:
cpu: 2000m # Doubled from 1000m
memory: 2Gi # Doubled from 1Gi
Example: Scale Loki Write
resources:
loki:
write:
requests:
cpu: 200m # Doubled from 100m
memory: 512Mi # Doubled from 256Mi
limits:
cpu: 1000m # Doubled from 500m
memory: 1Gi # Doubled from 512Mi
Step 3: Generate and Deploy¶
cd dp-infra/monitoring
npm run build
# Review changes
git diff manifests/monitoring.k8s.yaml
# Commit and deploy
git add config.yaml manifests/
git commit -m "Scale Prometheus resources (200m CPU, 3Gi memory)"
git push
# ArgoCD sync
argocd app sync monitoring
Step 4: Verify Scaling¶
# Check pods restarted with new resources
kubectl get pods -n monitoring
# Verify new resource allocation
kubectl describe pod <pod-name> -n monitoring | grep -A 10 "Limits:"
# Monitor resource usage
kubectl top pods -n monitoring --sort-by=memory
Scaling Guidelines by Component¶
Prometheus¶
Baseline: 100m CPU, 1500Mi memory
Scale when:
Scraping >2000 targets
Query latency >2s
Memory usage >80%
Scaling factor: 2x for every 2000 additional targets
Example:
# For 4000 targets
resources:
prometheus:
requests:
cpu: 200m
memory: 3Gi
Thanos Query¶
Baseline: 200m CPU, 512Mi memory
Scale when:
Handling >100 queries/sec
Federating >5 Prometheus instances
Query latency >1s
Scaling factor: 2x for every 100 qps increase
Loki Components¶
Baseline: 100m CPU, 256Mi memory each
Scale when:
Ingesting >100MB/day logs
Query latency >3s
Memory usage >80%
Scaling factor: 2x for every 100MB/day increase
Horizontal Scaling (Replicas)¶
Step 1: Determine Which Components Can Scale¶
Horizontally scalable:
✅ Prometheus (independent scraping)
✅ Thanos Query (stateless)
✅ Thanos Store (shared S3 state)
✅ Loki Write (shared S3 state)
✅ Loki Read (shared S3 state)
✅ Loki Backend (shared S3 state)
✅ Alertmanager (gossip clustering)
✅ Grafana (with shared database)
Single replica only:
❌ Thanos Compactor (requires single instance)
DaemonSets (auto-scale with nodes):
⚙️ Alloy (one per node)
⚙️ Node Exporter (one per node)
Step 2: Update Replica Count¶
Edit config.yaml:
Example: Scale Prometheus to 3 replicas
replicas:
prometheus: 3 # Increased from 2
Example: Scale Thanos Query to 3 replicas
replicas:
thanosQuery: 3 # Increased from 2
Example: Scale Loki components
loki:
write:
replicas: 3 # Increased from 2
read:
replicas: 3 # Increased from 2
backend:
replicas: 3 # Increased from 2
Step 3: Deploy Scaling¶
cd dp-infra/monitoring
npm run build
git add config.yaml manifests/
git commit -m "Scale Thanos Query to 3 replicas for HA"
git push
argocd app sync monitoring
Step 4: Verify Scaling¶
# Check new replicas running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query
# Verify load distribution (for StatefulSets)
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o wide
# Check Thanos Query sees all stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores
# Should show all Prometheus sidecars
Replica Count Guidelines¶
Prometheus¶
Minimum: 2 (for HA) Recommended: 2-3
Scale to 3 when:
Need higher query throughput
Scraping >3000 targets
Want to tolerate 2 failures
Cost: 1 additional 3Gi PVC per replica
Thanos Query¶
Minimum: 2 (for HA) Recommended: 2-3
Scale to 3 when:
Query load >100 qps
Need lower query latency
Grafana dashboard count >50
Cost: No additional storage (stateless)
Loki Write/Read/Backend¶
Minimum: 2 (for HA) Recommended: 2-3
Scale to 3 when:
Log ingestion >200MB/day
Query latency increasing
Read replicas at >80% CPU
Cost: Additional 5Gi PVC per component per replica
Storage Scaling¶
PVC Expansion (Longhorn)¶
Step 1: Check Current Usage¶
# Check Prometheus PVC usage
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus
# Check all PVCs
kubectl get pvc -n monitoring
Step 2: Verify Storage Class Supports Expansion¶
kubectl get storageclass longhorn -o jsonpath='{.allowVolumeExpansion}'
# Should output: true
Step 3: Expand PVC¶
Option A: Via config.yaml (recommended)
Edit config.yaml:
storage:
prometheus: 6Gi # Increased from 3Gi
grafana: 10Gi # Increased from 5Gi
thanosStore: 20Gi # Increased from 10Gi
Deploy:
npm run build
git add config.yaml manifests/
git commit -m "Expand Prometheus storage to 6Gi"
git push
argocd app sync monitoring
Option B: Direct PVC edit (emergency)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring
Change:
spec:
resources:
requests:
storage: 6Gi # Increased from 3Gi
Step 4: Verify Expansion¶
# Check PVC size updated
kubectl get pvc -n monitoring | grep prometheus
# Check filesystem expanded
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus
Note: Expansion happens automatically. Pod does NOT need restart.
Storage Scaling Examples¶
Prometheus Storage¶
Current: 3Gi (3 days retention) Scale to: 6Gi (6 days retention OR more metrics)
When to scale:
PVC >80% full
Want longer retention
Scraping more targets
Update:
storage:
prometheus: 6Gi
# Also increase retention if desired
retention:
prometheus: 6d
Thanos Store Cache¶
Current: 10Gi per replica Scale to: 20Gi per replica
When to scale:
Cache hit ratio <80%
Query latency increasing
Large number of blocks in S3
Update:
storage:
thanosStore: 20Gi
Thanos Compactor Workspace¶
Current: 20Gi Scale to: 40Gi
When to scale:
Compaction taking >4 hours
Compactor logs show “out of space”
Processing large blocks
Update:
storage:
thanosCompactor: 40Gi
Loki PVCs¶
Current: 5Gi per component Scale to: 10Gi per component
When to scale:
WAL PVC >80% full (Write component)
Cache PVC >80% full (Read component)
Index PVC >80% full (Backend component)
Update:
storage:
loki:
write: 10Gi
read: 10Gi
backend: 10Gi
S3 Storage¶
No manual scaling needed - S3 is unlimited.
Monitor costs:
# Check bucket sizes
aws s3 ls s3://metrics-thanos-kup6s/ --recursive --summarize | grep "Total Size"
aws s3 ls s3://logs-loki-kup6s/ --recursive --summarize | grep "Total Size"
Reduce S3 usage:
Decrease retention policies
Increase downsampling aggressiveness
Enable Thanos compaction earlier
Scaling for Cluster Growth¶
Adding Nodes to Cluster¶
When adding nodes, DaemonSets auto-scale:
Alloy (log collection)
Node Exporter (node metrics)
No action needed - Kubernetes schedules new pods automatically.
Verify:
# Check Alloy pods match node count
kubectl get nodes | wc -l
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy | wc -l
# Numbers should match
Scaling for More Workloads¶
Symptoms:
Prometheus scraping more targets
Loki ingesting more logs
More Grafana dashboard queries
Actions:
Monitor metrics:
prometheus_tsdb_head_series(should be <100k per 1000 pods)loki_ingester_chunks_created_total(log ingestion rate)
Scale based on thresholds:
Prometheus: 2x CPU/memory for every 2x target increase
Loki: 2x replicas for every 2x log volume increase
Example scaling plan:
# Cluster doubles from 100 to 200 pods
# Prometheus: Double resources
resources:
prometheus:
requests:
cpu: 200m # Was 100m
memory: 3Gi # Was 1500Mi
# Loki: Add replicas
loki:
write:
replicas: 3 # Was 2
read:
replicas: 3 # Was 2
Resource Scaling Matrix¶
Quick Reference Table¶
Component |
Current |
Light Load (50 pods) |
Medium Load (200 pods) |
Heavy Load (500+ pods) |
|---|---|---|---|---|
Prometheus |
||||
Replicas |
2 |
2 |
2-3 |
3 (consider sharding) |
CPU |
100m |
50m |
200m |
500m |
Memory |
1500Mi |
1Gi |
3Gi |
6Gi |
Storage |
3Gi |
2Gi |
6Gi |
10Gi |
Thanos Query |
||||
Replicas |
2 |
2 |
2-3 |
3 |
CPU |
200m |
100m |
400m |
1000m |
Memory |
512Mi |
256Mi |
1Gi |
2Gi |
Loki Write |
||||
Replicas |
2 |
2 |
2-3 |
3-4 |
CPU |
100m |
50m |
200m |
500m |
Memory |
256Mi |
128Mi |
512Mi |
1Gi |
Cost Analysis¶
Resource Scaling Costs¶
CPU/Memory: Included in node costs (no marginal cost until nodes exhausted)
Storage (Longhorn PVCs):
Included in node storage (no marginal cost)
Limited by total node disk capacity
Monitor usage:
kubectl get nodes -o yaml | grep -A 5 capacity
S3 Storage:
Metrics: ~€0.023/GB/month
Logs: ~€0.023/GB/month
Example: Doubling retention doubles S3 cost
Scaling cost estimate:
Current: 2.7 cores, 12.5Gi memory, 152GB S3
Cost: ~€3.50/month (S3 only)
Scaled 2x: 5.4 cores, 25Gi memory, 304GB S3
Cost: ~€7.00/month (S3 only)
Scaling Automation¶
Vertical Pod Autoscaler (VPA)¶
Not recommended for monitoring stack due to:
Restarts required for resource changes
Can disrupt metric collection
Better to scale based on capacity planning
If using VPA:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: prometheus-vpa
namespace: monitoring
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: prometheus-kube-prometheus-stack-prometheus
updatePolicy:
updateMode: "Off" # Recommend only, don't auto-apply
Horizontal Pod Autoscaler (HPA)¶
Supported for:
Thanos Query (based on CPU/memory)
Loki Read (based on query latency)
Example HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: thanos-query-hpa
namespace: monitoring
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: thanos-query
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Considerations:
Test thoroughly before production
Monitor for flapping (frequent scale up/down)
Set appropriate min/max replica counts
Rollback Scaling Changes¶
Rollback Resource Changes¶
If scaling causes issues:
# Revert config.yaml
git revert HEAD
# Regenerate
npm run build
# Deploy
git push
argocd app sync monitoring
Pods will restart with old resource specifications.
Rollback Replica Changes¶
Same process as resource changes:
git revert HEAD
npm run build
git push
argocd app sync monitoring
Kubernetes will:
Scale down extra replicas
Terminate excess pods gracefully
Rollback Storage Expansion¶
Cannot shrink PVCs directly. To reduce:
Delete StatefulSet (keep PVC):
kubectl delete sts <name> -n monitoring --cascade=orphanDelete PVC:
kubectl delete pvc <name> -n monitoringUpdate config.yaml with smaller size
Regenerate and deploy
Data loss - restore from backup
Monitoring Scaling Effectiveness¶
Metrics to Track¶
Resource utilization:
# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="cpu"}) by (pod)
# Memory usage vs requests
sum(container_memory_working_set_bytes{namespace="monitoring"}) by (pod)
/
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="memory"}) by (pod)
Query performance:
# Prometheus query duration
histogram_quantile(0.95, rate(prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query"}[5m]))
# Loki query duration
histogram_quantile(0.95, rate(loki_request_duration_seconds_bucket{route=~"api_.*"}[5m]))
Storage growth:
# Prometheus TSDB size
prometheus_tsdb_storage_blocks_bytes / 1e9
# PVC usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"prometheus.*"}
/ kubelet_volume_stats_capacity_bytes
See Also¶
Resource Requirements - Component baselines
Resource Optimization - Sizing methodology
Troubleshooting - Common issues