How-To Guide

Scale Monitoring Resources¶

Type: How-To (Task-oriented)

Step-by-step guide for scaling CPU, memory, storage, and replicas in the monitoring stack.

Before You Begin¶

When to scale:

Pods frequently OOMKilled
CPU throttling >20%
Query latency increasing
Storage approaching capacity
Adding more cluster nodes

Scaling options:

Vertical scaling: Increase CPU/memory/storage
Horizontal scaling: Increase replica count
Storage expansion: Increase PVC sizes
S3 scaling: Automatic (unlimited)

Vertical Scaling (CPU & Memory)¶

Step 1: Identify Resource Bottleneck¶

Check current usage:

# Overall resource usage
kubectl top pods -n monitoring --sort-by=memory
kubectl top pods -n monitoring --sort-by=cpu

# Check for OOM kills
kubectl get events -n monitoring | grep OOM

# Check CPU throttling
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Identify which component needs scaling:

# Top memory consumers
kubectl top pods -n monitoring | sort -k3 -h | tail -5

# Top CPU consumers
kubectl top pods -n monitoring | sort -k2 -h | tail -5

Step 2: Update Resource Requests/Limits¶

Edit config.yaml:

Example: Scale Prometheus

resources:
  prometheus:
    requests:
      cpu: 200m        # Doubled from 100m
      memory: 3Gi      # Doubled from 1500Mi
    limits:
      cpu: 2000m       # Doubled from 1000m
      memory: 6Gi      # Doubled from 3Gi

Example: Scale Thanos Query

resources:
  thanosQuery:
    requests:
      cpu: 400m        # Doubled from 200m
      memory: 1Gi      # Doubled from 512Mi
    limits:
      cpu: 2000m       # Doubled from 1000m
      memory: 2Gi      # Doubled from 1Gi

Example: Scale Loki Write

resources:
  loki:
    write:
      requests:
        cpu: 200m      # Doubled from 100m
        memory: 512Mi  # Doubled from 256Mi
      limits:
        cpu: 1000m     # Doubled from 500m
        memory: 1Gi    # Doubled from 512Mi

Step 3: Generate and Deploy¶

cd dp-infra/monitoring
npm run build

# Review changes
git diff manifests/monitoring.k8s.yaml

# Commit and deploy
git add config.yaml manifests/
git commit -m "Scale Prometheus resources (200m CPU, 3Gi memory)"
git push

# ArgoCD sync
argocd app sync monitoring

Step 4: Verify Scaling¶

# Check pods restarted with new resources
kubectl get pods -n monitoring

# Verify new resource allocation
kubectl describe pod <pod-name> -n monitoring | grep -A 10 "Limits:"

# Monitor resource usage
kubectl top pods -n monitoring --sort-by=memory

Scaling Guidelines by Component¶

Prometheus¶

Baseline: 100m CPU, 1500Mi memory

Scale when:

Scraping >2000 targets
Query latency >2s
Memory usage >80%

Scaling factor: 2x for every 2000 additional targets

Example:

# For 4000 targets
resources:
  prometheus:
    requests:
      cpu: 200m
      memory: 3Gi

Thanos Query¶

Baseline: 200m CPU, 512Mi memory

Scale when:

Handling >100 queries/sec
Federating >5 Prometheus instances
Query latency >1s

Scaling factor: 2x for every 100 qps increase

Loki Components¶

Baseline: 100m CPU, 256Mi memory each

Scale when:

Ingesting >100MB/day logs
Query latency >3s
Memory usage >80%

Scaling factor: 2x for every 100MB/day increase

Horizontal Scaling (Replicas)¶

Step 1: Determine Which Components Can Scale¶

Horizontally scalable:

✅ Prometheus (independent scraping)
✅ Thanos Query (stateless)
✅ Thanos Store (shared S3 state)
✅ Loki Write (shared S3 state)
✅ Loki Read (shared S3 state)
✅ Loki Backend (shared S3 state)
✅ Alertmanager (gossip clustering)
✅ Grafana (with shared database)

Single replica only:

❌ Thanos Compactor (requires single instance)

DaemonSets (auto-scale with nodes):

⚙️ Alloy (one per node)
⚙️ Node Exporter (one per node)

Step 2: Update Replica Count¶

Edit config.yaml:

Example: Scale Prometheus to 3 replicas

replicas:
  prometheus: 3  # Increased from 2

Example: Scale Thanos Query to 3 replicas

replicas:
  thanosQuery: 3  # Increased from 2

Example: Scale Loki components

loki:
  write:
    replicas: 3   # Increased from 2
  read:
    replicas: 3   # Increased from 2
  backend:
    replicas: 3   # Increased from 2

Step 3: Deploy Scaling¶

cd dp-infra/monitoring
npm run build

git add config.yaml manifests/
git commit -m "Scale Thanos Query to 3 replicas for HA"
git push

argocd app sync monitoring

Step 4: Verify Scaling¶

# Check new replicas running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query

# Verify load distribution (for StatefulSets)
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o wide

# Check Thanos Query sees all stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores
# Should show all Prometheus sidecars

Replica Count Guidelines¶

Prometheus¶

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

Need higher query throughput
Scraping >3000 targets
Want to tolerate 2 failures

Cost: 1 additional 3Gi PVC per replica

Thanos Query¶

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

Query load >100 qps
Need lower query latency
Grafana dashboard count >50

Cost: No additional storage (stateless)

Loki Write/Read/Backend¶

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

Log ingestion >200MB/day
Query latency increasing
Read replicas at >80% CPU

Cost: Additional 5Gi PVC per component per replica

Storage Scaling¶

PVC Expansion (Longhorn)¶

Step 1: Check Current Usage¶

# Check Prometheus PVC usage
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

# Check all PVCs
kubectl get pvc -n monitoring

Step 2: Verify Storage Class Supports Expansion¶

kubectl get storageclass longhorn -o jsonpath='{.allowVolumeExpansion}'
# Should output: true

Step 3: Expand PVC¶

Option A: Via config.yaml (recommended)

Edit config.yaml:

storage:
  prometheus: 6Gi    # Increased from 3Gi
  grafana: 10Gi      # Increased from 5Gi
  thanosStore: 20Gi  # Increased from 10Gi

Deploy:

npm run build
git add config.yaml manifests/
git commit -m "Expand Prometheus storage to 6Gi"
git push
argocd app sync monitoring

Option B: Direct PVC edit (emergency)

kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring

Change:

spec:
  resources:
    requests:
      storage: 6Gi  # Increased from 3Gi

Step 4: Verify Expansion¶

# Check PVC size updated
kubectl get pvc -n monitoring | grep prometheus

# Check filesystem expanded
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

Note: Expansion happens automatically. Pod does NOT need restart.

Storage Scaling Examples¶

Prometheus Storage¶

Current: 3Gi (3 days retention) Scale to: 6Gi (6 days retention OR more metrics)

When to scale:

PVC >80% full
Want longer retention
Scraping more targets

Update:

storage:
  prometheus: 6Gi

# Also increase retention if desired
retention:
  prometheus: 6d

Thanos Store Cache¶

Current: 10Gi per replica Scale to: 20Gi per replica

When to scale:

Cache hit ratio <80%
Query latency increasing
Large number of blocks in S3

Update:

storage:
  thanosStore: 20Gi

Thanos Compactor Workspace¶

Current: 20Gi Scale to: 40Gi

When to scale:

Compaction taking >4 hours
Compactor logs show “out of space”
Processing large blocks

Update:

storage:
  thanosCompactor: 40Gi

Loki PVCs¶

Current: 5Gi per component Scale to: 10Gi per component

When to scale:

WAL PVC >80% full (Write component)
Cache PVC >80% full (Read component)
Index PVC >80% full (Backend component)

Update:

storage:
  loki:
    write: 10Gi
    read: 10Gi
    backend: 10Gi

S3 Storage¶

No manual scaling needed - S3 is unlimited.

Monitor costs:

# Check bucket sizes
aws s3 ls s3://metrics-thanos-kup6s/ --recursive --summarize | grep "Total Size"
aws s3 ls s3://logs-loki-kup6s/ --recursive --summarize | grep "Total Size"

Reduce S3 usage:

Decrease retention policies
Increase downsampling aggressiveness
Enable Thanos compaction earlier

Scaling for Cluster Growth¶

Adding Nodes to Cluster¶

When adding nodes, DaemonSets auto-scale:

Alloy (log collection)
Node Exporter (node metrics)

No action needed - Kubernetes schedules new pods automatically.

Verify:

# Check Alloy pods match node count
kubectl get nodes | wc -l
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy | wc -l
# Numbers should match

Scaling for More Workloads¶

Symptoms:

Prometheus scraping more targets
Loki ingesting more logs
More Grafana dashboard queries

Actions:

Monitor metrics:
- prometheus_tsdb_head_series (should be <100k per 1000 pods)
- loki_ingester_chunks_created_total (log ingestion rate)
Scale based on thresholds:
- Prometheus: 2x CPU/memory for every 2x target increase
- Loki: 2x replicas for every 2x log volume increase

Example scaling plan:

# Cluster doubles from 100 to 200 pods

# Prometheus: Double resources
resources:
  prometheus:
    requests:
      cpu: 200m      # Was 100m
      memory: 3Gi    # Was 1500Mi

# Loki: Add replicas
loki:
  write:
    replicas: 3      # Was 2
  read:
    replicas: 3      # Was 2

Resource Scaling Matrix¶

Quick Reference Table¶

Component	Current	Light Load (50 pods)	Medium Load (200 pods)	Heavy Load (500+ pods)
Prometheus
Replicas	2	2	2-3	3 (consider sharding)
CPU	100m	50m	200m	500m
Memory	1500Mi	1Gi	3Gi	6Gi
Storage	3Gi	2Gi	6Gi	10Gi
Thanos Query
Replicas	2	2	2-3	3
CPU	200m	100m	400m	1000m
Memory	512Mi	256Mi	1Gi	2Gi
Loki Write
Replicas	2	2	2-3	3-4
CPU	100m	50m	200m	500m
Memory	256Mi	128Mi	512Mi	1Gi

Cost Analysis¶

Resource Scaling Costs¶

CPU/Memory: Included in node costs (no marginal cost until nodes exhausted)

Storage (Longhorn PVCs):

Included in node storage (no marginal cost)
Limited by total node disk capacity
Monitor usage: kubectl get nodes -o yaml | grep -A 5 capacity

S3 Storage:

Metrics: ~€0.023/GB/month
Logs: ~€0.023/GB/month
Example: Doubling retention doubles S3 cost

Scaling cost estimate:

Current: 2.7 cores, 12.5Gi memory, 152GB S3
Cost: ~€3.50/month (S3 only)

Scaled 2x: 5.4 cores, 25Gi memory, 304GB S3
Cost: ~€7.00/month (S3 only)

Scaling Automation¶

Vertical Pod Autoscaler (VPA)¶

Not recommended for monitoring stack due to:

Restarts required for resource changes
Can disrupt metric collection
Better to scale based on capacity planning

If using VPA:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: prometheus-vpa
  namespace: monitoring
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: prometheus-kube-prometheus-stack-prometheus
  updatePolicy:
    updateMode: "Off"  # Recommend only, don't auto-apply

Horizontal Pod Autoscaler (HPA)¶

Supported for:

Thanos Query (based on CPU/memory)
Loki Read (based on query latency)

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: thanos-query-hpa
  namespace: monitoring
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: thanos-query
  minReplicas: 2
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Considerations:

Test thoroughly before production
Monitor for flapping (frequent scale up/down)
Set appropriate min/max replica counts

Rollback Scaling Changes¶

Rollback Resource Changes¶

If scaling causes issues:

# Revert config.yaml
git revert HEAD

# Regenerate
npm run build

# Deploy
git push
argocd app sync monitoring

Pods will restart with old resource specifications.

Rollback Replica Changes¶

Same process as resource changes:

git revert HEAD
npm run build
git push
argocd app sync monitoring

Kubernetes will:

Scale down extra replicas
Terminate excess pods gracefully

Rollback Storage Expansion¶

Cannot shrink PVCs directly. To reduce:

Delete StatefulSet (keep PVC): kubectl delete sts <name> -n monitoring --cascade=orphan
Delete PVC: kubectl delete pvc <name> -n monitoring
Update config.yaml with smaller size
Regenerate and deploy
Data loss - restore from backup

Monitoring Scaling Effectiveness¶

Metrics to Track¶

Resource utilization:

# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring"}[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="cpu"}) by (pod)

# Memory usage vs requests
sum(container_memory_working_set_bytes{namespace="monitoring"}) by (pod)
  /
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="memory"}) by (pod)

Query performance:

# Prometheus query duration
histogram_quantile(0.95, rate(prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query"}[5m]))

# Loki query duration
histogram_quantile(0.95, rate(loki_request_duration_seconds_bucket{route=~"api_.*"}[5m]))

Storage growth:

# Prometheus TSDB size
prometheus_tsdb_storage_blocks_bytes / 1e9

# PVC usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"prometheus.*"}
  / kubelet_volume_stats_capacity_bytes

Scale Monitoring Resources¶

Before You Begin¶

Vertical Scaling (CPU & Memory)¶

Step 1: Identify Resource Bottleneck¶

Step 2: Update Resource Requests/Limits¶

Step 3: Generate and Deploy¶

Step 4: Verify Scaling¶

Scaling Guidelines by Component¶

Prometheus¶

Thanos Query¶

Loki Components¶

Horizontal Scaling (Replicas)¶

Step 1: Determine Which Components Can Scale¶

Step 2: Update Replica Count¶

Step 3: Deploy Scaling¶

Step 4: Verify Scaling¶

Replica Count Guidelines¶

Prometheus¶

Thanos Query¶

Loki Write/Read/Backend¶

Storage Scaling¶

PVC Expansion (Longhorn)¶

Step 1: Check Current Usage¶

Step 2: Verify Storage Class Supports Expansion¶

Step 3: Expand PVC¶

Step 4: Verify Expansion¶

Storage Scaling Examples¶

Prometheus Storage¶

Thanos Store Cache¶

Thanos Compactor Workspace¶

Loki PVCs¶

S3 Storage¶

Scaling for Cluster Growth¶

Adding Nodes to Cluster¶

Scaling for More Workloads¶

Resource Scaling Matrix¶

Quick Reference Table¶

Cost Analysis¶

Resource Scaling Costs¶

Scaling Automation¶

Vertical Pod Autoscaler (VPA)¶

Horizontal Pod Autoscaler (HPA)¶

Rollback Scaling Changes¶

Rollback Resource Changes¶

Rollback Replica Changes¶

Rollback Storage Expansion¶

Monitoring Scaling Effectiveness¶

Metrics to Track¶

See Also¶