How-To Guide

Scale Monitoring Resources

Step-by-step guide for scaling CPU, memory, storage, and replicas in the monitoring stack.

Before You Begin

When to scale:

  • Pods frequently OOMKilled

  • CPU throttling >20%

  • Query latency increasing

  • Storage approaching capacity

  • Adding more cluster nodes

Scaling options:

  1. Vertical scaling: Increase CPU/memory/storage

  2. Horizontal scaling: Increase replica count

  3. Storage expansion: Increase PVC sizes

  4. S3 scaling: Automatic (unlimited)


Vertical Scaling (CPU & Memory)

Step 1: Identify Resource Bottleneck

Check current usage:

# Overall resource usage
kubectl top pods -n monitoring --sort-by=memory
kubectl top pods -n monitoring --sort-by=cpu

# Check for OOM kills
kubectl get events -n monitoring | grep OOM

# Check CPU throttling
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Identify which component needs scaling:

# Top memory consumers
kubectl top pods -n monitoring | sort -k3 -h | tail -5

# Top CPU consumers
kubectl top pods -n monitoring | sort -k2 -h | tail -5

Step 2: Update Resource Requests/Limits

Edit config.yaml:

Example: Scale Prometheus

resources:
  prometheus:
    requests:
      cpu: 200m        # Doubled from 100m
      memory: 3Gi      # Doubled from 1500Mi
    limits:
      cpu: 2000m       # Doubled from 1000m
      memory: 6Gi      # Doubled from 3Gi

Example: Scale Thanos Query

resources:
  thanosQuery:
    requests:
      cpu: 400m        # Doubled from 200m
      memory: 1Gi      # Doubled from 512Mi
    limits:
      cpu: 2000m       # Doubled from 1000m
      memory: 2Gi      # Doubled from 1Gi

Example: Scale Loki Write

resources:
  loki:
    write:
      requests:
        cpu: 200m      # Doubled from 100m
        memory: 512Mi  # Doubled from 256Mi
      limits:
        cpu: 1000m     # Doubled from 500m
        memory: 1Gi    # Doubled from 512Mi

Step 3: Generate and Deploy

cd dp-infra/monitoring
npm run build

# Review changes
git diff manifests/monitoring.k8s.yaml

# Commit and deploy
git add config.yaml manifests/
git commit -m "Scale Prometheus resources (200m CPU, 3Gi memory)"
git push

# ArgoCD sync
argocd app sync monitoring

Step 4: Verify Scaling

# Check pods restarted with new resources
kubectl get pods -n monitoring

# Verify new resource allocation
kubectl describe pod <pod-name> -n monitoring | grep -A 10 "Limits:"

# Monitor resource usage
kubectl top pods -n monitoring --sort-by=memory

Scaling Guidelines by Component

Prometheus

Baseline: 100m CPU, 1500Mi memory

Scale when:

  • Scraping >2000 targets

  • Query latency >2s

  • Memory usage >80%

Scaling factor: 2x for every 2000 additional targets

Example:

# For 4000 targets
resources:
  prometheus:
    requests:
      cpu: 200m
      memory: 3Gi

Thanos Query

Baseline: 200m CPU, 512Mi memory

Scale when:

  • Handling >100 queries/sec

  • Federating >5 Prometheus instances

  • Query latency >1s

Scaling factor: 2x for every 100 qps increase

Loki Components

Baseline: 100m CPU, 256Mi memory each

Scale when:

  • Ingesting >100MB/day logs

  • Query latency >3s

  • Memory usage >80%

Scaling factor: 2x for every 100MB/day increase


Horizontal Scaling (Replicas)

Step 1: Determine Which Components Can Scale

Horizontally scalable:

  • ✅ Prometheus (independent scraping)

  • ✅ Thanos Query (stateless)

  • ✅ Thanos Store (shared S3 state)

  • ✅ Loki Write (shared S3 state)

  • ✅ Loki Read (shared S3 state)

  • ✅ Loki Backend (shared S3 state)

  • ✅ Alertmanager (gossip clustering)

  • ✅ Grafana (with shared database)

Single replica only:

  • ❌ Thanos Compactor (requires single instance)

DaemonSets (auto-scale with nodes):

  • ⚙️ Alloy (one per node)

  • ⚙️ Node Exporter (one per node)

Step 2: Update Replica Count

Edit config.yaml:

Example: Scale Prometheus to 3 replicas

replicas:
  prometheus: 3  # Increased from 2

Example: Scale Thanos Query to 3 replicas

replicas:
  thanosQuery: 3  # Increased from 2

Example: Scale Loki components

loki:
  write:
    replicas: 3   # Increased from 2
  read:
    replicas: 3   # Increased from 2
  backend:
    replicas: 3   # Increased from 2

Step 3: Deploy Scaling

cd dp-infra/monitoring
npm run build

git add config.yaml manifests/
git commit -m "Scale Thanos Query to 3 replicas for HA"
git push

argocd app sync monitoring

Step 4: Verify Scaling

# Check new replicas running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query

# Verify load distribution (for StatefulSets)
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o wide

# Check Thanos Query sees all stores
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Visit: http://localhost:9090/stores
# Should show all Prometheus sidecars

Replica Count Guidelines

Prometheus

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

  • Need higher query throughput

  • Scraping >3000 targets

  • Want to tolerate 2 failures

Cost: 1 additional 3Gi PVC per replica

Thanos Query

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

  • Query load >100 qps

  • Need lower query latency

  • Grafana dashboard count >50

Cost: No additional storage (stateless)

Loki Write/Read/Backend

Minimum: 2 (for HA) Recommended: 2-3

Scale to 3 when:

  • Log ingestion >200MB/day

  • Query latency increasing

  • Read replicas at >80% CPU

Cost: Additional 5Gi PVC per component per replica


Storage Scaling

PVC Expansion (Longhorn)

Step 1: Check Current Usage

# Check Prometheus PVC usage
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

# Check all PVCs
kubectl get pvc -n monitoring

Step 2: Verify Storage Class Supports Expansion

kubectl get storageclass longhorn -o jsonpath='{.allowVolumeExpansion}'
# Should output: true

Step 3: Expand PVC

Option A: Via config.yaml (recommended)

Edit config.yaml:

storage:
  prometheus: 6Gi    # Increased from 3Gi
  grafana: 10Gi      # Increased from 5Gi
  thanosStore: 20Gi  # Increased from 10Gi

Deploy:

npm run build
git add config.yaml manifests/
git commit -m "Expand Prometheus storage to 6Gi"
git push
argocd app sync monitoring

Option B: Direct PVC edit (emergency)

kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring

Change:

spec:
  resources:
    requests:
      storage: 6Gi  # Increased from 3Gi

Step 4: Verify Expansion

# Check PVC size updated
kubectl get pvc -n monitoring | grep prometheus

# Check filesystem expanded
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

Note: Expansion happens automatically. Pod does NOT need restart.

Storage Scaling Examples

Prometheus Storage

Current: 3Gi (3 days retention) Scale to: 6Gi (6 days retention OR more metrics)

When to scale:

  • PVC >80% full

  • Want longer retention

  • Scraping more targets

Update:

storage:
  prometheus: 6Gi

# Also increase retention if desired
retention:
  prometheus: 6d

Thanos Store Cache

Current: 10Gi per replica Scale to: 20Gi per replica

When to scale:

  • Cache hit ratio <80%

  • Query latency increasing

  • Large number of blocks in S3

Update:

storage:
  thanosStore: 20Gi

Thanos Compactor Workspace

Current: 20Gi Scale to: 40Gi

When to scale:

  • Compaction taking >4 hours

  • Compactor logs show “out of space”

  • Processing large blocks

Update:

storage:
  thanosCompactor: 40Gi

Loki PVCs

Current: 5Gi per component Scale to: 10Gi per component

When to scale:

  • WAL PVC >80% full (Write component)

  • Cache PVC >80% full (Read component)

  • Index PVC >80% full (Backend component)

Update:

storage:
  loki:
    write: 10Gi
    read: 10Gi
    backend: 10Gi

S3 Storage

No manual scaling needed - S3 is unlimited.

Monitor costs:

# Check bucket sizes
aws s3 ls s3://metrics-thanos-kup6s/ --recursive --summarize | grep "Total Size"
aws s3 ls s3://logs-loki-kup6s/ --recursive --summarize | grep "Total Size"

Reduce S3 usage:

  1. Decrease retention policies

  2. Increase downsampling aggressiveness

  3. Enable Thanos compaction earlier


Scaling for Cluster Growth

Adding Nodes to Cluster

When adding nodes, DaemonSets auto-scale:

  • Alloy (log collection)

  • Node Exporter (node metrics)

No action needed - Kubernetes schedules new pods automatically.

Verify:

# Check Alloy pods match node count
kubectl get nodes | wc -l
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy | wc -l
# Numbers should match

Scaling for More Workloads

Symptoms:

  • Prometheus scraping more targets

  • Loki ingesting more logs

  • More Grafana dashboard queries

Actions:

  1. Monitor metrics:

    • prometheus_tsdb_head_series (should be <100k per 1000 pods)

    • loki_ingester_chunks_created_total (log ingestion rate)

  2. Scale based on thresholds:

    • Prometheus: 2x CPU/memory for every 2x target increase

    • Loki: 2x replicas for every 2x log volume increase

Example scaling plan:

# Cluster doubles from 100 to 200 pods

# Prometheus: Double resources
resources:
  prometheus:
    requests:
      cpu: 200m      # Was 100m
      memory: 3Gi    # Was 1500Mi

# Loki: Add replicas
loki:
  write:
    replicas: 3      # Was 2
  read:
    replicas: 3      # Was 2

Resource Scaling Matrix

Quick Reference Table

Component

Current

Light Load (50 pods)

Medium Load (200 pods)

Heavy Load (500+ pods)

Prometheus

Replicas

2

2

2-3

3 (consider sharding)

CPU

100m

50m

200m

500m

Memory

1500Mi

1Gi

3Gi

6Gi

Storage

3Gi

2Gi

6Gi

10Gi

Thanos Query

Replicas

2

2

2-3

3

CPU

200m

100m

400m

1000m

Memory

512Mi

256Mi

1Gi

2Gi

Loki Write

Replicas

2

2

2-3

3-4

CPU

100m

50m

200m

500m

Memory

256Mi

128Mi

512Mi

1Gi


Cost Analysis

Resource Scaling Costs

CPU/Memory: Included in node costs (no marginal cost until nodes exhausted)

Storage (Longhorn PVCs):

  • Included in node storage (no marginal cost)

  • Limited by total node disk capacity

  • Monitor usage: kubectl get nodes -o yaml | grep -A 5 capacity

S3 Storage:

  • Metrics: ~€0.023/GB/month

  • Logs: ~€0.023/GB/month

  • Example: Doubling retention doubles S3 cost

Scaling cost estimate:

Current: 2.7 cores, 12.5Gi memory, 152GB S3
Cost: ~€3.50/month (S3 only)

Scaled 2x: 5.4 cores, 25Gi memory, 304GB S3
Cost: ~€7.00/month (S3 only)

Scaling Automation

Vertical Pod Autoscaler (VPA)

Not recommended for monitoring stack due to:

  • Restarts required for resource changes

  • Can disrupt metric collection

  • Better to scale based on capacity planning

If using VPA:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: prometheus-vpa
  namespace: monitoring
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: prometheus-kube-prometheus-stack-prometheus
  updatePolicy:
    updateMode: "Off"  # Recommend only, don't auto-apply

Horizontal Pod Autoscaler (HPA)

Supported for:

  • Thanos Query (based on CPU/memory)

  • Loki Read (based on query latency)

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: thanos-query-hpa
  namespace: monitoring
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: thanos-query
  minReplicas: 2
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Considerations:

  • Test thoroughly before production

  • Monitor for flapping (frequent scale up/down)

  • Set appropriate min/max replica counts


Rollback Scaling Changes

Rollback Resource Changes

If scaling causes issues:

# Revert config.yaml
git revert HEAD

# Regenerate
npm run build

# Deploy
git push
argocd app sync monitoring

Pods will restart with old resource specifications.

Rollback Replica Changes

Same process as resource changes:

git revert HEAD
npm run build
git push
argocd app sync monitoring

Kubernetes will:

  • Scale down extra replicas

  • Terminate excess pods gracefully

Rollback Storage Expansion

Cannot shrink PVCs directly. To reduce:

  1. Delete StatefulSet (keep PVC): kubectl delete sts <name> -n monitoring --cascade=orphan

  2. Delete PVC: kubectl delete pvc <name> -n monitoring

  3. Update config.yaml with smaller size

  4. Regenerate and deploy

  5. Data loss - restore from backup


Monitoring Scaling Effectiveness

Metrics to Track

Resource utilization:

# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total{namespace="monitoring"}[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="cpu"}) by (pod)

# Memory usage vs requests
sum(container_memory_working_set_bytes{namespace="monitoring"}) by (pod)
  /
sum(kube_pod_container_resource_requests{namespace="monitoring", resource="memory"}) by (pod)

Query performance:

# Prometheus query duration
histogram_quantile(0.95, rate(prometheus_http_request_duration_seconds_bucket{handler="/api/v1/query"}[5m]))

# Loki query duration
histogram_quantile(0.95, rate(loki_request_duration_seconds_bucket{route=~"api_.*"}[5m]))

Storage growth:

# Prometheus TSDB size
prometheus_tsdb_storage_blocks_bytes / 1e9

# PVC usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"prometheus.*"}
  / kubelet_volume_stats_capacity_bytes

See Also