Explanation

Resource Optimization and Sizing Methodology

Introduction

This document explains how we determined the resource requests and limits for the monitoring stack, the optimization work performed in October 2025, and the methodology for right-sizing components.

Optimization History (October 2025)

Initial State (Before Optimization)

Problems Identified:

  1. ArgoCD: No resource requests (BestEffort QoS class)

    • Risk: First to be evicted during memory pressure

    • Impact: GitOps deployments could fail unexpectedly

  2. Worker Nodes: Overcommitted at 126-148% CPU limits

    • Risk: CPU throttling affecting all workloads

    • Impact: Performance degradation cluster-wide

  3. Loki: Massively overprovisioned (80%+ waste)

    • Each component: 500m CPU / 1Gi memory requests

    • Actual usage: 15-40m CPU / 80-170Mi memory

    • Waste: 2.4 CPU cores, 4.5Gi memory unused

Optimization Actions

1. ArgoCD Resource Guarantees (Nov 1)

# Before: No requests/limits (BestEffort)
# After: Burstable QoS with appropriate limits
application-controller:
  resources:
    requests: { cpu: 250m, memory: 768Mi }
    limits: { cpu: 1000m, memory: 2Gi }

server:
  resources:
    requests: { cpu: 50m, memory: 128Mi }
    limits: { cpu: 500m, memory: 512Mi }

repo-server:
  resources:
    requests: { cpu: 50m, memory: 128Mi }
    limits: { cpu: 500m, memory: 512Mi }

redis:
  resources:
    requests: { cpu: 50m, memory: 64Mi }
    limits: { cpu: 200m, memory: 256Mi }

2. Loki Right-Sizing (Oct 29)

# Before: Overprovisioned
write:
  resources:
    requests: { cpu: 500m, memory: 1Gi }

# After: Right-sized based on actual usage
write:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 500m, memory: 512Mi }

# Applied to: write, read, backend (all 3 components)
# Savings: 2.4 CPU cores, 4.5Gi memory

3. Prometheus Optimization (Oct 29 - Thanos Integration)

# Before: 7-day retention, 6Gi PVC
prometheus:
  retention: 7d
  storage:
    size: 6Gi
  resources:
    requests: { memory: 2500Mi }

# After: 3-day retention + Thanos, 3Gi PVC
prometheus:
  retention: 3d
  storage:
    size: 3Gi
  thanos:
    sidecar: enabled  # Offload to S3
  resources:
    requests: { memory: 1500Mi }  # 40% reduction

# Savings: 3Gi PVC × 2 replicas = 6Gi storage
# Savings: 1000Mi memory × 2 replicas = 2Gi memory

Results

Node Resource Allocation (After Optimization):

Primary Worker Node (cax31-fsn1):
  CPU requests:  8.3 / 16 cores (52%)  ← was 85%
  CPU limits:    13.6 / 16 cores (85%) ← was 148%
  Memory requests: 12.8 / 28Gi (46%)   ← was 55%
  Memory limits:   22.4 / 28Gi (80%)   ← was 90%

Cluster-Wide Impact:

  • ✅ Overcommitment resolved (148% → 85%)

  • ✅ ArgoCD protected with resource guarantees

  • ✅ Loki optimized (80% resource reduction)

  • ✅ Prometheus storage reduced (6Gi → 3Gi)

  • ✅ Zero service disruptions during changes

Sizing Methodology

Step 1: Measure Actual Usage

Collect baseline metrics:

# CPU usage over 24h
kubectl top pods -n monitoring --sort-by=cpu

# Memory usage over 24h
kubectl top pods -n monitoring --sort-by=memory

# Query Prometheus for historical trends
rate(container_cpu_usage_seconds_total{namespace="monitoring"}[1h])
container_memory_working_set_bytes{namespace="monitoring"}

Example: Loki Write Component

Observed peak usage:
  CPU: 40m (0.04 cores)
  Memory: 170Mi

Requested (before optimization):
  CPU: 500m (12.5x over-provisioned!)
  Memory: 1Gi (6x over-provisioned!)

Step 2: Calculate Requests

Formula:

CPU request = peak_usage × 1.5 (50% headroom)
Memory request = peak_usage × 2.0 (100% headroom)

Why these multipliers?

  • CPU: Compressible resource, can throttle without crashes

  • Memory: Non-compressible, OOM kills if exceeded

Example: Loki Write

CPU request = 40m × 1.5 = 60m → round to 100m (nicer number)
Memory request = 170Mi × 2.0 = 340Mi → round to 256Mi (power of 2)

Step 3: Set Limits

Formula:

CPU limit = request × 4-5 (allow bursting)
Memory limit = request × 2 (prevent runaway)

Example: Loki Write

CPU limit = 100m × 5 = 500m
Memory limit = 256Mi × 2 = 512Mi

Final configuration:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

Step 4: Validate

Deploy and monitor:

# Watch for OOM kills
kubectl get events -n monitoring | grep OOM

# Check if CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1

# Verify pods not being evicted
kubectl get pods -n monitoring -o wide

Iterate if needed:

  • If OOM kills: Increase memory request/limit

  • If high throttling (>20%): Increase CPU limit

  • If consistently under-utilized: Reduce requests

Current Resource Allocation

Metrics Stack

Prometheus (2 replicas):

resources:
  requests: { cpu: 100m, memory: 1500Mi }
  limits: { cpu: 1000m, memory: 3000Mi }

# Actual usage: 80-150m CPU, 1200-1800Mi memory
# Headroom: 50% CPU, 66% memory

Thanos Query (2 replicas):

resources:
  requests: { cpu: 200m, memory: 512Mi }
  limits: { cpu: 1000m, memory: 1Gi }

# Actual usage: 50-150m CPU, 300-400Mi memory
# Headroom: 33% CPU, 28% memory

Thanos Store (2 replicas):

resources:
  requests: { cpu: 200m, memory: 1Gi }
  limits: { cpu: 1000m, memory: 2Gi }

# Actual usage: 80-150m CPU, 600-800Mi memory
# Headroom: 33% CPU, 25% memory

Thanos Compactor (1 replica):

resources:
  requests: { cpu: 500m, memory: 2Gi }
  limits: { cpu: 2000m, memory: 4Gi }

# Actual usage: 200-400m CPU, 1-1.5Gi memory
# Headroom: 25% CPU, 33% memory

Logs Stack

Loki Write (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 20-40m CPU, 100-170Mi memory
# Headroom: 150% CPU, 50% memory

Loki Read (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 15-35m CPU, 80-150Mi memory
# Headroom: 185% CPU, 70% memory

Loki Backend (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 15-30m CPU, 90-160Mi memory
# Headroom: 233% CPU, 60% memory

Visualization & Alerting

Grafana (1 replica):

resources:
  requests: { cpu: 50m, memory: 512Mi }
  limits: { cpu: 500m, memory: 1024Mi }

# Actual usage: 10-30m CPU, 350-450Mi memory
# Headroom: 66% CPU, 13% memory

Alertmanager (2 replicas):

resources:
  requests: { cpu: 25m, memory: 100Mi }
  limits: { cpu: 100m, memory: 256Mi }

# Actual usage: 5-15m CPU, 60-80Mi memory
# Headroom: 66% CPU, 25% memory

Log Collection

Alloy (4 pods, DaemonSet):

resources:
  requests: { cpu: 50m, memory: 128Mi }
  limits: { cpu: 200m, memory: 256Mi }

# Actual usage: 15-30m CPU, 70-100Mi memory
# Headroom: 66% CPU, 28% memory

Resource Allocation Strategy

QoS Classes

Guaranteed (requests = limits):

  • Use only for: Databases, critical stateful workloads

  • Our stack: None (intentionally avoided)

  • Why not: Wastes cluster capacity

Burstable (requests < limits):

  • Use for: Most workloads (infrastructure, applications)

  • Our stack: All monitoring components

  • Why: Balance protection with efficiency

BestEffort (no requests/limits):

  • Use for: Non-critical batch jobs, canaries

  • Our stack: None (all have guarantees)

  • Why not: First to be evicted

Request vs Limit Philosophy

Requests (guaranteed minimum):

  • Set based on typical usage + safety margin

  • Determines pod placement (scheduler considers requests)

  • Used for QoS classification

Limits (maximum allowed):

  • CPU: Allow bursting (3-5x requests)

  • Memory: Prevent runaway (2x requests)

  • Trigger throttling (CPU) or OOM kill (memory) if exceeded

Example reasoning:

prometheus:
  requests: { cpu: 100m, memory: 1500Mi }  # Typical usage
  limits: { cpu: 1000m, memory: 3000Mi }    # Burst capacity

# Rationale:
# - Runs fine at 100m CPU most of the time
# - Can burst to 1000m during query spikes
# - Uses ~1500Mi memory typically
# - Can grow to 3000Mi during compaction
# - Won't starve other pods (has guaranteed 100m/1500Mi)

Scaling Guidelines

When to Scale Vertically (Increase Resources)

Symptoms:

  • Frequent OOM kills

  • CPU throttling >20%

  • High latency during normal load

  • Pods consistently at resource limits

Actions:

# Increase requests and limits proportionally
# Before
resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# After (2x scale)
resources:
  requests: { cpu: 200m, memory: 512Mi }
  limits: { cpu: 1000m, memory: 1024Mi }

When to Scale Horizontally (Increase Replicas)

Symptoms:

  • High load despite adequate resources

  • Query latency issues

  • Single replica at capacity

  • Need better HA

Actions:

# Increase replica count
# Before
replicas: 2

# After
replicas: 3

Components that scale horizontally:

  • ✅ Prometheus (independent scraping)

  • ✅ Thanos Query (stateless)

  • ✅ Thanos Store (shared S3 state)

  • ✅ Loki Write/Read (shared S3 state)

  • ⚠️ Thanos Compactor (single replica only)

Monitoring Resource Efficiency

Key Metrics

Resource utilization:

# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# Memory usage vs requests
sum(container_memory_working_set_bytes) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="memory"}) by (pod)

Throttling:

# High throttling indicates CPU limit too low
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.2

OOM kills:

# Check for recent OOM kills
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

Target Utilization

Healthy ranges:

  • CPU requests: 30-70% utilized (allows bursting)

  • CPU limits: <80% utilized (avoid throttling)

  • Memory requests: 50-80% utilized (safety buffer)

  • Memory limits: <90% utilized (avoid OOM)

Red flags:

  • CPU requests >90% utilized: Too tight, increase requests

  • Memory requests >95% utilized: OOM risk, increase requests

  • CPU limits constantly hit: Throttling, increase limits

  • Memory limits reached: OOM kill imminent

Cost Optimization

Cluster-Level View

Total monitoring stack resources:

CPU requests:  ~2.7 cores (out of 40 total cluster cores = 6.75%)
Memory requests: ~12.5Gi (out of 140Gi total cluster memory = 9%)
Storage (PVCs): ~60Gi Longhorn (out of 400Gi available = 15%)
Storage (S3): ~10Gi (cost: ~$0.25/month)

Cost breakdown:

  • Compute: Included in node costs (no marginal cost)

  • Longhorn PVCs: Included in node storage (no marginal cost)

  • S3 storage: $0.023/GB/month × 10GB = $0.23/month

  • Total incremental cost: ~$0.25/month

Optimization Opportunities

Already Optimized:

  • ✅ Loki right-sized (80% reduction achieved)

  • ✅ Prometheus retention reduced (3d + Thanos)

  • ✅ ArgoCD resource guarantees added

  • ✅ Thanos downsampling enabled

Future Optimizations:

  • [ ] Add PVC resizing (reduce Thanos Store PVCs if cache not needed)

  • [ ] Implement recording rules (reduce query load)

  • [ ] Add query caching (reduce Thanos Store load)

  • [ ] Consider Alloy log sampling (reduce Loki write load)

Best Practices

DO

Measure first: Collect actual usage before sizing ✅ Add headroom: 50-100% buffer for spikes ✅ Use Burstable QoS: Balance efficiency with guarantees ✅ Monitor continuously: Track utilization over time ✅ Document decisions: Explain why resources set to values ✅ Test in staging: Validate resource changes before production

DON’T

Guess: Don’t set resources without data ❌ Copy defaults: Helm chart defaults often 5-10x oversized ❌ Set requests = limits: Wastes cluster capacity (Guaranteed QoS) ❌ Ignore metrics: Monitor utilization after changes ❌ Over-optimize: Some buffer is good (safety margin) ❌ Batch changes: Change one component at a time

References