Explanation

Resource Optimization and Sizing Methodology¶

Type: Explanation (Understanding-oriented)

Related Concepts: Resource Requirements | Resource Management | Scale Resources

Introduction¶

This document explains how we determined the resource requests and limits for the monitoring stack, the optimization work performed in October 2025, and the methodology for right-sizing components.

Optimization History (October 2025)¶

Initial State (Before Optimization)¶

Problems Identified:

ArgoCD: No resource requests (BestEffort QoS class)
- Risk: First to be evicted during memory pressure
- Impact: GitOps deployments could fail unexpectedly
Worker Nodes: Overcommitted at 126-148% CPU limits
- Risk: CPU throttling affecting all workloads
- Impact: Performance degradation cluster-wide
Loki: Massively overprovisioned (80%+ waste)
- Each component: 500m CPU / 1Gi memory requests
- Actual usage: 15-40m CPU / 80-170Mi memory
- Waste: 2.4 CPU cores, 4.5Gi memory unused

Optimization Actions¶

1. ArgoCD Resource Guarantees (Nov 1)

# Before: No requests/limits (BestEffort)
# After: Burstable QoS with appropriate limits
application-controller:
  resources:
    requests: { cpu: 250m, memory: 768Mi }
    limits: { cpu: 1000m, memory: 2Gi }

server:
  resources:
    requests: { cpu: 50m, memory: 128Mi }
    limits: { cpu: 500m, memory: 512Mi }

repo-server:
  resources:
    requests: { cpu: 50m, memory: 128Mi }
    limits: { cpu: 500m, memory: 512Mi }

redis:
  resources:
    requests: { cpu: 50m, memory: 64Mi }
    limits: { cpu: 200m, memory: 256Mi }

2. Loki Right-Sizing (Oct 29)

# Before: Overprovisioned
write:
  resources:
    requests: { cpu: 500m, memory: 1Gi }

# After: Right-sized based on actual usage
write:
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 500m, memory: 512Mi }

# Applied to: write, read, backend (all 3 components)
# Savings: 2.4 CPU cores, 4.5Gi memory

3. Prometheus Optimization (Oct 29 - Thanos Integration)

# Before: 7-day retention, 6Gi PVC
prometheus:
  retention: 7d
  storage:
    size: 6Gi
  resources:
    requests: { memory: 2500Mi }

# After: 3-day retention + Thanos, 3Gi PVC
prometheus:
  retention: 3d
  storage:
    size: 3Gi
  thanos:
    sidecar: enabled  # Offload to S3
  resources:
    requests: { memory: 1500Mi }  # 40% reduction

# Savings: 3Gi PVC × 2 replicas = 6Gi storage
# Savings: 1000Mi memory × 2 replicas = 2Gi memory

Results¶

Node Resource Allocation (After Optimization):

Primary Worker Node (cax31-fsn1):
  CPU requests:  8.3 / 16 cores (52%)  ← was 85%
  CPU limits:    13.6 / 16 cores (85%) ← was 148%
  Memory requests: 12.8 / 28Gi (46%)   ← was 55%
  Memory limits:   22.4 / 28Gi (80%)   ← was 90%

Cluster-Wide Impact:

✅ Overcommitment resolved (148% → 85%)
✅ ArgoCD protected with resource guarantees
✅ Loki optimized (80% resource reduction)
✅ Prometheus storage reduced (6Gi → 3Gi)
✅ Zero service disruptions during changes

Sizing Methodology¶

Step 1: Measure Actual Usage¶

Collect baseline metrics:

# CPU usage over 24h
kubectl top pods -n monitoring --sort-by=cpu

# Memory usage over 24h
kubectl top pods -n monitoring --sort-by=memory

# Query Prometheus for historical trends
rate(container_cpu_usage_seconds_total{namespace="monitoring"}[1h])
container_memory_working_set_bytes{namespace="monitoring"}

Example: Loki Write Component

Observed peak usage:
  CPU: 40m (0.04 cores)
  Memory: 170Mi

Requested (before optimization):
  CPU: 500m (12.5x over-provisioned!)
  Memory: 1Gi (6x over-provisioned!)

Step 2: Calculate Requests¶

Formula:

CPU request = peak_usage × 1.5 (50% headroom)
Memory request = peak_usage × 2.0 (100% headroom)

Why these multipliers?

CPU: Compressible resource, can throttle without crashes
Memory: Non-compressible, OOM kills if exceeded

Example: Loki Write

CPU request = 40m × 1.5 = 60m → round to 100m (nicer number)
Memory request = 170Mi × 2.0 = 340Mi → round to 256Mi (power of 2)

Step 3: Set Limits¶

Formula:

CPU limit = request × 4-5 (allow bursting)
Memory limit = request × 2 (prevent runaway)

Example: Loki Write

CPU limit = 100m × 5 = 500m
Memory limit = 256Mi × 2 = 512Mi

Final configuration:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

Step 4: Validate¶

Deploy and monitor:

# Watch for OOM kills
kubectl get events -n monitoring | grep OOM

# Check if CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1

# Verify pods not being evicted
kubectl get pods -n monitoring -o wide

Iterate if needed:

If OOM kills: Increase memory request/limit
If high throttling (>20%): Increase CPU limit
If consistently under-utilized: Reduce requests

Current Resource Allocation¶

Metrics Stack¶

Prometheus (2 replicas):

resources:
  requests: { cpu: 100m, memory: 1500Mi }
  limits: { cpu: 1000m, memory: 3000Mi }

# Actual usage: 80-150m CPU, 1200-1800Mi memory
# Headroom: 50% CPU, 66% memory

Thanos Query (2 replicas):

resources:
  requests: { cpu: 200m, memory: 512Mi }
  limits: { cpu: 1000m, memory: 1Gi }

# Actual usage: 50-150m CPU, 300-400Mi memory
# Headroom: 33% CPU, 28% memory

Thanos Store (2 replicas):

resources:
  requests: { cpu: 200m, memory: 1Gi }
  limits: { cpu: 1000m, memory: 2Gi }

# Actual usage: 80-150m CPU, 600-800Mi memory
# Headroom: 33% CPU, 25% memory

Thanos Compactor (1 replica):

resources:
  requests: { cpu: 500m, memory: 2Gi }
  limits: { cpu: 2000m, memory: 4Gi }

# Actual usage: 200-400m CPU, 1-1.5Gi memory
# Headroom: 25% CPU, 33% memory

Logs Stack¶

Loki Write (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 20-40m CPU, 100-170Mi memory
# Headroom: 150% CPU, 50% memory

Loki Read (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 15-35m CPU, 80-150Mi memory
# Headroom: 185% CPU, 70% memory

Loki Backend (2 replicas):

resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# Actual usage: 15-30m CPU, 90-160Mi memory
# Headroom: 233% CPU, 60% memory

Visualization & Alerting¶

Grafana (1 replica):

resources:
  requests: { cpu: 50m, memory: 512Mi }
  limits: { cpu: 500m, memory: 1024Mi }

# Actual usage: 10-30m CPU, 350-450Mi memory
# Headroom: 66% CPU, 13% memory

Alertmanager (2 replicas):

resources:
  requests: { cpu: 25m, memory: 100Mi }
  limits: { cpu: 100m, memory: 256Mi }

# Actual usage: 5-15m CPU, 60-80Mi memory
# Headroom: 66% CPU, 25% memory

Log Collection¶

Alloy (4 pods, DaemonSet):

resources:
  requests: { cpu: 50m, memory: 128Mi }
  limits: { cpu: 200m, memory: 256Mi }

# Actual usage: 15-30m CPU, 70-100Mi memory
# Headroom: 66% CPU, 28% memory

Resource Allocation Strategy¶

QoS Classes¶

Guaranteed (requests = limits):

Use only for: Databases, critical stateful workloads
Our stack: None (intentionally avoided)
Why not: Wastes cluster capacity

Burstable (requests < limits):

Use for: Most workloads (infrastructure, applications)
Our stack: All monitoring components
Why: Balance protection with efficiency

BestEffort (no requests/limits):

Use for: Non-critical batch jobs, canaries
Our stack: None (all have guarantees)
Why not: First to be evicted

Request vs Limit Philosophy¶

Requests (guaranteed minimum):

Set based on typical usage + safety margin
Determines pod placement (scheduler considers requests)
Used for QoS classification

Limits (maximum allowed):

CPU: Allow bursting (3-5x requests)
Memory: Prevent runaway (2x requests)
Trigger throttling (CPU) or OOM kill (memory) if exceeded

Example reasoning:

prometheus:
  requests: { cpu: 100m, memory: 1500Mi }  # Typical usage
  limits: { cpu: 1000m, memory: 3000Mi }    # Burst capacity

# Rationale:
# - Runs fine at 100m CPU most of the time
# - Can burst to 1000m during query spikes
# - Uses ~1500Mi memory typically
# - Can grow to 3000Mi during compaction
# - Won't starve other pods (has guaranteed 100m/1500Mi)

Scaling Guidelines¶

When to Scale Vertically (Increase Resources)¶

Symptoms:

Frequent OOM kills
CPU throttling >20%
High latency during normal load
Pods consistently at resource limits

Actions:

# Increase requests and limits proportionally
# Before
resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits: { cpu: 500m, memory: 512Mi }

# After (2x scale)
resources:
  requests: { cpu: 200m, memory: 512Mi }
  limits: { cpu: 1000m, memory: 1024Mi }

When to Scale Horizontally (Increase Replicas)¶

Symptoms:

High load despite adequate resources
Query latency issues
Single replica at capacity
Need better HA

Actions:

# Increase replica count
# Before
replicas: 2

# After
replicas: 3

Components that scale horizontally:

✅ Prometheus (independent scraping)
✅ Thanos Query (stateless)
✅ Thanos Store (shared S3 state)
✅ Loki Write/Read (shared S3 state)
⚠️ Thanos Compactor (single replica only)

Monitoring Resource Efficiency¶

Key Metrics¶

Resource utilization:

# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)

# Memory usage vs requests
sum(container_memory_working_set_bytes) by (pod)
  /
sum(kube_pod_container_resource_requests{resource="memory"}) by (pod)

Throttling:

# High throttling indicates CPU limit too low
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.2

OOM kills:

# Check for recent OOM kills
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

Target Utilization¶

Healthy ranges:

CPU requests: 30-70% utilized (allows bursting)
CPU limits: <80% utilized (avoid throttling)
Memory requests: 50-80% utilized (safety buffer)
Memory limits: <90% utilized (avoid OOM)

Red flags:

CPU requests >90% utilized: Too tight, increase requests
Memory requests >95% utilized: OOM risk, increase requests
CPU limits constantly hit: Throttling, increase limits
Memory limits reached: OOM kill imminent

Cost Optimization¶

Cluster-Level View¶

Total monitoring stack resources:

CPU requests:  ~2.7 cores (out of 40 total cluster cores = 6.75%)
Memory requests: ~12.5Gi (out of 140Gi total cluster memory = 9%)
Storage (PVCs): ~60Gi Longhorn (out of 400Gi available = 15%)
Storage (S3): ~10Gi (cost: ~$0.25/month)

Cost breakdown:

Compute: Included in node costs (no marginal cost)
Longhorn PVCs: Included in node storage (no marginal cost)
S3 storage: $0.023/GB/month × 10GB = $0.23/month
Total incremental cost: ~$0.25/month

Optimization Opportunities¶

Already Optimized:

✅ Loki right-sized (80% reduction achieved)
✅ Prometheus retention reduced (3d + Thanos)
✅ ArgoCD resource guarantees added
✅ Thanos downsampling enabled

Future Optimizations:

[ ] Add PVC resizing (reduce Thanos Store PVCs if cache not needed)
[ ] Implement recording rules (reduce query load)
[ ] Add query caching (reduce Thanos Store load)
[ ] Consider Alloy log sampling (reduce Loki write load)

Best Practices¶

DO¶

✅ Measure first: Collect actual usage before sizing ✅ Add headroom: 50-100% buffer for spikes ✅ Use Burstable QoS: Balance efficiency with guarantees ✅ Monitor continuously: Track utilization over time ✅ Document decisions: Explain why resources set to values ✅ Test in staging: Validate resource changes before production

DON’T¶

❌ Guess: Don’t set resources without data ❌ Copy defaults: Helm chart defaults often 5-10x oversized ❌ Set requests = limits: Wastes cluster capacity (Guaranteed QoS) ❌ Ignore metrics: Monitor utilization after changes ❌ Over-optimize: Some buffer is good (safety margin) ❌ Batch changes: Change one component at a time