Explanation
Resource Optimization and Sizing Methodology¶
Introduction¶
This document explains how we determined the resource requests and limits for the monitoring stack, the optimization work performed in October 2025, and the methodology for right-sizing components.
Optimization History (October 2025)¶
Initial State (Before Optimization)¶
Problems Identified:
ArgoCD: No resource requests (BestEffort QoS class)
Risk: First to be evicted during memory pressure
Impact: GitOps deployments could fail unexpectedly
Worker Nodes: Overcommitted at 126-148% CPU limits
Risk: CPU throttling affecting all workloads
Impact: Performance degradation cluster-wide
Loki: Massively overprovisioned (80%+ waste)
Each component: 500m CPU / 1Gi memory requests
Actual usage: 15-40m CPU / 80-170Mi memory
Waste: 2.4 CPU cores, 4.5Gi memory unused
Optimization Actions¶
1. ArgoCD Resource Guarantees (Nov 1)
# Before: No requests/limits (BestEffort)
# After: Burstable QoS with appropriate limits
application-controller:
resources:
requests: { cpu: 250m, memory: 768Mi }
limits: { cpu: 1000m, memory: 2Gi }
server:
resources:
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 500m, memory: 512Mi }
repo-server:
resources:
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 500m, memory: 512Mi }
redis:
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 200m, memory: 256Mi }
2. Loki Right-Sizing (Oct 29)
# Before: Overprovisioned
write:
resources:
requests: { cpu: 500m, memory: 1Gi }
# After: Right-sized based on actual usage
write:
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Applied to: write, read, backend (all 3 components)
# Savings: 2.4 CPU cores, 4.5Gi memory
3. Prometheus Optimization (Oct 29 - Thanos Integration)
# Before: 7-day retention, 6Gi PVC
prometheus:
retention: 7d
storage:
size: 6Gi
resources:
requests: { memory: 2500Mi }
# After: 3-day retention + Thanos, 3Gi PVC
prometheus:
retention: 3d
storage:
size: 3Gi
thanos:
sidecar: enabled # Offload to S3
resources:
requests: { memory: 1500Mi } # 40% reduction
# Savings: 3Gi PVC × 2 replicas = 6Gi storage
# Savings: 1000Mi memory × 2 replicas = 2Gi memory
Results¶
Node Resource Allocation (After Optimization):
Primary Worker Node (cax31-fsn1):
CPU requests: 8.3 / 16 cores (52%) ← was 85%
CPU limits: 13.6 / 16 cores (85%) ← was 148%
Memory requests: 12.8 / 28Gi (46%) ← was 55%
Memory limits: 22.4 / 28Gi (80%) ← was 90%
Cluster-Wide Impact:
✅ Overcommitment resolved (148% → 85%)
✅ ArgoCD protected with resource guarantees
✅ Loki optimized (80% resource reduction)
✅ Prometheus storage reduced (6Gi → 3Gi)
✅ Zero service disruptions during changes
Sizing Methodology¶
Step 1: Measure Actual Usage¶
Collect baseline metrics:
# CPU usage over 24h
kubectl top pods -n monitoring --sort-by=cpu
# Memory usage over 24h
kubectl top pods -n monitoring --sort-by=memory
# Query Prometheus for historical trends
rate(container_cpu_usage_seconds_total{namespace="monitoring"}[1h])
container_memory_working_set_bytes{namespace="monitoring"}
Example: Loki Write Component
Observed peak usage:
CPU: 40m (0.04 cores)
Memory: 170Mi
Requested (before optimization):
CPU: 500m (12.5x over-provisioned!)
Memory: 1Gi (6x over-provisioned!)
Step 2: Calculate Requests¶
Formula:
CPU request = peak_usage × 1.5 (50% headroom)
Memory request = peak_usage × 2.0 (100% headroom)
Why these multipliers?
CPU: Compressible resource, can throttle without crashes
Memory: Non-compressible, OOM kills if exceeded
Example: Loki Write
CPU request = 40m × 1.5 = 60m → round to 100m (nicer number)
Memory request = 170Mi × 2.0 = 340Mi → round to 256Mi (power of 2)
Step 3: Set Limits¶
Formula:
CPU limit = request × 4-5 (allow bursting)
Memory limit = request × 2 (prevent runaway)
Example: Loki Write
CPU limit = 100m × 5 = 500m
Memory limit = 256Mi × 2 = 512Mi
Final configuration:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
Step 4: Validate¶
Deploy and monitor:
# Watch for OOM kills
kubectl get events -n monitoring | grep OOM
# Check if CPU throttling
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
# Verify pods not being evicted
kubectl get pods -n monitoring -o wide
Iterate if needed:
If OOM kills: Increase memory request/limit
If high throttling (>20%): Increase CPU limit
If consistently under-utilized: Reduce requests
Current Resource Allocation¶
Metrics Stack¶
Prometheus (2 replicas):
resources:
requests: { cpu: 100m, memory: 1500Mi }
limits: { cpu: 1000m, memory: 3000Mi }
# Actual usage: 80-150m CPU, 1200-1800Mi memory
# Headroom: 50% CPU, 66% memory
Thanos Query (2 replicas):
resources:
requests: { cpu: 200m, memory: 512Mi }
limits: { cpu: 1000m, memory: 1Gi }
# Actual usage: 50-150m CPU, 300-400Mi memory
# Headroom: 33% CPU, 28% memory
Thanos Store (2 replicas):
resources:
requests: { cpu: 200m, memory: 1Gi }
limits: { cpu: 1000m, memory: 2Gi }
# Actual usage: 80-150m CPU, 600-800Mi memory
# Headroom: 33% CPU, 25% memory
Thanos Compactor (1 replica):
resources:
requests: { cpu: 500m, memory: 2Gi }
limits: { cpu: 2000m, memory: 4Gi }
# Actual usage: 200-400m CPU, 1-1.5Gi memory
# Headroom: 25% CPU, 33% memory
Logs Stack¶
Loki Write (2 replicas):
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Actual usage: 20-40m CPU, 100-170Mi memory
# Headroom: 150% CPU, 50% memory
Loki Read (2 replicas):
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Actual usage: 15-35m CPU, 80-150Mi memory
# Headroom: 185% CPU, 70% memory
Loki Backend (2 replicas):
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Actual usage: 15-30m CPU, 90-160Mi memory
# Headroom: 233% CPU, 60% memory
Visualization & Alerting¶
Grafana (1 replica):
resources:
requests: { cpu: 50m, memory: 512Mi }
limits: { cpu: 500m, memory: 1024Mi }
# Actual usage: 10-30m CPU, 350-450Mi memory
# Headroom: 66% CPU, 13% memory
Alertmanager (2 replicas):
resources:
requests: { cpu: 25m, memory: 100Mi }
limits: { cpu: 100m, memory: 256Mi }
# Actual usage: 5-15m CPU, 60-80Mi memory
# Headroom: 66% CPU, 25% memory
Log Collection¶
Alloy (4 pods, DaemonSet):
resources:
requests: { cpu: 50m, memory: 128Mi }
limits: { cpu: 200m, memory: 256Mi }
# Actual usage: 15-30m CPU, 70-100Mi memory
# Headroom: 66% CPU, 28% memory
Resource Allocation Strategy¶
QoS Classes¶
Guaranteed (requests = limits):
Use only for: Databases, critical stateful workloads
Our stack: None (intentionally avoided)
Why not: Wastes cluster capacity
Burstable (requests < limits):
Use for: Most workloads (infrastructure, applications)
Our stack: All monitoring components
Why: Balance protection with efficiency
BestEffort (no requests/limits):
Use for: Non-critical batch jobs, canaries
Our stack: None (all have guarantees)
Why not: First to be evicted
Request vs Limit Philosophy¶
Requests (guaranteed minimum):
Set based on typical usage + safety margin
Determines pod placement (scheduler considers requests)
Used for QoS classification
Limits (maximum allowed):
CPU: Allow bursting (3-5x requests)
Memory: Prevent runaway (2x requests)
Trigger throttling (CPU) or OOM kill (memory) if exceeded
Example reasoning:
prometheus:
requests: { cpu: 100m, memory: 1500Mi } # Typical usage
limits: { cpu: 1000m, memory: 3000Mi } # Burst capacity
# Rationale:
# - Runs fine at 100m CPU most of the time
# - Can burst to 1000m during query spikes
# - Uses ~1500Mi memory typically
# - Can grow to 3000Mi during compaction
# - Won't starve other pods (has guaranteed 100m/1500Mi)
Scaling Guidelines¶
When to Scale Vertically (Increase Resources)¶
Symptoms:
Frequent OOM kills
CPU throttling >20%
High latency during normal load
Pods consistently at resource limits
Actions:
# Increase requests and limits proportionally
# Before
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# After (2x scale)
resources:
requests: { cpu: 200m, memory: 512Mi }
limits: { cpu: 1000m, memory: 1024Mi }
When to Scale Horizontally (Increase Replicas)¶
Symptoms:
High load despite adequate resources
Query latency issues
Single replica at capacity
Need better HA
Actions:
# Increase replica count
# Before
replicas: 2
# After
replicas: 3
Components that scale horizontally:
✅ Prometheus (independent scraping)
✅ Thanos Query (stateless)
✅ Thanos Store (shared S3 state)
✅ Loki Write/Read (shared S3 state)
⚠️ Thanos Compactor (single replica only)
Monitoring Resource Efficiency¶
Key Metrics¶
Resource utilization:
# CPU usage vs requests
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod)
# Memory usage vs requests
sum(container_memory_working_set_bytes) by (pod)
/
sum(kube_pod_container_resource_requests{resource="memory"}) by (pod)
Throttling:
# High throttling indicates CPU limit too low
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.2
OOM kills:
# Check for recent OOM kills
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
Target Utilization¶
Healthy ranges:
CPU requests: 30-70% utilized (allows bursting)
CPU limits: <80% utilized (avoid throttling)
Memory requests: 50-80% utilized (safety buffer)
Memory limits: <90% utilized (avoid OOM)
Red flags:
CPU requests >90% utilized: Too tight, increase requests
Memory requests >95% utilized: OOM risk, increase requests
CPU limits constantly hit: Throttling, increase limits
Memory limits reached: OOM kill imminent
Cost Optimization¶
Cluster-Level View¶
Total monitoring stack resources:
CPU requests: ~2.7 cores (out of 40 total cluster cores = 6.75%)
Memory requests: ~12.5Gi (out of 140Gi total cluster memory = 9%)
Storage (PVCs): ~60Gi Longhorn (out of 400Gi available = 15%)
Storage (S3): ~10Gi (cost: ~$0.25/month)
Cost breakdown:
Compute: Included in node costs (no marginal cost)
Longhorn PVCs: Included in node storage (no marginal cost)
S3 storage: $0.023/GB/month × 10GB = $0.23/month
Total incremental cost: ~$0.25/month
Optimization Opportunities¶
Already Optimized:
✅ Loki right-sized (80% reduction achieved)
✅ Prometheus retention reduced (3d + Thanos)
✅ ArgoCD resource guarantees added
✅ Thanos downsampling enabled
Future Optimizations:
[ ] Add PVC resizing (reduce Thanos Store PVCs if cache not needed)
[ ] Implement recording rules (reduce query load)
[ ] Add query caching (reduce Thanos Store load)
[ ] Consider Alloy log sampling (reduce Loki write load)
Best Practices¶
DO¶
✅ Measure first: Collect actual usage before sizing ✅ Add headroom: 50-100% buffer for spikes ✅ Use Burstable QoS: Balance efficiency with guarantees ✅ Monitor continuously: Track utilization over time ✅ Document decisions: Explain why resources set to values ✅ Test in staging: Validate resource changes before production
DON’T¶
❌ Guess: Don’t set resources without data ❌ Copy defaults: Helm chart defaults often 5-10x oversized ❌ Set requests = limits: Wastes cluster capacity (Guaranteed QoS) ❌ Ignore metrics: Monitor utilization after changes ❌ Over-optimize: Some buffer is good (safety margin) ❌ Batch changes: Change one component at a time