Reference
Resource Requirements Reference¶
Complete resource specifications for all monitoring stack components.
Summary Tables¶
CPU Requirements¶
Component |
Replicas |
CPU Request (each) |
CPU Limit (each) |
Total Request |
Total Limit |
|---|---|---|---|---|---|
Metrics Stack |
|||||
Prometheus |
2 |
100m |
1000m |
200m |
2000m |
Thanos Query |
2 |
200m |
1000m |
400m |
2000m |
Thanos Store |
2 |
200m |
1000m |
400m |
2000m |
Thanos Compactor |
1 |
500m |
2000m |
500m |
2000m |
Thanos Sidecar |
2 |
25m |
100m |
50m |
200m |
Logs Stack |
|||||
Loki Write |
2 |
100m |
500m |
200m |
1000m |
Loki Read |
2 |
100m |
500m |
200m |
1000m |
Loki Backend |
2 |
100m |
500m |
200m |
1000m |
Loki Gateway |
1 |
50m |
200m |
50m |
200m |
Alloy (DaemonSet) |
4 nodes |
50m |
200m |
200m |
800m |
Visualization & Alerting |
|||||
Grafana |
1 |
50m |
500m |
50m |
500m |
Alertmanager |
2 |
25m |
100m |
50m |
200m |
Operators & Exporters |
|||||
Prometheus Operator |
1 |
100m |
200m |
100m |
200m |
kube-state-metrics |
1 |
10m |
100m |
10m |
100m |
Node Exporter |
4 nodes |
10m |
200m |
40m |
800m |
TOTAL |
~2.65 cores |
~15.8 cores |
Memory Requirements¶
Component |
Replicas |
Memory Request (each) |
Memory Limit (each) |
Total Request |
Total Limit |
|---|---|---|---|---|---|
Metrics Stack |
|||||
Prometheus |
2 |
1500Mi |
3000Mi |
3000Mi |
6000Mi |
Thanos Query |
2 |
512Mi |
1Gi |
1024Mi |
2Gi |
Thanos Store |
2 |
1Gi |
2Gi |
2Gi |
4Gi |
Thanos Compactor |
1 |
2Gi |
4Gi |
2Gi |
4Gi |
Thanos Sidecar |
2 |
128Mi |
256Mi |
256Mi |
512Mi |
Logs Stack |
|||||
Loki Write |
2 |
256Mi |
512Mi |
512Mi |
1Gi |
Loki Read |
2 |
256Mi |
512Mi |
512Mi |
1Gi |
Loki Backend |
2 |
256Mi |
512Mi |
512Mi |
1Gi |
Loki Gateway |
1 |
128Mi |
256Mi |
128Mi |
256Mi |
Alloy (DaemonSet) |
4 nodes |
128Mi |
256Mi |
512Mi |
1Gi |
Visualization & Alerting |
|||||
Grafana |
1 |
512Mi |
1Gi |
512Mi |
1Gi |
Alertmanager |
2 |
100Mi |
256Mi |
200Mi |
512Mi |
Operators & Exporters |
|||||
Prometheus Operator |
1 |
128Mi |
256Mi |
128Mi |
256Mi |
kube-state-metrics |
1 |
64Mi |
128Mi |
64Mi |
128Mi |
Node Exporter |
4 nodes |
32Mi |
64Mi |
128Mi |
256Mi |
TOTAL |
~12.5 GB |
~24.5 GB |
Storage Requirements¶
Component |
Type |
Size (each) |
Replicas |
Total |
Storage Class |
|---|---|---|---|---|---|
Persistent Volumes |
|||||
Prometheus |
PVC |
3Gi |
2 |
6Gi |
longhorn |
Thanos Store |
PVC |
10Gi |
2 |
20Gi |
longhorn |
Thanos Compactor |
PVC |
20Gi |
1 |
20Gi |
longhorn |
Loki Write |
PVC |
5Gi |
2 |
10Gi |
longhorn |
Loki Read |
PVC |
5Gi |
2 |
10Gi |
longhorn |
Loki Backend |
PVC |
5Gi |
2 |
10Gi |
longhorn |
Grafana |
PVC |
5Gi |
1 |
5Gi |
longhorn |
Alertmanager |
PVC |
10Gi |
2 |
20Gi |
hcloud-volumes |
Subtotal PVCs |
101Gi |
||||
Object Storage (S3) |
|||||
Prometheus metrics |
Bucket |
~150GB |
- |
~150GB |
Hetzner S3 (fsn1) |
Loki logs |
Bucket |
~2GB |
- |
~2GB |
Hetzner S3 (fsn1) |
Subtotal S3 |
~152GB |
||||
TOTAL STORAGE |
~253GB |
Detailed Component Specifications¶
Prometheus¶
Deployment Type: StatefulSet
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 100m
memory: 1500Mi
limits:
cpu: 1000m
memory: 3000Mi
Storage:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 3Gi
Rationale:
CPU: 100m baseline, bursts to 1000m during scraping/compaction
Memory: 1500Mi for TSDB, allows growth to 3000Mi during queries
Storage: 3Gi for 3 days retention (~500MB/day compressed)
QoS Class: Burstable
Thanos Query¶
Deployment Type: Deployment
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
Storage: None (stateless)
Rationale:
CPU: 200m for query processing, bursts to 1000m for complex queries
Memory: 512Mi for query cache and result buffering
QoS Class: Burstable
Thanos Store¶
Deployment Type: StatefulSet
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
Storage:
volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
Rationale:
CPU: 200m for S3 reads and index serving
Memory: 1Gi for index cache (500MB) + chunk cache (500MB)
Storage: 10Gi for cached indexes
QoS Class: Burstable
Thanos Compactor¶
Deployment Type: StatefulSet
Replicas: 1 (single instance)
Resources:
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
Storage:
volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 20Gi
Rationale:
CPU: 500m for compaction and downsampling, bursts to 2000m during heavy work
Memory: 2Gi for block processing, allows growth to 4Gi
Storage: 20Gi workspace for downloading/merging blocks
QoS Class: Burstable
Thanos Sidecar¶
Deployment Type: Container in Prometheus pods
Replicas: 2 (one per Prometheus)
Resources:
resources:
requests:
cpu: 25m
memory: 128Mi
limits:
cpu: 100m
memory: 256Mi
Storage: Shares Prometheus PVC
Rationale:
CPU: 25m minimal (background uploads)
Memory: 128Mi for block buffering
QoS Class: Burstable
Loki Write¶
Deployment Type: Deployment (SimpleScalable mode)
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Storage:
persistence:
enabled: true
storageClass: longhorn
size: 5Gi
Rationale:
CPU: 100m for log ingestion, bursts to 500m during spikes
Memory: 256Mi for WAL buffering
Storage: 5Gi for Write-Ahead Log
QoS Class: Burstable
Loki Read¶
Deployment Type: Deployment (SimpleScalable mode)
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Storage:
persistence:
enabled: true
storageClass: longhorn
size: 5Gi
Rationale:
CPU: 100m for query processing
Memory: 256Mi for query result caching
Storage: 5Gi for cache
QoS Class: Burstable
Loki Backend¶
Deployment Type: StatefulSet (SimpleScalable mode)
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Storage:
persistence:
enabled: true
storageClass: longhorn
size: 5Gi
Rationale:
CPU: 100m for index/chunk operations
Memory: 256Mi for metadata
Storage: 5Gi for index metadata
QoS Class: Burstable
Loki Gateway¶
Deployment Type: Deployment
Replicas: 1 (not critical for HA)
Resources:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Storage: None (stateless proxy)
Rationale:
CPU: 50m for HTTP proxying
Memory: 128Mi minimal for nginx
QoS Class: Burstable
Alloy (Grafana Agent)¶
Deployment Type: DaemonSet
Replicas: 4 (one per node)
Resources:
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Storage: None
Rationale:
CPU: 50m per node for log collection
Memory: 128Mi for log buffering
QoS Class: Burstable
Grafana¶
Deployment Type: Deployment
Replicas: 1 (UI component)
Resources:
resources:
requests:
cpu: 50m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
Storage:
persistence:
enabled: true
storageClass: longhorn
size: 5Gi
Rationale:
CPU: 50m for UI serving
Memory: 512Mi for dashboard rendering (increased from 256Mi to prevent OOM)
Storage: 5Gi for dashboards and SQLite DB
QoS Class: Burstable
Alertmanager¶
Deployment Type: StatefulSet
Replicas: 2 (for HA)
Resources:
resources:
requests:
cpu: 25m
memory: 100Mi
limits:
cpu: 100m
memory: 256Mi
Storage:
storage:
volumeClaimTemplate:
spec:
storageClassName: hcloud-volumes
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
Rationale:
CPU: 25m minimal (low alert volume)
Memory: 100Mi for alert state
Storage: 10Gi (over-provisioned, actual usage ~100MB)
QoS Class: Burstable
Prometheus Operator¶
Deployment Type: Deployment
Replicas: 1
Resources:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Storage: None
Rationale:
CPU: 100m for CRD reconciliation
Memory: 128Mi for operator logic
QoS Class: Burstable
kube-state-metrics¶
Deployment Type: Deployment
Replicas: 1
Resources:
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
Storage: None
Rationale:
CPU: 10m minimal (exposes metrics)
Memory: 64Mi for metric cache
QoS Class: Burstable
Node Exporter¶
Deployment Type: DaemonSet
Replicas: 4 (one per node)
Resources:
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
cpu: 200m
memory: 64Mi
Storage: None
Rationale:
CPU: 10m per node (very lightweight)
Memory: 32Mi minimal
QoS Class: Burstable
Cluster Impact Analysis¶
Node Resource Allocation¶
Assumptions:
5 agent nodes (4 cax-series in monitoring, 1 cpx21 elsewhere)
Total cluster: 40 CPU cores, 140Gi memory
Monitoring Stack Usage:
CPU requests: 2.65 cores (6.6% of cluster)
CPU limits: 15.8 cores (39.5% of cluster) - acceptable for bursting
Memory requests: 12.5Gi (8.9% of cluster)
Memory limits: 24.5Gi (17.5% of cluster)
Node Distribution (typical):
Primary worker (cax31-fsn1): ~40% of stack
Secondary worker (cax31-nbg1): ~30% of stack
Tertiary workers (cax21 × 2): ~20% of stack
Control plane co-located (cpx21): ~10% of stack (DaemonSets only)
Storage Impact¶
Longhorn Capacity:
Total node storage: 640GB raw disk
Longhorn available: ~400GB (after system reservation)
Monitoring usage: 81Gi PVCs × 2 replicas = 162GB actual
Percentage: 40.5% of Longhorn capacity
Hetzner Volumes:
Alertmanager: 20Gi (10Gi × 2 replicas)
S3 Storage:
~152GB steady state
Cost: ~€3.50/month
Scaling Recommendations¶
Vertical Scaling (Increase Resources)¶
When to scale up:
CPU throttling >20%
OOMKilled events
Persistent high memory usage (>80% of requests)
How to scale:
# In config.yaml
resources:
prometheus:
requests:
cpu: 200m # Double from 100m
memory: 3Gi # Double from 1500Mi
Horizontal Scaling (Increase Replicas)¶
Components that scale horizontally:
✅ Prometheus (2 → 3 replicas)
✅ Thanos Query (2 → 3 replicas)
✅ Thanos Store (2 → 3 replicas)
✅ Loki Write/Read/Backend (2 → 3 replicas)
⚠️ Thanos Compactor (must remain 1 replica)
⚠️ Grafana (can increase, but not necessary)
How to scale:
# In config.yaml
replicas:
prometheus: 3 # Increased from 2
thanosQuery: 3
Storage Scaling¶
PVC Expansion:
# Edit PVC (if storage class supports expansion)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring
# Change storage: 3Gi → 6Gi
S3 Scaling: Automatic (S3 is unlimited)
Minimum Requirements¶
For testing/development:
1 node, 4 CPU, 8Gi RAM
Reduce replicas to 1 for all components
Reduce Prometheus retention to 1d
Total: ~1 core, ~4Gi memory, ~30Gi storage
For small production (current configuration):
3+ nodes, 8 CPU, 16Gi RAM per node
2 replicas for critical components
3-day Prometheus retention
Total: ~2.7 cores, ~12.5Gi memory, ~101Gi storage
For large production (100+ nodes):
5+ dedicated monitoring nodes
3 replicas for all components
Consider Prometheus sharding
Total: ~10 cores, ~50Gi memory, ~500Gi storage
Resource Optimization Opportunities¶
Current optimizations (implemented October 2025):
✅ Loki right-sized (80% reduction from defaults)
✅ Prometheus retention reduced (7d → 3d with Thanos)
✅ Thanos downsampling enabled
Future optimizations:
Alertmanager PVC: 10Gi → 1Gi (90% waste)
Grafana PVC: 5Gi → 1Gi (80% waste)
Node Exporter: Move to system namespace
See Also¶
Resource Optimization - Sizing methodology
Architecture Overview - Component relationships
Configuration Reference - Adjusting resources