Reference

Resource Requirements Reference¶

Type: Reference (Information-oriented)

Related: Resource Optimization | Scale Resources

Complete resource specifications for all monitoring stack components.

Summary Tables¶

CPU Requirements¶

Component	Replicas	CPU Request (each)	CPU Limit (each)	Total Request	Total Limit
Metrics Stack
Prometheus	2	100m	1000m	200m	2000m
Thanos Query	2	200m	1000m	400m	2000m
Thanos Store	2	200m	1000m	400m	2000m
Thanos Compactor	1	500m	2000m	500m	2000m
Thanos Sidecar	2	25m	100m	50m	200m
Logs Stack
Loki Write	2	100m	500m	200m	1000m
Loki Read	2	100m	500m	200m	1000m
Loki Backend	2	100m	500m	200m	1000m
Loki Gateway	1	50m	200m	50m	200m
Alloy (DaemonSet)	4 nodes	50m	200m	200m	800m
Visualization & Alerting
Grafana	1	50m	500m	50m	500m
Alertmanager	2	25m	100m	50m	200m
Operators & Exporters
Prometheus Operator	1	100m	200m	100m	200m
kube-state-metrics	1	10m	100m	10m	100m
Node Exporter	4 nodes	10m	200m	40m	800m
TOTAL				~2.65 cores	~15.8 cores

Memory Requirements¶

Component	Replicas	Memory Request (each)	Memory Limit (each)	Total Request	Total Limit
Metrics Stack
Prometheus	2	1500Mi	3000Mi	3000Mi	6000Mi
Thanos Query	2	512Mi	1Gi	1024Mi	2Gi
Thanos Store	2	1Gi	2Gi	2Gi	4Gi
Thanos Compactor	1	2Gi	4Gi	2Gi	4Gi
Thanos Sidecar	2	128Mi	256Mi	256Mi	512Mi
Logs Stack
Loki Write	2	256Mi	512Mi	512Mi	1Gi
Loki Read	2	256Mi	512Mi	512Mi	1Gi
Loki Backend	2	256Mi	512Mi	512Mi	1Gi
Loki Gateway	1	128Mi	256Mi	128Mi	256Mi
Alloy (DaemonSet)	4 nodes	128Mi	256Mi	512Mi	1Gi
Visualization & Alerting
Grafana	1	512Mi	1Gi	512Mi	1Gi
Alertmanager	2	100Mi	256Mi	200Mi	512Mi
Operators & Exporters
Prometheus Operator	1	128Mi	256Mi	128Mi	256Mi
kube-state-metrics	1	64Mi	128Mi	64Mi	128Mi
Node Exporter	4 nodes	32Mi	64Mi	128Mi	256Mi
TOTAL				~12.5 GB	~24.5 GB

Storage Requirements¶

Component	Type	Size (each)	Replicas	Total	Storage Class
Persistent Volumes
Prometheus	PVC	3Gi	2	6Gi	longhorn
Thanos Store	PVC	10Gi	2	20Gi	longhorn
Thanos Compactor	PVC	20Gi	1	20Gi	longhorn
Loki Write	PVC	5Gi	2	10Gi	longhorn
Loki Read	PVC	5Gi	2	10Gi	longhorn
Loki Backend	PVC	5Gi	2	10Gi	longhorn
Grafana	PVC	5Gi	1	5Gi	longhorn
Alertmanager	PVC	10Gi	2	20Gi	hcloud-volumes
Subtotal PVCs				101Gi
Object Storage (S3)
Prometheus metrics	Bucket	~150GB	-	~150GB	Hetzner S3 (fsn1)
Loki logs	Bucket	~2GB	-	~2GB	Hetzner S3 (fsn1)
Subtotal S3				~152GB
TOTAL STORAGE				~253GB

Detailed Component Specifications¶

Prometheus¶

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 1500Mi
  limits:
    cpu: 1000m
    memory: 3000Mi

Storage:

storageSpec:
  volumeClaimTemplate:
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 3Gi

Rationale:

CPU: 100m baseline, bursts to 1000m during scraping/compaction
Memory: 1500Mi for TSDB, allows growth to 3000Mi during queries
Storage: 3Gi for 3 days retention (~500MB/day compressed)

QoS Class: Burstable

Thanos Query¶

Deployment Type: Deployment

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 200m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Storage: None (stateless)

Rationale:

CPU: 200m for query processing, bursts to 1000m for complex queries
Memory: 512Mi for query cache and result buffering

QoS Class: Burstable

Thanos Store¶

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 200m
    memory: 1Gi
  limits:
    cpu: 1000m
    memory: 2Gi

Storage:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 10Gi

Rationale:

CPU: 200m for S3 reads and index serving
Memory: 1Gi for index cache (500MB) + chunk cache (500MB)
Storage: 10Gi for cached indexes

QoS Class: Burstable

Thanos Compactor¶

Deployment Type: StatefulSet

Replicas: 1 (single instance)

Resources:

resources:
  requests:
    cpu: 500m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

Storage:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 20Gi

Rationale:

CPU: 500m for compaction and downsampling, bursts to 2000m during heavy work
Memory: 2Gi for block processing, allows growth to 4Gi
Storage: 20Gi workspace for downloading/merging blocks

QoS Class: Burstable

Thanos Sidecar¶

Deployment Type: Container in Prometheus pods

Replicas: 2 (one per Prometheus)

Resources:

resources:
  requests:
    cpu: 25m
    memory: 128Mi
  limits:
    cpu: 100m
    memory: 256Mi

Storage: Shares Prometheus PVC

Rationale:

CPU: 25m minimal (background uploads)
Memory: 128Mi for block buffering

QoS Class: Burstable

Loki Write¶

Deployment Type: Deployment (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

CPU: 100m for log ingestion, bursts to 500m during spikes
Memory: 256Mi for WAL buffering
Storage: 5Gi for Write-Ahead Log

QoS Class: Burstable

Loki Read¶

Deployment Type: Deployment (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

CPU: 100m for query processing
Memory: 256Mi for query result caching
Storage: 5Gi for cache

QoS Class: Burstable

Loki Backend¶

Deployment Type: StatefulSet (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

CPU: 100m for index/chunk operations
Memory: 256Mi for metadata
Storage: 5Gi for index metadata

QoS Class: Burstable

Loki Gateway¶

Deployment Type: Deployment

Replicas: 1 (not critical for HA)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None (stateless proxy)

Rationale:

CPU: 50m for HTTP proxying
Memory: 128Mi minimal for nginx

QoS Class: Burstable

Alloy (Grafana Agent)¶

Deployment Type: DaemonSet

Replicas: 4 (one per node)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None

Rationale:

CPU: 50m per node for log collection
Memory: 128Mi for log buffering

QoS Class: Burstable

Grafana¶

Deployment Type: Deployment

Replicas: 1 (UI component)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 1Gi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

CPU: 50m for UI serving
Memory: 512Mi for dashboard rendering (increased from 256Mi to prevent OOM)
Storage: 5Gi for dashboards and SQLite DB

QoS Class: Burstable

Alertmanager¶

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 25m
    memory: 100Mi
  limits:
    cpu: 100m
    memory: 256Mi

Storage:

storage:
  volumeClaimTemplate:
    spec:
      storageClassName: hcloud-volumes
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 10Gi

Rationale:

CPU: 25m minimal (low alert volume)
Memory: 100Mi for alert state
Storage: 10Gi (over-provisioned, actual usage ~100MB)

QoS Class: Burstable

Prometheus Operator¶

Deployment Type: Deployment

Replicas: 1

Resources:

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None

Rationale:

CPU: 100m for CRD reconciliation
Memory: 128Mi for operator logic

QoS Class: Burstable

kube-state-metrics¶

Deployment Type: Deployment

Replicas: 1

Resources:

resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 100m
    memory: 128Mi

Storage: None

Rationale:

CPU: 10m minimal (exposes metrics)
Memory: 64Mi for metric cache

QoS Class: Burstable

Node Exporter¶

Deployment Type: DaemonSet

Replicas: 4 (one per node)

Resources:

resources:
  requests:
    cpu: 10m
    memory: 32Mi
  limits:
    cpu: 200m
    memory: 64Mi

Storage: None

Rationale:

CPU: 10m per node (very lightweight)
Memory: 32Mi minimal

QoS Class: Burstable

Cluster Impact Analysis¶

Node Resource Allocation¶

Assumptions:

5 agent nodes (4 cax-series in monitoring, 1 cpx21 elsewhere)
Total cluster: 40 CPU cores, 140Gi memory

Monitoring Stack Usage:

CPU requests: 2.65 cores (6.6% of cluster)
CPU limits: 15.8 cores (39.5% of cluster) - acceptable for bursting
Memory requests: 12.5Gi (8.9% of cluster)
Memory limits: 24.5Gi (17.5% of cluster)

Node Distribution (typical):

Primary worker (cax31-fsn1): ~40% of stack
Secondary worker (cax31-nbg1): ~30% of stack
Tertiary workers (cax21 × 2): ~20% of stack
Control plane co-located (cpx21): ~10% of stack (DaemonSets only)

Storage Impact¶

Longhorn Capacity:

Total node storage: 640GB raw disk
Longhorn available: ~400GB (after system reservation)
Monitoring usage: 81Gi PVCs × 2 replicas = 162GB actual
Percentage: 40.5% of Longhorn capacity

Hetzner Volumes:

Alertmanager: 20Gi (10Gi × 2 replicas)

S3 Storage:

~152GB steady state
Cost: ~€3.50/month

Scaling Recommendations¶

Vertical Scaling (Increase Resources)¶

When to scale up:

CPU throttling >20%
OOMKilled events
Persistent high memory usage (>80% of requests)

How to scale:

# In config.yaml
resources:
  prometheus:
    requests:
      cpu: 200m  # Double from 100m
      memory: 3Gi  # Double from 1500Mi

Horizontal Scaling (Increase Replicas)¶

Components that scale horizontally:

✅ Prometheus (2 → 3 replicas)
✅ Thanos Query (2 → 3 replicas)
✅ Thanos Store (2 → 3 replicas)
✅ Loki Write/Read/Backend (2 → 3 replicas)
⚠️ Thanos Compactor (must remain 1 replica)
⚠️ Grafana (can increase, but not necessary)

How to scale:

# In config.yaml
replicas:
  prometheus: 3  # Increased from 2
  thanosQuery: 3

Storage Scaling¶

PVC Expansion:

# Edit PVC (if storage class supports expansion)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring

# Change storage: 3Gi → 6Gi

S3 Scaling: Automatic (S3 is unlimited)

Minimum Requirements¶

For testing/development:

1 node, 4 CPU, 8Gi RAM
Reduce replicas to 1 for all components
Reduce Prometheus retention to 1d
Total: ~1 core, ~4Gi memory, ~30Gi storage

For small production (current configuration):

3+ nodes, 8 CPU, 16Gi RAM per node
2 replicas for critical components
3-day Prometheus retention
Total: ~2.7 cores, ~12.5Gi memory, ~101Gi storage

For large production (100+ nodes):

5+ dedicated monitoring nodes
3 replicas for all components
Consider Prometheus sharding
Total: ~10 cores, ~50Gi memory, ~500Gi storage

Resource Optimization Opportunities¶

Current optimizations (implemented October 2025):

✅ Loki right-sized (80% reduction from defaults)
✅ Prometheus retention reduced (7d → 3d with Thanos)
✅ Thanos downsampling enabled

Future optimizations:

Alertmanager PVC: 10Gi → 1Gi (90% waste)
Grafana PVC: 5Gi → 1Gi (80% waste)
Node Exporter: Move to system namespace

Resource Requirements Reference¶

Summary Tables¶

CPU Requirements¶

Memory Requirements¶

Storage Requirements¶

Detailed Component Specifications¶

Prometheus¶

Thanos Query¶

Thanos Store¶

Thanos Compactor¶

Thanos Sidecar¶

Loki Write¶

Loki Read¶

Loki Backend¶

Loki Gateway¶

Alloy (Grafana Agent)¶

Grafana¶

Alertmanager¶

Prometheus Operator¶

kube-state-metrics¶

Node Exporter¶

Cluster Impact Analysis¶

Node Resource Allocation¶

Storage Impact¶

Scaling Recommendations¶

Vertical Scaling (Increase Resources)¶

Horizontal Scaling (Increase Replicas)¶

Storage Scaling¶

Minimum Requirements¶

Resource Optimization Opportunities¶

See Also¶