Reference

Resource Requirements Reference

Complete resource specifications for all monitoring stack components.

Summary Tables

CPU Requirements

Component

Replicas

CPU Request (each)

CPU Limit (each)

Total Request

Total Limit

Metrics Stack

Prometheus

2

100m

1000m

200m

2000m

Thanos Query

2

200m

1000m

400m

2000m

Thanos Store

2

200m

1000m

400m

2000m

Thanos Compactor

1

500m

2000m

500m

2000m

Thanos Sidecar

2

25m

100m

50m

200m

Logs Stack

Loki Write

2

100m

500m

200m

1000m

Loki Read

2

100m

500m

200m

1000m

Loki Backend

2

100m

500m

200m

1000m

Loki Gateway

1

50m

200m

50m

200m

Alloy (DaemonSet)

4 nodes

50m

200m

200m

800m

Visualization & Alerting

Grafana

1

50m

500m

50m

500m

Alertmanager

2

25m

100m

50m

200m

Operators & Exporters

Prometheus Operator

1

100m

200m

100m

200m

kube-state-metrics

1

10m

100m

10m

100m

Node Exporter

4 nodes

10m

200m

40m

800m

TOTAL

~2.65 cores

~15.8 cores

Memory Requirements

Component

Replicas

Memory Request (each)

Memory Limit (each)

Total Request

Total Limit

Metrics Stack

Prometheus

2

1500Mi

3000Mi

3000Mi

6000Mi

Thanos Query

2

512Mi

1Gi

1024Mi

2Gi

Thanos Store

2

1Gi

2Gi

2Gi

4Gi

Thanos Compactor

1

2Gi

4Gi

2Gi

4Gi

Thanos Sidecar

2

128Mi

256Mi

256Mi

512Mi

Logs Stack

Loki Write

2

256Mi

512Mi

512Mi

1Gi

Loki Read

2

256Mi

512Mi

512Mi

1Gi

Loki Backend

2

256Mi

512Mi

512Mi

1Gi

Loki Gateway

1

128Mi

256Mi

128Mi

256Mi

Alloy (DaemonSet)

4 nodes

128Mi

256Mi

512Mi

1Gi

Visualization & Alerting

Grafana

1

512Mi

1Gi

512Mi

1Gi

Alertmanager

2

100Mi

256Mi

200Mi

512Mi

Operators & Exporters

Prometheus Operator

1

128Mi

256Mi

128Mi

256Mi

kube-state-metrics

1

64Mi

128Mi

64Mi

128Mi

Node Exporter

4 nodes

32Mi

64Mi

128Mi

256Mi

TOTAL

~12.5 GB

~24.5 GB

Storage Requirements

Component

Type

Size (each)

Replicas

Total

Storage Class

Persistent Volumes

Prometheus

PVC

3Gi

2

6Gi

longhorn

Thanos Store

PVC

10Gi

2

20Gi

longhorn

Thanos Compactor

PVC

20Gi

1

20Gi

longhorn

Loki Write

PVC

5Gi

2

10Gi

longhorn

Loki Read

PVC

5Gi

2

10Gi

longhorn

Loki Backend

PVC

5Gi

2

10Gi

longhorn

Grafana

PVC

5Gi

1

5Gi

longhorn

Alertmanager

PVC

10Gi

2

20Gi

hcloud-volumes

Subtotal PVCs

101Gi

Object Storage (S3)

Prometheus metrics

Bucket

~150GB

-

~150GB

Hetzner S3 (fsn1)

Loki logs

Bucket

~2GB

-

~2GB

Hetzner S3 (fsn1)

Subtotal S3

~152GB

TOTAL STORAGE

~253GB

Detailed Component Specifications

Prometheus

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 1500Mi
  limits:
    cpu: 1000m
    memory: 3000Mi

Storage:

storageSpec:
  volumeClaimTemplate:
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 3Gi

Rationale:

  • CPU: 100m baseline, bursts to 1000m during scraping/compaction

  • Memory: 1500Mi for TSDB, allows growth to 3000Mi during queries

  • Storage: 3Gi for 3 days retention (~500MB/day compressed)

QoS Class: Burstable


Thanos Query

Deployment Type: Deployment

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 200m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 1Gi

Storage: None (stateless)

Rationale:

  • CPU: 200m for query processing, bursts to 1000m for complex queries

  • Memory: 512Mi for query cache and result buffering

QoS Class: Burstable


Thanos Store

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 200m
    memory: 1Gi
  limits:
    cpu: 1000m
    memory: 2Gi

Storage:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 10Gi

Rationale:

  • CPU: 200m for S3 reads and index serving

  • Memory: 1Gi for index cache (500MB) + chunk cache (500MB)

  • Storage: 10Gi for cached indexes

QoS Class: Burstable


Thanos Compactor

Deployment Type: StatefulSet

Replicas: 1 (single instance)

Resources:

resources:
  requests:
    cpu: 500m
    memory: 2Gi
  limits:
    cpu: 2000m
    memory: 4Gi

Storage:

volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: longhorn
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 20Gi

Rationale:

  • CPU: 500m for compaction and downsampling, bursts to 2000m during heavy work

  • Memory: 2Gi for block processing, allows growth to 4Gi

  • Storage: 20Gi workspace for downloading/merging blocks

QoS Class: Burstable


Thanos Sidecar

Deployment Type: Container in Prometheus pods

Replicas: 2 (one per Prometheus)

Resources:

resources:
  requests:
    cpu: 25m
    memory: 128Mi
  limits:
    cpu: 100m
    memory: 256Mi

Storage: Shares Prometheus PVC

Rationale:

  • CPU: 25m minimal (background uploads)

  • Memory: 128Mi for block buffering

QoS Class: Burstable


Loki Write

Deployment Type: Deployment (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

  • CPU: 100m for log ingestion, bursts to 500m during spikes

  • Memory: 256Mi for WAL buffering

  • Storage: 5Gi for Write-Ahead Log

QoS Class: Burstable


Loki Read

Deployment Type: Deployment (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

  • CPU: 100m for query processing

  • Memory: 256Mi for query result caching

  • Storage: 5Gi for cache

QoS Class: Burstable


Loki Backend

Deployment Type: StatefulSet (SimpleScalable mode)

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

  • CPU: 100m for index/chunk operations

  • Memory: 256Mi for metadata

  • Storage: 5Gi for index metadata

QoS Class: Burstable


Loki Gateway

Deployment Type: Deployment

Replicas: 1 (not critical for HA)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None (stateless proxy)

Rationale:

  • CPU: 50m for HTTP proxying

  • Memory: 128Mi minimal for nginx

QoS Class: Burstable


Alloy (Grafana Agent)

Deployment Type: DaemonSet

Replicas: 4 (one per node)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None

Rationale:

  • CPU: 50m per node for log collection

  • Memory: 128Mi for log buffering

QoS Class: Burstable


Grafana

Deployment Type: Deployment

Replicas: 1 (UI component)

Resources:

resources:
  requests:
    cpu: 50m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 1Gi

Storage:

persistence:
  enabled: true
  storageClass: longhorn
  size: 5Gi

Rationale:

  • CPU: 50m for UI serving

  • Memory: 512Mi for dashboard rendering (increased from 256Mi to prevent OOM)

  • Storage: 5Gi for dashboards and SQLite DB

QoS Class: Burstable


Alertmanager

Deployment Type: StatefulSet

Replicas: 2 (for HA)

Resources:

resources:
  requests:
    cpu: 25m
    memory: 100Mi
  limits:
    cpu: 100m
    memory: 256Mi

Storage:

storage:
  volumeClaimTemplate:
    spec:
      storageClassName: hcloud-volumes
      accessModes: [ReadWriteOnce]
      resources:
        requests:
          storage: 10Gi

Rationale:

  • CPU: 25m minimal (low alert volume)

  • Memory: 100Mi for alert state

  • Storage: 10Gi (over-provisioned, actual usage ~100MB)

QoS Class: Burstable


Prometheus Operator

Deployment Type: Deployment

Replicas: 1

Resources:

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

Storage: None

Rationale:

  • CPU: 100m for CRD reconciliation

  • Memory: 128Mi for operator logic

QoS Class: Burstable


kube-state-metrics

Deployment Type: Deployment

Replicas: 1

Resources:

resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 100m
    memory: 128Mi

Storage: None

Rationale:

  • CPU: 10m minimal (exposes metrics)

  • Memory: 64Mi for metric cache

QoS Class: Burstable


Node Exporter

Deployment Type: DaemonSet

Replicas: 4 (one per node)

Resources:

resources:
  requests:
    cpu: 10m
    memory: 32Mi
  limits:
    cpu: 200m
    memory: 64Mi

Storage: None

Rationale:

  • CPU: 10m per node (very lightweight)

  • Memory: 32Mi minimal

QoS Class: Burstable


Cluster Impact Analysis

Node Resource Allocation

Assumptions:

  • 5 agent nodes (4 cax-series in monitoring, 1 cpx21 elsewhere)

  • Total cluster: 40 CPU cores, 140Gi memory

Monitoring Stack Usage:

  • CPU requests: 2.65 cores (6.6% of cluster)

  • CPU limits: 15.8 cores (39.5% of cluster) - acceptable for bursting

  • Memory requests: 12.5Gi (8.9% of cluster)

  • Memory limits: 24.5Gi (17.5% of cluster)

Node Distribution (typical):

  • Primary worker (cax31-fsn1): ~40% of stack

  • Secondary worker (cax31-nbg1): ~30% of stack

  • Tertiary workers (cax21 × 2): ~20% of stack

  • Control plane co-located (cpx21): ~10% of stack (DaemonSets only)

Storage Impact

Longhorn Capacity:

  • Total node storage: 640GB raw disk

  • Longhorn available: ~400GB (after system reservation)

  • Monitoring usage: 81Gi PVCs × 2 replicas = 162GB actual

  • Percentage: 40.5% of Longhorn capacity

Hetzner Volumes:

  • Alertmanager: 20Gi (10Gi × 2 replicas)

S3 Storage:

  • ~152GB steady state

  • Cost: ~€3.50/month

Scaling Recommendations

Vertical Scaling (Increase Resources)

When to scale up:

  • CPU throttling >20%

  • OOMKilled events

  • Persistent high memory usage (>80% of requests)

How to scale:

# In config.yaml
resources:
  prometheus:
    requests:
      cpu: 200m  # Double from 100m
      memory: 3Gi  # Double from 1500Mi

Horizontal Scaling (Increase Replicas)

Components that scale horizontally:

  • ✅ Prometheus (2 → 3 replicas)

  • ✅ Thanos Query (2 → 3 replicas)

  • ✅ Thanos Store (2 → 3 replicas)

  • ✅ Loki Write/Read/Backend (2 → 3 replicas)

  • ⚠️ Thanos Compactor (must remain 1 replica)

  • ⚠️ Grafana (can increase, but not necessary)

How to scale:

# In config.yaml
replicas:
  prometheus: 3  # Increased from 2
  thanosQuery: 3

Storage Scaling

PVC Expansion:

# Edit PVC (if storage class supports expansion)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring

# Change storage: 3Gi → 6Gi

S3 Scaling: Automatic (S3 is unlimited)

Minimum Requirements

For testing/development:

  • 1 node, 4 CPU, 8Gi RAM

  • Reduce replicas to 1 for all components

  • Reduce Prometheus retention to 1d

  • Total: ~1 core, ~4Gi memory, ~30Gi storage

For small production (current configuration):

  • 3+ nodes, 8 CPU, 16Gi RAM per node

  • 2 replicas for critical components

  • 3-day Prometheus retention

  • Total: ~2.7 cores, ~12.5Gi memory, ~101Gi storage

For large production (100+ nodes):

  • 5+ dedicated monitoring nodes

  • 3 replicas for all components

  • Consider Prometheus sharding

  • Total: ~10 cores, ~50Gi memory, ~500Gi storage

Resource Optimization Opportunities

Current optimizations (implemented October 2025):

  • ✅ Loki right-sized (80% reduction from defaults)

  • ✅ Prometheus retention reduced (7d → 3d with Thanos)

  • ✅ Thanos downsampling enabled

Future optimizations:

  • Alertmanager PVC: 10Gi → 1Gi (90% waste)

  • Grafana PVC: 5Gi → 1Gi (80% waste)

  • Node Exporter: Move to system namespace

See Also