Reference

Configuration Reference

Complete reference for dp-infra/monitoring/config.yaml and .env overrides.

Configuration Files

config.yaml (Primary Configuration)

Main configuration file with all monitoring stack settings. Committed to git.

Location: dp-infra/monitoring/config.yaml

.env (Environment Overrides)

Optional overrides for sensitive data or environment-specific settings. NOT committed to git.

Location: dp-infra/monitoring/.env

Template: dp-infra/monitoring/.env.example

Configuration Schema

Top-Level Structure

interface MonitoringConfig {
  namespace: string;           // Kubernetes namespace
  versions: VersionConfig;     // Component versions
  domains: DomainConfig;       // Ingress domains
  s3: S3Config;                // Object storage
  smtp: SmtpConfig;            // Email alerting
  retention: RetentionConfig;  // Data retention policies
  storage: StorageConfig;      // PVC sizes
  replicas: ReplicaConfig;     // Replica counts
  resources: ResourceConfig;   // CPU/memory limits
}

Field Reference

namespace

Kubernetes namespace for all monitoring components.

namespace: monitoring
  • Type: string

  • Default: monitoring

  • Override: Not recommended (many hardcoded references)


versions

Component versions for Helm charts and Docker images.

versions:
  prometheusStack: v69.2.0  # kube-prometheus-stack Helm chart
  loki: 6.23.0              # Loki Helm chart
  alloy: v1.6.1             # Alloy Helm chart
  thanos: v0.37.2           # Thanos Docker image tag
  • Type: VersionConfig

  • Fields:

    • prometheusStack: kube-prometheus-stack Helm chart version

    • loki: Loki Helm chart version

    • alloy: Alloy (Grafana Agent) Helm chart version

    • thanos: Thanos image tag (quay.io/thanos/thanos)

  • Update Frequency: Monthly (check for security patches)


domains

Ingress domain names for external access.

domains:
  grafana: grafana.ops.kup6s.net
  • Type: DomainConfig

  • Fields:

    • grafana: Grafana UI domain (Traefik ingress with Let’s Encrypt)

  • DNS Required: A/AAAA records must point to cluster load balancer


s3

Hetzner Object Storage configuration for metrics and logs.

s3:
  endpoint: https://fsn1.your-objectstorage.com
  endpointNoProtocol: fsn1.your-objectstorage.com
  region: fsn1
  buckets:
    thanos: metrics-thanos-kup6s
    loki: logs-loki-kup6s
  • Type: S3Config

  • Fields:

    • endpoint: Full S3 endpoint URL (with https://)

    • endpointNoProtocol: Endpoint without protocol (for Loki config)

    • region: Hetzner region code (fsn1, nbg1, hel1)

    • buckets.thanos: Metrics storage bucket name

    • buckets.loki: Log storage bucket name

  • Bucket Naming: Must be globally unique (suffix with -kup6s)

  • Credentials: Managed via ExternalSecret (from crossplane-system/hetzner-s3-credentials)


smtp

Email server configuration for Alertmanager notifications.

smtp:
  host: smtp.example.com
  port: 587
  from: alerts@example.com
  username: smtp-user
  password: ${SMTP_PASSWORD}  # Override in .env
  requireTls: true
  • Type: SmtpConfig

  • Fields:

    • host: SMTP server hostname

    • port: SMTP port (587 for STARTTLS, 465 for TLS)

    • from: Sender email address

    • username: SMTP authentication username

    • password: SMTP password (override in .env)

    • requireTls: Enforce TLS connection

  • Security: Never commit password to git - use .env override


retention

Data retention policies for metrics and logs.

retention:
  prometheus: 3d           # Prometheus local storage
  prometheusS3Raw: 30      # Thanos S3 raw data (days)
  prometheusS35m: 180      # Thanos S3 5-min downsampled (days)
  prometheusS31h: 730      # Thanos S3 1-hour downsampled (days)
  loki: 744h               # Loki retention (hours)
  • Type: RetentionConfig

  • Fields:

    • prometheus: Local Prometheus retention (duration string: 3d, 7d, etc.)

    • prometheusS3Raw: S3 raw metrics retention (days)

    • prometheusS35m: S3 downsampled (5-min) retention (days)

    • prometheusS31h: S3 downsampled (1-hour) retention (days)

    • loki: Loki log retention (duration string: 744h = 31 days)

  • Cost vs Retention: Longer retention → higher S3 costs

  • Recommendation: Keep current values unless storage costs are issue


storage

Persistent volume sizes for each component.

storage:
  prometheus: 3Gi
  grafana: 10Gi
  alertmanager: 1Gi
  lokiBackend: 10Gi
  lokiWrite: 10Gi
  thanosStore: 10Gi
  thanosCompactor: 20Gi
  • Type: StorageConfig

  • Fields: All PVC size requests (Gi = Gibibytes)

  • Storage Class: longhorn (2 replicas)

  • Resizing: PVCs can be expanded but NEVER shrunk

  • Monitoring: Check Longhorn UI for actual usage


replicas

High-availability replica counts.

replicas:
  prometheus: 2        # Query HA
  alertmanager: 3      # Quorum-based
  grafana: 1           # UI only
  lokiBackend: 1       # Index management
  lokiRead: 2          # Query HA
  lokiWrite: 2         # Ingestion HA
  thanosQuery: 2       # Query HA
  thanosStore: 2       # Query HA
  • Type: ReplicaConfig

  • HA Components (>= 2 replicas): prometheus, alertmanager, lokiRead, lokiWrite, thanosQuery, thanosStore

  • Single Instance (1 replica): grafana, lokiBackend, thanosCompactor

  • Auto-Scaled (not applicable): None currently


resources

CPU and memory requests/limits for all components.

resources:
  prometheus:
    requests:
      cpu: 100m
      memory: 1500Mi
    limits:
      cpu: 2000m
      memory: 3000Mi
  • Type: ResourceConfig

  • Components: prometheus, grafana, alertmanager, lokiBackend, lokiRead, lokiWrite, lokiGateway, alloy, thanosQuery, thanosStore, thanosCompactor

  • Format:

    • CPU: 100m = 0.1 cores, 1000m = 1 core

    • Memory: 256Mi = 256 MiB, 1Gi = 1 GiB

  • Sizing: Based on actual usage analysis (October 2025)

  • Update: See Resource Optimization

Environment Variable Overrides

Override any config.yaml value using environment variables in .env:

# .env (not committed to git)
SMTP_PASSWORD=secret123
HETZNER_S3_ACCESS_KEY=ABCDEF123456
HETZNER_S3_SECRET_KEY=secret789

Pattern: UPPER_SNAKE_CASE environment variable overrides camelCase config field.

Precedence: .env > config.yaml

Validation

Config validation happens at compile-time via TypeScript:

cd dp-infra/monitoring
npm run compile  # ❌ Fails if config is invalid

Common Errors:

  • Type mismatch: replicas: "2" (string) instead of replicas: 2 (number)

  • Missing required field: Forgot to add new field to config.yaml

  • Invalid format: CPU "100" instead of "100m"

Example: Full config.yaml

namespace: monitoring

versions:
  prometheusStack: v69.2.0
  loki: 6.23.0
  alloy: v1.6.1
  thanos: v0.37.2

domains:
  grafana: grafana.ops.kup6s.net

s3:
  endpoint: https://fsn1.your-objectstorage.com
  endpointNoProtocol: fsn1.your-objectstorage.com
  region: fsn1
  buckets:
    thanos: metrics-thanos-kup6s
    loki: logs-loki-kup6s

smtp:
  host: smtp.example.com
  port: 587
  from: alerts@example.com
  username: smtp-user
  password: ${SMTP_PASSWORD}
  requireTls: true

retention:
  prometheus: 3d
  prometheusS3Raw: 30
  prometheusS35m: 180
  prometheusS31h: 730
  loki: 744h

storage:
  prometheus: 3Gi
  grafana: 10Gi
  alertmanager: 1Gi
  lokiBackend: 10Gi
  lokiWrite: 10Gi
  thanosStore: 10Gi
  thanosCompactor: 20Gi

replicas:
  prometheus: 2
  alertmanager: 3
  grafana: 1
  lokiBackend: 1
  lokiRead: 2
  lokiWrite: 2
  thanosQuery: 2
  thanosStore: 2

resources:
  prometheus:
    requests:
      cpu: 100m
      memory: 1500Mi
    limits:
      cpu: 2000m
      memory: 3000Mi
  # ... (11 components total)

Next Steps