Reference

Configuration Reference¶

Type: Reference (Information-oriented)

Related: CDK8S Approach | Scale Resources

Complete reference for dp-infra/monitoring/config.yaml and .env overrides.

Configuration Files¶

config.yaml (Primary Configuration)¶

Main configuration file with all monitoring stack settings. Committed to git.

Location: dp-infra/monitoring/config.yaml

.env (Environment Overrides)¶

Optional overrides for sensitive data or environment-specific settings. NOT committed to git.

Location: dp-infra/monitoring/.env

Template: dp-infra/monitoring/.env.example

Configuration Schema¶

Top-Level Structure¶

interface MonitoringConfig {
  namespace: string;           // Kubernetes namespace
  versions: VersionConfig;     // Component versions
  domains: DomainConfig;       // Ingress domains
  s3: S3Config;                // Object storage
  smtp: SmtpConfig;            // Email alerting
  retention: RetentionConfig;  // Data retention policies
  storage: StorageConfig;      // PVC sizes
  replicas: ReplicaConfig;     // Replica counts
  resources: ResourceConfig;   // CPU/memory limits
}

Field Reference¶

namespace¶

Kubernetes namespace for all monitoring components.

namespace: monitoring

Type: string
Default: monitoring
Override: Not recommended (many hardcoded references)

versions¶

Component versions for Helm charts and Docker images.

versions:
  prometheusStack: 87.0.0   # kube-prometheus-stack Helm chart (operator v0.92.0)
  loki: 7.0.0               # Loki Helm chart (app 3.6.7)
  alloy: 1.10.0             # Alloy Helm chart (app v1.17.0)
  thanos: v0.41.0           # Thanos Docker image tag

Type: VersionConfig
Fields:
- prometheusStack: kube-prometheus-stack Helm chart version
- loki: Loki Helm chart version
- alloy: Alloy (Grafana Agent) Helm chart version
- thanos: Thanos image tag (quay.io/thanos/thanos)
Pin every version explicitly. Never use latest for a chart version. An unattended chart upgrade once silently dropped the Thanos sidecar by changing the objectStorageConfig schema.
Update Frequency: Monthly (check for security patches)

domains¶

Ingress domain names for external access.

domains:
  grafana: grafana.ops.kup6s.net

Type: DomainConfig
Fields:
- grafana: Grafana UI domain (Traefik ingress with Let’s Encrypt)
DNS Required: A/AAAA records must point to cluster load balancer

s3¶

Hetzner Object Storage configuration for metrics and logs.

s3:
  endpoint: https://fsn1.your-objectstorage.com
  endpointNoProtocol: fsn1.your-objectstorage.com
  region: fsn1
  buckets:
    thanos: metrics-thanos-kup6s
    loki: logs-loki-kup6s

Type: S3Config
Fields:
- endpoint: Full S3 endpoint URL (with https://)
- endpointNoProtocol: Endpoint without protocol (for Loki config)
- region: Hetzner region code (fsn1, nbg1, hel1)
- buckets.thanos: Metrics storage bucket name
- buckets.loki: Log storage bucket name
Bucket Naming: Must be globally unique (suffix with -kup6s)
Credentials: Managed via ExternalSecret (from crossplane-system/hetzner-s3-credentials)

smtp¶

Email server configuration for Alertmanager notifications.

smtp:
  host: smtp.example.com
  port: 587
  from: alerts@example.com
  username: smtp-user
  password: ${SMTP_PASSWORD}  # Override in .env
  requireTls: true

Type: SmtpConfig
Fields:
- host: SMTP server hostname
- port: SMTP port (587 for STARTTLS, 465 for TLS)
- from: Sender email address
- username: SMTP authentication username
- password: SMTP password (override in .env)
- requireTls: Enforce TLS connection
Security: Never commit password to git - use .env override

retention¶

Data retention policies for metrics and logs.

retention:
  prometheus: 3d           # Prometheus local storage
  prometheusS3Raw: 30      # Thanos S3 raw data (days)
  prometheusS35m: 180      # Thanos S3 5-min downsampled (days)
  prometheusS31h: 730      # Thanos S3 1-hour downsampled (days)
  loki: 744h               # Loki retention (hours)

Type: RetentionConfig
Fields:
- prometheus: Local Prometheus retention (duration string: 3d, 7d, etc.)
- prometheusS3Raw: S3 raw metrics retention (days)
- prometheusS35m: S3 downsampled (5-min) retention (days)
- prometheusS31h: S3 downsampled (1-hour) retention (days)
- loki: Loki log retention (duration string: 744h = 31 days)
Cost vs Retention: Longer retention → higher S3 costs
Recommendation: Keep current values unless storage costs are issue

storage¶

Persistent volume sizes for each component.

storage:
  prometheus: 3Gi
  grafana: 10Gi
  alertmanager: 1Gi
  lokiBackend: 10Gi
  lokiWrite: 10Gi
  thanosStore: 10Gi
  thanosCompactor: 20Gi

Type: StorageConfig
Fields: All PVC size requests (Gi = Gibibytes)
Storage Class: longhorn (2 replicas)
Resizing: PVCs can be expanded but NEVER shrunk
Monitoring: Check Longhorn UI for actual usage

replicas¶

High-availability replica counts.

replicas:
  prometheus: 2        # Query HA
  alertmanager: 3      # Quorum-based
  grafana: 1           # UI only
  lokiBackend: 1       # Index management
  lokiRead: 2          # Query HA
  lokiWrite: 2         # Ingestion HA
  thanosQuery: 2       # Query HA
  thanosStore: 2       # Query HA

Type: ReplicaConfig
HA Components (>= 2 replicas): prometheus, alertmanager, lokiRead, lokiWrite, thanosQuery, thanosStore
Single Instance (1 replica): grafana, lokiBackend, thanosCompactor
Auto-Scaled (not applicable): None currently

resources¶

CPU and memory requests/limits for all components.

resources:
  prometheus:
    requests:
      cpu: 100m
      memory: 1500Mi
    limits:
      cpu: 2000m
      memory: 3000Mi

Type: ResourceConfig
Components: prometheus, grafana, alertmanager, lokiBackend, lokiRead, lokiWrite, lokiGateway, alloy, thanosQuery, thanosStore, thanosCompactor
Format:
- CPU: 100m = 0.1 cores, 1000m = 1 core
- Memory: 256Mi = 256 MiB, 1Gi = 1 GiB
Sizing: Based on actual usage analysis (October 2025)
Update: See Resource Optimization

Environment Variable Overrides¶

Override any config.yaml value using environment variables in .env:

# .env (not committed to git)
SMTP_PASSWORD=secret123
HETZNER_S3_ACCESS_KEY=ABCDEF123456
HETZNER_S3_SECRET_KEY=secret789

Pattern: UPPER_SNAKE_CASE environment variable overrides camelCase config field.

Precedence: .env > config.yaml

Validation¶

Config validation happens at compile-time via TypeScript:

cd dp-infra/monitoring
npm run compile  # ❌ Fails if config is invalid

Common Errors:

Type mismatch: replicas: "2" (string) instead of replicas: 2 (number)
Missing required field: Forgot to add new field to config.yaml
Invalid format: CPU "100" instead of "100m"

Example: Full config.yaml¶

namespace: monitoring

versions:
  prometheusStack: 87.0.0
  loki: 7.0.0
  alloy: 1.10.0
  thanos: v0.41.0

domains:
  grafana: grafana.ops.kup6s.net

s3:
  endpoint: https://fsn1.your-objectstorage.com
  endpointNoProtocol: fsn1.your-objectstorage.com
  region: fsn1
  buckets:
    thanos: metrics-thanos-kup6s
    loki: logs-loki-kup6s

smtp:
  host: smtp.example.com
  port: 587
  from: alerts@example.com
  username: smtp-user
  password: ${SMTP_PASSWORD}
  requireTls: true

retention:
  prometheus: 3d
  prometheusS3Raw: 30
  prometheusS35m: 180
  prometheusS31h: 730
  loki: 744h

storage:
  prometheus: 3Gi
  grafana: 10Gi
  alertmanager: 1Gi
  lokiBackend: 10Gi
  lokiWrite: 10Gi
  thanosStore: 10Gi
  thanosCompactor: 20Gi

replicas:
  prometheus: 2
  alertmanager: 3
  grafana: 1
  lokiBackend: 1
  lokiRead: 2
  lokiWrite: 2
  thanosQuery: 2
  thanosStore: 2

resources:
  prometheus:
    requests:
      cpu: 100m
      memory: 1500Mi
    limits:
      cpu: 2000m
      memory: 3000Mi
  # ... (11 components total)

Next Steps¶

Upgrade Components - Change versions
Scale Resources - Adjust replicas/storage
Resource Requirements - Full resource specifications