Reference
Configuration Reference¶
Complete reference for dp-infra/monitoring/config.yaml and .env overrides.
Configuration Files¶
config.yaml (Primary Configuration)¶
Main configuration file with all monitoring stack settings. Committed to git.
Location: dp-infra/monitoring/config.yaml
.env (Environment Overrides)¶
Optional overrides for sensitive data or environment-specific settings. NOT committed to git.
Location: dp-infra/monitoring/.env
Template: dp-infra/monitoring/.env.example
Configuration Schema¶
Top-Level Structure¶
interface MonitoringConfig {
namespace: string; // Kubernetes namespace
versions: VersionConfig; // Component versions
domains: DomainConfig; // Ingress domains
s3: S3Config; // Object storage
smtp: SmtpConfig; // Email alerting
retention: RetentionConfig; // Data retention policies
storage: StorageConfig; // PVC sizes
replicas: ReplicaConfig; // Replica counts
resources: ResourceConfig; // CPU/memory limits
}
Field Reference¶
namespace¶
Kubernetes namespace for all monitoring components.
namespace: monitoring
Type:
stringDefault:
monitoringOverride: Not recommended (many hardcoded references)
versions¶
Component versions for Helm charts and Docker images.
versions:
prometheusStack: v69.2.0 # kube-prometheus-stack Helm chart
loki: 6.23.0 # Loki Helm chart
alloy: v1.6.1 # Alloy Helm chart
thanos: v0.37.2 # Thanos Docker image tag
Type:
VersionConfigFields:
prometheusStack: kube-prometheus-stack Helm chart versionloki: Loki Helm chart versionalloy: Alloy (Grafana Agent) Helm chart versionthanos: Thanos image tag (quay.io/thanos/thanos)
Update Frequency: Monthly (check for security patches)
domains¶
Ingress domain names for external access.
domains:
grafana: grafana.ops.kup6s.net
Type:
DomainConfigFields:
grafana: Grafana UI domain (Traefik ingress with Let’s Encrypt)
DNS Required: A/AAAA records must point to cluster load balancer
s3¶
Hetzner Object Storage configuration for metrics and logs.
s3:
endpoint: https://fsn1.your-objectstorage.com
endpointNoProtocol: fsn1.your-objectstorage.com
region: fsn1
buckets:
thanos: metrics-thanos-kup6s
loki: logs-loki-kup6s
Type:
S3ConfigFields:
endpoint: Full S3 endpoint URL (withhttps://)endpointNoProtocol: Endpoint without protocol (for Loki config)region: Hetzner region code (fsn1,nbg1,hel1)buckets.thanos: Metrics storage bucket namebuckets.loki: Log storage bucket name
Bucket Naming: Must be globally unique (suffix with
-kup6s)Credentials: Managed via ExternalSecret (from
crossplane-system/hetzner-s3-credentials)
smtp¶
Email server configuration for Alertmanager notifications.
smtp:
host: smtp.example.com
port: 587
from: alerts@example.com
username: smtp-user
password: ${SMTP_PASSWORD} # Override in .env
requireTls: true
Type:
SmtpConfigFields:
host: SMTP server hostnameport: SMTP port (587 for STARTTLS, 465 for TLS)from: Sender email addressusername: SMTP authentication usernamepassword: SMTP password (override in.env)requireTls: Enforce TLS connection
Security: Never commit password to git - use
.envoverride
retention¶
Data retention policies for metrics and logs.
retention:
prometheus: 3d # Prometheus local storage
prometheusS3Raw: 30 # Thanos S3 raw data (days)
prometheusS35m: 180 # Thanos S3 5-min downsampled (days)
prometheusS31h: 730 # Thanos S3 1-hour downsampled (days)
loki: 744h # Loki retention (hours)
Type:
RetentionConfigFields:
prometheus: Local Prometheus retention (duration string: 3d, 7d, etc.)prometheusS3Raw: S3 raw metrics retention (days)prometheusS35m: S3 downsampled (5-min) retention (days)prometheusS31h: S3 downsampled (1-hour) retention (days)loki: Loki log retention (duration string: 744h = 31 days)
Cost vs Retention: Longer retention → higher S3 costs
Recommendation: Keep current values unless storage costs are issue
storage¶
Persistent volume sizes for each component.
storage:
prometheus: 3Gi
grafana: 10Gi
alertmanager: 1Gi
lokiBackend: 10Gi
lokiWrite: 10Gi
thanosStore: 10Gi
thanosCompactor: 20Gi
Type:
StorageConfigFields: All PVC size requests (Gi = Gibibytes)
Storage Class:
longhorn(2 replicas)Resizing: PVCs can be expanded but NEVER shrunk
Monitoring: Check Longhorn UI for actual usage
replicas¶
High-availability replica counts.
replicas:
prometheus: 2 # Query HA
alertmanager: 3 # Quorum-based
grafana: 1 # UI only
lokiBackend: 1 # Index management
lokiRead: 2 # Query HA
lokiWrite: 2 # Ingestion HA
thanosQuery: 2 # Query HA
thanosStore: 2 # Query HA
Type:
ReplicaConfigHA Components (>= 2 replicas): prometheus, alertmanager, lokiRead, lokiWrite, thanosQuery, thanosStore
Single Instance (1 replica): grafana, lokiBackend, thanosCompactor
Auto-Scaled (not applicable): None currently
resources¶
CPU and memory requests/limits for all components.
resources:
prometheus:
requests:
cpu: 100m
memory: 1500Mi
limits:
cpu: 2000m
memory: 3000Mi
Type:
ResourceConfigComponents: prometheus, grafana, alertmanager, lokiBackend, lokiRead, lokiWrite, lokiGateway, alloy, thanosQuery, thanosStore, thanosCompactor
Format:
CPU:
100m= 0.1 cores,1000m= 1 coreMemory:
256Mi= 256 MiB,1Gi= 1 GiB
Sizing: Based on actual usage analysis (October 2025)
Update: See Resource Optimization
Environment Variable Overrides¶
Override any config.yaml value using environment variables in .env:
# .env (not committed to git)
SMTP_PASSWORD=secret123
HETZNER_S3_ACCESS_KEY=ABCDEF123456
HETZNER_S3_SECRET_KEY=secret789
Pattern: UPPER_SNAKE_CASE environment variable overrides camelCase config field.
Precedence: .env > config.yaml
Validation¶
Config validation happens at compile-time via TypeScript:
cd dp-infra/monitoring
npm run compile # ❌ Fails if config is invalid
Common Errors:
Type mismatch:
replicas: "2"(string) instead ofreplicas: 2(number)Missing required field: Forgot to add new field to
config.yamlInvalid format: CPU
"100"instead of"100m"
Example: Full config.yaml¶
namespace: monitoring
versions:
prometheusStack: v69.2.0
loki: 6.23.0
alloy: v1.6.1
thanos: v0.37.2
domains:
grafana: grafana.ops.kup6s.net
s3:
endpoint: https://fsn1.your-objectstorage.com
endpointNoProtocol: fsn1.your-objectstorage.com
region: fsn1
buckets:
thanos: metrics-thanos-kup6s
loki: logs-loki-kup6s
smtp:
host: smtp.example.com
port: 587
from: alerts@example.com
username: smtp-user
password: ${SMTP_PASSWORD}
requireTls: true
retention:
prometheus: 3d
prometheusS3Raw: 30
prometheusS35m: 180
prometheusS31h: 730
loki: 744h
storage:
prometheus: 3Gi
grafana: 10Gi
alertmanager: 1Gi
lokiBackend: 10Gi
lokiWrite: 10Gi
thanosStore: 10Gi
thanosCompactor: 20Gi
replicas:
prometheus: 2
alertmanager: 3
grafana: 1
lokiBackend: 1
lokiRead: 2
lokiWrite: 2
thanosQuery: 2
thanosStore: 2
resources:
prometheus:
requests:
cpu: 100m
memory: 1500Mi
limits:
cpu: 2000m
memory: 3000Mi
# ... (11 components total)
Next Steps¶
Upgrade Components - Change versions
Scale Resources - Adjust replicas/storage
Resource Requirements - Full resource specifications