Monitoring Stack¶
Complete observability platform for the kup6s.com Kubernetes cluster, providing metrics collection, log aggregation, alerting, and long-term data storage.
Overview¶
The monitoring stack consists of:
Prometheus - Metrics collection and short-term storage (3 days)
Thanos - Long-term metrics storage (2 years) with S3 backend
Grafana - Visualization and dashboards
Loki - Log aggregation with S3 storage (31 days retention)
Alloy - Log collection agent (DaemonSet)
Alertmanager - Alert routing and notification (SMTP/email)
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Grafana (Dashboards) │
│ https://grafana.ops.kup6s.net │
└───────────────────┬───────────────────┬────────────────────┘
│ │
┌───────────▼───────────┐ ┌───▼────────────┐
│ Thanos Query │ │ Loki Gateway │
│ (Global Metrics) │ │ (Log Query) │
└───────┬───────┬───────┘ └────────┬───────┘
│ │ │
┌──────────▼──┐ ┌──▼──────────┐ ┌──────▼────────┐
│ Prometheus │ │ Thanos Store │ │ Loki Backend │
│ (3d local) │ │ (S3 history) │ │ (Read/Write) │
└─────────────┘ └──────────────┘ └───────────────┘
Key Features¶
Long-term Retention: Metrics stored for 2 years, logs for 31 days
High Availability: 2+ replicas for all query components
Cost Optimization: Hetzner S3 for storage, downsampling for old data
Type-Safe Configuration: CDK8S with TypeScript prevents errors
GitOps Deployment: ArgoCD for automated sync and self-healing
Resource Optimized: Right-sized requests/limits based on actual usage
Quick Links¶
Explanation (Understanding)¶
Architecture Overview - System design and components
Prometheus & Thanos Integration - Metrics storage strategy
Loki Architecture - Log aggregation design
CDK8S Approach - Why TypeScript for manifests
Resource Optimization - Sizing rationale
Storage Architecture - S3, Longhorn, PVC strategy
How-To (Practical Guides)¶
Setup Grafana Password - Configure stable admin credentials with ESO
Upgrade Components - Update Prometheus/Loki/Thanos versions
Scale Resources - Adjust replicas/storage/memory
Debug Issues - Troubleshooting steps
Reference (Technical Details)¶
Configuration - config.yaml reference
Constructs API - TypeScript construct API
Helm Values - Generated Helm values
Resource Requirements - CPU/memory specifications
S3 Buckets - Bucket configuration
Troubleshooting - Common issues and solutions
Access¶
Grafana: https://grafana.ops.kup6s.net
Prometheus: Via Grafana → Explore → Prometheus datasource
Thanos Query: Via Grafana → Explore → Prometheus datasource (queries both local + S3)
Loki: Via Grafana → Explore → Loki datasource
Longhorn UI: https://longhorn.ops.kup6s.net (for PVC management)
Source Code¶
Manifests:
dp-infra/monitoring/manifests/monitoring.k8s.yaml(generated)CDK8S Source:
dp-infra/monitoring/charts/(TypeScript)Configuration:
dp-infra/monitoring/config.yamlArgoCD App:
argoapps/dist/monitoring.k8s.yaml
Deployment Method¶
The monitoring stack is deployed via ArgoCD using GitOps:
CDK8S code in
dp-infra/monitoring/charts/generates manifestsManifests committed to
dp-infra/monitoring/manifests/ArgoCD syncs from git repository automatically
Changes to config.yaml trigger rebuild and sync
See CDK8S Approach for rationale.
Support¶
For issues or questions:
Check Troubleshooting first
Review Debug Issues for common problems
Inspect Grafana dashboards for component health
Check ArgoCD UI for sync status: https://argocd.ops.kup6s.net
How-To
Explanation