Monitoring Stack¶

Complete observability platform for the kup6s.com Kubernetes cluster, providing metrics collection, log aggregation, alerting, and long-term data storage.

Overview¶

The monitoring stack consists of:

Prometheus - Metrics collection and short-term storage (3 days)
Thanos - Long-term metrics storage (2 years) with S3 backend
Grafana - Visualization and dashboards
Loki - Log aggregation with S3 storage (31 days retention)
Alloy - Log collection agent (DaemonSet)
Alertmanager - Alert routing and notification (SMTP/email)

Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                     Grafana (Dashboards)                     │
│                   https://grafana.ops.kup6s.net              │
└───────────────────┬───────────────────┬────────────────────┘
                    │                   │
        ┌───────────▼───────────┐   ┌───▼────────────┐
        │    Thanos Query       │   │  Loki Gateway  │
        │   (Global Metrics)    │   │   (Log Query)  │
        └───────┬───────┬───────┘   └────────┬───────┘
                │       │                     │
     ┌──────────▼──┐ ┌──▼──────────┐  ┌──────▼────────┐
     │ Prometheus  │ │ Thanos Store │  │ Loki Backend  │
     │  (3d local) │ │ (S3 history) │  │ (Read/Write)  │
     └─────────────┘ └──────────────┘  └───────────────┘

Key Features¶

Long-term Retention: Metrics stored for 2 years, logs for 31 days
High Availability: 2+ replicas for all query components
Cost Optimization: Hetzner S3 for storage, downsampling for old data
Type-Safe Configuration: CDK8S with TypeScript prevents errors
GitOps Deployment: ArgoCD for automated sync and self-healing
Resource Optimized: Right-sized requests/limits based on actual usage

Quick Links¶

Explanation (Understanding)¶

Architecture Overview - System design and components
Prometheus & Thanos Integration - Metrics storage strategy
Loki Architecture - Log aggregation design
CDK8S Approach - Why TypeScript for manifests
Resource Optimization - Sizing rationale
Storage Architecture - S3, Longhorn, PVC strategy

How-To (Practical Guides)¶

Setup Grafana Password - Configure stable admin credentials with ESO
Upgrade Components - Update Prometheus/Loki/Thanos versions
Scale Resources - Adjust replicas/storage/memory
Debug Issues - Troubleshooting steps

Reference (Technical Details)¶

Configuration - config.yaml reference
Constructs API - TypeScript construct API
Helm Values - Generated Helm values
Resource Requirements - CPU/memory specifications
S3 Buckets - Bucket configuration
Troubleshooting - Common issues and solutions

Access¶

Grafana: https://grafana.ops.kup6s.net
Prometheus: Via Grafana → Explore → Prometheus datasource
Thanos Query: Via Grafana → Explore → Prometheus datasource (queries both local + S3)
Loki: Via Grafana → Explore → Loki datasource
Longhorn UI: https://longhorn.ops.kup6s.net (for PVC management)

Source Code¶

Manifests: dp-infra/monitoring/manifests/monitoring.k8s.yaml (generated)
CDK8S Source: dp-infra/monitoring/charts/ (TypeScript)
Configuration: dp-infra/monitoring/config.yaml
ArgoCD App: argoapps/dist/monitoring.k8s.yaml

Deployment Method¶

The monitoring stack is deployed via ArgoCD using GitOps:

CDK8S code in dp-infra/monitoring/charts/ generates manifests
Manifests committed to dp-infra/monitoring/manifests/
ArgoCD syncs from git repository automatically
Changes to config.yaml trigger rebuild and sync

See CDK8S Approach for rationale.

Support¶

For issues or questions:

Check Troubleshooting first
Review Debug Issues for common problems
Inspect Grafana dashboards for component health
Check ArgoCD UI for sync status: https://argocd.ops.kup6s.net

How-To

Explanation

Reference