../../_images/kup6s-icon-observability.svg

Monitoring Stack

Complete observability platform for the kup6s.com Kubernetes cluster, providing metrics collection, log aggregation, alerting, and long-term data storage.

Overview

The monitoring stack consists of:

  • Prometheus - Metrics collection and short-term storage (3 days)

  • Thanos - Long-term metrics storage (2 years) with S3 backend

  • Grafana - Visualization and dashboards

  • Loki - Log aggregation with S3 storage (31 days retention)

  • Alloy - Log collection agent (DaemonSet)

  • Alertmanager - Alert routing and notification (SMTP/email)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Grafana (Dashboards)                     │
│                   https://grafana.ops.kup6s.net              │
└───────────────────┬───────────────────┬────────────────────┘
                    │                   │
        ┌───────────▼───────────┐   ┌───▼────────────┐
        │    Thanos Query       │   │  Loki Gateway  │
        │   (Global Metrics)    │   │   (Log Query)  │
        └───────┬───────┬───────┘   └────────┬───────┘
                │       │                     │
     ┌──────────▼──┐ ┌──▼──────────┐  ┌──────▼────────┐
     │ Prometheus  │ │ Thanos Store │  │ Loki Backend  │
     │  (3d local) │ │ (S3 history) │  │ (Read/Write)  │
     └─────────────┘ └──────────────┘  └───────────────┘

Key Features

  • Long-term Retention: Metrics stored for 2 years, logs for 31 days

  • High Availability: 2+ replicas for all query components

  • Cost Optimization: Hetzner S3 for storage, downsampling for old data

  • Type-Safe Configuration: CDK8S with TypeScript prevents errors

  • GitOps Deployment: ArgoCD for automated sync and self-healing

  • Resource Optimized: Right-sized requests/limits based on actual usage

Access

  • Grafana: https://grafana.ops.kup6s.net

  • Prometheus: Via Grafana → Explore → Prometheus datasource

  • Thanos Query: Via Grafana → Explore → Prometheus datasource (queries both local + S3)

  • Loki: Via Grafana → Explore → Loki datasource

  • Longhorn UI: https://longhorn.ops.kup6s.net (for PVC management)

Source Code

  • Manifests: dp-infra/monitoring/manifests/monitoring.k8s.yaml (generated)

  • CDK8S Source: dp-infra/monitoring/charts/ (TypeScript)

  • Configuration: dp-infra/monitoring/config.yaml

  • ArgoCD App: argoapps/dist/monitoring.k8s.yaml

Deployment Method

The monitoring stack is deployed via ArgoCD using GitOps:

  1. CDK8S code in dp-infra/monitoring/charts/ generates manifests

  2. Manifests committed to dp-infra/monitoring/manifests/

  3. ArgoCD syncs from git repository automatically

  4. Changes to config.yaml trigger rebuild and sync

See CDK8S Approach for rationale.

Support

For issues or questions:

  • Check Troubleshooting first

  • Review Debug Issues for common problems

  • Inspect Grafana dashboards for component health

  • Check ArgoCD UI for sync status: https://argocd.ops.kup6s.net