Explanation

Monitoring Stack Architecture Overview

Introduction

The kup6s monitoring stack provides comprehensive observability for the Kubernetes cluster, combining metrics, logs, and alerting into a unified system. This document explains the overall architecture, component relationships, and data flow patterns.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Grafana UI                              │
│                  (Visualization & Dashboards)                   │
└─────────────┬───────────────────────────────┬───────────────────┘
              │                               │
              │ Metrics Query                 │ Logs Query
              ▼                               ▼
    ┌─────────────────┐              ┌──────────────────┐
    │  Thanos Query   │              │   Loki Gateway   │
    │  (Federation)   │              │   (HTTP Proxy)   │
    └────────┬────────┘              └────────┬─────────┘
             │                                │
             ├──────────┬─────────────────────┼────────────┐
             │          │                     │            │
             ▼          ▼                     ▼            ▼
    ┌────────────┐  ┌──────────┐      ┌──────────┐  ┌──────────┐
    │ Prometheus │  │  Thanos  │      │   Loki   │  │   Loki   │
    │  Sidecars  │  │  Store   │      │  Write   │  │   Read   │
    │   (gRPC)   │  │  (S3)    │      │          │  │          │
    └─────┬──────┘  └──────────┘      └─────┬────┘  └─────┬────┘
          │                                  │             │
          ▼                                  ▼             ▼
    ┌──────────┐                      ┌───────────────────────┐
    │Prometheus│                      │   Loki Backend        │
    │  (TSDB)  │                      │   (Index + Chunks)    │
    └─────┬────┘                      └───────────┬───────────┘
          │                                       │
          │ Scrape                                │ Push
          ▼                                       ▼
    ┌──────────────────┐              ┌──────────────────┐
    │  Service         │              │   Alloy          │
    │  Monitors        │              │   (DaemonSet)    │
    │  (Targets)       │              │   Log Collector  │
    └──────────────────┘              └──────────────────┘

Core Components

Metrics Pipeline

Prometheus (2 replicas, StatefulSet)

  • Role: Primary metrics collection and short-term storage

  • Collection Method: Pull-based (scraping targets every 30s)

  • Storage: 3Gi Longhorn PVC per replica (3-day retention)

  • High Availability: 2 replicas with identical configuration

  • Integration: Thanos sidecar uploads 2-hour blocks to S3

Thanos Architecture

  • Sidecar: Runs alongside Prometheus, uploads blocks to S3, provides gRPC StoreAPI

  • Query (2 replicas): Unified query interface, federates queries across sidecars and stores

  • Store (2 replicas): Queries historical data from S3, caches indexes (10Gi PVC each)

  • Compactor (1 replica): Downsamples data, applies retention policies (20Gi PVC)

Why Thanos?

  • Unlimited retention (S3 storage is cheap)

  • Global query view (queries both real-time Prometheus and historical S3)

  • Automatic downsampling (5m and 1h resolutions for long-term data)

  • Cost optimization (compress and downsample old metrics)

Logs Pipeline

Loki (SimpleScalable mode)

  • Write Path (2 replicas): Receives logs from Alloy, writes to S3

  • Read Path (2 replicas): Serves log queries, reads from S3

  • Backend (2 replicas): Handles both index and chunk storage

  • Storage: S3 for chunks and indexes, Longhorn PVCs for WAL/cache

  • Retention: 744h (31 days) in S3

Alloy (DaemonSet on all nodes)

  • Role: Log collection agent (Grafana Agent successor)

  • Collection Method: Kubernetes API (no privileged access needed)

  • Processing: JSON parsing, structured metadata extraction

  • Filtering: Per-node log collection via spec.nodeName selector

  • Labeling: Adds cluster, namespace, pod, container labels

Why SimpleScalable?

  • Balanced complexity vs scalability

  • Separate read/write paths (independent scaling)

  • S3-native (no need for object storage gateways)

  • Suitable for clusters up to 100 nodes

Visualization & Alerting

Grafana (1 replica, Deployment)

  • Datasources: Thanos Query (metrics), Loki Gateway (logs)

  • Dashboards: 25 pre-configured dashboards (K8s resources, networking, storage)

  • Storage: 5Gi Longhorn PVC for dashboards and settings

  • Authentication: Admin credentials stored in Kubernetes Secret

Alertmanager (2 replicas, StatefulSet)

  • Role: Alert routing and notification

  • Clustering: 2-peer gossip cluster for high availability

  • Notifications: Email via SMTP

  • Storage: 10Gi Hetzner Volumes PVC per replica

Storage Architecture

Longhorn PVCs (Primary persistent storage)

  • Prometheus: 3Gi × 2 replicas (2-replica Longhorn volumes)

  • Thanos Store: 10Gi × 2 replicas (index caching)

  • Thanos Compactor: 20Gi (compaction workspace)

  • Grafana: 5Gi (dashboards and config)

  • Loki components: Multiple PVCs for WAL and cache

S3 Buckets (Long-term object storage)

  • metrics-thanos-kup6s: Prometheus metrics (730-day retention)

  • logs-loki-kup6s: Loki log chunks and indexes (90-day retention)

  • Region: fsn1 (Falkenstein, same as cluster)

  • Lifecycle: Automated expiration via S3 lifecycle policies

Data Flow Patterns

Metrics Flow

  1. Collection: Prometheus scrapes targets (service monitors) every 30s

  2. Storage: Metrics stored in local TSDB (3-day retention)

  3. Upload: Thanos sidecar uploads 2-hour blocks to S3 every 2 hours

  4. Compaction: Thanos Compactor downsamples blocks (5m, 1h resolutions)

  5. Query: Thanos Query federates real-time (sidecars) + historical (store) data

  6. Visualization: Grafana queries Thanos Query endpoint

Logs Flow

  1. Collection: Alloy reads pod logs via Kubernetes API

  2. Processing: JSON parsing, metadata extraction, labeling

  3. Ingestion: Logs pushed to Loki Write (via Gateway)

  4. Storage: Write path stores chunks and indexes in S3

  5. Query: Read path serves log queries from S3

  6. Visualization: Grafana queries Loki Gateway endpoint

Alert Flow

  1. Evaluation: Prometheus evaluates alert rules every 30s

  2. Firing: Alerts sent to Alertmanager when conditions met

  3. Routing: Alertmanager routes by severity/namespace

  4. Notification: Emails sent via SMTP

  5. Silencing: Manual silences configured in Alertmanager UI

Resource Allocation

CPU Allocation (Total: ~2.7 cores requests)

  • Prometheus: 100m × 2 = 200m

  • Thanos Query: 200m × 2 = 400m

  • Thanos Store: 200m × 2 = 400m

  • Thanos Compactor: 500m = 500m

  • Loki Write: 100m × 2 = 200m

  • Loki Read: 100m × 2 = 200m

  • Loki Backend: 100m × 2 = 200m

  • Grafana: 50m = 50m

  • Alloy: 50m × 4 nodes = 200m

  • Alertmanager: 25m × 2 = 50m

Memory Allocation (Total: ~12.5 GB requests)

  • Prometheus: 1500Mi × 2 = 3000Mi (~3GB)

  • Thanos Query: 512Mi × 2 = 1024Mi (~1GB)

  • Thanos Store: 1Gi × 2 = 2048Mi (~2GB)

  • Thanos Compactor: 2Gi = 2048Mi (~2GB)

  • Loki Write: 256Mi × 2 = 512Mi

  • Loki Read: 256Mi × 2 = 512Mi

  • Loki Backend: 256Mi × 2 = 512Mi

  • Grafana: 512Mi = 512Mi

  • Alloy: 128Mi × 4 nodes = 512Mi

  • Alertmanager: 100Mi × 2 = 200Mi

Storage Allocation (Total: ~100 GB PVCs + S3)

  • Longhorn PVCs: ~60Gi total

    • Prometheus: 6Gi (3Gi × 2)

    • Thanos: 30Gi (10Gi × 2 store + 20Gi compactor)

    • Grafana: 5Gi

    • Loki: ~20Gi (multiple components)

  • S3 Storage: Unlimited (pay-per-GB)

    • Metrics: ~50GB (compressed, downsampled)

    • Logs: ~20GB (31-day retention)

High Availability Design

Component HA Status

Fully HA (2+ replicas):

  • ✅ Prometheus (2 replicas, independent scraping)

  • ✅ Thanos Query (2 replicas, stateless)

  • ✅ Thanos Store (2 replicas, shared S3 state)

  • ✅ Loki Write (2 replicas, shared S3 state)

  • ✅ Loki Read (2 replicas, shared S3 state)

  • ✅ Loki Backend (2 replicas, shared S3 state)

  • ✅ Alertmanager (2 replicas, gossip cluster)

Single Replica (acceptable for role):

  • ⚠️ Thanos Compactor (1 replica, background job, restartable)

  • ⚠️ Grafana (1 replica, UI only, stateless config in PVC)

DaemonSet (node-level HA):

  • ✅ Alloy (4 nodes, each collects its node’s logs)

Failure Scenarios

Prometheus Pod Failure:

  • Impact: Metrics gap for ~1 minute (scrape interval)

  • Recovery: StatefulSet restarts pod automatically

  • Data: No loss (replica continues scraping)

Thanos Store Failure:

  • Impact: Historical queries slower (1 replica serves requests)

  • Recovery: StatefulSet restarts pod automatically

  • Data: No loss (S3 is source of truth)

Loki Write Failure:

  • Impact: Log ingestion continues via remaining replica

  • Recovery: Deployment restarts pod automatically

  • Data: Minimal loss (Alloy retries failed pushes)

S3 Outage:

  • Impact: Historical metrics/logs unavailable

  • Recovery: Automatic when S3 recovers

  • Data: No loss (local caches serve recent data)

Security Model

Authentication

  • Grafana: Username/password (stored in K8s Secret)

  • Prometheus: No authentication (cluster-internal only)

  • Alertmanager: No authentication (cluster-internal only)

Network Security

  • Ingress: Traefik with TLS termination (Let’s Encrypt)

  • Service Mesh: None (all communication within cluster)

  • Network Policies: Not implemented (future consideration)

Credential Management

  • S3 Credentials: External Secrets Operator (ESO) replicates from crossplane-system

  • SMTP Credentials: Stored in ConfigMap (consider moving to Secret)

  • Grafana Password: Auto-generated, stored in Secret

RBAC

  • Prometheus: ClusterRole for scraping metrics

  • Alloy: ClusterRole for reading pod logs

  • Thanos/Loki: No special permissions (S3 access via credentials)

Deployment Architecture (CDK8S)

Repository Structure

dp-infra/monitoring/
├── charts/
│   ├── constructs/          # 11 TypeScript constructs
│   ├── types.ts             # Shared TypeScript interfaces
│   └── monitoring-chart.ts  # Main chart assembling constructs
├── manifests/
│   └── monitoring.k8s.yaml  # Generated manifests (committed)
├── config.yaml              # Configuration values
└── main.ts                  # Entry point, config loading

Construct Pattern

Each component is a TypeScript class (construct) that:

  1. Accepts typed configuration (MonitoringConfig)

  2. Generates Kubernetes resources (ApiObject)

  3. Uses consistent labeling and sync waves

  4. Documents prerequisites and behavior

Sync Waves (ArgoCD Ordering)

  • Wave 0: Namespace, PriorityClass, ProviderConfig

  • Wave 1: S3 Buckets, ExternalSecrets

  • Wave 2: HelmCharts (Prometheus, Loki)

  • Wave 3: Thanos components, Alloy

Integration Points

External Dependencies

  • Crossplane: S3 bucket provisioning

  • External Secrets Operator: Secret replication

  • Traefik: Ingress and TLS termination

  • Longhorn: Persistent volume provisioning

  • cert-manager: TLS certificate management (via Traefik)

Service Discovery

  • Prometheus: Kubernetes service discovery (role: endpoints, pod, service)

  • Alertmanager: Gossip protocol for peer discovery

  • Thanos Query: DNS SRV records for sidecar discovery

API Integrations

  • Kubernetes API: Prometheus scraping, Alloy log collection

  • S3 API: Thanos/Loki object storage

  • SMTP: Alertmanager email notifications

Monitoring the Monitoring

Self-Monitoring Metrics

  • up{job="kube-prometheus-stack-prometheus"}: Prometheus health

  • prometheus_tsdb_head_series: Cardinality tracking

  • thanos_sidecar_shipper_uploads_total: S3 upload success

  • loki_ingester_chunks_flushed_total: Loki ingestion rate

Health Checks

  • Prometheus: /-/healthy endpoint

  • Thanos Query: /-/healthy endpoint

  • Loki: /ready endpoint

  • Grafana: /api/health endpoint

Alerting on Monitoring Issues

  • PrometheusDown: Prometheus not scraping

  • ThanosCompactorFailed: Compaction errors

  • LokiRequestErrors: Log ingestion failures

  • AlertmanagerDown: Alert routing broken

Performance Characteristics

Query Performance

  • Recent metrics (<3 days): ~100ms (Prometheus TSDB)

  • Historical metrics (>3 days): ~500ms (Thanos Store S3)

  • Recent logs (<1 hour): ~200ms (Loki memory cache)

  • Historical logs (>1 hour): ~1s (S3 reads)

Ingestion Rates

  • Metrics: ~100k samples/sec (8 nodes, ~1000 targets)

  • Logs: ~50MB/day uncompressed (~5MB/day compressed in S3)

  • Alerts: ~10 evaluations/sec

Storage Growth

  • Metrics (S3): ~500MB/day (compressed)

  • Logs (S3): ~5MB/day (compressed)

  • Longhorn PVCs: Stable after initial fill

Future Enhancements

Planned Improvements

  • [ ] Add distributed tracing (Tempo)

  • [ ] Implement network policies

  • [ ] Add multi-cluster federation

  • [ ] Migrate SMTP credentials to Secret

  • [ ] Add SLO/SLI dashboards

  • [ ] Implement log sampling for high-volume namespaces

Scalability Considerations

  • Prometheus: Consider sharding for >2000 targets

  • Loki: Migrate to microservices mode for >100 nodes

  • Thanos Store: Add replicas if S3 read latency increases

  • Alloy: Current design scales linearly with node count

References