Explanation
Monitoring Stack Architecture Overview¶
Introduction¶
The kup6s monitoring stack provides comprehensive observability for the Kubernetes cluster, combining metrics, logs, and alerting into a unified system. This document explains the overall architecture, component relationships, and data flow patterns.
System Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Grafana UI │
│ (Visualization & Dashboards) │
└─────────────┬───────────────────────────────┬───────────────────┘
│ │
│ Metrics Query │ Logs Query
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Thanos Query │ │ Loki Gateway │
│ (Federation) │ │ (HTTP Proxy) │
└────────┬────────┘ └────────┬─────────┘
│ │
├──────────┬─────────────────────┼────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Prometheus │ │ Thanos │ │ Loki │ │ Loki │
│ Sidecars │ │ Store │ │ Write │ │ Read │
│ (gRPC) │ │ (S3) │ │ │ │ │
└─────┬──────┘ └──────────┘ └─────┬────┘ └─────┬────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────────────────┐
│Prometheus│ │ Loki Backend │
│ (TSDB) │ │ (Index + Chunks) │
└─────┬────┘ └───────────┬───────────┘
│ │
│ Scrape │ Push
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Service │ │ Alloy │
│ Monitors │ │ (DaemonSet) │
│ (Targets) │ │ Log Collector │
└──────────────────┘ └──────────────────┘
Core Components¶
Metrics Pipeline¶
Prometheus (2 replicas, StatefulSet)
Role: Primary metrics collection and short-term storage
Collection Method: Pull-based (scraping targets every 30s)
Storage: 3Gi Longhorn PVC per replica (3-day retention)
High Availability: 2 replicas with identical configuration
Integration: Thanos sidecar uploads 2-hour blocks to S3
Thanos Architecture
Sidecar: Runs alongside Prometheus, uploads blocks to S3, provides gRPC StoreAPI
Query (2 replicas): Unified query interface, federates queries across sidecars and stores
Store (2 replicas): Queries historical data from S3, caches indexes (10Gi PVC each)
Compactor (1 replica): Downsamples data, applies retention policies (20Gi PVC)
Why Thanos?
Unlimited retention (S3 storage is cheap)
Global query view (queries both real-time Prometheus and historical S3)
Automatic downsampling (5m and 1h resolutions for long-term data)
Cost optimization (compress and downsample old metrics)
Logs Pipeline¶
Loki (SimpleScalable mode)
Write Path (2 replicas): Receives logs from Alloy, writes to S3
Read Path (2 replicas): Serves log queries, reads from S3
Backend (2 replicas): Handles both index and chunk storage
Storage: S3 for chunks and indexes, Longhorn PVCs for WAL/cache
Retention: 744h (31 days) in S3
Alloy (DaemonSet on all nodes)
Role: Log collection agent (Grafana Agent successor)
Collection Method: Kubernetes API (no privileged access needed)
Processing: JSON parsing, structured metadata extraction
Filtering: Per-node log collection via
spec.nodeNameselectorLabeling: Adds cluster, namespace, pod, container labels
Why SimpleScalable?
Balanced complexity vs scalability
Separate read/write paths (independent scaling)
S3-native (no need for object storage gateways)
Suitable for clusters up to 100 nodes
Visualization & Alerting¶
Grafana (1 replica, Deployment)
Datasources: Thanos Query (metrics), Loki Gateway (logs)
Dashboards: 25 pre-configured dashboards (K8s resources, networking, storage)
Storage: 5Gi Longhorn PVC for dashboards and settings
Authentication: Admin credentials stored in Kubernetes Secret
Alertmanager (2 replicas, StatefulSet)
Role: Alert routing and notification
Clustering: 2-peer gossip cluster for high availability
Notifications: Email via SMTP
Storage: 10Gi Hetzner Volumes PVC per replica
Storage Architecture¶
Longhorn PVCs (Primary persistent storage)
Prometheus: 3Gi × 2 replicas (2-replica Longhorn volumes)
Thanos Store: 10Gi × 2 replicas (index caching)
Thanos Compactor: 20Gi (compaction workspace)
Grafana: 5Gi (dashboards and config)
Loki components: Multiple PVCs for WAL and cache
S3 Buckets (Long-term object storage)
metrics-thanos-kup6s: Prometheus metrics (730-day retention)logs-loki-kup6s: Loki log chunks and indexes (90-day retention)Region: fsn1 (Falkenstein, same as cluster)
Lifecycle: Automated expiration via S3 lifecycle policies
Data Flow Patterns¶
Metrics Flow¶
Collection: Prometheus scrapes targets (service monitors) every 30s
Storage: Metrics stored in local TSDB (3-day retention)
Upload: Thanos sidecar uploads 2-hour blocks to S3 every 2 hours
Compaction: Thanos Compactor downsamples blocks (5m, 1h resolutions)
Query: Thanos Query federates real-time (sidecars) + historical (store) data
Visualization: Grafana queries Thanos Query endpoint
Logs Flow¶
Collection: Alloy reads pod logs via Kubernetes API
Processing: JSON parsing, metadata extraction, labeling
Ingestion: Logs pushed to Loki Write (via Gateway)
Storage: Write path stores chunks and indexes in S3
Query: Read path serves log queries from S3
Visualization: Grafana queries Loki Gateway endpoint
Alert Flow¶
Evaluation: Prometheus evaluates alert rules every 30s
Firing: Alerts sent to Alertmanager when conditions met
Routing: Alertmanager routes by severity/namespace
Notification: Emails sent via SMTP
Silencing: Manual silences configured in Alertmanager UI
Resource Allocation¶
CPU Allocation (Total: ~2.7 cores requests)¶
Prometheus: 100m × 2 = 200m
Thanos Query: 200m × 2 = 400m
Thanos Store: 200m × 2 = 400m
Thanos Compactor: 500m = 500m
Loki Write: 100m × 2 = 200m
Loki Read: 100m × 2 = 200m
Loki Backend: 100m × 2 = 200m
Grafana: 50m = 50m
Alloy: 50m × 4 nodes = 200m
Alertmanager: 25m × 2 = 50m
Memory Allocation (Total: ~12.5 GB requests)¶
Prometheus: 1500Mi × 2 = 3000Mi (~3GB)
Thanos Query: 512Mi × 2 = 1024Mi (~1GB)
Thanos Store: 1Gi × 2 = 2048Mi (~2GB)
Thanos Compactor: 2Gi = 2048Mi (~2GB)
Loki Write: 256Mi × 2 = 512Mi
Loki Read: 256Mi × 2 = 512Mi
Loki Backend: 256Mi × 2 = 512Mi
Grafana: 512Mi = 512Mi
Alloy: 128Mi × 4 nodes = 512Mi
Alertmanager: 100Mi × 2 = 200Mi
Storage Allocation (Total: ~100 GB PVCs + S3)¶
Longhorn PVCs: ~60Gi total
Prometheus: 6Gi (3Gi × 2)
Thanos: 30Gi (10Gi × 2 store + 20Gi compactor)
Grafana: 5Gi
Loki: ~20Gi (multiple components)
S3 Storage: Unlimited (pay-per-GB)
Metrics: ~50GB (compressed, downsampled)
Logs: ~20GB (31-day retention)
High Availability Design¶
Component HA Status¶
Fully HA (2+ replicas):
✅ Prometheus (2 replicas, independent scraping)
✅ Thanos Query (2 replicas, stateless)
✅ Thanos Store (2 replicas, shared S3 state)
✅ Loki Write (2 replicas, shared S3 state)
✅ Loki Read (2 replicas, shared S3 state)
✅ Loki Backend (2 replicas, shared S3 state)
✅ Alertmanager (2 replicas, gossip cluster)
Single Replica (acceptable for role):
⚠️ Thanos Compactor (1 replica, background job, restartable)
⚠️ Grafana (1 replica, UI only, stateless config in PVC)
DaemonSet (node-level HA):
✅ Alloy (4 nodes, each collects its node’s logs)
Failure Scenarios¶
Prometheus Pod Failure:
Impact: Metrics gap for ~1 minute (scrape interval)
Recovery: StatefulSet restarts pod automatically
Data: No loss (replica continues scraping)
Thanos Store Failure:
Impact: Historical queries slower (1 replica serves requests)
Recovery: StatefulSet restarts pod automatically
Data: No loss (S3 is source of truth)
Loki Write Failure:
Impact: Log ingestion continues via remaining replica
Recovery: Deployment restarts pod automatically
Data: Minimal loss (Alloy retries failed pushes)
S3 Outage:
Impact: Historical metrics/logs unavailable
Recovery: Automatic when S3 recovers
Data: No loss (local caches serve recent data)
Security Model¶
Authentication¶
Grafana: Username/password (stored in K8s Secret)
Prometheus: No authentication (cluster-internal only)
Alertmanager: No authentication (cluster-internal only)
Network Security¶
Ingress: Traefik with TLS termination (Let’s Encrypt)
Service Mesh: None (all communication within cluster)
Network Policies: Not implemented (future consideration)
Credential Management¶
S3 Credentials: External Secrets Operator (ESO) replicates from
crossplane-systemSMTP Credentials: Stored in ConfigMap (consider moving to Secret)
Grafana Password: Auto-generated, stored in Secret
RBAC¶
Prometheus: ClusterRole for scraping metrics
Alloy: ClusterRole for reading pod logs
Thanos/Loki: No special permissions (S3 access via credentials)
Deployment Architecture (CDK8S)¶
Repository Structure¶
dp-infra/monitoring/
├── charts/
│ ├── constructs/ # 11 TypeScript constructs
│ ├── types.ts # Shared TypeScript interfaces
│ └── monitoring-chart.ts # Main chart assembling constructs
├── manifests/
│ └── monitoring.k8s.yaml # Generated manifests (committed)
├── config.yaml # Configuration values
└── main.ts # Entry point, config loading
Construct Pattern¶
Each component is a TypeScript class (construct) that:
Accepts typed configuration (
MonitoringConfig)Generates Kubernetes resources (
ApiObject)Uses consistent labeling and sync waves
Documents prerequisites and behavior
Sync Waves (ArgoCD Ordering)¶
Wave 0: Namespace, PriorityClass, ProviderConfig
Wave 1: S3 Buckets, ExternalSecrets
Wave 2: HelmCharts (Prometheus, Loki)
Wave 3: Thanos components, Alloy
Integration Points¶
External Dependencies¶
Crossplane: S3 bucket provisioning
External Secrets Operator: Secret replication
Traefik: Ingress and TLS termination
Longhorn: Persistent volume provisioning
cert-manager: TLS certificate management (via Traefik)
Service Discovery¶
Prometheus: Kubernetes service discovery (role: endpoints, pod, service)
Alertmanager: Gossip protocol for peer discovery
Thanos Query: DNS SRV records for sidecar discovery
API Integrations¶
Kubernetes API: Prometheus scraping, Alloy log collection
S3 API: Thanos/Loki object storage
SMTP: Alertmanager email notifications
Monitoring the Monitoring¶
Self-Monitoring Metrics¶
up{job="kube-prometheus-stack-prometheus"}: Prometheus healthprometheus_tsdb_head_series: Cardinality trackingthanos_sidecar_shipper_uploads_total: S3 upload successloki_ingester_chunks_flushed_total: Loki ingestion rate
Health Checks¶
Prometheus:
/-/healthyendpointThanos Query:
/-/healthyendpointLoki:
/readyendpointGrafana:
/api/healthendpoint
Alerting on Monitoring Issues¶
PrometheusDown: Prometheus not scrapingThanosCompactorFailed: Compaction errorsLokiRequestErrors: Log ingestion failuresAlertmanagerDown: Alert routing broken
Performance Characteristics¶
Query Performance¶
Recent metrics (<3 days): ~100ms (Prometheus TSDB)
Historical metrics (>3 days): ~500ms (Thanos Store S3)
Recent logs (<1 hour): ~200ms (Loki memory cache)
Historical logs (>1 hour): ~1s (S3 reads)
Ingestion Rates¶
Metrics: ~100k samples/sec (8 nodes, ~1000 targets)
Logs: ~50MB/day uncompressed (~5MB/day compressed in S3)
Alerts: ~10 evaluations/sec
Storage Growth¶
Metrics (S3): ~500MB/day (compressed)
Logs (S3): ~5MB/day (compressed)
Longhorn PVCs: Stable after initial fill
Future Enhancements¶
Planned Improvements¶
[ ] Add distributed tracing (Tempo)
[ ] Implement network policies
[ ] Add multi-cluster federation
[ ] Migrate SMTP credentials to Secret
[ ] Add SLO/SLI dashboards
[ ] Implement log sampling for high-volume namespaces
Scalability Considerations¶
Prometheus: Consider sharding for >2000 targets
Loki: Migrate to microservices mode for >100 nodes
Thanos Store: Add replicas if S3 read latency increases
Alloy: Current design scales linearly with node count