Monitoring Architecture

This document explains the cluster-level monitoring and observability architecture, including metrics collection, log aggregation, and long-term storage strategies.

Overview

The kup6s.com cluster monitoring stack provides comprehensive observability across all infrastructure and application components:

  • Metrics: Prometheus + Thanos (3-day local, 2-year S3 with downsampling)

  • Logs: Loki + Alloy (31-day retention in S3)

  • Visualization: Grafana (dashboards, queries, alerts)

  • Alerting: Alertmanager (email routing via SMTP)

All monitoring components are deployed via ArgoCD from the dp-infra/monitoring/ repository using CDK8S TypeScript.

Metrics Collection

Prometheus

Primary metrics collection engine:

  • Deployment: 2 replicas for high availability

  • Local Retention: 3 days (reduced from 7 days to optimize memory/storage)

  • Storage: 3Gi Longhorn PVC per replica (reduced from 6Gi)

  • Resources: 100m/1500Mi requests, 500m/3000Mi limits

  • Target Discovery: Kubernetes service discovery (all namespaces)

What Prometheus Monitors:

  • Node metrics (CPU, memory, disk, network)

  • Pod metrics (resource usage, restarts, status)

  • Kubernetes API metrics (API server, scheduler, controller-manager)

  • Application metrics (via /metrics endpoints)

  • Custom metrics (ServiceMonitor and PodMonitor CRDs)

Thanos - Long-Term Metrics Storage

Architecture: Thanos extends Prometheus with unlimited historical storage and global query capabilities.

Thanos Sidecar

Runs as a sidecar container alongside Prometheus:

  • Upload Blocks: Uploads 2-hour metric blocks to S3

  • Real-time Queries: Provides gRPC StoreAPI for recent data

  • Configuration: Defined in Prometheus CRD via thanos.objectStorageConfig

Thanos Query (2 replicas with anti-affinity)

Unified query interface for Prometheus + S3 data:

  • Federation: Queries across sidecars and store gateways

  • Deduplication: Merges metrics using replica labels

  • Grafana Integration: Default Prometheus datasource points to Thanos Query

  • Service: thanos-query.monitoring.svc.cluster.local:9090

  • Resources: 200m/512Mi requests per replica

Why Query Instead of Prometheus?

  • Transparent access to both recent (Prometheus) and historical (S3) data

  • Single query interface for all time ranges

  • No need to change dashboards or queries

Thanos Store (2 replicas, 10Gi PVC each)

Historical data gateway:

  • Purpose: Queries historical metrics from S3

  • Caching: Index cache (500MB) + chunk cache (500MB) for performance

  • StoreAPI: Provides gRPC interface for historical blocks

  • Resources: 200m/1Gi requests per replica

Thanos Compactor (1 replica, 20Gi PVC)

Data lifecycle management:

  • Downsampling: Creates 5-minute and 1-hour resolution data

  • Compaction: Merges small blocks into larger ones

  • Retention Enforcement: Deletes data beyond retention periods

  • Schedule: Runs every few minutes

  • Resources: 500m/2Gi requests

Retention Strategy

Multi-tier retention balances cost, performance, and data availability:

Resolution

Retention Period

Storage

Use Case

Raw (15s)

3 days

Prometheus local

Recent troubleshooting, live dashboards

Raw (15s)

30 days

S3

Recent historical analysis

5-minute

180 days (~6 months)

S3

Medium-term trends

1-hour

730 days (2 years)

S3

Long-term capacity planning

Rationale:

  • 3-day local: Fast queries for recent data, minimal storage/memory

  • 30-day raw: Detailed troubleshooting without excessive S3 costs

  • 6-month downsampled: Trends and patterns without raw data overhead

  • 2-year hourly: Long-term capacity planning and compliance

Storage Configuration

S3 Bucket: metrics-thanos-kup6s

  • Provider: Hetzner Object Storage

  • Region: fsn1 (Falkenstein, same as cluster)

  • Provisioning: Crossplane-managed S3 bucket

  • Credentials: ESO replicates from crossplane-system to monitoring namespace

Prometheus Local Storage:

  • PVC Size: 3Gi per replica (Longhorn, 2 replicas storage class)

  • Storage Class: longhorn (2 replicas, best-effort locality)

  • Retention: 3 days (automatic deletion of older data)

Log Collection

Loki - Log Aggregation

Architecture: SimpleScalable deployment mode (3 components)

Components

  1. Loki Backend:

    • Handles compaction and deletion

    • Resources: 100m/256Mi requests

  2. Loki Read:

    • Handles query path

    • Resources: 100m/256Mi requests

  3. Loki Write:

    • Handles ingestion path

    • Resources: 100m/256Mi requests

Why SimpleScalable?

  • Suitable for medium-scale clusters (<100 nodes)

  • Simpler than microservices mode

  • Easier to operate and troubleshoot

  • Sufficient for current cluster size

Alloy - Log Collector

Deployment: DaemonSet (one pod per node)

  • Purpose: Collects logs from all pods on the node

  • Discovery: Automatic discovery via Kubernetes API

  • Shipping: Ships logs to Loki Write component

  • Resources: 100m/128Mi requests per pod

  • Log Sources:

    • Container stdout/stderr

    • Kubernetes events

    • Node system logs (optional)

Log Retention

Storage:

  • S3 Bucket: logs-loki-kup6s (Hetzner fsn1)

  • Retention: 744 hours (31 days)

  • Credentials: Same Hetzner S3 credentials as Thanos

Why 31 Days?

  • Sufficient for troubleshooting recent issues

  • Balances storage costs with usefulness

  • Longer retention available via S3 lifecycle policies if needed

Querying Logs

Grafana Explore:

  1. Navigate to Grafana → Explore

  2. Select “Loki” datasource

  3. Use LogQL queries:

    {namespace="monitoring"} |= "error"
    {app="nginx"} | json | status >= 400
    

See Loki Query Language (LogQL) for syntax.

Visualization and Dashboards

Grafana

Central observability UI:

  • Datasources:

    • Prometheus → Thanos Query (metrics)

    • Loki (logs)

    • Optional: Tempo (traces, if deployed)

  • Resources: 50m/512Mi requests (increased to prevent OOM)

  • Storage: ConfigMaps for dashboards (can migrate to git-synced)

  • Access: https://grafana.ops.kup6s.net

Pre-configured Dashboards:

  • Kubernetes cluster overview

  • Node metrics

  • Pod resource usage

  • Longhorn storage metrics

  • Custom application dashboards

Creating Dashboards:

  1. Create dashboard in Grafana UI

  2. Export JSON

  3. Store in dp-infra/monitoring/dashboards/ (optional)

  4. Commit to git for version control

Alerting

Alertmanager

Alert routing and notification:

Alert Flow:

Prometheus/Loki
    ├──> Alert Rules (fire when conditions met)
    └──> Alertmanager (receives firing alerts)
         ├──> Grouping (similar alerts grouped)
         ├──> Inhibition (suppress redundant alerts)
         ├──> Silencing (temporary muting)
         └──> Notification (email via SMTP)

Silenced Alerts:

  • CPUThrottlingHigh: Expected behavior with CPU limits

  • KubeMemoryOvercommit: Intentional overcommit strategy

Configuration: See AlertmanagerConstruct for routing rules and inhibitions.

Key Monitoring Endpoints

User-Facing UIs:

  • Grafana: https://grafana.ops.kup6s.net (dashboards, queries)

  • Longhorn UI: https://longhorn.ops.kup6s.net (storage management)

  • ArgoCD: https://argocd.ops.kup6s.net (GitOps deployments)

Internal Services (accessible via port-forward):

  • Prometheus: prometheus.monitoring.svc:9090

  • Thanos Query: thanos-query.monitoring.svc:9090

  • Loki: loki-read.monitoring.svc:3100

  • Alertmanager: alertmanager.monitoring.svc:9093

Port-Forward Example:

kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Access at http://localhost:9090

Health Check Commands

# Check cluster nodes
kubectl get nodes

# Check pod status across all namespaces
kubectl get pods -A | grep -v Running

# Check monitoring stack
kubectl get pods -n monitoring

# Check Longhorn storage health
kubectl get nodes.longhorn.io -n longhorn-system

# Check PostgreSQL clusters
kubectl get clusters.postgresql.cnpg.io -A

# Check ArgoCD applications
kubectl get applications -n argocd

# Check External Secrets Operator
kubectl get pods -n external-secrets

# Check ExternalSecrets and SecretStores
kubectl get externalsecrets,secretstores,clustersecretstores -A

# Check Loki health
kubectl get pods -n monitoring | grep loki

# Check Thanos components
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'

Verification Commands

Thanos Health Checks:

# Check all Thanos components running
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'

# Check Thanos Query connected stores
kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores

# Check S3 blocks loaded by Store Gateway
kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/

# Check Compactor logs for downsampling activity
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep -E "(compact|downsample)"

Loki Health Checks:

# Check Loki components
kubectl get pods -n monitoring | grep loki

# Check Loki ready to receive logs
kubectl exec -n monitoring loki-write-0 -- wget -qO- http://localhost:3100/ready

# Check Alloy log collection
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20

Integration with Applications

Exposing Custom Metrics

Applications can expose Prometheus metrics via:

  1. HTTP /metrics endpoint:

    // Example in Go
    http.Handle("/metrics", promhttp.Handler())
    
  2. ServiceMonitor CRD (automatic discovery):

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: myapp
      namespace: myapp
    spec:
      selector:
        matchLabels:
          app: myapp
      endpoints:
      - port: metrics
        interval: 30s
    
  3. PodMonitor CRD (for pods without services):

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: myapp-pods
    spec:
      selector:
        matchLabels:
          app: myapp
      podMetricsEndpoints:
      - port: metrics
    

Structured Logging

For optimal Loki querying, use structured logging (JSON):

{
  "timestamp": "2025-11-19T14:30:00Z",
  "level": "error",
  "service": "myapp",
  "message": "Database connection failed",
  "error": "connection timeout",
  "duration_ms": 5000
}

LogQL Query Example:

{namespace="myapp"} | json | level="error" | duration_ms > 3000

Deployment and Configuration

Management: All monitoring components managed via CDK8S in dp-infra/monitoring/

Configuration Files:

Workflow:

  1. Edit config.yaml or TypeScript constructs

  2. Build: npm run build

  3. Commit manifests to git: git add manifests/ && git commit && git push

  4. ArgoCD auto-syncs from git to cluster

See Monitoring Deployment Documentation for detailed guides.

Troubleshooting

No Metrics for New Pods

Symptom: Pods not showing in Grafana

Check:

  1. ServiceMonitor/PodMonitor created?

    kubectl get servicemonitors,podmonitors -A
    
  2. Prometheus targets configured?

    kubectl port-forward -n monitoring svc/prometheus 9090:9090
    # Visit http://localhost:9090/targets
    
  3. Metrics endpoint accessible?

    kubectl port-forward -n myapp pod/myapp-xxx 8080:8080
    curl http://localhost:8080/metrics
    

Historical Metrics Missing

Symptom: Queries for >3 days ago return no data

Check:

  1. Thanos Store running?

    kubectl get pods -n monitoring | grep thanos-store
    
  2. S3 blocks uploaded?

    kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/
    
  3. Thanos Query connected to Store?

    kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores
    

Logs Not Appearing

Symptom: No logs in Grafana Explore

Check:

  1. Alloy running on all nodes?

    kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
    
  2. Loki Write accepting logs?

    kubectl logs -n monitoring loki-write-0 --tail=50
    
  3. Loki Read serving queries?

    kubectl exec -n monitoring loki-read-0 -- wget -qO- http://localhost:3100/ready
    

Further Reading