Monitoring Architecture¶

This document explains the cluster-level monitoring and observability architecture, including metrics collection, log aggregation, and long-term storage strategies.

Overview¶

The kup6s.com cluster monitoring stack provides comprehensive observability across all infrastructure and application components:

Metrics: Prometheus + Thanos (3-day local, 2-year S3 with downsampling)
Logs: Loki + Alloy (31-day retention in S3)
Visualization: Grafana (dashboards, queries, alerts)
Alerting: Alertmanager (email routing via SMTP)

All monitoring components are deployed via ArgoCD from the dp-infra/monitoring/ repository using CDK8S TypeScript.

Metrics Collection¶

Prometheus¶

Primary metrics collection engine:

Deployment: 2 replicas for high availability
Local Retention: 3 days (reduced from 7 days to optimize memory/storage)
Storage: 3Gi Longhorn PVC per replica (reduced from 6Gi)
Resources: 100m/1500Mi requests, 500m/3000Mi limits
Target Discovery: Kubernetes service discovery (all namespaces)

What Prometheus Monitors:

Node metrics (CPU, memory, disk, network)
Pod metrics (resource usage, restarts, status)
Kubernetes API metrics (API server, scheduler, controller-manager)
Application metrics (via /metrics endpoints)
Custom metrics (ServiceMonitor and PodMonitor CRDs)

Thanos - Long-Term Metrics Storage¶

Architecture: Thanos extends Prometheus with unlimited historical storage and global query capabilities.

Thanos Sidecar¶

Runs as a sidecar container alongside Prometheus:

Upload Blocks: Uploads 2-hour metric blocks to S3
Real-time Queries: Provides gRPC StoreAPI for recent data
Configuration: Defined in Prometheus CRD via thanos.objectStorageConfig

Thanos Query (2 replicas with anti-affinity)¶

Unified query interface for Prometheus + S3 data:

Federation: Queries across sidecars and store gateways
Deduplication: Merges metrics using replica labels
Grafana Integration: Default Prometheus datasource points to Thanos Query
Service: thanos-query.monitoring.svc.cluster.local:9090
Resources: 200m/512Mi requests per replica

Why Query Instead of Prometheus?

Transparent access to both recent (Prometheus) and historical (S3) data
Single query interface for all time ranges
No need to change dashboards or queries

Thanos Store (2 replicas, 10Gi PVC each)¶

Historical data gateway:

Purpose: Queries historical metrics from S3
Caching: Index cache (500MB) + chunk cache (500MB) for performance
StoreAPI: Provides gRPC interface for historical blocks
Resources: 200m/1Gi requests per replica

Thanos Compactor (1 replica, 20Gi PVC)¶

Data lifecycle management:

Downsampling: Creates 5-minute and 1-hour resolution data
Compaction: Merges small blocks into larger ones
Retention Enforcement: Deletes data beyond retention periods
Schedule: Runs every few minutes
Resources: 500m/2Gi requests

Retention Strategy¶

Multi-tier retention balances cost, performance, and data availability:

Resolution	Retention Period	Storage	Use Case
Raw (15s)	3 days	Prometheus local	Recent troubleshooting, live dashboards
Raw (15s)	30 days	S3	Recent historical analysis
5-minute	180 days (~6 months)	S3	Medium-term trends
1-hour	730 days (2 years)	S3	Long-term capacity planning

Rationale:

3-day local: Fast queries for recent data, minimal storage/memory
30-day raw: Detailed troubleshooting without excessive S3 costs
6-month downsampled: Trends and patterns without raw data overhead
2-year hourly: Long-term capacity planning and compliance

Storage Configuration¶

S3 Bucket: metrics-thanos-kup6s

Provider: Hetzner Object Storage
Region: fsn1 (Falkenstein, same as cluster)
Provisioning: Crossplane-managed S3 bucket
Credentials: ESO replicates from crossplane-system to monitoring namespace

Prometheus Local Storage:

PVC Size: 3Gi per replica (Longhorn, 2 replicas storage class)
Storage Class: longhorn (2 replicas, best-effort locality)
Retention: 3 days (automatic deletion of older data)

Log Collection¶

Loki - Log Aggregation¶

Architecture: SimpleScalable deployment mode (3 components)

Components¶

Loki Backend:
- Handles compaction and deletion
- Resources: 100m/256Mi requests
Loki Read:
- Handles query path
- Resources: 100m/256Mi requests
Loki Write:
- Handles ingestion path
- Resources: 100m/256Mi requests

Why SimpleScalable?

Suitable for medium-scale clusters (<100 nodes)
Simpler than microservices mode
Easier to operate and troubleshoot
Sufficient for current cluster size

Alloy - Log Collector¶

Deployment: DaemonSet (one pod per node)

Purpose: Collects logs from all pods on the node
Discovery: Automatic discovery via Kubernetes API
Shipping: Ships logs to Loki Write component
Resources: 100m/128Mi requests per pod
Log Sources:
- Container stdout/stderr
- Kubernetes events
- Node system logs (optional)

Log Retention¶

Storage:

S3 Bucket: logs-loki-kup6s (Hetzner fsn1)
Retention: 744 hours (31 days)
Credentials: Same Hetzner S3 credentials as Thanos

Why 31 Days?

Sufficient for troubleshooting recent issues
Balances storage costs with usefulness
Longer retention available via S3 lifecycle policies if needed

Querying Logs¶

Grafana Explore:

Navigate to Grafana → Explore
Select “Loki” datasource

Use LogQL queries:

{namespace="monitoring"} |= "error"
{app="nginx"} | json | status >= 400

See Loki Query Language (LogQL) for syntax.

Visualization and Dashboards¶

Grafana¶

Central observability UI:

Datasources:
- Prometheus → Thanos Query (metrics)
- Loki (logs)
- Optional: Tempo (traces, if deployed)
Resources: 50m/512Mi requests (increased to prevent OOM)
Storage: ConfigMaps for dashboards (can migrate to git-synced)
Access: https://grafana.ops.kup6s.net

Pre-configured Dashboards:

Kubernetes cluster overview
Node metrics
Pod resource usage
Longhorn storage metrics
Custom application dashboards

Creating Dashboards:

Create dashboard in Grafana UI
Export JSON
Store in dp-infra/monitoring/dashboards/ (optional)
Commit to git for version control

Alerting¶

Alertmanager¶

Alert routing and notification:

Resources: 100m/256Mi requests
Configuration: Managed via dp-infra/monitoring/config.yaml
Routing: Email via SMTP

Alert Flow:

Prometheus/Loki
    │
    ├──> Alert Rules (fire when conditions met)
    │
    └──> Alertmanager (receives firing alerts)
         │
         ├──> Grouping (similar alerts grouped)
         ├──> Inhibition (suppress redundant alerts)
         ├──> Silencing (temporary muting)
         │
         └──> Notification (email via SMTP)

Silenced Alerts:

CPUThrottlingHigh: Expected behavior with CPU limits
KubeMemoryOvercommit: Intentional overcommit strategy

Configuration: See AlertmanagerConstruct for routing rules and inhibitions.

Key Monitoring Endpoints¶

User-Facing UIs:

Grafana: https://grafana.ops.kup6s.net (dashboards, queries)
Longhorn UI: https://longhorn.ops.kup6s.net (storage management)
ArgoCD: https://argocd.ops.kup6s.net (GitOps deployments)

Internal Services (accessible via port-forward):

Prometheus: prometheus.monitoring.svc:9090
Thanos Query: thanos-query.monitoring.svc:9090
Loki: loki-read.monitoring.svc:3100
Alertmanager: alertmanager.monitoring.svc:9093

Port-Forward Example:

kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Access at http://localhost:9090

Health Check Commands¶

# Check cluster nodes
kubectl get nodes

# Check pod status across all namespaces
kubectl get pods -A | grep -v Running

# Check monitoring stack
kubectl get pods -n monitoring

# Check Longhorn storage health
kubectl get nodes.longhorn.io -n longhorn-system

# Check PostgreSQL clusters
kubectl get clusters.postgresql.cnpg.io -A

# Check ArgoCD applications
kubectl get applications -n argocd

# Check External Secrets Operator
kubectl get pods -n external-secrets

# Check ExternalSecrets and SecretStores
kubectl get externalsecrets,secretstores,clustersecretstores -A

# Check Loki health
kubectl get pods -n monitoring | grep loki

# Check Thanos components
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'

Verification Commands¶

Thanos Health Checks:

# Check all Thanos components running
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'

# Check Thanos Query connected stores
kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores

# Check S3 blocks loaded by Store Gateway
kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/

# Check Compactor logs for downsampling activity
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep -E "(compact|downsample)"

Loki Health Checks:

# Check Loki components
kubectl get pods -n monitoring | grep loki

# Check Loki ready to receive logs
kubectl exec -n monitoring loki-write-0 -- wget -qO- http://localhost:3100/ready

# Check Alloy log collection
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20

Integration with Applications¶

Exposing Custom Metrics¶

Applications can expose Prometheus metrics via:

HTTP /metrics endpoint:

// Example in Go
http.Handle("/metrics", promhttp.Handler())

ServiceMonitor CRD (automatic discovery):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

PodMonitor CRD (for pods without services):

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: myapp-pods
spec:
  selector:
    matchLabels:
      app: myapp
  podMetricsEndpoints:
  - port: metrics

Structured Logging¶

For optimal Loki querying, use structured logging (JSON):

{
  "timestamp": "2025-11-19T14:30:00Z",
  "level": "error",
  "service": "myapp",
  "message": "Database connection failed",
  "error": "connection timeout",
  "duration_ms": 5000
}

LogQL Query Example:

{namespace="myapp"} | json | level="error" | duration_ms > 3000

Deployment and Configuration¶

Management: All monitoring components managed via CDK8S in dp-infra/monitoring/

Configuration Files:

Central config: dp-infra/monitoring/config.yaml
TypeScript constructs: dp-infra/monitoring/charts/
- PrometheusConstruct (with Thanos sidecar)
- ThanosQueryConstruct
- ThanosStoreConstruct
- ThanosCompactorConstruct
- LokiConstruct
- GrafanaConstruct
- AlloyConstruct
- AlertmanagerConstruct
- S3BucketsConstruct (metrics-thanos-kup6s, logs-loki-kup6s)
- S3SecretsConstruct (ExternalSecret for S3 credentials)
Generated manifests: dp-infra/monitoring/manifests/monitoring.k8s.yaml

Workflow:

Edit config.yaml or TypeScript constructs
Build: npm run build
Commit manifests to git: git add manifests/ && git commit && git push
ArgoCD auto-syncs from git to cluster

See Monitoring Deployment Documentation for detailed guides.

Troubleshooting¶

No Metrics for New Pods¶

Symptom: Pods not showing in Grafana

Check:

ServiceMonitor/PodMonitor created?

kubectl get servicemonitors,podmonitors -A

Prometheus targets configured?

kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Visit http://localhost:9090/targets

Metrics endpoint accessible?

kubectl port-forward -n myapp pod/myapp-xxx 8080:8080
curl http://localhost:8080/metrics

Historical Metrics Missing¶

Symptom: Queries for >3 days ago return no data

Check:

Thanos Store running?

kubectl get pods -n monitoring | grep thanos-store

S3 blocks uploaded?

kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/

Thanos Query connected to Store?

kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores

Logs Not Appearing¶

Symptom: No logs in Grafana Explore

Check:

Alloy running on all nodes?

kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

Loki Write accepting logs?

kubectl logs -n monitoring loki-write-0 --tail=50

Loki Read serving queries?

kubectl exec -n monitoring loki-read-0 -- wget -qO- http://localhost:3100/ready

Monitoring Architecture¶

Overview¶

Metrics Collection¶

Prometheus¶

Thanos - Long-Term Metrics Storage¶

Thanos Sidecar¶

Thanos Query (2 replicas with anti-affinity)¶

Thanos Store (2 replicas, 10Gi PVC each)¶

Thanos Compactor (1 replica, 20Gi PVC)¶

Retention Strategy¶

Storage Configuration¶

Log Collection¶

Loki - Log Aggregation¶

Components¶

Alloy - Log Collector¶

Log Retention¶

Querying Logs¶

Visualization and Dashboards¶

Grafana¶

Alerting¶

Alertmanager¶

Key Monitoring Endpoints¶

Health Check Commands¶

Verification Commands¶

Integration with Applications¶

Exposing Custom Metrics¶

Structured Logging¶

Deployment and Configuration¶

Troubleshooting¶

No Metrics for New Pods¶

Historical Metrics Missing¶

Logs Not Appearing¶

Further Reading¶