Monitoring Architecture¶
This document explains the cluster-level monitoring and observability architecture, including metrics collection, log aggregation, and long-term storage strategies.
Overview¶
The kup6s.com cluster monitoring stack provides comprehensive observability across all infrastructure and application components:
Metrics: Prometheus + Thanos (3-day local, 2-year S3 with downsampling)
Logs: Loki + Alloy (31-day retention in S3)
Visualization: Grafana (dashboards, queries, alerts)
Alerting: Alertmanager (email routing via SMTP)
All monitoring components are deployed via ArgoCD from the dp-infra/monitoring/ repository using CDK8S TypeScript.
Metrics Collection¶
Prometheus¶
Primary metrics collection engine:
Deployment: 2 replicas for high availability
Local Retention: 3 days (reduced from 7 days to optimize memory/storage)
Storage: 3Gi Longhorn PVC per replica (reduced from 6Gi)
Resources: 100m/1500Mi requests, 500m/3000Mi limits
Target Discovery: Kubernetes service discovery (all namespaces)
What Prometheus Monitors:
Node metrics (CPU, memory, disk, network)
Pod metrics (resource usage, restarts, status)
Kubernetes API metrics (API server, scheduler, controller-manager)
Application metrics (via
/metricsendpoints)Custom metrics (ServiceMonitor and PodMonitor CRDs)
Thanos - Long-Term Metrics Storage¶
Architecture: Thanos extends Prometheus with unlimited historical storage and global query capabilities.
Thanos Sidecar¶
Runs as a sidecar container alongside Prometheus:
Upload Blocks: Uploads 2-hour metric blocks to S3
Real-time Queries: Provides gRPC StoreAPI for recent data
Configuration: Defined in Prometheus CRD via
thanos.objectStorageConfig
Thanos Query (2 replicas with anti-affinity)¶
Unified query interface for Prometheus + S3 data:
Federation: Queries across sidecars and store gateways
Deduplication: Merges metrics using
replicalabelsGrafana Integration: Default Prometheus datasource points to Thanos Query
Service:
thanos-query.monitoring.svc.cluster.local:9090Resources: 200m/512Mi requests per replica
Why Query Instead of Prometheus?
Transparent access to both recent (Prometheus) and historical (S3) data
Single query interface for all time ranges
No need to change dashboards or queries
Thanos Store (2 replicas, 10Gi PVC each)¶
Historical data gateway:
Purpose: Queries historical metrics from S3
Caching: Index cache (500MB) + chunk cache (500MB) for performance
StoreAPI: Provides gRPC interface for historical blocks
Resources: 200m/1Gi requests per replica
Thanos Compactor (1 replica, 20Gi PVC)¶
Data lifecycle management:
Downsampling: Creates 5-minute and 1-hour resolution data
Compaction: Merges small blocks into larger ones
Retention Enforcement: Deletes data beyond retention periods
Schedule: Runs every few minutes
Resources: 500m/2Gi requests
Retention Strategy¶
Multi-tier retention balances cost, performance, and data availability:
Resolution |
Retention Period |
Storage |
Use Case |
|---|---|---|---|
Raw (15s) |
3 days |
Prometheus local |
Recent troubleshooting, live dashboards |
Raw (15s) |
30 days |
S3 |
Recent historical analysis |
5-minute |
180 days (~6 months) |
S3 |
Medium-term trends |
1-hour |
730 days (2 years) |
S3 |
Long-term capacity planning |
Rationale:
3-day local: Fast queries for recent data, minimal storage/memory
30-day raw: Detailed troubleshooting without excessive S3 costs
6-month downsampled: Trends and patterns without raw data overhead
2-year hourly: Long-term capacity planning and compliance
Storage Configuration¶
S3 Bucket: metrics-thanos-kup6s
Provider: Hetzner Object Storage
Region: fsn1 (Falkenstein, same as cluster)
Provisioning: Crossplane-managed S3 bucket
Credentials: ESO replicates from
crossplane-systemtomonitoringnamespace
Prometheus Local Storage:
PVC Size: 3Gi per replica (Longhorn, 2 replicas storage class)
Storage Class:
longhorn(2 replicas, best-effort locality)Retention: 3 days (automatic deletion of older data)
Log Collection¶
Loki - Log Aggregation¶
Architecture: SimpleScalable deployment mode (3 components)
Components¶
Loki Backend:
Handles compaction and deletion
Resources: 100m/256Mi requests
Loki Read:
Handles query path
Resources: 100m/256Mi requests
Loki Write:
Handles ingestion path
Resources: 100m/256Mi requests
Why SimpleScalable?
Suitable for medium-scale clusters (<100 nodes)
Simpler than microservices mode
Easier to operate and troubleshoot
Sufficient for current cluster size
Alloy - Log Collector¶
Deployment: DaemonSet (one pod per node)
Purpose: Collects logs from all pods on the node
Discovery: Automatic discovery via Kubernetes API
Shipping: Ships logs to Loki Write component
Resources: 100m/128Mi requests per pod
Log Sources:
Container stdout/stderr
Kubernetes events
Node system logs (optional)
Log Retention¶
Storage:
S3 Bucket:
logs-loki-kup6s(Hetzner fsn1)Retention: 744 hours (31 days)
Credentials: Same Hetzner S3 credentials as Thanos
Why 31 Days?
Sufficient for troubleshooting recent issues
Balances storage costs with usefulness
Longer retention available via S3 lifecycle policies if needed
Querying Logs¶
Grafana Explore:
Navigate to Grafana → Explore
Select “Loki” datasource
Use LogQL queries:
{namespace="monitoring"} |= "error" {app="nginx"} | json | status >= 400
See Loki Query Language (LogQL) for syntax.
Visualization and Dashboards¶
Grafana¶
Central observability UI:
Datasources:
Prometheus → Thanos Query (metrics)
Loki (logs)
Optional: Tempo (traces, if deployed)
Resources: 50m/512Mi requests (increased to prevent OOM)
Storage: ConfigMaps for dashboards (can migrate to git-synced)
Access:
https://grafana.ops.kup6s.net
Pre-configured Dashboards:
Kubernetes cluster overview
Node metrics
Pod resource usage
Longhorn storage metrics
Custom application dashboards
Creating Dashboards:
Create dashboard in Grafana UI
Export JSON
Store in
dp-infra/monitoring/dashboards/(optional)Commit to git for version control
Alerting¶
Alertmanager¶
Alert routing and notification:
Resources: 100m/256Mi requests
Configuration: Managed via dp-infra/monitoring/config.yaml
Routing: Email via SMTP
Alert Flow:
Prometheus/Loki
│
├──> Alert Rules (fire when conditions met)
│
└──> Alertmanager (receives firing alerts)
│
├──> Grouping (similar alerts grouped)
├──> Inhibition (suppress redundant alerts)
├──> Silencing (temporary muting)
│
└──> Notification (email via SMTP)
Silenced Alerts:
CPUThrottlingHigh: Expected behavior with CPU limitsKubeMemoryOvercommit: Intentional overcommit strategy
Configuration: See AlertmanagerConstruct for routing rules and inhibitions.
Key Monitoring Endpoints¶
User-Facing UIs:
Grafana:
https://grafana.ops.kup6s.net(dashboards, queries)Longhorn UI:
https://longhorn.ops.kup6s.net(storage management)ArgoCD:
https://argocd.ops.kup6s.net(GitOps deployments)
Internal Services (accessible via port-forward):
Prometheus:
prometheus.monitoring.svc:9090Thanos Query:
thanos-query.monitoring.svc:9090Loki:
loki-read.monitoring.svc:3100Alertmanager:
alertmanager.monitoring.svc:9093
Port-Forward Example:
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Access at http://localhost:9090
Health Check Commands¶
# Check cluster nodes
kubectl get nodes
# Check pod status across all namespaces
kubectl get pods -A | grep -v Running
# Check monitoring stack
kubectl get pods -n monitoring
# Check Longhorn storage health
kubectl get nodes.longhorn.io -n longhorn-system
# Check PostgreSQL clusters
kubectl get clusters.postgresql.cnpg.io -A
# Check ArgoCD applications
kubectl get applications -n argocd
# Check External Secrets Operator
kubectl get pods -n external-secrets
# Check ExternalSecrets and SecretStores
kubectl get externalsecrets,secretstores,clustersecretstores -A
# Check Loki health
kubectl get pods -n monitoring | grep loki
# Check Thanos components
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'
Verification Commands¶
Thanos Health Checks:
# Check all Thanos components running
kubectl get pods -n monitoring -l 'app.kubernetes.io/name in (thanos-query,thanos-store,thanos-compactor)'
# Check Thanos Query connected stores
kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores
# Check S3 blocks loaded by Store Gateway
kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/
# Check Compactor logs for downsampling activity
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | grep -E "(compact|downsample)"
Loki Health Checks:
# Check Loki components
kubectl get pods -n monitoring | grep loki
# Check Loki ready to receive logs
kubectl exec -n monitoring loki-write-0 -- wget -qO- http://localhost:3100/ready
# Check Alloy log collection
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=20
Integration with Applications¶
Exposing Custom Metrics¶
Applications can expose Prometheus metrics via:
HTTP /metrics endpoint:
// Example in Go http.Handle("/metrics", promhttp.Handler())
ServiceMonitor CRD (automatic discovery):
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: myapp namespace: myapp spec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 30s
PodMonitor CRD (for pods without services):
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: myapp-pods spec: selector: matchLabels: app: myapp podMetricsEndpoints: - port: metrics
Structured Logging¶
For optimal Loki querying, use structured logging (JSON):
{
"timestamp": "2025-11-19T14:30:00Z",
"level": "error",
"service": "myapp",
"message": "Database connection failed",
"error": "connection timeout",
"duration_ms": 5000
}
LogQL Query Example:
{namespace="myapp"} | json | level="error" | duration_ms > 3000
Deployment and Configuration¶
Management: All monitoring components managed via CDK8S in dp-infra/monitoring/
Configuration Files:
Central config: dp-infra/monitoring/config.yaml
TypeScript constructs: dp-infra/monitoring/charts/
PrometheusConstruct (with Thanos sidecar)
ThanosQueryConstruct
ThanosStoreConstruct
ThanosCompactorConstruct
LokiConstruct
GrafanaConstruct
AlloyConstruct
AlertmanagerConstruct
S3BucketsConstruct (metrics-thanos-kup6s, logs-loki-kup6s)
S3SecretsConstruct (ExternalSecret for S3 credentials)
Generated manifests: dp-infra/monitoring/manifests/monitoring.k8s.yaml
Workflow:
Edit
config.yamlor TypeScript constructsBuild:
npm run buildCommit manifests to git:
git add manifests/ && git commit && git pushArgoCD auto-syncs from git to cluster
See Monitoring Deployment Documentation for detailed guides.
Troubleshooting¶
No Metrics for New Pods¶
Symptom: Pods not showing in Grafana
Check:
ServiceMonitor/PodMonitor created?
kubectl get servicemonitors,podmonitors -APrometheus targets configured?
kubectl port-forward -n monitoring svc/prometheus 9090:9090 # Visit http://localhost:9090/targets
Metrics endpoint accessible?
kubectl port-forward -n myapp pod/myapp-xxx 8080:8080 curl http://localhost:8080/metrics
Historical Metrics Missing¶
Symptom: Queries for >3 days ago return no data
Check:
Thanos Store running?
kubectl get pods -n monitoring | grep thanos-storeS3 blocks uploaded?
kubectl exec -n monitoring thanos-store-0 -- ls /var/thanos/store/meta-syncer/Thanos Query connected to Store?
kubectl exec -n monitoring deploy/thanos-query -- curl -s localhost:9090/api/v1/stores
Logs Not Appearing¶
Symptom: No logs in Grafana Explore
Check:
Alloy running on all nodes?
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloyLoki Write accepting logs?
kubectl logs -n monitoring loki-write-0 --tail=50Loki Read serving queries?
kubectl exec -n monitoring loki-read-0 -- wget -qO- http://localhost:3100/ready
Further Reading¶
Monitoring Deployment How-To - Deploy and configure monitoring stack
Configuration Reference - config.yaml options
Prometheus Operator - ServiceMonitor, PodMonitor CRDs
Thanos Documentation - Thanos architecture and components
Loki Documentation - LogQL and deployment modes
Grafana Documentation - Dashboard creation and datasources