Explanation
CDK8S Approach for Monitoring Stack¶
This document explains how the monitoring stack uses CDK8S with deployment-specific constructs and patterns. For general CDK8S concepts, architecture, and benefits, see CDK8S Infrastructure as Code.
Overview¶
The monitoring stack deployment consists of 11 TypeScript constructs that generate the complete observability platform:
Prometheus + Thanos (metrics collection and long-term storage)
Grafana (visualization)
Loki (log aggregation)
Alloy (metrics and logs collection agent)
Alertmanager (alert routing)
S3 buckets for Thanos and Loki data
Project Structure¶
dp-infra/monitoring/
├── charts/
│ ├── constructs/ # Individual construct files
│ │ ├── namespace.ts
│ │ ├── priority-class.ts
│ │ ├── crossplane-provider.ts
│ │ ├── thanos-s3-bucket.ts
│ │ ├── loki-s3-bucket.ts
│ │ ├── thanos-s3-credentials.ts
│ │ ├── prometheus-stack.ts
│ │ ├── loki.ts
│ │ ├── alloy.ts
│ │ ├── thanos-query.ts
│ │ ├── thanos-store.ts
│ │ └── thanos-compactor.ts
│ ├── types.ts # Shared TypeScript interfaces
│ └── monitoring-chart.ts # Main chart (composes constructs)
├── config.yaml # Configuration values
├── main.ts # Entry point (loads config, synthesizes)
├── package.json # NPM dependencies
├── tsconfig.json # TypeScript configuration
├── tests/ # Jest unit tests
│ ├── constructs/
│ │ ├── prometheus-stack.test.ts
│ │ └── ...
│ └── monitoring-chart.test.ts
└── manifests/ # Generated YAML (committed to git)
└── monitoring.k8s.yaml
Monitoring Stack Constructs¶
1. Namespace and Infrastructure¶
NamespaceConstruct (namespace.ts)
Creates
monitoringnamespaceAdds standard labels
PriorityClassConstruct (priority-class.ts)
PriorityClass for critical monitoring components
Ensures Prometheus/Alertmanager scheduling priority
2. Storage and S3¶
CrossplaneProviderConstruct (crossplane-provider.ts)
ProviderConfig for Hetzner S3 (references cluster-wide configuration)
ThanosS3BucketConstruct (thanos-s3-bucket.ts)
Creates
metrics-thanos-kup6sbucketBucketVersioning (disabled)
BucketLifecycleConfiguration:
Raw data: 30 days
5min resolution: 180 days
1hour resolution: 730 days
LokiS3BucketConstruct (loki-s3-bucket.ts)
Creates
logs-loki-kup6sbucketBucketVersioning (disabled)
BucketLifecycleConfiguration (744h retention)
ThanosS3CredentialsConstruct (thanos-s3-credentials.ts)
Key pattern: Replicates S3 credentials from
crossplane-systemtomonitoringUses ExternalSecret with ClusterSecretStore
Enables secret sharing across namespaces without duplication
3. Prometheus Stack (Helm Integration)¶
PrometheusStackConstruct (prometheus-stack.ts)
Key pattern: Wraps
kube-prometheus-stackHelm chart in CDK8SGenerates complex Helm values from TypeScript config
Configures Prometheus, Grafana, Alertmanager
Adds Thanos sidecar to Prometheus
Resource requests/limits from config
Example:
export class PrometheusStackConstruct extends Construct {
constructor(scope: Construct, id: string, props: PrometheusStackProps) {
super(scope, id);
const helmChart = new Helm Chart(this, 'prometheus-stack', {
chart: 'kube-prometheus-stack',
repo: 'https://prometheus-community.github.io/helm-charts',
values: {
prometheus: {
prometheusSpec: {
replicas: props.config.replicas.prometheus,
retention: '3d', // Local storage (Thanos handles long-term)
storageSpec: {
volumeClaimTemplate: {
spec: {
storageClassName: 'longhorn',
resources: {
requests: {
storage: props.config.storage.prometheus,
},
},
},
},
},
thanos: {
// Thanos sidecar configuration
objectStorageConfig: {
name: 'thanos-objstore-config',
key: 'objstore.yml',
},
},
resources: {
requests: {
cpu: props.config.resources.prometheus.requests.cpu,
memory: props.config.resources.prometheus.requests.memory,
},
},
},
},
},
});
}
}
Benefits:
Type-safe Helm values generation
Complex chart wrapped in simple construct
Resource limits from centralized config
4. Loki and Alloy¶
LokiConstruct (loki.ts)
Wraps Loki Helm chart in CDK8S
Configures S3 storage backend
Sets resource requests/limits
Backend, Read, Write components with anti-affinity
AlloyConstruct (alloy.ts)
Grafana Alloy (agent for logs and metrics collection)
DaemonSet on all nodes
Sends logs to Loki, metrics to Prometheus
5. Thanos Components¶
ThanosQueryConstruct (thanos-query.ts)
Key pattern: Pure CDK8S (not Helm), full control
Deployment with 2 replicas
Anti-affinity (spread across nodes)
Service for Grafana datasource
Resource requests from config
Example:
export class ThanosQueryConstruct extends Construct {
constructor(scope: Construct, id: string, props: ThanosQueryProps) {
super(scope, id);
const deployment = new kplus.Deployment(this, 'deployment', {
metadata: {
namespace: props.config.namespace,
labels: this.generateLabels('thanos', 'query'),
},
replicas: 2,
containers: [{
name: 'thanos-query',
image: `quay.io/thanos/thanos:${props.config.versions.thanos}`,
args: [
'query',
'--grpc-address=0.0.0.0:10901',
'--http-address=0.0.0.0:9090',
'--store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local',
'--store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local',
],
portNumber: 9090,
resources: {
requests: {
cpu: kplus.Cpu.millis(parseInt(props.config.resources.thanosQuery.requests.cpu)),
memory: kplus.Size.mebibytes(parseInt(props.config.resources.thanosQuery.requests.memory)),
},
},
}],
podMetadata: {
labels: this.generateLabels('thanos', 'query'),
},
affinity: {
podAntiAffinity: {
preferredDuringSchedulingIgnoredDuringExecution: [{
weight: 100,
podAffinityTerm: {
labelSelector: {
matchLabels: this.generateLabels('thanos', 'query'),
},
topologyKey: 'kubernetes.io/hostname',
},
}],
},
},
});
// Create Service
new kplus.Service(this, 'service', {
metadata: { namespace: props.config.namespace },
selector: deployment,
ports: [
{ port: 9090, targetPort: 9090, name: 'http' },
{ port: 10901, targetPort: 10901, name: 'grpc' },
],
});
}
}
Benefits:
Full control over Deployment spec
Type-safe anti-affinity configuration
No Helm templating complexity
Easy to test and modify
ThanosStoreConstruct (thanos-store.ts)
StatefulSet with 2 replicas
Persistent storage (10Gi each) for index/chunk cache
Anti-affinity across nodes
Queries historical metrics from S3
ThanosCompactorConstruct (thanos-compactor.ts)
StatefulSet with 1 replica
Persistent storage (20Gi) for compaction workspace
Downsamples and compacts S3 blocks
Enforces retention policies
Construct Hierarchy¶
MonitoringChart
├── NamespaceConstruct (Namespace)
├── PriorityClassConstruct (PriorityClass)
├── CrossplaneProviderConstruct (ProviderConfig)
├── ThanosS3BucketConstruct (Bucket, BucketVersioning, BucketLifecycle)
├── LokiS3BucketConstruct (Bucket, BucketVersioning, BucketLifecycle)
├── ThanosS3CredentialsConstruct (ExternalSecret)
├── PrometheusStackConstruct (HelmChart)
├── LokiConstruct (HelmChart)
├── AlloyConstruct (HelmChart)
├── ThanosQueryConstruct (Deployment, Service)
├── ThanosStoreConstruct (StatefulSet, Service, PVC)
└── ThanosCompactorConstruct (StatefulSet, Service, PVC)
Configuration Pattern¶
# config.yaml
namespace: monitoring
versions:
prometheus: v2.55.1
grafana: 11.4.0
loki: 3.3.2
thanos: v0.37.2
replicas:
prometheus: 2
alertmanager: 2
storage:
prometheus: 3Gi
thanosStore: 10Gi
thanosCompactor: 20Gi
resources:
prometheus:
requests:
cpu: "100m"
memory: "1500Mi"
thanosQuery:
requests:
cpu: "25m"
memory: "128Mi"
Benefits:
All configuration in one file
Type-safe via
MonitoringConfiginterfaceEnvironment-specific configs possible
IDE autocomplete
Unique Monitoring Patterns¶
1. Mixed Helm + Pure CDK8S¶
Helm for complex charts:
kube-prometheus-stack(100+ resources)Loki (complex distributed setup)
Alloy (DaemonSet with many features)
Pure CDK8S for simple components:
Thanos Query (just Deployment + Service)
Thanos Store (StatefulSet + Service)
Thanos Compactor (StatefulSet)
Benefit: Use Helm where it helps, CDK8S where we need control.
2. Secret Replication Pattern¶
Challenge: Thanos needs S3 credentials, but they’re in crossplane-system namespace
Solution: ExternalSecret with ClusterSecretStore
new ExternalSecret(this, 'thanos-s3-credentials', {
spec: {
refreshInterval: '1h',
secretStoreRef: {
name: 'crossplane-secret-store', // ClusterSecretStore
kind: 'ClusterSecretStore',
},
target: {
name: 'thanos-s3-credentials',
namespace: 'monitoring',
},
dataFrom: [{
find: {
name: {
regexp: '^hetzner-s3-credentials$', // Source secret in crossplane-system
},
},
}],
},
});
Benefits:
No secret duplication
Single source of truth (crossplane-system)
Automatic updates (ESO syncs every 1h)
3. Anti-Affinity for High Availability¶
All critical components use anti-affinity:
affinity: {
podAntiAffinity: {
preferredDuringSchedulingIgnoredDuringExecution: [{
weight: 100,
podAffinityTerm: {
labelSelector: {
matchLabels: { 'app.kubernetes.io/name': 'thanos-query' },
},
topologyKey: 'kubernetes.io/hostname', // Spread across nodes
},
}],
},
}
Applied to:
Thanos Query (2 replicas)
Thanos Store (2 replicas)
Prometheus (2 replicas via Helm values)
Benefit: Node failure doesn’t take down entire monitoring stack.
4. Resource Guarantees from Config¶
All components have resource requests:
resources: {
requests: {
cpu: kplus.Cpu.millis(parseInt(config.resources.component.requests.cpu)),
memory: kplus.Size.mebibytes(parseInt(config.resources.component.requests.memory)),
},
limits: {
cpu: kplus.Cpu.millis(parseInt(config.resources.component.limits.cpu)),
memory: kplus.Size.mebibytes(parseInt(config.resources.component.limits.memory)),
},
}
Benefits:
QoS guarantees (Burstable class)
Predictable scheduling
Prevents resource starvation
All values centralized in config.yaml
Build Workflow¶
# 1. Edit configuration
vim config.yaml
# 2. Build manifests
npm run build
# Runs: compile → test → synth
# 3. Review changes
git diff manifests/monitoring.k8s.yaml
# 4. Commit and push
git add manifests/ charts/ config.yaml
git commit -m "Scale Prometheus to 3 replicas"
git push
# 5. ArgoCD automatically deploys
Testing Strategy¶
Monitoring stack has comprehensive unit tests:
describe('ThanosQueryConstruct', () => {
it('should create 2 replicas by default', () => {
const chart = Testing.chart();
const config = createTestConfig();
new ThanosQueryConstruct(chart, 'test', { config });
const manifests = synthesizeChart(chart);
const deployment = findResource(manifests, 'Deployment');
expect(deployment.spec.replicas).toBe(2);
});
it('should configure anti-affinity', () => {
const chart = Testing.chart();
const config = createTestConfig();
new ThanosQueryConstruct(chart, 'test', { config });
const manifests = synthesizeChart(chart);
const deployment = findResource(manifests, 'Deployment');
expect(deployment.spec.template.spec.affinity).toBeDefined();
expect(deployment.spec.template.spec.affinity.podAntiAffinity).toBeDefined();
});
it('should set resource requests from config', () => {
const chart = Testing.chart();
const config = createTestConfig({
resources: {
thanosQuery: { requests: { cpu: '50m', memory: '256Mi' } }
}
});
new ThanosQueryConstruct(chart, 'test', { config });
const manifests = synthesizeChart(chart);
const deployment = findResource(manifests, 'Deployment');
expect(deployment.spec.template.spec.containers[0].resources.requests.cpu).toBe('50m');
expect(deployment.spec.template.spec.containers[0].resources.requests.memory).toBe('256Mi');
});
});
Troubleshooting¶
Thanos Query Can’t Reach Prometheus¶
Symptoms: Thanos Query shows “no stores available”
Solution: Check DNS resolution for Prometheus sidecar
kubectl exec -n monitoring thanos-query-0 -- nslookup prometheus-operated.monitoring.svc.cluster.local
Loki Pods Pending (Memory)¶
Symptoms: Loki pods stuck in Pending with “Insufficient memory”
Solution: Reduce memory requests in config.yaml
resources:
loki:
backend:
requests:
memory: "256Mi" # Reduced from 1Gi
S3 Credentials Not Found¶
Symptoms: Thanos sidecar errors with “s3: access denied”
Solution: Check ExternalSecret synced credentials
kubectl get externalsecret -n monitoring thanos-s3-credentials
kubectl get secret -n monitoring thanos-s3-credentials