Explanation

CDK8S Approach for Monitoring Stack

This document explains how the monitoring stack uses CDK8S with deployment-specific constructs and patterns. For general CDK8S concepts, architecture, and benefits, see CDK8S Infrastructure as Code.

Overview

The monitoring stack deployment consists of 11 TypeScript constructs that generate the complete observability platform:

  • Prometheus + Thanos (metrics collection and long-term storage)

  • Grafana (visualization)

  • Loki (log aggregation)

  • Alloy (metrics and logs collection agent)

  • Alertmanager (alert routing)

  • S3 buckets for Thanos and Loki data

Project Structure

dp-infra/monitoring/
├── charts/
│   ├── constructs/              # Individual construct files
│   │   ├── namespace.ts
│   │   ├── priority-class.ts
│   │   ├── crossplane-provider.ts
│   │   ├── thanos-s3-bucket.ts
│   │   ├── loki-s3-bucket.ts
│   │   ├── thanos-s3-credentials.ts
│   │   ├── prometheus-stack.ts
│   │   ├── loki.ts
│   │   ├── alloy.ts
│   │   ├── thanos-query.ts
│   │   ├── thanos-store.ts
│   │   └── thanos-compactor.ts
│   ├── types.ts                 # Shared TypeScript interfaces
│   └── monitoring-chart.ts      # Main chart (composes constructs)
├── config.yaml                  # Configuration values
├── main.ts                      # Entry point (loads config, synthesizes)
├── package.json                 # NPM dependencies
├── tsconfig.json                # TypeScript configuration
├── tests/                       # Jest unit tests
│   ├── constructs/
│   │   ├── prometheus-stack.test.ts
│   │   └── ...
│   └── monitoring-chart.test.ts
└── manifests/                   # Generated YAML (committed to git)
    └── monitoring.k8s.yaml

Monitoring Stack Constructs

1. Namespace and Infrastructure

NamespaceConstruct (namespace.ts)

  • Creates monitoring namespace

  • Adds standard labels

PriorityClassConstruct (priority-class.ts)

  • PriorityClass for critical monitoring components

  • Ensures Prometheus/Alertmanager scheduling priority

2. Storage and S3

CrossplaneProviderConstruct (crossplane-provider.ts)

  • ProviderConfig for Hetzner S3 (references cluster-wide configuration)

ThanosS3BucketConstruct (thanos-s3-bucket.ts)

  • Creates metrics-thanos-kup6s bucket

  • BucketVersioning (disabled)

  • BucketLifecycleConfiguration:

    • Raw data: 30 days

    • 5min resolution: 180 days

    • 1hour resolution: 730 days

LokiS3BucketConstruct (loki-s3-bucket.ts)

  • Creates logs-loki-kup6s bucket

  • BucketVersioning (disabled)

  • BucketLifecycleConfiguration (744h retention)

ThanosS3CredentialsConstruct (thanos-s3-credentials.ts)

  • Key pattern: Replicates S3 credentials from crossplane-system to monitoring

  • Uses ExternalSecret with ClusterSecretStore

  • Enables secret sharing across namespaces without duplication

3. Prometheus Stack (Helm Integration)

PrometheusStackConstruct (prometheus-stack.ts)

  • Key pattern: Wraps kube-prometheus-stack Helm chart in CDK8S

  • Generates complex Helm values from TypeScript config

  • Configures Prometheus, Grafana, Alertmanager

  • Adds Thanos sidecar to Prometheus

  • Resource requests/limits from config

Example:

export class PrometheusStackConstruct extends Construct {
  constructor(scope: Construct, id: string, props: PrometheusStackProps) {
    super(scope, id);

    const helmChart = new Helm Chart(this, 'prometheus-stack', {
      chart: 'kube-prometheus-stack',
      repo: 'https://prometheus-community.github.io/helm-charts',
      values: {
        prometheus: {
          prometheusSpec: {
            replicas: props.config.replicas.prometheus,
            retention: '3d',  // Local storage (Thanos handles long-term)
            storageSpec: {
              volumeClaimTemplate: {
                spec: {
                  storageClassName: 'longhorn',
                  resources: {
                    requests: {
                      storage: props.config.storage.prometheus,
                    },
                  },
                },
              },
            },
            thanos: {
              // Thanos sidecar configuration
              objectStorageConfig: {
                name: 'thanos-objstore-config',
                key: 'objstore.yml',
              },
            },
            resources: {
              requests: {
                cpu: props.config.resources.prometheus.requests.cpu,
                memory: props.config.resources.prometheus.requests.memory,
              },
            },
          },
        },
      },
    });
  }
}

Benefits:

  • Type-safe Helm values generation

  • Complex chart wrapped in simple construct

  • Resource limits from centralized config

4. Loki and Alloy

LokiConstruct (loki.ts)

  • Wraps Loki Helm chart in CDK8S

  • Configures S3 storage backend

  • Sets resource requests/limits

  • Backend, Read, Write components with anti-affinity

AlloyConstruct (alloy.ts)

  • Grafana Alloy (agent for logs and metrics collection)

  • DaemonSet on all nodes

  • Sends logs to Loki, metrics to Prometheus

5. Thanos Components

ThanosQueryConstruct (thanos-query.ts)

  • Key pattern: Pure CDK8S (not Helm), full control

  • Deployment with 2 replicas

  • Anti-affinity (spread across nodes)

  • Service for Grafana datasource

  • Resource requests from config

Example:

export class ThanosQueryConstruct extends Construct {
  constructor(scope: Construct, id: string, props: ThanosQueryProps) {
    super(scope, id);

    const deployment = new kplus.Deployment(this, 'deployment', {
      metadata: {
        namespace: props.config.namespace,
        labels: this.generateLabels('thanos', 'query'),
      },
      replicas: 2,
      containers: [{
        name: 'thanos-query',
        image: `quay.io/thanos/thanos:${props.config.versions.thanos}`,
        args: [
          'query',
          '--grpc-address=0.0.0.0:10901',
          '--http-address=0.0.0.0:9090',
          '--store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local',
          '--store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local',
        ],
        portNumber: 9090,
        resources: {
          requests: {
            cpu: kplus.Cpu.millis(parseInt(props.config.resources.thanosQuery.requests.cpu)),
            memory: kplus.Size.mebibytes(parseInt(props.config.resources.thanosQuery.requests.memory)),
          },
        },
      }],
      podMetadata: {
        labels: this.generateLabels('thanos', 'query'),
      },
      affinity: {
        podAntiAffinity: {
          preferredDuringSchedulingIgnoredDuringExecution: [{
            weight: 100,
            podAffinityTerm: {
              labelSelector: {
                matchLabels: this.generateLabels('thanos', 'query'),
              },
              topologyKey: 'kubernetes.io/hostname',
            },
          }],
        },
      },
    });

    // Create Service
    new kplus.Service(this, 'service', {
      metadata: { namespace: props.config.namespace },
      selector: deployment,
      ports: [
        { port: 9090, targetPort: 9090, name: 'http' },
        { port: 10901, targetPort: 10901, name: 'grpc' },
      ],
    });
  }
}

Benefits:

  • Full control over Deployment spec

  • Type-safe anti-affinity configuration

  • No Helm templating complexity

  • Easy to test and modify

ThanosStoreConstruct (thanos-store.ts)

  • StatefulSet with 2 replicas

  • Persistent storage (10Gi each) for index/chunk cache

  • Anti-affinity across nodes

  • Queries historical metrics from S3

ThanosCompactorConstruct (thanos-compactor.ts)

  • StatefulSet with 1 replica

  • Persistent storage (20Gi) for compaction workspace

  • Downsamples and compacts S3 blocks

  • Enforces retention policies

Construct Hierarchy

MonitoringChart
├── NamespaceConstruct          (Namespace)
├── PriorityClassConstruct      (PriorityClass)
├── CrossplaneProviderConstruct (ProviderConfig)
├── ThanosS3BucketConstruct     (Bucket, BucketVersioning, BucketLifecycle)
├── LokiS3BucketConstruct       (Bucket, BucketVersioning, BucketLifecycle)
├── ThanosS3CredentialsConstruct (ExternalSecret)
├── PrometheusStackConstruct    (HelmChart)
├── LokiConstruct               (HelmChart)
├── AlloyConstruct              (HelmChart)
├── ThanosQueryConstruct        (Deployment, Service)
├── ThanosStoreConstruct        (StatefulSet, Service, PVC)
└── ThanosCompactorConstruct    (StatefulSet, Service, PVC)

Configuration Pattern

# config.yaml
namespace: monitoring
versions:
  prometheus: v2.55.1
  grafana: 11.4.0
  loki: 3.3.2
  thanos: v0.37.2
replicas:
  prometheus: 2
  alertmanager: 2
storage:
  prometheus: 3Gi
  thanosStore: 10Gi
  thanosCompactor: 20Gi
resources:
  prometheus:
    requests:
      cpu: "100m"
      memory: "1500Mi"
  thanosQuery:
    requests:
      cpu: "25m"
      memory: "128Mi"

Benefits:

  • All configuration in one file

  • Type-safe via MonitoringConfig interface

  • Environment-specific configs possible

  • IDE autocomplete

Unique Monitoring Patterns

1. Mixed Helm + Pure CDK8S

Helm for complex charts:

  • kube-prometheus-stack (100+ resources)

  • Loki (complex distributed setup)

  • Alloy (DaemonSet with many features)

Pure CDK8S for simple components:

  • Thanos Query (just Deployment + Service)

  • Thanos Store (StatefulSet + Service)

  • Thanos Compactor (StatefulSet)

Benefit: Use Helm where it helps, CDK8S where we need control.

2. Secret Replication Pattern

Challenge: Thanos needs S3 credentials, but they’re in crossplane-system namespace

Solution: ExternalSecret with ClusterSecretStore

new ExternalSecret(this, 'thanos-s3-credentials', {
  spec: {
    refreshInterval: '1h',
    secretStoreRef: {
      name: 'crossplane-secret-store',  // ClusterSecretStore
      kind: 'ClusterSecretStore',
    },
    target: {
      name: 'thanos-s3-credentials',
      namespace: 'monitoring',
    },
    dataFrom: [{
      find: {
        name: {
          regexp: '^hetzner-s3-credentials$',  // Source secret in crossplane-system
        },
      },
    }],
  },
});

Benefits:

  • No secret duplication

  • Single source of truth (crossplane-system)

  • Automatic updates (ESO syncs every 1h)

3. Anti-Affinity for High Availability

All critical components use anti-affinity:

affinity: {
  podAntiAffinity: {
    preferredDuringSchedulingIgnoredDuringExecution: [{
      weight: 100,
      podAffinityTerm: {
        labelSelector: {
          matchLabels: { 'app.kubernetes.io/name': 'thanos-query' },
        },
        topologyKey: 'kubernetes.io/hostname',  // Spread across nodes
      },
    }],
  },
}

Applied to:

  • Thanos Query (2 replicas)

  • Thanos Store (2 replicas)

  • Prometheus (2 replicas via Helm values)

Benefit: Node failure doesn’t take down entire monitoring stack.

4. Resource Guarantees from Config

All components have resource requests:

resources: {
  requests: {
    cpu: kplus.Cpu.millis(parseInt(config.resources.component.requests.cpu)),
    memory: kplus.Size.mebibytes(parseInt(config.resources.component.requests.memory)),
  },
  limits: {
    cpu: kplus.Cpu.millis(parseInt(config.resources.component.limits.cpu)),
    memory: kplus.Size.mebibytes(parseInt(config.resources.component.limits.memory)),
  },
}

Benefits:

  • QoS guarantees (Burstable class)

  • Predictable scheduling

  • Prevents resource starvation

  • All values centralized in config.yaml

Build Workflow

# 1. Edit configuration
vim config.yaml

# 2. Build manifests
npm run build
# Runs: compile → test → synth

# 3. Review changes
git diff manifests/monitoring.k8s.yaml

# 4. Commit and push
git add manifests/ charts/ config.yaml
git commit -m "Scale Prometheus to 3 replicas"
git push

# 5. ArgoCD automatically deploys

Testing Strategy

Monitoring stack has comprehensive unit tests:

describe('ThanosQueryConstruct', () => {
  it('should create 2 replicas by default', () => {
    const chart = Testing.chart();
    const config = createTestConfig();

    new ThanosQueryConstruct(chart, 'test', { config });

    const manifests = synthesizeChart(chart);
    const deployment = findResource(manifests, 'Deployment');

    expect(deployment.spec.replicas).toBe(2);
  });

  it('should configure anti-affinity', () => {
    const chart = Testing.chart();
    const config = createTestConfig();

    new ThanosQueryConstruct(chart, 'test', { config });

    const manifests = synthesizeChart(chart);
    const deployment = findResource(manifests, 'Deployment');

    expect(deployment.spec.template.spec.affinity).toBeDefined();
    expect(deployment.spec.template.spec.affinity.podAntiAffinity).toBeDefined();
  });

  it('should set resource requests from config', () => {
    const chart = Testing.chart();
    const config = createTestConfig({
      resources: {
        thanosQuery: { requests: { cpu: '50m', memory: '256Mi' } }
      }
    });

    new ThanosQueryConstruct(chart, 'test', { config });

    const manifests = synthesizeChart(chart);
    const deployment = findResource(manifests, 'Deployment');

    expect(deployment.spec.template.spec.containers[0].resources.requests.cpu).toBe('50m');
    expect(deployment.spec.template.spec.containers[0].resources.requests.memory).toBe('256Mi');
  });
});

Troubleshooting

Thanos Query Can’t Reach Prometheus

Symptoms: Thanos Query shows “no stores available”

Solution: Check DNS resolution for Prometheus sidecar

kubectl exec -n monitoring thanos-query-0 -- nslookup prometheus-operated.monitoring.svc.cluster.local

Loki Pods Pending (Memory)

Symptoms: Loki pods stuck in Pending with “Insufficient memory”

Solution: Reduce memory requests in config.yaml

resources:
  loki:
    backend:
      requests:
        memory: "256Mi"  # Reduced from 1Gi

S3 Credentials Not Found

Symptoms: Thanos sidecar errors with “s3: access denied”

Solution: Check ExternalSecret synced credentials

kubectl get externalsecret -n monitoring thanos-s3-credentials
kubectl get secret -n monitoring thanos-s3-credentials