Explanation

Prometheus and Thanos Integration

Why Thanos?

The Prometheus Storage Problem

Prometheus excels at real-time metrics collection but has limitations for long-term storage:

Problems with Prometheus alone:

  • Limited retention: Local TSDB constrained by disk space

  • No global view: Multiple Prometheus instances can’t query each other

  • No backup: Losing a Prometheus pod means losing metrics

  • Expensive storage: Keeping full-resolution metrics for years is costly

Why not just increase Prometheus retention?

  • Storage costs scale linearly with retention

  • Query performance degrades with large local TSDB

  • Still no redundancy or cross-instance queries

  • No automatic downsampling for old data

Thanos Solution

Thanos extends Prometheus with:

  • Unlimited retention: S3 storage is cheap (~$0.023/GB/month)

  • Global queries: Federation across multiple Prometheus instances

  • Automatic downsampling: 5m and 1h resolutions for historical data

  • Backup & recovery: S3 provides durability (99.999999999%)

  • Cost optimization: Compress and downsample old metrics

Architecture Decision: Sidecar vs Remote-Write

Approach Comparison

Aspect

Sidecar (Our Choice)

Remote-Write

Latency

No query latency (gRPC)

Query latency to Prometheus

Complexity

Lower (runs alongside)

Higher (separate receiver)

Query Performance

Fast (direct TSDB access)

Slower (must query via API)

Prometheus Impact

Minimal (read-only)

Higher (network I/O for writes)

Failure Mode

Degrades gracefully

Backpressure on Prometheus

Deduplication

Built-in (replica labels)

Requires careful configuration

Why We Chose Sidecar

Primary Reasons:

  1. Simplicity: Sidecar runs in same pod as Prometheus (no separate deployment)

  2. Performance: Direct TSDB access via gRPC (no API overhead)

  3. Graceful Degradation: Sidecar failure doesn’t affect Prometheus scraping

  4. Query Speed: Thanos Query can read directly from Prometheus TSDB

Trade-offs Accepted:

  • Slightly higher pod resource usage (sidecar container overhead)

  • Must restart Prometheus pod to update sidecar configuration

  • Sidecar coupled to Prometheus lifecycle

Component Roles

Thanos Sidecar

Primary Functions:

  1. Block Upload: Uploads 2-hour Prometheus blocks to S3

  2. gRPC StoreAPI: Serves real-time queries from Prometheus TSDB

  3. Metadata Management: Maintains block metadata in S3

Configuration:

thanos:
  objectStorageConfig:
    key: objstore.yml              # S3 configuration
    name: thanos-objstore-config   # Secret with S3 credentials
  version: 0.37.2                  # Thanos version
  resources:
    requests:
      cpu: 25m                     # Minimal CPU (background uploads)
      memory: 128Mi                # Buffer for block uploads

Upload Behavior:

  • Uploads every 2 hours (aligned with Prometheus block boundaries)

  • Only uploads immutable blocks (not the head block)

  • Adds external labels: prometheus_replica (for deduplication)

  • Retries failed uploads with exponential backoff

gRPC StoreAPI:

  • Listens on port 10901

  • Serves queries for time ranges in local TSDB

  • Supports streaming responses (large queries)

  • Provides min/max time metadata to Query

Thanos Query

Primary Functions:

  1. Query Federation: Unified query interface across multiple stores

  2. Deduplication: Removes duplicate samples from HA replicas

  3. Query Routing: Routes queries to appropriate stores (sidecar vs S3)

Configuration:

replicas: 2                     # HA for query interface
stores:
  - dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local  # Sidecars
  - dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local    # Store Gateways
deduplication: true             # Enable replica deduplication
queryFrontend: false            # Not using query frontend (overhead not needed)

Query Flow:

  1. Client queries /api/v1/query?query=up on Thanos Query

  2. Query identifies time range and required stores

  3. Parallel gRPC fan-out to sidecars and store gateways

  4. Merge and deduplicate responses

  5. Return unified result to client

Deduplication Logic:

# Prometheus replica-0 sample: up{job="api", prometheus_replica="0"} = 1 @ 1699000000
# Prometheus replica-1 sample: up{job="api", prometheus_replica="1"} = 1 @ 1699000000
# Thanos Query result:        up{job="api"} = 1 @ 1699000000  (replica label removed)

Thanos Store

Primary Functions:

  1. S3 Gateway: Provides gRPC StoreAPI for S3 block data

  2. Index Caching: Caches block indexes for faster queries

  3. Block Discovery: Syncs block metadata from S3 every 3 minutes

Configuration:

replicas: 2                     # HA for historical queries
persistentVolumeClaim:
  size: 10Gi                    # Index cache per replica
indexCache:
  type: in-memory
  config:
    max_size: 500MB             # Memory cache for block indexes
chunkCache:
  type: in-memory
  config:
    max_size: 500MB             # Memory cache for chunks

Storage Layout:

/var/thanos/store/
├── meta-syncer/           # Synced block metadata
│   ├── 01K8WC41VMNQF74ZC000CC72NY/
│   │   └── meta.json      # Block metadata (time range, labels)
│   └── 01K8WC7A1S8XE9AWK8SKD381H1/
│       └── meta.json
└── cache/                 # Disk cache for index files

Query Performance:

  • First query: 1-2s (fetch from S3, cache index)

  • Cached query: 100-500ms (index in memory/disk)

  • Chunk read: 200-800ms (depends on S3 latency)

Thanos Compactor

Primary Functions:

  1. Compaction: Merges small blocks into larger blocks

  2. Downsampling: Creates 5m and 1h resolution blocks

  3. Retention: Applies retention policies (deletes old blocks)

Configuration:

replicas: 1                     # Only one compactor should run
persistentVolumeClaim:
  size: 20Gi                    # Workspace for compaction
retentionRaw: 30d               # Keep raw data for 30 days
retention5m: 180d               # Keep 5-minute data for 180 days (6 months)
retention1h: 730d               # Keep 1-hour data for 730 days (2 years)
compactionConcurrency: 1        # Sequential compaction (safety)

Compaction Process:

Input:  [2h block] [2h block] [2h block] [2h block]  (8h total)
        └─────────────────────────┬──────────────────┘
Compact:                    [12h block]               (merged + deduplicated)
Downsample:              [12h @ 5m resolution]        (1/5 samples)
Downsample:              [12h @ 1h resolution]        (1/60 samples)

Block Lifecycle:

  1. Day 1-30: Raw 2h blocks (full resolution, 30s scrape interval)

  2. Day 31-180: 5m resolution blocks (downsampled, 1/10 samples)

  3. Day 181-730: 1h resolution blocks (downsampled, 1/120 samples)

  4. Day 731+: Deleted (expired by S3 lifecycle policy)

Data Flow

Upload Path (Prometheus → S3)

Prometheus TSDB
    │ Every 2 hours
Thanos Sidecar (Block Reader)
    │ gzip compress (~60% reduction)
S3 Bucket (metrics-thanos-kup6s)
    └── 01K8WC41VMNQF74ZC000CC72NY/    # Block ULID
        ├── meta.json                   # Metadata (120KB)
        ├── index                       # Series index (2MB compressed)
        └── chunks/                     # Sample data
            ├── 000001 (10MB)
            ├── 000002 (10MB)
            └── ...

Block Structure:

  • ULID: Universally Unique Lexicographically Sortable Identifier

  • meta.json: Time range, external labels, compaction level

  • index: Series labels → chunk references

  • chunks/: Actual sample data (compressed XOR delta encoding)

Query Path (Grafana → Prometheus/S3)

Scenario 1: Recent metrics (< 3 days)

Grafana
    │ PromQL: rate(http_requests_total[5m])
Thanos Query (choose fastest store)
    │ gRPC: Query([now-3d, now])
Thanos Sidecar (reads Prometheus TSDB)
    │ Fast: 50-100ms
Prometheus TSDB (in-memory head + mmap'd blocks)

Scenario 2: Historical metrics (> 3 days)

Grafana
    │ PromQL: rate(http_requests_total[30d])
Thanos Query (fan-out to all stores)
    ├─────────────────┬─────────────────┐
    ▼                 ▼                 ▼
Sidecar (0-3d)   Store-0 (3d+)   Store-1 (3d+)
    │                 │                 │
    │                 │ Fetch from S3   │
    │                 ▼                 ▼
    └─────────────────┴─────────────────┘
                      │ Merge + Deduplicate
                  Thanos Query
                      │ 500ms-2s
                   Grafana

Query Optimization:

  • Time range splitting: Query routes to appropriate stores

  • Parallel fetch: Multiple stores queried simultaneously

  • Downsampling: Old queries use 5m/1h resolution (faster)

  • Caching: Store gateways cache frequently accessed blocks

S3 Storage Format

Bucket Structure

s3://metrics-thanos-kup6s/
├── 01K8WC41VMNQF74ZC000CC72NY/         # Raw 2h block (1d old)
│   ├── meta.json                       # resolution=0 (raw)
│   ├── index (2.1 MB)
│   └── chunks/ (25 MB total)
├── 01K8YT3G2HPQN8B9VWX5KZMR1F/         # Compacted 12h block (15d old)
│   ├── meta.json                       # resolution=0, compaction.level=2
│   ├── index (6.8 MB)
│   └── chunks/ (68 MB total)
├── 5m/
│   └── 01K9A7M3PXCVBN8G4WQ2FRTY6J/    # 5m downsampled (45d old)
│       ├── meta.json                   # resolution=300000 (5m)
│       ├── index (1.2 MB)
│       └── chunks/ (12 MB total)
└── 1h/
    └── 01K9S8R4QYDWCM9H5XT3GSUZ7K/    # 1h downsampled (400d old)
        ├── meta.json                   # resolution=3600000 (1h)
        ├── index (400 KB)
        └── chunks/ (3 MB total)

Storage Efficiency

Compression Results (typical 2-hour block):

  • Uncompressed samples: ~50 MB

  • Prometheus TSDB (XOR encoding): ~10 MB (80% reduction)

  • S3 gzip compression: ~6 MB (additional 40% reduction)

  • 5m downsampling: ~1.2 MB (80% reduction from 6MB)

  • 1h downsampling: ~300 KB (75% reduction from 1.2MB)

Storage Cost Calculation (2-year retention):

Daily ingestion:  500 MB/day (uncompressed)
After compression: 100 MB/day (S3)
30 days raw:      30 × 100 MB = 3 GB
150 days @ 5m:    150 × 20 MB = 3 GB  (5x downsampling)
550 days @ 1h:    550 × 5 MB = 2.75 GB (12x downsampling)
Total:            ~9 GB for 2 years of metrics
Cost:             ~$0.21/month at $0.023/GB/month

Configuration Deep Dive

Thanos Sidecar Configuration

# In Prometheus spec
thanos:
  image: quay.io/thanos/thanos:v0.37.2
  version: 0.37.2
  objectStorageConfig:
    key: objstore.yml
    name: thanos-objstore-config
  resources:
    requests:
      cpu: 25m
      memory: 128Mi
    limits:
      cpu: 100m
      memory: 256Mi

objstore.yml (S3 configuration):

type: S3
config:
  bucket: "metrics-thanos-kup6s"
  endpoint: "fsn1.your-objectstorage.com"
  region: fsn1
  access_key: "${S3_ACCESS_KEY}"
  secret_key: "${S3_SECRET_KEY}"

Thanos Query Configuration

stores:
  # Auto-discover Prometheus sidecars
  - dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
  # Connect to Thanos Store gateways
  - dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local

replicaLabel:
  - prometheus_replica  # Label to deduplicate on

queryTimeout: 5m        # Max query duration
maxConcurrent: 20       # Max parallel queries

DNS SRV Discovery:

$ dig +short SRV _grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
0 50 10901 prometheus-kube-prometheus-stack-prometheus-0.prometheus-operated...
0 50 10901 prometheus-kube-prometheus-stack-prometheus-1.prometheus-operated...

Thanos Store Configuration

indexCacheSize: 500MB
chunkCacheSize: 500MB
syncInterval: 3m        # How often to check S3 for new blocks
bucketCacheSize: 1GB    # Metadata cache

Thanos Compactor Configuration

retentionResolutionRaw: 30d   # Delete raw blocks after 30 days
retentionResolution5m: 180d   # Delete 5m blocks after 180 days
retentionResolution1h: 730d   # Delete 1h blocks after 730 days

compactionConcurrency: 1      # Only compact one block group at a time
downsampleConcurrency: 1      # Only downsample one block at a time

consistencyDelay: 30m         # Wait 30m before compacting (safety)
compactionInterval: 5m        # Check for compaction work every 5m

High Availability Design

Prometheus HA (2 Replicas)

Configuration:

replicas: 2
externalLabels:
  prometheus_replica: $(POD_NAME)  # Unique per replica

Behavior:

  • Both replicas scrape identical targets

  • Each adds prometheus_replica label (replica-0, replica-1)

  • Thanos Query deduplicates based on this label

Failure Scenario:

  1. Replica-0 fails at 12:00:00

  2. Replica-1 continues scraping (no metrics gap)

  3. Thanos Query deduplicates: uses replica-1 data for 12:00:00+

  4. Replica-0 restarts at 12:02:30, resumes scraping

  5. For 12:02:30+, Query has both replicas (deduplicates)

Thanos Query HA (2 Replicas)

Why HA Query?

  • Query is stateless (no persistent state)

  • Load balancing via Kubernetes Service

  • Failover: If one replica fails, other serves requests

Load Distribution:

Grafana → Service (thanos-query:9090)
              ├─ query-pod-0  (50% traffic)
              └─ query-pod-1  (50% traffic)

Thanos Store HA (2 Replicas)

Why HA Store?

  • S3 reads can be slow (~500ms)

  • 2 replicas = 2x read throughput

  • Cache distribution: Different replicas cache different blocks

Failure Scenario:

  • Store-0 has cached blocks A, B, C

  • Store-1 has cached blocks D, E, F

  • If Store-0 fails, queries hit Store-1 (slower until cache warms)

Single Point of Failure: Compactor

Why only 1 replica?

  • Compaction must be serialized (avoid conflicts)

  • Compactor can be restarted without data loss

  • Only affects background operations (not queries)

Failure Impact:

  • Queries continue normally (Store serves existing blocks)

  • No new compaction/downsampling until restart

  • S3 gradually accumulates small blocks (eventually compacted)

Performance Tuning

Query Optimization

Slow Query Symptoms:

  • Queries taking >5s

  • High memory usage in Query pods

  • S3 read errors (throttling)

Solutions:

  1. Increase cache size: More index/chunk cache = fewer S3 reads

  2. Add Store replicas: Distribute query load

  3. Use downsampled data: Query 1h data for month+ ranges

  4. Optimize PromQL: Use recording rules for expensive queries

Upload Optimization

Slow Upload Symptoms:

  • Sidecars behind on uploads (check thanos_sidecar_shipper_uploads_total)

  • High sidecar CPU/memory usage

  • S3 write errors

Solutions:

  1. Increase sidecar resources: More CPU for compression

  2. Check S3 bandwidth: Hetzner S3 has rate limits

  3. Reduce Prometheus retention: Smaller blocks = faster uploads

Compaction Optimization

Slow Compaction Symptoms:

  • Growing number of small blocks in S3

  • Compactor high CPU/memory

  • Compactor errors/crashes

Solutions:

  1. Increase compactor PVC: More workspace for large compactions

  2. Reduce retention: Fewer blocks to manage

  3. Tune compaction concurrency: Balance speed vs resource usage

Troubleshooting

Common Issues

Issue: “Thanos Sidecar not uploading blocks”

Check logs:

kubectl logs -n monitoring prometheus-xxx-0 -c thanos-sidecar

Common causes:

  • S3 credentials expired/incorrect

  • Network connectivity to S3

  • Blocks not yet immutable (wait 2h from block creation)

Issue: “Queries returning incomplete data”

Check Store sync status:

kubectl logs -n monitoring thanos-store-0 | grep "meta-syncer"

Common causes:

  • Store not synced yet (wait 3m for sync interval)

  • S3 blocks corrupted (check meta.json)

  • Time range mismatch (check block min/max time)

Issue: “High query latency”

Check Thanos Query metrics:

rate(thanos_query_duration_seconds_sum[5m])

Common causes:

  • S3 read latency (check Hetzner S3 status)

  • Cache not warmed up (first query after restart)

  • Large time range without downsampling

Migration from Prometheus-Only

Before Migration

Prometheus (7d retention, 6Gi PVC)
    │ Query: /api/v1/query
Grafana → Prometheus:9090

After Migration

Prometheus (3d retention, 3Gi PVC)
    ├── Thanos Sidecar → S3 (unlimited retention)
Thanos Query (federates Sidecar + S3)
    │ Query: /api/v1/query (Prometheus-compatible)
Grafana → Thanos Query:9090

Benefits Realized:

  • 50% reduction in PVC usage (6Gi → 3Gi)

  • 2 years of metrics accessible (vs 7 days)

  • Historical queries working (>3 days old)

  • No disruption (Prometheus continued scraping)

References