Explanation

Prometheus and Thanos Integration¶

Type: Explanation (Understanding-oriented)

Related Concepts: Architecture Overview | S3 Buckets | Query Thanos Metrics

Why Thanos?¶

The Prometheus Storage Problem¶

Prometheus excels at real-time metrics collection but has limitations for long-term storage:

Problems with Prometheus alone:

Limited retention: Local TSDB constrained by disk space
No global view: Multiple Prometheus instances can’t query each other
No backup: Losing a Prometheus pod means losing metrics
Expensive storage: Keeping full-resolution metrics for years is costly

Why not just increase Prometheus retention?

Storage costs scale linearly with retention
Query performance degrades with large local TSDB
Still no redundancy or cross-instance queries
No automatic downsampling for old data

Thanos Solution¶

Thanos extends Prometheus with:

✅ Unlimited retention: S3 storage is cheap (~$0.023/GB/month)
✅ Global queries: Federation across multiple Prometheus instances
✅ Automatic downsampling: 5m and 1h resolutions for historical data
✅ Backup & recovery: S3 provides durability (99.999999999%)
✅ Cost optimization: Compress and downsample old metrics

Architecture Decision: Sidecar vs Remote-Write¶

Approach Comparison¶

Aspect	Sidecar (Our Choice)	Remote-Write
Latency	No query latency (gRPC)	Query latency to Prometheus
Complexity	Lower (runs alongside)	Higher (separate receiver)
Query Performance	Fast (direct TSDB access)	Slower (must query via API)
Prometheus Impact	Minimal (read-only)	Higher (network I/O for writes)
Failure Mode	Degrades gracefully	Backpressure on Prometheus
Deduplication	Built-in (replica labels)	Requires careful configuration

Why We Chose Sidecar¶

Primary Reasons:

Simplicity: Sidecar runs in same pod as Prometheus (no separate deployment)
Performance: Direct TSDB access via gRPC (no API overhead)
Graceful Degradation: Sidecar failure doesn’t affect Prometheus scraping
Query Speed: Thanos Query can read directly from Prometheus TSDB

Trade-offs Accepted:

Slightly higher pod resource usage (sidecar container overhead)
Must restart Prometheus pod to update sidecar configuration
Sidecar coupled to Prometheus lifecycle

Component Roles¶

Thanos Sidecar¶

Primary Functions:

Block Upload: Uploads 2-hour Prometheus blocks to S3
gRPC StoreAPI: Serves real-time queries from Prometheus TSDB
Metadata Management: Maintains block metadata in S3

Configuration:

thanos:
  objectStorageConfig:
    key: objstore.yml              # S3 configuration
    name: thanos-objstore-config   # Secret with S3 credentials
  version: 0.37.2                  # Thanos version
  resources:
    requests:
      cpu: 25m                     # Minimal CPU (background uploads)
      memory: 128Mi                # Buffer for block uploads

Upload Behavior:

Uploads every 2 hours (aligned with Prometheus block boundaries)
Only uploads immutable blocks (not the head block)
Adds external labels: prometheus_replica (for deduplication)
Retries failed uploads with exponential backoff

gRPC StoreAPI:

Listens on port 10901
Serves queries for time ranges in local TSDB
Supports streaming responses (large queries)
Provides min/max time metadata to Query

Thanos Query¶

Primary Functions:

Query Federation: Unified query interface across multiple stores
Deduplication: Removes duplicate samples from HA replicas
Query Routing: Routes queries to appropriate stores (sidecar vs S3)

Configuration:

replicas: 2                     # HA for query interface
stores:
  - dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local  # Sidecars
  - dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local    # Store Gateways
deduplication: true             # Enable replica deduplication
queryFrontend: false            # Not using query frontend (overhead not needed)

Query Flow:

Client queries /api/v1/query?query=up on Thanos Query
Query identifies time range and required stores
Parallel gRPC fan-out to sidecars and store gateways
Merge and deduplicate responses
Return unified result to client

Deduplication Logic:

# Prometheus replica-0 sample: up{job="api", prometheus_replica="0"} = 1 @ 1699000000
# Prometheus replica-1 sample: up{job="api", prometheus_replica="1"} = 1 @ 1699000000
# Thanos Query result:        up{job="api"} = 1 @ 1699000000  (replica label removed)

Thanos Store¶

Primary Functions:

S3 Gateway: Provides gRPC StoreAPI for S3 block data
Index Caching: Caches block indexes for faster queries
Block Discovery: Syncs block metadata from S3 every 3 minutes

Configuration:

replicas: 2                     # HA for historical queries
persistentVolumeClaim:
  size: 10Gi                    # Index cache per replica
indexCache:
  type: in-memory
  config:
    max_size: 500MB             # Memory cache for block indexes
chunkCache:
  type: in-memory
  config:
    max_size: 500MB             # Memory cache for chunks

Storage Layout:

/var/thanos/store/
├── meta-syncer/           # Synced block metadata
│   ├── 01K8WC41VMNQF74ZC000CC72NY/
│   │   └── meta.json      # Block metadata (time range, labels)
│   └── 01K8WC7A1S8XE9AWK8SKD381H1/
│       └── meta.json
└── cache/                 # Disk cache for index files

Query Performance:

First query: 1-2s (fetch from S3, cache index)
Cached query: 100-500ms (index in memory/disk)
Chunk read: 200-800ms (depends on S3 latency)

Thanos Compactor¶

Primary Functions:

Compaction: Merges small blocks into larger blocks
Downsampling: Creates 5m and 1h resolution blocks
Retention: Applies retention policies (deletes old blocks)

Configuration:

replicas: 1                     # Only one compactor should run
persistentVolumeClaim:
  size: 20Gi                    # Workspace for compaction
retentionRaw: 30d               # Keep raw data for 30 days
retention5m: 180d               # Keep 5-minute data for 180 days (6 months)
retention1h: 730d               # Keep 1-hour data for 730 days (2 years)
compactionConcurrency: 1        # Sequential compaction (safety)

Compaction Process:

Input:  [2h block] [2h block] [2h block] [2h block]  (8h total)
        └─────────────────────────┬──────────────────┘
Compact:                    [12h block]               (merged + deduplicated)
                                  │
Downsample:              [12h @ 5m resolution]        (1/5 samples)
                                  │
Downsample:              [12h @ 1h resolution]        (1/60 samples)

Block Lifecycle:

Day 1-30: Raw 2h blocks (full resolution, 30s scrape interval)
Day 31-180: 5m resolution blocks (downsampled, 1/10 samples)
Day 181-730: 1h resolution blocks (downsampled, 1/120 samples)
Day 731+: Deleted (expired by S3 lifecycle policy)

Data Flow¶

Upload Path (Prometheus → S3)¶

Prometheus TSDB
    │
    │ Every 2 hours
    ▼
Thanos Sidecar (Block Reader)
    │
    │ gzip compress (~60% reduction)
    ▼
S3 Bucket (metrics-thanos-kup6s)
    │
    └── 01K8WC41VMNQF74ZC000CC72NY/    # Block ULID
        ├── meta.json                   # Metadata (120KB)
        ├── index                       # Series index (2MB compressed)
        └── chunks/                     # Sample data
            ├── 000001 (10MB)
            ├── 000002 (10MB)
            └── ...

Block Structure:

ULID: Universally Unique Lexicographically Sortable Identifier
meta.json: Time range, external labels, compaction level
index: Series labels → chunk references
chunks/: Actual sample data (compressed XOR delta encoding)

Query Path (Grafana → Prometheus/S3)¶

Scenario 1: Recent metrics (< 3 days)

Grafana
    │ PromQL: rate(http_requests_total[5m])
    ▼
Thanos Query (choose fastest store)
    │
    │ gRPC: Query([now-3d, now])
    ▼
Thanos Sidecar (reads Prometheus TSDB)
    │
    │ Fast: 50-100ms
    ▼
Prometheus TSDB (in-memory head + mmap'd blocks)

Scenario 2: Historical metrics (> 3 days)

Grafana
    │ PromQL: rate(http_requests_total[30d])
    ▼
Thanos Query (fan-out to all stores)
    │
    ├─────────────────┬─────────────────┐
    ▼                 ▼                 ▼
Sidecar (0-3d)   Store-0 (3d+)   Store-1 (3d+)
    │                 │                 │
    │                 │ Fetch from S3   │
    │                 ▼                 ▼
    └─────────────────┴─────────────────┘
                      │
                      │ Merge + Deduplicate
                      ▼
                  Thanos Query
                      │
                      │ 500ms-2s
                      ▼
                   Grafana

Query Optimization:

Time range splitting: Query routes to appropriate stores
Parallel fetch: Multiple stores queried simultaneously
Downsampling: Old queries use 5m/1h resolution (faster)
Caching: Store gateways cache frequently accessed blocks

S3 Storage Format¶

Bucket Structure¶

s3://metrics-thanos-kup6s/
├── 01K8WC41VMNQF74ZC000CC72NY/         # Raw 2h block (1d old)
│   ├── meta.json                       # resolution=0 (raw)
│   ├── index (2.1 MB)
│   └── chunks/ (25 MB total)
├── 01K8YT3G2HPQN8B9VWX5KZMR1F/         # Compacted 12h block (15d old)
│   ├── meta.json                       # resolution=0, compaction.level=2
│   ├── index (6.8 MB)
│   └── chunks/ (68 MB total)
├── 5m/
│   └── 01K9A7M3PXCVBN8G4WQ2FRTY6J/    # 5m downsampled (45d old)
│       ├── meta.json                   # resolution=300000 (5m)
│       ├── index (1.2 MB)
│       └── chunks/ (12 MB total)
└── 1h/
    └── 01K9S8R4QYDWCM9H5XT3GSUZ7K/    # 1h downsampled (400d old)
        ├── meta.json                   # resolution=3600000 (1h)
        ├── index (400 KB)
        └── chunks/ (3 MB total)

Storage Efficiency¶

Compression Results (typical 2-hour block):

Uncompressed samples: ~50 MB
Prometheus TSDB (XOR encoding): ~10 MB (80% reduction)
S3 gzip compression: ~6 MB (additional 40% reduction)
5m downsampling: ~1.2 MB (80% reduction from 6MB)
1h downsampling: ~300 KB (75% reduction from 1.2MB)

Storage Cost Calculation (2-year retention):

Daily ingestion:  500 MB/day (uncompressed)
After compression: 100 MB/day (S3)
30 days raw:      30 × 100 MB = 3 GB
150 days @ 5m:    150 × 20 MB = 3 GB  (5x downsampling)
550 days @ 1h:    550 × 5 MB = 2.75 GB (12x downsampling)
Total:            ~9 GB for 2 years of metrics
Cost:             ~$0.21/month at $0.023/GB/month

Configuration Deep Dive¶

Thanos Sidecar Configuration¶

# In Prometheus spec
thanos:
  image: quay.io/thanos/thanos:v0.37.2
  version: 0.37.2
  objectStorageConfig:
    key: objstore.yml
    name: thanos-objstore-config
  resources:
    requests:
      cpu: 25m
      memory: 128Mi
    limits:
      cpu: 100m
      memory: 256Mi

objstore.yml (S3 configuration):

type: S3
config:
  bucket: "metrics-thanos-kup6s"
  endpoint: "fsn1.your-objectstorage.com"
  region: fsn1
  access_key: "${S3_ACCESS_KEY}"
  secret_key: "${S3_SECRET_KEY}"

Thanos Query Configuration¶

stores:
  # Auto-discover Prometheus sidecars
  - dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
  # Connect to Thanos Store gateways
  - dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local

replicaLabel:
  - prometheus_replica  # Label to deduplicate on

queryTimeout: 5m        # Max query duration
maxConcurrent: 20       # Max parallel queries

DNS SRV Discovery:

$ dig +short SRV _grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
0 50 10901 prometheus-kube-prometheus-stack-prometheus-0.prometheus-operated...
0 50 10901 prometheus-kube-prometheus-stack-prometheus-1.prometheus-operated...

Thanos Store Configuration¶

indexCacheSize: 500MB
chunkCacheSize: 500MB
syncInterval: 3m        # How often to check S3 for new blocks
bucketCacheSize: 1GB    # Metadata cache

Thanos Compactor Configuration¶

retentionResolutionRaw: 30d   # Delete raw blocks after 30 days
retentionResolution5m: 180d   # Delete 5m blocks after 180 days
retentionResolution1h: 730d   # Delete 1h blocks after 730 days

compactionConcurrency: 1      # Only compact one block group at a time
downsampleConcurrency: 1      # Only downsample one block at a time

consistencyDelay: 30m         # Wait 30m before compacting (safety)
compactionInterval: 5m        # Check for compaction work every 5m

High Availability Design¶

Prometheus HA (2 Replicas)¶

Configuration:

replicas: 2
externalLabels:
  prometheus_replica: $(POD_NAME)  # Unique per replica

Behavior:

Both replicas scrape identical targets
Each adds prometheus_replica label (replica-0, replica-1)
Thanos Query deduplicates based on this label

Failure Scenario:

Replica-0 fails at 12:00:00
Replica-1 continues scraping (no metrics gap)
Thanos Query deduplicates: uses replica-1 data for 12:00:00+
Replica-0 restarts at 12:02:30, resumes scraping
For 12:02:30+, Query has both replicas (deduplicates)

Thanos Query HA (2 Replicas)¶

Why HA Query?

Query is stateless (no persistent state)
Load balancing via Kubernetes Service
Failover: If one replica fails, other serves requests

Load Distribution:

Grafana → Service (thanos-query:9090)
              │
              ├─ query-pod-0  (50% traffic)
              └─ query-pod-1  (50% traffic)

Thanos Store HA (2 Replicas)¶

Why HA Store?

S3 reads can be slow (~500ms)
2 replicas = 2x read throughput
Cache distribution: Different replicas cache different blocks

Failure Scenario:

Store-0 has cached blocks A, B, C
Store-1 has cached blocks D, E, F
If Store-0 fails, queries hit Store-1 (slower until cache warms)

Single Point of Failure: Compactor¶

Why only 1 replica?

Compaction must be serialized (avoid conflicts)
Compactor can be restarted without data loss
Only affects background operations (not queries)

Failure Impact:

Queries continue normally (Store serves existing blocks)
No new compaction/downsampling until restart
S3 gradually accumulates small blocks (eventually compacted)

Performance Tuning¶

Query Optimization¶

Slow Query Symptoms:

Queries taking >5s
High memory usage in Query pods
S3 read errors (throttling)

Solutions:

Increase cache size: More index/chunk cache = fewer S3 reads
Add Store replicas: Distribute query load
Use downsampled data: Query 1h data for month+ ranges
Optimize PromQL: Use recording rules for expensive queries

Upload Optimization¶

Slow Upload Symptoms:

Sidecars behind on uploads (check thanos_sidecar_shipper_uploads_total)
High sidecar CPU/memory usage
S3 write errors

Solutions:

Increase sidecar resources: More CPU for compression
Check S3 bandwidth: Hetzner S3 has rate limits
Reduce Prometheus retention: Smaller blocks = faster uploads

Compaction Optimization¶

Slow Compaction Symptoms:

Growing number of small blocks in S3
Compactor high CPU/memory
Compactor errors/crashes

Solutions:

Increase compactor PVC: More workspace for large compactions
Reduce retention: Fewer blocks to manage
Tune compaction concurrency: Balance speed vs resource usage

Troubleshooting¶

Common Issues¶

Issue: “Thanos Sidecar not uploading blocks”

Check logs:

kubectl logs -n monitoring prometheus-xxx-0 -c thanos-sidecar

Common causes:

S3 credentials expired/incorrect
Network connectivity to S3
Blocks not yet immutable (wait 2h from block creation)

Issue: “Queries returning incomplete data”

Check Store sync status:

kubectl logs -n monitoring thanos-store-0 | grep "meta-syncer"

Common causes:

Store not synced yet (wait 3m for sync interval)
S3 blocks corrupted (check meta.json)
Time range mismatch (check block min/max time)

Issue: “High query latency”

Check Thanos Query metrics:

rate(thanos_query_duration_seconds_sum[5m])

Common causes:

S3 read latency (check Hetzner S3 status)
Cache not warmed up (first query after restart)
Large time range without downsampling

Migration from Prometheus-Only¶

Before Migration¶

Prometheus (7d retention, 6Gi PVC)
    │
    │ Query: /api/v1/query
    ▼
Grafana → Prometheus:9090

After Migration¶

Prometheus (3d retention, 3Gi PVC)
    ├── Thanos Sidecar → S3 (unlimited retention)
    │
Thanos Query (federates Sidecar + S3)
    │
    │ Query: /api/v1/query (Prometheus-compatible)
    ▼
Grafana → Thanos Query:9090

Benefits Realized:

50% reduction in PVC usage (6Gi → 3Gi)
2 years of metrics accessible (vs 7 days)
Historical queries working (>3 days old)
No disruption (Prometheus continued scraping)

Prometheus and Thanos Integration¶

Why Thanos?¶

The Prometheus Storage Problem¶

Thanos Solution¶

Architecture Decision: Sidecar vs Remote-Write¶

Approach Comparison¶

Why We Chose Sidecar¶

Component Roles¶

Thanos Sidecar¶

Thanos Query¶

Thanos Store¶

Thanos Compactor¶

Data Flow¶

Upload Path (Prometheus → S3)¶

Query Path (Grafana → Prometheus/S3)¶

S3 Storage Format¶

Bucket Structure¶

Storage Efficiency¶

Configuration Deep Dive¶

Thanos Sidecar Configuration¶

Thanos Query Configuration¶

Thanos Store Configuration¶

Thanos Compactor Configuration¶

High Availability Design¶

Prometheus HA (2 Replicas)¶

Thanos Query HA (2 Replicas)¶

Thanos Store HA (2 Replicas)¶

Single Point of Failure: Compactor¶

Performance Tuning¶

Query Optimization¶

Upload Optimization¶

Compaction Optimization¶

Troubleshooting¶

Common Issues¶

Migration from Prometheus-Only¶

Before Migration¶

After Migration¶

References¶