Explanation
Prometheus and Thanos Integration¶
Why Thanos?¶
The Prometheus Storage Problem¶
Prometheus excels at real-time metrics collection but has limitations for long-term storage:
Problems with Prometheus alone:
Limited retention: Local TSDB constrained by disk space
No global view: Multiple Prometheus instances can’t query each other
No backup: Losing a Prometheus pod means losing metrics
Expensive storage: Keeping full-resolution metrics for years is costly
Why not just increase Prometheus retention?
Storage costs scale linearly with retention
Query performance degrades with large local TSDB
Still no redundancy or cross-instance queries
No automatic downsampling for old data
Thanos Solution¶
Thanos extends Prometheus with:
✅ Unlimited retention: S3 storage is cheap (~$0.023/GB/month)
✅ Global queries: Federation across multiple Prometheus instances
✅ Automatic downsampling: 5m and 1h resolutions for historical data
✅ Backup & recovery: S3 provides durability (99.999999999%)
✅ Cost optimization: Compress and downsample old metrics
Architecture Decision: Sidecar vs Remote-Write¶
Approach Comparison¶
Aspect |
Sidecar (Our Choice) |
Remote-Write |
|---|---|---|
Latency |
No query latency (gRPC) |
Query latency to Prometheus |
Complexity |
Lower (runs alongside) |
Higher (separate receiver) |
Query Performance |
Fast (direct TSDB access) |
Slower (must query via API) |
Prometheus Impact |
Minimal (read-only) |
Higher (network I/O for writes) |
Failure Mode |
Degrades gracefully |
Backpressure on Prometheus |
Deduplication |
Built-in (replica labels) |
Requires careful configuration |
Why We Chose Sidecar¶
Primary Reasons:
Simplicity: Sidecar runs in same pod as Prometheus (no separate deployment)
Performance: Direct TSDB access via gRPC (no API overhead)
Graceful Degradation: Sidecar failure doesn’t affect Prometheus scraping
Query Speed: Thanos Query can read directly from Prometheus TSDB
Trade-offs Accepted:
Slightly higher pod resource usage (sidecar container overhead)
Must restart Prometheus pod to update sidecar configuration
Sidecar coupled to Prometheus lifecycle
Component Roles¶
Thanos Sidecar¶
Primary Functions:
Block Upload: Uploads 2-hour Prometheus blocks to S3
gRPC StoreAPI: Serves real-time queries from Prometheus TSDB
Metadata Management: Maintains block metadata in S3
Configuration:
thanos:
objectStorageConfig:
key: objstore.yml # S3 configuration
name: thanos-objstore-config # Secret with S3 credentials
version: 0.37.2 # Thanos version
resources:
requests:
cpu: 25m # Minimal CPU (background uploads)
memory: 128Mi # Buffer for block uploads
Upload Behavior:
Uploads every 2 hours (aligned with Prometheus block boundaries)
Only uploads immutable blocks (not the head block)
Adds external labels:
prometheus_replica(for deduplication)Retries failed uploads with exponential backoff
gRPC StoreAPI:
Listens on port 10901
Serves queries for time ranges in local TSDB
Supports streaming responses (large queries)
Provides min/max time metadata to Query
Thanos Query¶
Primary Functions:
Query Federation: Unified query interface across multiple stores
Deduplication: Removes duplicate samples from HA replicas
Query Routing: Routes queries to appropriate stores (sidecar vs S3)
Configuration:
replicas: 2 # HA for query interface
stores:
- dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local # Sidecars
- dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local # Store Gateways
deduplication: true # Enable replica deduplication
queryFrontend: false # Not using query frontend (overhead not needed)
Query Flow:
Client queries
/api/v1/query?query=upon Thanos QueryQuery identifies time range and required stores
Parallel gRPC fan-out to sidecars and store gateways
Merge and deduplicate responses
Return unified result to client
Deduplication Logic:
# Prometheus replica-0 sample: up{job="api", prometheus_replica="0"} = 1 @ 1699000000
# Prometheus replica-1 sample: up{job="api", prometheus_replica="1"} = 1 @ 1699000000
# Thanos Query result: up{job="api"} = 1 @ 1699000000 (replica label removed)
Thanos Store¶
Primary Functions:
S3 Gateway: Provides gRPC StoreAPI for S3 block data
Index Caching: Caches block indexes for faster queries
Block Discovery: Syncs block metadata from S3 every 3 minutes
Configuration:
replicas: 2 # HA for historical queries
persistentVolumeClaim:
size: 10Gi # Index cache per replica
indexCache:
type: in-memory
config:
max_size: 500MB # Memory cache for block indexes
chunkCache:
type: in-memory
config:
max_size: 500MB # Memory cache for chunks
Storage Layout:
/var/thanos/store/
├── meta-syncer/ # Synced block metadata
│ ├── 01K8WC41VMNQF74ZC000CC72NY/
│ │ └── meta.json # Block metadata (time range, labels)
│ └── 01K8WC7A1S8XE9AWK8SKD381H1/
│ └── meta.json
└── cache/ # Disk cache for index files
Query Performance:
First query: 1-2s (fetch from S3, cache index)
Cached query: 100-500ms (index in memory/disk)
Chunk read: 200-800ms (depends on S3 latency)
Thanos Compactor¶
Primary Functions:
Compaction: Merges small blocks into larger blocks
Downsampling: Creates 5m and 1h resolution blocks
Retention: Applies retention policies (deletes old blocks)
Configuration:
replicas: 1 # Only one compactor should run
persistentVolumeClaim:
size: 20Gi # Workspace for compaction
retentionRaw: 30d # Keep raw data for 30 days
retention5m: 180d # Keep 5-minute data for 180 days (6 months)
retention1h: 730d # Keep 1-hour data for 730 days (2 years)
compactionConcurrency: 1 # Sequential compaction (safety)
Compaction Process:
Input: [2h block] [2h block] [2h block] [2h block] (8h total)
└─────────────────────────┬──────────────────┘
Compact: [12h block] (merged + deduplicated)
│
Downsample: [12h @ 5m resolution] (1/5 samples)
│
Downsample: [12h @ 1h resolution] (1/60 samples)
Block Lifecycle:
Day 1-30: Raw 2h blocks (full resolution, 30s scrape interval)
Day 31-180: 5m resolution blocks (downsampled, 1/10 samples)
Day 181-730: 1h resolution blocks (downsampled, 1/120 samples)
Day 731+: Deleted (expired by S3 lifecycle policy)
Data Flow¶
Upload Path (Prometheus → S3)¶
Prometheus TSDB
│
│ Every 2 hours
▼
Thanos Sidecar (Block Reader)
│
│ gzip compress (~60% reduction)
▼
S3 Bucket (metrics-thanos-kup6s)
│
└── 01K8WC41VMNQF74ZC000CC72NY/ # Block ULID
├── meta.json # Metadata (120KB)
├── index # Series index (2MB compressed)
└── chunks/ # Sample data
├── 000001 (10MB)
├── 000002 (10MB)
└── ...
Block Structure:
ULID: Universally Unique Lexicographically Sortable Identifier
meta.json: Time range, external labels, compaction level
index: Series labels → chunk references
chunks/: Actual sample data (compressed XOR delta encoding)
Query Path (Grafana → Prometheus/S3)¶
Scenario 1: Recent metrics (< 3 days)
Grafana
│ PromQL: rate(http_requests_total[5m])
▼
Thanos Query (choose fastest store)
│
│ gRPC: Query([now-3d, now])
▼
Thanos Sidecar (reads Prometheus TSDB)
│
│ Fast: 50-100ms
▼
Prometheus TSDB (in-memory head + mmap'd blocks)
Scenario 2: Historical metrics (> 3 days)
Grafana
│ PromQL: rate(http_requests_total[30d])
▼
Thanos Query (fan-out to all stores)
│
├─────────────────┬─────────────────┐
▼ ▼ ▼
Sidecar (0-3d) Store-0 (3d+) Store-1 (3d+)
│ │ │
│ │ Fetch from S3 │
│ ▼ ▼
└─────────────────┴─────────────────┘
│
│ Merge + Deduplicate
▼
Thanos Query
│
│ 500ms-2s
▼
Grafana
Query Optimization:
Time range splitting: Query routes to appropriate stores
Parallel fetch: Multiple stores queried simultaneously
Downsampling: Old queries use 5m/1h resolution (faster)
Caching: Store gateways cache frequently accessed blocks
S3 Storage Format¶
Bucket Structure¶
s3://metrics-thanos-kup6s/
├── 01K8WC41VMNQF74ZC000CC72NY/ # Raw 2h block (1d old)
│ ├── meta.json # resolution=0 (raw)
│ ├── index (2.1 MB)
│ └── chunks/ (25 MB total)
├── 01K8YT3G2HPQN8B9VWX5KZMR1F/ # Compacted 12h block (15d old)
│ ├── meta.json # resolution=0, compaction.level=2
│ ├── index (6.8 MB)
│ └── chunks/ (68 MB total)
├── 5m/
│ └── 01K9A7M3PXCVBN8G4WQ2FRTY6J/ # 5m downsampled (45d old)
│ ├── meta.json # resolution=300000 (5m)
│ ├── index (1.2 MB)
│ └── chunks/ (12 MB total)
└── 1h/
└── 01K9S8R4QYDWCM9H5XT3GSUZ7K/ # 1h downsampled (400d old)
├── meta.json # resolution=3600000 (1h)
├── index (400 KB)
└── chunks/ (3 MB total)
Storage Efficiency¶
Compression Results (typical 2-hour block):
Uncompressed samples: ~50 MB
Prometheus TSDB (XOR encoding): ~10 MB (80% reduction)
S3 gzip compression: ~6 MB (additional 40% reduction)
5m downsampling: ~1.2 MB (80% reduction from 6MB)
1h downsampling: ~300 KB (75% reduction from 1.2MB)
Storage Cost Calculation (2-year retention):
Daily ingestion: 500 MB/day (uncompressed)
After compression: 100 MB/day (S3)
30 days raw: 30 × 100 MB = 3 GB
150 days @ 5m: 150 × 20 MB = 3 GB (5x downsampling)
550 days @ 1h: 550 × 5 MB = 2.75 GB (12x downsampling)
Total: ~9 GB for 2 years of metrics
Cost: ~$0.21/month at $0.023/GB/month
Configuration Deep Dive¶
Thanos Sidecar Configuration¶
# In Prometheus spec
thanos:
image: quay.io/thanos/thanos:v0.37.2
version: 0.37.2
objectStorageConfig:
key: objstore.yml
name: thanos-objstore-config
resources:
requests:
cpu: 25m
memory: 128Mi
limits:
cpu: 100m
memory: 256Mi
objstore.yml (S3 configuration):
type: S3
config:
bucket: "metrics-thanos-kup6s"
endpoint: "fsn1.your-objectstorage.com"
region: fsn1
access_key: "${S3_ACCESS_KEY}"
secret_key: "${S3_SECRET_KEY}"
Thanos Query Configuration¶
stores:
# Auto-discover Prometheus sidecars
- dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
# Connect to Thanos Store gateways
- dnssrv+_grpc._tcp.thanos-store-grpc.monitoring.svc.cluster.local
replicaLabel:
- prometheus_replica # Label to deduplicate on
queryTimeout: 5m # Max query duration
maxConcurrent: 20 # Max parallel queries
DNS SRV Discovery:
$ dig +short SRV _grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
0 50 10901 prometheus-kube-prometheus-stack-prometheus-0.prometheus-operated...
0 50 10901 prometheus-kube-prometheus-stack-prometheus-1.prometheus-operated...
Thanos Store Configuration¶
indexCacheSize: 500MB
chunkCacheSize: 500MB
syncInterval: 3m # How often to check S3 for new blocks
bucketCacheSize: 1GB # Metadata cache
Thanos Compactor Configuration¶
retentionResolutionRaw: 30d # Delete raw blocks after 30 days
retentionResolution5m: 180d # Delete 5m blocks after 180 days
retentionResolution1h: 730d # Delete 1h blocks after 730 days
compactionConcurrency: 1 # Only compact one block group at a time
downsampleConcurrency: 1 # Only downsample one block at a time
consistencyDelay: 30m # Wait 30m before compacting (safety)
compactionInterval: 5m # Check for compaction work every 5m
High Availability Design¶
Prometheus HA (2 Replicas)¶
Configuration:
replicas: 2
externalLabels:
prometheus_replica: $(POD_NAME) # Unique per replica
Behavior:
Both replicas scrape identical targets
Each adds
prometheus_replicalabel (replica-0, replica-1)Thanos Query deduplicates based on this label
Failure Scenario:
Replica-0 fails at 12:00:00
Replica-1 continues scraping (no metrics gap)
Thanos Query deduplicates: uses replica-1 data for 12:00:00+
Replica-0 restarts at 12:02:30, resumes scraping
For 12:02:30+, Query has both replicas (deduplicates)
Thanos Query HA (2 Replicas)¶
Why HA Query?
Query is stateless (no persistent state)
Load balancing via Kubernetes Service
Failover: If one replica fails, other serves requests
Load Distribution:
Grafana → Service (thanos-query:9090)
│
├─ query-pod-0 (50% traffic)
└─ query-pod-1 (50% traffic)
Thanos Store HA (2 Replicas)¶
Why HA Store?
S3 reads can be slow (~500ms)
2 replicas = 2x read throughput
Cache distribution: Different replicas cache different blocks
Failure Scenario:
Store-0 has cached blocks A, B, C
Store-1 has cached blocks D, E, F
If Store-0 fails, queries hit Store-1 (slower until cache warms)
Single Point of Failure: Compactor¶
Why only 1 replica?
Compaction must be serialized (avoid conflicts)
Compactor can be restarted without data loss
Only affects background operations (not queries)
Failure Impact:
Queries continue normally (Store serves existing blocks)
No new compaction/downsampling until restart
S3 gradually accumulates small blocks (eventually compacted)
Performance Tuning¶
Query Optimization¶
Slow Query Symptoms:
Queries taking >5s
High memory usage in Query pods
S3 read errors (throttling)
Solutions:
Increase cache size: More index/chunk cache = fewer S3 reads
Add Store replicas: Distribute query load
Use downsampled data: Query 1h data for month+ ranges
Optimize PromQL: Use recording rules for expensive queries
Upload Optimization¶
Slow Upload Symptoms:
Sidecars behind on uploads (check
thanos_sidecar_shipper_uploads_total)High sidecar CPU/memory usage
S3 write errors
Solutions:
Increase sidecar resources: More CPU for compression
Check S3 bandwidth: Hetzner S3 has rate limits
Reduce Prometheus retention: Smaller blocks = faster uploads
Compaction Optimization¶
Slow Compaction Symptoms:
Growing number of small blocks in S3
Compactor high CPU/memory
Compactor errors/crashes
Solutions:
Increase compactor PVC: More workspace for large compactions
Reduce retention: Fewer blocks to manage
Tune compaction concurrency: Balance speed vs resource usage
Troubleshooting¶
Common Issues¶
Issue: “Thanos Sidecar not uploading blocks”
Check logs:
kubectl logs -n monitoring prometheus-xxx-0 -c thanos-sidecar
Common causes:
S3 credentials expired/incorrect
Network connectivity to S3
Blocks not yet immutable (wait 2h from block creation)
Issue: “Queries returning incomplete data”
Check Store sync status:
kubectl logs -n monitoring thanos-store-0 | grep "meta-syncer"
Common causes:
Store not synced yet (wait 3m for sync interval)
S3 blocks corrupted (check meta.json)
Time range mismatch (check block min/max time)
Issue: “High query latency”
Check Thanos Query metrics:
rate(thanos_query_duration_seconds_sum[5m])
Common causes:
S3 read latency (check Hetzner S3 status)
Cache not warmed up (first query after restart)
Large time range without downsampling
Migration from Prometheus-Only¶
Before Migration¶
Prometheus (7d retention, 6Gi PVC)
│
│ Query: /api/v1/query
▼
Grafana → Prometheus:9090
After Migration¶
Prometheus (3d retention, 3Gi PVC)
├── Thanos Sidecar → S3 (unlimited retention)
│
Thanos Query (federates Sidecar + S3)
│
│ Query: /api/v1/query (Prometheus-compatible)
▼
Grafana → Thanos Query:9090
Benefits Realized:
50% reduction in PVC usage (6Gi → 3Gi)
2 years of metrics accessible (vs 7 days)
Historical queries working (>3 days old)
No disruption (Prometheus continued scraping)