How-To Guide
Query Historical Metrics with Thanos¶
Goal: Query both recent and historical Prometheus metrics using Thanos for long-term trend analysis.
Time: ~20 minutes
Prerequisites¶
Access to Grafana UI
Basic understanding of PromQL (Prometheus Query Language)
Thanos components deployed (already enabled in cluster)
Quick reference¶
Task |
Where to Query |
Notes |
|---|---|---|
Recent metrics (< 3 days) |
Grafana → Prometheus datasource |
Served from Prometheus local storage |
Historical metrics (> 3 days) |
Grafana → Prometheus datasource |
Served from S3 via Thanos Store |
Query across all time ranges |
Grafana → Prometheus datasource |
Thanos Query federates both sources |
Verify Thanos components |
|
Check Query, Store, Compactor |
Check S3 blocks |
|
See loaded blocks |
Understanding Thanos Architecture¶
Thanos extends Prometheus with unlimited long-term storage:
Prometheus local storage: 3 days retention (fast queries)
S3 raw data: 30 days (full resolution)
S3 downsampled (5-min): 180 days (~6 months)
S3 downsampled (1-hour): 730 days (2 years)
Grafana datasource configured to use Thanos Query, which automatically:
Routes recent queries to Prometheus sidecars
Routes historical queries to Thanos Store (S3)
Deduplicates metrics from replicas
You don’t need to change anything - all existing dashboards and queries work transparently!
Step 1: Access Grafana Explore¶
Open Grafana:
https://grafana.ops.kup6s.netLogin with admin credentials
Click Explore (compass icon) in left sidebar
Select Prometheus from data source dropdown (top)
Note: Even though it says “Prometheus”, you’re actually querying through Thanos Query!
Step 2: Query recent metrics (last 3 days)¶
Current CPU usage¶
rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m])
Result: CPU usage rate over last 5 minutes. Served from Prometheus local storage.
Memory usage right now¶
container_memory_working_set_bytes{namespace="hello-kup6s"}
Result: Current memory usage. Fast query from Prometheus.
Pod restarts in last hour¶
increase(kube_pod_container_status_restarts_total{namespace="hello-kup6s"}[1h])
Result: How many times pods restarted in last hour.
Step 3: Query historical metrics (older than 3 days)¶
CPU usage trend over 30 days¶
Click time picker (top right)
Select Last 30 days
Query:
avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]))
Result: 30-day CPU trend. Recent data from Prometheus, historical from S3!
Notice: You’re querying 30 days of data, but Prometheus only keeps 3 days locally. Thanos Store fetches the rest from S3 transparently.
Memory growth over 6 months¶
avg(container_memory_working_set_bytes{namespace="hello-kup6s"}) by (pod)
Set time range to Last 180 days (6 months).
Result: Long-term memory trend. Uses 5-minute downsampled data from S3.
Long-term capacity planning¶
max(
sum(container_memory_working_set_bytes{namespace!~"kube-system|monitoring"})
/
sum(kube_node_status_allocatable{resource="memory"})
) * 100
Set time range to Last 1 year.
Result: Cluster memory utilization percentage over a full year. Uses 1-hour downsampled data from S3.
Step 4: Verify Thanos is serving your queries¶
Check Thanos Query stores¶
From terminal:
kubectl exec -n monitoring deploy/thanos-query -- \
curl -s localhost:9090/api/v1/stores | python3 -m json.tool
Look for:
"sidecar"entries: Prometheus replicas (real-time data)"store"entries: Thanos Store gateways (S3 historical data)
Each store shows:
minTime/maxTime: What time range it coverslabelSets: What metrics it has
Example output¶
{
"sidecar": [
{
"name": "10.42.1.132:10901",
"minTime": 1761699349000, // 3 days ago
"maxTime": 9223372036854775807, // "now"
"labelSets": [{"cluster": "kup6s", "replica": "prometheus-...-1"}]
}
],
"store": [
{
"name": "10.42.1.43:10901",
"minTime": 1761717600082, // 12 hours ago
"maxTime": 1761760800000, // 7 hours ago
"labelSets": [{"cluster": "kup6s"}]
}
]
}
Interpretation:
Sidecar covers last 3 days up to “now” (real-time)
Store covers historical blocks uploaded to S3
Thanos Query federates both automatically!
Check S3 blocks loaded¶
kubectl exec -n monitoring thanos-store-0 -- \
ls -lh /var/thanos/store/meta-syncer/ | head -20
Result: Directories named with block ULIDs (unique IDs). Each is a 2-hour block of metrics uploaded from Prometheus.
Step 5: Compare trends over time¶
Week-over-week comparison¶
avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]))
and
avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]) offset 1w)
Set time range to Last 7 days.
Result: Current week vs last week CPU usage side-by-side.
Month-over-month growth¶
sum(container_memory_working_set_bytes{namespace="hello-kup6s"})
/
sum(container_memory_working_set_bytes{namespace="hello-kup6s"} offset 30d)
Result: Memory usage as a ratio of 30 days ago. >1 means growth, <1 means reduction.
Year-over-year seasonality¶
For clusters running >1 year:
avg_over_time(
sum(rate(container_cpu_usage_seconds_total{namespace!~"kube-system|monitoring"}[1h]))
[1d:1h]
)
Set time range to Last 2 years.
Result: Daily CPU average over 2 years. Uses 1-hour downsampled data from S3.
Step 6: Performance and cost optimization¶
Query performance by time range¶
Time Range |
Data Source |
Resolution |
Query Speed |
|---|---|---|---|
Last 3 hours |
Prometheus |
Raw (15s) |
Very fast |
Last 3 days |
Prometheus |
Raw (15s) |
Fast |
3-30 days |
S3 (Thanos Store) |
Raw (15s) |
Medium |
30-180 days |
S3 (Thanos Store) |
5-min downsampled |
Fast |
180+ days |
S3 (Thanos Store) |
1-hour downsampled |
Very fast |
Tip: Downsampled data is faster to query because there are fewer data points!
Optimize long-term queries¶
Instead of querying raw data over 6 months:
# Slower - queries millions of raw samples
rate(container_cpu_usage_seconds_total[5m]) # Last 180 days
Use appropriate aggregation:
# Faster - uses downsampled data
avg_over_time(
rate(container_cpu_usage_seconds_total[5m])[1h:5m]
)
Result: 1-hour average rates instead of every 5-minute sample. Much faster!
Use recording rules for frequent queries¶
For dashboards querying the same expensive PromQL repeatedly, create a recording rule:
# In Prometheus rules
- record: namespace:container_cpu:avg_rate5m
expr: |
avg(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
Then query:
namespace:container_cpu:avg_rate5m{namespace="hello-kup6s"}
Benefit: Pre-computed metric, much faster queries!
Step 7: Identify gaps and missing data¶
Check for data gaps¶
absent_over_time(
up{job="hello-kup6s"}[1h]
)
Result: Shows when the up metric was missing (scrape failures or downtime).
Verify metric exists in historical range¶
count_over_time(
container_memory_working_set_bytes{namespace="hello-kup6s"}[30d]
)
Result: How many samples exist in last 30 days. If zero, metric wasn’t collected yet.
Find when a metric first appeared¶
timestamp(
container_memory_working_set_bytes{namespace="hello-kup6s"}
)
Set time range to Last 2 years, graph it.
Result: Metric starts appearing when pods were first deployed.
Step 8: Common use cases¶
Capacity planning: Storage growth¶
predict_linear(
sum(kubelet_volume_stats_used_bytes{namespace="hello-kup6s"})[30d],
86400 * 90 # 90 days in seconds
)
Result: Predicted storage usage in 90 days based on last 30 days trend.
Performance baseline: 95th percentile latency¶
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{namespace="hello-kup6s"}[5m])
)
Set time range to Last 90 days.
Result: 95th percentile latency over 3 months. Helps establish SLOs.
Cost analysis: Resource waste¶
sum(
kube_pod_container_resource_requests{namespace="hello-kup6s", resource="memory"}
-
avg_over_time(container_memory_working_set_bytes{namespace="hello-kup6s"}[1d])
) by (pod)
Result: How much requested memory is unused. Helps right-size requests.
Incident investigation: What changed?¶
After an incident on Oct 15, 2025:
delta(
container_memory_working_set_bytes{namespace="hello-kup6s"}[1h]
)
Set time range to Oct 15 10:00 - Oct 15 16:00.
Result: Shows memory spikes during incident window.
SLO tracking: Uptime over 30 days¶
avg_over_time(up{job="hello-kup6s"}[30d]) * 100
Result: Percentage uptime over last 30 days. Compare against SLO (e.g., 99.9%).
Step 9: Troubleshooting¶
“No data” for historical queries¶
Check if Thanos Store has the data:
kubectl exec -n monitoring thanos-store-0 -- \
ls -lh /var/thanos/store/meta-syncer/
If no directories, Thanos hasn’t loaded S3 blocks yet. Wait a few minutes.
Check Thanos Store logs:
kubectl logs -n monitoring thanos-store-0 --tail=30
Look for “loaded new block” messages.
Query timeout on long time ranges¶
Symptom: Grafana shows “Timeout” error when querying 1+ years.
Fix: Use downsampled queries or recording rules (see Step 6).
Alternative: Increase query timeout in Grafana datasource settings.
Gaps in historical data¶
Cause: Thanos sidecar was just enabled. Historical data before enablement doesn’t exist in S3.
Explanation: Thanos only uploads blocks going forward from when it was enabled. It doesn’t backfill old data.
Solution: Wait for time to pass. In 30 days, you’ll have 30 days of history!
“Store gateway not found” errors¶
Check Thanos Store pods:
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store
Should show 2/2 replicas running.
If not running, check events:
kubectl describe pod -n monitoring thanos-store-0
Metrics look different after 30+ days¶
Expected behavior! After 30 days, raw data is deleted and only 5-minute downsampled data remains.
Example:
Day 1-30: Raw 15s resolution
Day 31-180: 5-minute resolution (averaged)
Day 181+: 1-hour resolution (averaged)
Impact: Long-term queries are averages, not exact values. This is intentional for cost savings!
Step 10: Advanced: Directly query Thanos components¶
Query Thanos Query directly (bypass Grafana)¶
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
Then open browser: http://localhost:9090
Use case: Debugging Thanos federation, seeing which stores contribute to a query.
Query Prometheus directly (bypass Thanos)¶
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9091:9090
Then open browser: http://localhost:9091
Use case: Compare Prometheus local data vs Thanos federated data.
Check Thanos Compactor activity¶
kubectl logs -n monitoring thanos-compactor-0 --tail=50 | \
grep -E "(compact|downsample)"
Look for:
“start first pass of downsampling” (creating 5-min blocks)
“start second pass of downsampling” (creating 1-hour blocks)
“compaction iterations done”
Inspect S3 bucket directly¶
Via AWS CLI with Hetzner S3:
export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
aws s3 ls s3://metrics-thanos-kup6s/ \
--endpoint-url https://fsn1.your-objectstorage.com
Result: Shows all blocks uploaded to S3, including downsampled blocks.
Thanos retention summary¶
Data Age |
Resolution |
Storage Location |
Retention |
|---|---|---|---|
0-3 days |
Raw (15s) |
Prometheus local |
3 days |
0-30 days |
Raw (15s) |
S3 (Hetzner fsn1) |
30 days |
30-180 days |
5-minute avg |
S3 (Hetzner fsn1) |
180 days |
180-730 days |
1-hour avg |
S3 (Hetzner fsn1) |
730 days (2 years) |
730+ days |
- |
Deleted |
- |
Cost optimization: Downsampling reduces storage costs by 99%+ for old data!
Example: 1 year of raw 15s data = ~2 TB. 1 year of 1-hour downsampled data = ~20 GB.
Next steps¶
Create alerts - Alert on metric thresholds
Access Grafana - Build dashboards
Query Loki logs - Correlate metrics with logs