How-To Guide

Query Historical Metrics with Thanos¶

Type: How-To (Task-oriented)

Goal: Query both recent and historical Prometheus metrics using Thanos for long-term trend analysis.

Time: ~20 minutes

Prerequisites¶

Access to Grafana UI
Basic understanding of PromQL (Prometheus Query Language)
Thanos components deployed (already enabled in cluster)

Quick reference¶

Task	Where to Query	Notes
Recent metrics (< 3 days)	Grafana → Prometheus datasource	Served from Prometheus local storage
Historical metrics (> 3 days)	Grafana → Prometheus datasource	Served from S3 via Thanos Store
Query across all time ranges	Grafana → Prometheus datasource	Thanos Query federates both sources
Verify Thanos components	`kubectl get pods -n monitoring`	Check Query, Store, Compactor
Check S3 blocks	`kubectl exec thanos-store-0 -- ls /var/thanos/store/meta-syncer/`	See loaded blocks

Understanding Thanos Architecture¶

Thanos extends Prometheus with unlimited long-term storage:

Prometheus local storage: 3 days retention (fast queries)
S3 raw data: 30 days (full resolution)
S3 downsampled (5-min): 180 days (~6 months)
S3 downsampled (1-hour): 730 days (2 years)

Grafana datasource configured to use Thanos Query, which automatically:

Routes recent queries to Prometheus sidecars
Routes historical queries to Thanos Store (S3)
Deduplicates metrics from replicas

You don’t need to change anything - all existing dashboards and queries work transparently!

Step 1: Access Grafana Explore¶

Open Grafana: https://grafana.ops.kup6s.net
Login with admin credentials
Click Explore (compass icon) in left sidebar
Select Prometheus from data source dropdown (top)

Note: Even though it says “Prometheus”, you’re actually querying through Thanos Query!

Step 2: Query recent metrics (last 3 days)¶

Current CPU usage¶

rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m])

Result: CPU usage rate over last 5 minutes. Served from Prometheus local storage.

Memory usage right now¶

container_memory_working_set_bytes{namespace="hello-kup6s"}

Result: Current memory usage. Fast query from Prometheus.

Pod restarts in last hour¶

increase(kube_pod_container_status_restarts_total{namespace="hello-kup6s"}[1h])

Result: How many times pods restarted in last hour.

Step 3: Query historical metrics (older than 3 days)¶

CPU usage trend over 30 days¶

Click time picker (top right)
Select Last 30 days

Query:

avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]))

Result: 30-day CPU trend. Recent data from Prometheus, historical from S3!

Notice: You’re querying 30 days of data, but Prometheus only keeps 3 days locally. Thanos Store fetches the rest from S3 transparently.

Memory growth over 6 months¶

avg(container_memory_working_set_bytes{namespace="hello-kup6s"}) by (pod)

Set time range to Last 180 days (6 months).

Result: Long-term memory trend. Uses 5-minute downsampled data from S3.

Long-term capacity planning¶

max(
  sum(container_memory_working_set_bytes{namespace!~"kube-system|monitoring"})
  /
  sum(kube_node_status_allocatable{resource="memory"})
) * 100

Set time range to Last 1 year.

Result: Cluster memory utilization percentage over a full year. Uses 1-hour downsampled data from S3.

Step 4: Verify Thanos is serving your queries¶

Check Thanos Query stores¶

From terminal:

kubectl exec -n monitoring deploy/thanos-query -- \
  curl -s localhost:9090/api/v1/stores | python3 -m json.tool

Look for:

"sidecar" entries: Prometheus replicas (real-time data)
"store" entries: Thanos Store gateways (S3 historical data)

Each store shows:

minTime / maxTime: What time range it covers
labelSets: What metrics it has

Example output¶

{
  "sidecar": [
    {
      "name": "10.42.1.132:10901",
      "minTime": 1761699349000,  // 3 days ago
      "maxTime": 9223372036854775807,  // "now"
      "labelSets": [{"cluster": "kup6s", "replica": "prometheus-...-1"}]
    }
  ],
  "store": [
    {
      "name": "10.42.1.43:10901",
      "minTime": 1761717600082,  // 12 hours ago
      "maxTime": 1761760800000,  // 7 hours ago
      "labelSets": [{"cluster": "kup6s"}]
    }
  ]
}

Interpretation:

Sidecar covers last 3 days up to “now” (real-time)
Store covers historical blocks uploaded to S3
Thanos Query federates both automatically!

Check S3 blocks loaded¶

kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/meta-syncer/ | head -20

Result: Directories named with block ULIDs (unique IDs). Each is a 2-hour block of metrics uploaded from Prometheus.

Step 5: Compare trends over time¶

Week-over-week comparison¶

avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]))
  and
avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]) offset 1w)

Set time range to Last 7 days.

Result: Current week vs last week CPU usage side-by-side.

Month-over-month growth¶

sum(container_memory_working_set_bytes{namespace="hello-kup6s"})
  /
sum(container_memory_working_set_bytes{namespace="hello-kup6s"} offset 30d)

Result: Memory usage as a ratio of 30 days ago. >1 means growth, <1 means reduction.

Year-over-year seasonality¶

For clusters running >1 year:

avg_over_time(
  sum(rate(container_cpu_usage_seconds_total{namespace!~"kube-system|monitoring"}[1h]))
  [1d:1h]
)

Set time range to Last 2 years.

Result: Daily CPU average over 2 years. Uses 1-hour downsampled data from S3.

Step 6: Performance and cost optimization¶

Query performance by time range¶

Time Range	Data Source	Resolution	Query Speed
Last 3 hours	Prometheus	Raw (15s)	Very fast
Last 3 days	Prometheus	Raw (15s)	Fast
3-30 days	S3 (Thanos Store)	Raw (15s)	Medium
30-180 days	S3 (Thanos Store)	5-min downsampled	Fast
180+ days	S3 (Thanos Store)	1-hour downsampled	Very fast

Tip: Downsampled data is faster to query because there are fewer data points!

Optimize long-term queries¶

Instead of querying raw data over 6 months:

# Slower - queries millions of raw samples
rate(container_cpu_usage_seconds_total[5m])  # Last 180 days

Use appropriate aggregation:

# Faster - uses downsampled data
avg_over_time(
  rate(container_cpu_usage_seconds_total[5m])[1h:5m]
)

Result: 1-hour average rates instead of every 5-minute sample. Much faster!

Use recording rules for frequent queries¶

For dashboards querying the same expensive PromQL repeatedly, create a recording rule:

# In Prometheus rules
- record: namespace:container_cpu:avg_rate5m
  expr: |
    avg(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Then query:

namespace:container_cpu:avg_rate5m{namespace="hello-kup6s"}

Benefit: Pre-computed metric, much faster queries!

Step 7: Identify gaps and missing data¶

Check for data gaps¶

absent_over_time(
  up{job="hello-kup6s"}[1h]
)

Result: Shows when the up metric was missing (scrape failures or downtime).

Verify metric exists in historical range¶

count_over_time(
  container_memory_working_set_bytes{namespace="hello-kup6s"}[30d]
)

Result: How many samples exist in last 30 days. If zero, metric wasn’t collected yet.

Find when a metric first appeared¶

timestamp(
  container_memory_working_set_bytes{namespace="hello-kup6s"}
)

Set time range to Last 2 years, graph it.

Result: Metric starts appearing when pods were first deployed.

Step 8: Common use cases¶

Capacity planning: Storage growth¶

predict_linear(
  sum(kubelet_volume_stats_used_bytes{namespace="hello-kup6s"})[30d],
  86400 * 90  # 90 days in seconds
)

Result: Predicted storage usage in 90 days based on last 30 days trend.

Performance baseline: 95th percentile latency¶

histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket{namespace="hello-kup6s"}[5m])
)

Set time range to Last 90 days.

Result: 95th percentile latency over 3 months. Helps establish SLOs.

Cost analysis: Resource waste¶

sum(
  kube_pod_container_resource_requests{namespace="hello-kup6s", resource="memory"}
  -
  avg_over_time(container_memory_working_set_bytes{namespace="hello-kup6s"}[1d])
) by (pod)

Result: How much requested memory is unused. Helps right-size requests.

Incident investigation: What changed?¶

After an incident on Oct 15, 2025:

delta(
  container_memory_working_set_bytes{namespace="hello-kup6s"}[1h]
)

Set time range to Oct 15 10:00 - Oct 15 16:00.

Result: Shows memory spikes during incident window.

SLO tracking: Uptime over 30 days¶

avg_over_time(up{job="hello-kup6s"}[30d]) * 100

Result: Percentage uptime over last 30 days. Compare against SLO (e.g., 99.9%).

Step 9: Troubleshooting¶

“No data” for historical queries¶

Check if Thanos Store has the data:

kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/meta-syncer/

If no directories, Thanos hasn’t loaded S3 blocks yet. Wait a few minutes.

Check Thanos Store logs:

kubectl logs -n monitoring thanos-store-0 --tail=30

Look for “loaded new block” messages.

Query timeout on long time ranges¶

Symptom: Grafana shows “Timeout” error when querying 1+ years.

Fix: Use downsampled queries or recording rules (see Step 6).

Alternative: Increase query timeout in Grafana datasource settings.

Gaps in historical data¶

Cause: Thanos sidecar was just enabled. Historical data before enablement doesn’t exist in S3.

Explanation: Thanos only uploads blocks going forward from when it was enabled. It doesn’t backfill old data.

Solution: Wait for time to pass. In 30 days, you’ll have 30 days of history!

“Store gateway not found” errors¶

Check Thanos Store pods:

kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

Should show 2/2 replicas running.

If not running, check events:

kubectl describe pod -n monitoring thanos-store-0

Metrics look different after 30+ days¶

Expected behavior! After 30 days, raw data is deleted and only 5-minute downsampled data remains.

Example:

Day 1-30: Raw 15s resolution
Day 31-180: 5-minute resolution (averaged)
Day 181+: 1-hour resolution (averaged)

Impact: Long-term queries are averages, not exact values. This is intentional for cost savings!

Step 10: Advanced: Directly query Thanos components¶

Query Thanos Query directly (bypass Grafana)¶

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

Then open browser: http://localhost:9090

Use case: Debugging Thanos federation, seeing which stores contribute to a query.

Query Prometheus directly (bypass Thanos)¶

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9091:9090

Then open browser: http://localhost:9091

Use case: Compare Prometheus local data vs Thanos federated data.

Check Thanos Compactor activity¶

kubectl logs -n monitoring thanos-compactor-0 --tail=50 | \
  grep -E "(compact|downsample)"