How-To Guide

Query Historical Metrics with Thanos

Goal: Query both recent and historical Prometheus metrics using Thanos for long-term trend analysis.

Time: ~20 minutes

Prerequisites

  • Access to Grafana UI

  • Basic understanding of PromQL (Prometheus Query Language)

  • Thanos components deployed (already enabled in cluster)

Quick reference

Task

Where to Query

Notes

Recent metrics (< 3 days)

Grafana → Prometheus datasource

Served from Prometheus local storage

Historical metrics (> 3 days)

Grafana → Prometheus datasource

Served from S3 via Thanos Store

Query across all time ranges

Grafana → Prometheus datasource

Thanos Query federates both sources

Verify Thanos components

kubectl get pods -n monitoring

Check Query, Store, Compactor

Check S3 blocks

kubectl exec thanos-store-0 -- ls /var/thanos/store/meta-syncer/

See loaded blocks

Understanding Thanos Architecture

Thanos extends Prometheus with unlimited long-term storage:

  • Prometheus local storage: 3 days retention (fast queries)

  • S3 raw data: 30 days (full resolution)

  • S3 downsampled (5-min): 180 days (~6 months)

  • S3 downsampled (1-hour): 730 days (2 years)

Grafana datasource configured to use Thanos Query, which automatically:

  • Routes recent queries to Prometheus sidecars

  • Routes historical queries to Thanos Store (S3)

  • Deduplicates metrics from replicas

You don’t need to change anything - all existing dashboards and queries work transparently!

Step 1: Access Grafana Explore

  1. Open Grafana: https://grafana.ops.kup6s.net

  2. Login with admin credentials

  3. Click Explore (compass icon) in left sidebar

  4. Select Prometheus from data source dropdown (top)

Note: Even though it says “Prometheus”, you’re actually querying through Thanos Query!

Step 2: Query recent metrics (last 3 days)

Current CPU usage

rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m])

Result: CPU usage rate over last 5 minutes. Served from Prometheus local storage.

Memory usage right now

container_memory_working_set_bytes{namespace="hello-kup6s"}

Result: Current memory usage. Fast query from Prometheus.

Pod restarts in last hour

increase(kube_pod_container_status_restarts_total{namespace="hello-kup6s"}[1h])

Result: How many times pods restarted in last hour.

Step 3: Query historical metrics (older than 3 days)

CPU usage trend over 30 days

  1. Click time picker (top right)

  2. Select Last 30 days

  3. Query:

    avg(rate(container_cpu_usage_seconds_total{namespace="hello-kup6s"}[5m]))
    

Result: 30-day CPU trend. Recent data from Prometheus, historical from S3!

Notice: You’re querying 30 days of data, but Prometheus only keeps 3 days locally. Thanos Store fetches the rest from S3 transparently.

Memory growth over 6 months

avg(container_memory_working_set_bytes{namespace="hello-kup6s"}) by (pod)

Set time range to Last 180 days (6 months).

Result: Long-term memory trend. Uses 5-minute downsampled data from S3.

Long-term capacity planning

max(
  sum(container_memory_working_set_bytes{namespace!~"kube-system|monitoring"})
  /
  sum(kube_node_status_allocatable{resource="memory"})
) * 100

Set time range to Last 1 year.

Result: Cluster memory utilization percentage over a full year. Uses 1-hour downsampled data from S3.

Step 4: Verify Thanos is serving your queries

Check Thanos Query stores

From terminal:

kubectl exec -n monitoring deploy/thanos-query -- \
  curl -s localhost:9090/api/v1/stores | python3 -m json.tool

Look for:

  • "sidecar" entries: Prometheus replicas (real-time data)

  • "store" entries: Thanos Store gateways (S3 historical data)

Each store shows:

  • minTime / maxTime: What time range it covers

  • labelSets: What metrics it has

Example output

{
  "sidecar": [
    {
      "name": "10.42.1.132:10901",
      "minTime": 1761699349000,  // 3 days ago
      "maxTime": 9223372036854775807,  // "now"
      "labelSets": [{"cluster": "kup6s", "replica": "prometheus-...-1"}]
    }
  ],
  "store": [
    {
      "name": "10.42.1.43:10901",
      "minTime": 1761717600082,  // 12 hours ago
      "maxTime": 1761760800000,  // 7 hours ago
      "labelSets": [{"cluster": "kup6s"}]
    }
  ]
}

Interpretation:

  • Sidecar covers last 3 days up to “now” (real-time)

  • Store covers historical blocks uploaded to S3

  • Thanos Query federates both automatically!

Check S3 blocks loaded

kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/meta-syncer/ | head -20

Result: Directories named with block ULIDs (unique IDs). Each is a 2-hour block of metrics uploaded from Prometheus.

Step 6: Performance and cost optimization

Query performance by time range

Time Range

Data Source

Resolution

Query Speed

Last 3 hours

Prometheus

Raw (15s)

Very fast

Last 3 days

Prometheus

Raw (15s)

Fast

3-30 days

S3 (Thanos Store)

Raw (15s)

Medium

30-180 days

S3 (Thanos Store)

5-min downsampled

Fast

180+ days

S3 (Thanos Store)

1-hour downsampled

Very fast

Tip: Downsampled data is faster to query because there are fewer data points!

Optimize long-term queries

Instead of querying raw data over 6 months:

# Slower - queries millions of raw samples
rate(container_cpu_usage_seconds_total[5m])  # Last 180 days

Use appropriate aggregation:

# Faster - uses downsampled data
avg_over_time(
  rate(container_cpu_usage_seconds_total[5m])[1h:5m]
)

Result: 1-hour average rates instead of every 5-minute sample. Much faster!

Use recording rules for frequent queries

For dashboards querying the same expensive PromQL repeatedly, create a recording rule:

# In Prometheus rules
- record: namespace:container_cpu:avg_rate5m
  expr: |
    avg(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Then query:

namespace:container_cpu:avg_rate5m{namespace="hello-kup6s"}

Benefit: Pre-computed metric, much faster queries!

Step 7: Identify gaps and missing data

Check for data gaps

absent_over_time(
  up{job="hello-kup6s"}[1h]
)

Result: Shows when the up metric was missing (scrape failures or downtime).

Verify metric exists in historical range

count_over_time(
  container_memory_working_set_bytes{namespace="hello-kup6s"}[30d]
)

Result: How many samples exist in last 30 days. If zero, metric wasn’t collected yet.

Find when a metric first appeared

timestamp(
  container_memory_working_set_bytes{namespace="hello-kup6s"}
)

Set time range to Last 2 years, graph it.

Result: Metric starts appearing when pods were first deployed.

Step 8: Common use cases

Capacity planning: Storage growth

predict_linear(
  sum(kubelet_volume_stats_used_bytes{namespace="hello-kup6s"})[30d],
  86400 * 90  # 90 days in seconds
)

Result: Predicted storage usage in 90 days based on last 30 days trend.

Performance baseline: 95th percentile latency

histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket{namespace="hello-kup6s"}[5m])
)

Set time range to Last 90 days.

Result: 95th percentile latency over 3 months. Helps establish SLOs.

Cost analysis: Resource waste

sum(
  kube_pod_container_resource_requests{namespace="hello-kup6s", resource="memory"}
  -
  avg_over_time(container_memory_working_set_bytes{namespace="hello-kup6s"}[1d])
) by (pod)

Result: How much requested memory is unused. Helps right-size requests.

Incident investigation: What changed?

After an incident on Oct 15, 2025:

delta(
  container_memory_working_set_bytes{namespace="hello-kup6s"}[1h]
)

Set time range to Oct 15 10:00 - Oct 15 16:00.

Result: Shows memory spikes during incident window.

SLO tracking: Uptime over 30 days

avg_over_time(up{job="hello-kup6s"}[30d]) * 100

Result: Percentage uptime over last 30 days. Compare against SLO (e.g., 99.9%).

Step 9: Troubleshooting

“No data” for historical queries

Check if Thanos Store has the data:

kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/meta-syncer/

If no directories, Thanos hasn’t loaded S3 blocks yet. Wait a few minutes.

Check Thanos Store logs:

kubectl logs -n monitoring thanos-store-0 --tail=30

Look for “loaded new block” messages.

Query timeout on long time ranges

Symptom: Grafana shows “Timeout” error when querying 1+ years.

Fix: Use downsampled queries or recording rules (see Step 6).

Alternative: Increase query timeout in Grafana datasource settings.

Gaps in historical data

Cause: Thanos sidecar was just enabled. Historical data before enablement doesn’t exist in S3.

Explanation: Thanos only uploads blocks going forward from when it was enabled. It doesn’t backfill old data.

Solution: Wait for time to pass. In 30 days, you’ll have 30 days of history!

“Store gateway not found” errors

Check Thanos Store pods:

kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

Should show 2/2 replicas running.

If not running, check events:

kubectl describe pod -n monitoring thanos-store-0

Metrics look different after 30+ days

Expected behavior! After 30 days, raw data is deleted and only 5-minute downsampled data remains.

Example:

  • Day 1-30: Raw 15s resolution

  • Day 31-180: 5-minute resolution (averaged)

  • Day 181+: 1-hour resolution (averaged)

Impact: Long-term queries are averages, not exact values. This is intentional for cost savings!

Step 10: Advanced: Directly query Thanos components

Query Thanos Query directly (bypass Grafana)

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

Then open browser: http://localhost:9090

Use case: Debugging Thanos federation, seeing which stores contribute to a query.

Query Prometheus directly (bypass Thanos)

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9091:9090

Then open browser: http://localhost:9091

Use case: Compare Prometheus local data vs Thanos federated data.

Check Thanos Compactor activity

kubectl logs -n monitoring thanos-compactor-0 --tail=50 | \
  grep -E "(compact|downsample)"

Look for:

  • “start first pass of downsampling” (creating 5-min blocks)

  • “start second pass of downsampling” (creating 1-hour blocks)

  • “compaction iterations done”

Inspect S3 bucket directly

Via AWS CLI with Hetzner S3:

export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>

aws s3 ls s3://metrics-thanos-kup6s/ \
  --endpoint-url https://fsn1.your-objectstorage.com

Result: Shows all blocks uploaded to S3, including downsampled blocks.

Thanos retention summary

Data Age

Resolution

Storage Location

Retention

0-3 days

Raw (15s)

Prometheus local

3 days

0-30 days

Raw (15s)

S3 (Hetzner fsn1)

30 days

30-180 days

5-minute avg

S3 (Hetzner fsn1)

180 days

180-730 days

1-hour avg

S3 (Hetzner fsn1)

730 days (2 years)

730+ days

-

Deleted

-

Cost optimization: Downsampling reduces storage costs by 99%+ for old data!

Example: 1 year of raw 15s data = ~2 TB. 1 year of 1-hour downsampled data = ~20 GB.

Next steps