How-To Guide

Debug Monitoring Issues¶

Type: How-To (Task-oriented)

Systematic approach to debugging monitoring stack problems.

Debugging Workflow¶

1. Identify the symptom
   ↓
2. Gather evidence (logs, metrics, events)
   ↓
3. Form hypothesis
   ↓
4. Test hypothesis
   ↓
5. Apply fix
   ↓
6. Verify resolution
   ↓
7. Document root cause

Quick Diagnostic Commands¶

Pod Status¶

# All monitoring pods
kubectl get pods -n monitoring

# Specific component
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Pod details
kubectl describe pod <pod-name> -n monitoring

# Pod logs
kubectl logs -n monitoring <pod-name> --tail=100

# Previous container logs (after crash)
kubectl logs -n monitoring <pod-name> --previous

# Logs from specific container in pod
kubectl logs -n monitoring <pod-name> -c <container-name>

Resource Usage¶

# Current resource usage
kubectl top pods -n monitoring

# Sort by memory
kubectl top pods -n monitoring --sort-by=memory

# Sort by CPU
kubectl top pods -n monitoring --sort-by=cpu

# Node resources
kubectl top nodes

Events¶

# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

# Warning events only
kubectl get events -n monitoring --field-selector type=Warning

# Events for specific pod
kubectl get events -n monitoring --field-selector involvedObject.name=<pod-name>

Storage¶

# PVC status
kubectl get pvc -n monitoring

# PVC usage (from pod)
kubectl exec -n monitoring <pod-name> -- df -h

# Longhorn volumes
kubectl get volumes -n longhorn-system | grep monitoring

Services and Endpoints¶

# Services
kubectl get svc -n monitoring

# Endpoints
kubectl get endpoints -n monitoring

# Service details
kubectl describe svc <service-name> -n monitoring

Debugging by Symptom¶

Symptom: Grafana Dashboards Show “No Data”¶

Step 1: Identify Datasource¶

Check which datasource (Prometheus/Thanos or Loki):

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Open: http://localhost:3000/datasources
# Click on datasource showing error

Step 2: Test Datasource Connection¶

For Prometheus/Thanos:

# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
  wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

# Should return: "Thanos is Healthy"

For Loki:

kubectl exec -n monitoring <grafana-pod> -- \
  wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

# Should return: "ready"

Step 3: Check Data Exists¶

Prometheus/Thanos:

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

# Open: http://localhost:9090/graph
# Query: up{job="kubernetes-nodes"}
# Should return data

Loki:

kubectl port-forward -n monitoring svc/loki-gateway 3100:80

# Test labels endpoint
curl http://localhost:3100/loki/api/v1/labels

# Should return list of labels

Step 4: Common Causes¶

1. Service DNS Resolution Failure

# From Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
  nslookup thanos-query.monitoring.svc.cluster.local

# If fails, check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

2. Service Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query
# Should show running pods

3. Firewall/Network Policy

# Check if network policies exist
kubectl get networkpolicy -n monitoring

# Test direct connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
  curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

Symptom: Prometheus Not Scraping Metrics¶

Step 1: Check Target Status¶

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Open: http://localhost:9090/targets
# Look for targets showing "DOWN"

Step 2: Check ServiceMonitor¶

# List all ServiceMonitors
kubectl get servicemonitor -A

# Check if specific ServiceMonitor exists
kubectl get servicemonitor <name> -n <namespace> -o yaml

# Check labels match Prometheus selector
kubectl get prometheus -n monitoring kube-prometheus-stack-prometheus -o yaml | grep serviceMonitorSelector -A 5

Step 3: Test Target Reachability¶

# From Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -O- http://<target-service>.<namespace>.svc.cluster.local:<port>/metrics

# Should return Prometheus metrics

Step 4: Check Prometheus Logs¶

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep -i error

Common Causes¶

1. ServiceMonitor Label Mismatch

# ServiceMonitor needs this label
metadata:
  labels:
    release: kube-prometheus-stack  # Required!

2. Wrong Service Port Name

# ServiceMonitor references port name
spec:
  endpoints:
    - port: metrics  # Must match service port name

3. TLS/Authentication Issues

# Check if target requires authentication
kubectl describe servicemonitor <name> -n <namespace>

Symptom: Loki Not Receiving Logs¶

Step 1: Check Alloy Pods¶

# All Alloy pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=50

Step 2: Verify Log Flow¶

Test Alloy → Loki connection:

# From Alloy pod
kubectl exec -n monitoring <alloy-pod> -- \
  wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

# Should return: "ready"

Step 3: Check Loki Write Pods¶

# Loki Write running?
kubectl get pods -n monitoring -l app.kubernetes.io/component=write

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i error

Step 4: Verify Logs Arriving¶

kubectl port-forward -n monitoring svc/loki-gateway 3100:80

# Check labels (should show active labels)
curl http://localhost:3100/loki/api/v1/labels

# Query recent logs
curl -G http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={namespace="monitoring"}' \
  --data-urlencode 'limit=10'

Common Causes¶

1. Alloy Configuration Error

# Check Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml

# Verify Loki endpoint correct:
# url = "http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push"

2. Loki Write Out of Memory

kubectl get events -n monitoring | grep OOM | grep loki-write

3. S3 Credentials Missing

kubectl get secret loki-s3-config -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring

Symptom: Thanos Query Not Showing Historical Data¶

Step 1: Check Thanos Query Stores¶

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

# Open: http://localhost:9090/stores
# Should show:
# - Prometheus sidecars (sidecar)
# - Thanos Store gateways (store)

Expected output:

{
  "stores": [
    {"name": "prometheus-0", "lastCheck": "...", "type": "sidecar"},
    {"name": "prometheus-1", "lastCheck": "...", "type": "sidecar"},
    {"name": "thanos-store-0", "lastCheck": "...", "type": "store"},
    {"name": "thanos-store-1", "lastCheck": "...", "type": "store"}
  ]
}

Step 2: Check Thanos Store Pods¶

# Pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

# Check logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"

Step 3: Verify S3 Blocks Exist¶

# From Thanos Store pod
kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/

# Should show downloaded block metadata

Check S3 directly (if AWS CLI available):

aws s3 ls s3://metrics-thanos-kup6s/ --recursive | head -20
# Should show block directories (ULIDs)

Step 4: Check Thanos Sidecar Uploads¶

# From Prometheus pod
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c thanos-sidecar | grep -i upload

# Should show successful uploads every 2 hours

Common Causes¶

1. Thanos Store Can’t Reach S3

kubectl logs -n monitoring thanos-store-0 | grep -i "s3\|error"
# Look for: "access denied", "connection refused"

2. S3 Credentials Invalid

kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d
# Verify access_key and secret_key

3. No Historical Data Yet

# Check cluster uptime
kubectl get nodes -o jsonpath='{.items[0].metadata.creationTimestamp}'

# Historical data only available after >3 days
# (Prometheus local retention is 3 days)

Debugging Tools and Techniques¶

Interactive Debugging Pod¶

Launch debug pod:

kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash

Useful tools included:

curl - HTTP requests
wget - Download files
dig - DNS queries
nslookup - DNS resolution
ping - Network connectivity
traceroute - Network path
netstat - Network connections
tcpdump - Packet capture

Examples:

# Test Thanos Query
curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

# Test Loki
curl http://loki-gateway.monitoring.svc.cluster.local/ready

# DNS resolution
dig thanos-query.monitoring.svc.cluster.local

# DNS SRV records
dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local

Port Forwarding for Local Access¶

Forward Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Access: http://localhost:9090

Forward Grafana:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Access: http://localhost:3000

Forward Thanos Query:

kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Access: http://localhost:9090

Forward Loki:

kubectl port-forward -n monitoring svc/loki-gateway 3100:80
# Access: http://localhost:3100

Exec into Running Pods¶

# Prometheus
kubectl exec -it -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- sh

# Thanos Store
kubectl exec -it -n monitoring thanos-store-0 -- sh

# Loki Write
kubectl exec -it -n monitoring loki-write-0 -- sh

Useful commands inside pod:

# Check disk usage
df -h

# Check environment variables
env | grep -i s3

# Check process
ps aux

# Check open ports
netstat -tulpn

# Test outbound connectivity
wget -O- https://fsn1.your-objectstorage.com

Viewing Configuration¶

HelmChart resources:

kubectl get helmchart -n monitoring -o yaml

Extract Helm values:

kubectl get helmchart kube-prometheus-stack -n monitoring \
  -o jsonpath='{.spec.valuesContent}' > /tmp/prometheus-values.yaml

ConfigMaps:

# Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml

# Prometheus rules
kubectl get configmap prometheus-kube-prometheus-stack-prometheus-rulefiles-0 -n monitoring -o yaml

Prometheus Debugging Queries¶

Check target health:

up{namespace="monitoring"}

Check scrape duration:

scrape_duration_seconds{namespace="monitoring"}

Check memory usage:

container_memory_working_set_bytes{namespace="monitoring"} / 1024 / 1024

Check CPU throttling:

rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Check Thanos sidecar uploads:

thanos_objstore_bucket_operations_total{operation="upload"}

Check Loki ingestion rate:

rate(loki_distributor_lines_received_total[5m])

Common Issue Patterns¶

Pattern: Intermittent Failures¶

Symptoms:

Pods occasionally fail health checks
Queries sometimes timeout
Metrics sometimes missing

Debugging approach:

Check for resource constraints:

kubectl top pods -n monitoring
# Look for pods near limit

Check for node issues:

kubectl get nodes
kubectl describe node <node-name>
# Look for: MemoryPressure, DiskPressure

Check for network issues:

kubectl get events -A | grep -i network

Enable debug logging:

# For Prometheus (increase log level)
kubectl edit prometheus -n monitoring
# Add: logLevel: debug

Pattern: Cascading Failures¶

Symptoms:

One component fails, then others fail
Error messages reference downstream services

Debugging approach:

Identify failure sequence:

kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -50
# Look for first failure

Check dependencies:

Grafana → Thanos Query → Prometheus/Thanos Store
                 ↓
                Loki → Loki Write

Fix root cause first, then check if others recover

Pattern: Gradual Degradation¶

Symptoms:

Performance slowly decreases over time
Queries getting slower
Memory usage increasing

Debugging approach:

Check for resource leaks:

# Memory growth over 24h
rate(container_memory_working_set_bytes{namespace="monitoring"}[24h])

Check for growing datasets:

# Prometheus TSDB size
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  du -sh /prometheus/

# PVC growth
kubectl exec -n monitoring <pod> -- df -h

Check for increasing cardinality:

# Prometheus series count
prometheus_tsdb_head_series

Apply fixes:
- Increase resources
- Add retention policies
- Drop high-cardinality metrics

Debugging Checklist¶

When stuck, work through this checklist:

Infrastructure Layer¶

[ ] All nodes healthy? (kubectl get nodes)
[ ] Sufficient cluster resources? (kubectl top nodes)
[ ] Longhorn healthy? (kubectl get pods -n longhorn-system)
[ ] Network connectivity? (test between pods)

Pod Layer¶

[ ] All pods running? (kubectl get pods -n monitoring)
[ ] No crash loops? (check restart count)
[ ] No OOM kills? (kubectl get events -n monitoring | grep OOM)
[ ] Resource limits not hit? (kubectl top pods -n monitoring)

Storage Layer¶

[ ] All PVCs bound? (kubectl get pvc -n monitoring)
[ ] PVCs not full? (check from pods)
[ ] S3 buckets exist? (kubectl get bucket -A)
[ ] S3 credentials valid? (check secrets)

Network Layer¶

[ ] Services exist? (kubectl get svc -n monitoring)
[ ] Endpoints populated? (kubectl get endpoints -n monitoring)
[ ] DNS resolving? (test from debug pod)
[ ] No network policies blocking? (kubectl get networkpolicy -n monitoring)

Application Layer¶

[ ] Config correct? (check ConfigMaps)
[ ] Secrets exist? (check ExternalSecrets)
[ ] Logs showing errors? (check pod logs)
[ ] Metrics collected? (query Prometheus)

Escalation Path¶

When to Escalate¶

Escalate if:

Issue affects production workloads
Data loss risk
Unable to identify root cause after 1 hour
Need component expertise (Prometheus, Thanos, Loki)

Information to Gather Before Escalating¶

# Create debug bundle
DEBUG_DIR=/tmp/monitoring-debug-$(date +%Y%m%d-%H%M%S)
mkdir -p $DEBUG_DIR

# Pod status
kubectl get pods -n monitoring -o wide > $DEBUG_DIR/pods.txt

# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -100 > $DEBUG_DIR/events.txt

# Logs from failed pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
  kubectl logs -n monitoring $pod --previous > $DEBUG_DIR/$(basename $pod)-previous.log 2>&1 || true
  kubectl logs -n monitoring $pod > $DEBUG_DIR/$(basename $pod).log 2>&1 || true
done

# Resource usage
kubectl top pods -n monitoring > $DEBUG_DIR/resource-usage.txt

# Describe failing pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
  kubectl describe -n monitoring $pod > $DEBUG_DIR/$(basename $pod)-describe.txt
done

# Compress
tar czf monitoring-debug-$(date +%Y%m%d-%H%M%S).tar.gz -C /tmp $(basename $DEBUG_DIR)

Debug Monitoring Issues¶

Debugging Workflow¶

Quick Diagnostic Commands¶

Pod Status¶

Resource Usage¶

Events¶

Storage¶

Services and Endpoints¶

Debugging by Symptom¶

Symptom: Grafana Dashboards Show “No Data”¶

Step 1: Identify Datasource¶

Step 2: Test Datasource Connection¶

Step 3: Check Data Exists¶

Step 4: Common Causes¶

Symptom: Prometheus Not Scraping Metrics¶

Step 1: Check Target Status¶

Step 2: Check ServiceMonitor¶

Step 3: Test Target Reachability¶

Step 4: Check Prometheus Logs¶

Common Causes¶

Symptom: Loki Not Receiving Logs¶

Step 1: Check Alloy Pods¶

Step 2: Verify Log Flow¶

Step 3: Check Loki Write Pods¶

Step 4: Verify Logs Arriving¶

Common Causes¶

Symptom: Thanos Query Not Showing Historical Data¶

Step 1: Check Thanos Query Stores¶

Step 2: Check Thanos Store Pods¶

Step 3: Verify S3 Blocks Exist¶

Step 4: Check Thanos Sidecar Uploads¶

Common Causes¶

Debugging Tools and Techniques¶

Interactive Debugging Pod¶

Port Forwarding for Local Access¶

Exec into Running Pods¶

Viewing Configuration¶

Prometheus Debugging Queries¶

Common Issue Patterns¶

Pattern: Intermittent Failures¶

Pattern: Cascading Failures¶

Pattern: Gradual Degradation¶

Debugging Checklist¶

Infrastructure Layer¶

Pod Layer¶

Storage Layer¶

Network Layer¶

Application Layer¶

Escalation Path¶

When to Escalate¶

Information to Gather Before Escalating¶

Support Resources¶

See Also¶