How-To Guide
Debug Monitoring Issues¶
Systematic approach to debugging monitoring stack problems.
Debugging Workflow¶
1. Identify the symptom
↓
2. Gather evidence (logs, metrics, events)
↓
3. Form hypothesis
↓
4. Test hypothesis
↓
5. Apply fix
↓
6. Verify resolution
↓
7. Document root cause
Quick Diagnostic Commands¶
Pod Status¶
# All monitoring pods
kubectl get pods -n monitoring
# Specific component
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
# Pod details
kubectl describe pod <pod-name> -n monitoring
# Pod logs
kubectl logs -n monitoring <pod-name> --tail=100
# Previous container logs (after crash)
kubectl logs -n monitoring <pod-name> --previous
# Logs from specific container in pod
kubectl logs -n monitoring <pod-name> -c <container-name>
Resource Usage¶
# Current resource usage
kubectl top pods -n monitoring
# Sort by memory
kubectl top pods -n monitoring --sort-by=memory
# Sort by CPU
kubectl top pods -n monitoring --sort-by=cpu
# Node resources
kubectl top nodes
Events¶
# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20
# Warning events only
kubectl get events -n monitoring --field-selector type=Warning
# Events for specific pod
kubectl get events -n monitoring --field-selector involvedObject.name=<pod-name>
Storage¶
# PVC status
kubectl get pvc -n monitoring
# PVC usage (from pod)
kubectl exec -n monitoring <pod-name> -- df -h
# Longhorn volumes
kubectl get volumes -n longhorn-system | grep monitoring
Services and Endpoints¶
# Services
kubectl get svc -n monitoring
# Endpoints
kubectl get endpoints -n monitoring
# Service details
kubectl describe svc <service-name> -n monitoring
Debugging by Symptom¶
Symptom: Grafana Dashboards Show “No Data”¶
Step 1: Identify Datasource¶
Check which datasource (Prometheus/Thanos or Loki):
# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Open: http://localhost:3000/datasources
# Click on datasource showing error
Step 2: Test Datasource Connection¶
For Prometheus/Thanos:
# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
# Should return: "Thanos is Healthy"
For Loki:
kubectl exec -n monitoring <grafana-pod> -- \
wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready
# Should return: "ready"
Step 3: Check Data Exists¶
Prometheus/Thanos:
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Open: http://localhost:9090/graph
# Query: up{job="kubernetes-nodes"}
# Should return data
Loki:
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
# Test labels endpoint
curl http://localhost:3100/loki/api/v1/labels
# Should return list of labels
Step 4: Common Causes¶
1. Service DNS Resolution Failure
# From Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
nslookup thanos-query.monitoring.svc.cluster.local
# If fails, check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
2. Service Not Running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query
# Should show running pods
3. Firewall/Network Policy
# Check if network policies exist
kubectl get networkpolicy -n monitoring
# Test direct connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
Symptom: Prometheus Not Scraping Metrics¶
Step 1: Check Target Status¶
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open: http://localhost:9090/targets
# Look for targets showing "DOWN"
Step 2: Check ServiceMonitor¶
# List all ServiceMonitors
kubectl get servicemonitor -A
# Check if specific ServiceMonitor exists
kubectl get servicemonitor <name> -n <namespace> -o yaml
# Check labels match Prometheus selector
kubectl get prometheus -n monitoring kube-prometheus-stack-prometheus -o yaml | grep serviceMonitorSelector -A 5
Step 3: Test Target Reachability¶
# From Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
wget -O- http://<target-service>.<namespace>.svc.cluster.local:<port>/metrics
# Should return Prometheus metrics
Step 4: Check Prometheus Logs¶
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep -i error
Common Causes¶
1. ServiceMonitor Label Mismatch
# ServiceMonitor needs this label
metadata:
labels:
release: kube-prometheus-stack # Required!
2. Wrong Service Port Name
# ServiceMonitor references port name
spec:
endpoints:
- port: metrics # Must match service port name
3. TLS/Authentication Issues
# Check if target requires authentication
kubectl describe servicemonitor <name> -n <namespace>
Symptom: Loki Not Receiving Logs¶
Step 1: Check Alloy Pods¶
# All Alloy pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=50
Step 2: Verify Log Flow¶
Test Alloy → Loki connection:
# From Alloy pod
kubectl exec -n monitoring <alloy-pod> -- \
wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready
# Should return: "ready"
Step 3: Check Loki Write Pods¶
# Loki Write running?
kubectl get pods -n monitoring -l app.kubernetes.io/component=write
# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i error
Step 4: Verify Logs Arriving¶
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
# Check labels (should show active labels)
curl http://localhost:3100/loki/api/v1/labels
# Query recent logs
curl -G http://localhost:3100/loki/api/v1/query_range \
--data-urlencode 'query={namespace="monitoring"}' \
--data-urlencode 'limit=10'
Common Causes¶
1. Alloy Configuration Error
# Check Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml
# Verify Loki endpoint correct:
# url = "http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push"
2. Loki Write Out of Memory
kubectl get events -n monitoring | grep OOM | grep loki-write
3. S3 Credentials Missing
kubectl get secret loki-s3-config -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring
Symptom: Thanos Query Not Showing Historical Data¶
Step 1: Check Thanos Query Stores¶
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Open: http://localhost:9090/stores
# Should show:
# - Prometheus sidecars (sidecar)
# - Thanos Store gateways (store)
Expected output:
{
"stores": [
{"name": "prometheus-0", "lastCheck": "...", "type": "sidecar"},
{"name": "prometheus-1", "lastCheck": "...", "type": "sidecar"},
{"name": "thanos-store-0", "lastCheck": "...", "type": "store"},
{"name": "thanos-store-1", "lastCheck": "...", "type": "store"}
]
}
Step 2: Check Thanos Store Pods¶
# Pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store
# Check logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"
Step 3: Verify S3 Blocks Exist¶
# From Thanos Store pod
kubectl exec -n monitoring thanos-store-0 -- \
ls -lh /var/thanos/store/
# Should show downloaded block metadata
Check S3 directly (if AWS CLI available):
aws s3 ls s3://metrics-thanos-kup6s/ --recursive | head -20
# Should show block directories (ULIDs)
Step 4: Check Thanos Sidecar Uploads¶
# From Prometheus pod
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
-c thanos-sidecar | grep -i upload
# Should show successful uploads every 2 hours
Common Causes¶
1. Thanos Store Can’t Reach S3
kubectl logs -n monitoring thanos-store-0 | grep -i "s3\|error"
# Look for: "access denied", "connection refused"
2. S3 Credentials Invalid
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d
# Verify access_key and secret_key
3. No Historical Data Yet
# Check cluster uptime
kubectl get nodes -o jsonpath='{.items[0].metadata.creationTimestamp}'
# Historical data only available after >3 days
# (Prometheus local retention is 3 days)
Debugging Tools and Techniques¶
Interactive Debugging Pod¶
Launch debug pod:
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
Useful tools included:
curl- HTTP requestswget- Download filesdig- DNS queriesnslookup- DNS resolutionping- Network connectivitytraceroute- Network pathnetstat- Network connectionstcpdump- Packet capture
Examples:
# Test Thanos Query
curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
# Test Loki
curl http://loki-gateway.monitoring.svc.cluster.local/ready
# DNS resolution
dig thanos-query.monitoring.svc.cluster.local
# DNS SRV records
dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local
Port Forwarding for Local Access¶
Forward Prometheus:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Access: http://localhost:9090
Forward Grafana:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Access: http://localhost:3000
Forward Thanos Query:
kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Access: http://localhost:9090
Forward Loki:
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
# Access: http://localhost:3100
Exec into Running Pods¶
# Prometheus
kubectl exec -it -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- sh
# Thanos Store
kubectl exec -it -n monitoring thanos-store-0 -- sh
# Loki Write
kubectl exec -it -n monitoring loki-write-0 -- sh
Useful commands inside pod:
# Check disk usage
df -h
# Check environment variables
env | grep -i s3
# Check process
ps aux
# Check open ports
netstat -tulpn
# Test outbound connectivity
wget -O- https://fsn1.your-objectstorage.com
Viewing Configuration¶
HelmChart resources:
kubectl get helmchart -n monitoring -o yaml
Extract Helm values:
kubectl get helmchart kube-prometheus-stack -n monitoring \
-o jsonpath='{.spec.valuesContent}' > /tmp/prometheus-values.yaml
ConfigMaps:
# Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml
# Prometheus rules
kubectl get configmap prometheus-kube-prometheus-stack-prometheus-rulefiles-0 -n monitoring -o yaml
Prometheus Debugging Queries¶
Check target health:
up{namespace="monitoring"}
Check scrape duration:
scrape_duration_seconds{namespace="monitoring"}
Check memory usage:
container_memory_working_set_bytes{namespace="monitoring"} / 1024 / 1024
Check CPU throttling:
rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1
Check Thanos sidecar uploads:
thanos_objstore_bucket_operations_total{operation="upload"}
Check Loki ingestion rate:
rate(loki_distributor_lines_received_total[5m])
Common Issue Patterns¶
Pattern: Intermittent Failures¶
Symptoms:
Pods occasionally fail health checks
Queries sometimes timeout
Metrics sometimes missing
Debugging approach:
Check for resource constraints:
kubectl top pods -n monitoring # Look for pods near limit
Check for node issues:
kubectl get nodes kubectl describe node <node-name> # Look for: MemoryPressure, DiskPressure
Check for network issues:
kubectl get events -A | grep -i networkEnable debug logging:
# For Prometheus (increase log level) kubectl edit prometheus -n monitoring # Add: logLevel: debug
Pattern: Cascading Failures¶
Symptoms:
One component fails, then others fail
Error messages reference downstream services
Debugging approach:
Identify failure sequence:
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -50 # Look for first failure
Check dependencies:
Grafana → Thanos Query → Prometheus/Thanos Store ↓ Loki → Loki Write
Fix root cause first, then check if others recover
Pattern: Gradual Degradation¶
Symptoms:
Performance slowly decreases over time
Queries getting slower
Memory usage increasing
Debugging approach:
Check for resource leaks:
# Memory growth over 24h rate(container_memory_working_set_bytes{namespace="monitoring"}[24h])
Check for growing datasets:
# Prometheus TSDB size kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \ du -sh /prometheus/ # PVC growth kubectl exec -n monitoring <pod> -- df -h
Check for increasing cardinality:
# Prometheus series count prometheus_tsdb_head_series
Apply fixes:
Increase resources
Add retention policies
Drop high-cardinality metrics
Debugging Checklist¶
When stuck, work through this checklist:
Infrastructure Layer¶
[ ] All nodes healthy? (
kubectl get nodes)[ ] Sufficient cluster resources? (
kubectl top nodes)[ ] Longhorn healthy? (
kubectl get pods -n longhorn-system)[ ] Network connectivity? (test between pods)
Pod Layer¶
[ ] All pods running? (
kubectl get pods -n monitoring)[ ] No crash loops? (check restart count)
[ ] No OOM kills? (
kubectl get events -n monitoring | grep OOM)[ ] Resource limits not hit? (
kubectl top pods -n monitoring)
Storage Layer¶
[ ] All PVCs bound? (
kubectl get pvc -n monitoring)[ ] PVCs not full? (check from pods)
[ ] S3 buckets exist? (
kubectl get bucket -A)[ ] S3 credentials valid? (check secrets)
Network Layer¶
[ ] Services exist? (
kubectl get svc -n monitoring)[ ] Endpoints populated? (
kubectl get endpoints -n monitoring)[ ] DNS resolving? (test from debug pod)
[ ] No network policies blocking? (
kubectl get networkpolicy -n monitoring)
Application Layer¶
[ ] Config correct? (check ConfigMaps)
[ ] Secrets exist? (check ExternalSecrets)
[ ] Logs showing errors? (check pod logs)
[ ] Metrics collected? (query Prometheus)
Escalation Path¶
When to Escalate¶
Escalate if:
Issue affects production workloads
Data loss risk
Unable to identify root cause after 1 hour
Need component expertise (Prometheus, Thanos, Loki)
Information to Gather Before Escalating¶
# Create debug bundle
DEBUG_DIR=/tmp/monitoring-debug-$(date +%Y%m%d-%H%M%S)
mkdir -p $DEBUG_DIR
# Pod status
kubectl get pods -n monitoring -o wide > $DEBUG_DIR/pods.txt
# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -100 > $DEBUG_DIR/events.txt
# Logs from failed pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
kubectl logs -n monitoring $pod --previous > $DEBUG_DIR/$(basename $pod)-previous.log 2>&1 || true
kubectl logs -n monitoring $pod > $DEBUG_DIR/$(basename $pod).log 2>&1 || true
done
# Resource usage
kubectl top pods -n monitoring > $DEBUG_DIR/resource-usage.txt
# Describe failing pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
kubectl describe -n monitoring $pod > $DEBUG_DIR/$(basename $pod)-describe.txt
done
# Compress
tar czf monitoring-debug-$(date +%Y%m%d-%H%M%S).tar.gz -C /tmp $(basename $DEBUG_DIR)
Support Resources¶
See Also¶
Troubleshooting Reference - Common issues
Architecture Overview - System design
Resource Requirements - Expected usage