How-To Guide

Debug Monitoring Issues

Systematic approach to debugging monitoring stack problems.

Debugging Workflow

1. Identify the symptom
2. Gather evidence (logs, metrics, events)
3. Form hypothesis
4. Test hypothesis
5. Apply fix
6. Verify resolution
7. Document root cause

Quick Diagnostic Commands

Pod Status

# All monitoring pods
kubectl get pods -n monitoring

# Specific component
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Pod details
kubectl describe pod <pod-name> -n monitoring

# Pod logs
kubectl logs -n monitoring <pod-name> --tail=100

# Previous container logs (after crash)
kubectl logs -n monitoring <pod-name> --previous

# Logs from specific container in pod
kubectl logs -n monitoring <pod-name> -c <container-name>

Resource Usage

# Current resource usage
kubectl top pods -n monitoring

# Sort by memory
kubectl top pods -n monitoring --sort-by=memory

# Sort by CPU
kubectl top pods -n monitoring --sort-by=cpu

# Node resources
kubectl top nodes

Events

# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

# Warning events only
kubectl get events -n monitoring --field-selector type=Warning

# Events for specific pod
kubectl get events -n monitoring --field-selector involvedObject.name=<pod-name>

Storage

# PVC status
kubectl get pvc -n monitoring

# PVC usage (from pod)
kubectl exec -n monitoring <pod-name> -- df -h

# Longhorn volumes
kubectl get volumes -n longhorn-system | grep monitoring

Services and Endpoints

# Services
kubectl get svc -n monitoring

# Endpoints
kubectl get endpoints -n monitoring

# Service details
kubectl describe svc <service-name> -n monitoring

Debugging by Symptom

Symptom: Grafana Dashboards Show “No Data”

Step 1: Identify Datasource

Check which datasource (Prometheus/Thanos or Loki):

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Open: http://localhost:3000/datasources
# Click on datasource showing error

Step 2: Test Datasource Connection

For Prometheus/Thanos:

# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
  wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

# Should return: "Thanos is Healthy"

For Loki:

kubectl exec -n monitoring <grafana-pod> -- \
  wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

# Should return: "ready"

Step 3: Check Data Exists

Prometheus/Thanos:

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

# Open: http://localhost:9090/graph
# Query: up{job="kubernetes-nodes"}
# Should return data

Loki:

kubectl port-forward -n monitoring svc/loki-gateway 3100:80

# Test labels endpoint
curl http://localhost:3100/loki/api/v1/labels

# Should return list of labels

Step 4: Common Causes

1. Service DNS Resolution Failure

# From Grafana pod
kubectl exec -n monitoring <grafana-pod> -- \
  nslookup thanos-query.monitoring.svc.cluster.local

# If fails, check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

2. Service Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-query
# Should show running pods

3. Firewall/Network Policy

# Check if network policies exist
kubectl get networkpolicy -n monitoring

# Test direct connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
  curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

Symptom: Prometheus Not Scraping Metrics

Step 1: Check Target Status

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Open: http://localhost:9090/targets
# Look for targets showing "DOWN"

Step 2: Check ServiceMonitor

# List all ServiceMonitors
kubectl get servicemonitor -A

# Check if specific ServiceMonitor exists
kubectl get servicemonitor <name> -n <namespace> -o yaml

# Check labels match Prometheus selector
kubectl get prometheus -n monitoring kube-prometheus-stack-prometheus -o yaml | grep serviceMonitorSelector -A 5

Step 3: Test Target Reachability

# From Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -O- http://<target-service>.<namespace>.svc.cluster.local:<port>/metrics

# Should return Prometheus metrics

Step 4: Check Prometheus Logs

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep -i error

Common Causes

1. ServiceMonitor Label Mismatch

# ServiceMonitor needs this label
metadata:
  labels:
    release: kube-prometheus-stack  # Required!

2. Wrong Service Port Name

# ServiceMonitor references port name
spec:
  endpoints:
    - port: metrics  # Must match service port name

3. TLS/Authentication Issues

# Check if target requires authentication
kubectl describe servicemonitor <name> -n <namespace>

Symptom: Loki Not Receiving Logs

Step 1: Check Alloy Pods

# All Alloy pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy --tail=50

Step 2: Verify Log Flow

Test Alloy → Loki connection:

# From Alloy pod
kubectl exec -n monitoring <alloy-pod> -- \
  wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

# Should return: "ready"

Step 3: Check Loki Write Pods

# Loki Write running?
kubectl get pods -n monitoring -l app.kubernetes.io/component=write

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i error

Step 4: Verify Logs Arriving

kubectl port-forward -n monitoring svc/loki-gateway 3100:80

# Check labels (should show active labels)
curl http://localhost:3100/loki/api/v1/labels

# Query recent logs
curl -G http://localhost:3100/loki/api/v1/query_range \
  --data-urlencode 'query={namespace="monitoring"}' \
  --data-urlencode 'limit=10'

Common Causes

1. Alloy Configuration Error

# Check Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml

# Verify Loki endpoint correct:
# url = "http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push"

2. Loki Write Out of Memory

kubectl get events -n monitoring | grep OOM | grep loki-write

3. S3 Credentials Missing

kubectl get secret loki-s3-config -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring

Symptom: Thanos Query Not Showing Historical Data

Step 1: Check Thanos Query Stores

kubectl port-forward -n monitoring svc/thanos-query 9090:9090

# Open: http://localhost:9090/stores
# Should show:
# - Prometheus sidecars (sidecar)
# - Thanos Store gateways (store)

Expected output:

{
  "stores": [
    {"name": "prometheus-0", "lastCheck": "...", "type": "sidecar"},
    {"name": "prometheus-1", "lastCheck": "...", "type": "sidecar"},
    {"name": "thanos-store-0", "lastCheck": "...", "type": "store"},
    {"name": "thanos-store-1", "lastCheck": "...", "type": "store"}
  ]
}

Step 2: Check Thanos Store Pods

# Pods running?
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

# Check logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"

Step 3: Verify S3 Blocks Exist

# From Thanos Store pod
kubectl exec -n monitoring thanos-store-0 -- \
  ls -lh /var/thanos/store/

# Should show downloaded block metadata

Check S3 directly (if AWS CLI available):

aws s3 ls s3://metrics-thanos-kup6s/ --recursive | head -20
# Should show block directories (ULIDs)

Step 4: Check Thanos Sidecar Uploads

# From Prometheus pod
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c thanos-sidecar | grep -i upload

# Should show successful uploads every 2 hours

Common Causes

1. Thanos Store Can’t Reach S3

kubectl logs -n monitoring thanos-store-0 | grep -i "s3\|error"
# Look for: "access denied", "connection refused"

2. S3 Credentials Invalid

kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d
# Verify access_key and secret_key

3. No Historical Data Yet

# Check cluster uptime
kubectl get nodes -o jsonpath='{.items[0].metadata.creationTimestamp}'

# Historical data only available after >3 days
# (Prometheus local retention is 3 days)

Debugging Tools and Techniques

Interactive Debugging Pod

Launch debug pod:

kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash

Useful tools included:

  • curl - HTTP requests

  • wget - Download files

  • dig - DNS queries

  • nslookup - DNS resolution

  • ping - Network connectivity

  • traceroute - Network path

  • netstat - Network connections

  • tcpdump - Packet capture

Examples:

# Test Thanos Query
curl http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy

# Test Loki
curl http://loki-gateway.monitoring.svc.cluster.local/ready

# DNS resolution
dig thanos-query.monitoring.svc.cluster.local

# DNS SRV records
dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local

Port Forwarding for Local Access

Forward Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Access: http://localhost:9090

Forward Grafana:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Access: http://localhost:3000

Forward Thanos Query:

kubectl port-forward -n monitoring svc/thanos-query 9090:9090
# Access: http://localhost:9090

Forward Loki:

kubectl port-forward -n monitoring svc/loki-gateway 3100:80
# Access: http://localhost:3100

Exec into Running Pods

# Prometheus
kubectl exec -it -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- sh

# Thanos Store
kubectl exec -it -n monitoring thanos-store-0 -- sh

# Loki Write
kubectl exec -it -n monitoring loki-write-0 -- sh

Useful commands inside pod:

# Check disk usage
df -h

# Check environment variables
env | grep -i s3

# Check process
ps aux

# Check open ports
netstat -tulpn

# Test outbound connectivity
wget -O- https://fsn1.your-objectstorage.com

Viewing Configuration

HelmChart resources:

kubectl get helmchart -n monitoring -o yaml

Extract Helm values:

kubectl get helmchart kube-prometheus-stack -n monitoring \
  -o jsonpath='{.spec.valuesContent}' > /tmp/prometheus-values.yaml

ConfigMaps:

# Alloy config
kubectl get configmap alloy-config -n monitoring -o yaml

# Prometheus rules
kubectl get configmap prometheus-kube-prometheus-stack-prometheus-rulefiles-0 -n monitoring -o yaml

Prometheus Debugging Queries

Check target health:

up{namespace="monitoring"}

Check scrape duration:

scrape_duration_seconds{namespace="monitoring"}

Check memory usage:

container_memory_working_set_bytes{namespace="monitoring"} / 1024 / 1024

Check CPU throttling:

rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Check Thanos sidecar uploads:

thanos_objstore_bucket_operations_total{operation="upload"}

Check Loki ingestion rate:

rate(loki_distributor_lines_received_total[5m])

Common Issue Patterns

Pattern: Intermittent Failures

Symptoms:

  • Pods occasionally fail health checks

  • Queries sometimes timeout

  • Metrics sometimes missing

Debugging approach:

  1. Check for resource constraints:

    kubectl top pods -n monitoring
    # Look for pods near limit
    
  2. Check for node issues:

    kubectl get nodes
    kubectl describe node <node-name>
    # Look for: MemoryPressure, DiskPressure
    
  3. Check for network issues:

    kubectl get events -A | grep -i network
    
  4. Enable debug logging:

    # For Prometheus (increase log level)
    kubectl edit prometheus -n monitoring
    # Add: logLevel: debug
    

Pattern: Cascading Failures

Symptoms:

  • One component fails, then others fail

  • Error messages reference downstream services

Debugging approach:

  1. Identify failure sequence:

    kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -50
    # Look for first failure
    
  2. Check dependencies:

    Grafana → Thanos Query → Prometheus/Thanos Store
                    Loki → Loki Write
    
  3. Fix root cause first, then check if others recover

Pattern: Gradual Degradation

Symptoms:

  • Performance slowly decreases over time

  • Queries getting slower

  • Memory usage increasing

Debugging approach:

  1. Check for resource leaks:

    # Memory growth over 24h
    rate(container_memory_working_set_bytes{namespace="monitoring"}[24h])
    
  2. Check for growing datasets:

    # Prometheus TSDB size
    kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
      du -sh /prometheus/
    
    # PVC growth
    kubectl exec -n monitoring <pod> -- df -h
    
  3. Check for increasing cardinality:

    # Prometheus series count
    prometheus_tsdb_head_series
    
  4. Apply fixes:

    • Increase resources

    • Add retention policies

    • Drop high-cardinality metrics


Debugging Checklist

When stuck, work through this checklist:

Infrastructure Layer

  • [ ] All nodes healthy? (kubectl get nodes)

  • [ ] Sufficient cluster resources? (kubectl top nodes)

  • [ ] Longhorn healthy? (kubectl get pods -n longhorn-system)

  • [ ] Network connectivity? (test between pods)

Pod Layer

  • [ ] All pods running? (kubectl get pods -n monitoring)

  • [ ] No crash loops? (check restart count)

  • [ ] No OOM kills? (kubectl get events -n monitoring | grep OOM)

  • [ ] Resource limits not hit? (kubectl top pods -n monitoring)

Storage Layer

  • [ ] All PVCs bound? (kubectl get pvc -n monitoring)

  • [ ] PVCs not full? (check from pods)

  • [ ] S3 buckets exist? (kubectl get bucket -A)

  • [ ] S3 credentials valid? (check secrets)

Network Layer

  • [ ] Services exist? (kubectl get svc -n monitoring)

  • [ ] Endpoints populated? (kubectl get endpoints -n monitoring)

  • [ ] DNS resolving? (test from debug pod)

  • [ ] No network policies blocking? (kubectl get networkpolicy -n monitoring)

Application Layer

  • [ ] Config correct? (check ConfigMaps)

  • [ ] Secrets exist? (check ExternalSecrets)

  • [ ] Logs showing errors? (check pod logs)

  • [ ] Metrics collected? (query Prometheus)


Escalation Path

When to Escalate

Escalate if:

  • Issue affects production workloads

  • Data loss risk

  • Unable to identify root cause after 1 hour

  • Need component expertise (Prometheus, Thanos, Loki)

Information to Gather Before Escalating

# Create debug bundle
DEBUG_DIR=/tmp/monitoring-debug-$(date +%Y%m%d-%H%M%S)
mkdir -p $DEBUG_DIR

# Pod status
kubectl get pods -n monitoring -o wide > $DEBUG_DIR/pods.txt

# Recent events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -100 > $DEBUG_DIR/events.txt

# Logs from failed pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
  kubectl logs -n monitoring $pod --previous > $DEBUG_DIR/$(basename $pod)-previous.log 2>&1 || true
  kubectl logs -n monitoring $pod > $DEBUG_DIR/$(basename $pod).log 2>&1 || true
done

# Resource usage
kubectl top pods -n monitoring > $DEBUG_DIR/resource-usage.txt

# Describe failing pods
for pod in $(kubectl get pods -n monitoring --field-selector=status.phase!=Running -o name); do
  kubectl describe -n monitoring $pod > $DEBUG_DIR/$(basename $pod)-describe.txt
done

# Compress
tar czf monitoring-debug-$(date +%Y%m%d-%H%M%S).tar.gz -C /tmp $(basename $DEBUG_DIR)

Support Resources


See Also