Reference

Troubleshooting Guide

Common issues and solutions for the monitoring stack.

Quick Diagnostic Commands

# Check all monitoring pods
kubectl get pods -n monitoring

# Check pod logs
kubectl logs -n monitoring <pod-name> --tail=100

# Check pod events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

# Check resource usage
kubectl top pods -n monitoring

# Check PVC status
kubectl get pvc -n monitoring

# Check S3 buckets
kubectl get bucket -A | grep -E "thanos|loki"

# Check secrets
kubectl get externalsecret -n monitoring

Pod Issues

Pods Stuck in Pending

Symptom: Pods show Pending state for >2 minutes

Diagnosis:

kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. Insufficient Resources

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims
  Warning  FailedScheduling  0/5 nodes are available: insufficient memory

Solution:

  • Check node resources: kubectl describe nodes

  • Reduce resource requests in config.yaml

  • Add more nodes to cluster

2. PVC Not Bound

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Solution:

# Check PVC status
kubectl get pvc -n monitoring

# If pending, check storage class
kubectl get storageclass

# Check Longhorn status
kubectl get pods -n longhorn-system

3. Node Selector/Affinity Issues

Events:
  Warning  FailedScheduling  0/5 nodes match pod topology spread constraints

Solution:

  • Ensure nodes have correct labels

  • Adjust anti-affinity rules if too strict

  • Check topology.kubernetes.io/zone labels on nodes

Pods in CrashLoopBackOff

Symptom: Pods restart repeatedly

Diagnosis:

kubectl logs -n monitoring <pod-name> --previous
kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. OOMKilled (Out of Memory)

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137

Solution:

# Increase memory in config.yaml
resources:
  prometheus:
    requests:
      memory: 2Gi  # Increased from 1500Mi
    limits:
      memory: 4Gi  # Increased from 3Gi

2. Misconfigured S3 Credentials

Logs: error: failed to upload block: access denied

Solution:

# Verify secret exists
kubectl get secret thanos-objstore-config -n monitoring

# Check secret content
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d

# Verify ESO replication
kubectl get externalsecret thanos-s3-credentials-es -n monitoring
kubectl describe externalsecret thanos-s3-credentials-es -n monitoring

3. Liveness/Readiness Probe Failures

Events:
  Warning  Unhealthy  Liveness probe failed: Get http://10.42.0.1:9090/-/healthy: dial tcp timeout

Solution:

  • Check if component is actually healthy: kubectl exec -n monitoring <pod> -- wget -O- http://localhost:9090/-/healthy

  • Increase probe timeouts if component is slow to start

  • Check component logs for startup errors

Pods in ImagePullBackOff

Symptom: Cannot pull container image

Diagnosis:

kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. Image Not Found

Events:
  Warning  Failed     Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found

Solution:

  • Check image tag in config.yaml

  • Verify image exists: docker pull quay.io/thanos/thanos:v0.36.1

  • Check for typos in image name

2. Rate Limiting

Events:
  Warning  Failed     toomanyrequests: You have reached your pull rate limit

Solution:

  • Use registry mirror

  • Add image pull secret for authenticated access

  • Wait for rate limit to reset


Prometheus Issues

Prometheus Not Scraping Targets

Symptom: Targets show as “DOWN” in Prometheus UI

Diagnosis:

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Check targets: http://localhost:9090/targets

Common Causes & Solutions:

1. ServiceMonitor Not Detected

# Check if ServiceMonitor exists
kubectl get servicemonitor -A

# Check Prometheus operator logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator

Solution:

# Ensure ServiceMonitor has correct labels
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: kube-prometheus-stack  # Required label

2. Network Policy Blocking

# Check if target is reachable from Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -O- http://<target-service>:<port>/metrics

Solution:

  • Add network policy allowing Prometheus → target communication

  • Check service exists: kubectl get svc <target-service>

3. TLS Certificate Issues

Logs: Get "https://target:8443/metrics": x509: certificate signed by unknown authority

Solution:

# In ServiceMonitor, skip TLS verification
spec:
  endpoints:
    - port: https
      scheme: https
      tlsConfig:
        insecureSkipVerify: true

Prometheus Disk Full

Symptom: Prometheus pod crash or slow queries

Diagnosis:

kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

Common Causes & Solutions:

1. PVC Full

Filesystem      Size  Used Avail Use% Mounted on
/dev/longhorn   3.0G  2.9G   100M  97% /prometheus

Solution:

# Option 1: Expand PVC (if storage class supports it)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring
# Change: storage: 3Gi → 6Gi

# Option 2: Reduce retention
# In config.yaml:
retention: 2d  # Reduced from 3d

# Option 3: Delete old data manually (emergency)
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  rm -rf /prometheus/01HQOLDEST_BLOCK_ID/

2. WAL Growing Too Large

kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  du -sh /prometheus/wal

Solution:

  • Enable WAL compression (already enabled in config)

  • Restart Prometheus to compact WAL

  • Check for high cardinality metrics causing excessive WAL growth

Prometheus High Memory Usage

Symptom: Prometheus pod using >80% of memory limit

Diagnosis:

kubectl top pod -n monitoring prometheus-kube-prometheus-stack-prometheus-0

Common Causes & Solutions:

1. Too Many Series

# Check series count in Prometheus UI
# Query: prometheus_tsdb_head_series

Solution:

  • Reduce scrape targets

  • Increase sample_limit to drop high-cardinality targets

  • Add relabel rules to drop unnecessary labels

  • Increase memory limits

2. Large Queries

# Check slow queries in logs
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep "slow query"

Solution:

  • Limit query concurrency

  • Add query timeout

  • Optimize queries (use recording rules)


Thanos Issues

Thanos Sidecar Not Uploading

Symptom: thanos_objstore_bucket_operations_total{operation="upload"} not increasing

Diagnosis:

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar | grep -i upload

Common Causes & Solutions:

1. S3 Credentials Invalid

Logs: failed to upload block: access denied

Solution:

# Check secret
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d

# Verify credentials work
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- \
  sh -c 'echo "test" | aws s3 cp - s3://metrics-thanos-kup6s/test.txt'

2. No Blocks to Upload

Logs: no blocks to upload

Explanation: Normal if Prometheus just started. Blocks are created every 2 hours.

Verification:

# Check block creation
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  ls -lh /prometheus/ | grep "^d"

3. Network Issues

Logs: dial tcp: i/o timeout

Solution:

  • Check S3 endpoint reachable: kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- wget -O- https://fsn1.your-objectstorage.com

  • Check firewall/security groups

  • Verify DNS resolution

Thanos Query Not Showing Historical Data

Symptom: Queries only return last 3 days (Prometheus local data)

Diagnosis:

# Check Thanos Query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-query | grep store

# Check stores connected
kubectl exec -n monitoring thanos-query-0 -- wget -qO- http://localhost:9090/api/v1/stores

Common Causes & Solutions:

1. Thanos Store Not Connected

{
  "status": "success",
  "data": {
    "stores": [
      {"name": "prometheus-0"},  // Only sidecar, no store
    ]
  }
}

Solution:

# Check Thanos Store pods running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

# Check service
kubectl get svc thanos-store -n monitoring

# Verify DNS SRV record
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
  dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local

2. Thanos Store Can’t Read S3

# Check Thanos Store logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"

Solution:

  • Verify S3 bucket exists: kubectl get bucket metrics-thanos-kup6s

  • Check Store has S3 credentials

  • Check blocks exist in S3

3. Time Range Issue

# Query returns no data for >3 days ago

Explanation: Historical data takes time to accumulate. Check:

  • Has cluster been running >3 days?

  • Have blocks been uploaded to S3?

  • Has compactor run successfully?


Loki Issues

Loki Not Receiving Logs

Symptom: No logs visible in Grafana Explore

Diagnosis:

# Check Loki Write pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=write

# Check Alloy pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

Common Causes & Solutions:

1. Alloy Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

Solution:

# Check Alloy logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy

# Restart Alloy
kubectl rollout restart daemonset alloy -n monitoring

2. Alloy Can’t Reach Loki

Logs: failed to send batch: Post "http://loki-gateway/loki/api/v1/push": dial tcp: no route to host

Solution:

# Verify Loki Gateway service
kubectl get svc loki-gateway -n monitoring

# Test from Alloy pod
kubectl exec -n monitoring <alloy-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

3. Loki Write Out of Memory

kubectl get events -n monitoring | grep OOM

Solution:

  • Increase Loki Write memory limits

  • Reduce max_chunk_age to flush more frequently

  • Add more Loki Write replicas

Loki Chunks Not Flushing to S3

Symptom: loki_boltdb_shipper_uploads_total not increasing

Diagnosis:

kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i s3

Common Causes & Solutions:

1. S3 Credentials Missing

Logs: failed to put object: SignatureDoesNotMatch

Solution:

# Check Loki S3 secret
kubectl get secret loki-s3-config -n monitoring -o yaml

# Verify ESO created secret
kubectl get externalsecret loki-s3-credentials-es -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring

2. WAL PVC Full

kubectl exec -n monitoring loki-write-0 -- df -h /var/loki

Solution:

# Expand PVC
kubectl edit pvc loki-write -n monitoring

# Or force flush
kubectl rollout restart deployment loki-write -n monitoring

3. S3 Bucket Doesn’t Exist

kubectl get bucket logs-loki-kup6s -n crossplane-system

Solution:

  • Check Crossplane bucket status

  • Verify ProviderConfig is correct

Loki Queries Timing Out

Symptom: Grafana shows “Loki: timeout exceeded”

Diagnosis:

# Check Loki Read pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=read

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=read | grep -i timeout

Common Causes & Solutions:

1. Query Too Large

Logs: max query length exceeded

Solution:

  • Reduce time range in Grafana query

  • Add more specific label filters

  • Increase max_query_length in Loki config

2. Insufficient Read Replicas

kubectl top pods -n monitoring -l app.kubernetes.io/component=read

Solution:

  • Scale Read replicas: kubectl scale deployment loki-read -n monitoring --replicas=3

  • Increase resource limits

3. S3 Slow

# Check S3 request duration
kubectl port-forward -n monitoring svc/loki-read 3100:3100
curl http://localhost:3100/metrics | grep s3_request_duration

Solution:

  • Check network latency to S3

  • Increase query timeout

  • Add caching layer


Grafana Issues

Cannot Access Grafana UI

Symptom: https://grafana.ops.kup6s.net returns 404 or timeout

Diagnosis:

# Check Grafana pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

# Check Ingress
kubectl get ingress -n monitoring

# Check cert-manager certificate
kubectl get certificate -n monitoring

Common Causes & Solutions:

1. Pod Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

Solution:

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

# Restart if needed
kubectl rollout restart deployment kube-prometheus-stack-grafana -n monitoring

2. Ingress Not Created

kubectl get ingress -n monitoring

Solution:

  • Verify Ingress enabled in Helm values

  • Check Traefik ingress controller running: kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

3. TLS Certificate Failed

kubectl describe certificate grafana-tls -n monitoring

Solution:

  • Check cert-manager logs

  • Verify DNS A record points to cluster

  • Check Let’s Encrypt rate limits

Grafana Dashboards Not Loading

Symptom: Dashboards show “Error loading dashboard”

Diagnosis:

kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Common Causes & Solutions:

1. Datasource Not Configured

Logs: datasource not found

Solution:

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Check datasources: http://localhost:3000/datasources
# Verify "Thanos" and "Loki" datasources exist

2. Datasource Can’t Connect

Logs: dial tcp: i/o timeout

Solution:

# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

3. PVC Full (Dashboard Storage)

kubectl exec -n monitoring <grafana-pod> -- df -h /var/lib/grafana

Solution:

  • Expand PVC

  • Delete old dashboard versions

  • Clean up orphaned snapshots


Resource Issues

CPU Throttling

Symptom: Slow queries, high latency

Diagnosis:

# Check CPU throttling metric
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Solution:

# Increase CPU limits in config.yaml
resources:
  prometheus:
    limits:
      cpu: 2000m  # Increased from 1000m

Memory Pressure

Symptom: OOMKilled events

Diagnosis:

kubectl get events -n monitoring | grep OOM
kubectl top pods -n monitoring --sort-by=memory

Solution:

  • Increase memory limits

  • Reduce retention

  • Add more replicas to distribute load

Storage Running Out

Symptom: PVCs approaching capacity

Diagnosis:

# Check PVC usage
kubectl exec -n monitoring <pod> -- df -h

# Check all PVCs
for pvc in $(kubectl get pvc -n monitoring -o name); do
  echo "=== $pvc ==="
  kubectl exec -n monitoring $(kubectl get pod -n monitoring -o name | grep $(echo $pvc | cut -d/ -f2 | sed 's/-pvc.*//' ) | head -1) -- df -h 2>/dev/null | grep -v "^Filesystem"
done

Solution:

  • Expand PVCs (if storage class supports it)

  • Reduce retention policies

  • Clean up old data


Network Issues

DNS Resolution Failures

Symptom: Pods can’t resolve service names

Diagnosis:

kubectl exec -n monitoring <pod> -- nslookup thanos-query.monitoring.svc.cluster.local

Solution:

  • Check CoreDNS pods running: kubectl get pods -n kube-system -l k8s-app=kube-dns

  • Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns

  • Restart CoreDNS: kubectl rollout restart deployment coredns -n kube-system

Service Unreachable

Symptom: Pods can’t connect to services

Diagnosis:

# Test service connectivity
kubectl exec -n monitoring <pod> -- wget -O- http://<service>:<port>/health

Solution:

  • Verify service exists: kubectl get svc <service> -n monitoring

  • Check endpoints: kubectl get endpoints <service> -n monitoring

  • Check network policies (if any)


See Also