Reference

Troubleshooting Guide¶

Type: Reference (Information-oriented)

Related: Debug Issues | Resource Requirements

Common issues and solutions for the monitoring stack.

Quick Diagnostic Commands¶

# Check all monitoring pods
kubectl get pods -n monitoring

# Check pod logs
kubectl logs -n monitoring <pod-name> --tail=100

# Check pod events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20

# Check resource usage
kubectl top pods -n monitoring

# Check PVC status
kubectl get pvc -n monitoring

# Check S3 buckets
kubectl get bucket -A | grep -E "thanos|loki"

# Check secrets
kubectl get externalsecret -n monitoring

Pod Issues¶

Pods Stuck in Pending¶

Symptom: Pods show Pending state for >2 minutes

Diagnosis:

kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. Insufficient Resources

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims
  Warning  FailedScheduling  0/5 nodes are available: insufficient memory

Solution:

Check node resources: kubectl describe nodes
Reduce resource requests in config.yaml
Add more nodes to cluster

2. PVC Not Bound

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Solution:

# Check PVC status
kubectl get pvc -n monitoring

# If pending, check storage class
kubectl get storageclass

# Check Longhorn status
kubectl get pods -n longhorn-system

3. Node Selector/Affinity Issues

Events:
  Warning  FailedScheduling  0/5 nodes match pod topology spread constraints

Solution:

Ensure nodes have correct labels
Adjust anti-affinity rules if too strict
Check topology.kubernetes.io/zone labels on nodes

Pods in CrashLoopBackOff¶

Symptom: Pods restart repeatedly

Diagnosis:

kubectl logs -n monitoring <pod-name> --previous
kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. OOMKilled (Out of Memory)

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137

Solution:

# Increase memory in config.yaml
resources:
  prometheus:
    requests:
      memory: 2Gi  # Increased from 1500Mi
    limits:
      memory: 4Gi  # Increased from 3Gi

2. Misconfigured S3 Credentials

Logs: error: failed to upload block: access denied

Solution:

# Verify secret exists
kubectl get secret thanos-objstore-config -n monitoring

# Check secret content
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d

# Verify ESO replication
kubectl get externalsecret thanos-s3-credentials-es -n monitoring
kubectl describe externalsecret thanos-s3-credentials-es -n monitoring

3. Liveness/Readiness Probe Failures

Events:
  Warning  Unhealthy  Liveness probe failed: Get http://10.42.0.1:9090/-/healthy: dial tcp timeout

Solution:

Check if component is actually healthy: kubectl exec -n monitoring <pod> -- wget -O- http://localhost:9090/-/healthy
Increase probe timeouts if component is slow to start
Check component logs for startup errors

Pods in ImagePullBackOff¶

Symptom: Cannot pull container image

Diagnosis:

kubectl describe pod <pod-name> -n monitoring

Common Causes & Solutions:

1. Image Not Found

Events:
  Warning  Failed     Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found

Solution:

Check image tag in config.yaml
Verify image exists: docker pull quay.io/thanos/thanos:v0.36.1
Check for typos in image name

2. Rate Limiting

Events:
  Warning  Failed     toomanyrequests: You have reached your pull rate limit

Solution:

Use registry mirror
Add image pull secret for authenticated access
Wait for rate limit to reset

Prometheus Issues¶

Prometheus Not Scraping Targets¶

Symptom: Targets show as “DOWN” in Prometheus UI

Diagnosis:

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Check targets: http://localhost:9090/targets

Common Causes & Solutions:

1. ServiceMonitor Not Detected

# Check if ServiceMonitor exists
kubectl get servicemonitor -A

# Check Prometheus operator logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator

Solution:

# Ensure ServiceMonitor has correct labels
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: kube-prometheus-stack  # Required label

2. Network Policy Blocking

# Check if target is reachable from Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  wget -O- http://<target-service>:<port>/metrics

Solution:

Add network policy allowing Prometheus → target communication
Check service exists: kubectl get svc <target-service>

3. TLS Certificate Issues

Logs: Get "https://target:8443/metrics": x509: certificate signed by unknown authority

Solution:

# In ServiceMonitor, skip TLS verification
spec:
  endpoints:
    - port: https
      scheme: https
      tlsConfig:
        insecureSkipVerify: true

Prometheus Disk Full¶

Symptom: Prometheus pod crash or slow queries

Diagnosis:

kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus

Common Causes & Solutions:

1. PVC Full

Filesystem      Size  Used Avail Use% Mounted on
/dev/longhorn   3.0G  2.9G   100M  97% /prometheus

Solution:

# Option 1: Expand PVC (if storage class supports it)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring
# Change: storage: 3Gi → 6Gi

# Option 2: Reduce retention
# In config.yaml:
retention: 2d  # Reduced from 3d

# Option 3: Delete old data manually (emergency)
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  rm -rf /prometheus/01HQOLDEST_BLOCK_ID/

2. WAL Growing Too Large

kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  du -sh /prometheus/wal

Solution:

Enable WAL compression (already enabled in config)
Restart Prometheus to compact WAL
Check for high cardinality metrics causing excessive WAL growth

Prometheus High Memory Usage¶

Symptom: Prometheus pod using >80% of memory limit

Diagnosis:

kubectl top pod -n monitoring prometheus-kube-prometheus-stack-prometheus-0

Common Causes & Solutions:

1. Too Many Series

# Check series count in Prometheus UI
# Query: prometheus_tsdb_head_series

Solution:

Reduce scrape targets
Increase sample_limit to drop high-cardinality targets
Add relabel rules to drop unnecessary labels
Increase memory limits

2. Large Queries

# Check slow queries in logs
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep "slow query"

Solution:

Limit query concurrency
Add query timeout
Optimize queries (use recording rules)

Thanos Issues¶

Thanos Sidecar Not Uploading¶

Symptom: thanos_objstore_bucket_operations_total{operation="upload"} not increasing

Diagnosis:

kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar | grep -i upload

Common Causes & Solutions:

1. S3 Credentials Invalid

Logs: failed to upload block: access denied

Solution:

# Check secret
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d

# Verify credentials work
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- \
  sh -c 'echo "test" | aws s3 cp - s3://metrics-thanos-kup6s/test.txt'

2. No Blocks to Upload

Logs: no blocks to upload

Explanation: Normal if Prometheus just started. Blocks are created every 2 hours.

Verification:

# Check block creation
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  ls -lh /prometheus/ | grep "^d"

3. Network Issues

Logs: dial tcp: i/o timeout

Solution:

Check S3 endpoint reachable: kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- wget -O- https://fsn1.your-objectstorage.com
Check firewall/security groups
Verify DNS resolution

Thanos Query Not Showing Historical Data¶

Symptom: Queries only return last 3 days (Prometheus local data)

Diagnosis:

# Check Thanos Query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-query | grep store

# Check stores connected
kubectl exec -n monitoring thanos-query-0 -- wget -qO- http://localhost:9090/api/v1/stores

Common Causes & Solutions:

1. Thanos Store Not Connected

{
  "status": "success",
  "data": {
    "stores": [
      {"name": "prometheus-0"},  // Only sidecar, no store
    ]
  }
}

Solution:

# Check Thanos Store pods running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store

# Check service
kubectl get svc thanos-store -n monitoring

# Verify DNS SRV record
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
  dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local

2. Thanos Store Can’t Read S3

# Check Thanos Store logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"

Solution:

Verify S3 bucket exists: kubectl get bucket metrics-thanos-kup6s
Check Store has S3 credentials
Check blocks exist in S3

3. Time Range Issue

# Query returns no data for >3 days ago

Explanation: Historical data takes time to accumulate. Check:

Has cluster been running >3 days?
Have blocks been uploaded to S3?
Has compactor run successfully?

Loki Issues¶

Loki Not Receiving Logs¶

Symptom: No logs visible in Grafana Explore

Diagnosis:

# Check Loki Write pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=write

# Check Alloy pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels

Common Causes & Solutions:

1. Alloy Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy

Solution:

# Check Alloy logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy

# Restart Alloy
kubectl rollout restart daemonset alloy -n monitoring

2. Alloy Can’t Reach Loki

Logs: failed to send batch: Post "http://loki-gateway/loki/api/v1/push": dial tcp: no route to host

Solution:

# Verify Loki Gateway service
kubectl get svc loki-gateway -n monitoring

# Test from Alloy pod
kubectl exec -n monitoring <alloy-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

3. Loki Write Out of Memory

kubectl get events -n monitoring | grep OOM

Solution:

Increase Loki Write memory limits
Reduce max_chunk_age to flush more frequently
Add more Loki Write replicas

Loki Chunks Not Flushing to S3¶

Symptom: loki_boltdb_shipper_uploads_total not increasing

Diagnosis:

kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i s3

Common Causes & Solutions:

1. S3 Credentials Missing

Logs: failed to put object: SignatureDoesNotMatch

Solution:

# Check Loki S3 secret
kubectl get secret loki-s3-config -n monitoring -o yaml

# Verify ESO created secret
kubectl get externalsecret loki-s3-credentials-es -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring

2. WAL PVC Full

kubectl exec -n monitoring loki-write-0 -- df -h /var/loki

Solution:

# Expand PVC
kubectl edit pvc loki-write -n monitoring

# Or force flush
kubectl rollout restart deployment loki-write -n monitoring

3. S3 Bucket Doesn’t Exist

kubectl get bucket logs-loki-kup6s -n crossplane-system

Solution:

Check Crossplane bucket status
Verify ProviderConfig is correct

Loki Queries Timing Out¶

Symptom: Grafana shows “Loki: timeout exceeded”

Diagnosis:

# Check Loki Read pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=read

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=read | grep -i timeout

Common Causes & Solutions:

1. Query Too Large

Logs: max query length exceeded

Solution:

Reduce time range in Grafana query
Add more specific label filters
Increase max_query_length in Loki config

2. Insufficient Read Replicas

kubectl top pods -n monitoring -l app.kubernetes.io/component=read

Solution:

Scale Read replicas: kubectl scale deployment loki-read -n monitoring --replicas=3
Increase resource limits

3. S3 Slow

# Check S3 request duration
kubectl port-forward -n monitoring svc/loki-read 3100:3100
curl http://localhost:3100/metrics | grep s3_request_duration

Solution:

Check network latency to S3
Increase query timeout
Add caching layer

Grafana Issues¶

Cannot Access Grafana UI¶

Symptom: https://grafana.ops.kup6s.net returns 404 or timeout

Diagnosis:

# Check Grafana pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

# Check Ingress
kubectl get ingress -n monitoring

# Check cert-manager certificate
kubectl get certificate -n monitoring

Common Causes & Solutions:

1. Pod Not Running

kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana

Solution:

# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

# Restart if needed
kubectl rollout restart deployment kube-prometheus-stack-grafana -n monitoring

2. Ingress Not Created

kubectl get ingress -n monitoring

Solution:

Verify Ingress enabled in Helm values
Check Traefik ingress controller running: kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik

3. TLS Certificate Failed

kubectl describe certificate grafana-tls -n monitoring

Solution:

Check cert-manager logs
Verify DNS A record points to cluster
Check Let’s Encrypt rate limits

Grafana Dashboards Not Loading¶

Symptom: Dashboards show “Error loading dashboard”

Diagnosis:

kubectl logs -n monitoring -l app.kubernetes.io/name=grafana

Common Causes & Solutions:

1. Datasource Not Configured

Logs: datasource not found

Solution:

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Check datasources: http://localhost:3000/datasources
# Verify "Thanos" and "Loki" datasources exist

2. Datasource Can’t Connect

Logs: dial tcp: i/o timeout

Solution:

# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready

3. PVC Full (Dashboard Storage)

kubectl exec -n monitoring <grafana-pod> -- df -h /var/lib/grafana

Solution:

Expand PVC
Delete old dashboard versions
Clean up orphaned snapshots

Resource Issues¶

CPU Throttling¶

Symptom: Slow queries, high latency

Diagnosis:

# Check CPU throttling metric
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1

Solution:

# Increase CPU limits in config.yaml
resources:
  prometheus:
    limits:
      cpu: 2000m  # Increased from 1000m

Memory Pressure¶

Symptom: OOMKilled events

Diagnosis:

kubectl get events -n monitoring | grep OOM
kubectl top pods -n monitoring --sort-by=memory

Solution:

Increase memory limits
Reduce retention
Add more replicas to distribute load

Storage Running Out¶

Symptom: PVCs approaching capacity

Diagnosis:

# Check PVC usage
kubectl exec -n monitoring <pod> -- df -h

# Check all PVCs
for pvc in $(kubectl get pvc -n monitoring -o name); do
  echo "=== $pvc ==="
  kubectl exec -n monitoring $(kubectl get pod -n monitoring -o name | grep $(echo $pvc | cut -d/ -f2 | sed 's/-pvc.*//' ) | head -1) -- df -h 2>/dev/null | grep -v "^Filesystem"
done

Solution:

Expand PVCs (if storage class supports it)
Reduce retention policies
Clean up old data

Network Issues¶

DNS Resolution Failures¶

Symptom: Pods can’t resolve service names

Diagnosis:

kubectl exec -n monitoring <pod> -- nslookup thanos-query.monitoring.svc.cluster.local

Solution:

Check CoreDNS pods running: kubectl get pods -n kube-system -l k8s-app=kube-dns
Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns
Restart CoreDNS: kubectl rollout restart deployment coredns -n kube-system

Service Unreachable¶

Symptom: Pods can’t connect to services

Diagnosis:

# Test service connectivity
kubectl exec -n monitoring <pod> -- wget -O- http://<service>:<port>/health

Solution:

Verify service exists: kubectl get svc <service> -n monitoring
Check endpoints: kubectl get endpoints <service> -n monitoring
Check network policies (if any)

Troubleshooting Guide¶

Quick Diagnostic Commands¶

Pod Issues¶

Pods Stuck in Pending¶

Pods in CrashLoopBackOff¶

Pods in ImagePullBackOff¶

Prometheus Issues¶

Prometheus Not Scraping Targets¶

Prometheus Disk Full¶

Prometheus High Memory Usage¶

Thanos Issues¶

Thanos Sidecar Not Uploading¶

Thanos Query Not Showing Historical Data¶

Loki Issues¶

Loki Not Receiving Logs¶

Loki Chunks Not Flushing to S3¶

Loki Queries Timing Out¶

Grafana Issues¶

Cannot Access Grafana UI¶

Grafana Dashboards Not Loading¶

Resource Issues¶

CPU Throttling¶

Memory Pressure¶

Storage Running Out¶

Network Issues¶

DNS Resolution Failures¶

Service Unreachable¶

See Also¶