Reference
Troubleshooting Guide¶
Common issues and solutions for the monitoring stack.
Quick Diagnostic Commands¶
# Check all monitoring pods
kubectl get pods -n monitoring
# Check pod logs
kubectl logs -n monitoring <pod-name> --tail=100
# Check pod events
kubectl get events -n monitoring --sort-by='.lastTimestamp' | tail -20
# Check resource usage
kubectl top pods -n monitoring
# Check PVC status
kubectl get pvc -n monitoring
# Check S3 buckets
kubectl get bucket -A | grep -E "thanos|loki"
# Check secrets
kubectl get externalsecret -n monitoring
Pod Issues¶
Pods Stuck in Pending¶
Symptom: Pods show Pending state for >2 minutes
Diagnosis:
kubectl describe pod <pod-name> -n monitoring
Common Causes & Solutions:
1. Insufficient Resources
Events:
Warning FailedScheduling pod has unbound immediate PersistentVolumeClaims
Warning FailedScheduling 0/5 nodes are available: insufficient memory
Solution:
Check node resources:
kubectl describe nodesReduce resource requests in
config.yamlAdd more nodes to cluster
2. PVC Not Bound
Events:
Warning FailedScheduling pod has unbound immediate PersistentVolumeClaims
Solution:
# Check PVC status
kubectl get pvc -n monitoring
# If pending, check storage class
kubectl get storageclass
# Check Longhorn status
kubectl get pods -n longhorn-system
3. Node Selector/Affinity Issues
Events:
Warning FailedScheduling 0/5 nodes match pod topology spread constraints
Solution:
Ensure nodes have correct labels
Adjust anti-affinity rules if too strict
Check
topology.kubernetes.io/zonelabels on nodes
Pods in CrashLoopBackOff¶
Symptom: Pods restart repeatedly
Diagnosis:
kubectl logs -n monitoring <pod-name> --previous
kubectl describe pod <pod-name> -n monitoring
Common Causes & Solutions:
1. OOMKilled (Out of Memory)
State: Terminated
Reason: OOMKilled
Exit Code: 137
Solution:
# Increase memory in config.yaml
resources:
prometheus:
requests:
memory: 2Gi # Increased from 1500Mi
limits:
memory: 4Gi # Increased from 3Gi
2. Misconfigured S3 Credentials
Logs: error: failed to upload block: access denied
Solution:
# Verify secret exists
kubectl get secret thanos-objstore-config -n monitoring
# Check secret content
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d
# Verify ESO replication
kubectl get externalsecret thanos-s3-credentials-es -n monitoring
kubectl describe externalsecret thanos-s3-credentials-es -n monitoring
3. Liveness/Readiness Probe Failures
Events:
Warning Unhealthy Liveness probe failed: Get http://10.42.0.1:9090/-/healthy: dial tcp timeout
Solution:
Check if component is actually healthy:
kubectl exec -n monitoring <pod> -- wget -O- http://localhost:9090/-/healthyIncrease probe timeouts if component is slow to start
Check component logs for startup errors
Pods in ImagePullBackOff¶
Symptom: Cannot pull container image
Diagnosis:
kubectl describe pod <pod-name> -n monitoring
Common Causes & Solutions:
1. Image Not Found
Events:
Warning Failed Failed to pull image "quay.io/thanos/thanos:v0.99.0": not found
Solution:
Check image tag in
config.yamlVerify image exists:
docker pull quay.io/thanos/thanos:v0.36.1Check for typos in image name
2. Rate Limiting
Events:
Warning Failed toomanyrequests: You have reached your pull rate limit
Solution:
Use registry mirror
Add image pull secret for authenticated access
Wait for rate limit to reset
Prometheus Issues¶
Prometheus Not Scraping Targets¶
Symptom: Targets show as “DOWN” in Prometheus UI
Diagnosis:
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Check targets: http://localhost:9090/targets
Common Causes & Solutions:
1. ServiceMonitor Not Detected
# Check if ServiceMonitor exists
kubectl get servicemonitor -A
# Check Prometheus operator logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus-operator
Solution:
# Ensure ServiceMonitor has correct labels
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: kube-prometheus-stack # Required label
2. Network Policy Blocking
# Check if target is reachable from Prometheus pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
wget -O- http://<target-service>:<port>/metrics
Solution:
Add network policy allowing Prometheus → target communication
Check service exists:
kubectl get svc <target-service>
3. TLS Certificate Issues
Logs: Get "https://target:8443/metrics": x509: certificate signed by unknown authority
Solution:
# In ServiceMonitor, skip TLS verification
spec:
endpoints:
- port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
Prometheus Disk Full¶
Symptom: Prometheus pod crash or slow queries
Diagnosis:
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- df -h /prometheus
Common Causes & Solutions:
1. PVC Full
Filesystem Size Used Avail Use% Mounted on
/dev/longhorn 3.0G 2.9G 100M 97% /prometheus
Solution:
# Option 1: Expand PVC (if storage class supports it)
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-0 -n monitoring
# Change: storage: 3Gi → 6Gi
# Option 2: Reduce retention
# In config.yaml:
retention: 2d # Reduced from 3d
# Option 3: Delete old data manually (emergency)
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
rm -rf /prometheus/01HQOLDEST_BLOCK_ID/
2. WAL Growing Too Large
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
du -sh /prometheus/wal
Solution:
Enable WAL compression (already enabled in config)
Restart Prometheus to compact WAL
Check for high cardinality metrics causing excessive WAL growth
Prometheus High Memory Usage¶
Symptom: Prometheus pod using >80% of memory limit
Diagnosis:
kubectl top pod -n monitoring prometheus-kube-prometheus-stack-prometheus-0
Common Causes & Solutions:
1. Too Many Series
# Check series count in Prometheus UI
# Query: prometheus_tsdb_head_series
Solution:
Reduce scrape targets
Increase
sample_limitto drop high-cardinality targetsAdd relabel rules to drop unnecessary labels
Increase memory limits
2. Large Queries
# Check slow queries in logs
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 | grep "slow query"
Solution:
Limit query concurrency
Add query timeout
Optimize queries (use recording rules)
Thanos Issues¶
Thanos Sidecar Not Uploading¶
Symptom: thanos_objstore_bucket_operations_total{operation="upload"} not increasing
Diagnosis:
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar | grep -i upload
Common Causes & Solutions:
1. S3 Credentials Invalid
Logs: failed to upload block: access denied
Solution:
# Check secret
kubectl get secret thanos-objstore-config -n monitoring -o jsonpath='{.data.objstore\.yml}' | base64 -d
# Verify credentials work
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- \
sh -c 'echo "test" | aws s3 cp - s3://metrics-thanos-kup6s/test.txt'
2. No Blocks to Upload
Logs: no blocks to upload
Explanation: Normal if Prometheus just started. Blocks are created every 2 hours.
Verification:
# Check block creation
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
ls -lh /prometheus/ | grep "^d"
3. Network Issues
Logs: dial tcp: i/o timeout
Solution:
Check S3 endpoint reachable:
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -c thanos-sidecar -- wget -O- https://fsn1.your-objectstorage.comCheck firewall/security groups
Verify DNS resolution
Thanos Query Not Showing Historical Data¶
Symptom: Queries only return last 3 days (Prometheus local data)
Diagnosis:
# Check Thanos Query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=thanos-query | grep store
# Check stores connected
kubectl exec -n monitoring thanos-query-0 -- wget -qO- http://localhost:9090/api/v1/stores
Common Causes & Solutions:
1. Thanos Store Not Connected
{
"status": "success",
"data": {
"stores": [
{"name": "prometheus-0"}, // Only sidecar, no store
]
}
}
Solution:
# Check Thanos Store pods running
kubectl get pods -n monitoring -l app.kubernetes.io/name=thanos-store
# Check service
kubectl get svc thanos-store -n monitoring
# Verify DNS SRV record
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- \
dig SRV _grpc._tcp.thanos-store.monitoring.svc.cluster.local
2. Thanos Store Can’t Read S3
# Check Thanos Store logs
kubectl logs -n monitoring thanos-store-0 | grep -E "error|block"
Solution:
Verify S3 bucket exists:
kubectl get bucket metrics-thanos-kup6sCheck Store has S3 credentials
Check blocks exist in S3
3. Time Range Issue
# Query returns no data for >3 days ago
Explanation: Historical data takes time to accumulate. Check:
Has cluster been running >3 days?
Have blocks been uploaded to S3?
Has compactor run successfully?
Loki Issues¶
Loki Not Receiving Logs¶
Symptom: No logs visible in Grafana Explore
Diagnosis:
# Check Loki Write pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=write
# Check Alloy pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
# Test log ingestion
kubectl port-forward -n monitoring svc/loki-gateway 3100:80
curl http://localhost:3100/loki/api/v1/labels
Common Causes & Solutions:
1. Alloy Not Running
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy
Solution:
# Check Alloy logs
kubectl logs -n monitoring -l app.kubernetes.io/name=alloy
# Restart Alloy
kubectl rollout restart daemonset alloy -n monitoring
2. Alloy Can’t Reach Loki
Logs: failed to send batch: Post "http://loki-gateway/loki/api/v1/push": dial tcp: no route to host
Solution:
# Verify Loki Gateway service
kubectl get svc loki-gateway -n monitoring
# Test from Alloy pod
kubectl exec -n monitoring <alloy-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready
3. Loki Write Out of Memory
kubectl get events -n monitoring | grep OOM
Solution:
Increase Loki Write memory limits
Reduce
max_chunk_ageto flush more frequentlyAdd more Loki Write replicas
Loki Chunks Not Flushing to S3¶
Symptom: loki_boltdb_shipper_uploads_total not increasing
Diagnosis:
kubectl logs -n monitoring -l app.kubernetes.io/component=write | grep -i s3
Common Causes & Solutions:
1. S3 Credentials Missing
Logs: failed to put object: SignatureDoesNotMatch
Solution:
# Check Loki S3 secret
kubectl get secret loki-s3-config -n monitoring -o yaml
# Verify ESO created secret
kubectl get externalsecret loki-s3-credentials-es -n monitoring
kubectl describe externalsecret loki-s3-credentials-es -n monitoring
2. WAL PVC Full
kubectl exec -n monitoring loki-write-0 -- df -h /var/loki
Solution:
# Expand PVC
kubectl edit pvc loki-write -n monitoring
# Or force flush
kubectl rollout restart deployment loki-write -n monitoring
3. S3 Bucket Doesn’t Exist
kubectl get bucket logs-loki-kup6s -n crossplane-system
Solution:
Check Crossplane bucket status
Verify ProviderConfig is correct
Loki Queries Timing Out¶
Symptom: Grafana shows “Loki: timeout exceeded”
Diagnosis:
# Check Loki Read pods
kubectl get pods -n monitoring -l app.kubernetes.io/component=read
# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/component=read | grep -i timeout
Common Causes & Solutions:
1. Query Too Large
Logs: max query length exceeded
Solution:
Reduce time range in Grafana query
Add more specific label filters
Increase
max_query_lengthin Loki config
2. Insufficient Read Replicas
kubectl top pods -n monitoring -l app.kubernetes.io/component=read
Solution:
Scale Read replicas:
kubectl scale deployment loki-read -n monitoring --replicas=3Increase resource limits
3. S3 Slow
# Check S3 request duration
kubectl port-forward -n monitoring svc/loki-read 3100:3100
curl http://localhost:3100/metrics | grep s3_request_duration
Solution:
Check network latency to S3
Increase query timeout
Add caching layer
Grafana Issues¶
Cannot Access Grafana UI¶
Symptom: https://grafana.ops.kup6s.net returns 404 or timeout
Diagnosis:
# Check Grafana pod
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
# Check Ingress
kubectl get ingress -n monitoring
# Check cert-manager certificate
kubectl get certificate -n monitoring
Common Causes & Solutions:
1. Pod Not Running
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
Solution:
# Check logs
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
# Restart if needed
kubectl rollout restart deployment kube-prometheus-stack-grafana -n monitoring
2. Ingress Not Created
kubectl get ingress -n monitoring
Solution:
Verify Ingress enabled in Helm values
Check Traefik ingress controller running:
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
3. TLS Certificate Failed
kubectl describe certificate grafana-tls -n monitoring
Solution:
Check cert-manager logs
Verify DNS A record points to cluster
Check Let’s Encrypt rate limits
Grafana Dashboards Not Loading¶
Symptom: Dashboards show “Error loading dashboard”
Diagnosis:
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
Common Causes & Solutions:
1. Datasource Not Configured
Logs: datasource not found
Solution:
# Port-forward to Grafana
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Check datasources: http://localhost:3000/datasources
# Verify "Thanos" and "Loki" datasources exist
2. Datasource Can’t Connect
Logs: dial tcp: i/o timeout
Solution:
# Test from Grafana pod
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://thanos-query.monitoring.svc.cluster.local:9090/-/healthy
kubectl exec -n monitoring <grafana-pod> -- wget -O- http://loki-gateway.monitoring.svc.cluster.local/ready
3. PVC Full (Dashboard Storage)
kubectl exec -n monitoring <grafana-pod> -- df -h /var/lib/grafana
Solution:
Expand PVC
Delete old dashboard versions
Clean up orphaned snapshots
Resource Issues¶
CPU Throttling¶
Symptom: Slow queries, high latency
Diagnosis:
# Check CPU throttling metric
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Query: rate(container_cpu_cfs_throttled_seconds_total{namespace="monitoring"}[5m]) > 0.1
Solution:
# Increase CPU limits in config.yaml
resources:
prometheus:
limits:
cpu: 2000m # Increased from 1000m
Memory Pressure¶
Symptom: OOMKilled events
Diagnosis:
kubectl get events -n monitoring | grep OOM
kubectl top pods -n monitoring --sort-by=memory
Solution:
Increase memory limits
Reduce retention
Add more replicas to distribute load
Storage Running Out¶
Symptom: PVCs approaching capacity
Diagnosis:
# Check PVC usage
kubectl exec -n monitoring <pod> -- df -h
# Check all PVCs
for pvc in $(kubectl get pvc -n monitoring -o name); do
echo "=== $pvc ==="
kubectl exec -n monitoring $(kubectl get pod -n monitoring -o name | grep $(echo $pvc | cut -d/ -f2 | sed 's/-pvc.*//' ) | head -1) -- df -h 2>/dev/null | grep -v "^Filesystem"
done
Solution:
Expand PVCs (if storage class supports it)
Reduce retention policies
Clean up old data
Network Issues¶
DNS Resolution Failures¶
Symptom: Pods can’t resolve service names
Diagnosis:
kubectl exec -n monitoring <pod> -- nslookup thanos-query.monitoring.svc.cluster.local
Solution:
Check CoreDNS pods running:
kubectl get pods -n kube-system -l k8s-app=kube-dnsCheck CoreDNS logs:
kubectl logs -n kube-system -l k8s-app=kube-dnsRestart CoreDNS:
kubectl rollout restart deployment coredns -n kube-system
Service Unreachable¶
Symptom: Pods can’t connect to services
Diagnosis:
# Test service connectivity
kubectl exec -n monitoring <pod> -- wget -O- http://<service>:<port>/health
Solution:
Verify service exists:
kubectl get svc <service> -n monitoringCheck endpoints:
kubectl get endpoints <service> -n monitoringCheck network policies (if any)
See Also¶
Architecture Overview - System design
Resource Requirements - Expected resource usage
S3 Buckets - Bucket troubleshooting
How-To: Debug Issues - Step-by-step debugging