Reference
Troubleshooting¶
Overview¶
Common issues, symptoms, diagnosis steps, and solutions for GitLab BDA.
General troubleshooting workflow:
Identify symptom - What’s broken? (UI down, slow, errors)
Check status -
kubectl get pods(are pods running?)Review logs -
kubectl logs(what errors appear?)Check resources -
kubectl top(out of memory/CPU?)Fix root cause - Apply solution
Verify - Test that issue is resolved
GitLab UI Issues¶
GitLab UI Completely Down (502 Bad Gateway)¶
Symptom: https://gitlab.staging.bluedynamics.eu returns 502 error
Diagnosis:
# Check webservice pods
kubectl get pods -l app=webservice -n gitlabbda
# Check pod logs
kubectl logs -l app=webservice -n gitlabbda --tail=50
Common causes:
1. Webservice pods CrashLoopBackOff
# Check events
kubectl describe pod gitlab-webservice-xxx -n gitlabbda
# Common error: PostgreSQL connection failed
# Error in logs: "PG::ConnectionBad: could not connect to server"
Solution: Check PostgreSQL
# Verify PostgreSQL running
kubectl get pods -l cnpg.io/cluster=gitlab-postgres -n gitlabbda
# Check pooler
kubectl get pods -l cnpg.io/poolerName=gitlab-postgres-pooler -n gitlabbda
# Test connection
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT 1"
2. Redis connection failed
# Check Redis
kubectl get pods -l app.kubernetes.io/name=redis -n gitlabbda
kubectl logs redis-0 -n gitlabbda
# Test connection
kubectl exec -it redis-0 -n gitlabbda -- redis-cli ping
3. S3 credentials invalid
# Check ExternalSecret sync status
kubectl get externalsecret gitlab-s3-credentials -n gitlabbda
# If SYNCED=False, check secret in source namespace
kubectl get secret hetzner-s3 -n application-secrets
GitLab UI Slow (High Latency)¶
Symptom: Pages take >5 seconds to load
Diagnosis:
# Check resource usage
kubectl top pods -n gitlabbda
# Check webservice logs for slow queries
kubectl logs -l app=webservice -n gitlabbda | grep -i "slow\|timeout"
Common causes:
1. High CPU usage
kubectl top pods -l app=webservice -n gitlabbda
# Output: webservice-xxx 950m (95% of 1000m limit)
Solution: Scale webservice
# Temporary: Increase replicas
kubectl scale deploy gitlab-webservice -n gitlabbda --replicas=3
# Permanent: Update config.yaml
replicas:
webservice: 3
# Rebuild and apply
2. PostgreSQL slow queries
# Check database connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT count(*) FROM pg_stat_activity"
# Check slow queries in Loki
{namespace="gitlabbda", pod=~"gitlab-postgres-.*"} |= "duration" | regexp `duration: (?P<duration>[0-9.]+) ms` | duration > 1000
Solution: Increase pooler connections or database resources
# In database.ts
postgresql:
parameters:
max_connections: "400" # Was 200
Login Failures¶
Symptom: Users can’t log in (password rejected)
Diagnosis:
# Check GitLab logs
kubectl logs -l app=webservice -n gitlabbda | grep -i "authentication\|login"
# Check if root password is correct
kubectl get secret gitlab-initial-root-password -n gitlabbda -o jsonpath='{.data.password}' | base64 -d
Common causes:
1. LDAP/OAuth misconfigured (if enabled)
Check GitLab Admin → Settings → Sign-in restrictions
Verify OAuth application credentials (Harbor integration)
2. Database lock on users table
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT * FROM pg_locks WHERE NOT granted AND relation::regclass::text = 'users'"
Solution: Restart webservice pods to clear locks
kubectl rollout restart deploy/gitlab-webservice -n gitlabbda
Git Operations Issues¶
Git Clone Fails (Connection Refused)¶
Symptom: git clone https://gitlab.../repo.git fails
Diagnosis:
# Test HTTP git endpoint
curl -I https://gitlab.staging.bluedynamics.eu/user/repo.git/info/refs?service=git-upload-pack
# Check workhorse (proxies git HTTP)
kubectl get pods -l app=workhorse -n gitlabbda
kubectl logs -l app=workhorse -n gitlabbda --tail=50
Common causes:
1. Repository not found
Verify repository exists in GitLab UI
Check user has access (Reporter role minimum)
2. Gitaly down
kubectl get pods -l app=gitaly -n gitlabbda
kubectl logs gitlab-gitaly-0 -n gitlabbda
Solution: Restart Gitaly if crashed
kubectl delete pod gitlab-gitaly-0 -n gitlabbda
# StatefulSet recreates pod
Git Push Hangs or Times Out¶
Symptom: git push hangs for minutes, then times out
Diagnosis:
# Check Gitaly CPU (large push = high CPU)
kubectl top pod gitlab-gitaly-0 -n gitlabbda
# Check Gitaly logs for errors
kubectl logs gitlab-gitaly-0 -n gitlabbda | tail -n 100
Common causes:
1. Large push (>1GB)
Gitaly processing large pack file (high CPU, normal)
Wait for completion (may take 5-10 minutes)
2. Hetzner Volume full
# Check Gitaly PVC usage
kubectl exec -it gitlab-gitaly-0 -n gitlabbda -- df -h /home/git/repositories
Solution: Resize Hetzner Volume
# Via Hetzner Cloud Console
# Volumes → gitaly-data → Resize → 30Gi
# Or via Helm chart update
gitaly:
persistence:
size: 30Gi
SSH Git Access Not Working¶
Symptom: git clone git@gitlab.../repo.git fails with “Connection refused”
Diagnosis:
# Check SSH routing (Traefik)
kubectl get ingressroutetcp gitlab-ssh -n gitlabbda
# Check GitLab Shell pods
kubectl get pods -l app=gitlab-shell -n gitlabbda
kubectl logs -l app=gitlab-shell -n gitlabbda
Common causes:
1. SSH key not uploaded to GitLab
Add SSH key in GitLab UI (User Settings → SSH Keys)
2. Traefik SSH entrypoint not configured
Check cluster Traefik configuration (cluster infrastructure)
Solution: Verify SSH connectivity
# Test SSH (should show GitLab welcome)
ssh -T git@gitlab.staging.bluedynamics.eu
# Expected output:
# Welcome to GitLab, @username!
CI/CD Issues¶
Pipelines Stuck in “Pending”¶
Symptom: CI jobs stay in “pending” state, never run
Diagnosis:
# Check GitLab Runner pods
kubectl get pods -l app=gitlab-runner -n gitlabbda
# Check runner logs
kubectl logs -l app=gitlab-runner -n gitlabbda --tail=50
Common causes:
1. No runners available
# GitLab UI: Admin → Runners
# Verify runners are registered and active
Solution: Check runner registration token
# In GitLab UI: Settings → CI/CD → Runners
# Verify registration token matches ExternalSecret
2. Sidekiq queue backed up
# Check Sidekiq pod
kubectl get pods -l app=sidekiq -n gitlabbda
kubectl logs -l app=sidekiq -n gitlabbda | grep -i "pipeline\|job"
Solution: Scale Sidekiq
kubectl scale deploy gitlab-sidekiq -n gitlabbda --replicas=2
CI Jobs Failing with “docker: command not found”¶
Symptom: Jobs using Docker commands fail
Diagnosis: Check GitLab Runner configuration
Common causes:
1. Runner not configured for Docker-in-Docker
Runners need
privileged: trueordocker.sockmount
Solution: Verify runner config (future: configure DinD)
Artifacts Upload Failing¶
Symptom: Job succeeds but artifacts not saved
Diagnosis:
# Check S3 connectivity
kubectl logs -l app=webservice -n gitlabbda | grep -i "s3\|artifact"
# Check S3 credentials
kubectl get secret gitlab-s3-credentials -n gitlabbda -o yaml
Common causes:
1. S3 bucket doesn’t exist
kubectl get bucket artifacts-gitlabbda-kup6s -n crossplane-system
# Should show READY=True, SYNCED=True
Solution: Verify Crossplane bucket provisioning
# Check Crossplane provider logs
kubectl logs -n crossplane-system -l pkg.crossplane.io/provider=provider-aws-s3
Harbor Registry Issues¶
Docker Pull Fails (Image Not Found)¶
Symptom: docker pull registry.../project/image:tag returns 404
Diagnosis:
# Check if image exists in Harbor UI
# https://registry.staging.bluedynamics.eu
# Check Harbor Registry logs
kubectl logs -l app.kubernetes.io/name=harbor-registry -n gitlabbda
Common causes:
1. Image not pushed - Verify image exists in project
2. S3 connectivity issue
# Check S3 credentials
kubectl get secret harbor-s3-credentials -n gitlabbda -o yaml
# Test S3 connectivity from pod
kubectl exec -it deploy/harbor-registry -n gitlabbda -- wget -O- https://fsn1.your-objectstorage.com
Harbor UI Not Loading¶
Symptom: https://registry.staging.bluedynamics.eu returns 502 or timeout
Diagnosis:
# Check Harbor pods
kubectl get pods -l app.kubernetes.io/part-of=harbor -n gitlabbda
# Check ingress
kubectl get ingress harbor-ingress -n gitlabbda
Common causes:
1. Harbor Core pod crashed
kubectl logs -l app.kubernetes.io/name=harbor-core -n gitlabbda --previous
# Check for PostgreSQL connection errors
Solution: Verify PostgreSQL harbor database exists
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -l | grep harbor
Database Issues¶
PostgreSQL Primary Down¶
Symptom: All database connections fail
Diagnosis:
# Check CNPG cluster status
kubectl get cluster gitlab-postgres -n gitlabbda
# Check pod status
kubectl get pods -l cnpg.io/cluster=gitlab-postgres -n gitlabbda
Common causes:
1. Primary pod crashed
kubectl describe pod gitlab-postgres-1 -n gitlabbda
# Look for: OOMKilled, Evicted, CrashLoopBackOff
Solution: CNPG auto-failover to standby
# Monitor failover
kubectl get cluster gitlab-postgres -n gitlabbda -w
# Expected: Standby promoted to primary within 30 seconds
2. Both instances down
Critical failure, requires manual intervention
Check cluster events:
kubectl get events -n gitlabbda --sort-by='.lastTimestamp'
High Replication Lag¶
Symptom: Standby falls behind primary (>1 minute lag)
Diagnosis:
# Check replication lag
kubectl get cluster gitlab-postgres -n gitlabbda -o jsonpath='{.status.instances[*].lag}'
# Check standby logs
kubectl logs gitlab-postgres-2 -n gitlabbda | grep -i "replication\|lag"
Common causes:
1. High write volume
Primary under heavy load (many inserts/updates)
Standby can’t keep up
Solution: Increase standby resources
# In database.ts
resources:
requests:
cpu: 200m # Was 100m
memory: 512Mi # Was 256Mi
2. Network issues
Check pod-to-pod network latency
Connection Pool Exhausted¶
Symptom: Applications can’t connect (max connections reached)
Diagnosis:
# Check active connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT count(*) FROM pg_stat_activity"
# Check max connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -c "SHOW max_connections"
Solution: Increase max_connections or pooler size
# In database.ts
postgresql:
parameters:
max_connections: "400" # Was 200
# Or increase pooler size
pgbouncer:
parameters:
default_pool_size: "50" # Was 25
Storage Issues¶
Longhorn PVC Stuck in Pending¶
Symptom: PVC remains Pending, pod can’t start
Diagnosis:
kubectl describe pvc redis-data -n gitlabbda
# Look for: Events showing "no nodes available"
Common causes:
1. Insufficient cluster storage
# Check Longhorn node storage
kubectl get nodes.longhorn.io -n longhorn-system
# Check available storage per node
Solution: Add cluster storage or increase node disk size
2. Storage class not found
kubectl get storageclass | grep longhorn
Hetzner Volume Resize Failed¶
Symptom: Gitaly PVC still shows old size after resize
Diagnosis:
kubectl describe pvc gitaly-data -n gitlabbda
Solution: Hetzner Volumes require pod restart after resize
# 1. Resize via Hetzner Console
# 2. Delete Gitaly pod
kubectl delete pod gitlab-gitaly-0 -n gitlabbda
# 3. StatefulSet recreates pod with new size
S3 Bucket Access Denied¶
Symptom: Applications can’t read/write S3 buckets
Diagnosis:
# Check Crossplane Bucket status
kubectl get bucket artifacts-gitlabbda-kup6s -n crossplane-system
# Check S3 credentials in application-secrets
kubectl get secret hetzner-s3 -n application-secrets -o yaml
Solution: Verify S3 credentials are correct
# Test S3 access from pod
kubectl run -it s3-test --image=amazon/aws-cli --rm --env="AWS_ACCESS_KEY_ID=xxx" --env="AWS_SECRET_ACCESS_KEY=yyy" -- \
s3 ls s3://artifacts-gitlabbda-kup6s --endpoint-url=https://fsn1.your-objectstorage.com
Performance Issues¶
High Memory Usage (OOMKills)¶
Symptom: Pods restarting with “OOMKilled” reason
Diagnosis:
# Check pod events
kubectl get events -n gitlabbda --sort-by='.lastTimestamp' | grep OOM
# Check memory usage
kubectl top pods -n gitlabbda
Solution: Increase memory limits
# In GitLab Helm chart
webservice:
resources:
limits:
memory: 3Gi # Was 2.5Gi
High CPU Usage¶
Symptom: Pods at 100% CPU, slow responses
Diagnosis:
kubectl top pods -n gitlabbda
kubectl top nodes
Common causes:
1. CPU limits too low
Pods throttled at limit
Solution: Increase CPU limits
webservice:
resources:
limits:
cpu: 1500m # Was 1000m
2. High traffic - Scale horizontally
kubectl scale deploy gitlab-webservice -n gitlabbda --replicas=4
ArgoCD Sync Issues¶
Application OutOfSync¶
Symptom: ArgoCD shows “OutOfSync” status
Diagnosis:
kubectl describe application gitlab-bda -n argocd
# Check sync status
kubectl get application gitlab-bda -n argocd -o jsonpath='{.status.sync.status}'
Common causes:
1. Manual changes in cluster
Someone ran
kubectl applydirectly (bypasses GitOps)
Solution: Revert manual changes or update git
# Option 1: Revert (sync from git)
argocd app sync gitlab-bda
# Option 2: Accept drift (update git to match cluster)
kubectl get <resource> -n gitlabbda -o yaml > manifests/resource.yaml
# Commit to git
2. Resource validation failed
# Check application conditions
kubectl get application gitlab-bda -n argocd -o jsonpath='{.status.conditions}'
Sync Wave Issues (Resources Not Ready)¶
Symptom: Later waves fail because earlier waves not ready
Diagnosis:
# Check resources by sync wave
kubectl get all -n gitlabbda --show-labels | grep sync-wave
Solution: Ensure sync waves are correct
Wave 0: Namespace
Wave 1: RBAC, S3 Buckets, S3 Credentials
Wave 2: App Secrets, Redis, ObjectStore
Wave 3: PostgreSQL, GitLab, Harbor
Wave 5: Ingresses
Emergency Procedures¶
Complete GitLab Outage (All Pods Down)¶
1. Assess damage
kubectl get pods -n gitlabbda
kubectl get events -n gitlabbda --sort-by='.lastTimestamp' | tail -n 50
2. Check dependencies first
# PostgreSQL
kubectl get cluster gitlab-postgres -n gitlabbda
# Redis
kubectl get pods -l app.kubernetes.io/name=redis -n gitlabbda
# S3 buckets
kubectl get buckets -n crossplane-system | grep gitlabbda
3. Restart in order (if dependencies OK)
# Redis (Wave 2)
kubectl delete pod redis-0 -n gitlabbda
# Wait for Redis ready
kubectl wait --for=condition=ready pod redis-0 -n gitlabbda --timeout=300s
# GitLab (Wave 3)
kubectl rollout restart deploy -n gitlabbda
Restore from Backup¶
GitLab backup restore:
# 1. Stop GitLab
kubectl scale deploy --all -n gitlabbda --replicas=0
# 2. Restore from S3
kubectl exec -it deploy/gitlab-toolbox -n gitlabbda -- bash
gitlab-backup restore BACKUP=<timestamp>_gitlab_backup.tar
# 3. Start GitLab
kubectl scale deploy --all -n gitlabbda --replicas=1
PostgreSQL PITR restore:
# Via CNPG Cluster spec
spec:
bootstrap:
recovery:
source: gitlab-postgres
recoveryTarget:
targetTime: "2025-10-27 10:00:00"
Summary¶
Common issue categories:
Pods not running - Check status, logs, resources
Connectivity failures - Check networking, ingress, services
Database issues - Check CNPG status, connections, replication
Storage issues - Check PVC status, Longhorn, S3 credentials
Performance - Check resource usage, scale horizontally
First steps always:
kubectl get pods -n gitlabbdakubectl logs <pod> -n gitlabbdakubectl describe <resource> -n gitlabbdaCheck Grafana dashboards
Query Loki logs
For detailed commands, see kubectl Commands Reference.