Reference

Troubleshooting


Overview

Common issues, symptoms, diagnosis steps, and solutions for GitLab BDA.

General troubleshooting workflow:

  1. Identify symptom - What’s broken? (UI down, slow, errors)

  2. Check status - kubectl get pods (are pods running?)

  3. Review logs - kubectl logs (what errors appear?)

  4. Check resources - kubectl top (out of memory/CPU?)

  5. Fix root cause - Apply solution

  6. Verify - Test that issue is resolved


GitLab UI Issues

GitLab UI Completely Down (502 Bad Gateway)

Symptom: https://gitlab.staging.bluedynamics.eu returns 502 error

Diagnosis:

# Check webservice pods
kubectl get pods -l app=webservice -n gitlabbda

# Check pod logs
kubectl logs -l app=webservice -n gitlabbda --tail=50

Common causes:

1. Webservice pods CrashLoopBackOff

# Check events
kubectl describe pod gitlab-webservice-xxx -n gitlabbda

# Common error: PostgreSQL connection failed
# Error in logs: "PG::ConnectionBad: could not connect to server"

Solution: Check PostgreSQL

# Verify PostgreSQL running
kubectl get pods -l cnpg.io/cluster=gitlab-postgres -n gitlabbda

# Check pooler
kubectl get pods -l cnpg.io/poolerName=gitlab-postgres-pooler -n gitlabbda

# Test connection
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT 1"

2. Redis connection failed

# Check Redis
kubectl get pods -l app.kubernetes.io/name=redis -n gitlabbda
kubectl logs redis-0 -n gitlabbda

# Test connection
kubectl exec -it redis-0 -n gitlabbda -- redis-cli ping

3. S3 credentials invalid

# Check ExternalSecret sync status
kubectl get externalsecret gitlab-s3-credentials -n gitlabbda

# If SYNCED=False, check secret in source namespace
kubectl get secret hetzner-s3 -n application-secrets

GitLab UI Slow (High Latency)

Symptom: Pages take >5 seconds to load

Diagnosis:

# Check resource usage
kubectl top pods -n gitlabbda

# Check webservice logs for slow queries
kubectl logs -l app=webservice -n gitlabbda | grep -i "slow\|timeout"

Common causes:

1. High CPU usage

kubectl top pods -l app=webservice -n gitlabbda
# Output: webservice-xxx  950m (95% of 1000m limit)

Solution: Scale webservice

# Temporary: Increase replicas
kubectl scale deploy gitlab-webservice -n gitlabbda --replicas=3

# Permanent: Update config.yaml
replicas:
  webservice: 3
# Rebuild and apply

2. PostgreSQL slow queries

# Check database connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT count(*) FROM pg_stat_activity"

# Check slow queries in Loki
{namespace="gitlabbda", pod=~"gitlab-postgres-.*"} |= "duration" | regexp `duration: (?P<duration>[0-9.]+) ms` | duration > 1000

Solution: Increase pooler connections or database resources

# In database.ts
postgresql:
  parameters:
    max_connections: "400"  # Was 200

Login Failures

Symptom: Users can’t log in (password rejected)

Diagnosis:

# Check GitLab logs
kubectl logs -l app=webservice -n gitlabbda | grep -i "authentication\|login"

# Check if root password is correct
kubectl get secret gitlab-initial-root-password -n gitlabbda -o jsonpath='{.data.password}' | base64 -d

Common causes:

1. LDAP/OAuth misconfigured (if enabled)

  • Check GitLab Admin → Settings → Sign-in restrictions

  • Verify OAuth application credentials (Harbor integration)

2. Database lock on users table

kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT * FROM pg_locks WHERE NOT granted AND relation::regclass::text = 'users'"

Solution: Restart webservice pods to clear locks

kubectl rollout restart deploy/gitlab-webservice -n gitlabbda

Git Operations Issues

Git Clone Fails (Connection Refused)

Symptom: git clone https://gitlab.../repo.git fails

Diagnosis:

# Test HTTP git endpoint
curl -I https://gitlab.staging.bluedynamics.eu/user/repo.git/info/refs?service=git-upload-pack

# Check workhorse (proxies git HTTP)
kubectl get pods -l app=workhorse -n gitlabbda
kubectl logs -l app=workhorse -n gitlabbda --tail=50

Common causes:

1. Repository not found

  • Verify repository exists in GitLab UI

  • Check user has access (Reporter role minimum)

2. Gitaly down

kubectl get pods -l app=gitaly -n gitlabbda
kubectl logs gitlab-gitaly-0 -n gitlabbda

Solution: Restart Gitaly if crashed

kubectl delete pod gitlab-gitaly-0 -n gitlabbda
# StatefulSet recreates pod

Git Push Hangs or Times Out

Symptom: git push hangs for minutes, then times out

Diagnosis:

# Check Gitaly CPU (large push = high CPU)
kubectl top pod gitlab-gitaly-0 -n gitlabbda

# Check Gitaly logs for errors
kubectl logs gitlab-gitaly-0 -n gitlabbda | tail -n 100

Common causes:

1. Large push (>1GB)

  • Gitaly processing large pack file (high CPU, normal)

  • Wait for completion (may take 5-10 minutes)

2. Hetzner Volume full

# Check Gitaly PVC usage
kubectl exec -it gitlab-gitaly-0 -n gitlabbda -- df -h /home/git/repositories

Solution: Resize Hetzner Volume

# Via Hetzner Cloud Console
# Volumes → gitaly-data → Resize → 30Gi

# Or via Helm chart update
gitaly:
  persistence:
    size: 30Gi

SSH Git Access Not Working

Symptom: git clone git@gitlab.../repo.git fails with “Connection refused”

Diagnosis:

# Check SSH routing (Traefik)
kubectl get ingressroutetcp gitlab-ssh -n gitlabbda

# Check GitLab Shell pods
kubectl get pods -l app=gitlab-shell -n gitlabbda
kubectl logs -l app=gitlab-shell -n gitlabbda

Common causes:

1. SSH key not uploaded to GitLab

  • Add SSH key in GitLab UI (User Settings → SSH Keys)

2. Traefik SSH entrypoint not configured

  • Check cluster Traefik configuration (cluster infrastructure)

Solution: Verify SSH connectivity

# Test SSH (should show GitLab welcome)
ssh -T git@gitlab.staging.bluedynamics.eu

# Expected output:
# Welcome to GitLab, @username!

CI/CD Issues

Pipelines Stuck in “Pending”

Symptom: CI jobs stay in “pending” state, never run

Diagnosis:

# Check GitLab Runner pods
kubectl get pods -l app=gitlab-runner -n gitlabbda

# Check runner logs
kubectl logs -l app=gitlab-runner -n gitlabbda --tail=50

Common causes:

1. No runners available

# GitLab UI: Admin → Runners
# Verify runners are registered and active

Solution: Check runner registration token

# In GitLab UI: Settings → CI/CD → Runners
# Verify registration token matches ExternalSecret

2. Sidekiq queue backed up

# Check Sidekiq pod
kubectl get pods -l app=sidekiq -n gitlabbda
kubectl logs -l app=sidekiq -n gitlabbda | grep -i "pipeline\|job"

Solution: Scale Sidekiq

kubectl scale deploy gitlab-sidekiq -n gitlabbda --replicas=2

CI Jobs Failing with “docker: command not found”

Symptom: Jobs using Docker commands fail

Diagnosis: Check GitLab Runner configuration

Common causes:

1. Runner not configured for Docker-in-Docker

  • Runners need privileged: true or docker.sock mount

Solution: Verify runner config (future: configure DinD)

Artifacts Upload Failing

Symptom: Job succeeds but artifacts not saved

Diagnosis:

# Check S3 connectivity
kubectl logs -l app=webservice -n gitlabbda | grep -i "s3\|artifact"

# Check S3 credentials
kubectl get secret gitlab-s3-credentials -n gitlabbda -o yaml

Common causes:

1. S3 bucket doesn’t exist

kubectl get bucket artifacts-gitlabbda-kup6s -n crossplane-system
# Should show READY=True, SYNCED=True

Solution: Verify Crossplane bucket provisioning

# Check Crossplane provider logs
kubectl logs -n crossplane-system -l pkg.crossplane.io/provider=provider-aws-s3

Harbor Registry Issues

Docker Push Fails (Unauthorized)

Symptom: docker push registry.../image:tag returns 401 Unauthorized

Diagnosis:

# Test docker login
docker login registry.staging.bluedynamics.eu -u testuser -p <gitlab-token>

# Check Harbor Core logs
kubectl logs -l app.kubernetes.io/name=harbor-core -n gitlabbda

Common causes:

1. Invalid GitLab personal access token

  • Generate new token in GitLab (Scopes: read_registry, write_registry)

2. Harbor OAuth not configured

kubectl get secret harbor-secrets -n gitlabbda -o jsonpath='{.data.gitlab-oauth-client-id}' | base64 -d

Solution: Verify OAuth credentials in application-secrets

Docker Pull Fails (Image Not Found)

Symptom: docker pull registry.../project/image:tag returns 404

Diagnosis:

# Check if image exists in Harbor UI
# https://registry.staging.bluedynamics.eu

# Check Harbor Registry logs
kubectl logs -l app.kubernetes.io/name=harbor-registry -n gitlabbda

Common causes:

1. Image not pushed - Verify image exists in project

2. S3 connectivity issue

# Check S3 credentials
kubectl get secret harbor-s3-credentials -n gitlabbda -o yaml

# Test S3 connectivity from pod
kubectl exec -it deploy/harbor-registry -n gitlabbda -- wget -O- https://fsn1.your-objectstorage.com

Harbor UI Not Loading

Symptom: https://registry.staging.bluedynamics.eu returns 502 or timeout

Diagnosis:

# Check Harbor pods
kubectl get pods -l app.kubernetes.io/part-of=harbor -n gitlabbda

# Check ingress
kubectl get ingress harbor-ingress -n gitlabbda

Common causes:

1. Harbor Core pod crashed

kubectl logs -l app.kubernetes.io/name=harbor-core -n gitlabbda --previous
# Check for PostgreSQL connection errors

Solution: Verify PostgreSQL harbor database exists

kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -l | grep harbor

Database Issues

PostgreSQL Primary Down

Symptom: All database connections fail

Diagnosis:

# Check CNPG cluster status
kubectl get cluster gitlab-postgres -n gitlabbda

# Check pod status
kubectl get pods -l cnpg.io/cluster=gitlab-postgres -n gitlabbda

Common causes:

1. Primary pod crashed

kubectl describe pod gitlab-postgres-1 -n gitlabbda
# Look for: OOMKilled, Evicted, CrashLoopBackOff

Solution: CNPG auto-failover to standby

# Monitor failover
kubectl get cluster gitlab-postgres -n gitlabbda -w

# Expected: Standby promoted to primary within 30 seconds

2. Both instances down

  • Critical failure, requires manual intervention

  • Check cluster events: kubectl get events -n gitlabbda --sort-by='.lastTimestamp'

High Replication Lag

Symptom: Standby falls behind primary (>1 minute lag)

Diagnosis:

# Check replication lag
kubectl get cluster gitlab-postgres -n gitlabbda -o jsonpath='{.status.instances[*].lag}'

# Check standby logs
kubectl logs gitlab-postgres-2 -n gitlabbda | grep -i "replication\|lag"

Common causes:

1. High write volume

  • Primary under heavy load (many inserts/updates)

  • Standby can’t keep up

Solution: Increase standby resources

# In database.ts
resources:
  requests:
    cpu: 200m  # Was 100m
    memory: 512Mi  # Was 256Mi

2. Network issues

  • Check pod-to-pod network latency

Connection Pool Exhausted

Symptom: Applications can’t connect (max connections reached)

Diagnosis:

# Check active connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab -c "SELECT count(*) FROM pg_stat_activity"

# Check max connections
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -c "SHOW max_connections"

Solution: Increase max_connections or pooler size

# In database.ts
postgresql:
  parameters:
    max_connections: "400"  # Was 200

# Or increase pooler size
pgbouncer:
  parameters:
    default_pool_size: "50"  # Was 25

Storage Issues

Longhorn PVC Stuck in Pending

Symptom: PVC remains Pending, pod can’t start

Diagnosis:

kubectl describe pvc redis-data -n gitlabbda
# Look for: Events showing "no nodes available"

Common causes:

1. Insufficient cluster storage

# Check Longhorn node storage
kubectl get nodes.longhorn.io -n longhorn-system

# Check available storage per node

Solution: Add cluster storage or increase node disk size

2. Storage class not found

kubectl get storageclass | grep longhorn

Hetzner Volume Resize Failed

Symptom: Gitaly PVC still shows old size after resize

Diagnosis:

kubectl describe pvc gitaly-data -n gitlabbda

Solution: Hetzner Volumes require pod restart after resize

# 1. Resize via Hetzner Console
# 2. Delete Gitaly pod
kubectl delete pod gitlab-gitaly-0 -n gitlabbda
# 3. StatefulSet recreates pod with new size

S3 Bucket Access Denied

Symptom: Applications can’t read/write S3 buckets

Diagnosis:

# Check Crossplane Bucket status
kubectl get bucket artifacts-gitlabbda-kup6s -n crossplane-system

# Check S3 credentials in application-secrets
kubectl get secret hetzner-s3 -n application-secrets -o yaml

Solution: Verify S3 credentials are correct

# Test S3 access from pod
kubectl run -it s3-test --image=amazon/aws-cli --rm --env="AWS_ACCESS_KEY_ID=xxx" --env="AWS_SECRET_ACCESS_KEY=yyy" -- \
  s3 ls s3://artifacts-gitlabbda-kup6s --endpoint-url=https://fsn1.your-objectstorage.com

Performance Issues

High Memory Usage (OOMKills)

Symptom: Pods restarting with “OOMKilled” reason

Diagnosis:

# Check pod events
kubectl get events -n gitlabbda --sort-by='.lastTimestamp' | grep OOM

# Check memory usage
kubectl top pods -n gitlabbda

Solution: Increase memory limits

# In GitLab Helm chart
webservice:
  resources:
    limits:
      memory: 3Gi  # Was 2.5Gi

High CPU Usage

Symptom: Pods at 100% CPU, slow responses

Diagnosis:

kubectl top pods -n gitlabbda
kubectl top nodes

Common causes:

1. CPU limits too low

  • Pods throttled at limit

Solution: Increase CPU limits

webservice:
  resources:
    limits:
      cpu: 1500m  # Was 1000m

2. High traffic - Scale horizontally

kubectl scale deploy gitlab-webservice -n gitlabbda --replicas=4

ArgoCD Sync Issues

Application OutOfSync

Symptom: ArgoCD shows “OutOfSync” status

Diagnosis:

kubectl describe application gitlab-bda -n argocd

# Check sync status
kubectl get application gitlab-bda -n argocd -o jsonpath='{.status.sync.status}'

Common causes:

1. Manual changes in cluster

  • Someone ran kubectl apply directly (bypasses GitOps)

Solution: Revert manual changes or update git

# Option 1: Revert (sync from git)
argocd app sync gitlab-bda

# Option 2: Accept drift (update git to match cluster)
kubectl get <resource> -n gitlabbda -o yaml > manifests/resource.yaml
# Commit to git

2. Resource validation failed

# Check application conditions
kubectl get application gitlab-bda -n argocd -o jsonpath='{.status.conditions}'

Sync Wave Issues (Resources Not Ready)

Symptom: Later waves fail because earlier waves not ready

Diagnosis:

# Check resources by sync wave
kubectl get all -n gitlabbda --show-labels | grep sync-wave

Solution: Ensure sync waves are correct

  • Wave 0: Namespace

  • Wave 1: RBAC, S3 Buckets, S3 Credentials

  • Wave 2: App Secrets, Redis, ObjectStore

  • Wave 3: PostgreSQL, GitLab, Harbor

  • Wave 5: Ingresses


Emergency Procedures

Complete GitLab Outage (All Pods Down)

1. Assess damage

kubectl get pods -n gitlabbda
kubectl get events -n gitlabbda --sort-by='.lastTimestamp' | tail -n 50

2. Check dependencies first

# PostgreSQL
kubectl get cluster gitlab-postgres -n gitlabbda

# Redis
kubectl get pods -l app.kubernetes.io/name=redis -n gitlabbda

# S3 buckets
kubectl get buckets -n crossplane-system | grep gitlabbda

3. Restart in order (if dependencies OK)

# Redis (Wave 2)
kubectl delete pod redis-0 -n gitlabbda

# Wait for Redis ready
kubectl wait --for=condition=ready pod redis-0 -n gitlabbda --timeout=300s

# GitLab (Wave 3)
kubectl rollout restart deploy -n gitlabbda

Restore from Backup

GitLab backup restore:

# 1. Stop GitLab
kubectl scale deploy --all -n gitlabbda --replicas=0

# 2. Restore from S3
kubectl exec -it deploy/gitlab-toolbox -n gitlabbda -- bash
gitlab-backup restore BACKUP=<timestamp>_gitlab_backup.tar

# 3. Start GitLab
kubectl scale deploy --all -n gitlabbda --replicas=1

PostgreSQL PITR restore:

# Via CNPG Cluster spec
spec:
  bootstrap:
    recovery:
      source: gitlab-postgres
      recoveryTarget:
        targetTime: "2025-10-27 10:00:00"

Summary

Common issue categories:

  1. Pods not running - Check status, logs, resources

  2. Connectivity failures - Check networking, ingress, services

  3. Database issues - Check CNPG status, connections, replication

  4. Storage issues - Check PVC status, Longhorn, S3 credentials

  5. Performance - Check resource usage, scale horizontally

First steps always:

  1. kubectl get pods -n gitlabbda

  2. kubectl logs <pod> -n gitlabbda

  3. kubectl describe <resource> -n gitlabbda

  4. Check Grafana dashboards

  5. Query Loki logs

For detailed commands, see kubectl Commands Reference.