Explanation

Monitoring & Observability


Overview

GitLab BDA integrates with the KUP6S cluster monitoring stack (Prometheus, Grafana, Loki). This document explains the monitoring strategy, WHY we monitor specific metrics, and HOW to use observability tools for troubleshooting.

Monitoring stack (cluster infrastructure):

  • Prometheus - Metrics collection and storage

  • Grafana - Dashboards and visualization

  • Loki - Log aggregation and querying

  • Alloy - Log shipping (replaces Promtail)


Monitoring Strategy

What We Monitor

Application metrics (GitLab, Harbor):

  • Availability - Are services responding? (HTTP 200 vs 500)

  • Performance - Response time, throughput

  • Errors - 4xx/5xx error rates

  • Saturation - Resource usage (CPU, memory, disk)

Infrastructure metrics (PostgreSQL, Redis, Storage):

  • Database - Connections, queries/sec, replication lag

  • Cache - Hit ratio, evictions, memory usage

  • Storage - Disk usage, IOPS, throughput

Business metrics (GitLab-specific):

  • CI/CD - Pipeline duration, job success rate

  • Git - Push/pull frequency, repository size

  • Users - Active users, session duration

Why This Strategy?

Four Golden Signals (Google SRE):

  1. Latency - How long requests take (HTTP response time)

  2. Traffic - How many requests (req/sec)

  3. Errors - How many failed requests (error rate)

  4. Saturation - How full the service is (CPU/memory usage)

RED Method (for request-driven services):

  • Rate - Requests per second

  • Errors - Error rate

  • Duration - Latency (p50, p95, p99)

USE Method (for resources):

  • Utilization - % used (CPU at 80%)

  • Saturation - Queue depth (pending requests)

  • Errors - Error count


Metrics Collection

Prometheus ServiceMonitors

GitLab BDA exposes metrics via ServiceMonitor CRDs (Prometheus Operator):

1. PostgreSQL (CNPG):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gitlab-postgres
spec:
  selector:
    matchLabels:
      cnpg.io/cluster: gitlab-postgres
  endpoints:
    - port: metrics
      interval: 30s

Metrics exposed:

  • pg_up - PostgreSQL instance up/down

  • pg_stat_database_* - Database stats (queries, transactions, deadlocks)

  • pg_stat_replication_* - Replication lag, sync state

  • pg_stat_bgwriter_* - Background writer stats

  • pg_locks_count - Lock contention

2. GitLab (built-in exporters):

# GitLab Webservice exposes metrics at :8083/metrics
# GitLab Sidekiq exposes metrics at :3807/metrics
# GitLab Gitaly exposes metrics at :9236/metrics

Metrics exposed:

  • http_requests_total - HTTP request count (by status code)

  • http_request_duration_seconds - Request latency histogram

  • sidekiq_jobs_* - Background job stats (queued, processing, failed)

  • gitaly_* - Git operation metrics (clone, push, diff duration)

3. Harbor (future):

# Harbor Core exposes Prometheus metrics at :8080/metrics
# Harbor Registry exposes at :5001/metrics

Custom Metrics

GitLab Runner (CI/CD):

  • gitlab_runner_jobs - Running jobs count

  • gitlab_runner_job_duration_seconds - Job duration

Harbor (registry):

  • harbor_project_count - Number of projects

  • harbor_repo_count - Number of repositories

  • harbor_artifact_count - Number of images


Log Aggregation

Loki Log Collection

How it works:

All pods → stdout/stderr → Alloy (DaemonSet) → Loki → S3 (logs-loki-kup6s)

Log labels (automatic):

  • namespace - Kubernetes namespace (gitlabbda)

  • pod - Pod name (gitlab-webservice-xxx)

  • container - Container name (webservice)

  • app - App label (gitlab, harbor)

Log Queries (LogQL)

Find errors in GitLab:

{namespace="gitlabbda", app="gitlab"} |= "error"

PostgreSQL slow queries:

{namespace="gitlabbda", pod=~"gitlab-postgres-.*"} |= "duration"
| regexp `duration: (?P<duration>[0-9.]+) ms`
| duration > 1000

CI/CD job failures:

{namespace="gitlabbda", app="gitlab", container="sidekiq"} |= "JobFailed"

For complete LogQL guide, see Main Cluster Docs: Query Loki Logs.


Grafana Dashboards

Pre-built Dashboards (Cluster)

KUP6S cluster provides:

  • Kubernetes / Compute Resources / Namespace (Pods) - Pod CPU/memory

  • Kubernetes / Networking / Namespace (Pods) - Pod network traffic

  • CNPG PostgreSQL - Database metrics, replication lag

  • Longhorn - Storage usage, IOPS

Access: https://grafana.ops.kup6s.net (cluster-wide Grafana)

Custom Dashboards (GitLab BDA)

Future: Create GitLab-specific dashboards

Dashboard 1: GitLab Application Overview

  • HTTP requests/sec - webservice traffic

  • Response time (p95) - latency

  • Error rate - 5xx errors

  • Active users - logged-in users

  • CI/CD jobs - queued, running, failed

Dashboard 2: GitLab Infrastructure

  • PostgreSQL - Connections, queries/sec, replication lag

  • Redis - Memory usage, hit ratio, evictions

  • Gitaly - Git operations/sec, storage usage

  • S3 - Upload/download throughput

Dashboard 3: Harbor Registry

  • Image pulls/pushes - Registry traffic

  • Storage usage - S3 bucket size

  • Scan results - Vulnerabilities by severity

  • Replication - Replication job status

Creating dashboards:

# Export dashboard JSON from Grafana
# Store in: dp-infra/gitlabbda/monitoring/dashboards/
# Import via Grafana UI or ConfigMap

Alerting Strategy

Alert Priorities

Critical (P1) - Immediate action required:

  • GitLab/Harbor down (HTTP 5xx for >5 min)

  • PostgreSQL down (no primary instance)

  • Disk >90% full (data loss risk)

High (P2) - Action required within 1 hour:

  • PostgreSQL replication lag >1 minute

  • CI/CD queue depth >50 jobs for >10 min

  • High error rate (>5% 5xx errors)

Medium (P3) - Action required within 24 hours:

  • Slow queries (>1 second for >10 min)

  • High memory usage (>85% for >30 min)

  • Certificate expiring in <7 days

Low (P4) - Informational:

  • Disk >80% full

  • High CPU usage (>80% for >1 hour)

Alert Routing

Current (cluster-level):

  • Prometheus → Alertmanager → Email (SMTP)

Future (GitLab BDA-specific):

# Alertmanager route
routes:
  - match:
      namespace: gitlabbda
    receiver: gitlab-admins
    group_by: [alertname, severity]
    group_wait: 30s
    group_interval: 5m

receivers:
  - name: gitlab-admins
    email_configs:
      - to: admin@example.com

Example Alerts

GitLab Down:

- alert: GitLabDown
  expr: up{job="gitlab-webservice"} == 0
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: GitLab webservice is down
    description: No gitlab-webservice pods responding for 5 minutes

PostgreSQL Replication Lag:

- alert: PostgresReplicationLag
  expr: pg_replication_lag_seconds > 60
  for: 5m
  labels: {severity: high}
  annotations:
    summary: PostgreSQL replication lag high
    description: Standby is {{ $value }}s behind primary

High CI Job Queue:

- alert: CIJobQueueHigh
  expr: sidekiq_queue_size{queue="pipeline"} > 50
  for: 10m
  labels: {severity: high}
  annotations:
    summary: CI/CD job queue high
    description: {{ $value }} jobs queued (slow processing)

Troubleshooting Workflows

Performance Degradation

Symptom: GitLab slow, users report timeouts

Investigation:

  1. Check application metrics (Grafana)

    • High response time? → Check webservice pods

    • High error rate? → Check application logs

  2. Check infrastructure

    • PostgreSQL connections maxed? → Scale pooler

    • Redis memory full? → Check eviction rate

    • Gitaly CPU high? → Large git operations

  3. Check resource usage (kubectl top)

    kubectl top pods -n gitlabbda
    kubectl top nodes
    
  4. Check logs (Loki)

    {namespace="gitlabbda"} |= "error" | logfmt
    

For complete troubleshooting, see Troubleshooting Reference.

CI/CD Pipeline Failures

Symptom: Pipelines stuck in “pending” or failing

Investigation:

  1. Check Sidekiq queue (metrics)

    • sidekiq_queue_size high? → Sidekiq overloaded

  2. Check GitLab Runner (logs)

    kubectl logs -l app=gitlab-runner -n gitlabbda
    
  3. Check job logs (GitLab UI)

    • Network errors? → Check external connectivity

    • Out of memory? → Increase runner memory

Database Performance

Symptom: Slow queries, high latency

Investigation:

  1. Check PostgreSQL metrics (Grafana)

    • pg_stat_activity - Active connections

    • pg_stat_database_tup_fetched - Rows read

  2. Check slow queries (Loki)

    {pod=~"gitlab-postgres-.*"} |= "duration" | logfmt | duration > 1000
    
  3. Check locks (psql)

    kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab
    SELECT * FROM pg_locks WHERE NOT granted;
    
  4. Check replication lag (CNPG)

    kubectl get cluster gitlab-postgres -n gitlabbda -o yaml | grep lag
    

Observability Best Practices

For Operators

  1. Regular dashboard review - Check Grafana daily for anomalies

  2. Set up alerts - Configure email notifications for critical issues

  3. Log retention - Loki retains 31 days (sufficient for incident investigation)

  4. Metrics retention - Prometheus retains 15 days (sufficient for short-term trends)

  5. Long-term trends - Export metrics to external system (future: Thanos)

For Developers

  1. Structured logging - Use JSON logs for easier parsing

  2. Log levels - ERROR for actionable issues, WARN for potential issues, INFO for significant events

  3. Context in logs - Include user_id, project_id, job_id for traceability

  4. Metrics in code - Instrument custom metrics (e.g., custom_feature_usage_total)

  5. Distributed tracing (future) - Jaeger/Tempo for request tracing across services

Monitoring Anti-Patterns

Avoid:

  • Too many alerts - Alert fatigue (high noise-to-signal ratio)

  • Vague alert descriptions - “Service down” (which service? which instance?)

  • No runbooks - Alerts without remediation steps

  • Logging everything - Verbose logs (high cost, low signal)

  • No baseline - Alerts without understanding normal behavior

Instead:

  • Actionable alerts - Only alert on issues requiring human intervention

  • Clear descriptions - “gitlab-webservice-1 pod down for 5 minutes”

  • Linked runbooks - Alert annotations link to troubleshooting docs

  • Structured logs - Log important events, filter in Loki

  • Establish baselines - Know normal CPU, memory, latency for each service


Metrics Reference

Key Metrics to Watch

GitLab Webservice:

  • http_request_duration_seconds_bucket - Latency (p50, p95, p99)

  • http_requests_total - Request count (by status code)

  • ruby_gc_duration_seconds_total - Ruby garbage collection (high = memory pressure)

PostgreSQL:

  • pg_stat_database_xact_commit - Transactions/sec

  • pg_stat_database_tup_fetched - Rows read/sec

  • pg_replication_lag_seconds - Standby lag

  • pg_settings_max_connections vs pg_stat_activity_count - Connection usage

Redis:

  • redis_memory_used_bytes - Memory usage

  • redis_keyspace_hits_total / redis_keyspace_misses_total - Cache hit ratio

  • redis_evicted_keys_total - Evictions (high = memory pressure)

Gitaly:

  • gitaly_service_client_requests_total - Git operations/sec

  • gitaly_repository_size_bytes - Repository storage usage

  • gitaly_spawn_timeout_count - Git timeouts (high = overload)


Summary

Monitoring architecture:

  • Metrics - Prometheus (30s scrape), 15d retention

  • Logs - Loki (all pods), 31d retention

  • Dashboards - Grafana (cluster-wide)

  • Alerts - Alertmanager → Email

Key observability tools:

  • Grafana - Pre-built dashboards for PostgreSQL, Longhorn

  • Loki - LogQL queries for troubleshooting

  • Prometheus - PromQL for custom metrics

  • kubectl - Resource usage (top), logs (logs)

Monitoring philosophy:

  • Four Golden Signals - Latency, traffic, errors, saturation

  • Actionable alerts - Only alert on issues requiring action

  • Defense in depth - Multiple monitoring layers (app, infrastructure, business)

Next steps:

  • Create custom Grafana dashboards (GitLab, Harbor)

  • Configure GitLab BDA-specific alerts

  • Set up alert routing (Slack, PagerDuty)

  • Deploy Trivy scanner for vulnerability monitoring

For detailed instructions: