Explanation
Monitoring & Observability¶
Overview¶
GitLab BDA integrates with the KUP6S cluster monitoring stack (Prometheus, Grafana, Loki). This document explains the monitoring strategy, WHY we monitor specific metrics, and HOW to use observability tools for troubleshooting.
Monitoring stack (cluster infrastructure):
Prometheus - Metrics collection and storage
Grafana - Dashboards and visualization
Loki - Log aggregation and querying
Alloy - Log shipping (replaces Promtail)
Monitoring Strategy¶
What We Monitor¶
Application metrics (GitLab, Harbor):
Availability - Are services responding? (HTTP 200 vs 500)
Performance - Response time, throughput
Errors - 4xx/5xx error rates
Saturation - Resource usage (CPU, memory, disk)
Infrastructure metrics (PostgreSQL, Redis, Storage):
Database - Connections, queries/sec, replication lag
Cache - Hit ratio, evictions, memory usage
Storage - Disk usage, IOPS, throughput
Business metrics (GitLab-specific):
CI/CD - Pipeline duration, job success rate
Git - Push/pull frequency, repository size
Users - Active users, session duration
Why This Strategy?¶
Four Golden Signals (Google SRE):
Latency - How long requests take (HTTP response time)
Traffic - How many requests (req/sec)
Errors - How many failed requests (error rate)
Saturation - How full the service is (CPU/memory usage)
RED Method (for request-driven services):
Rate - Requests per second
Errors - Error rate
Duration - Latency (p50, p95, p99)
USE Method (for resources):
Utilization - % used (CPU at 80%)
Saturation - Queue depth (pending requests)
Errors - Error count
Metrics Collection¶
Prometheus ServiceMonitors¶
GitLab BDA exposes metrics via ServiceMonitor CRDs (Prometheus Operator):
1. PostgreSQL (CNPG):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: gitlab-postgres
spec:
selector:
matchLabels:
cnpg.io/cluster: gitlab-postgres
endpoints:
- port: metrics
interval: 30s
Metrics exposed:
pg_up- PostgreSQL instance up/downpg_stat_database_*- Database stats (queries, transactions, deadlocks)pg_stat_replication_*- Replication lag, sync statepg_stat_bgwriter_*- Background writer statspg_locks_count- Lock contention
2. GitLab (built-in exporters):
# GitLab Webservice exposes metrics at :8083/metrics
# GitLab Sidekiq exposes metrics at :3807/metrics
# GitLab Gitaly exposes metrics at :9236/metrics
Metrics exposed:
http_requests_total- HTTP request count (by status code)http_request_duration_seconds- Request latency histogramsidekiq_jobs_*- Background job stats (queued, processing, failed)gitaly_*- Git operation metrics (clone, push, diff duration)
3. Harbor (future):
# Harbor Core exposes Prometheus metrics at :8080/metrics
# Harbor Registry exposes at :5001/metrics
Custom Metrics¶
GitLab Runner (CI/CD):
gitlab_runner_jobs- Running jobs countgitlab_runner_job_duration_seconds- Job duration
Harbor (registry):
harbor_project_count- Number of projectsharbor_repo_count- Number of repositoriesharbor_artifact_count- Number of images
Log Aggregation¶
Loki Log Collection¶
How it works:
All pods → stdout/stderr → Alloy (DaemonSet) → Loki → S3 (logs-loki-kup6s)
Log labels (automatic):
namespace- Kubernetes namespace (gitlabbda)pod- Pod name (gitlab-webservice-xxx)container- Container name (webservice)app- App label (gitlab, harbor)
Log Queries (LogQL)¶
Find errors in GitLab:
{namespace="gitlabbda", app="gitlab"} |= "error"
PostgreSQL slow queries:
{namespace="gitlabbda", pod=~"gitlab-postgres-.*"} |= "duration"
| regexp `duration: (?P<duration>[0-9.]+) ms`
| duration > 1000
CI/CD job failures:
{namespace="gitlabbda", app="gitlab", container="sidekiq"} |= "JobFailed"
For complete LogQL guide, see Main Cluster Docs: Query Loki Logs.
Grafana Dashboards¶
Pre-built Dashboards (Cluster)¶
KUP6S cluster provides:
Kubernetes / Compute Resources / Namespace (Pods) - Pod CPU/memory
Kubernetes / Networking / Namespace (Pods) - Pod network traffic
CNPG PostgreSQL - Database metrics, replication lag
Longhorn - Storage usage, IOPS
Access: https://grafana.ops.kup6s.net (cluster-wide Grafana)
Custom Dashboards (GitLab BDA)¶
Future: Create GitLab-specific dashboards
Dashboard 1: GitLab Application Overview
HTTP requests/sec - webservice traffic
Response time (p95) - latency
Error rate - 5xx errors
Active users - logged-in users
CI/CD jobs - queued, running, failed
Dashboard 2: GitLab Infrastructure
PostgreSQL - Connections, queries/sec, replication lag
Redis - Memory usage, hit ratio, evictions
Gitaly - Git operations/sec, storage usage
S3 - Upload/download throughput
Dashboard 3: Harbor Registry
Image pulls/pushes - Registry traffic
Storage usage - S3 bucket size
Scan results - Vulnerabilities by severity
Replication - Replication job status
Creating dashboards:
# Export dashboard JSON from Grafana
# Store in: dp-infra/gitlabbda/monitoring/dashboards/
# Import via Grafana UI or ConfigMap
Alerting Strategy¶
Alert Priorities¶
Critical (P1) - Immediate action required:
GitLab/Harbor down (HTTP 5xx for >5 min)
PostgreSQL down (no primary instance)
Disk >90% full (data loss risk)
High (P2) - Action required within 1 hour:
PostgreSQL replication lag >1 minute
CI/CD queue depth >50 jobs for >10 min
High error rate (>5% 5xx errors)
Medium (P3) - Action required within 24 hours:
Slow queries (>1 second for >10 min)
High memory usage (>85% for >30 min)
Certificate expiring in <7 days
Low (P4) - Informational:
Disk >80% full
High CPU usage (>80% for >1 hour)
Alert Routing¶
Current (cluster-level):
Prometheus → Alertmanager → Email (SMTP)
Future (GitLab BDA-specific):
# Alertmanager route
routes:
- match:
namespace: gitlabbda
receiver: gitlab-admins
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
receivers:
- name: gitlab-admins
email_configs:
- to: admin@example.com
Example Alerts¶
GitLab Down:
- alert: GitLabDown
expr: up{job="gitlab-webservice"} == 0
for: 5m
labels: {severity: critical}
annotations:
summary: GitLab webservice is down
description: No gitlab-webservice pods responding for 5 minutes
PostgreSQL Replication Lag:
- alert: PostgresReplicationLag
expr: pg_replication_lag_seconds > 60
for: 5m
labels: {severity: high}
annotations:
summary: PostgreSQL replication lag high
description: Standby is {{ $value }}s behind primary
High CI Job Queue:
- alert: CIJobQueueHigh
expr: sidekiq_queue_size{queue="pipeline"} > 50
for: 10m
labels: {severity: high}
annotations:
summary: CI/CD job queue high
description: {{ $value }} jobs queued (slow processing)
Troubleshooting Workflows¶
Performance Degradation¶
Symptom: GitLab slow, users report timeouts
Investigation:
Check application metrics (Grafana)
High response time? → Check webservice pods
High error rate? → Check application logs
Check infrastructure
PostgreSQL connections maxed? → Scale pooler
Redis memory full? → Check eviction rate
Gitaly CPU high? → Large git operations
Check resource usage (kubectl top)
kubectl top pods -n gitlabbda kubectl top nodes
Check logs (Loki)
{namespace="gitlabbda"} |= "error" | logfmt
For complete troubleshooting, see Troubleshooting Reference.
CI/CD Pipeline Failures¶
Symptom: Pipelines stuck in “pending” or failing
Investigation:
Check Sidekiq queue (metrics)
sidekiq_queue_sizehigh? → Sidekiq overloaded
Check GitLab Runner (logs)
kubectl logs -l app=gitlab-runner -n gitlabbdaCheck job logs (GitLab UI)
Network errors? → Check external connectivity
Out of memory? → Increase runner memory
Database Performance¶
Symptom: Slow queries, high latency
Investigation:
Check PostgreSQL metrics (Grafana)
pg_stat_activity- Active connectionspg_stat_database_tup_fetched- Rows read
Check slow queries (Loki)
{pod=~"gitlab-postgres-.*"} |= "duration" | logfmt | duration > 1000Check locks (psql)
kubectl exec -it gitlab-postgres-1 -n gitlabbda -- psql -U postgres -d gitlab SELECT * FROM pg_locks WHERE NOT granted;
Check replication lag (CNPG)
kubectl get cluster gitlab-postgres -n gitlabbda -o yaml | grep lag
Observability Best Practices¶
For Operators¶
Regular dashboard review - Check Grafana daily for anomalies
Set up alerts - Configure email notifications for critical issues
Log retention - Loki retains 31 days (sufficient for incident investigation)
Metrics retention - Prometheus retains 15 days (sufficient for short-term trends)
Long-term trends - Export metrics to external system (future: Thanos)
For Developers¶
Structured logging - Use JSON logs for easier parsing
Log levels - ERROR for actionable issues, WARN for potential issues, INFO for significant events
Context in logs - Include user_id, project_id, job_id for traceability
Metrics in code - Instrument custom metrics (e.g.,
custom_feature_usage_total)Distributed tracing (future) - Jaeger/Tempo for request tracing across services
Monitoring Anti-Patterns¶
Avoid:
❌ Too many alerts - Alert fatigue (high noise-to-signal ratio)
❌ Vague alert descriptions - “Service down” (which service? which instance?)
❌ No runbooks - Alerts without remediation steps
❌ Logging everything - Verbose logs (high cost, low signal)
❌ No baseline - Alerts without understanding normal behavior
Instead:
✅ Actionable alerts - Only alert on issues requiring human intervention
✅ Clear descriptions - “gitlab-webservice-1 pod down for 5 minutes”
✅ Linked runbooks - Alert annotations link to troubleshooting docs
✅ Structured logs - Log important events, filter in Loki
✅ Establish baselines - Know normal CPU, memory, latency for each service
Metrics Reference¶
Key Metrics to Watch¶
GitLab Webservice:
http_request_duration_seconds_bucket- Latency (p50, p95, p99)http_requests_total- Request count (by status code)ruby_gc_duration_seconds_total- Ruby garbage collection (high = memory pressure)
PostgreSQL:
pg_stat_database_xact_commit- Transactions/secpg_stat_database_tup_fetched- Rows read/secpg_replication_lag_seconds- Standby lagpg_settings_max_connectionsvspg_stat_activity_count- Connection usage
Redis:
redis_memory_used_bytes- Memory usageredis_keyspace_hits_total/redis_keyspace_misses_total- Cache hit ratioredis_evicted_keys_total- Evictions (high = memory pressure)
Gitaly:
gitaly_service_client_requests_total- Git operations/secgitaly_repository_size_bytes- Repository storage usagegitaly_spawn_timeout_count- Git timeouts (high = overload)
Summary¶
Monitoring architecture:
Metrics - Prometheus (30s scrape), 15d retention
Logs - Loki (all pods), 31d retention
Dashboards - Grafana (cluster-wide)
Alerts - Alertmanager → Email
Key observability tools:
Grafana - Pre-built dashboards for PostgreSQL, Longhorn
Loki - LogQL queries for troubleshooting
Prometheus - PromQL for custom metrics
kubectl - Resource usage (
top), logs (logs)
Monitoring philosophy:
Four Golden Signals - Latency, traffic, errors, saturation
Actionable alerts - Only alert on issues requiring action
Defense in depth - Multiple monitoring layers (app, infrastructure, business)
Next steps:
Create custom Grafana dashboards (GitLab, Harbor)
Configure GitLab BDA-specific alerts
Set up alert routing (Slack, PagerDuty)
Deploy Trivy scanner for vulnerability monitoring
For detailed instructions:
Troubleshooting Reference - Common issues and solutions
kubectl Commands Reference - Monitoring commands
Main Cluster Docs: Monitoring Basics - Tutorial