Explanation
Loki Architecture and SimpleScalable Mode¶
Introduction¶
Loki is a log aggregation system inspired by Prometheus, designed to be cost-effective and easy to operate. This document explains Loki’s architecture, our deployment mode choice (SimpleScalable), and how components interact to provide log storage and querying.
Why Loki?¶
The Log Storage Problem¶
Traditional logging solutions (ELK, Splunk) have significant challenges:
Problems with traditional systems:
Expensive indexing: Full-text indexing consumes massive storage
Complex operations: Multiple moving parts (Elasticsearch cluster, Kibana, Logstash)
Resource intensive: High CPU/memory for indexing and querying
Difficult scaling: Shard management complexity
Loki’s Approach¶
Core Philosophy: “Like Prometheus, but for logs”
Key principles:
✅ Index labels, not content: Only index metadata (namespace, pod, etc.)
✅ Chunk-based storage: Compress and store log lines as chunks
✅ S3-native: Built for object storage from day one
✅ Label-based queries: Use labels to locate chunks, then grep content
✅ Simple operations: Fewer components, less complexity
Trade-off Accepted:
❌ No full-text indexing: Can’t quickly find all logs containing “error”
✅ Fast label-based filtering: Quickly find all logs from
{namespace="monitoring", pod=~"prom.*"}✅ Grep log content: Once chunks loaded, can search content
Deployment Modes Overview¶
Lok
i offers three deployment modes:
Mode |
Components |
Suitable For |
Complexity |
|---|---|---|---|
Monolithic |
1 (all-in-one) |
Dev/testing, <50GB/day |
Low |
SimpleScalable |
3 (read/write/backend) |
Production, 50-200GB/day |
Medium |
Microservices |
10+ (separate components) |
Large scale, >200GB/day |
High |
Why SimpleScalable?¶
Our Requirements:
Log volume: ~50MB/day (5GB/day compressed)
Node count: 8 nodes (4 agents)
High availability: Yes (2 replicas per component)
Operational complexity: Prefer simplicity over extreme scalability
SimpleScalable Benefits:
✅ Clean separation: Read, write, and backend paths independent
✅ Scalable: Each path scales independently
✅ HA-ready**: 2+ replicas per component
✅ S3-native**: No need for object storage gateway
✅ Simple operations**: 3 components instead of 10+
When to consider Microservices mode:
Log volume >200GB/day
Need per-component scaling (e.g., scale queriers independently)
Have dedicated operations team for Loki
Component Architecture¶
Write Path (Log Ingestion)¶
Alloy (DaemonSet)
│ HTTP POST /loki/api/v1/push
▼
Loki Gateway (nginx reverse proxy)
│ Route to write path
▼
Loki Write (2 replicas)
│
├─ Parse & validate log lines
├─ Compress into chunks
├─ Write to S3 (chunks + index)
└─ WAL (Write-Ahead Log) to PVC
Loki Write Responsibilities:
Ingestion: Accept logs via HTTP API
Validation: Check labels, reject invalid logs
Chunking: Group log lines into compressed chunks
S3 Upload: Write chunks and index entries
WAL: Persist to disk before acknowledging
Configuration:
write:
replicas: 2
persistence:
enabled: true
size: 10Gi # WAL storage
resources:
requests:
cpu: 100m
memory: 256Mi
Write Process:
1. Alloy sends batch: POST /loki/api/v1/push
{
"streams": [
{
"stream": {"namespace": "monitoring", "pod": "prometheus-0"},
"values": [
["1699000000000000000", "level=info msg=\"Starting Prometheus\""]
]
}
]
}
2. Write validates:
- Labels are valid (no reserved prefixes)
- Timestamp is within acceptable range (±1h)
- Log line size <256KB
3. Write appends to chunk:
- Group by stream labels
- Add to current chunk
- Compress (gzip) when chunk reaches 1.5MB
4. Write persists:
- Write to WAL (local PVC)
- Upload chunk to S3
- Write index entry to S3
- Return 204 No Content to Alloy
Read Path (Log Querying)¶
Grafana
│ LogQL: {namespace="monitoring"} |= "error"
▼
Loki Gateway
│ Route to read path
▼
Loki Read (2 replicas)
│
├─ Parse LogQL query
├─ Query index (find chunks)
├─ Fetch chunks from S3/Backend
├─ Decompress & filter
└─ Return matching lines
Loki Read Responsibilities:
Query parsing: Parse LogQL syntax
Index querying: Find relevant chunks in index
Chunk fetching: Retrieve chunks from S3 or Backend cache
Filtering: Apply line filters and label matchers
Aggregation: Perform log aggregations (rate, count_over_time, etc.)
Configuration:
read:
replicas: 2
persistence:
enabled: true
size: 10Gi # Query cache
resources:
requests:
cpu: 100m
memory: 256Mi
Query Process:
1. Grafana sends query: GET /loki/api/v1/query_range
?query={namespace="monitoring"}|="error"
&start=1699000000000000000
&end=1699003600000000000
&limit=1000
2. Read parses LogQL:
- Label matchers: {namespace="monitoring"}
- Line filter: |= "error"
- Time range: 1h
3. Read queries index:
- Find chunks matching {namespace="monitoring"}
- Filter by time range: [start, end]
- Result: List of chunk IDs in S3
4. Read fetches chunks:
- Check local cache (PVC)
- If miss, fetch from S3
- Decompress (gzip)
5. Read filters:
- Apply line filter |= "error"
- Return matching lines to Grafana
6. Grafana renders:
- Display logs in Explore
- Highlight search terms
Backend Path (Storage & Indexing)¶
Loki Backend (2 replicas)
│
├─ Compact small chunks → larger chunks
├─ Build/update index files
├─ Maintain WAL (Write-Ahead Log)
└─ Serve as cache for Read path
Loki Backend Responsibilities:
Index maintenance: Build and compact index files
Chunk compaction: Merge small chunks into larger ones
WAL management: Handle Write-Ahead Log
Cache serving: Act as cache tier for Read path
Configuration:
backend:
replicas: 2
persistence:
enabled: true
size: 10Gi # Index + chunk cache
resources:
requests:
cpu: 100m
memory: 256Mi
Compaction Process:
Input: [1MB chunk] [1MB chunk] [1MB chunk] [1MB chunk]
└────────────────────┬────────────────────┘
Compact: [3.5MB chunk] (merged, recompressed)
Benefit: Fewer S3 objects, faster queries
Gateway (nginx)¶
Loki Gateway Responsibilities:
Request routing: Route to write, read, or backend based on path
Load balancing: Distribute across replicas
Authentication: (Optional, not currently enabled)
Routing Rules:
/loki/api/v1/push → write path (ingestion)
/loki/api/v1/query → read path (instant queries)
/loki/api/v1/query_range → read path (range queries)
/loki/api/v1/labels → read path (metadata)
/ready → all paths (health checks)
Data Model¶
Labels (Index Dimension)¶
Stream labels uniquely identify a log stream:
{
cluster="kup6s",
namespace="monitoring",
pod="prometheus-kube-prometheus-stack-prometheus-0",
container="prometheus"
}
Label characteristics:
Low cardinality (finite set of values)
Indexed in Loki
Fast to query
Used for chunk selection
Bad labels (high cardinality):
❌ request_id (millions of unique values)
❌ timestamp (always changing)
❌ user_email (unbounded)
Good labels (low cardinality):
✅ namespace (dozens of values)
✅ pod (hundreds of values)
✅ app (tens of values)
Structured Metadata (Not Indexed)¶
Structured metadata attached to log lines but not indexed:
{
trace_id="3f2a1b5c",
request_id="xyz789",
user_id="user123"
}
Use cases:
High-cardinality data that doesn’t need indexing
Correlation with traces (trace_id)
Session tracking (request_id)
How it works:
Stored with log line in chunk
Not indexed (doesn’t affect query performance)
Returned in query results
Filterable after chunk fetching
Log Lines (Content)¶
Log line structure:
timestamp: 1699000000000000000 (nanoseconds)
line: "level=info component=tsdb msg=\"Compacting WAL\" duration=1.2s"
Content characteristics:
Not indexed
Compressed in chunks
Searchable via line filters (
|=,!=,|~,!~)
Storage Layout¶
S3 Bucket Structure¶
s3://logs-loki-kup6s/
├── fake/ # Tenant ID (single-tenant = "fake")
│ ├── chunks/
│ │ ├── 1699000000000000000/ # Time-based sharding
│ │ │ ├── 01K8WC41VM... # Chunk ID (ULID)
│ │ │ ├── 01K8WC42XN...
│ │ │ └── ...
│ │ └── 1699003600000000000/
│ │ └── ...
│ └── index/
│ ├── index_19701/ # Period number (daily)
│ │ ├── 01K8WC41VM... # Index file (compressed)
│ │ └── ...
│ └── index_19702/
│ └── ...
└── wal/ # Write-Ahead Log (temporary)
└── ...
Chunk Format¶
Chunk structure:
Chunk: 01K8WC41VMNQF74ZC000CC72NY
├── Metadata (128 bytes)
│ ├── Stream labels hash
│ ├── Min timestamp
│ ├── Max timestamp
│ └── Compression type (gzip)
├── Entries (compressed)
│ ├── [ts1, "log line 1"]
│ ├── [ts2, "log line 2"]
│ └── ...
└── Checksum (CRC32)
Compression:
Algorithm: gzip level 6 (balance between speed and compression)
Typical compression ratio: 10:1 (10MB logs → 1MB chunk)
Chunk target size: 1.5MB compressed
Index Format¶
Index entry:
{
"labels": {
"cluster": "kup6s",
"namespace": "monitoring",
"pod": "prometheus-0"
},
"chunks": [
{
"id": "01K8WC41VMNQF74ZC000CC72NY",
"from": 1699000000000000000,
"through": 1699003600000000000
}
]
}
Index sharding:
Daily indexes: New index file every 24h
Compaction: Merge old index files weekly
Size: ~1MB per day of logs (highly compressed)
Query Language (LogQL)¶
Label Matching¶
Select streams by labels:
{namespace="monitoring", pod=~"prometheus.*"}
Operators:
=: Exact match!=: Not equal=~: Regex match!~: Regex not match
Line Filtering¶
Filter log lines:
{namespace="monitoring"} |= "error" # Contains "error"
{namespace="monitoring"} != "debug" # Doesn't contain "debug"
{namespace="monitoring"} |~ "err.*timeout" # Regex match
Log Aggregation¶
Count logs over time:
rate({namespace="monitoring"}[5m]) # Logs per second
count_over_time({namespace="monitoring"}[1h]) # Total logs in 1h
Log Parsing¶
Extract fields:
{namespace="monitoring"}
| json
| level="error"
| line_format "{{.timestamp}} - {{.msg}}"
JSON parsing:
Input: {"level":"error", "msg":"Connection failed", "duration":"1.2s"}
After: level="error" msg="Connection failed" duration="1.2s"
Performance Characteristics¶
Write Performance¶
Ingestion rate:
Target: 10,000 log lines/sec per write replica
Actual: ~500 lines/sec (sufficient for 8-node cluster)
Bottleneck: S3 write bandwidth (not CPU/memory)
Write latency:
P50: 50ms (cached in WAL)
P99: 200ms (includes S3 write)
WAL sync: Every 1s
Query Performance¶
Query types:
Fast: {namespace="monitoring"} # 50-100ms (index lookup)
Medium: {namespace="monitoring"} |= "error" # 200-500ms (chunk read)
Slow: {namespace="monitoring"} | json | level="error" # 1-2s (parsing)
Factors affecting performance:
Time range: Longer range = more chunks
Label selectivity: Specific labels = fewer chunks
Line filters: Simple filters (|=) faster than regex (|~)
Parsing: JSON parsing adds overhead
Storage Performance¶
Compression:
Text logs: 10:1 ratio (10MB → 1MB)
JSON logs: 8:1 ratio (more structure overhead)
Already compressed: 2:1 ratio (can’t compress much)
Chunk lifecycle:
1. Logs ingested → WAL (local PVC)
2. Chunk filled (1.5MB) → S3 upload
3. WAL cleared → disk space recovered
4. Index updated → chunk queryable
High Availability¶
Write Path HA¶
2 write replicas:
Both receive logs from Alloy
Alloy retries on failure (exponential backoff)
S3 provides durability (no data loss if replica fails)
Failure scenario:
1. Write-0 receives log batch at 12:00:00
2. Write-0 fails before uploading to S3
3. Alloy retries → sends to Write-1
4. Write-1 uploads successfully
5. No logs lost (retry succeeded)
Read Path HA¶
2 read replicas:
Kubernetes Service load balances
Stateless (no session affinity needed)
Failover automatic
Failure scenario:
1. Grafana queries Read-0
2. Read-0 fetches chunks from S3
3. Read-0 fails mid-query
4. Grafana retries → queries Read-1
5. Query completes (slight delay)
Backend Path HA¶
2 backend replicas:
Coordinate compaction via locking
Only one replica compacts a given chunk
WAL duplicated across replicas
Failure scenario:
1. Backend-0 compacting chunks A, B, C
2. Backend-0 fails mid-compaction
3. Backend-1 detects lock expiry
4. Backend-1 restarts compaction of A, B, C
5. Chunks eventually compacted (delayed)
Retention & Lifecycle¶
Table Management¶
Loki retention:
limits_config:
retention_period: 744h # 31 days
How it works:
Compactor checks chunk age
Deletes chunks older than 31 days
Updates index to remove references
S3 Lifecycle Policy¶
Additional safety:
BucketLifecycleConfiguration:
rule:
- id: delete-old-logs
status: Enabled
expiration:
days: 90 # Delete after 90 days (failsafe)
Why 90 days if Loki deletes at 31 days?
Safety margin for compaction delays
Orphaned chunks (compaction bugs)
Index inconsistencies
Storage Growth Calculation¶
Daily log volume: 50 MB/day uncompressed
Compression ratio: 10:1
Daily S3 writes: 5 MB/day compressed
31-day retention: 31 × 5 MB = 155 MB total
S3 cost: ~$0.004/month
Troubleshooting¶
Common Issues¶
Issue: “Logs not appearing in Grafana”
Check write path:
kubectl logs -n monitoring loki-write-0 | grep "push"
Common causes:
Alloy not sending logs (check DaemonSet)
Wrong labels (check Alloy configuration)
Timestamp out of range (check Alloy clock sync)
Issue: “Slow log queries”
Check metrics:
rate(loki_request_duration_seconds_sum{route=~"loki_api_v1_query.*"}[5m])
Common causes:
Large time range (reduce range)
Too many chunks (add more specific labels)
S3 read latency (check Hetzner S3 status)
Issue: “High memory usage”
Check component:
kubectl top pods -n monitoring | grep loki
Common causes:
Large query results (add limit parameter)
Too many streams (reduce label cardinality)
Compaction overhead (increase backend resources)
Best Practices¶
Label Design¶
Do:
✅ Use low-cardinality labels (namespace, app, pod)
✅ Keep labels finite (bounded set of values)
✅ Use structured metadata for high-cardinality data
Don’t:
❌ Add request IDs as labels (use structured metadata)
❌ Add user IDs as labels (millions of values)
❌ Add dynamic values as labels (timestamp, random IDs)
Query Optimization¶
Fast queries:
{namespace="monitoring", app="prometheus"} # Specific labels
{namespace="monitoring"} |= "error" # Simple filter
{namespace="monitoring"} | json | level="error" # Parsed filter
Slow queries:
{namespace=~".*"} # Too broad
{namespace="monitoring"} |~ "err.*|warn.*|fail.*" # Complex regex
{namespace="monitoring"} | json | component=~".*api.*" # Regex on parsed field
Resource Planning¶
Write path sizing:
1 replica per 10,000 lines/sec
256Mi memory per replica
10Gi PVC for WAL
Read path sizing:
1 replica per 100 concurrent queries
256Mi memory per replica
10Gi PVC for cache
Backend path sizing:
2 replicas minimum (HA)
256Mi memory per replica
10Gi PVC for compaction workspace
Comparison: Monolithic vs SimpleScalable¶
Aspect |
Monolithic |
SimpleScalable (Our Choice) |
|---|---|---|
Components |
1 (all-in-one) |
3 (read/write/backend) |
Scaling |
Vertical only |
Horizontal per path |
HA |
Single replica |
2+ replicas per path |
Resource Usage |
Lower (1 pod) |
Higher (6+ pods) |
Complexity |
Simpler |
Moderate |
Suitable For |
<50GB/day |
50-200GB/day |
Upgrade Impact |
Downtime (single pod) |
Rolling (zero downtime) |
Why we chose SimpleScalable:
Future-proof (can scale as log volume grows)
High availability (2 replicas per path)
Independent scaling (scale read without write)
Zero-downtime upgrades (rolling updates)