Explanation

Loki Architecture and SimpleScalable Mode

Introduction

Loki is a log aggregation system inspired by Prometheus, designed to be cost-effective and easy to operate. This document explains Loki’s architecture, our deployment mode choice (SimpleScalable), and how components interact to provide log storage and querying.

Why Loki?

The Log Storage Problem

Traditional logging solutions (ELK, Splunk) have significant challenges:

Problems with traditional systems:

  • Expensive indexing: Full-text indexing consumes massive storage

  • Complex operations: Multiple moving parts (Elasticsearch cluster, Kibana, Logstash)

  • Resource intensive: High CPU/memory for indexing and querying

  • Difficult scaling: Shard management complexity

Loki’s Approach

Core Philosophy: “Like Prometheus, but for logs”

Key principles:

  • Index labels, not content: Only index metadata (namespace, pod, etc.)

  • Chunk-based storage: Compress and store log lines as chunks

  • S3-native: Built for object storage from day one

  • Label-based queries: Use labels to locate chunks, then grep content

  • Simple operations: Fewer components, less complexity

Trade-off Accepted:

  • ❌ No full-text indexing: Can’t quickly find all logs containing “error”

  • ✅ Fast label-based filtering: Quickly find all logs from {namespace="monitoring", pod=~"prom.*"}

  • ✅ Grep log content: Once chunks loaded, can search content

Deployment Modes Overview

Lok

i offers three deployment modes:

Mode

Components

Suitable For

Complexity

Monolithic

1 (all-in-one)

Dev/testing, <50GB/day

Low

SimpleScalable

3 (read/write/backend)

Production, 50-200GB/day

Medium

Microservices

10+ (separate components)

Large scale, >200GB/day

High

Why SimpleScalable?

Our Requirements:

  • Log volume: ~50MB/day (5GB/day compressed)

  • Node count: 8 nodes (4 agents)

  • High availability: Yes (2 replicas per component)

  • Operational complexity: Prefer simplicity over extreme scalability

SimpleScalable Benefits:

  • ✅ Clean separation: Read, write, and backend paths independent

  • ✅ Scalable: Each path scales independently

  • ✅ HA-ready**: 2+ replicas per component

  • ✅ S3-native**: No need for object storage gateway

  • ✅ Simple operations**: 3 components instead of 10+

When to consider Microservices mode:

  • Log volume >200GB/day

  • Need per-component scaling (e.g., scale queriers independently)

  • Have dedicated operations team for Loki

Component Architecture

Write Path (Log Ingestion)

Alloy (DaemonSet)
    │ HTTP POST /loki/api/v1/push
Loki Gateway (nginx reverse proxy)
    │ Route to write path
Loki Write (2 replicas)
    ├─ Parse & validate log lines
    ├─ Compress into chunks
    ├─ Write to S3 (chunks + index)
    └─ WAL (Write-Ahead Log) to PVC

Loki Write Responsibilities:

  1. Ingestion: Accept logs via HTTP API

  2. Validation: Check labels, reject invalid logs

  3. Chunking: Group log lines into compressed chunks

  4. S3 Upload: Write chunks and index entries

  5. WAL: Persist to disk before acknowledging

Configuration:

write:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # WAL storage
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Write Process:

1. Alloy sends batch: POST /loki/api/v1/push
   {
     "streams": [
       {
         "stream": {"namespace": "monitoring", "pod": "prometheus-0"},
         "values": [
           ["1699000000000000000", "level=info msg=\"Starting Prometheus\""]
         ]
       }
     ]
   }

2. Write validates:
   - Labels are valid (no reserved prefixes)
   - Timestamp is within acceptable range (±1h)
   - Log line size <256KB

3. Write appends to chunk:
   - Group by stream labels
   - Add to current chunk
   - Compress (gzip) when chunk reaches 1.5MB

4. Write persists:
   - Write to WAL (local PVC)
   - Upload chunk to S3
   - Write index entry to S3
   - Return 204 No Content to Alloy

Read Path (Log Querying)

Grafana
    │ LogQL: {namespace="monitoring"} |= "error"
Loki Gateway
    │ Route to read path
Loki Read (2 replicas)
    ├─ Parse LogQL query
    ├─ Query index (find chunks)
    ├─ Fetch chunks from S3/Backend
    ├─ Decompress & filter
    └─ Return matching lines

Loki Read Responsibilities:

  1. Query parsing: Parse LogQL syntax

  2. Index querying: Find relevant chunks in index

  3. Chunk fetching: Retrieve chunks from S3 or Backend cache

  4. Filtering: Apply line filters and label matchers

  5. Aggregation: Perform log aggregations (rate, count_over_time, etc.)

Configuration:

read:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # Query cache
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Query Process:

1. Grafana sends query: GET /loki/api/v1/query_range
   ?query={namespace="monitoring"}|="error"
   &start=1699000000000000000
   &end=1699003600000000000
   &limit=1000

2. Read parses LogQL:
   - Label matchers: {namespace="monitoring"}
   - Line filter: |= "error"
   - Time range: 1h

3. Read queries index:
   - Find chunks matching {namespace="monitoring"}
   - Filter by time range: [start, end]
   - Result: List of chunk IDs in S3

4. Read fetches chunks:
   - Check local cache (PVC)
   - If miss, fetch from S3
   - Decompress (gzip)

5. Read filters:
   - Apply line filter |= "error"
   - Return matching lines to Grafana

6. Grafana renders:
   - Display logs in Explore
   - Highlight search terms

Backend Path (Storage & Indexing)

Loki Backend (2 replicas)
    ├─ Compact small chunks → larger chunks
    ├─ Build/update index files
    ├─ Maintain WAL (Write-Ahead Log)
    └─ Serve as cache for Read path

Loki Backend Responsibilities:

  1. Index maintenance: Build and compact index files

  2. Chunk compaction: Merge small chunks into larger ones

  3. WAL management: Handle Write-Ahead Log

  4. Cache serving: Act as cache tier for Read path

Configuration:

backend:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # Index + chunk cache
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Compaction Process:

Input:  [1MB chunk] [1MB chunk] [1MB chunk] [1MB chunk]
        └────────────────────┬────────────────────┘
Compact:            [3.5MB chunk]                   (merged, recompressed)
Benefit: Fewer S3 objects, faster queries

Gateway (nginx)

Loki Gateway Responsibilities:

  1. Request routing: Route to write, read, or backend based on path

  2. Load balancing: Distribute across replicas

  3. Authentication: (Optional, not currently enabled)

Routing Rules:

/loki/api/v1/push        write path (ingestion)
/loki/api/v1/query       read path (instant queries)
/loki/api/v1/query_range  read path (range queries)
/loki/api/v1/labels      read path (metadata)
/ready                   all paths (health checks)

Data Model

Labels (Index Dimension)

Stream labels uniquely identify a log stream:

{
  cluster="kup6s",
  namespace="monitoring",
  pod="prometheus-kube-prometheus-stack-prometheus-0",
  container="prometheus"
}

Label characteristics:

  • Low cardinality (finite set of values)

  • Indexed in Loki

  • Fast to query

  • Used for chunk selection

Bad labels (high cardinality):

  • ❌ request_id (millions of unique values)

  • ❌ timestamp (always changing)

  • ❌ user_email (unbounded)

Good labels (low cardinality):

  • ✅ namespace (dozens of values)

  • ✅ pod (hundreds of values)

  • ✅ app (tens of values)

Structured Metadata (Not Indexed)

Structured metadata attached to log lines but not indexed:

{
  trace_id="3f2a1b5c",
  request_id="xyz789",
  user_id="user123"
}

Use cases:

  • High-cardinality data that doesn’t need indexing

  • Correlation with traces (trace_id)

  • Session tracking (request_id)

How it works:

  • Stored with log line in chunk

  • Not indexed (doesn’t affect query performance)

  • Returned in query results

  • Filterable after chunk fetching

Log Lines (Content)

Log line structure:

timestamp: 1699000000000000000 (nanoseconds)
line:      "level=info component=tsdb msg=\"Compacting WAL\" duration=1.2s"

Content characteristics:

  • Not indexed

  • Compressed in chunks

  • Searchable via line filters (|=, !=, |~, !~)

Storage Layout

S3 Bucket Structure

s3://logs-loki-kup6s/
├── fake/                        # Tenant ID (single-tenant = "fake")
│   ├── chunks/
│   │   ├── 1699000000000000000/ # Time-based sharding
│   │   │   ├── 01K8WC41VM...   # Chunk ID (ULID)
│   │   │   ├── 01K8WC42XN...
│   │   │   └── ...
│   │   └── 1699003600000000000/
│   │       └── ...
│   └── index/
│       ├── index_19701/         # Period number (daily)
│       │   ├── 01K8WC41VM...   # Index file (compressed)
│       │   └── ...
│       └── index_19702/
│           └── ...
└── wal/                         # Write-Ahead Log (temporary)
    └── ...

Chunk Format

Chunk structure:

Chunk: 01K8WC41VMNQF74ZC000CC72NY
├── Metadata (128 bytes)
│   ├── Stream labels hash
│   ├── Min timestamp
│   ├── Max timestamp
│   └── Compression type (gzip)
├── Entries (compressed)
│   ├── [ts1, "log line 1"]
│   ├── [ts2, "log line 2"]
│   └── ...
└── Checksum (CRC32)

Compression:

  • Algorithm: gzip level 6 (balance between speed and compression)

  • Typical compression ratio: 10:1 (10MB logs → 1MB chunk)

  • Chunk target size: 1.5MB compressed

Index Format

Index entry:

{
  "labels": {
    "cluster": "kup6s",
    "namespace": "monitoring",
    "pod": "prometheus-0"
  },
  "chunks": [
    {
      "id": "01K8WC41VMNQF74ZC000CC72NY",
      "from": 1699000000000000000,
      "through": 1699003600000000000
    }
  ]
}

Index sharding:

  • Daily indexes: New index file every 24h

  • Compaction: Merge old index files weekly

  • Size: ~1MB per day of logs (highly compressed)

Query Language (LogQL)

Label Matching

Select streams by labels:

{namespace="monitoring", pod=~"prometheus.*"}

Operators:

  • =: Exact match

  • !=: Not equal

  • =~: Regex match

  • !~: Regex not match

Line Filtering

Filter log lines:

{namespace="monitoring"} |= "error"         # Contains "error"
{namespace="monitoring"} != "debug"         # Doesn't contain "debug"
{namespace="monitoring"} |~ "err.*timeout"  # Regex match

Log Aggregation

Count logs over time:

rate({namespace="monitoring"}[5m])           # Logs per second
count_over_time({namespace="monitoring"}[1h]) # Total logs in 1h

Log Parsing

Extract fields:

{namespace="monitoring"}
  | json
  | level="error"
  | line_format "{{.timestamp}} - {{.msg}}"

JSON parsing:

Input:  {"level":"error", "msg":"Connection failed", "duration":"1.2s"}
After:  level="error" msg="Connection failed" duration="1.2s"

Performance Characteristics

Write Performance

Ingestion rate:

  • Target: 10,000 log lines/sec per write replica

  • Actual: ~500 lines/sec (sufficient for 8-node cluster)

  • Bottleneck: S3 write bandwidth (not CPU/memory)

Write latency:

  • P50: 50ms (cached in WAL)

  • P99: 200ms (includes S3 write)

  • WAL sync: Every 1s

Query Performance

Query types:

Fast:   {namespace="monitoring"}                      # 50-100ms (index lookup)
Medium: {namespace="monitoring"} |= "error"           # 200-500ms (chunk read)
Slow:   {namespace="monitoring"} | json | level="error" # 1-2s (parsing)

Factors affecting performance:

  • Time range: Longer range = more chunks

  • Label selectivity: Specific labels = fewer chunks

  • Line filters: Simple filters (|=) faster than regex (|~)

  • Parsing: JSON parsing adds overhead

Storage Performance

Compression:

  • Text logs: 10:1 ratio (10MB → 1MB)

  • JSON logs: 8:1 ratio (more structure overhead)

  • Already compressed: 2:1 ratio (can’t compress much)

Chunk lifecycle:

1. Logs ingested → WAL (local PVC)
2. Chunk filled (1.5MB) → S3 upload
3. WAL cleared → disk space recovered
4. Index updated → chunk queryable

High Availability

Write Path HA

2 write replicas:

  • Both receive logs from Alloy

  • Alloy retries on failure (exponential backoff)

  • S3 provides durability (no data loss if replica fails)

Failure scenario:

1. Write-0 receives log batch at 12:00:00
2. Write-0 fails before uploading to S3
3. Alloy retries → sends to Write-1
4. Write-1 uploads successfully
5. No logs lost (retry succeeded)

Read Path HA

2 read replicas:

  • Kubernetes Service load balances

  • Stateless (no session affinity needed)

  • Failover automatic

Failure scenario:

1. Grafana queries Read-0
2. Read-0 fetches chunks from S3
3. Read-0 fails mid-query
4. Grafana retries → queries Read-1
5. Query completes (slight delay)

Backend Path HA

2 backend replicas:

  • Coordinate compaction via locking

  • Only one replica compacts a given chunk

  • WAL duplicated across replicas

Failure scenario:

1. Backend-0 compacting chunks A, B, C
2. Backend-0 fails mid-compaction
3. Backend-1 detects lock expiry
4. Backend-1 restarts compaction of A, B, C
5. Chunks eventually compacted (delayed)

Retention & Lifecycle

Table Management

Loki retention:

limits_config:
  retention_period: 744h  # 31 days

How it works:

  • Compactor checks chunk age

  • Deletes chunks older than 31 days

  • Updates index to remove references

S3 Lifecycle Policy

Additional safety:

BucketLifecycleConfiguration:
  rule:
    - id: delete-old-logs
      status: Enabled
      expiration:
        days: 90  # Delete after 90 days (failsafe)

Why 90 days if Loki deletes at 31 days?

  • Safety margin for compaction delays

  • Orphaned chunks (compaction bugs)

  • Index inconsistencies

Storage Growth Calculation

Daily log volume:  50 MB/day uncompressed
Compression ratio: 10:1
Daily S3 writes:   5 MB/day compressed
31-day retention:  31 × 5 MB = 155 MB total
S3 cost:           ~$0.004/month

Troubleshooting

Common Issues

Issue: “Logs not appearing in Grafana”

Check write path:

kubectl logs -n monitoring loki-write-0 | grep "push"

Common causes:

  • Alloy not sending logs (check DaemonSet)

  • Wrong labels (check Alloy configuration)

  • Timestamp out of range (check Alloy clock sync)

Issue: “Slow log queries”

Check metrics:

rate(loki_request_duration_seconds_sum{route=~"loki_api_v1_query.*"}[5m])

Common causes:

  • Large time range (reduce range)

  • Too many chunks (add more specific labels)

  • S3 read latency (check Hetzner S3 status)

Issue: “High memory usage”

Check component:

kubectl top pods -n monitoring | grep loki

Common causes:

  • Large query results (add limit parameter)

  • Too many streams (reduce label cardinality)

  • Compaction overhead (increase backend resources)

Best Practices

Label Design

Do:

  • ✅ Use low-cardinality labels (namespace, app, pod)

  • ✅ Keep labels finite (bounded set of values)

  • ✅ Use structured metadata for high-cardinality data

Don’t:

  • ❌ Add request IDs as labels (use structured metadata)

  • ❌ Add user IDs as labels (millions of values)

  • ❌ Add dynamic values as labels (timestamp, random IDs)

Query Optimization

Fast queries:

{namespace="monitoring", app="prometheus"}              # Specific labels
{namespace="monitoring"} |= "error"                     # Simple filter
{namespace="monitoring"} | json | level="error"         # Parsed filter

Slow queries:

{namespace=~".*"}                                       # Too broad
{namespace="monitoring"} |~ "err.*|warn.*|fail.*"       # Complex regex
{namespace="monitoring"} | json | component=~".*api.*"  # Regex on parsed field

Resource Planning

Write path sizing:

  • 1 replica per 10,000 lines/sec

  • 256Mi memory per replica

  • 10Gi PVC for WAL

Read path sizing:

  • 1 replica per 100 concurrent queries

  • 256Mi memory per replica

  • 10Gi PVC for cache

Backend path sizing:

  • 2 replicas minimum (HA)

  • 256Mi memory per replica

  • 10Gi PVC for compaction workspace

Comparison: Monolithic vs SimpleScalable

Aspect

Monolithic

SimpleScalable (Our Choice)

Components

1 (all-in-one)

3 (read/write/backend)

Scaling

Vertical only

Horizontal per path

HA

Single replica

2+ replicas per path

Resource Usage

Lower (1 pod)

Higher (6+ pods)

Complexity

Simpler

Moderate

Suitable For

<50GB/day

50-200GB/day

Upgrade Impact

Downtime (single pod)

Rolling (zero downtime)

Why we chose SimpleScalable:

  • Future-proof (can scale as log volume grows)

  • High availability (2 replicas per path)

  • Independent scaling (scale read without write)

  • Zero-downtime upgrades (rolling updates)

References