Explanation

Loki Architecture and SimpleScalable Mode¶

Type: Explanation (Understanding-oriented)

Related Concepts: Architecture Overview | S3 Buckets | Storage Tiers

Introduction¶

Loki is a log aggregation system inspired by Prometheus, designed to be cost-effective and easy to operate. This document explains Loki’s architecture, our deployment mode choice (SimpleScalable), and how components interact to provide log storage and querying.

Why Loki?¶

The Log Storage Problem¶

Traditional logging solutions (ELK, Splunk) have significant challenges:

Problems with traditional systems:

Expensive indexing: Full-text indexing consumes massive storage
Complex operations: Multiple moving parts (Elasticsearch cluster, Kibana, Logstash)
Resource intensive: High CPU/memory for indexing and querying
Difficult scaling: Shard management complexity

Loki’s Approach¶

Core Philosophy: “Like Prometheus, but for logs”

Key principles:

✅ Index labels, not content: Only index metadata (namespace, pod, etc.)
✅ Chunk-based storage: Compress and store log lines as chunks
✅ S3-native: Built for object storage from day one
✅ Label-based queries: Use labels to locate chunks, then grep content
✅ Simple operations: Fewer components, less complexity

Trade-off Accepted:

❌ No full-text indexing: Can’t quickly find all logs containing “error”
✅ Fast label-based filtering: Quickly find all logs from {namespace="monitoring", pod=~"prom.*"}
✅ Grep log content: Once chunks loaded, can search content

Deployment Modes Overview¶

Lok

i offers three deployment modes:

Mode	Components	Suitable For	Complexity
Monolithic	1 (all-in-one)	Dev/testing, <50GB/day	Low
SimpleScalable	3 (read/write/backend)	Production, 50-200GB/day	Medium
Microservices	10+ (separate components)	Large scale, >200GB/day	High

Why SimpleScalable?¶

Our Requirements:

Log volume: ~50MB/day (5GB/day compressed)
Node count: 8 nodes (4 agents)
High availability: Yes (2 replicas per component)
Operational complexity: Prefer simplicity over extreme scalability

SimpleScalable Benefits:

✅ Clean separation: Read, write, and backend paths independent
✅ Scalable: Each path scales independently
✅ HA-ready**: 2+ replicas per component
✅ S3-native**: No need for object storage gateway
✅ Simple operations**: 3 components instead of 10+

When to consider Microservices mode:

Log volume >200GB/day
Need per-component scaling (e.g., scale queriers independently)
Have dedicated operations team for Loki

Component Architecture¶

Write Path (Log Ingestion)¶

Alloy (DaemonSet)
    │ HTTP POST /loki/api/v1/push
    ▼
Loki Gateway (nginx reverse proxy)
    │ Route to write path
    ▼
Loki Write (2 replicas)
    │
    ├─ Parse & validate log lines
    ├─ Compress into chunks
    ├─ Write to S3 (chunks + index)
    └─ WAL (Write-Ahead Log) to PVC

Loki Write Responsibilities:

Ingestion: Accept logs via HTTP API
Validation: Check labels, reject invalid logs
Chunking: Group log lines into compressed chunks
S3 Upload: Write chunks and index entries
WAL: Persist to disk before acknowledging

Configuration:

write:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # WAL storage
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Write Process:

1. Alloy sends batch: POST /loki/api/v1/push
   {
     "streams": [
       {
         "stream": {"namespace": "monitoring", "pod": "prometheus-0"},
         "values": [
           ["1699000000000000000", "level=info msg=\"Starting Prometheus\""]
         ]
       }
     ]
   }

2. Write validates:
   - Labels are valid (no reserved prefixes)
   - Timestamp is within acceptable range (±1h)
   - Log line size <256KB

3. Write appends to chunk:
   - Group by stream labels
   - Add to current chunk
   - Compress (gzip) when chunk reaches 1.5MB

4. Write persists:
   - Write to WAL (local PVC)
   - Upload chunk to S3
   - Write index entry to S3
   - Return 204 No Content to Alloy

Read Path (Log Querying)¶

Grafana
    │ LogQL: {namespace="monitoring"} |= "error"
    ▼
Loki Gateway
    │ Route to read path
    ▼
Loki Read (2 replicas)
    │
    ├─ Parse LogQL query
    ├─ Query index (find chunks)
    ├─ Fetch chunks from S3/Backend
    ├─ Decompress & filter
    └─ Return matching lines

Loki Read Responsibilities:

Query parsing: Parse LogQL syntax
Index querying: Find relevant chunks in index
Chunk fetching: Retrieve chunks from S3 or Backend cache
Filtering: Apply line filters and label matchers
Aggregation: Perform log aggregations (rate, count_over_time, etc.)

Configuration:

read:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # Query cache
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Query Process:

1. Grafana sends query: GET /loki/api/v1/query_range
   ?query={namespace="monitoring"}|="error"
   &start=1699000000000000000
   &end=1699003600000000000
   &limit=1000

2. Read parses LogQL:
   - Label matchers: {namespace="monitoring"}
   - Line filter: |= "error"
   - Time range: 1h

3. Read queries index:
   - Find chunks matching {namespace="monitoring"}
   - Filter by time range: [start, end]
   - Result: List of chunk IDs in S3

4. Read fetches chunks:
   - Check local cache (PVC)
   - If miss, fetch from S3
   - Decompress (gzip)

5. Read filters:
   - Apply line filter |= "error"
   - Return matching lines to Grafana

6. Grafana renders:
   - Display logs in Explore
   - Highlight search terms

Backend Path (Storage & Indexing)¶

Loki Backend (2 replicas)
    │
    ├─ Compact small chunks → larger chunks
    ├─ Build/update index files
    ├─ Maintain WAL (Write-Ahead Log)
    └─ Serve as cache for Read path

Loki Backend Responsibilities:

Index maintenance: Build and compact index files
Chunk compaction: Merge small chunks into larger ones
WAL management: Handle Write-Ahead Log
Cache serving: Act as cache tier for Read path

Configuration:

backend:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi         # Index + chunk cache
  resources:
    requests:
      cpu: 100m
      memory: 256Mi

Compaction Process:

Input:  [1MB chunk] [1MB chunk] [1MB chunk] [1MB chunk]
        └────────────────────┬────────────────────┘
Compact:            [3.5MB chunk]                   (merged, recompressed)
Benefit: Fewer S3 objects, faster queries

Gateway (nginx)¶

Loki Gateway Responsibilities:

Request routing: Route to write, read, or backend based on path
Load balancing: Distribute across replicas
Authentication: (Optional, not currently enabled)

Routing Rules:

/loki/api/v1/push       → write path (ingestion)
/loki/api/v1/query      → read path (instant queries)
/loki/api/v1/query_range → read path (range queries)
/loki/api/v1/labels     → read path (metadata)
/ready                  → all paths (health checks)

Data Model¶

Labels (Index Dimension)¶

Stream labels uniquely identify a log stream:

{
  cluster="kup6s",
  namespace="monitoring",
  pod="prometheus-kube-prometheus-stack-prometheus-0",
  container="prometheus"
}

Label characteristics:

Low cardinality (finite set of values)
Indexed in Loki
Fast to query
Used for chunk selection

Bad labels (high cardinality):

❌ request_id (millions of unique values)
❌ timestamp (always changing)
❌ user_email (unbounded)

Good labels (low cardinality):

✅ namespace (dozens of values)
✅ pod (hundreds of values)
✅ app (tens of values)

Structured Metadata (Not Indexed)¶

Structured metadata attached to log lines but not indexed:

{
  trace_id="3f2a1b5c",
  request_id="xyz789",
  user_id="user123"
}

Use cases:

High-cardinality data that doesn’t need indexing
Correlation with traces (trace_id)
Session tracking (request_id)

How it works:

Stored with log line in chunk
Not indexed (doesn’t affect query performance)
Returned in query results
Filterable after chunk fetching

Log Lines (Content)¶

Log line structure:

timestamp: 1699000000000000000 (nanoseconds)
line:      "level=info component=tsdb msg=\"Compacting WAL\" duration=1.2s"

Content characteristics:

Not indexed
Compressed in chunks
Searchable via line filters (|=, !=, |~, !~)

Storage Layout¶

S3 Bucket Structure¶

s3://logs-loki-kup6s/
├── fake/                        # Tenant ID (single-tenant = "fake")
│   ├── chunks/
│   │   ├── 1699000000000000000/ # Time-based sharding
│   │   │   ├── 01K8WC41VM...   # Chunk ID (ULID)
│   │   │   ├── 01K8WC42XN...
│   │   │   └── ...
│   │   └── 1699003600000000000/
│   │       └── ...
│   └── index/
│       ├── index_19701/         # Period number (daily)
│       │   ├── 01K8WC41VM...   # Index file (compressed)
│       │   └── ...
│       └── index_19702/
│           └── ...
└── wal/                         # Write-Ahead Log (temporary)
    └── ...

Chunk Format¶

Chunk structure:

Chunk: 01K8WC41VMNQF74ZC000CC72NY
├── Metadata (128 bytes)
│   ├── Stream labels hash
│   ├── Min timestamp
│   ├── Max timestamp
│   └── Compression type (gzip)
├── Entries (compressed)
│   ├── [ts1, "log line 1"]
│   ├── [ts2, "log line 2"]
│   └── ...
└── Checksum (CRC32)

Compression:

Algorithm: gzip level 6 (balance between speed and compression)
Typical compression ratio: 10:1 (10MB logs → 1MB chunk)
Chunk target size: 1.5MB compressed

Index Format¶

Index entry:

{
  "labels": {
    "cluster": "kup6s",
    "namespace": "monitoring",
    "pod": "prometheus-0"
  },
  "chunks": [
    {
      "id": "01K8WC41VMNQF74ZC000CC72NY",
      "from": 1699000000000000000,
      "through": 1699003600000000000
    }
  ]
}

Index sharding:

Daily indexes: New index file every 24h
Compaction: Merge old index files weekly
Size: ~1MB per day of logs (highly compressed)

Query Language (LogQL)¶

Label Matching¶

Select streams by labels:

{namespace="monitoring", pod=~"prometheus.*"}

Operators:

=: Exact match
!=: Not equal
=~: Regex match
!~: Regex not match

Line Filtering¶

Filter log lines:

{namespace="monitoring"} |= "error"         # Contains "error"
{namespace="monitoring"} != "debug"         # Doesn't contain "debug"
{namespace="monitoring"} |~ "err.*timeout"  # Regex match

Log Aggregation¶

Count logs over time:

rate({namespace="monitoring"}[5m])           # Logs per second
count_over_time({namespace="monitoring"}[1h]) # Total logs in 1h

Log Parsing¶

Extract fields:

{namespace="monitoring"}
  | json
  | level="error"
  | line_format "{{.timestamp}} - {{.msg}}"

JSON parsing:

Input:  {"level":"error", "msg":"Connection failed", "duration":"1.2s"}
After:  level="error" msg="Connection failed" duration="1.2s"

Performance Characteristics¶

Write Performance¶

Ingestion rate:

Target: 10,000 log lines/sec per write replica
Actual: ~500 lines/sec (sufficient for 8-node cluster)
Bottleneck: S3 write bandwidth (not CPU/memory)

Write latency:

P50: 50ms (cached in WAL)
P99: 200ms (includes S3 write)
WAL sync: Every 1s

Query Performance¶

Query types:

Fast:   {namespace="monitoring"}                      # 50-100ms (index lookup)
Medium: {namespace="monitoring"} |= "error"           # 200-500ms (chunk read)
Slow:   {namespace="monitoring"} | json | level="error" # 1-2s (parsing)

Factors affecting performance:

Time range: Longer range = more chunks
Label selectivity: Specific labels = fewer chunks
Line filters: Simple filters (|=) faster than regex (|~)
Parsing: JSON parsing adds overhead

Storage Performance¶

Compression:

Text logs: 10:1 ratio (10MB → 1MB)
JSON logs: 8:1 ratio (more structure overhead)
Already compressed: 2:1 ratio (can’t compress much)

Chunk lifecycle:

Logs ingested → WAL (local PVC)
Chunk filled (1.5MB) → S3 upload
WAL cleared → disk space recovered
Index updated → chunk queryable

High Availability¶

Write Path HA¶

2 write replicas:

Both receive logs from Alloy
Alloy retries on failure (exponential backoff)
S3 provides durability (no data loss if replica fails)

Failure scenario:

Write-0 receives log batch at 12:00:00
Write-0 fails before uploading to S3
Alloy retries → sends to Write-1
Write-1 uploads successfully
No logs lost (retry succeeded)

Read Path HA¶

2 read replicas:

Kubernetes Service load balances
Stateless (no session affinity needed)
Failover automatic

Failure scenario:

Grafana queries Read-0
Read-0 fetches chunks from S3
Read-0 fails mid-query
Grafana retries → queries Read-1
Query completes (slight delay)

Backend Path HA¶

2 backend replicas:

Coordinate compaction via locking
Only one replica compacts a given chunk
WAL duplicated across replicas

Failure scenario:

Backend-0 compacting chunks A, B, C
Backend-0 fails mid-compaction
Backend-1 detects lock expiry
Backend-1 restarts compaction of A, B, C
Chunks eventually compacted (delayed)

Retention & Lifecycle¶

Table Management¶

Loki retention:

limits_config:
  retention_period: 744h  # 31 days

How it works:

Compactor checks chunk age
Deletes chunks older than 31 days
Updates index to remove references

S3 Lifecycle Policy¶

Additional safety:

BucketLifecycleConfiguration:
  rule:
    - id: delete-old-logs
      status: Enabled
      expiration:
        days: 90  # Delete after 90 days (failsafe)

Why 90 days if Loki deletes at 31 days?

Safety margin for compaction delays
Orphaned chunks (compaction bugs)
Index inconsistencies

Storage Growth Calculation¶

Daily log volume:  50 MB/day uncompressed
Compression ratio: 10:1
Daily S3 writes:   5 MB/day compressed
31-day retention:  31 × 5 MB = 155 MB total
S3 cost:           ~$0.004/month

Troubleshooting¶

Common Issues¶

Issue: “Logs not appearing in Grafana”

Check write path:

kubectl logs -n monitoring loki-write-0 | grep "push"

Common causes:

Alloy not sending logs (check DaemonSet)
Wrong labels (check Alloy configuration)
Timestamp out of range (check Alloy clock sync)

Issue: “Slow log queries”

Check metrics:

rate(loki_request_duration_seconds_sum{route=~"loki_api_v1_query.*"}[5m])

Common causes:

Large time range (reduce range)
Too many chunks (add more specific labels)
S3 read latency (check Hetzner S3 status)

Issue: “High memory usage”

Check component:

kubectl top pods -n monitoring | grep loki

Common causes:

Large query results (add limit parameter)
Too many streams (reduce label cardinality)
Compaction overhead (increase backend resources)

Best Practices¶

Label Design¶

Do:

✅ Use low-cardinality labels (namespace, app, pod)
✅ Keep labels finite (bounded set of values)
✅ Use structured metadata for high-cardinality data

Don’t:

❌ Add request IDs as labels (use structured metadata)
❌ Add user IDs as labels (millions of values)
❌ Add dynamic values as labels (timestamp, random IDs)

Query Optimization¶

Fast queries:

{namespace="monitoring", app="prometheus"}              # Specific labels
{namespace="monitoring"} |= "error"                     # Simple filter
{namespace="monitoring"} | json | level="error"         # Parsed filter

Slow queries:

{namespace=~".*"}                                       # Too broad
{namespace="monitoring"} |~ "err.*|warn.*|fail.*"       # Complex regex
{namespace="monitoring"} | json | component=~".*api.*"  # Regex on parsed field

Resource Planning¶

Write path sizing:

1 replica per 10,000 lines/sec
256Mi memory per replica
10Gi PVC for WAL

Read path sizing:

1 replica per 100 concurrent queries
256Mi memory per replica
10Gi PVC for cache

Backend path sizing:

2 replicas minimum (HA)
256Mi memory per replica
10Gi PVC for compaction workspace

Comparison: Monolithic vs SimpleScalable¶

Aspect	Monolithic	SimpleScalable (Our Choice)
Components	1 (all-in-one)	3 (read/write/backend)
Scaling	Vertical only	Horizontal per path
HA	Single replica	2+ replicas per path
Resource Usage	Lower (1 pod)	Higher (6+ pods)
Complexity	Simpler	Moderate
Suitable For	<50GB/day	50-200GB/day
Upgrade Impact	Downtime (single pod)	Rolling (zero downtime)

Why we chose SimpleScalable:

Future-proof (can scale as log volume grows)
High availability (2 replicas per path)
Independent scaling (scale read without write)
Zero-downtime upgrades (rolling updates)

Loki Architecture and SimpleScalable Mode¶

Introduction¶

Why Loki?¶

The Log Storage Problem¶

Loki’s Approach¶

Deployment Modes Overview¶

Why SimpleScalable?¶

Component Architecture¶

Write Path (Log Ingestion)¶

Read Path (Log Querying)¶

Backend Path (Storage & Indexing)¶

Gateway (nginx)¶

Data Model¶

Labels (Index Dimension)¶

Structured Metadata (Not Indexed)¶

Log Lines (Content)¶

Storage Layout¶

S3 Bucket Structure¶

Chunk Format¶

Index Format¶

Query Language (LogQL)¶

Label Matching¶

Line Filtering¶

Log Aggregation¶

Log Parsing¶

Performance Characteristics¶

Write Performance¶

Query Performance¶

Storage Performance¶

High Availability¶

Write Path HA¶

Read Path HA¶

Backend Path HA¶

Retention & Lifecycle¶

Table Management¶

S3 Lifecycle Policy¶

Storage Growth Calculation¶

Troubleshooting¶

Common Issues¶

Best Practices¶

Label Design¶

Query Optimization¶

Resource Planning¶

Comparison: Monolithic vs SimpleScalable¶

References¶