Architecture Overview

This document explains the architectural decisions behind the Nextcloud deployment on kup6s.

Design Principles

1. Shared Constructs for Consistency

All Nextcloud instances share the same CDK8S constructs from packages/nextcloud-shared/. This ensures:

  • Consistency: Same architecture across all instances

  • Maintainability: Fix once, apply everywhere

  • Testability: Validate construct logic in one place

  • Version control: Upgrade all instances by updating the shared package

2. S3 Primary Storage

User files are stored in Hetzner S3 Object Storage, not in Kubernetes volumes:

Why S3?

  • Scalability: Unlimited storage without managing volumes

  • Cost: Object storage cheaper than block storage for large files

  • Durability: Hetzner provides 99.999999999% (11 nines) durability

  • Backup: Built-in versioning and lifecycle policies

  • Migration: Easy to migrate between clusters (just move S3 credentials)

Trade-offs:

  • Slightly higher latency for file access vs. local storage

  • Dependency on external service (Hetzner S3)

  • Network egress costs for file transfers

3. CloudNativePG for PostgreSQL

CloudNativePG (CNPG) operator manages PostgreSQL clusters:

Why CNPG?

  • High Availability: Automatic failover with replica promotion

  • Automated Backups: Continuous archiving to S3 with Barman Cloud Plugin

  • Point-in-Time Recovery: Restore to any point in backup retention

  • Connection Pooling: PgBouncer built-in for efficient connection management

  • Observability: Prometheus metrics and comprehensive status reporting

Configuration:

  • 2 PostgreSQL replicas (1 primary + 1 standby)

  • Synchronous replication for data safety

  • Automated backups every 6 hours to S3

  • 30-day backup retention

4. Single Replica Design (affenstall)

The nextcloudaffenstall instance runs with replicas: 1 due to storage constraints:

Why 1 Replica?

  • Nextcloud requires RWX (ReadWriteMany) storage for multiple replicas

  • Longhorn and Hetzner Cloud Volumes only support RWO (ReadWriteOnce)

  • SMB CSI (RWX) available but too slow for Nextcloud config/apps directory

Implications:

  • No horizontal scaling for Nextcloud pod

  • Brief downtime during pod restarts (~30 seconds)

  • Vertical scaling still possible (increase CPU/memory)

Future Options:

  • Migrate to SMB CSI for config/apps (requires performance testing)

  • Use separate StatefulSet per pod with local RWO volumes

  • Implement active-passive setup with manual failover

5. ArgoCD GitOps Deployment

All Nextcloud instances deployed via ArgoCD with automated sync:

Sync Waves ensure proper ordering:

Wave 0: Namespace creation
Wave 1: S3 buckets (Crossplane) + credentials (ESO)
Wave 2: PostgreSQL + Redis (data layer)
Wave 3: PgBouncer pooler
Wave 4: Nextcloud + Collabora + Whiteboard

Benefits:

  • Declarative configuration (Git as source of truth)

  • Automated sync on Git push

  • Rollback via Git revert

  • Audit trail of all changes

Component Architecture

Nextcloud Hub

Image: nextcloud:31.0.13 (official Docker image)

Responsibilities:

  • WebDAV file protocol

  • Web UI for file management

  • User authentication and session management

  • Background jobs (cron)

  • File metadata in PostgreSQL

  • File content in S3

Environment Variables:

  • Database connection via POSTGRES_* env vars

  • Redis connection via REDIS_* env vars

  • S3 config via OBJECTSTORE_S3_* env vars (injected from secret)

Collabora Online

Image: collabora/code:25.04.8.2.1

Purpose: Browser-based LibreOffice for document editing

Integration:

  • Nextcloud calls Collabora via WOPI protocol

  • Documents opened in browser iframe

  • Real-time collaborative editing

  • Supports: .docx, .xlsx, .pptx, .odt, .ods, .odp

Configuration:

  • Hostname: collabora.{domain}

  • Nextcloud domain whitelisted in Collabora config

  • 2 replicas for load distribution

Whiteboard

Image: ghcr.io/nextcloud-releases/whiteboard:release

Purpose: Collaborative digital whiteboard

Integration:

  • Nextcloud app communicates via shared secret (JWT)

  • Whiteboard runs on /whiteboard path

  • Real-time drawing synchronization

PostgreSQL (CloudNativePG)

Configuration:

  • 2 replicas (primary + standby)

  • 10Gi PVC per replica (Longhorn)

  • PgBouncer connection pooling

  • Automated backups to S3

Database Schema:

  • Nextcloud uses ~150 tables

  • File metadata, user data, shares, comments

  • File content NOT in database (in S3)

Redis

Configuration:

  • 1 replica (StatefulSet)

  • 5Gi PVC (Longhorn)

  • Used for caching and session storage

  • No persistence required (cache only)

Purpose:

  • File locking coordination

  • Session storage

  • Transactional file locking

  • Cache for expensive operations

Network Architecture

Ingress Routes

affenstall.cloud → Traefik → nextcloud:8080
collabora.affenstall.cloud → Traefik → collabora:9980

TLS Termination:

  • cert-manager provisions Let’s Encrypt certificates

  • Traefik handles TLS termination

  • Backend connections are HTTP (in-cluster)

Service Mesh

All communication within namespace via Kubernetes Services:

nextcloud → postgres-pooler:5432 → postgres-rw:5432
nextcloud → redis:6379
nextcloud → collabora:9980 (internal, for WOPI)

Data Flow

File Upload

User Browser → Traefik (TLS) → Nextcloud Pod → Hetzner S3
                              PostgreSQL (metadata)

File Download

User Browser ← Traefik (TLS) ← Nextcloud Pod ← Hetzner S3
                              PostgreSQL (check permissions)

Document Editing (Collabora)

User Browser ↔ Traefik (TLS) ↔ Nextcloud Pod ↔ Collabora Pod
                                     ↓              ↓
                              PostgreSQL (metadata) Hetzner S3 (file)

Security Model

Secrets Management

All secrets stored in Infisical and injected via External Secrets Operator (ESO):

  • nextcloud-s3-credentials: S3 access key/secret

  • nextcloud-postgres-app: PostgreSQL credentials (created by CNPG)

  • nextcloud: Admin username/password

TLS Encryption

  • External traffic: TLS 1.2+ via Traefik + cert-manager

  • Internal traffic: Plain HTTP (within Kubernetes network)

  • S3 traffic: HTTPS to fsn1.your-objectstorage.com

RBAC

  • Nextcloud pod runs as www-data (UID 33)

  • PostgreSQL runs as postgres (UID 70)

  • No privileged containers required

Scaling Considerations

Vertical Scaling (Current)

Increase resources in config.yaml:

resources:
  nextcloud:
    requests:
      cpu: 200m → 500m
      memory: 512Mi → 1Gi
    limits:
      cpu: 1000m → 2000m
      memory: 2Gi → 4Gi

Horizontal Scaling (Future)

Requires RWX storage solution:

  1. Option A: SMB CSI

    • Mount Hetzner Storage Box via SMB

    • Performance testing required

    • May be too slow for config/apps

  2. Option B: Separate StatefulSets

    • Each pod gets own RWO volume

    • Manual load balancing

    • Complex upgrade procedure

  3. Option C: S3FS FUSE Mount

    • Mount S3 as filesystem

    • High latency, not recommended

    • Reliability concerns

Monitoring and Observability

Metrics

Nextcloud exposes Prometheus metrics at /ocs/v2.php/apps/serverinfo/api/v1/info:

  • Active users

  • Storage usage

  • Database size

  • PHP-FPM stats

  • Share counts

Logs

Structured JSON logs to stdout:

  • Application logs: /var/www/html/data/nextcloud.log

  • Apache access logs: stdout (collected by Loki)

  • Cron job logs: Separate container logs

Health Checks

  • Liveness probe: /status.php (restarts unhealthy pods)

  • Readiness probe: /status.php (removes from service rotation)

  • PostgreSQL: CNPG operator monitors and auto-fails over