KiwiStack

№ A · 01 / GitOps & repo topology

From git to running customer, in one reconciliation loop.

The whole platform is a function of one platform repo, three tier templates and a state repo per customer, all version-pinned, all reconciled by a per-cluster Argo. Nothing is configured by hand on a server. If git is wrong, the cluster is wrong; if git is right, the cluster eventually agrees.


№ G1 / Repository topology

One platform repo,
three tier templates,
one state repo per customer.

Everything lives in a single GitHub organisation (KiwiStack). The platform repo holds shared charts, Terraform modules and the customer registry. Three repository templates encode the per-tier shape. A customer is one repo, instantiated from the template that matches their tier. They keep that repo on churn.

od-platform

Operator-owned platform repo

Private · single source of truth for the MSP

  • core-base, mesh-base: versioned Helm charts (semver-tagged) holding the shared service definitions
  • Argo CD bootstrap manifests + AppMeshject definitions (used by Ansible at cluster install)
  • Ansible roles for cluster bootstrap and the offline signing rituals
  • Terraform modules per tier: contabo provider for VPS / VDS, OVH provider for DNS
  • Customer registry: YAML metadata listing every customer (slug, tier, jurisdiction, pinned base versions)
  • Operational runbooks and ADRs

template-core · template-mesh · template-fleet

Three GitHub repository templates, one per tier

Private · marked as GH 'Template repository'

  • template-core: minimal customer state structure for the shared k8s cluster
  • template-mesh: declares core-base@vX as a Helm dependency, adds Mesh-only services
  • template-fleet: declares core-base@vX + mesh-base@vY, adds Fleet-only services
  • Each template ships a self-contained skeleton: values/, overrides/, secrets/.sops.yaml, fleet-config/, README pinning the upstream base versions

cust-<slug>

One state repo per customer, instantiated from the matching template

Private · in the KiwiStack org

  • Created via gh repo create KiwiStack/cust-acme --template KiwiStack/template-<tier>
  • values/: Helm values per service (nubus.yaml, opendesk.yaml, fleet.yaml, smallstep.yaml, ...)
  • overrides/: Kustomize patches for anything customer-specific that isn't in values
  • secrets/: SOPS+age encrypted, decrypted at sync time inside the cluster
  • fleet-config/: fleetctl GitOps tree (Mesh / Fleet only)
  • Pinned to a specific tag of its source template; Renovate proposes bumps when tags advance

Why this shape

A single mega-repo would put every customer's secrets and overrides in one place, so access control becomes harder, churn-portability becomes a git-history filtering problem. Three tier templates plus per-customer state repos make the answer to "where does customer X live" a single GitHub URL, and the answer to "can I keep my stuff if I leave" is "yes, here's the URL".


№ G2 / Tier inheritance

Higher tiers
depend on lower bases.

The Core baseline lives once, as a Helm chart in the platform repo. Mesh and Fleet templates declare it as a dependency, pinned to a specific version. A bug fix in the baseline reaches every tier the same way: PR, merge, version bump, dependency-update PRs in customer state repos.

Template

Dependencies

Adds on top

template-core

core-base@vX (direct)

Core-tier per-tenant values: Nubus, OpenDesk apps, mail, files, real-time collab

template-mesh

core-base@vX (Helm dep, pinned)

smallstep intermediate CA, Fleet Server, WireGuard endpoint, Headscale

template-fleet

core-base@vX + mesh-base@vY (Helm deps, both pinned)

FreeRADIUS, OEM bootstrap endpoint, extended observability, audit-report generator

Bug-fix propagation

From one customer's incident
to every customer's baseline.

A bug surfaces in one customer's slice. The fix lands first as an override in their state repo (immediate), then is promoted upstream into the shared base if it's generic. Other customers see the fix on the next dependency bump.

01

Patch the customer

Bug observed in cust-acme. Fix lands as an override in cust-acme/values/<service>.yaml and is pushed. Argo reconciles within minutes, and Acme is patched.

02

Promote upstream

If the fix is generic and not Acme-specific, open a PR against core-base (or mesh-base) in od-platform. Code review, CI on the chart, merge.

03

Bump the base

Cut a new tag of core-base (vX → vX+1). kiwistack-self picks it up first, since its state repos pin to HEAD, not to a tag.

04

Roll out + clean up

Once observed-good on kiwistack-self, customers' Renovate sees the new tag and proposes the dependency bump in their state repo. Merge in waves. Once a customer is on the new base version, the override in their state repo is removed.


№ G3 / Inside a Core cluster

One shared k8s.
Many tenants in it.

The Core tier runs on one mutualized upstream Kubernetes cluster on Contabo. Some components are cluster-wide and shared by every tenant; others are reinstalled per tenant from each customer's state repo. Isolation between tenants is what justifies the price point.

Cluster-shared (in core-base, deployed once per Core cluster)

  • Argo CD: one per cluster, bootstrapped at install
  • ingress-nginx: multi-tenant via Host header
  • cert-manager: Let's Encrypt for service hostnames
  • smallstep CA: shared sub-CA, namespace-scoped provisioners per tenant
  • Prometheus + Grafana: multi-tenant via labels
  • Restic backup operator: per-tenant snapshot policies
  • SOPS age decryption: Argo helm-secrets plugin

Per-tenant (instantiated by Argo from each customer's state repo)

  • Nubus IAM: OpenLDAP + Heimdal KDC + Keycloak + UDM + portal
  • Open-Xchange App Suite: mail, calendar, contacts, tasks
  • Nextcloud: file sync
  • OpenProject: projects & tasks
  • XWiki: knowledge base
  • CryptPad / Impress: notes
  • Postfix mail relay
  • Per-tenant databases: postgres, mariadb, redis (one statefulset per tenant)
  • MinIO: per-tenant bucket
  • Vaultwarden: per-tenant
  • Synapse: Matrix homeserver (VDS-pinned via nodeSelector)
  • Jitsi videobridge + Meshsody (VDS-pinned)
  • Collabora Online (VDS-pinned)
  • Live-collab websocket gateway (VDS-pinned)
  • Intelligence: Onyx workspace + Mistral chat + Voxtral transcripts + Skynet live captions (only when intelligence: true, see ADR-12)

Namespace naming

Slug + service.
Predictable.

The customer slug is a short, lowercase, hyphenated identifier (RFC 1123, same shape as DNS labels). It's chosen at onboarding, never changes, and is used everywhere: namespaces, ingresses, smallstep provisioner names, S3 buckets, certificate subjects.

Cluster

Core (mutualized k8s)

Pattern

<slug>-<service>

Examples

  • acme-opendesk
  • acme-pki
  • beta-corp-opendesk
  • beta-corp-pki

One namespace per service per tenant. The slug prefix is what NetworkPolicy and RBAC scope on.

Cluster

Mesh / Fleet (dedicated k3s)

Pattern

<service>

Examples

  • opendesk
  • devices
  • pki
  • network
  • argocd

Customer slug is the cluster identity. Namespace boundary is purely service-level, so no second tenant to fence off.

Tenant isolation

Seven primitives.

Each layer below is declared in the per-customer state repo, reconciled by Argo, and verifiable independently in CI. On Mesh / Fleet clusters the same primitives apply but the bar is lower, since there is no second tenant to defend against. They still get applied because the same charts drive both cluster shapes.

NetworkPolicy

Cluster default-deny on namespace ingress. Each tenant's namespace declares its own allow-list (own services only, plus the shared ingress controller).

ResourceQuota

Per-namespace CPU, memory, storage and pod-count caps. Sized from the customer's tier and headcount. A noisy customer cannot starve their neighbours.

LimitRange

Default and max requests/limits per pod. Prevents accidental missing-limits from blowing up scheduling.

RBAC

Argo CD AppMeshject restricts each tenant's Application to its own namespaces. Operators have cluster-admin via OIDC against the Nubus Keycloak (audited).

PodSecurity

Restricted profile enforced at namespace level. No privileged pods, no hostPath, no hostNetwork, except in explicitly excepted system namespaces.

Storage

Per-tenant StorageClass binds to per-tenant Longhorn (or Rook-Ceph) replication groups. Volumes are namespace-scoped; backup policies are per-namespace.

Encryption at rest

Cluster-level etcd encryption. PVC-level encryption via the storage layer. Per-tenant KMS keys for backup encryption.


№ G4 / Inside a Mesh / Fleet cluster

One k3s
per customer.

Each Mesh and Fleet customer gets a dedicated k3s cluster on their own Contabo nodes (or eventually on-prem). The provisioning is three steps: Terraform spins the infrastructure, Ansible installs k3s and Argo, Argo reconciles the customer's state repo. Same path on both tiers, with Fleet just adding more services on top.

01

Terraform apply

terraform apply against the customer's module (contabo provider): N VPS + 1 VDS, OVH DNS records, minimal firewall. State stored in the operator's S3 backend, locked per customer.

02

Ansible bootstrap

Playbook against the new instances: cloud-init → k3s install (HA control plane on three nodes for Fleet, smaller for Mesh) → Argo CD install → seed Argo with one Application pointing at cust-<slug>.

03

Argo reconciles

Argo clones cust-<slug>, resolves the core-base + mesh-base Helm dependencies at the pinned versions, applies. Within ~10 minutes the slice is up.

Terraform state lives in the operator's S3 backend, locked per customer. The contabo provider covers VPS / VDS lifecycle; Contabo has no AWS-style VPC primitives, so private networking between nodes is built on the WireGuard / Headscale overlay already in the stack.

Mesh adds (on top of core-base)

  • smallstep intermediate CA: per customer, signed by the offline MSP root
  • Fleet Server: endpoint management API
  • WireGuard endpoint: persistent laptop tunnel
  • Headscale: admin / site-to-site mesh

Fleet adds (on top of core-base + mesh-base)

  • FreeRADIUS: Wi-Fi 802.1X EAP-TLS at the customer office
  • OEM bootstrap endpoint: serial-bound first-boot claim service
  • Extended observability: Prometheus + Loki + Grafana with longer retention
  • Audit-report generator: cron-based quarterly compliance reports

№ G5 / Argo topology

One Argo
per cluster.

Each cluster runs its own Argo. No central control plane. The Core cluster's Argo serves all Core tenants via AppMeshject scoping; each Mesh / Fleet cluster's Argo serves that one customer. The customer registry in od-platform is operator metadata only. It drives Terraform module instantiation and repo scaffolding, not runtime reconciliation.

Layer

Role

root-app

Bootstrap Application installed by Ansible at cluster install. Installs Argo CD itself + cluster-shared addons (cert-manager, ingress-nginx, monitoring, AppMeshject definitions). One per cluster.

per-customer-app

App-of-apps that fans out to one Application per service. Core cluster: N of these (one per Core tenant). Mesh / Fleet cluster: exactly one.

service-app

The leaves: one Helm release per service in the customer's namespace(s). Sync waves order them (Nubus first, then apps, then ingress, then Fleet).

Sync policy: automatic, with self-heal and prune. SyncWaves enforce dependency order (Nubus → apps → ingress → Fleet). Manual approval required for production cluster bootstrap and root-CA-signed material. A central Argo would have created a runtime dependency on KiwiStack infrastructure (a non-starter for on-prem Fleet deployments) and a single credential vault holding cluster-admin for every customer.


№ G6 / Internal canary

KiwiStack
runs itself.

Operating its own infrastructure as a real customer at every tier means template HEAD always lands here first. External customer state repos pin to semver tags. Once a change has run cleanly on kiwistack-self for an observation window, the new tag rolls out to external customers via Renovate-proposed dependency bumps.

cust-kiwistack-core

A namespace in the Core cluster. Pins template-core at HEAD.

cust-kiwistack-mesh

Dedicated Mesh k3s on the operator's own Contabo nodes. Pins template-mesh at HEAD.

cust-kiwistack-fleet

Dedicated Fleet k3s with the full device-management stack. Pins template-fleet at HEAD.

Building a synthetic test cluster would be duplicate effort and would never exercise the same code paths as a real customer. Running KiwiStack on its own platform applies dogfooding pressure on every tier and means no external customer ever sees a configuration that hasn't run in production for at least the observation window.


№ G7 / Secrets

Encrypted in git.
Decrypted in cluster.

Secrets live in the state repo as .sops.yaml files, encrypted with age. The decryption key is held by Argo CD inside the cluster (a Kubernetes secret seeded once at bootstrap from the operator's keypair). Each cluster has its own age key. No cross-cluster secret sharing. Nothing in plaintext is ever committed. Rotation = re-encrypt + commit.

In one paragraph

Argo CD has a helm-secrets plugin (or KSOPS, or the SOPS-secret-operator) that intercepts manifests on sync, decrypts the SOPS-wrapped values, and applies the resulting Secret to the cluster. The plaintext never leaves the cluster, never sits on disk in CI, never lands in a log line. Each customer's state repo encrypts to the operator's age public key plus (optionally) a per-customer recovery key the customer keeps in their own escrow.


№ G8 / Onboarding flow

From signed contract
to running,
in seven steps.

01

Create state repo

gh repo create KiwiStack/cust-acme --template KiwiStack/template-<tier> --private. Skeleton populated with tier-specific values + secrets/.sops.yaml referencing the operator age key.

02

Append to customer registry

Add customers/acme.yaml in od-platform: { slug: acme, tier: mesh, jurisdiction: BE, intelligence: true, base_versions: { core-base: v1.4.2, mesh-base: v0.7.1, intelligence-base: v0.3.0 } }. The intelligence-base entry is only resolved when intelligence: true; flipping it later (or back to false) is a one-line PR that Argo reconciles end-to-end.

03

Sign their PKI intermediate

Offline ritual: YubiKey out of the safe, sign their CSR with the MSP root, drop the signed cert into cust-acme/secrets/intermediate.crt.sops.yaml.

04

Meshvision the cluster (Mesh / Fleet only)

terraform apply against the customer module → ansible-playbook bootstrap.yml. Result: a k3s cluster with Argo running, seeded with one Application pointing at cust-acme.

05

Argo reconciles

Within ~10 minutes the slice is up. Core customers skip the previous step, because the existing Core Argo's customer-registry-app picks up acme.yaml on its next sync and creates the per-customer Application directly.

06

DNS records

Terraform with the OVH provider creates wildcard *.acme.od.kiwistack.io records pointing at the cluster's ingress IP. cert-manager issues TLS automatically.

07

Founding admin onboarded

First Nubus admin created via UDM REST API in a postSync hook. They get an email with a one-time token to set their password.

Total operator time from "signed contract" to "running customer with first user account": ~30 minutes for Mesh / Fleet, ~10 minutes for Core. The offline-CA signing is the only step that isn't automated end-to-end (and shouldn't be).


№ G9 / Churn flow

Leaving
is also a feature.

The same git topology that makes onboarding mechanical also makes churn clean. There's nothing to "extract" from a black box, since the customer's data is in named files, the customer's manifests are in a known repo, the customer's PKI is signed by a chain they can keep, and on Mesh / Fleet their Argo is already in their cluster.

Revoke at root

step ca revoke <intermediate-serial> against the offline root. Every cert the customer's intermediate ever issued dies within the day.

Drain and tear down

Mesh / Fleet: terraform destroy. Argo dies with the cluster. Core: remove the customer's per-customer-app from the Core Argo, namespaces deleted. State repo archived in either case.

Backup egress

Final backup snapshot exported to a customer-supplied destination (S3, USB, sFTP). Standard formats: EML, maildir, vCard, iCal, original files, SQL dumps.

Hand-over option

Mesh / Fleet only: customer takes the cluster + state repo + intermediate's private key (re-signed under a fresh root they own). Argo is already in their cluster, so it keeps reconciling without KiwiStack.