KiwiStack

№ A · 04 / Decisions log

What we chose,
why we chose it.

ADR-style log of architectural decisions whose rationale isn't already implicit in the rest of the architecture pages. Core=k8s vs Mesh=k3s, the PKI hierarchy, OEM-direct shipping, per-role keys and the BYOC alternative, all of those are documented where they live. What's collected here is the negative space: what we chose not to do, and why.


ADR · 01

Use Nubus alone (no separate Samba AD, no full UCS)

Locked

Context

Customers used to Microsoft / AD environments expect Group Policy and a 'real' domain controller. Initial drafts considered running Samba AD on a Debian LXC alongside Nubus to provide that experience, or alternatively deploying full Univention Corporate Server (UCS Core Edition) instead of Nubus.

Decision

Run Nubus by itself. Do not deploy Samba AD as a separate service. Do not deploy full UCS.

Why

Nubus's LDAP backend already carries POSIX, Kerberos and Samba schema on every account: uidNumber, gidNumber, homeDirectory, loginShell, krb5PrincipalName, sambaSamAccount are all populated out of the box. Adding Samba AD would duplicate the directory and create a sync problem we'd have to solve forever. Adding full UCS would replace Nubus with something heavier, with overlapping features and our own Kubernetes deployment of Nubus already covering the IAM use cases.

What was rejected

  • Samba AD on a Debian LXC alongside Nubus: duplicate directory, sync hazards, more services to operate.
  • Full UCS Core Edition deployment: heavier, replaces Nubus rather than complements it, no clear win on customer-visible features.

ADR · 02

Fleet (FleetDM), not Landscape (or AWX)

Locked

Context

Endpoint management is required for the Fleet tier laptops and useful for Mesh. Two real candidates were Canonical Landscape (the obvious 'official' Ubuntu choice) and FleetDM (open-source, k8s-native, cross-platform). Ansible AWX was also considered as a sovereign / DIY alternative.

Decision

FleetDM, deployed in the customer's k3s cluster via the official Helm chart, configured GitOps-style via fleetctl.

Why

FleetDM is k8s-native (drops into the same cluster as everything else), GitOps-native (fleetctl apply -f reads our state repo), cross-platform across Win, macOS, Linux, iOS and Android (important once we expanded the BYOL story), and crucially has no per-machine subscription. It runs without paying anyone per managed device.

What was rejected

  • Canonical Landscape: even self-hosted, requires an Ubuntu Pro entitlement on every managed machine. The economics don't work at our SMB scale, and the platform-lock-in to Ubuntu Pro tokens is a sovereignty drag.
  • Ansible AWX: capable, fully sovereign, but the endpoint UX is more 'glue your own tooling' than 'managed fleet'. We didn't want to assemble inventory, patching, compliance and remote-script flows ourselves when Fleet ships it.

ADR · 03

Ubuntu Pro is an optional upsell, not bundled in any default tier

Locked

Context

Ubuntu Pro provides Extended Security Maintenance (10-year CVE backports), Livepatch (rebootless kernel patching), FIPS 140-validated crypto, and Canonical-signed CIS / DISA-STIG / FedRAMP hardening profiles. It costs around €25 / desktop / year. The temptation was to bundle it into the Fleet tier as a value-add.

Decision

Ubuntu Pro is not bundled into Core, Mesh, or Fleet by default. It is sold as a per-machine SKU when a customer specifically requires ESM, Livepatch, FIPS or 10-year retention.

Why

Default tier hardens with OpenSCAP plus community Ansible CIS roles, which gets you ~90% of Ubuntu Pro's hardening value for free. ESM and Livepatch are nice for 24/7 servers; on laptops their value is limited (laptops reboot anyway, the LTS 5-year cycle is already long). Bundling Ubuntu Pro pushes Fleet pricing above the comparable Microsoft 365 SKUs and breaks the headline cheaper-than-Microsoft positioning the brand stands on.

What was rejected

  • Bundle Ubuntu Pro into Fleet by default: would have made Fleet more 'enterprise-grade' on paper, but would have raised the price above our Microsoft comparison and would have been irrelevant to most customers.
  • Make Ubuntu Pro mandatory across all tiers: even worse on the price story.

ADR · 04

No adsys / GPO model. Fleet profiles + scripts only

Locked

Context

Linux desktop policy management traditionally uses adsys, the Canonical-developed daemon that applies AD Group Policies to Ubuntu. It's free, mature, and fits AD-managed environments. Initial drafts considered using it in tandem with Fleet, especially for customers familiar with the AD/GPO mental model.

Decision

Do not use adsys. All endpoint policy is delivered through Fleet's profile + script model.

Why

Fleet already manages our endpoints. adsys would be a parallel policy plane just for the Linux subset of the fleet. Fleet's model is cross-platform: the same workflow drives macOS profiles, Windows MDM commands and Linux scripts, so we get one consistent management surface instead of one tool per OS. The 'AD GPO' mental model is also less and less relevant for the kind of customer we serve.

What was rejected

  • adsys for AD-style GPO inheritance on Linux: free in Ubuntu, mature, but a second policy plane we don't need and a Linux-only feature in a cross-platform fleet.

ADR · 05

YubiKey 5 Series (PIV-capable), not 5 Bio (FIDO-only); 5 FIPS as the premium SKU

Locked · reverses an earlier draft choice

Context

An earlier draft of the authorization model proposed the YubiKey 5 Bio FIDO Edition, with one shared key per customer and multiple fingerprints enrolled. A hardware audit raised two problems: (1) the 5 Bio FIDO Edition is FIDO2-only and does NOT support PIV, so it cannot hold the X.509 cert chain our smallstep-issued enrollment certs require for cert-based device identity; (2) it costs around €95, similar to the FIPS variant, while the standard 5C NFC runs around €55–70.

Decision

Default hardware: YubiKey 5 Series (5C NFC) with full PIV + FIDO2 + OATH + OpenPGP. Premium SKU: YubiKey 5 FIPS Series for compliance-driven customers (eIDAS, US-equivalent). Distribution: per-role keys, one YubiKey per admin. Authentication factors: PIN + touch.

Why

Our use cases (device enrollment, mTLS to APIs, SSH cert issuance) require PIV cert signing. FIDO2 alone is fine for browser-based portal sign-in but insufficient for the cert-issuance / signing flows we've built around smallstep. Per-role keys give cryptographic role separation: each key's PIV cert subject names the admin and their role, so the server can read who actually signed instead of relying on a claimed-role header. Two distinct cert signatures from two different per-role keys is a real cryptographic two-person rule for Critical operations, with no email-link / broker workaround needed.

What was rejected

  • YubiKey 5 Bio FIDO Edition: fingerprint sensor is elegant but FIDO-only is too narrow for our cert flows, and the single-key-multi-fingerprint model gave only 'operational' role separation (the hardware doesn't tell the server which fingerprint signed).
  • YubiKey 5 Bio Multi-protocol Edition: newer, similarly priced to FIPS, adds nothing the 5 Series doesn't already have except the fingerprint sensor (which we don't need given PIN protection is the standard model in PIV smartcards).

ADR · 06

Chromium app-mode for OpenDesk launchers; Element Desktop for Chat, Nextcloud Sync alongside Files

Locked

Context

Mesh and Fleet managed laptops need a Microsoft-Office-equivalent dock experience for the OpenDesk suite: separate icons in the launcher, not twelve tabs in one Firefox window. The temptation was to build a per-customer 'KiwiStack Office' app via Tauri or Electron with branded UI, deep-link handling and OS keychain integration. Alternatively, install Thunderbird and other native protocol clients across the board for parity with the legacy Microsoft Office model. Or just point users at the web demo and call it done.

Decision

Per-app Chromium windows via chromium --app=<URL> --profile-directory=Work for the OX App Suite (one launcher covering Mail / Calendar / Contacts / Tasks via internal tabs, Outlook-style), Files (web UI), Meet, Meshjects, Wiki and Notes, totalling six Chromium wrappers. Element Desktop (upstream native binary) for Chat. Nextcloud Sync (upstream native client) alongside the Files Chromium wrapper for the filesystem layer. All seven dock launchers plus Nextcloud Sync pushed by Fleet to Mesh and Fleet devices on first agent sync, same policy file, same operational path on both tiers.

Why

Zero new code to build, sign or maintain. Chromium ships in the base image already and brings sandboxing, mTLS, FIDO2, WebRTC, codecs and autoupdate for free. The Kerberos / Keycloak SSO already documented in A6 covers app-mode windows via SPNEGO Negotiate without a single change. The two native exceptions are upstream binaries with materially better UX than wrapping their web frontends. Element Desktop has system tray, native notifications and persistent presence; Nextcloud Sync gives true filesystem integration. Treating Mesh and Fleet identically (one operational path) keeps the support surface narrow.

What was rejected

  • Per-customer Tauri / Electron build of a 'KiwiStack Office' launcher: meaningful build, sign and update pipeline per customer, per-customer URL bake-in, for negligible UX gain over the Chromium app-mode approach.
  • Pure Electron wrappers like Nativefier: abandoned upstream, would put us on a dead maintenance path.
  • A single bundled launcher app showing all OpenDesk apps as tiles: loses the Microsoft-Office one-icon-per-app expectation users have, adds a click for every action.
  • Four separate launchers for the OX App Suite modules (Mail / Calendar / Contacts / Tasks): they all open the same SPA on different deep-links, which is confusing for users and loses the cross-module flow (drag a mail to create a task, mail to a calendar event). One launcher with internal tabs matches both the Outlook model and how /office card #1 already groups them.
  • Thunderbird as the default Mail / Calendar / Contacts / Tasks client: diverges from the OX web UX customers see in the demo and on personal devices, multiplies the support surface (per-OS configuration, per-OS bug reports), and adds a parallel auth path against IMAP / CalDAV / CardDAV that doesn't ride the Keycloak session.
  • Pre-installing the apps in the OEM image on Fleet while Fleet-pushing on Mesh: two operational paths for the same artefact, more divergence to track. One path on both tiers, period.

ADR · 07

Three repo templates per tier; shared base referenced via Helm dependency

Locked

Context

Initial drafts had a single platform repo + per-customer state repos with no per-tier shape. As the difference between Core (one shared cluster) and Mesh / Fleet (one cluster per customer) clarified, the question became how to organise the operator-facing repository surface so a new customer at any tier could be set up the same way, and so a fix in one tier's baseline could reach customers on other tiers.

Decision

Three GitHub repository templates (template-core, template-mesh, template-fleet) in a dedicated KiwiStack org. The Core baseline lives once in the platform repo as a versioned Helm chart (core-base); mesh and fleet templates declare it as a Helm dependency, pinned per template to a specific version. Customer state repos are instantiated from the matching template and pin to a tag of that template.

Why

Lower long-run maintenance for a solo operator, since bug fixes propagate via 'bump the dep' PRs rather than 3× manual file copies. Per-tier pinning preserves migration windows: a tier can hold an older version while another moves forward. Tier migration (Core↔Mesh) is dominated by data migration in either model, so the manifest-shape difference is small. Matches the standard Helm-chart-dependency pattern in the wider Kubernetes ecosystem, so Renovate / dependency-update tooling already understands it.

What was rejected

  • Self-contained tier templates with sync tooling (Renovate / GH Actions auto-PRing changes from core-template into mesh / fleet templates): higher legibility per repo but accepts 3× PR cost and long-term drift between supposedly-identical baselines.
  • Self-contained tier templates with no sync tooling: simplest at first, highest drift cost over time. 'Core baseline as deployed in a Mesh cluster' silently diverges from 'Core baseline as deployed in a Core cluster' over months.
  • One mega-repo with every customer's state: access control becomes harder, churn-portability becomes a git-history filtering problem, secrets blast radius is everyone's secrets.

ADR · 08

One Argo per cluster, no central control plane

Locked

Context

With dedicated k3s clusters per Mesh and Fleet customer, the question of where the Argo controller lives became a real architectural choice. A central Argo on a KiwiStack-operated control plane reconciling into all customer clusters via stored credentials was the obvious 'fewer moving parts' option.

Decision

Each cluster runs its own Argo. The Core cluster has one Argo serving all Core tenants via AppMeshject scoping. Each Mesh / Fleet customer's k3s has its own Argo, bootstrapped during cluster provisioning by the same Ansible play that installs k3s. The customer registry in od-platform is operator metadata only. It drives Terraform module instantiation and repo scaffolding, not runtime reconciliation.

Why

On-prem Fleet deployments rule out central Argo, because there's no realistic inbound API access from KiwiStack IPs to a customer's cluster API server. A central Argo would also require holding cluster-admin credentials for every customer cluster, which would be the single highest-value secret in the platform to defend. Per-cluster Argo is churn-portable: when a customer takes their cluster on the hand-over option, Argo is already there and keeps reconciling without KiwiStack. Blast radius of an Argo failure is one customer, not all of them.

What was rejected

  • Central Argo across all clusters reconciling via stored credentials: single pane, simpler new-customer addition, but central blast radius + central credential vault + breaks on-prem deployments.
  • Central Argo with per-customer fallback (dual reconciliation loops, central primary, local takes over if central is unreachable): most resilient on paper, also the most complex; opaque conflict semantics when both controllers are healthy.
  • No controller, push-based (GitHub Actions runs kubectl apply / helm upgrade): simplest mental model, but loses drift detection. In-cluster mutations from debugging or webhook flaps survive until the next push; for a 'git is the source of truth' model that's a contradiction.

ADR · 09

Terraform with the contabo provider for cluster infrastructure

Locked

Context

Meshvisioning a new k3s cluster per Mesh / Fleet customer requires reproducible infrastructure-as-code. The Contabo commitment is locked across the rest of the architecture (jurisdiction, VPS / VDS split). The question was which IaC tool, how mature the Contabo support is, and whether to consider cluster-native abstractions like Crossplane or Cluster API.

Decision

Terraform (or OpenTofu, equivalent) with the community contabo/contabo provider for VPS / VDS lifecycle. Per-customer Terraform module under od-platform/terraform/<customer>. State stored in the operator's S3 backend, locked per customer. Ansible plays on top do the k3s install, Argo bootstrap and any post-provisioning configuration. DNS via the OVH Terraform provider in the same module.

Why

Reproducible per-customer cluster provisioning is the only way '~30 minutes from signed contract to running customer' stays true at scale. The community Contabo provider covers VPS and VDS lifecycle adequately (create / destroy / resize / image / SSH key / snapshot). Networking on Contabo is thin (no VPC primitives, no security-group-style rules), but that aligns with the WireGuard / Headscale overlay model already in use, so the gap is not a blocker. Diff visibility before apply (terraform plan) matters for production infra in a way that pure Ansible's imperative model doesn't give.

What was rejected

  • Hetzner Core with the hcloud provider: more mature provider, comparable EU jurisdiction, but reverses the existing Contabo commitment without strong reason.
  • Crossplane with the Contabo provider: cluster-native IaC abstraction, but adds a control-plane Kubernetes cluster solely to manage other Kubernetes clusters. Operational surface multiplied for no clear win at this scale.
  • Cluster API (CAPI) with a Contabo infrastructure provider: similar concern as Crossplane, plus CAPI's Contabo support is essentially absent today.
  • Pure Ansible for everything (no Terraform): possible, but Ansible is imperative; re-running it doesn't reliably converge to a target state the way terraform plan / apply does.

ADR · 10

KiwiStack operates its own infrastructure as the canary

Locked

Context

Template changes (core-base, mesh-base, the tier templates themselves) need to be validated before rolling out to external customers. The naïve options were a synthetic test cluster (built solely for testing) or branch-based promotion (a dev branch reconciled into a lab environment, main reconciled into customer clusters).

Decision

KiwiStack operates its own infrastructure as a real customer at every tier: cust-kiwistack-core, cust-kiwistack-mesh, cust-kiwistack-fleet. These state repos pin template references to HEAD (or a next branch); external customer state repos pin to semver tags. Template changes land on kiwistack-self first; once observed-good for a defined window, the change is tagged and external customers' Renovate proposes the dependency bump.

Why

A synthetic test cluster never exercises the same code paths as a real customer: internal mail flows, calendar data, real device enrollments, real PKI cert rotation, real Fleet agents under current OS updates. Running KiwiStack on its own platform applies dogfooding pressure on every tier and means external customers never see a configuration that hasn't run in production for at least the observation window. It also exercises the actual upgrade mechanism (tag bump → Renovate PR → merge) that external customers will follow.

What was rejected

  • Synthetic test cluster only: cheaper to run but doesn't catch real-world failure modes (mail under load, PKI rotation, Fleet agent compatibility with current OS updates).
  • Branch promotion (dev → main on every repo): doesn't exercise the actual customer-facing upgrade mechanism, so green dev → main says nothing about whether the promotion mechanism itself works.
  • No canary, ship to all customers simultaneously: regressions hit every customer at once. Untenable.

ADR · 11

Meeting transcription via Voxtral + Skynet

Locked

Context

The Intelligence add-on promises 'meeting transcription and search.' Initial copy named OpenAI Whisper as the transcription engine and described a manual flow ('drop a recording in'). Two problems: (1) Whisper is OpenAI-branded, since even with self-hosted MIT weights, it reads against the Mistral-led EU-sovereign narrative the rest of the suite carries; (2) the visual placeholder promised an automatic flow with summary + action-item extraction that the manual drop-in did not deliver. A separate option of switching the videoconference engine from Jitsi to Nextcloud Talk (single-vendor, stt_whisper2 + call_summary_bot) was also considered.

Decision

Voxtral (Mistral La Plateforme, EU/France) produces the high-fidelity post-call transcript with native speaker diarisation. Mistral La Plateforme's chat model generates the summary and action-items list. Live captions during the call run on the tenant cluster via Jitsi Skynet (Apache-2.0, upstream), whose summary module is pluggable to OpenAI-compatible endpoints, and Mistral La Plateforme qualifies. The whole pipeline enables and disables per-tenant via the existing Intelligence Helm values flag, with no new GitOps layer.

Why

Voxtral is the engine the customer sees: it brand-aligns with Onyx and Mistral La Plateforme already named on /office/intelligence, and natively diarises with word-level timestamps, exactly what the visual placeholder promises. Skynet is upstream Jitsi, production-used at meet.jit.si, and its OpenAI-compatible summary module talks to Mistral without modification. The faster-whisper library inside Skynet's streaming-whisper module (running open-weights Whisper on the tenant cluster) handles live captions; this is invisible to marketing copy because audio never leaves the tenant and the model file never talks to OpenAI infrastructure. Per-tenant on/off is just an Intelligence Helm flag in the customer's cust-<slug> state repo, identical to the pattern used for Ubuntu Pro and paid Collabora.

What was rejected

  • Forking Skynet to swap faster-whisper for Voxtral on the live-caption path: pyproject.toml pins faster-whisper directly, no ASR_BACKEND_URL envvar, would create permanent rebase debt for an implementation detail nobody sees.
  • Manual drop-in only (the original copy): does not deliver the summary + action-items the visual placeholder promises, and is a worse UX than what hosted-SaaS competitors already give buyers.
  • Switching the videoconference engine from Jitsi to Nextcloud Talk: Talk's stt_whisper2 + call_summary_bot is single-vendor and architecturally cleaner, but unwinding /office/videoconference is a much larger change than this decision warrants.

ADR · 12

Intelligence as a one-flag toggle with shared identity and pinned charts

Locked

Context

ADR-11 settled the engine: Onyx workspace, Mistral chat, Voxtral transcripts, Skynet captions. The wiring contract on top of that was still open. How operators flip Intelligence on for one customer (without flipping it on for everyone). How it pins to OpenDesk versions so upgrades behave like the rest of the platform. How Onyx authenticates against the customer's existing identity. Which integrations work without per-customer configuration. The signup form had already drafted a single boolean for the add-on; the operator-facing schema needed to match.

Decision

One customer-facing toggle: intelligence: true | false in od-platform/customers/<slug>.yaml. When true, Argo renders a new intelligence-base Helm chart from od-platform into the customer's cluster. template-core declares intelligence-base@vX as a pinned Helm dependency, in the same shape as core-base. The chart bundles Onyx + Mistral + Voxtral + Skynet plus six pre-configured connectors out of the box: Mail (IMAP), Files (S3 against the tenant's MinIO bucket), Wiki (XWiki API), Meetings (Voxtral transcripts dropped into the tenant S3), Projects (OpenProject API), Calendar (CalDAV). SSO is OIDC against the per-tenant Nubus Keycloak, same realm as every other OpenDesk app. Onyx's UI is themed via Helm values onto KiwiStack tokens (kiwi-dark primary, bg-cream background, Quicksand display, Nunito body), and the Nubus portal tile uses the Intelligence iconSvg exported from src/data/products.ts on this site. YAML key, Helm chart, portal tile and operator docs all say 'intelligence', never 'onyx'. The vendor name appears only in the engine-attribution line on /office/intelligence. The full component-by-component breakdown (eleven components, six connectors, the umbrella + five overlay subcharts, the Voxtral pipeline, the sovereignty boundary) lives on /architecture/intelligence.

Why

One flag keeps the operator surface narrow. 'Did you turn it on or not?' is the only question, and upgrades roll forward as Renovate-driven base-version bumps, exactly the way the Core baseline already moves. Sharing the Nubus Keycloak realm means an Intelligence user is the same person as a Mail or Files user on day one, with no second account or MFA flow. The six chosen connectors are the ones with concrete paths in Onyx's existing model (S3 for Files via Nextcloud's MinIO, IMAP for Mail, web-API for the others); they cover the 'all integrations done out of the box' promise without inventing new connector code in v1. CryptPad Notes is intentionally excluded (see /office/notes caveat: E2EE means no AI can read it). Matrix Chat is deferred to v2 because no upstream Onyx Matrix connector exists today. Generic naming reduces engine lock-in: a future swap to LibreChat, Open WebUI or another workspace becomes a chart-internal change, with no rename across customer YAMLs, the operator portal, the docs or the marketing surface.

What was rejected

  • Per-feature flags (one boolean per connector and capability): maximum operator control, maximum config surface, most likely place for vendor specifics to leak into customer YAMLs. Defeats the one-flag goal that drove the request.
  • Onyx-branded across the stack (tile says 'Onyx', YAML key is onyx: true, chart is onyx-base): tight branding fit with the engine, but bakes the vendor name into every operator-facing artefact. Future engine swap becomes a rename ritual across od-platform, every customer state repo and every doc page.
  • Separate identity store for Intelligence (Onyx-native auth or its own Keycloak): isolates the AI workspace's auth blast radius, at the cost of a second password / MFA flow per user and a duplicate identity directory to sync. The point of running Nubus once per tenant is that no app outside it carries a parallel user store.
  • Ship Onyx with no out-of-the-box connectors and let customers wire it up: cheaper to launch, but breaks the 'all integrations done out of the box' promise the offer makes, and turns Intelligence into a manual configuration project for every customer.

Decisions intentionally not on this page

The decisions are documented; they're just not duplicated here: Core-tier on full upstream k8s vs Mesh/Fleet on k3s lives on /architecture; the offline-root + per-customer-intermediate PKI hierarchy lives there too; OEM-direct shipping for Fleet and cross-OS BYOL for Mesh live on /architecture/devices; the per-role keys distribution model, the two-person rule for Critical operations, and the BYOC alternative live on /architecture/authorization; repo topology and namespace conventions live on /architecture/gitops; the per-app desktop-launcher mechanics, the Kerberos / SSO trace, and the two native exceptions live on /architecture/desktop-apps. We don't repeat them here.