№ A · 02 / Platform strategy

Gluing many OSS parts, under one contract.

KiwiStack is, in the end, the same job OpenDesk does: take a dozen independent open-source components and make them behave like one product, per customer, repeatably. Without a single rule for how a component is integrated, every new part invents its own deployment path and the system sprawls. This page is that rule. It is the operating model the rest of the architecture assumes, and the contract every component, Onyx included, must meet.

№ P1 / The Component Integration Contract

Seven rules,
not negotiable.

Every OSS or custom component (the OpenDesk apps, Onyx today, whatever comes next) is integrated the same way. The rules are lifted from how OpenDesk already composes fifteen components without forking any of them: a hierarchical helmfile with ordered stages, one version source of truth, overlay-not-fork customization, one seed for all credentials, bootstrap jobs as a first-class stage.

№

Rule

What it means

One helmfile release, ordered

Every component is a first-class release of the single tenant helmfile, ordered by a deployStage label. No hand-run helm install, no separate -spike namespace, no manifest applied by hand. Adding a component is adding a release block, not inventing a deployment path.

Lives in the customer namespace

The component runs in cust-<slug> with every other component that customer owns. No cross-namespace data plane. The only cross-namespace calls allowed are to a declared shared service (see P4). A component that needs the tenant's Postgres reaches it in its own namespace, not over a namespace boundary.

Versions pinned in one place

Chart version and image tag are pinned in a single source-of-truth file per environment and bumped together in one commit. Modelled on OpenDesk's charts.yaml + images.yaml. The deployed version is never 'whatever HEAD is' and never :latest in dogfood or stable.

Customize by overlay, never fork config

Upstream chart values are never edited in place. Customization is a declared overlay, applied last, highest precedence (OpenDesk's customization.release.<app> pattern). The code repo is forked only when the source itself must change (onyx-kiwistack), and that fork is treated as a versioned, released artifact.

Credentials derived from one seed

Every credential is derived from the single MASTER_PASSWORD seed, component-scoped, consumed in the same namespace. Never hand-copied between namespaces, never stored. A from-scratch rebuild yields identical credentials with zero stored state (OpenDesk's derivePassword convention, the intelligence-base.derivePw helper).

Every side effect is an idempotent hook

Codify or it is not done. Any required side effect (a DB role, a bucket, an OIDC client, a schema seed, a consumer restart) is a pre or post hook that self-heals from scratch and is safe to re-run. A manual step taken to make something work is not finished until it is a hook in the same change.

Schema and seed are a contract

The chart's seed declares the minimum component schema version it needs. A check fails loudly when the running image does not satisfy it. This is the rule that would have caught the llm_model_flow miss before a user did.

The one that matters most

Rule f is the rule the whole platform was missing. Most of the Intelligence firefighting was the same shape: a manual step (a Postgres role, a Keycloak client, a pod restart, a seed row) made something work, was not codified, and a from-scratch recreate orphaned it. Codify or it is not done means the plan for any change must include the step that turns the fix into an idempotent hook, and the task is not complete until that step is.

№ P2 / Onyx, the worked example

The spike
against the contract.

Onyx works today, but it was integrated as a spike: a hand-run release in its own namespace, reusing the tenant's data over a namespace boundary, with secrets duplicated and pods restarted by a special hook. Every recurring incident traced back to that shape. The contract is what the spike becomes.

Aspect

Spike (today)

Contract (target)

Deployment

Hand-run helm install onyx-spike, outside both helmfiles

A release block in the tenant helmfile, deployStage ordered

Namespace

Own kiwistack-intel-spike ns, reusing cust-kiwistack data over the namespace boundary

Runs in cust-<slug> next to the data services it uses

Secrets

spike-onyx-* written into two namespaces by a provisioning Job, kept in sync by hand

Derived in-namespace from the seed, one copy, no mirror

Stale creds

A restart-consumers hook had to roll pods after every rotation

Same-namespace Secret, normal Helm checksum-restart, no special hook

Ingress

A separate spike-ingress.yaml applied by hand, chart ingress disabled

One ingress owned by the release

Schema drift

Seed assumed a table the image did not have, found at runtime by a user

Seed declares the schema version, a check fails the deploy first

None of this needs a migration. There is no real customer data yet, so the contract-compliant Onyx is a from-scratch rebuild of the kiwistack tenant, not an in-place surgery. That is Phase 3.

№ P3 / Repository strategy

Template is logic.
Customer is state.

The full repo topology is on the GitOps page. The one rule that matters for this operating model: platform logic lives in od-platform and the tier templates; a cust-<slug> repo holds customer state only (generated env, overlay values, encrypted secrets), never platform logic. Customizations reach a customer by a version bump of the template pin, not by editing the customer repo.

The split to fix

Today cust-kiwistack mixes both: it carries platform logic (bootstrap scripts, the OpenDesk submodule wiring, the intelligence helmfile) and customer state in one repo. That is why a change to "how we deploy" and a change to "this customer's config" are indistinguishable in history. Phase 2 extracts template-core so the customer repo is only what is unique to that customer.

№ P4 / Namespaces and shared services

One namespace
per customer.

A customer is one namespace holding all of that customer's components. Anything genuinely shared across customers (the ingress controller, the LiteLLM gateway, web search, a future mail-egress gateway) lives in its own cluster-scoped namespace and is multi-tenant by per-customer key, not by per-customer install. There is no third option: a component is either the customer's, in their namespace, or shared, in a kiwi-* namespace.

Namespace

Holds

Scope

cust-<slug>

Everything that customer owns: OpenDesk apps, Onyx, the broker, future components

One per customer. The NetworkPolicy and RBAC boundary.

kiwi-<service>

Cluster-shared gateways: ingress-nginx, kiwi-llm-gateway (LiteLLM), kiwi-search-gateway (SearXNG), a future kiwi-mail-gateway (TEM egress), cert-manager

Cluster-wide, multi-tenant via per-customer virtual keys. Never holds customer data.

argo-cd / kube-system

Platform control plane and system addons

Operator-only. Out of every customer's reach by policy.

Tier mapping

Same contract,
different blast radius.

The contract does not change by tier. What changes is what enforces the boundary: on Core it is the NetworkPolicy, on Mesh and Fleet it is the dedicated cluster. The same charts drive both, so the policies are applied either way.

Core

One shared cluster, many customers

NetworkPolicy is the tenant boundary. Default-deny is mandatory before a second Core customer.

Mesh / Fleet

One dedicated cluster per customer

The cluster is the hard boundary. Same contract, same policies, lower stakes (no second tenant).

№ P5 / Network isolation

Default-deny,
then allow on purpose.

Today there is no NetworkPolicy anywhere: any pod can reach any tenant's database. That is acceptable for exactly one customer and not one more. Every cust-<slug> namespace gets a default-deny policy, codified in the chart so it is not optional.

A customer namespace allows

→ traffic from its own pods
→ ingress from the shared ingress-nginx
→ egress to cluster DNS
→ egress to the specific shared kiwi-* services it uses
→ external egress only where a component requires it

A shared kiwi-* namespace allows

→ ingress only from cust-* namespaces, on the service port
→ nothing from another shared namespace by default
→ its own required external egress (a model API, an SMTP relay)

This is the one item that is a hard gate: a second Core customer is not onboarded until default-deny is in place. It is listed as a primitive on the GitOps page already; this makes it mandatory in the chart rather than aspirational in a doc.

№ P6 / Release model

dev,
dogfood, stable.

The point is to iterate without cutting a version every change, then pin a version once it works. Three lanes, keyed by helmfile environment. You work in dev. You prove it on the dogfood tenant. Only what dogfood validated reaches customers, by a version pin. Rolling back is re-pinning the previous version, which is safe precisely because everything is idempotent CaC and credentials are derived, so there is no stored state to strand.

Lane

Where

Pinning

Use

dev

The operator workstation, against the dogfood tenant

Local chart path, images :latest, pullPolicy Always, no version bump

Iterate freely: Onyx colours, a new connector, a custom element. Break things. The only rule: the moment it works it must be captured as idempotent CaC.

dogfood

The kiwistack tenant (KiwiStack runs itself)

A cut chart version + image pinned by digest

Prove end-to-end on a real customer-shaped slice before anyone else sees it. This is the gate.

stable

Every other customer

Only dogfood-validated, pinned versions

Roll out = bump the template version pin. Roll back = re-pin the previous one. Safe because everything is idempotent CaC, credentials are derived, and there is no stored-state dependence.

The rule that makes roll-back free

Whenever a change is made (a new connector, a customized element, a colour, an init script that pulls a credential from another service), the plan for that change carries a mandatory final step: capture it as idempotent configuration-as-code. Not a note, a step. Until that step is done the change does not exist as far as dogfood and stable are concerned. This is rule f applied to the everyday workflow, and it is why a rollback is just re-pinning a version, never archaeology.

№ P7 / Upstream upgrades

Bump on purpose,
never by accident.

Upstream moves: OpenDesk releases, Onyx releases, the gateways release. An upgrade is a deliberate, pinned, dogfood-gated act, never a silent drift. The two failure modes already seen (a submodule patch reverting unnoticed, a chart seed assuming a schema the image did not have) are closed by the patch-guard and the schema contract.

OpenDesk (submodule)

Pinned by tag (the env-prod-1.11.4 directory is the pin). Local changes only via patches/ + overlay, never by editing the submodule tree. A pre-apply patch-guard asserts every patch is applied and fails loudly if not. Bumping is a documented runbook: bump pin, re-prep, patch-check, dogfood, stable.

Onyx (code fork)

onyx-kiwistack tracks upstream as a remote and rebases on a cadence. The chart seed declares the Onyx schema version it needs (rule P1g). The image is pinned by release tag in dogfood and stable, never :latest.

Shared gateways

LiteLLM, SearXNG, future TEM gateway: pinned upstream chart versions, one shared release per cluster, multi-tenant by per-customer virtual key, not by per-customer install.

№ P8 / Roadmap

No big bang.
Sequenced, gated.

This page is Phase 0. Nothing below it is started until the operating model itself is agreed. Each phase is independently useful and gated on the one before. Because there is no real data yet, the heavy phase is a clean rebuild, not a migration.

Phase 0 (this page)

The operating model is written down and agreed. Tiered CLAUDE.md encodes the non-negotiables so future iterations follow them. No infrastructure changes.

Phase 1 — cheap guardrails

Submodule patch-guard. Correct the stale credentials-hazard note. Pin dogfood images by digest. Add the schema-and-seed contract check. All low-risk, high-leverage.

Phase 2 — repo split

Extract template-core from the current cust-kiwistack. The customer repo becomes state only: generated env, overlay values, encrypted secrets. Platform logic moves to od-platform and the template.

Phase 3 — collapse Onyx

Rebuild the kiwistack tenant from scratch with Onyx as a contract-compliant release inside cust-kiwistack. Retire the spike namespace, the secret mirror, the restart hook, the split ingress. Move LiteLLM and SearXNG into shared kiwi-* namespaces.

Phase 4 — isolation + release lanes

Default-deny NetworkPolicies everywhere. Formalize dev / dogfood / stable. Publish the chart as a versioned OCI artifact instead of a local path.

Phase 5 — one loop

Argo CD app-of-apps unifies the remaining release sources. From git to running customer in one reconciliation loop, the end state the GitOps page already describes.

The non-negotiables

Codify or it is not done. One contract for every component. The customer namespace is the boundary. Working code is canonical, the docs follow it. Secrets are derived, never committed, verified by hash. dev floats, dogfood and stable are pinned, nothing reaches stable that dogfood has not run. These live in the CLAUDE.md files at the workspace root, in od-platform, and in the customer repo, so every future iteration starts from them instead of rediscovering them.

Gluing many OSS parts, under one contract.

Seven rules, not negotiable.

The spike against the contract.

Template is logic. Customer is state.

One namespace per customer.

Same contract,different blast radius.

Default-deny, then allow on purpose.

dev, dogfood, stable.

Bump on purpose, never by accident.

No big bang. Sequenced, gated.