KiwiStack

№ 005 / Architecture · Observability

Promises that are
also signals.

The pricing FAQ promises a status page and a backup that lands. Observability is the layer that turns those copy lines into something a customer can verify without emailing us. It runs cluster-wide, not per-customer: one Uptime Kuma instance watches every `*.<customer-domain>` we host, every Restic CronJob pushes a heartbeat to it, every alert goes through one channel the founder reads in real time.


№ O.1 / Layers

Four layers,
one namespace.

The status page ships first. Metrics and full log aggregation are documented now so the deploy shape is decided before they land, not invented mid-incident. Backup health and alert routing are baked in from day one.

Status page

Scoped, ships first

Customer-visible uptime + per-app health

Tool · Uptime Kuma

Where · Cluster-wide `monitoring` namespace · single replica · status.kiwistack.io

  • Each `*.<customer-domain>` ingress: HTTPS 200 with valid certificate chain
  • Each app's healthz: portal `/health`, OX `/appsuite/api/system/version`, Nextcloud `/status.php`, Synapse `/_matrix/client/versions`, Jitsi `/about/health`, OpenProject `/health_checks/default`, XWiki `/xwiki/bin/Main/WebHome`
  • SMTP banner check on the customer's MX target (port 25 from a public probe)
  • DNS A-record sanity for every subdomain we publish

Metrics

Documented, deferred

Per-pod cpu/mem, helm chart KPIs, ingress request rates, error budgets

Tool · Prometheus + Loki + Grafana

Where · Same `monitoring` namespace · scrape configs templated from the customer registry

  • kube-state-metrics for pod/deployment health
  • ingress-nginx exporter for per-host latency + 5xx rates
  • Postgres + MariaDB + Redis exporters per customer namespace
  • Loki tail on every customer's `cust-<slug>` namespace, retention 30 days

Backup health

Wired with the Restic CronJob

Verify nightly snapshots actually land in the off-site repo

Tool · Restic + Uptime Kuma heartbeat

Where · Push notification from the CronJob to Uptime Kuma's push URL

  • Heartbeat fires only on successful `restic backup` exit
  • Missed heartbeat for two consecutive nights triggers an alert
  • Weekly `restic check --read-data-subset` runs in a separate CronJob

Alert routing

Sized for one operator today

Notify the right person fast when something breaks

Tool · Uptime Kuma notification channels

Where · Email + Slack + (later) the customer's `admin_email` from the signup YAML

  • Default channel: hello@kiwistack.io for the founder
  • Customer-specific channel: drawn from od-platform/customers/<slug>.yaml `contact.admin_email`
  • Severity gating: 1-minute reachability dips do not page; >5 minutes does

№ O.2 / Where it lives

Single source of
truth, per layer.

Status page is platform-level: it watches every customer, so it lives in `od-platform`, not in any `cust-<slug>`. Per-customer hooks (the URLs to monitor, the Restic heartbeat config) live in `template-core` and propagate at provisioning time.

od-platform

Operator-owned, holds platform-level monitoring config (status page is one cluster-wide deploy, not per-customer)

  • observability/uptime-kuma/values.yaml — Helm values for the deploy
  • observability/uptime-kuma/ingress.yaml — TLS via cert-manager, host status.kiwistack.io
  • observability/uptime-kuma/monitors/ — JSON exports of each customer's monitor set (one file per customer slug)
  • observability/prometheus/ — when metrics land
  • runbooks/uptime-kuma-rollout.md — deploy procedure
  • runbooks/alert-routing.md — channel setup, escalation logic

template-core

Per-customer hooks the platform reads to add the customer to Uptime Kuma

  • observability/monitors.yaml.tmpl — list of URLs the platform should monitor for this customer
  • backup/restic-cronjob.yaml — heartbeat-push to Uptime Kuma at the end of the CronJob

cust-<slug>

Customer state inherits the template; usually no per-customer overrides for monitoring

  • (generated) observability/monitors.yaml — frozen at instantiation with the customer's actual URLs

№ O.3 / Promises this closes

Copy claims,
made verifiable.

Where on the site

Promise

Closed by

`/pricing` FAQ: "What's your SLA when something breaks?"

1-business-day response on Core, status page at status.kiwistack.io

Uptime Kuma deploy (Phase 8.D)

`/pricing` Support & SLA band

"Status page at status.kiwistack.io"

Uptime Kuma deploy

`/pricing` Core bullet: "Backup with off-site EU copy"

Customer can verify their backups are running

Restic heartbeat → Uptime Kuma signal visible on the status page

Until the Uptime Kuma deploy lands, the FAQ references to status.kiwistack.io are forward-looking. The architecture above is what `status.kiwistack.io` will resolve to once `runbooks/uptime-kuma-rollout.md` runs against the cluster.