Skip to content

Roadmap

Astromesh OS is built capability-by-capability through phases, not as a single big-bang release. Each phase delivers a working appliance and is guarded by an explicit exit gate.

The governing rule, inherited from the build spec, is simple and absolute:

You do not advance to phase N+1 until phase N boots and passes its checks.

Each gate below is a real, automated check — not a milestone you declare done. CI (GitHub Actions) is the authoritative judge of whether a gate is met.

PhaseDeliversExit gate
0 — Unit validationAstromesh-core as a systemd service on a standard mkosi/Debian image, booting in QEMU/KVM200 on /v1/health and an agent query answered via the API against a frontier provider; astromeshctl doctor green
1 — Minimal + boot-to-agentCustom target, aggressive trimming, the size gate; push the OCI artifact via ORASCore image ≤ 500 MB and boots as a cloud image; the build fails if it exceeds the ceiling
2 — Immutability + updatesdm-verity over root, read-only root, A/B with systemd-sysupdate + automatic rollbackUpdate and rollback proven — a new boot that fails health rolls back to the previous slot
3 — SecurityTPM/sealed secrets, remote attestation, no-shell + break-glass, SELinux enforcing, Secure Boot; fail-closed kernel guarantees (tool sandbox + egress)A secret is accessible only under an intact boot; the agent does not start without sandbox and egress guaranteed
4 — Agent-native + fleetDeclarative machine-config join (static peers → Maia gossip), causal eBPF telemetry, OTel export, optional GPU sysextA node joins the mesh from machine-config alone — no SSH
post-4sched_ext scheduler + GPU broker, agent-aware memory/OOM, CRIU snapshot/restoreIncremental — each capability lands behind its own check

Prove the runtime works as a systemd service on a plain Debian/mkosi image that boots in QEMU/KVM. This phase is intentionally not minimal or immutable — that is Phase 1 onward. The gate is a live health check plus one real agent query, with astromeshctl doctor green.

Strip the image down to a custom boot-to-agent target, enforce the 500 MB ceiling as a build-failing gate, and publish the OCI artifact over ORAS. The dominant size term is the runtime’s Python closure, so trimming focuses there.

Make the root read-only and verified with dm-verity, and ship updates as atomic A/B image swaps via systemd-sysupdate, with automatic rollback: a freshly updated slot that boots but fails its health check is never blessed and the system returns to the last known-good slot. See Architecture for the mechanism.

Adds TPM-sealed secrets, locks down the appliance with no interactive shell (plus a break-glass path), enforces AppArmor, and enables Secure Boot. This phase also introduces the fail-closed kernel guarantees: the agent will not start unless its tool sandbox and egress governance are in place.

Turns the appliance into a fleet member that configures itself declaratively. A node joins the mesh purely from its machine-config (static peers first, migrating to Maia gossip as it matures), with mesh mTLS/IPsec, causal eBPF egress telemetry exported via OpenTelemetry. The gate is a no-SSH join.

Further kernel-level work: cgroup memory governance, CRIU-based snapshot/restore, and a sched_ext scheduler with a GPU broker. These are implemented behind their own checks. Two acceptance gates are deferred for environmental reasons rather than missing code: sched_ext ships as a guarded, fail-closed, default-off loader, but Debian trixie’s 6.12 kernel is built without CONFIG_SCHED_CLASS_EXT (backports 7.0 has it), so its gate waits on moving the kernel baseline; the GPU broker waits on GPU-equipped VMs and ships behind a sysext-gpu extension.

The phases above culminate in a set of kernel-level capabilities that are meant to be the OS’s defensible core — the reason an agent appliance is worth building rather than just running a service on a stock VM:

  • eBPF/XDP egress governance — governing and shaping what an agent can talk to at the network layer.
  • Causal cost attribution — attributing cost and resource use back to the agent (and action) that caused it.
  • A scheduler tuned for agent workloads.

The framing matters: these exist for trust, attribution, and densitynot “it runs faster.” They are sequenced into the phases deliberately (egress and sandbox in Phase 3; causal telemetry in Phase 4; scheduler and friends post-4) rather than exposed as a pile of loose tunables.