Gossip Protocol
Astromesh nodes in mesh mode use a push-pull gossip protocol to share cluster state, detect failures, and maintain a consistent view of the network without centralized coordination.
Algorithm
Section titled “Algorithm”Astromesh uses push-pull gossip: when two nodes exchange state, both sides send their current state and merge the received state. This ensures bidirectional convergence — each exchange brings both nodes closer to a consistent view.
Gossip Exchange Sequence
Section titled “Gossip Exchange Sequence” Node A Node B │ │ │ 1. Select random peer │ │──────── push state ─────────▶ │ │ │ │ │ 2. Merge received state │ │ with local state │ │ │◀──────── pull state ──────────│ 3. Send own state back │ │ │ 4. Merge received state │ │ with local state │ │ │ Both nodes now have the union of both statesEach gossip round:
- Node selects up to fanout random peers from its membership list
- For each selected peer, initiate a push-pull exchange
- Merge the received state, keeping the most recent information for each node (highest heartbeat counter wins)
Timing Parameters
Section titled “Timing Parameters”| Parameter | Default | Description |
|---|---|---|
gossip_interval | 2s | How often each node initiates a gossip round |
gossip_fanout | 3 | Number of random peers contacted per gossip round |
heartbeat_interval | 5s | How often each node increments its own heartbeat counter |
Heartbeats
Section titled “Heartbeats”Each node maintains a monotonically increasing heartbeat counter. The counter is incremented every heartbeat_interval (5 seconds by default) and included in gossip exchanges.
A node’s liveness is determined by how recently its heartbeat was observed to change:
heartbeat counter: 101 102 103 104 105 106 │ │ │ │ │ │time: 0s 5s 10s 15s 20s 25sOther nodes track the last time they observed a new heartbeat value for each peer. If the observed heartbeat stops advancing, the peer is suspected and eventually declared dead.
Failure Detection
Section titled “Failure Detection”Nodes are classified into three states based on how long since their heartbeat was last seen to change:
Last heartbeat seen │ ┌───────────┼───────────┬───────────┐ │ < 15s │ 15-30s │ > 30s │ │ │ │ │ ▼ ▼ ▼ ▼┌───────┐ ┌─────────┐ ┌──────┐│ Alive │ │ Suspect │ │ Dead │└───────┘ └─────────┘ └──────┘| State | Threshold | Description |
|---|---|---|
| Alive | < 15 seconds since last heartbeat change | Node is healthy and responsive |
| Suspect | 15 — 30 seconds since last heartbeat change | Node may be down. Still routed to but with lower priority |
| Dead | > 30 seconds since last heartbeat change | Node is considered failed. Removed from routing, agents reassigned |
Configurable Thresholds
Section titled “Configurable Thresholds”| Parameter | Default | Description |
|---|---|---|
suspect_threshold | 15s | Time without heartbeat update before marking suspect |
dead_threshold | 30s | Time without heartbeat update before marking dead |
cleanup_threshold | 120s | Time after death before removing node from state entirely |
State Convergence
Section titled “State Convergence”Gossip protocols provide eventual consistency — all nodes converge to the same state given enough gossip rounds without further changes.
Convergence speed depends on cluster size, gossip interval, and fanout:
| Cluster Size | Expected Rounds to Converge | Time at 2s Interval |
|---|---|---|
| 3 nodes | 2 — 3 rounds | 4 — 6 seconds |
| 10 nodes | 4 — 5 rounds | 8 — 10 seconds |
| 50 nodes | 6 — 8 rounds | 12 — 16 seconds |
The push-pull approach converges roughly twice as fast as push-only gossip because each exchange synchronizes both participants.
Gossip State Payload
Section titled “Gossip State Payload”Each gossip message contains the sender’s full membership table:
| Field | Type | Description |
|---|---|---|
node_id | string | Unique identifier for the node |
address | string | Host and port (e.g., 10.0.1.10:8000) |
heartbeat | integer | Monotonic heartbeat counter |
state | string | Node state as perceived by sender: alive, suspect, dead |
agents | list[string] | Agent names loaded on this node |
meta | object | Arbitrary metadata (role, version, capabilities) |
Configuration
Section titled “Configuration”Gossip parameters are set in runtime.yaml under the mesh section:
mesh: enabled: true node_name: "node-alpha" bind_port: 8001 seeds: - "10.0.1.10:8001" - "10.0.1.11:8001" gossip: interval: 2s fanout: 3 heartbeat: interval: 5s failure_detection: suspect_threshold: 15s dead_threshold: 30s cleanup_threshold: 120s