Skip to content

Gossip Protocol

Astromesh nodes in mesh mode use a push-pull gossip protocol to share cluster state, detect failures, and maintain a consistent view of the network without centralized coordination.

Astromesh uses push-pull gossip: when two nodes exchange state, both sides send their current state and merge the received state. This ensures bidirectional convergence — each exchange brings both nodes closer to a consistent view.

Node A Node B
│ │
│ 1. Select random peer │
│──────── push state ─────────▶ │
│ │
│ │ 2. Merge received state
│ │ with local state
│ │
│◀──────── pull state ──────────│ 3. Send own state back
│ │
│ 4. Merge received state │
│ with local state │
│ │
Both nodes now have the union
of both states

Each gossip round:

  1. Node selects up to fanout random peers from its membership list
  2. For each selected peer, initiate a push-pull exchange
  3. Merge the received state, keeping the most recent information for each node (highest heartbeat counter wins)
ParameterDefaultDescription
gossip_interval2sHow often each node initiates a gossip round
gossip_fanout3Number of random peers contacted per gossip round
heartbeat_interval5sHow often each node increments its own heartbeat counter

Each node maintains a monotonically increasing heartbeat counter. The counter is incremented every heartbeat_interval (5 seconds by default) and included in gossip exchanges.

A node’s liveness is determined by how recently its heartbeat was observed to change:

heartbeat counter: 101 102 103 104 105 106
│ │ │ │ │ │
time: 0s 5s 10s 15s 20s 25s

Other nodes track the last time they observed a new heartbeat value for each peer. If the observed heartbeat stops advancing, the peer is suspected and eventually declared dead.

Nodes are classified into three states based on how long since their heartbeat was last seen to change:

Last heartbeat seen
┌───────────┼───────────┬───────────┐
│ < 15s │ 15-30s │ > 30s │
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌─────────┐ ┌──────┐
│ Alive │ │ Suspect │ │ Dead │
└───────┘ └─────────┘ └──────┘
StateThresholdDescription
Alive< 15 seconds since last heartbeat changeNode is healthy and responsive
Suspect15 — 30 seconds since last heartbeat changeNode may be down. Still routed to but with lower priority
Dead> 30 seconds since last heartbeat changeNode is considered failed. Removed from routing, agents reassigned
ParameterDefaultDescription
suspect_threshold15sTime without heartbeat update before marking suspect
dead_threshold30sTime without heartbeat update before marking dead
cleanup_threshold120sTime after death before removing node from state entirely

Gossip protocols provide eventual consistency — all nodes converge to the same state given enough gossip rounds without further changes.

Convergence speed depends on cluster size, gossip interval, and fanout:

Cluster SizeExpected Rounds to ConvergeTime at 2s Interval
3 nodes2 — 3 rounds4 — 6 seconds
10 nodes4 — 5 rounds8 — 10 seconds
50 nodes6 — 8 rounds12 — 16 seconds

The push-pull approach converges roughly twice as fast as push-only gossip because each exchange synchronizes both participants.

Each gossip message contains the sender’s full membership table:

FieldTypeDescription
node_idstringUnique identifier for the node
addressstringHost and port (e.g., 10.0.1.10:8000)
heartbeatintegerMonotonic heartbeat counter
statestringNode state as perceived by sender: alive, suspect, dead
agentslist[string]Agent names loaded on this node
metaobjectArbitrary metadata (role, version, capabilities)

Gossip parameters are set in runtime.yaml under the mesh section:

mesh:
enabled: true
node_name: "node-alpha"
bind_port: 8001
seeds:
- "10.0.1.10:8001"
- "10.0.1.11:8001"
gossip:
interval: 2s
fanout: 3
heartbeat:
interval: 5s
failure_detection:
suspect_threshold: 15s
dead_threshold: 30s
cleanup_threshold: 120s