Scheduling & Routing
In mesh mode, multiple Astromesh nodes form a cluster. The scheduling and routing layer determines which node handles each incoming request based on agent placement, current load, and leader election.
Agent Placement
Section titled “Agent Placement”Each node loads and serves the agents defined in its local config/agents/ directory. The cluster’s gossip protocol propagates this information so every node knows which agents are available on which peers.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ node-alpha │ │ node-beta │ │ node-gamma ││ │ │ │ │ ││ assistant │ │ researcher │ │ assistant ││ support │ │ │ │ │└──────────────┘ └──────────────┘ └──────────────┘When a request arrives for a specific agent, the receiving node checks:
- Local first: If this node has the agent loaded, handle locally
- Remote routing: If this node does not have the agent, forward to a peer that does
If multiple nodes serve the same agent, the request is routed using the least-connections algorithm.
Least-Connections Routing
Section titled “Least-Connections Routing”When multiple nodes can serve a given agent, the router selects the node with the fewest active (in-flight) requests.
Algorithm
Section titled “Algorithm”Incoming request for agent "assistant" │ ▼ Find all nodes serving "assistant" ┌──────────────────────────────────┐ │ node-alpha: 12 active requests │ │ node-gamma: 3 active requests │ └──────────────────────────────────┘ │ ▼ Select node with fewest active: node-gamma │ ▼ Forward request to node-gammaTie-breaking: When multiple nodes have equal active request counts, the node with the lower avg_latency_ms is preferred.
Health awareness: Nodes in suspect state are deprioritized (treated as if they have +100 virtual active requests). Nodes in dead state are excluded entirely.
Active Request Tracking
Section titled “Active Request Tracking”Each node reports its active request count in gossip metadata. The count is updated in real-time:
- Incremented when a request begins execution
- Decremented when a response is sent or the request times out
Gossip propagation means remote counts may be slightly stale (up to one gossip interval behind). This is acceptable because perfect accuracy is not required for effective load balancing.
Leader Election
Section titled “Leader Election”The cluster elects a single leader node using the bully algorithm. The leader is responsible for:
- Cluster-wide health monitoring
- Coordinating agent rebalancing when nodes join or leave
- Serving as the tiebreaker for scheduling conflicts
Bully Algorithm
Section titled “Bully Algorithm”The node with the highest node_id (lexicographic comparison) becomes the leader.
Nodes in cluster: node-alpha (id: "alpha") node-beta (id: "beta") node-gamma (id: "gamma")
Leader: node-gamma (highest id)Election Process
Section titled “Election Process” Trigger event (node join, leader death, startup) │ ▼ Node sends ELECTION message to all nodes with higher IDs │ ├──── No response within timeout │ → This node becomes leader │ → Sends COORDINATOR message to all │ └──── Higher node responds with OK → Higher node takes over election → Wait for COORDINATOR messageRe-Election Triggers
Section titled “Re-Election Triggers”| Trigger | Description |
|---|---|
| Leader death | Leader node’s heartbeat exceeds dead_threshold. Any node detecting this initiates election |
| New node joins | If the new node has a higher node_id than the current leader, election is triggered |
| Network partition heals | When previously partitioned nodes reconnect, they compare leaders and re-elect if needed |
| Startup | Each node initiates election on first joining the cluster |
Routing Decision Flow
Section titled “Routing Decision Flow”Complete flow for an incoming API request in mesh mode:
Client request: POST /v1/agents/assistant/run │ ▼ ┌────────────────┐ │ Receiving Node │ └───────┬────────┘ │ ┌─────────▼───────────┐ │ Agent loaded │ │ locally? │ └─────────┬───────────┘ │ │ Yes │ │ No │ │ ▼ ▼ ┌────────────┐ ┌──────────────────┐ │ Check local│ │ Find peers with │ │ load vs │ │ this agent │ │ remote │ └────────┬─────────┘ └─────┬──────┘ │ │ ▼ │ ┌───────────────────┐ │ │ Any peers found? │ │ └────────┬──────────┘ │ │ │ │ Yes │ │ No │ │ │ │ ▼ ▼ │ ┌─────────────┐ Return 404 │ │ Least-conn │ "Agent not │ │ select peer │ found in │ └──────┬──────┘ cluster" │ │ ▼ ▼ Execute locally Forward to or on best node selected peerFuture Scheduling Strategies
Section titled “Future Scheduling Strategies”The following strategies are planned for future releases:
| Strategy | Description | Status |
|---|---|---|
| Affinity | Route requests from the same session to the same node for memory locality | Planned |
| Latency-based | Route to the node with the lowest observed round-trip time from the client | Planned |
| Cost-based | Factor in per-node compute cost (cloud instance pricing) for routing decisions | Planned |
| Capability-aware | Route based on hardware capabilities (GPU availability, memory capacity) | Planned |
Configuration
Section titled “Configuration”Scheduling parameters are set in runtime.yaml:
mesh: enabled: true routing: strategy: least_connections local_preference: true suspect_penalty: 100 election: algorithm: bully timeout: 5s| Parameter | Default | Description |
|---|---|---|
routing.strategy | least_connections | Routing algorithm for multi-node agents |
routing.local_preference | true | Prefer local execution when the agent is loaded on this node |
routing.suspect_penalty | 100 | Virtual active requests added to suspect nodes |
election.algorithm | bully | Leader election algorithm |
election.timeout | 5s | Timeout for election responses before self-declaring leader |