Four-Layer Design
Astromesh is organized into four distinct layers, each with a clear boundary and responsibility. This page walks through every layer in detail — the modules it contains, the interfaces it exposes, and how it connects to adjacent layers.
For a high-level overview, see Architecture Overview. For a request-level trace, see Agent Execution Pipeline.
Layer 1: API Layer
Section titled “Layer 1: API Layer”Module: astromesh/api/
The API Layer is the entry point for all external communication. It accepts HTTP and WebSocket connections, validates input, and forwards requests to the Runtime Engine. It contains no business logic — its only job is protocol translation.
REST API
Section titled “REST API”Module: astromesh/api/main.py + astromesh/api/routes/
The REST API is built with FastAPI and organized into route groups. Each route module exposes a router (FastAPI APIRouter) and a set_runtime(runtime) function that is called during bootstrap to inject the AgentRuntime instance.
| Route Group | Prefix | Key Endpoints | Description |
|---|---|---|---|
| Agents | /v1/agents | GET /, GET /{name}, POST /{name}/run | List, inspect, and execute agents |
| Memory | /v1/memory | GET /{agent}/history/{session}, DELETE /{agent}/history/{session}, GET /{agent}/semantic | Query and manage conversation history and semantic memory |
| Tools | /v1/tools | GET /, POST /execute | List registered tools and execute them directly |
| RAG | /v1/rag | POST /ingest, POST /query | Ingest documents and query knowledge bases |
| Health | /v1/health | GET / | Health check and version information |
The main entry point (main.py) creates the FastAPI application, runs AgentRuntime.bootstrap() on startup, and calls set_runtime() on each route module so they can access the runtime without global state.
WebSocket Streaming
Section titled “WebSocket Streaming”Module: astromesh/api/ws.py
The WebSocket endpoint at /v1/ws/agent/{agent_name} provides real-time token streaming for agent responses. It uses a ConnectionManager that:
- Tracks active WebSocket connections per agent
- Accepts JSON messages with
queryandsession_idfields - Streams partial responses as tokens are generated by the LLM provider
- Handles disconnection and cleanup
Example client interaction:
// Client sends:{"query": "Explain quantum computing", "session_id": "session-42"}
// Server streams back token-by-token:{"token": "Quantum", "done": false}{"token": " computing", "done": false}{"token": " uses", "done": false}// ... more tokens ...{"token": "", "done": true, "response": "Quantum computing uses..."}Channel Adapters
Section titled “Channel Adapters”Module: astromesh/channels/
Channel adapters sit above the API Layer, bridging external messaging platforms to the Agent Runtime. They are configured in config/channels.yaml and each adapter handles:
- Webhook verification — Platform-specific challenge/response (e.g., Meta’s hub.verify_token)
- Message parsing — Extracting text from platform-specific webhook payloads
- Signature validation — HMAC verification of incoming webhooks
- Response formatting — Converting agent output to platform-specific API calls
- Background execution — Running agent execution in background tasks so webhooks respond within platform timeouts (e.g., Meta requires a response within 5 seconds)
Each channel maps to a default_agent that handles all conversations on that channel.
Layer 2: Runtime Engine
Section titled “Layer 2: Runtime Engine”Module: astromesh/runtime/engine.py
The Runtime Engine is the heart of Astromesh — the control plane that turns declarative YAML configuration into running agent instances. It has two primary responsibilities: bootstrapping and execution.
AgentRuntime Bootstrap
Section titled “AgentRuntime Bootstrap”When the application starts, AgentRuntime.bootstrap() performs the following sequence:
config/agents/*.agent.yaml │ ▼┌─────────────────────────────────┐│ 1. Scan config/agents/ dir ││ 2. Parse each .agent.yaml ││ 3. Validate against schema ││ 4. For each agent definition: ││ ├── Build ModelRouter │──► Wire primary + fallback providers│ ├── Build MemoryManager │──► Configure backends + strategies│ ├── Build ToolRegistry │──► Register tools (internal, MCP, webhook, RAG)│ ├── Select Pattern │──► Instantiate orchestration pattern│ ├── Build PromptEngine │──► Register Jinja2 templates│ ├── Build GuardrailsEngine │──► Configure input/output rules│ └── Create Agent instance │──► Fully wired, ready to execute└─────────────────────────────────┘ │ ▼ agents: Dict[str, Agent] ← Lookup table by agent nameThe bootstrap process reads all YAML files, validates them, and assembles fully wired Agent objects. Each agent gets its own set of dependencies (router, memory, tools, etc.) based on its configuration.
Agent Execution
Section titled “Agent Execution”When a request arrives for a specific agent, AgentRuntime.run(agent_name, query, session_id) looks up the agent by name and delegates to Agent.run(), which executes the full pipeline (covered in detail in Agent Execution Pipeline).
Each Agent instance holds references to:
| Component | Class | Purpose |
|---|---|---|
| Model Router | ModelRouter | Select and call LLM providers with fallback |
| Memory Manager | MemoryManager | Build context from conversation, semantic, and episodic memory |
| Tool Registry | ToolRegistry | Discover, validate, and execute tools |
| Orchestration Pattern | OrchestrationPattern subclass | Control the reasoning loop (ReAct, PlanAndExecute, etc.) |
| Prompt Engine | PromptEngine | Render Jinja2 system prompts with variable injection |
| Guardrails Engine | GuardrailsEngine | Apply input/output safety rules |
Layer 3: Core Services
Section titled “Layer 3: Core Services”Core Services are the domain-specific building blocks that agents use during execution. Each service has a well-defined interface and delegates to Layer 4 for concrete implementations.
Model Router
Section titled “Model Router”Module: astromesh/core/model_router.py
The Model Router manages multi-provider LLM inference. Given a completion request, it ranks available providers using the configured routing strategy, checks circuit breaker state, and tries providers in order until one succeeds.
┌─────────────┐ Request ──────► │ ModelRouter │ │ │ │ 1. Rank │──► Strategy-based ordering │ 2. Try │──► Circuit breaker check │ 3. Fallback │──► Next provider on failure └──────┬──────┘ │ ┌────────────┼────────────┐ ▼ ▼ ▼ ┌───────┐ ┌────────┐ ┌───────┐ │Ollama │ │ OpenAI │ │ vLLM │ ... └───────┘ └────────┘ └───────┘Routing Strategies
Section titled “Routing Strategies”| Strategy | Behavior | Best For |
|---|---|---|
cost_optimized | Cheapest provider first, based on estimated_cost() | Budget-sensitive workloads |
latency_optimized | Fastest provider first, using exponential moving average of response times | Real-time applications |
quality_first | Highest quality score first (configured per provider) | Tasks requiring maximum accuracy |
round_robin | Rotate across providers evenly | Load distribution |
capability_match | Filter by required capabilities (tool calling, vision), then apply secondary strategy | Tasks needing specific features |
Circuit Breaker
Section titled “Circuit Breaker”The circuit breaker protects the system from repeatedly calling failing providers:
- Closed (normal): Requests flow through. Failures are counted.
- Open (tripped): After 3 consecutive failures, the circuit opens. All requests to that provider are immediately skipped for 60 seconds.
- Half-open (testing): After the cooldown, one request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens.
When a provider’s circuit is open, the router automatically tries the next provider in the ranked list. If all providers are tripped, the request fails with a clear error.
Memory Manager
Section titled “Memory Manager”Module: astromesh/core/memory.py
The Memory Manager handles three distinct types of memory, each with pluggable backends. It provides two key operations: build_context() (assemble memory for prompt construction) and persist_turn() (store conversation turns after execution).
┌──────────────────┐ │ MemoryManager │ │ │ │ build_context() │──► Assemble context from all memory types │ persist_turn() │──► Store conversation turns └──────┬───────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌───────────┐ ┌───────────┐ │Conversational│ │ Semantic │ │ Episodic │ │ │ │ │ │ │ │Redis/PG/ │ │pgvector/ │ │PostgreSQL │ │SQLite │ │Chroma/ │ │ │ │ │ │Qdrant/ │ │ │ │ │ │FAISS │ │ │ └──────────────┘ └───────────┘ └───────────┘Memory Types
Section titled “Memory Types”| Type | Purpose | Backend Options |
|---|---|---|
| Conversational | Chat history — stores user and assistant messages for multi-turn conversations | Redis, PostgreSQL, SQLite |
| Semantic | Vector embeddings — stores and retrieves information by semantic similarity | pgvector, ChromaDB, Qdrant, FAISS |
| Episodic | Event logs — records significant events, tool calls, and outcomes for long-term learning | PostgreSQL |
Memory Strategies
Section titled “Memory Strategies”Memory strategies control how conversational history is managed when it grows beyond what fits in the LLM’s context window:
| Strategy | Description | Use Case |
|---|---|---|
sliding_window | Keep the last N turns, discard older ones | Simple chatbots with short context needs |
summary | Compress older turns into summaries using the LLM, keep recent turns verbatim | Long-running conversations that need historical awareness |
token_budget | Fit as many recent turns as possible within a configured token limit | Maximizing context usage without exceeding model limits |
Tool Registry
Section titled “Tool Registry”Module: astromesh/core/tools.py
The Tool Registry is the central authority for all tools an agent can use. It handles tool discovery, registration, permission checking, rate limiting, and schema generation for LLM function calling.
Tool Types
Section titled “Tool Types”| Type | Source | Description |
|---|---|---|
internal | Python functions | Registered directly in code. Fastest execution path. |
mcp | MCP servers | Discovered from external MCP servers via stdio, SSE, or HTTP transport. Tools are fetched at startup and cached. |
webhook | HTTP endpoints | External REST APIs called via HTTP. Supports custom headers, authentication, and payload mapping. |
rag | RAG pipeline | Exposes a RAG pipeline as a callable tool, allowing agents to search knowledge bases during reasoning. |
Features
Section titled “Features”- Schema generation — Automatically generates JSON Schema for each tool, compatible with the function calling format expected by LLM providers (OpenAI-style
toolsparameter). - Permission filtering — Each agent definition specifies which tools it has access to. The registry filters available tools per agent at execution time.
- Rate limiting — Tools can have per-agent rate limits to prevent runaway tool calls during orchestration loops.
Prompt Engine
Section titled “Prompt Engine”Module: astromesh/core/prompt_engine.py
The Prompt Engine renders Jinja2 templates into system prompts for LLM calls. Key features:
- SilentUndefined — Missing template variables render as empty strings instead of raising errors. This allows prompts to gracefully handle optional context (e.g., no semantic memory results).
- Template registration — Templates are registered during agent bootstrap from the
prompts.systemfield in agent YAML. - Variable injection — At render time, the engine receives variables from the Memory Manager (conversation history, semantic results, episodic events) and injects them into the template.
Example system prompt template:
You are {{ agent_name }}, a {{ description }}.
{% if conversation_history %}Previous conversation:{% for msg in conversation_history %}{{ msg.role }}: {{ msg.content }}{% endfor %}{% endif %}
{% if semantic_context %}Relevant knowledge:{{ semantic_context }}{% endif %}Guardrails Engine
Section titled “Guardrails Engine”Module: astromesh/core/guardrails.py
The Guardrails Engine applies safety checks on both input (before the agent processes a query) and output (before the response is returned to the caller).
| Guardrail | Direction | Description |
|---|---|---|
pii_detection | Input & Output | Detects and redacts emails, phone numbers, SSNs, and credit card numbers using regex patterns |
topic_filter | Input | Blocks messages matching forbidden topics defined in configuration |
max_length | Input | Enforces maximum character limits on incoming queries |
cost_limit | Output | Enforces token-per-turn limits to prevent runaway LLM usage |
content_filter | Output | Blocks responses containing forbidden keywords or patterns |
Guardrails run in the order they are defined in the agent’s YAML configuration. Each guardrail can either pass (allow the message through), redact (modify the message and continue), or block (reject the message with an error).
Layer 4: Infrastructure
Section titled “Layer 4: Infrastructure”Infrastructure contains all concrete implementations of the interfaces defined in Layer 3. These are the adapters, drivers, and backends that do the actual work.
LLM Providers
Section titled “LLM Providers”Module: astromesh/providers/
All providers implement ProviderProtocol, a runtime_checkable Python Protocol defined in astromesh/providers/base.py.
ProviderProtocol Methods
Section titled “ProviderProtocol Methods”| Method | Return Type | Description |
|---|---|---|
complete(messages, **kwargs) | CompletionResult | Synchronous (non-streaming) completion |
stream(messages, **kwargs) | AsyncIterator[StreamChunk] | Streaming completion, yielding tokens |
health_check() | bool | Check if the provider is reachable |
supports_tools() | bool | Whether the provider supports function calling |
supports_vision() | bool | Whether the provider supports image inputs |
estimated_cost(tokens) | float | Estimated cost for a given token count |
avg_latency_ms | float | Exponential moving average of response latency |
Provider Implementations
Section titled “Provider Implementations”| Provider | Backend | Endpoint Style | Notes |
|---|---|---|---|
OllamaProvider | Ollama | /api/chat | Native Ollama API format |
OpenAICompatProvider | OpenAI API | /v1/chat/completions | Works with any OpenAI-compatible API |
VLLMProvider | vLLM | OpenAI-compatible | High-throughput serving with PagedAttention |
LlamaCppProvider | llama.cpp | OpenAI-compatible | CPU/GPU inference with GGUF models |
HFTGIProvider | HuggingFace TGI | OpenAI-compatible | Text Generation Inference server |
ONNXProvider | ONNX Runtime | Local inference | In-process inference, no network calls |
Each provider tracks its own latency history and cost estimates, which the Model Router uses for routing decisions.
Orchestration Patterns
Section titled “Orchestration Patterns”Module: astromesh/orchestration/
Orchestration patterns control how agents reason and use tools. Each pattern implements the OrchestrationPattern abstract base class with an execute() method that runs the reasoning loop.
| Pattern | Module | Description | Typical Use Case |
|---|---|---|---|
ReAct | patterns.py | Think, Act, Observe loop. The LLM decides when to call tools and when to produce a final answer. Iterates up to max_iterations. | General-purpose agents that need tools |
PlanAndExecute | patterns.py | First, the LLM creates a plan (list of steps). Then, each step is executed sequentially. | Multi-step tasks with clear decomposition |
ParallelFanOut | patterns.py | Send the same query to multiple sub-models simultaneously, merge results. | Ensemble approaches, multi-perspective analysis |
Pipeline | patterns.py | Chain multiple processing steps sequentially, where each step’s output feeds the next. | Data transformation, multi-stage processing |
Supervisor | supervisor.py | A supervisor agent delegates sub-tasks to worker agents and combines their results. | Complex tasks requiring specialized sub-agents |
Swarm | swarm.py | Agents hand off conversations to each other based on context. No central coordinator. | Customer service routing, multi-domain conversations |
RAG Pipeline
Section titled “RAG Pipeline”Module: astromesh/rag/
The RAG (Retrieval-Augmented Generation) pipeline connects document ingestion to knowledge-augmented agent responses.
Ingestion Flow:Documents → Chunking → Embedding → Vector Store
Query Flow:Query → Embedding → Vector Search → Reranking → Top-K resultsEach stage is pluggable:
| Stage | Options | Module |
|---|---|---|
| Chunking | Fixed-size, Recursive (split on separators), Sentence-aware, Semantic (embedding-based boundaries) | astromesh/rag/chunking/ |
| Embeddings | HuggingFace Inference API, SentenceTransformers (local), Ollama | astromesh/rag/embeddings/ |
| Vector Store | pgvector, ChromaDB, Qdrant, FAISS | astromesh/rag/stores/ |
| Reranking | Cross-encoder (local), Cohere Rerank API | astromesh/rag/reranking/ |
RAG pipelines are configured in config/rag/*.rag.yaml with apiVersion: astromesh/v1, kind: RAGPipeline.
MCP Integration
Section titled “MCP Integration”Module: astromesh/mcp/
Astromesh supports the Model Context Protocol (MCP) in both directions:
MCP Client (astromesh/mcp/client.py)
Section titled “MCP Client (astromesh/mcp/client.py)”Connects to external MCP servers to discover and invoke remote tools. Supports three transport mechanisms:
| Transport | Protocol | Use Case |
|---|---|---|
| stdio | stdin/stdout JSON-RPC | Local MCP servers, CLI tools |
| SSE | Server-Sent Events over HTTP | Remote MCP servers with streaming |
| HTTP | Standard HTTP JSON-RPC | Remote MCP servers, simple request/response |
At startup, the MCP client connects to configured servers, fetches their tool manifests, and registers discovered tools in the Tool Registry. Tools are then available to agents just like internal tools.
MCP Server (astromesh/mcp/server.py)
Section titled “MCP Server (astromesh/mcp/server.py)”Exposes Astromesh agents as MCP tools via a JSON-RPC endpoint at /mcp. This allows other MCP-compatible systems to call your Astromesh agents as if they were tools.
ML Model Registry
Section titled “ML Model Registry”Module: astromesh/ml/
The ML Model Registry manages non-LLM machine learning models for tasks like classification, embedding generation, and custom inference:
- Registry (
model_registry.py) — Register, version, load, and serve ML models - Serving (
serving/) — ONNX Runtime and PyTorch model servers for inference - Training (
training/) — Classifier and embedding fine-tuning pipelines
Observability
Section titled “Observability”Module: astromesh/observability/
The observability stack provides three pillars of monitoring:
Agent Execution ──► TelemetryManager ──► OpenTelemetry Collector ──► Jaeger/Zipkin │ MetricsCollector ──► Prometheus ──► Grafana │ CostTracker ──► Usage records + budget alerts| Component | Module | Description |
|---|---|---|
| TelemetryManager | telemetry.py | Distributed tracing with OpenTelemetry. Creates spans for each pipeline step (guardrails, memory, routing, tool calls). Falls back to _NoOpSpan when OpenTelemetry is not installed. |
| MetricsCollector | metrics.py | Prometheus metrics: request counts (by agent, status), latency histograms (by agent, provider), active agents gauge. |
| CostTracker | cost_tracker.py | Per-provider usage records tracking token counts and estimated costs. Supports budget enforcement (alert or block when budget is exceeded) and grouped cost reports. |
Layer Interaction Summary
Section titled “Layer Interaction Summary”The following rules govern how layers interact:
- Layer 1 (API) calls only Layer 2 (Runtime). API routes call
runtime.run()and nothing else. - Layer 2 (Runtime) calls only Layer 3 (Core Services). The Agent orchestrates calls to ModelRouter, MemoryManager, ToolRegistry, PromptEngine, and GuardrailsEngine.
- Layer 3 (Core Services) calls only Layer 4 (Infrastructure). The ModelRouter calls providers. The MemoryManager calls memory backends. The ToolRegistry calls tool implementations.
- Layer 4 (Infrastructure) calls only external systems (LLM endpoints, databases, vector stores, APIs).
No layer ever skips a level. The API Layer never directly calls a provider. The Runtime Engine never directly accesses a database. This strict layering makes the system testable (each layer can be tested in isolation with mocked dependencies) and maintainable (changes to infrastructure never ripple up to the API).