Provider Configuration
The provider configuration defines which LLM backends are available to Astromesh, how to connect to them, and how the model router selects between them. All providers are declared in a single file and shared across all agents.
File Location
Section titled “File Location”Provider configuration lives at config/providers.yaml (development) or /etc/astromesh/providers.yaml (production).
apiVersion: astromesh/v1kind: ProviderConfigmetadata: name: default-providersFull Example
Section titled “Full Example”Below is a complete providers.yaml with all six provider types configured:
apiVersion: astromesh/v1kind: ProviderConfigmetadata: name: default-providers
spec: providers: # --- Ollama (local inference) --- ollama: type: ollama endpoint: "http://ollama:11434" models: - "llama3.1:8b" - "llama3.1:70b" - "codellama:34b" - "nomic-embed-text" health_check_interval: 30
# --- OpenAI-compatible API --- openai: type: openai_compat endpoint: "https://api.openai.com/v1" api_key_env: OPENAI_API_KEY models: - "gpt-4o" - "gpt-4o-mini"
# --- vLLM (high-throughput serving) --- vllm: type: vllm endpoint: "http://vllm:8000" models: - "mistralai/Mistral-7B-Instruct-v0.3" health_check_interval: 30
# --- llama.cpp server --- llamacpp: type: llamacpp endpoint: "http://llamacpp:8080" models: - "local-model"
# --- HuggingFace Text Generation Inference --- hf_tgi: type: hf_tgi endpoint: "http://tgi:80" models: - "BAAI/bge-small-en-v1.5"
# --- ONNX Runtime (local) --- onnx: type: onnx models: - "model.onnx"
routing: default_strategy: cost_optimized fallback_enabled: true circuit_breaker: failure_threshold: 3 recovery_timeout: 60Provider Types
Section titled “Provider Types”Ollama
Section titled “Ollama”Ollama provides local LLM inference with simple model management. It is the recommended provider for development and single-node deployments.
Setup:
# Install Ollamacurl -fsSL https://ollama.ai/install.sh | sh
# Start the Ollama serverollama serve
# Pull modelsollama pull llama3.1:8bollama pull nomic-embed-textConfiguration:
ollama: type: ollama endpoint: "http://localhost:11434" models: - "llama3.1:8b" - "nomic-embed-text" health_check_interval: 30The endpoint is the Ollama HTTP API. When running in Docker, use the service name (e.g., http://ollama:11434). The models list declares which models this provider serves — it does not automatically pull them.
OpenAI-Compatible
Section titled “OpenAI-Compatible”Any API that implements the OpenAI chat completions interface. Works with OpenAI, Azure OpenAI, Anthropic (via proxy), Together AI, Groq, and other compatible services.
Setup:
# Set your API key as an environment variableexport OPENAI_API_KEY="sk-..."Configuration:
openai: type: openai_compat endpoint: "https://api.openai.com/v1" api_key_env: OPENAI_API_KEY models: - "gpt-4o" - "gpt-4o-mini"The api_key_env field is the name of the environment variable — not the key itself. The runtime reads the key from os.environ["OPENAI_API_KEY"] at startup. For Azure OpenAI, point the endpoint to your Azure deployment URL.
vLLM is a high-throughput LLM serving engine with continuous batching. Best for production workloads that need to serve many concurrent requests.
Setup:
# Run vLLM with Docker (requires NVIDIA GPU)docker run --gpus all \ -p 8000:8000 \ vllm/vllm-openai:latest \ --model mistralai/Mistral-7B-Instruct-v0.3Configuration:
vllm: type: vllm endpoint: "http://vllm:8000" models: - "mistralai/Mistral-7B-Instruct-v0.3" health_check_interval: 30vLLM exposes an OpenAI-compatible API, but Astromesh uses the dedicated vllm provider type for optimized health checking and capability detection. GPU access is required.
llama.cpp
Section titled “llama.cpp”llama.cpp provides lightweight CPU and GPU inference for GGUF-format models. Good for edge deployments and environments without dedicated GPU infrastructure.
Setup:
# Build and run the llama.cpp server./llama-server -m /models/llama-3.1-8b.gguf --host 0.0.0.0 --port 8080Configuration:
llamacpp: type: llamacpp endpoint: "http://llamacpp:8080" models: - "local-model"The model name in the models list is a logical identifier — the actual model file is specified when starting the llama.cpp server.
HuggingFace TGI
Section titled “HuggingFace TGI”HuggingFace Text Generation Inference (TGI) provides GPU-optimized transformer inference with features like flash attention and quantization.
Setup:
# Run TGI with Docker (requires NVIDIA GPU)docker run --gpus all \ -p 80:80 \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id BAAI/bge-small-en-v1.5Configuration:
hf_tgi: type: hf_tgi endpoint: "http://tgi:80" models: - "BAAI/bge-small-en-v1.5"TGI is particularly useful for embedding models and specialized transformer architectures that benefit from HuggingFace’s optimized inference stack.
ONNX Runtime
Section titled “ONNX Runtime”ONNX Runtime runs optimized ONNX-format models locally without a network endpoint. Suited for scenarios where you need deterministic latency without network hops.
Configuration:
onnx: type: onnx models: - "model.onnx"No endpoint is needed — the model file is loaded directly by the runtime. The models list contains paths to .onnx files relative to the config directory.
Provider Types Table
Section titled “Provider Types Table”| Type | Description | Endpoint Format |
|---|---|---|
ollama | Ollama local inference server | http://host:11434 |
openai_compat | Any OpenAI-compatible API (OpenAI, Azure, Together, Groq, etc.) | https://api.example.com/v1 |
vllm | vLLM high-throughput serving engine | http://host:8000 |
llamacpp | llama.cpp server for GGUF models | http://host:8080 |
hf_tgi | HuggingFace Text Generation Inference | http://host:80 |
onnx | ONNX Runtime local inference | No endpoint needed |
Routing
Section titled “Routing”The routing section controls how the model router selects providers when an agent makes an inference request.
Strategies
Section titled “Strategies”| Strategy | Value | When to Use |
|---|---|---|
| Cost Optimized | cost_optimized | Default. Prefers the cheapest available provider. Good for development and cost-sensitive workloads. |
| Latency Optimized | latency_optimized | Prefers the provider with the lowest response time. Good for real-time applications and chat interfaces. |
| Quality First | quality_first | Prefers the highest-capability model available. Good for complex reasoning tasks where accuracy matters most. |
| Round Robin | round_robin | Distributes requests evenly across all healthy providers. Good for load balancing across multiple identical deployments. |
| Capability Match | capability_match | Selects the provider based on request requirements (e.g., vision models for image inputs). Good for multi-modal agents. |
The default_strategy applies to all agents unless overridden in the agent’s spec.model.routing.strategy field.
routing: default_strategy: cost_optimizedFallback
Section titled “Fallback”When fallback_enabled is true, the model router automatically tries the next available provider if the primary fails. Agents can also define an explicit fallback model in their YAML.
routing: fallback_enabled: trueCircuit Breaker
Section titled “Circuit Breaker”The circuit breaker protects the system from repeatedly calling a failing provider. It tracks consecutive failures per provider and temporarily removes unhealthy providers from the routing pool.
routing: circuit_breaker: failure_threshold: 3 # Open the circuit after 3 consecutive failures recovery_timeout: 60 # Wait 60 seconds before trying the provider againHow it works:
- Each provider starts in the closed state (healthy, accepting requests).
- When a request to a provider fails, the failure counter increments.
- After
failure_thresholdconsecutive failures (default: 3), the circuit opens — the provider is removed from the routing pool. - After
recovery_timeoutseconds (default: 60), the circuit enters a half-open state — the next request is sent to the provider as a test. - If the test request succeeds, the circuit closes and the provider is returned to the pool. If it fails, the circuit remains open for another recovery period.
How Agents Reference Providers
Section titled “How Agents Reference Providers”Agents reference providers by the type value in their spec.model.primary.provider field. The model router looks up the matching provider in providers.yaml:
# In providers.yamlspec: providers: ollama: type: ollama endpoint: "http://ollama:11434" models: - "llama3.1:8b"# In an agent YAMLspec: model: primary: provider: ollama # Matches the provider type above model: "llama3.1:8b" # Must be in the provider's models list endpoint: "http://ollama:11434"The agent’s endpoint field can override the provider-level endpoint if needed (e.g., when an agent connects to a different Ollama instance).