Local Models
How model routing works
NEXUS routes micro-tasks to two model bands based on the task’s complexity:
| Band | Default Model | Tasks | Min VRAM |
|---|---|---|---|
| Supervisor | qwen2.5-coder:1.5b | Commit messages, boilerplate, test scaffolds | ~1.2 GB |
| Logic | llama3.2:3b | Lint fixes, code refactors | ~2.0 GB |
These defaults fit in 4GB VRAM with no spillover. If you have more VRAM, upgrade for better output quality.
Configuring models
Option 1: .env file (recommended)
cp .env.example .env
# Edit the file:
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
OLLAMA_HOST_URL="http://localhost:11434"
Option 2: MCP server config
{
"mcpServers": {
"nexus-ollama": {
"command": "node",
"args": ["~/.config/nexus/tools/mcp/server.mjs"],
"env": {
"NEXUS_SUPERVISOR_MODEL": "qwen2.5-coder:7b",
"NEXUS_LOGIC_MODEL": "llama3.1:8b"
}
}
}
}
Option 3: Per-task overrides
Override a single task without changing the whole band:
NEXUS_MODEL_COMMIT_MSG="qwen2.5-coder:3b"
NEXUS_MODEL_LOGIC_REFACTOR="qwen2.5:7b"
Priority: Per-task env var → Band-level env var → Built-in default
Hardware presets
RTX 3050 Mobile / 4GB VRAM (default)
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:1.5b"
NEXUS_LOGIC_MODEL="llama3.2:3b"
| Model | VRAM | Speed |
|---|---|---|
| qwen2.5-coder:1.5b | ~1.2 GB | ~122 t/s |
| llama3.2:3b | ~2.0 GB | ~73 t/s |
RTX 3060 / RTX 4060 / 8GB VRAM
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
| Model | VRAM | Speed |
|---|---|---|
| qwen2.5-coder:7b | ~4.7 GB | ~45 t/s |
| llama3.1:8b | ~4.9 GB | ~40 t/s |
MacBook Air / Pro M3 Base — 8GB Unified Memory
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:3b"
NEXUS_LOGIC_MODEL="llama3.2:3b"
Apple Silicon shares memory between CPU and GPU. With 8GB total, stick to smaller models.
MacBook Pro M3 Pro — 18GB / 36GB Unified Memory
# 18GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
# 36GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="qwen2.5:14b"
Apple Silicon has lower memory bandwidth than discrete GPUs (~150 GB/s on M3 Pro). Expect ~40–60% of the t/s you’d see on an equivalent NVIDIA card, but with larger model capacity.
RTX 4090 / 3090 — 24GB VRAM
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"
| Model | VRAM | Speed (4090) |
|---|---|---|
| qwen2.5-coder:14b | ~9.0 GB | ~70 t/s |
| qwen2.5:32b | ~19.8 GB | ~30 t/s |
RTX 5090 — 32GB VRAM
The RTX 5090 has 1.79 TB/s memory bandwidth. At these speeds, the logic band is faster than most setups’ supervisor band.
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"
| Model | Speed |
|---|---|
| qwen2.5-coder:14b | ~100 t/s |
| qwen2.5:32b | ~61 t/s |
MacBook Pro M3 Max — 48–96GB Unified Memory
# 48GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"
# 96GB (workstation territory)
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"
Dual GPU — 2× RTX 3090/4090 (48GB total)
Ollama supports multi-GPU via CUDA_VISIBLE_DEVICES. With 48GB combined VRAM, 70B models fit.
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"
VRAM rules of thumb
- Q4_K_M VRAM ≈
parameters (B) × 0.57+ 0.5 GB overhead + KV cache - KV cache at 8k context adds ~1–2 GB for 7–8B models, ~3–4 GB for 32B models
- If a model spills to RAM, expect 5–20× slower inference. Always fit the model fully in VRAM.
- Apple Silicon shares memory between OS and GPU — budget ~4 GB less than total for model headroom
Quick reference
| Hardware | VRAM | Supervisor | Logic | Speed |
|---|---|---|---|---|
| RTX 3050 Mobile | 4 GB | qwen2.5-coder:1.5b | llama3.2:3b | 70–120 t/s |
| M3 Base | 8 GB UM | qwen2.5-coder:3b | llama3.2:3b | 45–50 t/s |
| RTX 3060 / 4060 | 8 GB | qwen2.5-coder:7b | llama3.1:8b | 40–60 t/s |
| M3 Pro 18GB | 18 GB UM | qwen2.5-coder:7b | llama3.1:8b | 25–30 t/s |
| RTX 4060 Ti 16GB | 16 GB | qwen2.5-coder:7b | qwen2.5:14b | 35–55 t/s |
| RTX 3090 | 24 GB | qwen2.5-coder:14b | qwen2.5:32b | 20–50 t/s |
| RTX 4090 | 24 GB | qwen2.5-coder:14b | qwen2.5:32b | 30–70 t/s |
| M3 Max 48GB | 48 GB UM | qwen2.5-coder:14b | qwen2.5:32b | 18–35 t/s |
| RTX 5090 | 32 GB | qwen2.5-coder:14b | qwen2.5:32b | 61–100 t/s |
| 2× RTX 3090/4090 | 48 GB | qwen2.5-coder:32b | llama3.3:70b | 15–60 t/s |
Verifying your setup
# 1. Pull your chosen models
ollama pull <supervisor-model>
ollama pull <logic-model>
# 2. Test the MCP server health
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}
{"jsonrpc":"2.0","method":"notifications/initialized"}
{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"ollama_health","arguments":{}}}' | \
node ~/.config/nexus/tools/mcp/server.mjs 2>/dev/null | tail -1 | python3 -m json.tool
The health check lists your available models. If a model isn’t pulled, generate calls will fail with an Ollama error (not a CIRCUIT_BREAKER).