Local Models

How model routing works

NEXUS routes micro-tasks to two model bands based on the task’s complexity:

Band	Default Model	Tasks	Min VRAM
Supervisor	`qwen2.5-coder:1.5b`	Commit messages, boilerplate, test scaffolds	~1.2 GB
Logic	`llama3.2:3b`	Lint fixes, code refactors	~2.0 GB

These defaults fit in 4GB VRAM with no spillover. If you have more VRAM, upgrade for better output quality.

Configuring models

Option 1: `.env` file (recommended)

cp .env.example .env
# Edit the file:
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
OLLAMA_HOST_URL="http://localhost:11434"

Option 2: MCP server config

{
  "mcpServers": {
    "nexus-ollama": {
      "command": "node",
      "args": ["~/.config/nexus/tools/mcp/server.mjs"],
      "env": {
        "NEXUS_SUPERVISOR_MODEL": "qwen2.5-coder:7b",
        "NEXUS_LOGIC_MODEL": "llama3.1:8b"
      }
    }
  }
}

Option 3: Per-task overrides

Override a single task without changing the whole band:

NEXUS_MODEL_COMMIT_MSG="qwen2.5-coder:3b"
NEXUS_MODEL_LOGIC_REFACTOR="qwen2.5:7b"

Priority: Per-task env var → Band-level env var → Built-in default

Hardware presets

RTX 3050 Mobile / 4GB VRAM (default)

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:1.5b"
NEXUS_LOGIC_MODEL="llama3.2:3b"

Model	VRAM	Speed
qwen2.5-coder:1.5b	~1.2 GB	~122 t/s
llama3.2:3b	~2.0 GB	~73 t/s

RTX 3060 / RTX 4060 / 8GB VRAM

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"

Model	VRAM	Speed
qwen2.5-coder:7b	~4.7 GB	~45 t/s
llama3.1:8b	~4.9 GB	~40 t/s

MacBook Air / Pro M3 Base — 8GB Unified Memory

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:3b"
NEXUS_LOGIC_MODEL="llama3.2:3b"

Apple Silicon shares memory between CPU and GPU. With 8GB total, stick to smaller models.

MacBook Pro M3 Pro — 18GB / 36GB Unified Memory

# 18GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"

# 36GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="qwen2.5:14b"

Apple Silicon has lower memory bandwidth than discrete GPUs (~150 GB/s on M3 Pro). Expect ~40–60% of the t/s you’d see on an equivalent NVIDIA card, but with larger model capacity.

RTX 4090 / 3090 — 24GB VRAM

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"

Model	VRAM	Speed (4090)
qwen2.5-coder:14b	~9.0 GB	~70 t/s
qwen2.5:32b	~19.8 GB	~30 t/s

RTX 5090 — 32GB VRAM

The RTX 5090 has 1.79 TB/s memory bandwidth. At these speeds, the logic band is faster than most setups’ supervisor band.

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"

Model	Speed
qwen2.5-coder:14b	~100 t/s
qwen2.5:32b	~61 t/s

MacBook Pro M3 Max — 48–96GB Unified Memory

# 48GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"

# 96GB (workstation territory)
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"

Dual GPU — 2× RTX 3090/4090 (48GB total)

Ollama supports multi-GPU via CUDA_VISIBLE_DEVICES. With 48GB combined VRAM, 70B models fit.

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"

VRAM rules of thumb

Q4_K_M VRAM ≈ parameters (B) × 0.57 + 0.5 GB overhead + KV cache
KV cache at 8k context adds ~1–2 GB for 7–8B models, ~3–4 GB for 32B models
If a model spills to RAM, expect 5–20× slower inference. Always fit the model fully in VRAM.
Apple Silicon shares memory between OS and GPU — budget ~4 GB less than total for model headroom

Quick reference

Hardware	VRAM	Supervisor	Logic	Speed
RTX 3050 Mobile	4 GB	qwen2.5-coder:1.5b	llama3.2:3b	70–120 t/s
M3 Base	8 GB UM	qwen2.5-coder:3b	llama3.2:3b	45–50 t/s
RTX 3060 / 4060	8 GB	qwen2.5-coder:7b	llama3.1:8b	40–60 t/s
M3 Pro 18GB	18 GB UM	qwen2.5-coder:7b	llama3.1:8b	25–30 t/s
RTX 4060 Ti 16GB	16 GB	qwen2.5-coder:7b	qwen2.5:14b	35–55 t/s
RTX 3090	24 GB	qwen2.5-coder:14b	qwen2.5:32b	20–50 t/s
RTX 4090	24 GB	qwen2.5-coder:14b	qwen2.5:32b	30–70 t/s
M3 Max 48GB	48 GB UM	qwen2.5-coder:14b	qwen2.5:32b	18–35 t/s
RTX 5090	32 GB	qwen2.5-coder:14b	qwen2.5:32b	61–100 t/s
2× RTX 3090/4090	48 GB	qwen2.5-coder:32b	llama3.3:70b	15–60 t/s

Verifying your setup

# 1. Pull your chosen models
ollama pull <supervisor-model>
ollama pull <logic-model>

# 2. Test the MCP server health
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}
{"jsonrpc":"2.0","method":"notifications/initialized"}
{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"ollama_health","arguments":{}}}' | \
  node ~/.config/nexus/tools/mcp/server.mjs 2>/dev/null | tail -1 | python3 -m json.tool

The health check lists your available models. If a model isn’t pulled, generate calls will fail with an Ollama error (not a CIRCUIT_BREAKER).