NEXUS v0.1.1

Local Models

How model routing works

NEXUS routes micro-tasks to two model bands based on the task’s complexity:

BandDefault ModelTasksMin VRAM
Supervisorqwen2.5-coder:1.5bCommit messages, boilerplate, test scaffolds~1.2 GB
Logicllama3.2:3bLint fixes, code refactors~2.0 GB

These defaults fit in 4GB VRAM with no spillover. If you have more VRAM, upgrade for better output quality.


Configuring models

cp .env.example .env
# Edit the file:
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
OLLAMA_HOST_URL="http://localhost:11434"

Option 2: MCP server config

{
  "mcpServers": {
    "nexus-ollama": {
      "command": "node",
      "args": ["~/.config/nexus/tools/mcp/server.mjs"],
      "env": {
        "NEXUS_SUPERVISOR_MODEL": "qwen2.5-coder:7b",
        "NEXUS_LOGIC_MODEL": "llama3.1:8b"
      }
    }
  }
}

Option 3: Per-task overrides

Override a single task without changing the whole band:

NEXUS_MODEL_COMMIT_MSG="qwen2.5-coder:3b"
NEXUS_MODEL_LOGIC_REFACTOR="qwen2.5:7b"

Priority: Per-task env var → Band-level env var → Built-in default


Hardware presets

RTX 3050 Mobile / 4GB VRAM (default)

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:1.5b"
NEXUS_LOGIC_MODEL="llama3.2:3b"
ModelVRAMSpeed
qwen2.5-coder:1.5b~1.2 GB~122 t/s
llama3.2:3b~2.0 GB~73 t/s

RTX 3060 / RTX 4060 / 8GB VRAM

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"
ModelVRAMSpeed
qwen2.5-coder:7b~4.7 GB~45 t/s
llama3.1:8b~4.9 GB~40 t/s

MacBook Air / Pro M3 Base — 8GB Unified Memory

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:3b"
NEXUS_LOGIC_MODEL="llama3.2:3b"

Apple Silicon shares memory between CPU and GPU. With 8GB total, stick to smaller models.

MacBook Pro M3 Pro — 18GB / 36GB Unified Memory

# 18GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="llama3.1:8b"

# 36GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:7b"
NEXUS_LOGIC_MODEL="qwen2.5:14b"

Apple Silicon has lower memory bandwidth than discrete GPUs (~150 GB/s on M3 Pro). Expect ~40–60% of the t/s you’d see on an equivalent NVIDIA card, but with larger model capacity.

RTX 4090 / 3090 — 24GB VRAM

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"
ModelVRAMSpeed (4090)
qwen2.5-coder:14b~9.0 GB~70 t/s
qwen2.5:32b~19.8 GB~30 t/s

RTX 5090 — 32GB VRAM

The RTX 5090 has 1.79 TB/s memory bandwidth. At these speeds, the logic band is faster than most setups’ supervisor band.

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"
ModelSpeed
qwen2.5-coder:14b~100 t/s
qwen2.5:32b~61 t/s

MacBook Pro M3 Max — 48–96GB Unified Memory

# 48GB
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:14b"
NEXUS_LOGIC_MODEL="qwen2.5:32b"

# 96GB (workstation territory)
NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"

Dual GPU — 2× RTX 3090/4090 (48GB total)

Ollama supports multi-GPU via CUDA_VISIBLE_DEVICES. With 48GB combined VRAM, 70B models fit.

NEXUS_SUPERVISOR_MODEL="qwen2.5-coder:32b"
NEXUS_LOGIC_MODEL="llama3.3:70b"

VRAM rules of thumb

  • Q4_K_M VRAMparameters (B) × 0.57 + 0.5 GB overhead + KV cache
  • KV cache at 8k context adds ~1–2 GB for 7–8B models, ~3–4 GB for 32B models
  • If a model spills to RAM, expect 5–20× slower inference. Always fit the model fully in VRAM.
  • Apple Silicon shares memory between OS and GPU — budget ~4 GB less than total for model headroom

Quick reference

HardwareVRAMSupervisorLogicSpeed
RTX 3050 Mobile4 GBqwen2.5-coder:1.5bllama3.2:3b70–120 t/s
M3 Base8 GB UMqwen2.5-coder:3bllama3.2:3b45–50 t/s
RTX 3060 / 40608 GBqwen2.5-coder:7bllama3.1:8b40–60 t/s
M3 Pro 18GB18 GB UMqwen2.5-coder:7bllama3.1:8b25–30 t/s
RTX 4060 Ti 16GB16 GBqwen2.5-coder:7bqwen2.5:14b35–55 t/s
RTX 309024 GBqwen2.5-coder:14bqwen2.5:32b20–50 t/s
RTX 409024 GBqwen2.5-coder:14bqwen2.5:32b30–70 t/s
M3 Max 48GB48 GB UMqwen2.5-coder:14bqwen2.5:32b18–35 t/s
RTX 509032 GBqwen2.5-coder:14bqwen2.5:32b61–100 t/s
2× RTX 3090/409048 GBqwen2.5-coder:32bllama3.3:70b15–60 t/s

Verifying your setup

# 1. Pull your chosen models
ollama pull <supervisor-model>
ollama pull <logic-model>

# 2. Test the MCP server health
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}
{"jsonrpc":"2.0","method":"notifications/initialized"}
{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"ollama_health","arguments":{}}}' | \
  node ~/.config/nexus/tools/mcp/server.mjs 2>/dev/null | tail -1 | python3 -m json.tool

The health check lists your available models. If a model isn’t pulled, generate calls will fail with an Ollama error (not a CIRCUIT_BREAKER).