Playbook/Stage 05

Harden

Make it bulletproof

Local & Edge AI

Running models on-device for privacy, speed, or cost. When it makes sense and when it doesn't.

Builder's checkSometimes the best API call is the one you never make. Running models locally turns privacy from a liability into a selling point, and for some products that's the whole pitch. This is a positioning decision as much as a technical one. If "your data never leaves your device" is something your users would pay for, the engineering tradeoffs are worth it. If it's not, don't carry the complexity for a benefit nobody asked for.

Cloud APIs are convenient, but local inference offers unique advantages: data never leaves your device, no network latency, no per-token fees after hardware, works offline, and no dependence on a provider's pricing or strategic decisions.

The practical stack

Ollama is the easiest way to start. Install it, run ollama run llama3.1, and you have a local model with an OpenAI-compatible API. Point your existing code at localhost:11434 instead of the cloud, and most things just work.

For production serving, vLLM handles batching, caching, and concurrent requests efficiently. For maximum performance on consumer hardware, llama.cpp squeezes the most out of available resources.

Model selection for local

ModelSizesStrengths
Llama 3.18B, 70B, 405BBest overall open model
Qwen 2.57B–72BMultilingual, strong at code
Mistral/Mixtral7B, 8x7BFast, efficient
Phi-33.8B, 14BTiny but surprisingly capable

Quantization: fitting big models on small hardware

Quantization reduces the precision of model weights, dramatically cutting memory requirements with minimal quality loss:

  • Q8: ~99% quality, half the memory of full precision
  • Q5: ~97% quality, about 60% of full precision memory
  • Q4: ~95% quality, half of full precision memory. This is the sweet spot for most local deployments.

A 7B model at Q4 fits in 4–6 GB of VRAM. That runs on an M1 MacBook or an RTX 3060. A 70B model at Q4 needs 40+ GB, which means an M2 Ultra or two high-end GPUs.

Hybrid architecture

The most practical approach: use local models for simple, high-volume tasks (classification, summarization, embedding) and cloud models for complex reasoning where quality matters most. Since most local servers expose OpenAI-compatible APIs, switching between local and cloud is often just changing the base URL and model name.

When local makes sense

  • Privacy requirements. Healthcare, finance, legal. Data that cannot leave the premises.
  • High volume. At 10M+ tokens per day, local breaks even on hardware costs within a year.
  • Offline use. Field workers, aircraft, anywhere without reliable internet.
  • Predictable costs. No surprise API bills. Hardware is a fixed cost.