Local & Edge AI
Running models on-device for privacy, speed, or cost. When it makes sense and when it doesn't.
Cloud APIs are convenient, but local inference offers unique advantages: data never leaves your device, no network latency, no per-token fees after hardware, works offline, and no dependence on a provider's pricing or strategic decisions.
The practical stack
Ollama is the easiest way to start. Install it, run ollama run llama3.1, and you have a local model with an OpenAI-compatible API. Point your existing code at localhost:11434 instead of the cloud, and most things just work.
For production serving, vLLM handles batching, caching, and concurrent requests efficiently. For maximum performance on consumer hardware, llama.cpp squeezes the most out of available resources.
Model selection for local
| Model | Sizes | Strengths |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | Best overall open model |
| Qwen 2.5 | 7B–72B | Multilingual, strong at code |
| Mistral/Mixtral | 7B, 8x7B | Fast, efficient |
| Phi-3 | 3.8B, 14B | Tiny but surprisingly capable |
Quantization: fitting big models on small hardware
Quantization reduces the precision of model weights, dramatically cutting memory requirements with minimal quality loss:
- Q8: ~99% quality, half the memory of full precision
- Q5: ~97% quality, about 60% of full precision memory
- Q4: ~95% quality, half of full precision memory. This is the sweet spot for most local deployments.
A 7B model at Q4 fits in 4–6 GB of VRAM. That runs on an M1 MacBook or an RTX 3060. A 70B model at Q4 needs 40+ GB, which means an M2 Ultra or two high-end GPUs.
Hybrid architecture
The most practical approach: use local models for simple, high-volume tasks (classification, summarization, embedding) and cloud models for complex reasoning where quality matters most. Since most local servers expose OpenAI-compatible APIs, switching between local and cloud is often just changing the base URL and model name.
When local makes sense
- Privacy requirements. Healthcare, finance, legal. Data that cannot leave the premises.
- High volume. At 10M+ tokens per day, local breaks even on hardware costs within a year.
- Offline use. Field workers, aircraft, anywhere without reliable internet.
- Predictable costs. No surprise API bills. Hardware is a fixed cost.