Fine-Tuning & Optimization
Training models on your data, caching, latency reduction — only worth it when you have the usage to justify it.
What Harden is about
Your product works, people are using it, and now you need it to work better, faster, cheaper, and safer. This stage is about production-grade concerns — the stuff that separates a side project from a real product. None of this matters until someone is depending on what you built.
- Fine-Tuning & Optimization — training models on your data, caching, latency reduction. Only worth it when you have the usage to justify it.
- Security — prompt injection, data leaks, and the attacks that are unique to AI products. What to harden before it bites you.
- Local & Edge AI — running models on-device for privacy, speed, or cost. When it makes sense and when it doesn't.
- Connecting Your AI — integrations, webhooks, and making your AI product talk to the rest of the world.
The most important thing here: resist the urge to harden things prematurely. Optimization without users is just procrastination with better metrics. Come here when your users' experience demands it.
Fine-tuning means taking a base model and training it on your specific data so it performs better on your specific task. It is not the first solution. It is usually the third or fourth.
The decision framework
Before you fine-tune, ask what the actual problem is:
- Is the issue knowledge? The model doesn't know your data. Use RAG, not fine-tuning. RAG is faster to implement, easier to update, and provides citations.
- Is the issue behavior? The model doesn't follow your style, tone, or format consistently. Try better prompting first. If prompts are getting too long and expensive, or behavior is still inconsistent after serious prompt work, then fine-tune.
- Is the issue capability? The model simply can't do what you need. Try a better base model before you try fine-tuning a weaker one.
What fine-tuning actually looks like
You don't need to understand the math. You need to understand the process:
- Prepare data. 100 high-quality examples of ideal input-output pairs. Quality matters more than quantity. Use real production data where possible, cleaned and anonymized.
- Choose your method. LoRA and QLoRA are parameter-efficient approaches that update only a small subset of the model's weights. They require a fraction of the compute of full fine-tuning and work on consumer hardware.
- Train. Most hosted providers (OpenAI, Anthropic) make this a simple API call. Self-hosted gives you more control but requires ML infrastructure.
- Evaluate. Run your eval set (from Knowing If It Works) against the fine-tuned model. Compare to the base model. Check for regression on general capabilities.
- Deploy. Serve via API or self-hosted. Monitor performance in production.
Prompt caching: the optimization most builders miss
Before you fine-tune to reduce prompt length, know that provider-level prompt caching can slash costs dramatically:
- Anthropic: Mark static content for caching and get a 90% discount on cached tokens on subsequent calls. Your 5,000-token system prompt gets processed once, then reused.
- OpenAI: Automatic caching for identical prefixes over 1,024 tokens. 50% discount, no code changes needed.
Structure your prompts with static content first (system prompt, examples) and variable content last (user query). This maximizes cache hits. For a customer support bot making 10,000 calls a day with a 2,000-token system prompt, caching alone can cut costs from $50/day to $7/day.
Response caching
If many users ask similar questions, cache the responses. Exact-match caching is simple: hash the input, store the output. Semantic caching is more powerful: embed the query, find similar cached queries, return the cached response if similarity is above a threshold. A 50% cache hit rate on top of prompt caching can reduce your effective API costs by 90%+.