The AI Product Playbook
How to Build Products Where AI Is the Value — Not Just the Tool
Part 1: The Decision
Chapter 1: AI as Tool vs. AI as Product
You built something with Cursor or Lovable or Claude. It works. People like the demo. Now you're asking the question every AI-era builder eventually asks: should AI be the thing I sell, or just the thing I used to build faster?
Most builders get this wrong by default. They used AI to ship in a weekend, felt the power of it, and assumed the product should therefore be AI. That's like saying your product should be React because you built it in React. The tool you used to build is not automatically the thing your customer is paying for.
So let's get clear on the distinction.
The Three Roles AI Plays in a Product
AI as Tool. You used Cursor, Claude Code, or ChatGPT to write your code, generate your copy, or scaffold your app. Your user never sees AI. They don't know it exists. They're paying for the outcome — a scheduling app, a CRM, a marketplace. The AI was your power tool, not theirs.
AI as Feature. Your product does something bigger, and AI handles one piece of it. Think smart search in an e-commerce app, auto-categorization in a bookkeeping tool, or suggested replies in a helpdesk. Remove the AI and the product still exists — it's just less good.
AI as Core. Remove the AI and there is no product. The AI isn't enhancing something else; it is the thing. A legal document analyzer that reads contracts and flags risk. A personalized tutoring system that adapts to how a student learns. A meal planning engine that builds grocery lists from dietary restrictions and what's on sale this week. Kill the model, kill the product.
The Two Questions That Matter
Before you decide which category you're in, answer two questions honestly:
1. Is the AI doing something the user genuinely cannot do themselves?
If your AI summarizes articles, the user can do that — they just don't want to. That's a convenience play. It works, but it's vulnerable. If your AI analyzes an MRI scan and flags anomalies a radiologist might miss, that's capability the user literally doesn't have. The defensibility is different.
2. Does the output get better every time they use it?
A product that learns your preferences, builds on your history, or improves its accuracy based on your corrections has compounding value. A product that gives the same quality output to every user on day one as on day three hundred is a utility. Utilities get commoditized.
If you answered "no" to both, AI should probably be your tool, not your product. And that's genuinely fine — most successful software businesses don't sell AI. They sell outcomes that AI helped them build cheaply and quickly.
If you answered "yes" to one, AI is probably a feature. Build the bigger product, use AI to make one part of it dramatically better.
If you answered "yes" to both, you might have an AI-core product. Keep reading.
When to Stay a Tool Business
Here's a take that won't be popular with the AI hype crowd: most of you should stay in the "AI as tool" category.
Why? Because AI-core products have a cost structure that tool businesses don't. Every API call costs money. Every user interaction has a marginal cost. Traditional SaaS has near-zero marginal cost per user — AI SaaS does not. If you're a solo builder charging $29/month and your AI costs you $8/user/month in API calls, your margins are already worse than most bootstrapped SaaS founders would accept.
Tool businesses also don't have the reliability problem. When AI is your tool, a hallucination is your problem during development. When AI is your product, a hallucination is your customer's problem in production. That's a fundamentally different risk profile.
Stay a tool business if:
- Your product's value doesn't depend on AI being right every time
- You can't articulate what the AI does that a well-designed form or workflow couldn't
- Your per-user AI costs would eat more than 30% of your revenue
- You don't have a clear path to the output getting better with use
Template: AI Product Decision Canvas
At the end of this chapter, use this one-page canvas to classify your idea. It's a single sheet with four quadrants:
- Top left — "What does the AI actually do?" List the specific AI-powered actions in your product. Be concrete. Not "uses AI to help users" but "generates a personalized 7-day workout plan based on injury history, equipment available, and stated goals."
- Top right — "Can the user do this without AI?" For each action, mark it: Impossible without AI / Possible but painful / Easy without AI. Be honest.
- Bottom left — "Does it improve with use?" For each action, describe what data accumulates and how it makes the next output better. If you can't describe the mechanism, the answer is "no."
- Bottom right — "Classification." Based on your answers: Tool (AI helped you build), Feature (AI enhances one piece), or Core (AI is the product). Write it down. Commit to it. Build accordingly.
Chapter 2: The Value Proposition Shift
If you decided AI should be your core or a major feature, your pitch to customers just changed. Not a little — fundamentally.
Traditional software value propositions hit the same three notes: faster, cheaper, more convenient. Your project management tool saves 5 hours a week. Your accounting software costs less than a bookkeeper. Your scheduling app means fewer back-and-forth emails.
Those pitches work. They've sold billions of dollars of software. But they're the wrong frame for AI products. Here's why.
AI-Native Value Props
When AI is the core of what you're selling, you have access to three value propositions that traditional software can't claim:
1. Impossible without AI. The thing your product does literally could not be done before — or could only be done by an expensive expert. A tool that reads 200 pages of city zoning code and tells a homeowner whether their renovation needs a permit isn't "faster" — it's something that previously required a $400/hour land use attorney. That's not a convenience improvement. That's access to capability that didn't exist at this price point.
2. Gets smarter with use. Every interaction makes the product better for that specific user. A writing tool that learns your brand voice after 50 documents isn't just autocomplete — it's an asset that appreciates. The switching cost isn't the subscription price; it's the accumulated intelligence. This is the most defensible position in AI products, and the hardest to build.
3. Personalized at scale. One product, a thousand different experiences. A financial advisor can give personalized advice to 30 clients. An AI financial advisor can give personalized advice to 30,000 — each tailored to their income, risk tolerance, debt profile, and goals. The unit economics of personalization change completely when AI is involved.
The "So What" Test
Here's the test I run on every AI product pitch, including my own:
If you replaced your AI with a human expert doing the exact same thing, would the product still make sense?
If the answer is yes — if a human financial planner could do what your AI does, just slower and more expensive — then your AI is a cost play. You're competing on price against human labor. That can work. Turbotax is essentially "what if your accountant was software?" It's a $14 billion company. But know that's what you're building: cheaper access to existing expertise.
If the answer is no — if no human could realistically process 10,000 customer support tickets in real time and route them based on sentiment, urgency, and customer lifetime value — then you have something structurally different. That's where AI products get interesting.
Neither is wrong. But they require different pitches, different pricing, and different expectations from customers.
Positioning Against Non-AI Alternatives
Your real competition usually isn't another AI product. It's the spreadsheet. The consultant. The manual process. The intern doing it by hand.
This matters because your customer isn't choosing between your AI and a competitor's AI. They're choosing between your AI and doing nothing differently. The status quo is your enemy.
Position against non-AI alternatives by quantifying the gap:
- Time: "Your team spends 12 hours/week categorizing support tickets. This does it in 12 seconds."
- Cost: "You're paying a contractor $3,000/month to write product descriptions. This costs $49/month."
- Scale: "Your best sales rep can research 10 prospects per day. This researches 500."
Concrete numbers beat abstract promises. "AI-powered" means nothing. "$2,951 less per month" means everything.
Positioning Against Other AI Products
Now the harder one. When your prospect is comparing AI products, "we use GPT-4 too" is worth exactly zero as a differentiator. Everyone has access to the same foundation models. The model is the commodity. Your product is everything else.
What actually differentiates AI products:
- Domain expertise baked into prompts. Anyone can call the Claude API. Not anyone has spent 200 hours refining prompts for commercial real estate underwriting with eval sets that cover 47 edge cases.
- Data flywheel. If your users' data makes your product better for all users (with proper anonymization and consent), you have something that a new competitor can't replicate on day one.
- Workflow integration. If you're embedded in how someone works — plugged into their CRM, pulling from their email, pushing to their calendar — you're not a tool they visit. You're infrastructure they depend on.
- Trust and track record. In domains where being wrong is expensive (legal, medical, financial), six months of proven accuracy is a moat. Not a permanent one, but a real one.
Template: AI Value Proposition Canvas
This canvas maps four elements onto a single page. Fill it out for your product:
- Problem (top): What specific problem does your user have? Not "they need AI" — the actual pain. "Commercial real estate brokers spend 6 hours per deal manually pulling comparable sales data from three different county databases."
- AI Capability (left): What does your AI specifically do to address this? "Pulls, normalizes, and cross-references comparable sales data from public records across 200+ counties in under 30 seconds."
- User Outcome (right): What does the user's life look like after? Not what the product does — what the user gets. "Brokers close deals 4 days faster and never miss a relevant comp again."
- Defensibility (bottom): Why can't someone else do this tomorrow? Be specific. "Our county data normalization layer took 8 months to build and covers formatting quirks in 200+ jurisdictions. A competitor starting today faces that same 8 months."
If you can't fill in all four quadrants with specific, concrete answers, your value proposition isn't ready. Go back to Chapter 1.
Part 2: Building It Right
Chapter 3: Structured Outputs Beat Magic
You've seen the demo. Someone types a vague question into a chatbox, the AI spits out a paragraph of helpful-sounding text, and the audience claps. Looks like magic.
Now ship that to real users and watch what happens.
The Magic Demo Problem
Freeform AI demos are seductive because they showcase the model's range. Ask it anything! Watch it respond! But range is the opposite of what you want in a product. Products are supposed to do a specific thing reliably. "Ask it anything" means "the output could be anything," which means you can't guarantee quality, you can't test systematically, and you can't build a consistent user experience around it.
I've watched this kill products. A founder builds a chatbot-style AI feature, demos it to investors, gets funded, then spends six months trying to make the free-text output consistent enough to actually rely on. The demo was impressive precisely because it was unconstrained. The product fails for the same reason.
Here's the pattern: impressive demo, unreliable product, frustrated users, pivot or death.
Why Structured Outputs Win
Structured outputs mean your AI returns data in a predictable format — JSON objects, filled-in schemas, selections from a defined list, scores on a rubric. Instead of asking the AI to "write a product description," you ask it to return:
{
"headline": "string, max 60 chars",
"description": "string, max 200 chars",
"key_features": ["string", "string", "string"],
"tone": "one of: professional, casual, playful",
"confidence": 0.0-1.0
}
This changes everything about your product:
You can test it. When output has a schema, you can write automated checks. Is the headline under 60 characters? Are there exactly three key features? Is the confidence score a valid number? You can run a thousand test cases overnight and know exactly where quality breaks down.
You can design around it. Your UI isn't a chat window hoping for the best. It's a layout that knows it's getting a headline, a description, and three bullet points. Your designers can make that look great. Your frontend can render it predictably.
You can improve it. When a structured output fails, you know which field failed. The headline was too generic. The confidence score was high but the output was wrong. You can fix specific failure modes instead of trying to make "the AI" generically better.
You can explain it. Users trust a product that says "Based on your input, I identified these 3 key features with 87% confidence" more than one that dumps a paragraph and hopes the user trusts it.
The Reliability Spectrum
Think of AI product reliability as a spectrum:
- Level 1 — Sometimes helpful. Free-text AI output that's good when it works and garbage when it doesn't. Users can't predict which they'll get. This is where most AI products launch. Most die here too.
- Level 2 — Usually right. Structured outputs with basic validation. Output format is consistent. Content quality varies but failures are catchable. Users start to develop trust.
- Level 3 — Reliably useful. Structured outputs with eval-tested quality, confidence scores, and graceful failure modes. Users rely on it for real work. This is where paying customers live.
- Level 4 — Bet-your-business. Multi-model validation, human-in-the-loop for high-stakes outputs, audit trails, accuracy tracking over time. Enterprise customers require this.
Most solo builders should aim for Level 2 at launch and Level 3 within three months. Level 4 is for when you're charging enterprise prices.
How to Design AI Features That Feel Reliable
Three principles that will save you months of headaches:
1. Constrain the output space. The narrower the range of valid outputs, the more reliable the feature. Don't ask the AI "what should this user do?" Ask it to select from five predefined actions and provide a confidence score for each. Don't ask it to "write a business plan." Ask it to fill in a structured canvas with specific fields and character limits.
Every constraint you add is a guardrail against failure. And constraints aren't limiting — they're designing. A sonnet has constraints. That's what makes it a sonnet.
2. Show confidence, not certainty. Never present AI output as ground truth. Present it as a recommendation with a confidence indicator. "3 matches found — top match 92% confidence" tells the user something real. "Here's your answer!" tells them nothing about when to trust it and when to double-check.
Users are smart. They don't need you to pretend the AI is perfect. They need you to tell them when it's more or less sure. A weather app that says "70% chance of rain" is more useful than one that says either "it will rain" or "it won't rain." Your AI product works the same way.
3. Give users an escape hatch. Every AI-generated output should be editable. Every AI-made decision should be overridable. Not because your AI is bad — because your users need to feel in control. The moment a user feels trapped by an AI decision they disagree with, you've lost them.
The escape hatch also generates your best training data. When a user edits an AI output, they're telling you exactly how the output should have looked. That's a free eval case. Log it, learn from it, improve.
A Real Example: Structured Discovery Scripts
Here's how this plays out in practice. Builder Companion has a feature that generates customer discovery scripts — the questions you should ask potential users to validate your idea.
The bad version would be a chatbot: "Tell me about your product idea and I'll suggest some discovery questions." Sometimes you get great questions, sometimes you get generic ones, and you can never predict which.
The actual version uses structured outputs. The user fills in three fields: who their customer is, what problem they're solving, and what they've built so far. The AI returns a structured object:
- 5 questions, each with a category tag (pain, behavior, willingness-to-pay, alternatives, urgency)
- Follow-up prompts for each question (what to ask when they say yes vs. no)
- Red flags to listen for (signals that the problem isn't real)
- A confidence note on which questions are most likely to surface useful information
Every output has the same shape. The UI renders it the same way every time. We can test whether questions are actually probing the right categories. And users know exactly what they're getting.
That's not less powerful than a chatbot. It's more powerful — because it's reliable enough to actually use.
Template: AI Feature Spec
Use this template to define any AI-powered feature before you build it. One page per feature.
- Feature name: What do you call it? Keep it concrete. "Discovery Script Generator," not "AI Assistant."
- Input: What does the user provide? List every field, its type, and its constraints. "Customer segment: free text, 10-200 characters. Problem statement: free text, 20-500 characters. Current solution: dropdown, one of [nothing, spreadsheet, manual process, competitor product, other]."
- Output schema: What does the AI return? Define every field, type, constraint, and valid range. Be as specific as your database schema. If you can't define the output schema, you're building a chatbot, not a feature.
- Quality bar: What does "good" look like? Write 3-5 examples of ideal outputs for real inputs. These become your first eval cases.
- Failure modes: What can go wrong? List them. "AI generates generic questions not specific to the user's domain. AI suggests illegal interview questions. AI returns fewer than 5 questions. Confidence scores don't correlate with actual quality." For each, describe how the product handles it — fallback behavior, error messages, human escalation.
- Cost estimate: How many API calls per use? What's the token count? At current API pricing, what does one use of this feature cost you? Multiply by projected monthly usage per user. That's your per-user cost for this feature.
If you can fill this out completely, you're ready to build. If you can't, you're not.
Chapter 4: The Eval Problem
How to Know If Your AI Product Is Actually Good
Here's what happens to every AI product built on vibes: you ship it, a few users try it, someone gets a wildly wrong output, they screenshot it, and now that screenshot is your brand. You never saw it coming because you never systematically tested for it.
Vibes-based testing — where you type a few prompts, eyeball the results, and say "yeah, that looks right" — is how most builders evaluate their AI features. It feels productive. It is not. You're testing the happy path with your own mental model of how the product should work. Your users will do things you never imagined, with inputs you never considered, and your AI will confidently produce garbage.
The fix isn't complicated. It's just disciplined.
The Minimum Viable Eval: 20 Test Cases
Before you do anything else, write 20 test cases by hand. Not 200. Not 2,000. Twenty.
Each test case has three parts:
- Input: The exact prompt, query, or data the user would provide
- Expected output: What a good response looks like (doesn't have to be word-for-word — describe the qualities)
- Pass/fail criteria: Specific, binary conditions. "Mentions at least two relevant factors." "Does not hallucinate a statistic." "Stays under 200 words."
Where do you get these 20 cases? From real scenarios:
- 5 common cases: The bread-and-butter queries your product will handle every day
- 5 edge cases: Unusual inputs, very short or very long, ambiguous requests
- 5 adversarial cases: Inputs designed to break things — off-topic questions, prompt injection attempts, contradictory instructions
- 5 failure cases: Situations where the AI genuinely shouldn't know the answer, and you want it to say so
Run every test case. Score each one. Write down the results. You now have something most AI products never get: a baseline.
What You're Actually Evaluating
Most people think "eval" means "is the answer correct?" That's one dimension. There are four that matter:
Accuracy — Is the output factually correct? This is table stakes but surprisingly hard to measure for open-ended generation. For structured outputs (recommendations, classifications, extracted data), accuracy is straightforward. For freeform text, you need to define what "correct" means for your domain. A meal planning AI that suggests a recipe with an ingredient the user is allergic to isn't just inaccurate — it's dangerous.
Usefulness — Did the output actually help the user accomplish their goal? An answer can be technically correct and completely useless. If someone asks your AI "how should I price my product?" and it responds with a textbook definition of pricing strategy, that's accurate and worthless. Usefulness means the output moved the user closer to a decision or action.
Consistency — If I give the same input twice, do I get roughly similar quality? Not identical outputs — that's fine and expected with language models. But if the same question produces a brilliant answer at 2pm and nonsense at 3pm, users will never trust it. Test the same input 3-5 times and look at the variance.
Graceful failure — What happens when the AI doesn't know? This is the one most builders skip, and it's the one that matters most for trust. Your AI should have a recognizable way of saying "I'm not confident about this" or "this is outside what I can help with." If it confidently makes things up instead, you have a hallucination problem that will erode every ounce of trust you've built.
Human Eval vs. Automated Eval vs. LLM-as-Judge
You have three ways to score your test cases, and each has a sweet spot:
Human eval is the gold standard for subjective quality. You (or someone who knows the domain) reads the output and scores it. This is slow and expensive but irreplaceable for v1. If you're building a legal document reviewer, a lawyer needs to look at those outputs. There's no shortcut here.
Automated eval works when your outputs are structured. If your AI returns JSON with specific fields, you can write simple scripts that check: Did it return valid JSON? Are the required fields present? Is the sentiment classification one of the allowed values? Automated evals are fast, cheap, and run on every deploy. Use them for everything you can.
LLM-as-judge is the middle ground — you use a second AI model to evaluate the first one's output. This sounds circular, but it works surprisingly well for things like "is this response helpful?" or "does this summary capture the key points?" The trick is giving your judge model a detailed rubric, not just asking "is this good?" Claude or GPT-4 with a well-written scoring prompt can replicate human judgment at about 80-85% agreement. That's good enough for regression testing — catching when things get worse — even if it's not good enough to replace human eval entirely.
Use all three. Automated evals on every change. LLM-as-judge weekly. Human eval on your 20 core test cases monthly, or whenever you change your prompts significantly.
The Feedback Loop
Your eval set is a snapshot. Your users are a movie.
The most valuable eval data comes from how people actually use your product. Track these signals:
- Regeneration rate: How often do users click "try again"? High regeneration = the first output wasn't useful.
- Edit distance: If users can edit AI outputs, how much do they change? Heavy editing means your AI is a rough draft machine, not a finished product.
- Abandonment: Users who get an output and then don't take the next action (save it, share it, act on it) are telling you the output wasn't valuable.
- Thumbs up/down: Simple, but only useful if you actually read the downvoted outputs and understand why.
Every month, take your worst-performing real-world outputs and add them to your eval set. Your 20 test cases become 25, then 30, then 50. Each one represents a real failure your product had. This is how your eval set matures from "things I thought might go wrong" to "things that actually went wrong."
Template: Eval Rubric
The Eval Rubric template is a scoring sheet designed for your 20 initial test cases. Each row contains: a test case ID, the input text, expected output description, the actual output from your AI, and four scored dimensions (Accuracy, Usefulness, Consistency, Graceful Failure) each rated 1-3. A score of 1 means fail, 2 means acceptable, and 3 means good. The bottom of the sheet calculates your overall pass rate (percentage of test cases scoring 2+ on all dimensions) and highlights your weakest dimension so you know where to focus. Any test case scoring 1 on any dimension is an automatic fail regardless of other scores. Start with this template, run it before every significant prompt change, and expand it with real failure cases monthly.
Chapter 5: Choosing Your AI Architecture
What to Build, What to Call, What to Skip
Architecture decisions in AI products feel high-stakes because they're expensive to reverse. But most builders overcomplicate this. You have fewer real choices than you think, and for v1, the right answer is almost always the simplest one.
API-First vs. Open-Source vs. Fine-Tuned
API-first (Claude, GPT-4, Gemini): You send a request, you get a response, you pay per token. This is where you should start. The models are good, the infrastructure is someone else's problem, and you can ship in days instead of months. Your cost per request will be somewhere between $0.01 and $0.15 depending on the model and input size. For most products doing under 100,000 requests per month, this is the obvious choice.
Open-source (Llama, Mistral, Qwen): You host the model yourself. The model is free; the hosting is not. A decent GPU instance runs $500-2,000/month on AWS or GCP. This makes sense when: you need to process data that can't leave your infrastructure (healthcare, finance, defense), you're doing millions of requests and the API costs exceed hosting costs, or you need to modify the model itself. For v1 of most products? This is premature optimization.
Fine-tuned models: You take a base model and train it on your specific data so it performs better on your specific task. Fine-tuning costs range from a few hundred dollars (for small models with small datasets) to tens of thousands. The dirty secret: for 90% of use cases, a well-crafted system prompt with good examples gets you 80% of the performance of a fine-tuned model at 1% of the cost and effort. Fine-tune when you've proven the product works, have thousands of high-quality input-output examples, and need measurably better performance on a narrow task. That's v3 territory, not v1.
The decision is simple: Start with API calls. Move to open-source when you have a regulatory or cost reason. Fine-tune when you have the data and the proven need.
The Prompt Engineering Stack
Your prompts are your product logic. Treat them like code, not like casual conversation. Here's the stack, from foundation to advanced:
System prompts: The instructions that define your AI's behavior, personality, constraints, and output format. This is the most important piece. A good system prompt is 200-500 words, specifies exactly what the AI should and shouldn't do, defines the output format explicitly, and includes guardrails for edge cases. Version control these. Review changes like code reviews.
Few-shot examples: Include 2-5 examples of ideal input-output pairs directly in your prompt. This is the single highest-leverage technique for improving output quality. Instead of describing what you want, show it. An AI product that generates sales emails should include 3 examples of great sales emails right in the system prompt. The model learns the pattern — tone, length, structure — from examples faster than from instructions.
Structured outputs: Tell the model to respond in JSON, XML, or a specific schema. This makes your outputs parseable, testable, and consistent. Most API providers (Anthropic, OpenAI) now support enforced JSON schemas, meaning the model literally cannot return invalid structure. Use this for every feature where you need to do something with the output programmatically. Chat is the exception, not the rule.
Tool use / function calling: The model doesn't just generate text — it decides to call functions you've defined. "Look up the user's order history." "Calculate the shipping cost." "Search the knowledge base." This is how you connect your AI to real data and real actions. It's the bridge between a chatbot and a product.
When You Need RAG (and When You Don't)
RAG — Retrieval-Augmented Generation — means your AI searches a knowledge base before generating a response. Instead of relying solely on what the model was trained on, you feed it relevant documents, data, or context at query time.
You need RAG when:
- Your product needs to reference specific, frequently updated information (company docs, product catalogs, legal regulations)
- The information wouldn't be in the model's training data (your users' private data, proprietary content)
- Accuracy on specific facts matters more than general reasoning
You don't need RAG when:
- The model's built-in knowledge is sufficient for your use case
- You're doing creative generation, classification, or transformation tasks
- Your relevant context fits in the context window (more on this below)
RAG adds real complexity: you need a vector database, an embedding pipeline, a chunking strategy, and retrieval ranking logic. Each of these is a potential failure point. A bad chunking strategy means your AI retrieves irrelevant context and produces worse outputs than if you'd given it nothing.
If you're considering RAG, first ask: can I just put the relevant information directly in the prompt?
The Context Window Is Your Database
Modern models have context windows of 100K-200K tokens. That's roughly 300-600 pages of text. For many products, you don't need a retrieval system at all — you can just stuff the relevant context directly into the prompt.
Building a product that analyzes a user's resume against a job description? Both documents fit easily in the context window. No RAG needed. Building a product that helps users navigate a 50-page employee handbook? Paste the whole handbook in the system prompt. No RAG needed.
The context-stuffing approach works when:
- Your reference material is under ~100 pages
- The material doesn't change frequently (or you can reload it)
- You don't need to search across thousands of documents
It breaks down when:
- You have more content than fits in the window
- You need to search across many users' data
- Latency matters (longer prompts = slower responses = higher costs)
The cost implication is real. Sending 100K tokens of context with every request is expensive — roughly $0.10-0.30 per request on frontier models. But for early-stage products with low volume, it's dramatically simpler than building a RAG pipeline. Trade money for simplicity until simplicity stops scaling.
Cost Modeling: Know Your Numbers Before You Build
Every AI product has a per-request cost that traditional software doesn't. You need to model this before you write a line of code.
Here's the formula:
Cost per request = (input tokens x input price) + (output tokens x output price)
For Claude Sonnet (a typical mid-tier choice):
- Input: ~$3 per million tokens
- Output: ~$15 per million tokens
A typical request with a 1,000-token system prompt, 500-token user input, and 500-token response costs about $0.01. Sounds cheap. Now multiply:
- 100 users x 20 requests/day x 30 days = 60,000 requests/month = $600/month in API costs
- 1,000 users at the same rate = $6,000/month
- 10,000 users = $60,000/month
And that's with modest context sizes. If you're stuffing 50K tokens of context into every request, multiply those numbers by 10-20x.
Model these costs for three scenarios: 100 users, 1,000 users, and 10,000 users. If the numbers don't work at 1,000 users, you have a pricing problem that needs solving before you build. Chapter 6 covers how to solve it.
Template: AI Cost Model Worksheet
The AI Cost Model Worksheet helps you estimate your per-user AI costs before you build. It has three sections. Section one: define your AI features and estimate tokens per request (input and output) for each. Section two: estimate usage patterns — how many times per day/week/month will an average user trigger each feature? Section three: calculate total monthly cost at three scales (100, 1K, 10K users) using your chosen model's pricing. The worksheet includes a row for each AI feature, columns for input tokens, output tokens, requests per user per month, cost per request, and total monthly cost at each scale. The final row shows your all-in AI cost per user per month — the number you'll need for pricing decisions in Chapter 6.
Chapter 6: Pricing AI Products
The Math Is Different
If you've ever priced a SaaS product, forget what you know about margins. Traditional SaaS has near-zero marginal cost per user. One more user on your project management tool costs you fractions of a penny in server time. That's why SaaS businesses target 80%+ gross margins and price based on value delivered, not cost incurred.
AI products have a per-request cost that scales linearly with usage. Every time a user clicks "generate," you pay Anthropic or OpenAI real money. This isn't a rounding error — for many products, AI API costs are the single largest line item after payroll.
This changes everything about how you price.
The Cost Structure You Need to Understand
Your AI product has four cost layers:
API costs — The big one. What you pay the model provider per request. This varies wildly based on model choice (Haiku at $0.001 per request vs. Opus at $0.10+), context length, and output length. This is variable cost — it scales directly with usage.
Compute/hosting — Your servers, database, file storage. If you're API-first (which you should be for v1), this is minimal: a basic server, a database, maybe some file storage. Call it $50-200/month for early-stage. This is mostly fixed cost.
Tooling — Auth (Clerk, Supabase Auth), payments (Stripe at 2.9% + $0.30), monitoring, error tracking. Maybe $100-300/month total. Fixed cost until you scale.
Your time — The cost people forget. If you're spending 20 hours a week on this and your time is worth $100/hour, that's $8,000/month in opportunity cost. Not a line item on your P&L, but very real.
For pricing purposes, the number that matters most is your AI cost per user per month. You calculated this in Chapter 5. Now you need to build a price around it.
Three Pricing Models That Work
1. Usage-based pricing: Pay per generation
The user pays for what they use. $0.10 per report generated. $1 per document analyzed. 100 credits for $20.
When it works: When usage varies dramatically across users. When the value of each generation is clear and immediate. When your cost per request is high enough that unlimited plans would kill your margins.
When it doesn't: When users can't predict their costs (they hate surprises). When low usage makes the product feel expensive ("I paid $0.50 for one query?"). When you want predictable revenue.
Real math: If your cost per request is $0.03 and you charge $0.25 per generation, your gross margin is 88%. That's healthy. But if your average user only generates 10 things per month, you're making $2.50/user/month. You need volume or a higher per-unit price.
2. Tiered subscription: Free / Pro / Enterprise with limits
The most common model. Free tier with 20 generations/month. Pro at $29/month with 500 generations. Enterprise at $99/month with 2,000 generations.
When it works: When you want predictable revenue. When you can define usage tiers that match natural user segments (casual, regular, power user). When the free tier drives growth without bankrupting you.
When it doesn't: When your power users use 100x what your average user does (your top tier subsidizes abuse). When the right limit is hard to define (do you cap by generations? tokens? features?).
Real math: A Pro user at $29/month who makes 500 requests at $0.03 each costs you $15/month in API fees. That's a 48% gross margin. Acceptable for early stage, but you'll need to optimize. If most Pro users only make 100 requests, your effective margin is 90% — the limit exists for the outliers.
3. Outcome-based pricing: Charge for results
You don't charge for the AI call — you charge for the outcome it produces. $5 per completed lead analysis. $50 per generated contract. 15% of the revenue your pricing recommendation generated.
When it works: When the outcome has clear, measurable value. When you can tie usage to business results. When the value delivered far exceeds your cost to deliver it. This is the highest-margin model when it works.
When it doesn't: When the outcome is fuzzy (what's a "good" email draft worth?). When attribution is hard to prove. When users feel like they're paying for the AI's mistakes.
Real math: A real estate agent uses your AI to generate listing descriptions. Each description costs you $0.05 in API calls. You charge $3 per listing. That's a 98% gross margin, and the agent happily pays because a professional listing description used to cost $50 from a copywriter.
The 10x Rule
Here's your pricing shortcut for v1: charge at least 10x your AI cost per request.
If a request costs you $0.03, charge the user at least $0.30 in value (either directly per request or amortized across a subscription). If a request costs you $0.10, you need to charge at least $1.00.
Why 10x? Because you need margin for everything else:
- 3x covers your API costs with basic headroom
- 5x covers infrastructure, tooling, and payment processing
- 10x gives you actual gross margin to cover your time, customer support, and growth
If you can't charge 10x your API cost, you have one of three problems:
- Your AI cost per request is too high (use a cheaper model, reduce context, cache common queries)
- The value you deliver isn't high enough (your product needs to do more)
- You're in a market that can't support AI-product pricing (consider whether AI should be a feature, not the product)
The Margin Math Walkthrough
Let's make this concrete. Say you're building an AI product that helps small business owners write job postings.
Your costs per user per month (estimated 50 requests/month):
- API costs: 50 requests x $0.03 = $1.50
- Infrastructure (amortized across 500 users): $150 / 500 = $0.30
- Stripe fees on a $19/month plan: $0.85
- Total cost per user: $2.65/month
At $19/month subscription:
- Gross margin: ($19 - $2.65) / $19 = 86%
- Your API cost ratio: $19 / $1.50 = 12.7x (above the 10x threshold)
That's a healthy business. Now stress-test it:
What if a power user makes 200 requests/month?
- API costs jump to $6.00
- Total cost: $7.15
- Gross margin drops to 62%
Still workable, but this is why you set generation limits on your tiers. The limit isn't about being stingy — it's about protecting your margins from the 5% of users who will use 10x the average.
When to Eat the Cost
Your first 50-100 users should get generous limits, possibly even free access. The data you get from real usage is worth more than the API costs. You're buying three things:
- Real eval data — How users actually use your product (Chapter 4)
- Cost validation — Whether your per-user cost estimates are right
- Testimonials and feedback — The currency that sells the next 1,000 users
Set a mental budget. "I'll spend up to $500/month on API costs for free users." When you hit that budget, you've either validated that people love it (time to charge) or learned that they don't (time to pivot). Either answer is worth $500.
When to Gate
Start charging when any of these are true:
- Users are getting real, repeated value (they come back weekly)
- Your API costs exceed $500/month and climbing
- You have at least 10 users who would be upset if the product disappeared
- Someone has literally asked "can I pay for this?"
The last one happens more often than you'd think. Don't wait for it, but when it comes, listen.
Template: AI Cost Model Worksheet (Pricing Extension)
Extend the cost model worksheet from Chapter 5 with a pricing section. For each pricing model you're considering, fill in: price per unit or monthly subscription, estimated usage per user, gross revenue per user, AI cost per user (from Chapter 5), other costs per user (infrastructure, payments, tooling), gross margin percentage, and your 10x ratio (price divided by AI cost). Run the numbers for your three user scenarios (100, 1K, 10K users) and check: does gross margin stay above 70% at every scale? If the margin compresses below 60% at scale, you either need to raise prices, reduce AI costs (cheaper model, caching, shorter prompts), or add generation limits to your tiers. The worksheet should make the right pricing model obvious — it's the one where the math works at all three scales without requiring users to pay more than the value they receive.
Part 3: The Business
Chapter 7: The Wrapper Trap
When you're one API change away from death.
Here's a question that should keep you up at night: what happens to your product if Anthropic or OpenAI ships the exact feature you're selling?
If your answer is "we'd be screwed," you're a wrapper. And wrappers die.
Let me be specific about what a wrapper is. A wrapper is a product where the entire value creation happens inside someone else's model, and your contribution is a UI layer and maybe a system prompt. You take user input, send it to Claude or GPT, display the output, and charge $29/month for the privilege.
This isn't a hypothetical risk. It's already happened. When ChatGPT launched, dozens of "AI writing tools" that were literally just GPT-3 with a nicer interface died overnight. When Claude added artifacts, every "AI document generator" wrapper lost its reason to exist. When Notion shipped Notion AI, a hundred startups building "AI for notes" became irrelevant in a single product announcement.
The question isn't whether you're using third-party AI. Almost everyone is, and that's fine. The question is whether you've built anything on top of that AI that the AI provider can't trivially replicate.
The Four Defenses
There are exactly four ways to escape the wrapper trap. You don't need all four, but you need at least one, and two is better.
Defense 1: Proprietary Data
This is the strongest defense and the hardest to build. When your users' data makes the product better, and that data can't be replicated, you have a moat.
Think about what happens with a tool like Harvey (AI for lawyers). Every legal brief, every contract review, every case analysis that flows through the system makes the product smarter about how lawyers actually work. A new competitor starting from zero doesn't have that corpus. OpenAI could ship a legal assistant tomorrow, but it wouldn't have 18 months of real law firm workflows embedded in its understanding.
The key word is proprietary. Public data doesn't count. If you're doing RAG over Wikipedia or publicly available documentation, that's not a defense — anyone can do that. Your data advantage comes from user-generated content, proprietary domain knowledge, or accumulated feedback loops.
Ask yourself: after 1,000 users have used your product for six months, is it meaningfully better than it was on day one? If yes, you're building a data moat. If it's the same product with more users, you have no data defense.
Defense 2: Workflow Integration
When your product is embedded in how someone works — not just what they ask — switching costs go up dramatically.
A chatbot is easy to replace. A tool that's integrated into someone's CRM, that triggers based on their pipeline stages, that feeds results into their reporting dashboard, that their team has built processes around — that's much harder to rip out.
This is why Jasper survived (at least initially) while simpler AI writing tools didn't. Jasper wasn't just "generate marketing copy." It was brand voice settings, team collaboration, campaign management, integration with content calendars. Even when ChatGPT could write the same copy, Jasper was woven into marketing teams' workflows.
The test: if a user wanted to switch to a competitor, how many hours of setup, migration, and retraining would it take? If the answer is "five minutes," you're a wrapper. If it's "two weeks and we'd lose our historical data," you have workflow integration.
Defense 3: Domain-Specific Prompts and Evals
This one surprises people. "But anyone can write a prompt!" Sure. Anyone can also write code. The question is whether they can write good code that handles edge cases, has been tested against hundreds of failure modes, and has been iterated based on real user feedback.
Your prompt engineering is intellectual property when it's:
- Versioned: you know which prompt version works better and why
- Tested: you have eval sets with hundreds of test cases and pass/fail criteria
- Iterated: each version is informed by real user failures, not guesswork
- Multi-step: you're orchestrating multiple AI calls with logic between them, not just one prompt-and-response
A single system prompt is not defensible. A prompt pipeline with 15 steps, conditional logic, output validation, retry strategies, and domain-specific eval rubrics — that's engineering, and it takes months to replicate well.
Builder Companion does this with its discovery script generator. It's not one prompt that says "generate interview questions." It's a pipeline: analyze the product description, identify assumption categories, generate questions per category, filter for non-leading language, rank by information value, format for the user's experience level. Each step is tested against a rubric. The whole pipeline has been iterated across dozens of real products. Could someone rebuild it? Sure. But it would take them the same months of iteration it took us.
Defense 4: Multi-Model Orchestration
If your product only works with one model from one provider, you're dependent on that provider's pricing, capabilities, rate limits, and strategic decisions. That's a fragile position.
Products that orchestrate across multiple models — using Claude for reasoning, GPT for certain generation tasks, open-source models for classification, maybe a fine-tuned model for domain-specific tasks — have natural resilience. If one provider raises prices 5x (it happens), you can shift workloads. If one model degrades on a specific task (it happens), you can route around it.
This also makes your orchestration layer itself valuable. Knowing which model to use for which task, how to format inputs for each, how to normalize outputs across providers — that's hard-won operational knowledge.
The Wrapper Audit
Run this test on your own product right now:
- The Launch Test: If OpenAI/Anthropic announced they're building exactly what you're building, how many of your users would wait for their version instead of paying you? If most would wait — you're a wrapper.
- The Prompt Test: Could a smart person replicate your core AI functionality in an afternoon with the API docs and a good system prompt? If yes — you're a wrapper.
- The Data Test: Does your product get better as more people use it? Not "more users means more revenue" — actually better outputs, smarter suggestions, more relevant results? If not — you're a wrapper.
- The Integration Test: Does your product touch anything besides the chat input and the AI output? Does it connect to the user's existing tools, data, or workflows? If not — you're a wrapper.
Score yourself. If you failed three or four of these tests, you need to pick a defense and start building it now, before the market shifts underneath you.
Products That Escaped vs. Products That Didn't
Escaped: Midjourney started as "just" an image generation UI, but built a massive community, developed proprietary aesthetic training, and created a workflow (Discord-native) that became its own thing. When DALL-E 3 launched, Midjourney didn't flinch.
Didn't Escape: Copy.ai raised $11M to build a GPT-3 writing wrapper. When ChatGPT launched, their core product became free for everyone. They've survived by pivoting hard into workflow automation, but the original product — write marketing copy with AI — became a feature of everything.
The lesson: Being a wrapper isn't a death sentence if you see it early. But you have to move fast, because the window between "we're a wrapper" and "we're dead" keeps getting shorter.
Template: Wrapper Risk Scorecard
A one-page assessment you can run quarterly. Score each defense from 0-3:
- Proprietary Data (0 = no unique data, 1 = some user data but not leveraged, 2 = user data improves outputs, 3 = significant data moat competitors can't replicate)
- Workflow Integration (0 = standalone chat, 1 = some integrations, 2 = embedded in daily workflows, 3 = switching costs measured in weeks)
- Domain Prompt Engineering (0 = single prompt, 1 = multi-step pipeline, 2 = tested and versioned pipeline, 3 = hundreds of eval cases with continuous iteration)
- Multi-Model Resilience (0 = single model dependency, 1 = could switch with effort, 2 = actively use multiple models, 3 = intelligent routing across providers)
Score 0-3: You're a wrapper. Start building defenses immediately.
Score 4-6: Vulnerable but viable. Prioritize your weakest defense.
Score 7-9: Defensible. Keep iterating.
Score 10-12: Strong moat. Focus on growth.
Chapter 8: Building Trust with Non-Technical Users
Your users don't care about your model.
Here's something that took me too long to learn: the biggest barrier to adoption of your AI product isn't accuracy. It's trust.
I've watched users interact with AI features that were 95% accurate and refuse to rely on them. I've also watched users adopt AI features that were 70% accurate and love them. The difference wasn't the quality of the output. It was whether the user felt in control.
Non-technical users — the parents, realtors, teachers, salespeople we're building for — have a fundamentally different relationship with AI than developers do. A developer sees a wrong answer and thinks "I need to adjust my prompt." A non-technical user sees a wrong answer and thinks "I can't trust this thing."
One bad output can undo weeks of trust-building. And once trust is broken, it's nearly impossible to rebuild. So you need to design for trust from the start, not bolt it on after.
The Trust Gap
There's a specific phenomenon I call the trust gap. It looks like this:
A user sees a demo of your AI product. They're impressed. "Wow, it wrote that whole thing for me!" They sign up. They try it on their actual work. The first output is pretty good. The second is okay. The third gets something wrong — maybe it hallucinated a fact, maybe it used the wrong tone, maybe it suggested something that doesn't apply to their situation.
Now the user is in the trust gap. They've seen the product work. They've also seen it fail. They don't know when to trust it and when not to. This uncertainty is worse than the product being consistently bad, because at least then they'd know not to use it.
Most users resolve the trust gap by quitting. They go back to doing things manually. Your churn isn't because the product isn't good enough — it's because the user can't predict when it will be good.
Incremental Trust Building
The fix is to design your product so trust builds incrementally. You don't ask the user to trust the AI with high-stakes decisions on day one. You earn trust through a progression.
Level 1: Suggestions, Not Decisions
Start by positioning your AI as a suggestion engine, not a decision-maker. "Here are three options — which one fits?" is fundamentally different from "Here's your answer." The user stays in control, the AI is a helper, and a wrong suggestion is a minor annoyance instead of a trust-breaking failure.
This is why autocomplete works so well as an AI pattern. It's constantly suggesting, the user is constantly accepting or rejecting, and the stakes per suggestion are near zero. Over time, the user develops an intuitive sense of when the AI is right and when to ignore it.
Level 2: Show Your Work
When your AI generates something, show why. Not the technical chain-of-thought — your users don't care about that. But the reasoning in plain language.
Instead of: "Recommended price: $49/month"
Try: "Recommended price: $49/month. Based on: your competitors charge $39-79, your feature set is mid-tier, and your target users (real estate agents) typically expense tools under $50/month."
Now the user can evaluate the reasoning, not just the output. They can say "actually, my competitors are more expensive than that" and adjust. They're working with the AI, not blindly following it.
Level 3: Let Users Override and Correct
Every AI output should have an easy path to "this isn't right, let me fix it." And ideally, those corrections should feed back into future outputs.
This does two things. First, it gives the user an escape hatch, which reduces anxiety. They know that if the AI gets it wrong, they're not stuck. Second, corrections are the highest-signal feedback you can get. A user who corrects an output is telling you exactly what the AI should have said, in context, for their specific use case.
Build correction flows that are faster than starting from scratch. If a user has to delete the AI output and redo everything manually, you've failed at both trust and utility.
Level 4: Track Accuracy and Show It
Once you have enough usage data, show users their personal accuracy stats. "In the last 30 days, you've accepted 84% of suggestions without edits." This does something powerful: it turns a subjective feeling ("I'm not sure I can trust this") into an objective metric ("this is right 84% of the time").
You can also use this data to calibrate. If accuracy dips below a threshold for a specific user, you can proactively adjust — maybe surface more options, add more caveats, or flag outputs as lower confidence.
The Transparency Spectrum
How transparent should you be about the AI? It depends on the stakes.
Low stakes (email subject line suggestions, auto-categorization, formatting help): A simple "AI-generated" label is enough. Don't over-explain — it just creates friction.
Medium stakes (content drafts, data analysis, recommendations): Show the reasoning. Provide the sources. Let the user verify key claims. This is where "show your work" matters most.
High stakes (financial advice, legal documents, medical information): Full transparency plus human review gates. The AI generates, a human approves. Show confidence levels. Highlight anything the AI is uncertain about. Make it clear that the user is the decision-maker and the AI is an advisor.
The mistake most builders make is applying high-stakes transparency to low-stakes features (which kills the experience with friction) or low-stakes transparency to high-stakes features (which kills trust when something goes wrong).
When Confidence Matters
Should your AI say "I'm 73% confident in this answer"? Almost never for non-technical users.
Numerical confidence scores are meaningless to most people. What does 73% confident mean? Is that good? Should I trust it? The number creates more uncertainty, not less.
Instead, use language-based confidence indicators:
- High confidence: Present the output normally
- Medium confidence: "Here's my best take, but you should double-check the numbers"
- Low confidence: "I'm not sure about this — here's what I found, but you'll want to verify"
- No confidence: "I don't have enough information to help with this. Here's what I'd need..."
The last one is critical. An AI that says "I don't know" when it doesn't know builds more trust than an AI that confidently generates garbage. Teach your AI to identify its own uncertainty boundaries, and make "I don't know" a first-class output.
Handling the "Is This Real?" Question
At some point, a user will look at your AI's output and ask: is this real? Did it make this up?
This is especially acute with anything involving facts, data, or citations. LLMs hallucinate. Your users will eventually encounter a hallucination. How you handle that moment defines whether they come back.
Do: Make it easy to verify claims. Link to sources. Separate facts from opinions. Use structured outputs that clearly distinguish "data from your account" from "AI-generated analysis."
Don't: Pretend the AI is infallible. Don't hide behind "AI can make mistakes" disclaimers buried in your ToS. Address it head-on in the product experience.
The gold standard: design your product so that hallucinations are either impossible (because you've constrained the output space) or immediately obvious (because the user has the context to spot them). This goes back to Chapter 3 — structured outputs beat magic, and nowhere is that more true than in trust-building.
Template: Trust Design Checklist
For each AI feature in your product, answer these questions:
- Stakes level: Low / Medium / High — determines your transparency requirements
- Failure visibility: If the AI gets it wrong, will the user notice immediately, eventually, or never? (If "never," you have a serious problem.)
- Override path: Can the user easily correct the output? How many clicks?
- Feedback loop: Do corrections improve future outputs for this user? For all users?
- Confidence communication: How does the AI signal uncertainty? Is it calibrated?
- Verification path: Can the user verify key claims without leaving the product?
- Graceful degradation: When the AI can't help, what happens? (Blank screen = trust destroyer. Helpful fallback = trust builder.)
- Accuracy tracking: Are you measuring accuracy per user? Per feature? Are you sharing it?
Score each feature: 0-2 answers addressed = Trust Debt (fix before scaling), 3-5 = Minimum Viable Trust (okay for early users), 6-8 = Trust-First Design (ready for growth).
Part 4: Teardowns
Chapter 9: Case Studies: AI Products That Got It Right
The frameworks in this playbook are useless if they only work in theory. So let's put them to work on real products. For each, I'll analyze what they built, where AI sits, how they handle the hard problems, and what you can steal for your own product.
Case 1: Notion AI — The Feature Layer Play
What it does: Notion AI adds AI-powered writing, summarization, translation, and data analysis directly inside Notion's existing workspace product.
Where AI sits: Feature. Remove the AI and Notion is still a fully functional product — a workspace for docs, wikis, projects, and databases. The AI enhances what's already there.
Value proposition analysis: Notion AI's pitch isn't "we have AI." It's "you already work in Notion — now you can do more without leaving." This is the Feature play from Chapter 1, and Notion executes it well. The AI doesn't try to replace the user's workflow. It accelerates it.
The "so what" test from Chapter 2: if you replaced Notion AI with a human assistant who could edit your docs, summarize your pages, and fill in your databases, would the product still make sense? Absolutely — in fact, that's exactly what Notion AI is: a tireless human assistant embedded in your workspace. This means Notion AI is a cost/convenience play, not an "impossible without AI" play. That's fine, because it's a feature, not the core product.
Reliability and trust: Notion handles trust mostly through Level 1 (suggestions, not decisions) and the override path. When you use AI to generate content, it appears as a draft you can accept, edit, or discard. It never overwrites your existing work. It never makes changes without your explicit action. This is trust-first design done right.
They're weaker on showing their work — AI-generated summaries don't explain why certain points were included or excluded. But for the low-to-medium stakes of most Notion use cases (internal docs, meeting notes, project summaries), this is acceptable.
Pricing model: $10/month per user, on top of existing Notion subscription. This is classic feature pricing — incremental cost for incremental value. At their scale, this is enormously profitable. Their API costs per user are likely $0.50-1.00/month (most Notion AI actions are short-form generation or summarization), so they're running at 90%+ margins on the AI feature.
Wrapper risk assessment: Low. Notion AI scores well on the wrapper audit:
- Proprietary data: Your entire Notion workspace is the context. No competitor can access it.
- Workflow integration: It's literally inside the product you already use 8 hours a day.
- Multi-model: Notion has switched underlying models multiple times without users noticing.
- Domain prompts: Less strong here — their prompts are good but not deeply domain-specific.
Overall wrapper risk score: ~8/12. Very defensible.
What to steal: The insertion point matters more than the model quality. Notion AI isn't the best AI writing tool. It's the most convenient one, because it's right where you're already working. If you're building an AI feature, obsess over where in the workflow it appears, not just how smart it is.
Case 2: Harvey — Domain-Specific AI Done Right
What it does: Harvey is an AI platform for law firms. It helps lawyers with legal research, contract analysis, document drafting, due diligence, and regulatory compliance.
Where AI sits: Core. While Harvey has a UI and workflow tools, the AI is doing the actual work that lawyers need done. Remove the AI and there's no product.
Value proposition analysis: Harvey passes the "so what" test convincingly. Replace Harvey's AI with a human expert doing the same thing, and that human would need to be a team of junior lawyers billing $300-500/hour each. The AI doesn't just make legal work faster — it makes certain analyses economically viable that weren't before. A junior associate might take 40 hours to review 500 contracts for a specific clause. Harvey can do it in minutes. That's not a cost play — that's a capability play.
Harvey's positioning is sharp: they don't compete with ChatGPT for legal questions. They compete with the $200/hour associate who does the same work at 1/100th the speed. Their buyer isn't the individual lawyer — it's the firm's managing partner who sees the margin opportunity.
Reliability and trust: This is where Harvey's approach gets interesting, because the stakes are genuinely high. A hallucinated legal citation could lead to sanctions (it's happened — see: the lawyers who submitted ChatGPT-fabricated case citations to a federal court).
Harvey's trust architecture:
- Structured outputs: Legal research results include specific citations that can be verified.
- Source verification: Every claim links back to a specific document or legal database.
- Confidence signals: Harvey flags when it's less certain, and legal professionals are trained to verify anyway.
- Human-in-the-loop: Harvey explicitly positions itself as a tool for lawyers, not a replacement.
Pricing model: Enterprise pricing, reportedly $500-1,000+ per user per month. At these prices, Harvey can afford to use the most capable models, run multiple inference passes for quality, and still maintain healthy margins.
Wrapper risk assessment: Very low. Score: ~10-11/12. OpenAI could ship "ChatGPT for lawyers" tomorrow and Harvey wouldn't blink.
What to steal: Go deep, not wide. Harvey picked one domain, one buyer, and went deeper than anyone else could justify without that focus.
Case 3: Jasper — The Wrapper That Had to Evolve
What it does: Jasper started as an AI copywriting tool. Today it's an "AI marketing platform" with brand voice management, campaign workflows, and team collaboration.
Where AI sits: Started as Core, had to evolve into a hybrid.
Value proposition analysis: In 2021-2022, Jasper's pitch was clear: "Write marketing copy 10x faster with AI." They raised $125M at a $1.5B valuation. Then ChatGPT launched and Jasper's core value proposition became free overnight.
This is the purest case study of wrapper risk in the wild.
But Jasper didn't die. They pivoted toward the defenses from Chapter 7:
- Brand voice: Your company's tone and style guidelines stored and applied to every output.
- Campaign workflows: Templates for specific marketing workflows.
- Team collaboration: Multiple marketers with approval flows.
- Knowledge base: Upload your company's docs for personalized context.
The lesson: Jasper's initial product was indefensible. They survived because they moved fast. The brand voice / knowledge base pattern is one of the most accessible ways to build a data moat.
Wrapper risk (pre-pivot): 1-2/12. Pure wrapper.
Wrapper risk (post-pivot): 5-7/12. Improved but still vulnerable.
Case 4: Builder Companion — Eating Our Own Cooking
What it does: Builder Companion ($29/month) for non-technical builders who've built something with AI tools and need help turning it into a product with paying users.
Where AI sits: Core, but tightly constrained. The AI generates specific outputs — discovery scripts, validation scorecards, pricing recommendations — using structured pipelines.
I'm including this because it would be dishonest to write a playbook and not apply it to ourselves.
Value proposition analysis: The "so what" test: a human product coach charges $150-300/hour. Builder Companion offers similar guidance for $29/month. This is primarily a cost play — making expert-level product guidance accessible.
Honest assessment: Builder Companion's value prop is solid but not exceptional. It's a "10x cheaper than the alternative" play.
Reliability and trust:
- Structured outputs everywhere: No open-ended generation.
- Show your work: Every recommendation comes with reasoning.
- Override everything: Every AI output is editable.
- Progressive disclosure: Start simple, add complexity as users demonstrate readiness.
Wrapper risk assessment:
- Proprietary Data: 1-2/12 (thin data moat, early stage)
- Workflow Integration: 1/12 (standalone, needs integrations)
- Domain Prompts: 2-3/12 (strong, versioned, tested)
- Multi-Model: 0-1/12 (single provider dependency)
Overall: 4-7/12. Vulnerable but viable. Our biggest defense is domain-specific prompt engineering, and we need to build stronger data and integration moats over the next 6-12 months.
What to steal: Do the wrapper audit on yourself, and be honest about the results.
Cross-Case Patterns
- Feature AI is safer than Core AI. Notion has the least wrapper risk because AI is additive, not essential.
- Domain depth beats general capability. Harvey's legal specialization is essentially unassailable. Jasper's "marketing copy for everyone" was trivially replicated.
- Trust design is non-negotiable. Every successful AI product treats trust as a core design constraint.
- The wrapper audit should be quarterly. Jasper was safe in January 2022 and existentially threatened by December 2022.
Chapter 10: Your 2-Week AI Product Sprint
Stop planning. Start building.
You've read the frameworks. You've internalized the pitfalls. Now it's time to build something. This chapter is a day-by-day plan for going from "I have an idea for an AI product" to "I have a live MVP with real users testing it" in two weeks.
This isn't a plan for building a polished product. It's a plan for answering the only question that matters: does anyone want this enough to use it?
Rules for the sprint:
- Every day has one primary deliverable. Finish it before doing anything else.
- Don't build features that aren't on the list.
- "Show it to someone" appears three times. This is not optional.
Week 1: Validate the AI Angle
Day 1 — Run the Decision Canvas
Pull out the AI Product Decision Canvas from Chapter 1. Spend 90 minutes filling it out honestly.
Your deliverable: a one-paragraph statement that says: "My product is [tool/feature/core AI] because [reason]. The AI does [specific thing] that the user [can't/won't] do themselves. Without the AI, the product [still works / is diminished / doesn't exist]."
Day 2 — Map the Value Proposition
Use the AI Value Proposition Canvas from Chapter 2. Map: the problem, the AI capability, the user outcome, and your defensibility.
Your deliverable: answer the "so what" test. Also: write down three existing alternatives your users currently use (including "do nothing" and "do it manually").
Day 3 — Build One AI Feature
Not the whole product. One feature. Use structured outputs (Chapter 3). Define the input, output schema, quality bar, and two failure modes.
Build it. Don't build a UI yet — a script that takes input and returns output is fine.
Day 4 — Write 20 Eval Cases
10 happy path, 5 edge cases, 5 adversarial cases. Run all 20. Score each: Pass, Partial, Fail.
Target: 80% pass rate on happy path, 60% on edge cases, graceful failure on adversarial cases.
Day 5 — Show It to Someone
One person in your target audience. Show them the raw output. Ask: "Is this useful?" "What's wrong with it?" "Would you use this regularly?"
Write down exactly what they say.
Days 6-7 — Iterate and Decide
Make one of three decisions:
- Proceed: The feature works, the user wanted it.
- Pivot: The user wanted something adjacent.
- Kill: The AI doesn't add enough value. (Valid outcome. You saved months.)
Week 2: Ship the MVP
Day 8 — Write the Scope Brief
One page. Maximum five features. Then write a "not building" list that's at least as long.
Day 9 — Design the Trust Layer
Using the Trust Design Checklist from Chapter 8. How will you show outputs? How will users override? What happens when the AI doesn't know?
Days 10-11 — Build and Deploy
Build the minimal product. Deploy it somewhere with a URL. Your standard: can a stranger use the core AI feature without your help?
Day 12 — Show It to Three People
Watch them use it. Don't explain. Don't help. Note where they get confused, whether they trust the output, and whether they finish the core workflow.
Day 13 — Fix the Top Three Issues
Not features they requested — problems they encountered. Fix those three things. Nothing else.
Day 14 — First Retro
Answer five questions in writing:
- Does the AI add real value? (Based on user reactions, not your opinion.)
- What's your wrapper risk score?
- What's your trust design score?
- What would you need to charge?
- What's the one feature that would make users come back tomorrow?
What Happens After the Sprint
If you finished this sprint, you have something most AI product builders never get: honest data about whether your product works.
You don't have a finished product. You have a validated direction. From here:
- If users wanted it, start building in earnest.
- If users were lukewarm, dig into why and iterate on the weakest link.
- If users didn't care, be grateful you found out in two weeks instead of six months.
The playbook doesn't end here. It loops. Every new feature goes through the same cycle: spec it, eval it, ship it, test it with users, iterate.
Now go build something.
Template: 2-Week Sprint Tracker
| Day | Deliverable | Done? | Key Learning |
|---|---|---|---|
| 1 | Decision Canvas completed | [ ] | |
| 2 | Value Prop Canvas + "so what" test | [ ] | |
| 3 | One AI feature built (structured output) | [ ] | |
| 4 | 20 eval cases written and scored | [ ] | |
| 5 | One user test completed | [ ] | |
| 6-7 | Proceed / Pivot / Kill decision made | [ ] | |
| 8 | Scope brief (5 features + "not building" list) | [ ] | |
| 9 | Trust layer designed | [ ] | |
| 10-11 | MVP built and deployed to a live URL | [ ] | |
| 12 | Three user tests completed | [ ] | |
| 13 | Top 3 issues fixed and shipped | [ ] | |
| 14 | Retro document written | [ ] |
The AI Product Playbook. By Frank Sellhausen. Published by Builder's Path.