Thinking

Sunk Cost Fallacy

Behavioral Economics ▼

DecisionPsychology

Don't Throw Good Money After Bad

We continue investing in something just because we've already put time, money, or effort into it, even when the rational move is to stop.

The story

You're two hours into a terrible movie and think, "I've already spent two hours, might as well finish." That's paying extra misery to justify time already lost.

The trap

It sounds like: "We can't cancel the project; we've spent nine months on it." or "I'll keep the subscription -- maybe I'll use it next year."

Checklist

Ask: "If I were starting fresh today, would I choose this?"
Compare future upside to remaining time, money, and energy.
Set explicit stop conditions before you start.
Treat stopping as a win: you saved future cost.

Use when

You're debating whether to keep a struggling project, investment, product, relationship, or subscription purely because of past effort.

The drill

Name one thing you're only doing because of past effort. What is the smallest step you could take this week to exit or reduce your commitment?

Occam's Razor

William of Ockham ▼

DecisionStrategy

The Simplest Explanation Is Usually Correct

Between competing hypotheses that explain the data equally well, the simplest one -- with the fewest moving parts -- is usually the best starting point.

The story

Your Wi-Fi dies and your first theory is a global cyberwar. Then you realize the router is unplugged. The universe rarely needs a conspiracy when a loose cable will do.

The trap

We jump to exotic explanations when something breaks instead of checking the boring, obvious things first.

Checklist

List all explanations that fit the evidence.
Cross out the ones that require extra assumptions or miracles.
Start by testing the simplest explanation first.
Only add complexity if simple explanations fail.

Use when

You're troubleshooting bugs, outages, or confusing behavior -- or evaluating wild theories about why something happened.

The drill

Take a current problem and write down the most boring explanation you can think of. Test that before anything clever.

Build-Measure-Learn

Eric Ries ▼

InnovationProduct

The Lean Startup Loop

Instead of betting everything on a big launch, you build the smallest thing that can test a hypothesis, measure what happens, and learn whether to pivot or persevere.

The story

You could spend a year perfecting your app... or you could launch a janky landing page this week to see if anyone even wants what you're building.

The trap

Founders fall in love with building and treat learning as a side effect. They over-engineer v1, measure vanity metrics, and learn nothing useful.

Checklist

Start by asking: "What do we need to learn?"
Decide what to measure that will prove or disprove your hypothesis.
Build the smallest thing that can generate that measurement.
Run the loop quickly: build, measure, learn, adjust.

Use when

You're building something new and feel the urge to polish endlessly before anyone sees it.

The drill

Write down your current product hypothesis in one sentence. Now write the smallest experiment you could run in the next 7 days to test it.

Circle of Competence

Charlie Munger ▼

DecisionStrategy

Play Where You Understand the Game

Your circle of competence is the set of domains where you genuinely understand what's going on. You don't have to be an expert at everything -- just know where you are and aren't competent.

The story

A brilliant investor in consumer brands decides to dabble in biotech "for diversification." Spoiler: the molecules took his money.

The trap

Outside your circle, everything looks randomly good or bad. You're just guessing -- but with a dangerous illusion of understanding.

Checklist

Write down the domains where you have real experience and results.
Be honest about where you're merely opinionated, not competent.
Say "no" quickly to opportunities outside your circle.
Deliberately expand the circle by learning and doing, not by pretending.

Use when

You're tempted by a shiny opportunity in a field you don't really understand (but your ego says you'll figure it out on the fly).

The drill

List three domains where you'd confidently bet your own money -- and three where you absolutely shouldn't.

The Eisenhower Matrix

Dwight D. Eisenhower ▼

DecisionProductivity

Urgent vs. Important

Not all tasks are equal. Some are urgent and important (do now), some are important but not urgent (schedule), some are urgent but not important (delegate), and some are neither (delete).

The story

Your inbox is full, Slack is screaming, and yet your big strategic project hasn't moved in weeks. Congratulations -- you've been living in Quadrants 1 and 3.

The trap

We confuse urgency with importance. We feel productive answering pings while our real goals quietly starve in the background.

Checklist

List today's tasks in no particular order.
Label each as urgent/not urgent and important/not important.
Do: urgent + important. Schedule: not urgent + important.
Delegate: urgent + not important. Delete: not urgent + not important.

Use when

You feel busy all day but can't point to anything meaningful you actually accomplished.

The drill

Take your current to-do list and ruthlessly delete at least one item that is neither urgent nor important.

Day 1 Philosophy

Jeff Bezos ▼

StrategyProduct

Stay Hungry, Stay Foolish

Day 1 means you're still focused on customers, moving fast, and making decisions. Day 2 is stasis, then irrelevance, then death.

The story

Amazon could have become a slow, bureaucratic giant. Instead, Bezos made 'Day 1' a company-wide mantra -- treat every day like you're still scrappy and customer-obsessed.

The trap

It sounds like: 'We're too big to move fast' or 'We need more process' or 'Let's optimize for efficiency over customer experience.'

Checklist

Ask: 'What would a startup do here?'
Prioritize customer experience over internal efficiency.
Make decisions quickly with 70% of the information.
Resist proxies -- process should serve customers, not replace judgment.

Use when

You're scaling and feel the pull toward bureaucracy, slow decisions, or optimizing for internal metrics instead of customer value.

The drill

Name one process or meeting that exists more for internal comfort than customer value. What's the smallest step to eliminate it?

First Principles Thinking

Elon Musk ▼

InnovationDecision

Break Down to Fundamentals

Instead of copying what exists or reasoning by analogy, strip away assumptions to find the core physics, math, or immutable laws -- then rebuild from there.

The story

Musk didn't ask 'How do we make cheaper rockets?' He asked 'What does a rocket actually need?' Then he built SpaceX from first principles and cut costs by 10x.

The trap

We default to 'best practices' and 'industry standards' without questioning whether they're actually necessary or optimal.

Checklist

List everything you assume is true about the problem.
Question each assumption: 'Is this actually necessary?'
Identify the fundamental constraints (physics, math, laws).
Rebuild the solution from those fundamentals.

Use when

You're stuck optimizing within existing constraints, or everyone says 'that's just how it works' and you suspect there's a better way.

The drill

Pick a current problem. Write down three assumptions everyone accepts. Now question: what if those assumptions are wrong?

Do Things That Don't Scale

Paul Graham ▼

InnovationProduct

Manual Work Before Automation

Startups try to automate before they understand. Manual work teaches you what customers actually want. Once you know, then you automate.

The story

Stripe manually onboarded every customer in the early days. They learned exactly what customers needed, then built the perfect product.

The trap

We try to scale before we understand. We build systems for problems we don't fully understand yet.

Checklist

Don't build a system until you've done it manually 100 times.
Talk to every user personally.
Write personal responses, not templates.
Only automate once you understand the problem deeply.

Use when

You're building something new and feel the urge to automate everything before you understand what customers actually need.

The drill

What's one thing you're trying to automate? Could you do it manually for 10 customers first?

Default Alive vs Default Dead

Paul Graham ▼

StrategyDecision

Will You Survive Without More Funding?

Most startups are default dead. They're burning cash faster than they can make it, hoping for a miracle. Know which one you are.

The story

You have 6 months of runway and you're burning $50k/month. You need to raise $2M or you're dead. That's default dead.

The trap

We assume we'll raise more money. We don't calculate if we can survive without it. Most startups are default dead and don't realize it.

Checklist

Calculate your runway: months of cash / monthly burn.
Calculate path to profitability: revenue growth vs costs.
If you can't get profitable before running out of cash, you're default dead.
Either find product-market fit fast or cut costs dramatically.

Use when

You're running a startup and want to know if you're actually viable or just hoping for a miracle.

The drill

Calculate: if you never raise another dollar, will you survive? If no, you're default dead. What's the fastest path to default alive?

Map Model Capabilities

AI Feasibility ▼

Systematically evaluate which AI capabilities your product needs and assess technical feasibility.

When to use

When evaluating if AI is the right solution for your product problem
Before committing to build vs. buy decisions
When scoping an AI feature for the first time

Steps

List all the tasks your product needs AI to perform (e.g., classify images, generate text, predict churn)
For each task, research state-of-the-art (SOTA) capabilities: accuracy, latency, data requirements, cost
Map your product requirements against SOTA: Is current tech good enough? What's the gap?
Prioritize tasks by feasibility + user value. Identify quick wins and long-term bets.

Tips

Add a column for 'Required by Launch' vs. 'Nice to Have'—prevents scope creep
Update this map every 6 months; AI capabilities improve rapidly

Data Availability Assessment

AI Feasibility ▼

Evaluate whether you have sufficient quality data to train or fine-tune AI models effectively.

When to use

Before committing to custom model development
When deciding between pre-trained models vs. fine-tuning
If stakeholders assume AI will work without examining data reality

Steps

Quantify data volume: Count labeled examples per category. Most supervised tasks need 1,000+ examples minimum, 10,000+ for production quality
Assess data quality: Check for label accuracy (>95% correct?), class balance (no category <5% of total), representative coverage of edge cases
Evaluate data accessibility: Where does data live? Can you legally use it for ML? What's the pipeline to access and update it?
Identify data gaps: What scenarios are missing? What would it cost to collect/label the missing data?
Create data roadmap: Can you launch with existing data? When will you have sufficient data for v2 improvements?

Tips

Start with data audit before pitching AI features—60% of AI projects fail due to data issues
Budget 30-50% of AI development time for data collection and labeling, not just model work

AI Technical Debt Calculator

AI Feasibility ▼

Estimate the long-term maintenance costs of AI systems beyond initial development.

When to use

When creating business cases for AI investments
Before choosing between simple rules vs. ML solutions
When stakeholders focus only on development costs, ignoring operations

Steps

Calculate model maintenance: Retraining frequency (monthly? quarterly?) × engineer time per retrain × salary. Add monitoring/on-call costs
Estimate infrastructure costs: Inference compute (API calls × cost per call), training compute, data storage and pipelines
Factor in data pipeline maintenance: Label quality audits, dataset versioning, feature engineering updates, data validation systems
Account for model updates: As AI capabilities improve, you'll need to evaluate and integrate new models every 6-12 months
Compare total 3-year cost: AI solution vs. non-AI alternatives. Include development + operations + opportunity cost

Tips

Rule of thumb: AI operational costs are 3-5× the initial development cost over 3 years
For low-stakes features, simple heuristics often beat ML when total cost of ownership is considered

Latency Budget Planning

AI Feasibility ▼

Define acceptable response times for AI features and architect systems to meet latency requirements.

When to use

When designing user-facing AI features
Before selecting model architectures or inference infrastructure
If users complain that AI features feel slow

Steps

Define user expectations: Real-time (<100ms)? Interactive (<1s)? Asynchronous (>5s okay)? Base on user research and competitive benchmarks
Break down latency sources: Network roundtrip + model inference + post-processing + database queries. Measure each component
Set component budgets: Allocate total budget across pipeline. Example: 800ms total = 200ms network + 400ms inference + 200ms other
Optimize critical path: Can you use smaller models? Batch predictions? Cache results? Move compute closer to users?
Establish degradation strategy: If model is slow, show partial results, streaming responses, or fallback to faster (less accurate) model

Tips

Aim for <1 second for most user-facing AI features—users perceive longer waits as broken
Test latency at p95 and p99, not just average—tail latency kills UX for real users

Edge Case Scenario Mapping

AI Feasibility ▼

Systematically identify and prioritize edge cases where AI will fail, then design mitigation strategies.

When to use

After initial feasibility testing shows promise
Before launching AI features to production
When designing AI user experience and error handling

Steps

Brainstorm failure modes: Run team workshop listing scenarios where AI might fail (rare inputs, ambiguous cases, adversarial examples, distribution shifts)
Collect real edge cases: Review support tickets, user feedback, and competitive failures. Test your prototype with extreme inputs
Quantify frequency and impact: Estimate % of users affected × severity of bad outcome. Create 2×2 matrix of frequency/impact
Prioritize mitigation: High-frequency or high-impact cases need solutions before launch. Low/low can ship with monitoring
Design fallback strategies: Human review, confidence thresholds, fallback to simpler methods, explicit 'AI can't help here' messages

Tips

Plan for 5-20% of inputs to hit edge cases in production—AI is never 100% accurate
Show your edge case matrix to legal/trust & safety teams early—some failures have regulatory or PR risk

Multi-Model Strategy Design

AI Feasibility ▼

Plan when and how to combine multiple AI models to solve complex product problems.

When to use

When a single model can't meet all product requirements
When building AI products with multiple capabilities (e.g., search + summarization + recommendations)
Before scaling initial AI prototypes into full product suites

Steps

Map capabilities to models: Break your product into distinct AI tasks (classification, generation, ranking, etc.). Assign best-fit model type to each
Design model orchestration: Sequential (Model A → Model B)? Parallel (A + B → combine results)? Conditional (if A confident, skip B)?
Manage dependencies: What happens if Model A fails? Does Model B still work? Build fallback chains and circuit breakers
Optimize for cost and latency: Can you run cheaper/faster models first, then escalate to expensive models only when needed?
Version and deploy independently: Each model should have its own versioning, monitoring, and rollback capability

Tips

Start with single model, add models only when user needs clearly justify complexity
Use smaller, specialized models over one large model when possible—lower cost, faster, easier to debug

AI Unit Economics Model

Business Model & Pricing ▼

Calculate the true cost per user or per action for AI features to ensure sustainable economics.

When to use

Before launching AI products with usage-based costs
When setting pricing for AI-powered features
If AI costs are growing faster than revenue

Steps

Calculate cost per prediction: Inference costs + model hosting + data pipeline costs / number of predictions. Track separately for different models/features
Estimate average usage per user: Based on product analytics or beta testing, how many AI actions does typical user take per month?
Model cost at scale: User base × actions per user × cost per action. Project at 10×, 100×, 1000× current scale
Determine unit economics target: For SaaS, aim for LTV:CAC of 3:1. For freemium, AI costs should be <30% of revenue per paying user
Identify optimization levers: Can you cache results? Batch requests? Use cheaper models for simple queries? Set usage caps?

Tips

OpenAI/Anthropic costs drop 50-90% yearly—don't over-optimize current pricing, but do monitor costs weekly
Set usage limits for free tiers to prevent runaway costs—Notion AI limits free users to 20 actions

Define AI Value Proposition

Value Proposition ▼

Articulate the specific value AI delivers to users, beyond what non-AI solutions can provide.

When to use

When pitching AI features to stakeholders or users
Before writing product specs or user stories
When differentiating your AI product from competitors

Steps

Identify the user job: What task are users trying to accomplish? What's the current pain?
Define the AI advantage: What can AI do that rule-based systems or manual processes can't?
Quantify the benefit: Time saved? Better accuracy? Personalization? New capabilities?
Test the value prop: Share with 5-10 target users. Do they get excited? Do they see the benefit?

Tips

Focus on outcomes, not technology: "Get instant answers" not "Powered by GPT-4"
Avoid "AI for AI's sake"—if a simpler solution works, use it

AI Feature Pricing Strategy

Business Model & Pricing ▼

Determine how to monetize AI capabilities: bundled, add-on, usage-based, or premium tier.

When to use

Before launching AI features to customers
When deciding whether AI justifies price increases
If competitors are undercutting your AI pricing

Steps

Assess AI value perception: Does AI unlock new use cases or just improve existing workflows? New capabilities justify premium pricing
Benchmark competitive pricing: Survey 5-10 competitors. Are they charging for AI separately or bundling? What's the price premium?
Model pricing options: - Bundled free - Add-on flat fee - Usage-based - Higher tier only Calculate revenue and adoption for each
Test willingness to pay: Run pricing surveys or A/B tests with beta users. What % would pay $X for AI features?
Choose initial strategy: Start conservative (bundle free or low add-on), then raise prices as value is proven. Easier to decrease later than increase

Tips

Usage-based pricing aligns incentives but adds billing complexity—only use if users heavily value usage flexibility
Avoid 'AI tax' perception—if AI just makes existing features slightly better, don't charge separately

Freemium AI Strategy

Business Model & Pricing ▼

Design free vs. paid AI feature splits that drive conversion while controlling costs.

When to use

When adding AI to existing freemium products
If free tier AI costs are unsustainable
When optimizing free-to-paid conversion rates

Steps

Define free tier AI budget: Calculate sustainable cost per free user (e.g., $0.10-0.50/month). Convert to action limits (e.g., 20 AI queries/month)
Identify conversion-driving features: Which AI capabilities are 'need to have' for power users? Gate those behind paywall after taste
Design progression path: Free tier = 'try it' (10-50 actions). Paid tier = 'use it daily' (unlimited or high cap like 500/month)
Implement soft limits: Don't hard-block at limit. Show 'X uses left this month' warnings. Offer one-time upgrades or wait until next month
Monitor conversion metrics: What % of free users hit limits? What % convert within 7 days of hitting limit? Adjust limits to optimize revenue

Tips

Monthly resets create urgency—users convert when they need AI now, not when they accumulate limits over time
Make free tier generous enough for authentic trial—less than 10 AI actions feels like a demo, not a product

Usage-Based vs. Seat-Based Pricing

Business Model & Pricing ▼

Choose the right pricing model for AI products by evaluating usage patterns and customer preferences.

When to use

When designing pricing for new AI products
If customers complain about current pricing model
When usage varies widely across customer segments

Steps

Analyze usage distribution: Plot AI actions per user. If variance is low (most users similar), seat-based works. If high variance (10× difference), usage-based fits better
Assess customer preference: Enterprise prefers predictable costs (seat-based). Startups prefer pay-as-you-grow (usage-based). Survey target customers
Model revenue scenarios: Calculate ARR under each model at different growth stages. Which maximizes revenue at 100, 1000, 10000 customers?
Consider operational complexity: Usage-based requires real-time metering, billing reconciliation, and overage management. Seat-based is simpler
Test hybrid approaches: Base seat price + usage overages (Anthropic model). Or tiered usage buckets (Notion AI: $10 for 200 actions)

Tips

Default to seat-based for B2B SaaS—procurement prefers predictable budgets, and sales cycles are faster
Use usage-based for API products or when AI is core value prop and usage varies 10×+ across customers

Run a Model Feasibility Spike

AI Feasibility ▼

Test if your AI idea is technically possible by building a quick prototype in 1-2 weeks.

When to use

When stakeholders doubt whether AI can solve your problem
Before committing to a multi-month AI development roadmap
When you need to choose between multiple AI approaches

Steps

Define success criteria - Write down the minimum bar: "If the model can X with Y% accuracy, it's feasible."
Timebox the spike - Allocate 1-2 weeks maximum. Set a deadline for demo.
Use shortcuts - Pre-trained models, small datasets, manual labeling, cloud notebooks.
Build and evaluate - Train/fine-tune model. Test against success criteria.
Make go/no-go decision - If you hit the bar, green-light the project. If not, pivot or kill feature.

Tips

Set up tracking from day 1 of the spike—you'll want metrics to show stakeholders
Don't polish UX or code quality—this is a throwaway prototype

Enterprise AI Packaging

Business Model & Pricing ▼

Design AI product tiers and packaging that align with enterprise buying processes and budgets.

When to use

When selling AI products to companies with 1000+ employees
If enterprise deals stall due to pricing or packaging concerns
When building multi-year roadmap for enterprise features

Steps

Create enterprise tier: Include SSO, audit logs, data residency, SLAs, dedicated support, custom contracts. Price 3-5× higher than self-serve tiers
Offer volume discounts: Tiered pricing based on seats/usage. 100-500 users = 10% off, 500-1000 = 20% off, 1000+ = custom pricing
Bundle services: Professional services for implementation, training, custom model fine-tuning. Charge separately or include in annual contracts
Design annual commit incentives: Offer 15-25% discount for annual prepay vs. monthly. Reduces churn and improves cash flow
Build custom pricing tools: Sales team needs calculator to quickly quote multi-year, multi-product, multi-region deals. Automate approval workflows

Tips

Enterprise sales cycles are 6-12 months—ensure trial/POC pricing covers your costs but removes friction
Security and compliance are table stakes, not upsells—include in base enterprise tier or risk disqualification

AI ROI Projection Model

Business Model & Pricing ▼

Build data-driven ROI models that help customers justify AI product investments to their executives.

When to use

When selling high-cost AI products to enterprise
If sales team struggles to justify AI pricing
When creating case studies and marketing materials

Steps

Identify cost savings: Time saved per user × hourly cost × number of users. Example: 5 hours/week × $50/hour × 100 users = $1.3M/year
Quantify revenue impact: Increased conversion, faster sales cycles, better retention. Tie AI features to revenue metrics with A/B test data
Build ROI calculator: Create spreadsheet or web tool where prospects input their metrics (team size, salaries, current processes). Auto-calculate payback period
Validate with case studies: Get 3-5 customers to share actual ROI achieved. Use median results as conservative estimates for prospects
Present tiered scenarios: Conservative (10th percentile outcomes), expected (median), optimistic (90th percentile). Let buyers choose their assumptions

Tips

Aim for <6 month payback period for SMB, <12 months for enterprise—longer periods face budget scrutiny
Include implementation costs in ROI model—honest projections build trust and set realistic expectations

Build vs. Buy vs. API Decision

AI Feasibility ▼

Systematically evaluate whether to build models in-house, buy commercial solutions, or use API services.

When to use

After validating technical feasibility of your AI feature
When stakeholders ask about cost and timeline for AI development
Before assembling your AI product team

Steps

Map your requirements: accuracy needs, customization level, data sensitivity, scale, budget, timeline
Evaluate APIs: Test 2-3 providers. Check if they meet accuracy bar, pricing, and latency requirements
Evaluate buy options: Commercial ML platforms. Consider vendor lock-in, customization limits
Evaluate build: Estimate team size, timeline, infrastructure costs. Do you have ML expertise?
Create decision matrix: Score each option on cost, time-to-market, quality, control, and scalability

Tips

Most teams should start with APIs—fastest path to validation
Build only when APIs can't meet requirements or when AI is your core differentiator

AI Cost Containment Tactics

Business Model & Pricing ▼

Implement strategies to reduce AI infrastructure and API costs without sacrificing user experience.

When to use

When AI costs are growing faster than revenue
Before raising prices or cutting features due to costs
When optimizing for profitability after growth phase

Steps

Implement caching: Cache frequent queries/prompts. GitHub Copilot caches common code completions, reducing API calls 40%
Use tiered models: Route simple queries to cheaper models (GPT-3.5), complex to expensive (GPT-4). Classification model decides routing
Optimize prompts: Shorter prompts = lower costs. Test if you can achieve same quality with 50% fewer tokens. Use prompt compression techniques
Batch requests: Combine multiple API calls into single batch request where latency allows. Reduces overhead costs
Set usage quotas: Implement per-user rate limits to prevent abuse and runaway costs. Alert users before hitting limits

Tips

Audit your top 10% of users—often 5-10% of users drive 50%+ of costs. Target optimizations or pricing to them
Monitor cost per active user weekly—catch problems early before they become existential

AI Feature Prioritization

Roadmap & Prioritization ▼

Systematically prioritize which AI features to build first based on value, feasibility, and strategic fit.

When to use

When planning quarterly or annual AI roadmaps
If stakeholders disagree on which AI features to build
When you have more AI ideas than engineering capacity

Steps

Score user value: Rate each feature 1-10 on user impact. Base on user interviews, surveys, and revenue potential. Weight by user segment size
Assess technical feasibility: Rate 1-10 based on data availability, model maturity, engineering complexity. Get ML team input
Evaluate strategic alignment: Does this AI feature support core product strategy? Build competitive moat? Enable platform vision?
Estimate effort: T-shirt size (S/M/L/XL) for development time. Include data prep, model training, integration, and testing
Calculate priority score: (Value × Strategic Fit) / Effort. Feasibility acts as a filter—don't build infeasible ideas regardless of value

Tips

Build 'quick wins' first (high value, low effort) to build momentum and credibility for AI program
Avoid 'AI for AI's sake'—if a non-AI solution scores higher on value/effort, build that instead

AI Feature Sequencing

Roadmap & Prioritization ▼

Plan the optimal order to release AI features based on dependencies, learning, and user adoption.

When to use

When building multi-feature AI product roadmaps
If early AI features failed to gain traction
When planning phased rollouts over 6-12 months

Steps

Map feature dependencies: Which features require data from others? Which share models or infrastructure? Build dependency graph
Identify learning milestones: Which features teach you about user behavior, model performance, or data quality that inform later features?
Plan adoption curve: Start with features that drive frequent engagement (daily use). Delay features that need behavior change until users are habituated
Balance quick wins and strategic bets: Alternate between fast-shipping incremental features and longer-term platform investments
Design version gates: V1 = prove value with simple approach. V2 = improve quality with better models. V3 = scale with platform features

Tips

Ship user-facing AI value in first 60 days—builds credibility and user excitement for future features
Don't boil the ocean—better to ship 3 excellent AI features than 10 mediocre ones

Crawl-Walk-Run AI Roadmap

Roadmap & Prioritization ▼

Structure AI product evolution in three phases: simple MVP, improved accuracy, and scaled platform.

When to use

When planning multi-year AI product strategy
If stakeholders push for perfect AI before any launch
When communicating AI maturity stages to executives

Steps

Crawl (Months 1-3): Ship simplest AI that provides value. Use pre-trained models, limit scope, manual fallbacks. Goal: prove users want this
Walk (Months 4-9): Improve accuracy and coverage. Fine-tune models, expand training data, reduce edge cases. Goal: daily use by core users
Run (Months 10-18): Scale and automate. Custom models, real-time retraining, platform features. Goal: product differentiator at scale
Define success metrics for each phase: Crawl = engagement. Walk = quality scores. Run = competitive moat metrics
Communicate trade-offs: Crawl is fast but imperfect. Walk is better but not ready for all use cases. Run is mature but requires investment

Tips

Don't skip Crawl—90% of AI learnings come from real users, not internal testing
Plan 12-18 months minimum for Run phase—AI platforms require sustained investment to build moats

Minimum Viable AI Feature

Roadmap & Prioritization ▼

Define the smallest AI feature that delivers real user value and validates core hypotheses.

When to use

When starting new AI product initiatives
If AI projects are taking too long to ship
When stakeholders want to add too many capabilities before launch

Steps

Identify core user job: What's the single most important task AI helps with? Cut everything else for V1
Define minimum quality bar: What accuracy/latency is 'good enough' to be useful? Don't aim for perfection—aim for better than status quo
Limit initial scope: Constrain to single use case, user segment, or content type. Example: AI summaries for docs only, not all content
Use existing tools: Pre-trained models, third-party APIs, manual fallbacks. Build custom solutions only after validating demand
Set learning goals: What do you need to learn from V1 to inform V2? Design experiments to answer key questions

Tips

Ship MVAI in 4-6 weeks—if it takes longer, scope is too big
Perfect is the enemy of shipped—60% accuracy that users love beats 95% accuracy that never launches

AI Experiment Framework

Roadmap & Prioritization ▼

Design and run controlled experiments to validate AI product hypotheses before full development.

When to use

When testing new AI feature ideas with uncertain value
Before committing to expensive AI development
If stakeholders need proof that AI will drive metrics

Steps

Define hypothesis: Clear, testable statement. Example: 'AI-generated summaries will increase doc engagement by 20%'
Choose experiment type: A/B test (AI vs. control), Wizard of Oz (humans simulate AI), prototype (limited real AI), or survey (measure willingness to use)
Set success criteria: What metrics move? By how much? What's the minimum effect size to justify building?
Design minimal experiment: Smallest sample size and shortest duration to reach statistical significance. Use power analysis
Analyze and decide: If hypothesis validated, green-light feature. If invalidated, pivot or kill. If inconclusive, run follow-up experiment

Tips

Wizard of Oz experiments (humans pretending to be AI) are faster than building real AI—use for early validation
Run experiments on 5-10% of users initially—limits risk if AI performs poorly

AI Feature Kill Criteria

Roadmap & Prioritization ▼

Establish clear conditions for when to shut down or deprioritize AI features that aren't working.

When to use

Before launching new AI features
When AI features have low adoption despite investment
If engineering resources are spread too thin across AI initiatives

Steps

Set adoption thresholds: Define minimum active users or usage frequency. Example: If <10% of users try feature after 3 months, kill it
Define quality floors: Minimum acceptable accuracy, latency, or user satisfaction scores. If model can't hit bar after 2 improvement cycles, kill
Establish cost ceilings: Maximum cost per user or cost as % of revenue. If unit economics don't improve to target within 6 months, kill
Monitor competitive position: If competitors ship superior AI faster, evaluate whether to kill and copy or double down on differentiation
Create kill decision process: Who decides? How often do you review? What's the communication plan to users and stakeholders?

Tips

Review AI features quarterly—technology and user needs evolve fast, yesterday's good idea may be today's distraction
Celebrate kills as much as launches—killing bad features is good product management

AI Tech Debt Prioritization

Roadmap & Prioritization ▼

Systematically prioritize AI technical debt against new features to maintain sustainable development velocity.

When to use

When AI features are slowing down due to accumulated tech debt
If model performance is degrading or infrastructure is brittle
When planning roadmap balance between new features and improvements

Steps

Categorize AI tech debt: Model debt (outdated models, drift), data debt (stale datasets, pipeline brittleness), infra debt (scaling issues, monitoring gaps), code debt (ML code quality)
Assess impact: How does each debt item affect user experience, development velocity, costs, or risk? Rate 1-10 on each dimension
Estimate effort: Size each debt item (S/M/L/XL). Get ML team input on complexity and dependencies
Calculate debt ROI: (Impact on velocity + risk reduction) / Effort. Prioritize highest ROI debt first
Allocate capacity: Dedicate 20-30% of AI engineering capacity to tech debt each quarter. Don't let it slip to 0% or accumulate to 100%

Tips

Address model drift and data quality debt immediately—these directly impact users and compound over time
Trade-off rule: If new feature will create significant debt, either fix existing debt first or simplify feature scope

Model Refresh Cadence

Roadmap & Prioritization ▼

Plan regular cycles to evaluate and upgrade AI models as technology improves and data grows.

When to use

When setting up AI product development processes
If models are getting stale but team has no refresh plan
When planning long-term AI platform investments

Steps

Set evaluation cadence: Review new model releases quarterly. For fast-moving areas (LLMs), monthly. Track benchmarks and release notes
Define upgrade triggers: Automatic upgrade if new model improves accuracy >10%, reduces latency >30%, or cuts costs >50% with no quality loss
Plan testing windows: Allocate 1-2 weeks per quarter for ML team to test new models against production data and metrics
Manage version transitions: Run A/B tests (old model vs. new) before full rollout. Keep rollback plan for 2 weeks post-deployment
Schedule major refreshes: Every 6-12 months, revisit model architecture fundamentally. Is there a better approach than current solution?

Tips

Don't chase every model release—upgrade only when clear user benefit or cost savings justify the work
Document model lineage—track which model version was used when for debugging and compliance

AI Platform vs. Feature Decision

Roadmap & Prioritization ▼

Decide when to invest in reusable AI infrastructure vs. building point solutions for specific features.

When to use

After shipping 2-3 successful AI features
When engineering velocity on AI features is slowing
If considering building in-house ML platform capabilities

Steps

Identify pattern repetition: Are you solving similar AI problems 3+ times? Similar data pipelines? Similar model patterns? Repetition justifies platform
Calculate platform ROI: Cost to build platform ÷ (time saved per feature × number of future features). ROI > 3× justifies investment
Assess team maturity: Platform work requires senior ML/infra engineers. Do you have the talent? Can you hire or train?
Evaluate build vs. buy: Can you use external platforms (SageMaker, Vertex AI, Hugging Face) instead of building? Usually cheaper and faster
Phase platform investment: Don't build the whole platform upfront. Start with highest-pain areas (e.g., model deployment, monitoring) and expand

Tips

Default to features until you have 5+ AI use cases in production—premature platform work is waste
Platform work takes 2-3× longer than estimated—only invest when truly needed for scale

Write a Clear Problem Statement

Problem Discovery ▼

Frame the user problem you're solving before jumping to AI solutions.

When to use

At the very start of any AI initiative—before technical feasibility
When stakeholders ask 'Can we use AI for X?'
When your team is solution-focused instead of problem-focused

Steps

Identify the user: Who specifically experiences this problem? Be specific (e.g., 'sales reps at mid-market SaaS companies' not 'users')
Describe the problem: What job are they trying to do? What's blocking them? What workarounds exist today?
Quantify the pain: How often does this happen? How much time/money does it cost? How many users affected?
Articulate why now: Why hasn't this been solved yet? What's changed that makes solving it possible or urgent now?
Write the one-sentence problem statement: '[User] struggles to [job/goal] because [obstacle], which causes [negative outcome]'

Tips

If you can't write the problem statement without mentioning AI, you're solution-shopping—start over
Test your problem statement with 3-5 potential users. If they don't immediately relate, it's too vague

Conduct Customer Discovery Interviews

Problem Discovery ▼

Run interviews that uncover real problems, not what users think you want to hear.

When to use

Before building any AI feature—validate the problem exists
When usage data doesn't explain why users behave a certain way
When stakeholders disagree about what problem to solve

Steps

Recruit the right users: Talk to people who have the problem NOW, not people who might someday. Aim for 8-12 interviews
Ask about past behavior, not future intent: 'Tell me about the last time you tried to do X' beats 'Would you use a feature that does X?'
Dig into workarounds: 'How do you handle this today?' reveals pain severity. Complex workarounds = high pain worth solving
Follow the 5 Whys: When they mention a problem, ask 'Why is that a problem?' 5 times to get to root cause
Listen for emotion and specifics: 'That's so frustrating' or detailed stories signal real pain. Vague answers signal low priority

Tips

Never pitch your solution during discovery—you're there to learn, not to sell
Record and transcribe interviews. Patterns across 5+ interviews are more reliable than your memory

Apply Jobs to Be Done Framework

Problem Discovery ▼

Understand what users are really hiring your product to do.

When to use

When users request features but you suspect they have a deeper need
When trying to understand why users choose your product over alternatives
Before designing AI features—understand the job first

Steps

Identify the job: What progress is the user trying to make? What outcome do they want? (Not tasks, but end states)
Map the job timeline: When do they realize they have this job? What triggers them to look for a solution? What happens after?
Identify forces at play: - Push forces (problems with current solution) - Pull forces (attraction to new solution) - Anxiety (fear new solution won't work) - Habits (comfort with current solution)
Find underserved needs: Which parts of the job are poorly served today? Where do users overcompensate or accept trade-offs?
Frame AI solutions around the job: How can AI help users make progress faster, cheaper, or with less risk?

Tips

The job is rarely what users say it is—'I need a faster horse' means 'I need to get somewhere faster'
Look for jobs with high anxiety or strong habits—these are hard to solve but create switching costs once you do

Validate Problem Severity

Problem Discovery ▼

Confirm the problem is painful enough that users will actually use your solution.

When to use

After identifying a potential problem but before building anything
When stakeholders claim 'everyone has this problem' but you have no evidence
Before prioritizing which of several problems to solve first

Steps

Measure frequency: How often do users encounter this problem? Daily = high severity. Monthly = low. One-off = don't solve
Assess impact: What's the cost when this problem occurs? Time lost? Money lost? Emotional toll? Quantify it
Check current solutions: What do users do today? If workarounds are cheap/easy, your solution needs to be 10× better to win
Test willingness to change: Ask 'If I could solve this perfectly, would you switch from your current solution?' Hesitation = low severity
Validate with multiple signals: - Users complain about it unprompted - Users pay for bad workarounds - Users abandon tasks because it's too hard

Tips

If users say it's a problem but won't schedule a follow-up call, it's not painful enough
Problems you discover in interviews > problems users report in surveys. Actions > words

Size the Opportunity

Problem Discovery ▼

Estimate if solving this problem is big enough to justify AI investment.

When to use

After validating problem severity, before building business case
When choosing between multiple validated problems to solve
When executives ask 'How big is this opportunity?'

Steps

Count affected users: How many people/companies have this problem? Use customer data, market research, or proxy metrics
Estimate willingness to pay: Survey 20+ users. Ask 'What would you pay to solve this?' Use median, not mean (outliers skew)
Calculate TAM: Affected users × willingness to pay × purchase frequency. This is your ceiling
Estimate SAM (serviceable market): Of TAM, how many can you realistically reach with your sales/distribution? Usually 10-30% of TAM
Project SOM (share of market): What % of SAM can you capture in 3 years? Realistic first-mover = 5-15%, fast follower = 2-8%

Tips

TAM > $100M justifies significant AI investment. TAM < $10M rarely justifies custom ML—use APIs instead
Don't confuse market size with your opportunity. $1B TAM × 1% share = $10M business, not $1B business

Run a Design Sprint

Problem Discovery ▼

Quickly prototype and validate AI solutions with users in 5 days.

When to use

After validating the problem, before committing to full development
When stakeholders want proof the solution will work
When choosing between multiple AI approaches

Steps

Monday - Map the problem: Define the long-term goal, map the user journey, pick a target moment to focus the sprint
Tuesday - Sketch solutions: Each person sketches how AI could solve the problem. Vote on strongest ideas. No coding yet
Wednesday - Decide: Critique sketches, vote on one solution to prototype. Storyboard the user experience step-by-step
Thursday - Prototype: Build a realistic fake (Wizard of Oz). Use mockups + humans behind the scenes to simulate AI. No real ML
Friday - Test with 5 users: Watch them use the prototype. Look for confusion, delight, and whether they'd use it. Decide: build it or pivot

Tips

Don't build real AI during the sprint—use fake data or humans pretending to be AI. You're testing UX, not models
5 user tests reveal 85% of usability issues. More tests = diminishing returns

AI Risk Categories Overview

Primers ▼

Understand the complete landscape of risks unique to AI products and when each type matters most.

When to use

When starting your first AI product initiative
Before creating a risk management plan for AI features
When onboarding stakeholders or executives to AI product development

Steps

Model Risks: Performance degradation, bias, drift, adversarial attacks. Critical for accuracy-dependent features.
Data Risks: Quality issues, privacy violations, poisoning. Critical when handling sensitive or regulated data.
User Safety & Trust: Harmful outputs, misaligned expectations, transparency gaps. Critical for consumer-facing AI.
Ethical Considerations: Fairness, discrimination, unintended consequences. Critical for high-stakes decisions.
Legal & Compliance: Regulatory requirements, IP issues, liability. Critical in regulated industries.
Operational Risks: Deployment failures, scaling issues, cost overruns. Critical at high scale or tight margins.

Tips

Start with User Safety & Trust for consumer products; Legal & Compliance for enterprise
Revisit this map quarterly—new AI risks emerge as your product matures

Risk Assessment Framework

Primers ▼

Systematically evaluate and prioritize AI risks using likelihood, impact, and detection difficulty.

When to use

When planning a new AI feature or product launch
After identifying multiple risks and needing to prioritize mitigation efforts
When justifying risk management investments to leadership

Steps

List all identified risks: Use the AI Risk Categories Overview as your checklist
Score each risk: Likelihood (1-5), Impact (1-5), Detection Difficulty (1-5). Multiply for total score.
Prioritize by score: >75 = critical (address before launch), 50-75 = high (address within 30 days), 25-50 = medium (monitor), <25 = low (document only)
Create mitigation plan: For each critical/high risk, define prevention, detection, and response tactics
Assign owners: Every risk needs a DRI (Directly Responsible Individual)

Tips

Re-assess risks monthly in first 90 days post-launch—real user behavior reveals hidden risks
Include diverse stakeholders in scoring—PMs, engineers, legal, support teams see different risks

Detect and Prevent Overfitting

Model Risks ▼

Ensure your model generalizes to real-world data instead of just memorizing training examples.

When to use

When your model shows great training metrics but poor real-world performance
Before committing to a model for production deployment
When stakeholders question why AI performance doesn't match development claims

Steps

Split data properly: 70% train, 15% validation, 15% test. Never let test data touch training.
Compare train vs. validation metrics: If train accuracy is 95% but validation is 75%, you're overfitting
Apply regularization: Use dropout, L1/L2 regularization, early stopping. Start with dropout=0.2-0.5.
Increase training data: More diverse examples help. Aim for 10x examples per model parameter as baseline.
Validate on production-like data: Test on data sampled from actual user scenarios, not just held-out training data

Tips

Red flag: >10% gap between training and validation metrics means overfitting
For small datasets (<10K examples), use k-fold cross-validation instead of single split

Detect Model Drift

Model Risks ▼

Monitor when real-world data patterns change, causing your model's performance to degrade.

When to use

When setting up production monitoring for AI features
If users report AI quality declining over time
Every 30-90 days post-launch as routine health check

Steps

Track input distribution: Monitor feature distributions weekly. Use histograms, summary stats, KL divergence from baseline.
Track prediction distribution: Are outputs shifting? E.g., if your classifier suddenly predicts 80% class A vs. historical 50%, investigate.
Monitor model metrics: Track accuracy, precision, recall on live data (requires ground truth labels)
Set drift thresholds: If KL divergence >0.1 or accuracy drops >5%, trigger alert
Create retraining playbook: Define when to retrain (monthly default), who approves, how to A/B test new model

Tips

Use shadow mode for new models—run in parallel with production model for 1-2 weeks before switching
Seasonal businesses: Expect drift. Retrain models before peak seasons (holiday retail, tax season, etc.)

Defend Against Adversarial Attacks

Model Risks ▼

Protect your model from malicious inputs designed to cause incorrect predictions or harmful outputs.

When to use

Before launching AI features with financial impact (fraud detection, lending, pricing)
For user-generated content moderation systems
When AI controls access to resources or benefits

Steps

Threat model your feature: Who benefits from gaming the system? What would they try? (e.g., spam filter evasion, face recognition spoofing)
Test adversarial robustness: Use libraries like CleverHans, Foolbox. Generate adversarial examples for your model.
Implement defenses: Input validation, adversarial training (retrain on adversarial examples), ensemble models
Add detection layer: Monitor for suspicious input patterns (e.g., small perturbations, repeated similar inputs)
Build human review workflow: Flag high-stakes decisions or suspicious patterns for manual review

Tips

Start with input sanitization—often cheaper and more effective than complex adversarial training
For image/audio models, check for small pixel/noise perturbations that flip predictions

Model Explainability Framework

Model Risks ▼

Make AI decisions understandable to users, auditors, and internal teams for trust and compliance.

When to use

When building AI for regulated industries (finance, healthcare, hiring)
If users need to understand why AI made specific recommendations
Before launching AI features that impact high-stakes user decisions

Steps

Define your audience: End users need simple explanations; regulators need full audit trails; ML engineers need feature importance.
Choose explanation method: SHAP/LIME for feature importance, attention visualization for transformers, decision trees for simple rules
Build explanation UI: Show top 3-5 factors influencing each prediction. Use plain language, not technical jargon.
Document model cards: For each model, document training data, intended use, limitations, performance metrics
Test explanations: Show to 10 target users. Do they understand? Do they trust the AI more?

Tips

Start with global explanations (how the model works overall) before per-prediction explanations
For black-box models, consider building a simpler interpretable 'proxy model' for explanations

Bias Detection in Models

Model Risks ▼

Systematically test for unfair outcomes across demographic groups and use cases.

When to use

Before launching AI that affects people's opportunities (hiring, lending, housing)
When AI serves diverse user populations
As part of regular model audits (quarterly minimum for high-stakes AI)

Steps

Identify protected attributes: Age, gender, race, disability status, etc. Check applicable laws (GDPR, ECOA, FHA).
Measure performance by group: Calculate accuracy, false positive rate, false negative rate for each demographic
Apply fairness metrics: Demographic parity (equal outcomes), equalized odds (equal error rates), individual fairness
Set fairness thresholds: E.g., false positive rate must be within 5% across all groups
Document disparities: If bias detected, decide: retrain with balanced data, adjust decision thresholds, add human review

Tips

Even if you don't collect demographic data, test on diverse synthetic or proxy datasets
Involve domain experts and affected communities in defining what 'fair' means for your use case

Model Performance Degradation

Model Risks ▼

Plan for and monitor how model performance changes over time in production.

When to use

Before launching AI features in production
When setting up monitoring and alerting systems
If users report declining AI quality

Steps

Baseline your metrics: Record accuracy, precision, recall at launch. This is your reference point.
Set up monitoring: Track model metrics daily/weekly. Use tools like MLflow, Weights & Biases.
Define degradation thresholds: If accuracy drops >5%, trigger alert. If >10%, pause feature.
Create response playbook: Who gets alerted? How fast do you retrain? What's the communication plan?
Schedule regular retraining: Monthly or quarterly, depending on data freshness needs

Tips

Monitor input data distribution too—shifts in user behavior often cause model drift
Keep a "champion/challenger" system—always have a backup model ready

Handle Model Uncertainty

Model Risks ▼

Quantify and communicate when your model is uncertain about predictions to prevent overconfidence.

When to use

When AI predictions have variable confidence levels
For high-stakes decisions where wrong predictions are costly
When users need to understand AI reliability before acting

Steps

Calibrate confidence scores: Use temperature scaling or Platt scaling. Test: Do 90% confidence predictions succeed 90% of the time?
Define uncertainty thresholds: <50% confidence = reject, 50-80% = human review, >80% = auto-approve
Surface uncertainty to users: Show confidence scores, use language like 'high/medium/low confidence', explain implications
Build fallback workflows: When model is uncertain, route to human review, simpler heuristic, or ask user for more input
Monitor uncertainty patterns: Are certain user segments or scenarios consistently high-uncertainty? Investigate why.

Tips

For neural networks, use dropout at inference time (Monte Carlo dropout) to estimate uncertainty
Never auto-execute high-stakes actions when confidence is below your calibrated threshold

Model Ensemble Strategies

Model Risks ▼

Combine multiple models to improve reliability, reduce bias, and provide fallback options.

When to use

When single-model accuracy isn't meeting requirements
To reduce risk of model failures in production
When different models excel at different edge cases

Steps

Choose ensemble approach: Voting (majority rule), averaging (mean confidence), stacking (meta-model learns from base models)
Select diverse models: Different architectures (e.g., tree-based + neural net), different training data subsets, different hyperparameters
Define aggregation rules: For classification, use majority voting or weighted voting. For regression, use weighted average.
Test performance vs. cost: Measure accuracy gain vs. latency and compute cost. Aim for >5% accuracy improvement to justify.
Implement fallback logic: If models disagree significantly, route to human review or use most conservative prediction

Tips

Start with 3-5 models—diminishing returns beyond that for most applications
For latency-sensitive apps, run models in parallel rather than sequentially

Data Quality Validation

Data Risks ▼

Systematically check training and production data for errors, inconsistencies, and quality issues.

When to use

Before training any ML model
When setting up data pipelines for production AI
If model performance unexpectedly degrades

Steps

Define quality checks: Completeness (missing values <5%?), accuracy (spot-check samples), consistency (format/range validation), timeliness (data freshness)
Automate validation: Use tools like Great Expectations, Pandera. Run checks on every data batch before training/inference.
Set quality thresholds: Define minimum acceptable quality. E.g., >95% complete records, <1% invalid formats.
Monitor data drift: Track feature distributions over time. Alert if statistical properties shift significantly.
Create data rejection policy: Automatically reject batches below quality thresholds. Never train on bad data.

Tips

Add schema validation as first line of defense—catches 80% of data quality issues
Keep examples of 'bad data' in a test suite to prevent regression

Data Privacy & Compliance

Data Risks ▼

Ensure your AI systems handle user data in compliance with GDPR, CCPA, and other privacy regulations.

When to use

Before collecting any user data for ML training
When launching AI features in new geographic markets
After privacy regulations change or during audits

Steps

Map data flows: Document what data you collect, where it's stored, who accesses it, how long you keep it
Get proper consent: Users must opt-in to data collection for ML training. Separate from general product usage consent.
Implement data minimization: Only collect data necessary for model training. Aggregate or anonymize when possible.
Enable data deletion: Support 'right to be forgotten' (GDPR). Document how you remove user data from training sets and models.
Audit regularly: Quarterly review of data practices. Test that deletion workflows actually work.

Tips

For EU users, you need explicit consent and must explain ML model usage in privacy policy
Consider differential privacy techniques if working with sensitive data (medical, financial)

Training Data Contamination

Data Risks ▼

Prevent and detect when training data contains errors, biases, or malicious examples that corrupt your model.

When to use

Before starting model training, especially with user-generated or scraped data
When model behavior is unexpected or problematic
After discovering anomalies in training data sources

Steps

Audit data sources: Where does training data come from? How was it collected? What's the sampling methodology?
Check for label quality: Measure inter-annotator agreement (Kappa score >0.7 is good). Review disputed labels.
Detect outliers: Use statistical methods to find anomalous examples. Manually review top 1% most unusual data points.
Test for distribution bias: Compare training data demographics/scenarios to real user population. Fill gaps.
Version training datasets: Use data versioning (DVC, Pachyderm). Track exactly what data trained each model version.

Tips

For crowd-sourced labels, require 3+ labelers per example and take majority vote
Spot-check 100 random training examples yourself—fastest way to catch systemic issues

Data Poisoning Defense

Data Risks ▼

Protect your training pipeline from malicious actors injecting harmful examples to corrupt your model.

When to use

When training on user-generated content or external data sources
For content moderation or fraud detection systems
If your AI influences high-value decisions or resource allocation

Steps

Identify attack vectors: Can users submit training data? Can attackers access your data pipeline? What's the threat model?
Implement data validation: Sanitize inputs, check for suspicious patterns (duplicates, extremes, coordinated submissions)
Use trusted data sources: Prefer curated datasets over unfiltered web scraping. Verify data provenance.
Apply outlier detection: Use statistical methods or anomaly detection models to flag suspicious training examples
Monitor model behavior: Test trained models on known-good validation sets. Alert if performance drops unexpectedly.

Tips

For user-contributed training data, require minimum account age/reputation before accepting submissions
Keep a 'clean' holdout dataset that never touches user-generated data for validation

Data Pipeline Failure Response

Data Risks ▼

Plan for and recover from data pipeline outages that break model training or inference.

When to use

When building production ML data pipelines
After experiencing a data pipeline incident
Before launch of AI features with real-time data dependencies

Steps

Map pipeline dependencies: Document data sources, transformations, storage, and downstream consumers. Identify single points of failure.
Build monitoring: Alert on pipeline failures (job failures, data quality drops, missing data, latency spikes)
Create fallback data: Cache recent data for inference. If live data fails, fall back to cached version for 24-48 hours.
Define recovery procedures: Document step-by-step recovery (restart jobs, backfill data, validate outputs, notify stakeholders)
Practice incident response: Run fire drills quarterly. Simulate pipeline failures and test recovery procedures.

Tips

Set up dual alerting: page on-call engineer AND send non-urgent alert to PM
For critical pipelines, implement automated rollback to last known good state

Labeling Quality Assurance

Data Risks ▼

Ensure high-quality, consistent labels for supervised learning through systematic QA processes.

When to use

When setting up a data labeling operation (in-house or vendor)
If model performance is below expectations despite good architecture
Before scaling up labeling efforts

Steps

Create labeling guidelines: Write clear, detailed instructions with examples. Include edge cases and ambiguous scenarios.
Train labelers: Require all labelers to complete training set. Must score >90% agreement with gold standard.
Measure inter-annotator agreement: Have 10-20% of data labeled by multiple people. Calculate Cohen's Kappa or Fleiss' Kappa. Target >0.7.
Implement review process: Subject matter experts review 5-10% of labels. Provide feedback to labelers.
Track labeler performance: Monitor agreement rates per labeler. Provide additional training or remove low-performing labelers.

Tips

For subjective tasks, accept that perfect agreement is impossible. Kappa of 0.6-0.7 may be acceptable.
Use active learning: have model flag most uncertain examples for human review first

PII Detection & Redaction

Data Risks ▼

Automatically detect and remove personally identifiable information from training data and model outputs.

When to use

When working with user-generated content or communication data
Before sharing data with labeling vendors or third parties
When building AI features that process sensitive information

Steps

Define PII scope: Names, emails, phone numbers, addresses, SSN, credit cards, medical records, etc. Check applicable regulations.
Implement detection: Use regex patterns, named entity recognition (NER) models, or services like AWS Macie, Google DLP API
Apply redaction strategy: Replace with tokens ([NAME], [EMAIL]) or synthetic data. Don't just delete—preserve context.
Validate effectiveness: Manually review sample of redacted data. Run PII detection on model outputs periodically.
Document exceptions: Some use cases require PII. Document why, how it's protected, and retention policies.

Tips

For text generation models, add PII detection as post-processing step before showing outputs to users
Test with creative PII formats—attackers use l33tspeak, Unicode, and other tricks to evade detection

Synthetic Data Generation

Data Risks ▼

Create artificial training data to augment real data, protect privacy, or handle rare scenarios.

When to use

When you lack sufficient real training data
To protect user privacy while maintaining data utility
To oversample rare but important scenarios (fraud, safety incidents)

Steps

Choose generation method: Rule-based (for structured data), GANs (for images), language models (for text), data augmentation (transforms)
Validate realism: Statistical tests comparing synthetic vs. real data distributions. Use domain experts to review samples.
Measure utility: Train models on real vs. synthetic data. Performance drop >10% means synthetic data isn't good enough.
Check for privacy leaks: Ensure synthetic data doesn't accidentally memorize and reproduce real examples
Document limitations: Synthetic data may not capture all real-world complexity. Test models on real held-out data.

Tips

Start with data augmentation (rotations, crops, paraphrasing)—simpler and lower risk than full synthesis
For regulated industries, validate that synthetic data satisfies same compliance requirements as real data

Harmful Output Prevention

User Safety & Trust ▼

Block AI from generating dangerous, offensive, or harmful content through multi-layered safety systems.

When to use

Before launching any generative AI feature (text, image, code)
When AI outputs are user-facing or influence user decisions
Required for consumer applications, especially those accessible to minors

Steps

Define harm taxonomy: Violence, hate speech, sexual content, self-harm, illegal activity, misinformation. Prioritize by severity and likelihood.
Implement input filters: Block prompts requesting harmful content. Use keyword lists + classifier models.
Apply output filters: Scan all generated content before showing to users. Use content moderation APIs + custom classifiers.
Set confidence thresholds: >0.9 = block automatically, 0.7-0.9 = human review, <0.7 = allow with monitoring
Build escalation workflow: Repeated violations trigger account review. Store blocked attempts for analysis.

Tips

Layer multiple filters—no single filter is perfect. Aim for 99.5%+ harmful content blocked.
Red-team your system monthly: try to generate harmful content and update filters based on findings

Set User Expectations for AI

User Safety & Trust ▼

Clearly communicate what AI can and cannot do to prevent misunderstanding and misuse.

When to use

During onboarding for new AI features
When users first interact with AI capabilities
After incidents caused by user misunderstanding of AI limitations

Steps

Document capabilities and limitations: What tasks does AI excel at? Where does it fail? What shouldn't users try?
Communicate in-product: Show capability descriptions on first use. Use disclaimers for high-stakes use cases.
Provide examples: Show what good inputs/outputs look like. Show what AI cannot do.
Set accuracy expectations: 'This AI is 85% accurate on X task' or 'Always verify AI outputs for [use case]'
Update based on usage: Monitor support tickets and user errors. Refine messaging to address common misconceptions.

Tips

For safety-critical domains (medical, legal, financial), require explicit acknowledgment of limitations before use
Test messaging with target users—what's clear to you may confuse them

AI Transparency Communication

User Safety & Trust ▼

Disclose when AI is involved in decisions and how it influences user experiences.

When to use

When AI influences recommendations, rankings, or decisions users care about
In regulated industries requiring algorithmic transparency
When building trust is critical to product adoption

Steps

Identify AI touchpoints: Where does AI influence user experience? Recommendations, search results, content moderation, pricing?
Decide disclosure level: Passive (AI badge), Active (explanation on demand), Proactive (always-visible explanation)
Write clear disclosures: Use plain language. 'AI suggests these results based on your history' not 'ML algorithm ranks outputs'
Provide controls: Let users adjust AI behavior (opt out, tune personalization, see alternatives)
Document for auditors: Maintain detailed technical documentation for regulators, even if users see simplified version

Tips

For high-stakes decisions (lending, hiring), proactive disclosure may be legally required
Test transparency features with users—too much detail overwhelms, too little erodes trust

Human-in-the-Loop Review

User Safety & Trust ▼

Design workflows where humans review and approve high-stakes AI decisions before execution.

When to use

For AI decisions with significant user impact (financial, legal, safety)
When model confidence is low or decision is ambiguous
As a safety net while AI is maturing

Steps

Define review triggers: Low confidence (<80%), high stakes (>$100 transaction), sensitive content, user flags
Design review interface: Show AI's recommendation + confidence + key evidence. Make approve/reject/edit easy.
Set SLAs: How fast must reviews complete? Who gets escalated? What happens if no review within SLA?
Measure effectiveness: Track overturn rate (how often humans disagree with AI). If >20%, AI needs improvement.
Close feedback loop: Feed human decisions back to training data. AI should learn from corrections.

Tips

Start with 100% human review, gradually decrease as AI improves and you build confidence
Monitor reviewer fatigue—accuracy drops after ~2 hours. Rotate reviewers or add breaks.

Design Fallback Mechanisms

User Safety & Trust ▼

Build graceful degradation when AI fails so users can still accomplish their goals.

When to use

For any AI feature in production
When AI is part of critical user workflows
During AI system outages or performance degradation

Steps

Identify failure modes: Model errors, low confidence, service outages, timeouts, unexpected inputs
Design fallback tiers: Tier 1 (simpler model), Tier 2 (rule-based system), Tier 3 (manual process), Tier 4 (graceful failure message)
Set degradation thresholds: If model latency >2s, fall back to cached results. If accuracy <70%, fall back to rules.
Implement seamlessly: Users shouldn't notice transition. Fallback should feel like normal feature operation.
Monitor fallback usage: Track how often each tier activates. High fallback rate indicates systemic AI issues.

Tips

For recommendation systems, always have a 'popular items' fallback—simple and always works
Test fallbacks in production regularly—fire drills ensure they work when needed

AI Error Communication

User Safety & Trust ▼

Craft helpful, honest error messages when AI fails or produces low-quality outputs.

When to use

When designing error states for AI features
After users report confusion about AI failures
When AI cannot fulfill user requests

Steps

Categorize error types: 'AI not confident enough', 'Input unclear', 'Request outside AI capabilities', 'Temporary service issue'
Write specific messages: Not 'Error occurred', but 'AI couldn't understand your request. Try rephrasing or adding more details.'
Suggest next steps: Tell users what to do. 'Try again', 'Rephrase your question', 'Contact support for help'
Provide alternatives: If AI can't help, show manual workflow or human assistance option
Learn from errors: Log error types and user context. Use to improve model and error handling.

Tips

Never blame users—even if input is bad, frame as AI limitation: 'AI works best with X type of input'
For generative AI, distinguish 'couldn't generate' vs. 'generated but content was filtered'

Collect User Feedback on AI

User Safety & Trust ▼

Systematically gather user feedback on AI quality to identify issues and drive improvements.

When to use

For any user-facing AI feature in production
When diagnosing AI quality issues
To prioritize model improvement efforts

Steps

Add lightweight feedback: Thumbs up/down on AI outputs. Takes <1 second, high response rate.
Segment by confidence: Always ask for feedback on low-confidence predictions. Sample 5-10% of high-confidence ones.
Add optional details: Let users explain why they downvoted (optional text field or predefined reasons)
Close the loop: Show users 'Thanks for feedback' + what will happen. Notify them when issue is fixed.
Analyze patterns: Weekly review of negative feedback. Identify common failure modes. Prioritize by frequency × severity.

Tips

Aim for >5% feedback rate. If lower, reduce friction (fewer clicks, better placement)
Tag feedback with model version so you can measure if improvements actually help

Content Moderation at Scale

User Safety & Trust ▼

Build systems to detect and remove harmful user-generated content using AI + human review.

When to use

For platforms with user-generated content
When required by platform policies (App Store, regulatory requirements)
After discovering problematic content in your product

Steps

Define policy: What content is prohibited? Violence, hate speech, spam, misinformation, etc. Write clear guidelines.
Implement automated detection: Use content moderation APIs (AWS Rekognition, Google Vision, OpenAI Moderation) + custom models
Set action thresholds: >0.9 = auto-remove, 0.7-0.9 = human review, <0.7 = allow with monitoring
Build review queue: Surface flagged content to human moderators. Prioritize by severity and volume.
Handle appeals: Let users appeal removals. Review by senior moderators. Update policies based on patterns.

Tips

Start with pre-moderation (review before publishing) for high-risk platforms. Shift to post-moderation as systems mature.
Provide mental health support for human moderators—exposure to harmful content causes trauma

Build User Trust in AI

User Safety & Trust ▼

Systematically increase user confidence in AI through transparency, consistency, and demonstrated reliability.

When to use

When launching new AI features to skeptical users
If adoption metrics show users avoiding AI features
After AI errors or incidents damage trust

Steps

Start small: Launch AI for low-stakes tasks first. Let users build confidence before expanding to critical workflows.
Show your work: Explain how AI works, what data it uses, how accurate it is. Transparency builds credibility.
Be honest about limitations: Don't oversell. Tell users what AI can't do. Honesty prevents disappointment.
Deliver consistent quality: Users trust reliable systems. Monitor and maintain >95% success rate for core use cases.
Give users control: Let them disable AI, adjust settings, override decisions. Control increases comfort.

Tips

Measure trust explicitly: survey users quarterly on AI confidence and reliability perceptions
Celebrate wins: when AI helps users succeed, acknowledge it. Positive associations build trust.

Fairness Auditing Process

Ethical Considerations ▼

Conduct regular audits to measure and improve fairness across demographic groups and use cases.

When to use

Quarterly for high-stakes AI systems (hiring, lending, criminal justice)
Before major model updates or feature launches
When required by regulations or ethical AI commitments

Steps

Define fairness criteria: Demographic parity, equalized odds, individual fairness, or other domain-specific measures
Collect representative test data: Include diverse demographics and edge cases. Aim for 500+ examples per protected group.
Measure disparities: Calculate performance metrics (accuracy, FPR, FNR) for each demographic. Document gaps >5%.
Investigate root causes: Is bias in training data, model architecture, or post-processing? Use feature importance analysis.
Implement mitigations: Rebalance training data, adjust decision thresholds per group, add fairness constraints, or redesign feature.

Tips

Involve external auditors or diverse internal stakeholders—insider bias blinds you to issues
Document audit results even if no bias found—shows due diligence to regulators and stakeholders

Algorithmic Discrimination Prevention

Ethical Considerations ▼

Proactively design AI systems to prevent unfair treatment based on protected characteristics.

When to use

During initial AI feature design and requirements
When AI influences decisions affecting people's opportunities
Before expanding AI to new markets or demographics

Steps

Conduct pre-deployment risk assessment: Could this AI system discriminate? Against which groups? What's the potential harm?
Remove or mitigate problematic features: Avoid using race, gender, zip code directly. Check for proxy features (name, address).
Ensure training data diversity: Balanced representation of protected groups. Oversample underrepresented groups if needed.
Apply fairness constraints during training: Use fairness-aware algorithms (e.g., Fairlearn library) that optimize for both accuracy and fairness
Test extensively pre-launch: Run fairness audits before release. Require sign-off from ethics/legal teams for high-stakes AI.

Tips

Include diverse voices in design—people from affected communities spot issues you miss
Document your fairness approach—shows good faith effort if challenged legally

Unintended Consequences Assessment

Ethical Considerations ▼

Identify and plan for negative second-order effects of your AI system before they cause harm.

When to use

During AI product strategy and planning phases
Before launching AI features with broad societal impact
When expanding AI systems to new domains or scales

Steps

Map intended effects: What is AI designed to accomplish? Who benefits? How?
Brainstorm unintended effects: Who might be harmed? Could AI be misused? What behaviors might it incentivize? Could it be gamed?
Assess likelihood and severity: For each unintended consequence, rate probability and potential harm
Design mitigations: Rate limiting, access controls, monitoring for misuse, user education, or design changes
Monitor post-launch: Track metrics related to potential harms. Adjust mitigations based on observed behavior.

Tips

Use 'pre-mortem' technique: imagine AI caused major harm. Work backwards to identify how it happened.
Include diverse perspectives—different stakeholders see different risks

Stakeholder Impact Mapping

Ethical Considerations ▼

Systematically identify everyone affected by your AI system and understand how it impacts them.

When to use

Early in AI product planning, before committing to approach
When making major changes to existing AI systems
If stakeholders raise concerns about AI impacts

Steps

Identify all stakeholders: End users, indirect users, employees, communities, competitors, regulators, society
Map impacts for each group: How does AI affect them? Benefits? Harms? Changes to work/life?
Prioritize by impact: Which groups experience the largest effects? Which effects are irreversible?
Engage stakeholders: Interview representatives from high-impact groups. Understand their concerns and priorities.
Incorporate feedback: Adjust AI design, policies, or safeguards based on stakeholder input. Document tradeoffs.

Tips

Don't forget indirect stakeholders—job displacement, ecosystem effects, societal norms
For high-impact systems, consider establishing ongoing stakeholder advisory boards

Value Alignment Testing

Ethical Considerations ▼

Verify that AI system behaviors align with stated organizational values and ethical principles.

When to use

Before launching AI systems with significant autonomy
When AI makes decisions that reflect organizational values
As part of regular ethics audits

Steps

Articulate core values: What principles should guide AI behavior? Fairness, transparency, safety, respect, autonomy?
Translate to testable scenarios: Create specific situations where values might conflict. 'Should AI prioritize accuracy or fairness?'
Test AI behavior: Run scenarios through your AI system. Does it behave according to values? Where does it diverge?
Identify misalignments: Document cases where AI behavior conflicts with values. Understand root cause.
Adjust and retest: Modify training objectives, reward functions, constraints, or post-processing to improve alignment.

Tips

Values often conflict (privacy vs. personalization, safety vs. autonomy). Define priority hierarchy.
Test edge cases where tradeoffs are hardest—that's where value alignment matters most

Responsible AI Principles

Ethical Considerations ▼

Establish and operationalize a set of ethical principles to guide AI development and deployment.

When to use

When starting an AI program or establishing AI governance
Before making major AI product decisions with ethical dimensions
When communicating AI approach to stakeholders or public

Steps

Define principles: Common ones include fairness, accountability, transparency, safety, privacy, human control. Adapt to your context.
Write clear definitions: What does each principle mean specifically? Include examples and counterexamples.
Create decision checklists: For each principle, list questions to ask during design/development. 'Does this AI treat all users fairly?'
Assign accountability: Who reviews AI products for principle adherence? Who has authority to block launches?
Integrate into processes: Add ethics review to design reviews, launch checklists, and post-launch monitoring.

Tips

Don't just copy Google/Microsoft principles—customize to your industry, users, and risks
Principles without enforcement are PR. Build real gatekeeping mechanisms.

Ethical AI Decision Framework

Ethical Considerations ▼

Use a structured process to evaluate and resolve ethical dilemmas in AI product development.

When to use

When facing difficult tradeoffs between competing values
If team members disagree about ethics of an AI feature
Before launching controversial or high-stakes AI capabilities

Steps

Frame the dilemma: What are the competing values or interests? Who benefits? Who is harmed?
Gather perspectives: Consult diverse stakeholders. What do affected groups think? What do experts recommend?
Evaluate options: List possible approaches. For each, assess alignment with values, feasibility, risks, precedent set.
Make decision: Choose option with best balance of benefits, harms, and value alignment. Document rationale.
Plan monitoring: How will you know if decision was right? What metrics or signals indicate success or failure?

Tips

Use thought experiments: 'If this decision became public, could we defend it?' 'Would we want competitors to make the same choice?'
Sometimes best answer is 'don't build it'—not every AI application is worth the ethical costs

Impact Assessment for Stakeholders

Ethical Considerations ▼

Conduct thorough impact assessments to understand social, economic, and ethical effects of AI systems.

When to use

Before launching AI systems with significant societal impact
When required by regulations (EU AI Act, impact assessments)
For major updates to existing high-stakes AI systems

Steps

Define scope: What AI system? What deployment context? What time horizon? Geographic scope?
Assess impacts by category: Human rights, safety, fairness, economic, environmental, social cohesion
Quantify where possible: How many people affected? What magnitude of impact? What probability?
Identify mitigation measures: For each significant risk, document prevention and response strategies
Publish and update: Share assessment with stakeholders. Update post-launch based on observed impacts.

Tips

Use established frameworks: Canada's ATIA, UK ICO DPIA, or EU AI Act requirements provide templates
Impact assessments should be living documents—update quarterly as you learn from real-world deployment

AI Regulation Landscape

Legal & Compliance ▼

Navigate the evolving landscape of AI-specific regulations across different jurisdictions.

When to use

When planning AI product strategy and roadmap
Before launching AI features in new geographic markets
Quarterly as regulatory landscape evolves rapidly

Steps

Map applicable regulations: EU AI Act (high-risk AI systems), US sector-specific rules, China AI rules, GDPR (automated decisions)
Classify your AI system: High-risk (credit, employment, law enforcement), limited-risk (chatbots), minimal-risk (spam filters)
Identify compliance requirements: High-risk may require: conformity assessments, risk management, data governance, transparency, human oversight
Assess compliance gaps: What requirements don't you meet today? What's the timeline to comply?
Build compliance roadmap: Prioritize by regulation enforcement date and business impact. Assign owners.

Tips

Don't wait for final regulations—start building compliance capabilities now (documentation, testing, governance)
Work with legal counsel familiar with AI regulations—this is specialized and rapidly evolving

GDPR Compliance for AI

Legal & Compliance ▼

Ensure AI systems comply with GDPR requirements for automated decision-making and data protection.

When to use

When processing data of EU residents
Before launching AI features that make automated decisions
During GDPR compliance audits

Steps

Assess Article 22 applicability: Does AI make decisions without human involvement? Is it legally or similarly significant?
Obtain proper consent: If using personal data for AI training, get explicit opt-in consent. Can't use pre-checked boxes.
Provide meaningful information: Tell users about AI logic, significance, and consequences in privacy policy
Enable human intervention: Allow users to contest AI decisions and request human review (Article 22(3))
Support data subject rights: Implement right to explanation, right to be forgotten (remove from training data), right to data portability

Tips

Conduct Data Protection Impact Assessment (DPIA) for high-risk AI—required by GDPR Article 35
Work with Data Protection Officer (DPO) throughout AI development, not just at launch

Intellectual Property Considerations

Legal & Compliance ▼

Navigate IP issues around training data, model ownership, and AI-generated outputs.

When to use

When sourcing training data from third-party sources
Before using pre-trained models or APIs commercially
When AI generates content that might infringe copyrights

Steps

Audit training data sources: Do you have rights to use this data for ML training? Check terms of service, licenses.
Review model licenses: If using pre-trained models (GPT, LLaMA, Stable Diffusion), check license terms. Commercial use allowed?
Assess output liability: If AI generates content similar to copyrighted works, who's liable? Implement detection for problematic outputs.
Protect your IP: Document novel ML architectures. Consider patents for truly innovative techniques (high bar).
Establish usage policies: Define acceptable use of your AI. Prohibit generating content that infringes IP.

Tips

For generative AI, add content filters that block outputs too similar to known copyrighted works
IP law for AI is unsettled—work with specialized IP counsel, don't rely on general advice

Liability & Insurance

Legal & Compliance ▼

Understand and manage legal liability for AI system failures, errors, and harms.

When to use

When launching AI products with potential for significant user harm
Before deploying AI in regulated industries (healthcare, finance, automotive)
When structuring contracts with AI vendors or customers

Steps

Identify liability scenarios: What could go wrong? AI error causes financial loss, physical harm, discrimination, privacy breach?
Assess liability exposure: Who could sue? What are potential damages? What's the probability?
Review liability limitations: Do your Terms of Service limit liability? Are limitations enforceable in relevant jurisdictions?
Obtain insurance coverage: Professional liability, cyber liability, product liability. Ensure AI is explicitly covered.
Implement risk controls: The measures in this deck reduce likelihood of incidents and show reasonable care if sued.

Tips

Many insurance policies exclude AI-related claims by default—get explicit AI coverage
For B2B AI, negotiate liability caps in contracts. Unlimited liability for AI is too risky.

Audit Trail Requirements

Legal & Compliance ▼

Implement comprehensive logging and audit trails for AI systems to support compliance and investigations.

When to use

For regulated AI systems (finance, healthcare, government)
When AI makes decisions that could be legally challenged
As required by regulations (EU AI Act, SOC 2, ISO 27001)

Steps

Define logging scope: Model inputs, outputs, decisions, confidence scores, user interactions, model versions, data versions
Set retention policies: How long to keep logs? GDPR requires deletion upon request, but some regulations require multi-year retention.
Implement secure storage: Logs contain sensitive data. Encrypt at rest, restrict access, maintain immutability.
Enable traceability: Link each prediction to model version, training data version, user, timestamp. Must be able to reproduce.
Build audit reports: Create dashboards and reports for regulators, auditors, internal reviews. Test that you can answer common questions.

Tips

For high-stakes AI, log enough detail to fully reproduce any decision even years later
Balance retention needs with privacy—minimize PII in logs, anonymize where possible

Responsible AI Documentation

Legal & Compliance ▼

Create and maintain comprehensive documentation of AI systems for transparency, compliance, and knowledge sharing.

When to use

Throughout AI development lifecycle
When required by regulations (model cards, data sheets)
Before launching AI systems to production

Steps

Create model cards: Document intended use, training data, performance metrics, limitations, fairness analysis, ethical considerations
Create data sheets: Document dataset origin, collection method, preprocessing, demographics, known biases, intended uses
Document system architecture: Data flows, model architecture, dependencies, infrastructure, update procedures
Write user-facing documentation: What AI does, how to use it, limitations, how to get help, how to provide feedback
Maintain living docs: Update documentation with each model version, system change, or new findings. Version control all docs.

Tips

Use templates: Google Model Cards, Microsoft datasheets, or EU AI Act technical documentation templates
Make documentation searchable and accessible—it's useless if people can't find it

Deployment Failure Prevention

Operational Risks ▼

Minimize risk of failed deployments through testing, staging, and gradual rollouts.

When to use

Before deploying any AI model to production
When updating existing production AI systems
After experiencing deployment incidents

Steps

Test in staging: Deploy to production-like environment first. Validate model performance, latency, error rates.
Implement canary deployments: Roll out to 5% of traffic first. Monitor metrics for 24-48 hours before full rollout.
Define rollback criteria: If error rate >1% or latency >2x baseline, auto-rollback. Have one-click rollback mechanism.
Pre-deployment checklist: Verify dependencies, data schema, API compatibility, monitoring/alerting, documentation
Plan deployment windows: Deploy during low-traffic periods. Have team available to monitor and respond to issues.

Tips

Always deploy new models alongside old ones (shadow mode) for 24 hours before switching traffic
Keep last 3 model versions deployable—enables quick rollback if issues emerge days after deployment

Scaling AI Systems

Operational Risks ▼

Plan for and manage challenges that emerge when scaling AI from prototype to high-volume production.

When to use

When traffic is expected to grow 10x or more
Before major product launches or marketing campaigns
When experiencing performance degradation under load

Steps

Benchmark capacity: Measure current throughput (requests/second), latency, and resource usage. Identify bottlenecks.
Project future load: Estimate peak traffic based on growth plans. Add 50% buffer for unexpected spikes.
Optimize performance: Model quantization, batching, caching, GPU optimization. Measure latency/cost tradeoffs.
Plan infrastructure: Auto-scaling policies, load balancing, multi-region deployment. Test failover scenarios.
Load test extensively: Simulate peak traffic + 2x. Measure behavior under sustained load and traffic spikes.

Tips

Model inference cost often scales linearly with traffic—factor this into unit economics early
For generative AI, implement rate limiting per user to prevent abuse and control costs

Cost Management for AI

Operational Risks ▼

Monitor and optimize AI infrastructure costs to maintain healthy unit economics.

When to use

Before committing to AI features with significant compute costs
When monthly AI infrastructure costs exceed budget
During planning cycles and budget allocation

Steps

Calculate unit economics: Cost per prediction, cost per user, cost per month. Track over time.
Set cost budgets: Define acceptable costs for your business model. Alert if approaching limits.
Optimize inference: Smaller models, quantization, batching, caching, edge deployment. Measure accuracy vs. cost tradeoffs.
Optimize training: Spot instances, lower-precision training, smaller datasets, fewer experiments. Use MLOps tools to track experiment costs.
Monitor continuously: Daily/weekly cost dashboards by team, project, model. Identify cost spikes immediately.

Tips

For API-based AI, renegotiate pricing after hitting volume thresholds—vendors offer discounts at scale
Consider model distillation: train smaller, cheaper models that mimic larger expensive models

Vendor Lock-In Mitigation

Operational Risks ▼

Reduce dependency on single AI vendors to maintain flexibility and negotiating leverage.

When to use

When evaluating AI vendor relationships
Before committing to proprietary AI platforms or APIs
If current vendor relationship becomes problematic

Steps

Assess lock-in risk: How hard to switch vendors? Proprietary APIs? Custom integrations? Data in vendor-specific formats?
Design for portability: Use abstraction layers. Build interfaces that work with multiple providers. Avoid vendor-specific features initially.
Maintain multi-vendor capability: Test alternative providers quarterly. Keep POC integrations working.
Diversify strategically: Use different vendors for different use cases. Prevents single point of failure.
Negotiate protections: Include data portability, API stability, and exit assistance terms in contracts.

Tips

For LLM APIs, use libraries like LangChain or LlamaIndex that support multiple providers
Build vendor switching into roadmap every 12-18 months—forces you to maintain portability

Technical Debt in AI Systems

Operational Risks ▼

Identify and manage ML-specific technical debt that accumulates faster than traditional software.

When to use

During sprint planning and roadmap reviews
When velocity slows or bugs increase
Quarterly as part of technical health reviews

Steps

Audit ML-specific debt: Glue code, pipeline jungles, experimental codepaths, multiple versions of truth, undeclared dependencies
Quantify impact: How much does debt slow development? Increase bugs? Raise costs?
Prioritize by pain: Which debt causes most problems? Which is easiest to fix? Focus on high-impact, low-effort first.
Allocate capacity: Reserve 20-30% of engineering time for debt reduction. Track and celebrate progress.
Prevent accumulation: Code review standards, refactoring sprints, deprecation policies, monitoring for code smells

Tips

ML debt compounds faster than traditional software—it blocks experimentation and slows innovation
Create 'ML platform team' role to manage shared infrastructure and prevent debt at system level

Set Up Risk Monitoring Dashboard

Operational Risks ▼

Create centralized visibility into AI system health, risks, and incidents across all dimensions.

When to use

When launching first AI features to production
If you lack visibility into AI system health
After incidents reveal monitoring gaps

Steps

Define key risk indicators: Model performance, data quality, cost, latency, error rates, user feedback, fairness metrics
Set thresholds and alerts: Green (healthy), yellow (investigate), red (immediate action). Define escalation procedures.
Build dashboard: Centralized view of all AI systems. Accessible to PMs, engineers, execs. Real-time + historical trends.
Automate data collection: Instrument production systems to emit metrics. Aggregate from multiple sources (logs, databases, APIs).
Review cadence: Daily check by on-call. Weekly review with team. Monthly review with leadership.

Tips

Start simple: track 5-10 most critical metrics. Expand over time as you learn what matters.
Include leading indicators (input data quality) not just lagging indicators (model accuracy)

Incident Response for AI

Operational Risks ▼

Establish playbooks for responding to AI system failures, quality issues, or safety incidents.

When to use

Before launching AI features to production
After experiencing your first AI incident
When updating incident response procedures

Steps

Define incident types: Model failure, data corruption, harmful output, fairness violation, privacy breach, cost spike, outage
Set severity levels: P0 (user safety, major outage), P1 (significant degradation), P2 (minor issues). Define response SLAs.
Create response playbooks: For each incident type, document detection, initial response, investigation, mitigation, communication
Assign roles: Incident commander, communications lead, technical lead. Train team on roles and procedures.
Conduct post-mortems: After incidents, document timeline, root cause, action items. Share learnings broadly.

Tips

For AI safety incidents, communicate proactively to users even before full resolution—transparency builds trust
Practice incident response with fire drills quarterly—muscle memory matters during real incidents

Prevent Underfitting

Model Risks ▼

Ensure your model is complex enough to capture important patterns and deliver useful predictions.

When to use

When baseline models show poor performance on training and validation data
Before giving up on an AI approach due to low accuracy
When stakeholders question if AI adds value over simple rules

Steps

Diagnose underfitting: If both training and validation accuracy are low (e.g., both ~65% for binary classification), you're underfitting
Increase model complexity: Add more layers, more parameters, more features. Start with 2-3x current capacity.
Improve features: Add more informative input features. Feature engineering often matters more than model architecture.
Train longer: Increase epochs/iterations. Ensure model has converged (training loss plateaus).
Try different architectures: If linear model underperforms, try decision trees. If simple NN underperforms, try deeper networks.

Tips

Check training loss first—if it's not decreasing, you have optimization or data problems before underfitting
For structured data, gradient boosting (XGBoost, LightGBM) often fixes underfitting better than neural networks

User Control Over AI

User Safety & Trust ▼

Give users meaningful control over AI behavior, personalization, and decision-making.

When to use

When AI personalizes experiences or makes recommendations
If users express concerns about AI control or autonomy
To build trust and meet transparency requirements

Steps

Identify control points: What aspects of AI can users adjust? Personalization level, data usage, automation degree, feature on/off
Design controls: Simple toggles for most users, advanced settings for power users. Provide clear explanations of each control.
Set sensible defaults: Most users won't change settings. Default to safe, balanced options.
Make controls discoverable: Surface key controls in main settings. Don't bury in deep menus.
Provide override mechanisms: Let users undo AI actions, manually adjust results, revert to non-AI experience.

Tips

Test controls with non-technical users—what's obvious to you may be confusing to them
For sensitive use cases, default to 'AI off' and make users opt in to automation

Write AI Product Specs

Requirements & Specs ▼

Create comprehensive product requirements documents tailored for AI features with probabilistic behaviors.

When to use

When scoping a new AI feature before development begins
When communicating requirements to ML engineers and designers
Before estimating timelines or resources for AI projects

Steps

Define the user problem: What task does AI solve? What's the current painful alternative? Include 3-5 specific user scenarios.
Specify success criteria: Model performance thresholds (accuracy, precision, recall), latency limits, cost constraints, user satisfaction targets.
Document failure modes: What happens when model is wrong? When it's unsure? When it's slow? Define graceful degradation paths.
List edge cases explicitly: Enumerate at least 10 scenarios where AI might fail. How should system behave for each?
Define data requirements: Training data volume, labeling needs, refresh frequency, privacy constraints, retention policies.
Map dependencies: APIs, infrastructure, monitoring tools, human-in-the-loop processes, fallback systems.

Tips

Include example inputs and expected outputs for 5 typical cases and 5 edge cases
Specify what's in scope for MVP vs. future iterations—prevents scope creep

Define AI Success Metrics

Requirements & Specs ▼

Establish clear, measurable criteria for what "good enough" means for your AI feature.

When to use

Before starting AI development or model training
When aligning stakeholders on AI launch criteria
When evaluating if your AI feature is ready to ship

Steps

Define user-facing metrics: Task completion rate, user satisfaction, time saved
Define model metrics: Accuracy, precision, recall, F1 score (based on your use case)
Define system metrics: Latency, cost per prediction, uptime
Set minimum bars: What's the minimum acceptable level for each metric to ship?
Weight by importance: Rank metrics by priority (e.g., accuracy 40%, latency 30%, cost 20%)

Tips

Always include latency—a slow model frustrates users even if accurate
Get ML engineers to validate that metrics are achievable

Write AI Acceptance Criteria

Requirements & Specs ▼

Define testable conditions that AI features must meet before marking stories complete or shipping to users.

When to use

When writing user stories for AI features during sprint planning
Before QA begins testing AI functionality
When determining if an AI feature is ready for launch

Steps

Functional criteria: Define what the feature does. Example: 'Given user query, system returns relevant results in <1s'
Performance criteria: Set minimum bars. Example: 'Accuracy >85% on validation set, precision >90% for top 3 results'
Edge case handling: Test boundaries. Example: 'When confidence <70%, show 'Not sure' message instead of prediction'
UX criteria: User experience standards. Example: 'Loading indicator appears within 100ms, shows model confidence level'
Monitoring criteria: Observability requirements. Example: 'Log all predictions with confidence scores, latency, and user feedback'

Tips

Use 'When/Given/Then' format for clarity: 'Given ambiguous input, when model confidence <70%, then show 3 options instead of 1'
Include negative test cases: What should NOT happen (e.g., 'System never returns offensive content')

Document Edge Cases & Failure Modes

Requirements & Specs ▼

Systematically identify and specify how AI systems should behave when encountering unusual inputs or model failures.

When to use

During AI product spec writing, before development starts
When designing error handling and fallback strategies
After discovering edge cases in testing or production

Steps

Brainstorm input edge cases: Empty inputs, extremely long inputs, non-English text, special characters, adversarial inputs, ambiguous requests
Identify model failure modes: Low confidence predictions, contradictory outputs, hallucinations, timeout/latency spikes, model unavailable
Define system behaviors: For each edge case, specify exact system response—show error message? Fallback to rules? Route to human?
Document user communication: What does user see? Example: 'I'm not confident about this answer' vs. hiding uncertainty
Prioritize edge cases: Mark which must be handled at launch (P0) vs. can be addressed later (P1, P2)

Tips

Aim to document 20-30 edge cases minimum—real AI systems encounter dozens of failure modes
Test your edge case handling with red teaming before launch

Write User Stories for AI Features

Requirements & Specs ▼

Craft user stories that capture AI-specific requirements, uncertainty, and iterative learning needs.

When to use

During sprint planning for AI development
When breaking down large AI epics into deliverable increments
When communicating AI requirements to cross-functional teams

Steps

Start with user value: 'As a [user], I want [AI capability] so that [benefit]'. Focus on outcome, not technology.
Add AI-specific details: Include model type, accuracy target, latency requirement, data source, fallback behavior
Split into layers: Story 1: MVP with simple model. Story 2: Improve accuracy. Story 3: Add personalization. Build incrementally.
Include training stories: 'As an ML engineer, I need labeled data to train the classification model' counts as a story
Add monitoring stories: 'As a PM, I want to see model accuracy in production to know when to retrain'

Tips

Use this format: 'As a [user], I want [AI feature] with [performance level] so that [outcome]'
Always pair feature stories with monitoring/evaluation stories in the same sprint

Specify Model Constraints & Requirements

Requirements & Specs ▼

Define technical constraints and non-functional requirements that limit model selection and architecture choices.

When to use

Before ML engineers begin model selection or architecture design
When negotiating tradeoffs between accuracy, latency, and cost
When evaluating whether to use pre-trained vs. custom models

Steps

Latency constraints: Define max acceptable response time. Example: 'P95 latency <500ms' or 'Batch processing <1 hour'
Cost constraints: Set budget per prediction or monthly inference spend. Example: '$0.001 per prediction max' or '$5K/month inference budget'
Data constraints: Privacy requirements, data location restrictions, retention limits. Example: 'No PII can leave EU data centers'
Infrastructure constraints: On-premise vs. cloud, GPU availability, scaling requirements. Example: 'Must run on CPU-only instances'
Model size constraints: Deployment target limits. Example: 'Model must fit in 100MB for mobile deployment'

Tips

Document 'must-have' vs. 'nice-to-have' constraints—helps ML engineers make tradeoff decisions
Re-evaluate constraints quarterly—technology improves, costs drop, requirements change

Create Model Evaluation Rubric

Requirements & Specs ▼

Build a standardized scorecard for comparing model candidates and making go/no-go decisions.

When to use

When evaluating multiple model approaches or vendors
Before final model selection for production deployment
When comparing fine-tuned models against baselines

Steps

List evaluation dimensions: Accuracy, latency, cost, maintainability, explainability, fairness, ease of deployment
Define scoring criteria: For each dimension, create 1-5 scale. Example: Accuracy: 1=<70%, 2=70-80%, 3=80-85%, 4=85-90%, 5=>90%
Assign weights: Total should equal 100%. Example: Accuracy 35%, Latency 25%, Cost 20%, Maintainability 15%, Explainability 5%
Evaluate candidates: Score each model on every dimension. Calculate weighted total score.
Set minimum bars: Define deal-breakers. Example: 'Any score <3 on Accuracy is automatic rejection regardless of other scores'

Tips

Include non-technical stakeholders in weighting exercise—reveals business priorities
Document evaluation in decision log for future reference when explaining model choices

Define Human-in-the-Loop Requirements

Requirements & Specs ▼

Specify when and how humans should review, override, or augment AI decisions.

When to use

For high-stakes AI decisions (hiring, lending, medical, legal)
When model accuracy alone is insufficient for user trust
When designing content moderation or fraud detection systems

Steps

Identify human intervention triggers: When does AI route to human? Low confidence (<70%)? Specific content types? Random sampling?
Define review workflows: Who reviews? What information do they see? What actions can they take? What's the SLA?
Specify override rules: Can humans override AI? Is override logged? Does it retrain the model?
Design feedback loops: How do human decisions improve the model? Label correction? Active learning prioritization?
Plan for scale: What happens when review volume exceeds capacity? Which cases get priority?

Tips

Start with 100% human review at launch, then gradually decrease as model improves and you build trust
Track human-AI agreement rates—if humans override >20%, your model needs improvement

Plan Data Collection Strategy

Data Strategy ▼

Design systematic approach to gathering, labeling, and maintaining high-quality training data.

When to use

Before starting AI development when you lack sufficient data
When planning to improve model performance through more data
When designing data pipelines for continuous learning

Steps

Quantify data needs: Calculate required examples per class/scenario. Start with 1K minimum, 10K target, 100K for production scale.
Identify data sources: Internal logs, user-generated content, purchased datasets, web scraping, partnerships, synthetic generation
Plan collection timeline: Map data acquisition to development phases. Example: 'MVP needs 5K labeled examples by Month 2'
Design labeling workflow: Who labels? Internal team, contractors, crowdsourcing? What's the quality bar? How much does it cost?
Build validation process: How do you verify label quality? Inter-rater agreement? Expert review? Automated checks?
Set refresh cadence: How often do you collect new data? Daily, weekly, monthly? What triggers data updates?

Tips

Budget $0.10-$5 per label depending on complexity—data labeling often costs more than development
Prioritize data diversity over volume—1K diverse examples beats 10K similar ones

Establish Data Labeling Pipeline

Data Strategy ▼

Build efficient, quality-controlled workflows for annotating training data at scale.

When to use

When you have raw data but need labeled examples for supervised learning
When scaling from prototype to production-quality models
When managing ongoing labeling for model improvements

Steps

Choose labeling approach: In-house experts (high quality, slow, expensive), contractors (medium quality, faster, moderate cost), crowdsourcing (variable quality, fastest, cheap)
Design labeling interface: Simple, clear instructions with examples. Include 'unsure' option. Show previous labels for context.
Implement quality controls: Gold standard test sets (10-20% of labels), measure inter-rater agreement (aim for >80%), require 2-3 labelers per example for disagreement detection
Set up labeling workflow: Task assignment, review queue, dispute resolution process, label correction mechanism
Track metrics: Labels per hour, cost per label, label quality score, labeler agreement rates
Iterate on guidelines: Update labeling instructions weekly based on common errors and edge cases

Tips

Start with small batch (100 examples), measure quality, adjust process before scaling to thousands
Pay labelers fairly—quality correlates with compensation and training

Design Active Learning Workflow

Data Strategy ▼

Implement smart sampling strategies that prioritize labeling the most valuable training examples.

When to use

When you have large amounts of unlabeled data but limited labeling budget
When trying to improve model performance efficiently
When deploying models that learn from production data

Steps

Set up uncertainty sampling: Deploy model, capture predictions with confidence scores. Queue low-confidence examples (<70%) for human review.
Implement diversity sampling: Don't just label uncertain examples—also sample to cover edge cases and rare scenarios. Use clustering.
Create review interface: Show model prediction + confidence, allow labeler to correct or confirm, capture reasoning for corrections
Feed labels back: Retrain model weekly or monthly with new labels. Measure if accuracy improves.
Balance exploration vs. exploitation: 80% uncertain examples (exploitation), 20% random samples (exploration for coverage)

Tips

Start active learning after you have 1K baseline labels—need initial model for uncertainty estimates
Track label efficiency: Are you getting accuracy gains per 100 labels? If not, switch sampling strategy

Implement Data Versioning

Data Strategy ▼

Track and manage different versions of training datasets for reproducibility and model comparison.

When to use

When you start training models and need to track which data produced which results
When managing multiple model experiments in parallel
When debugging model performance regressions

Steps

Choose versioning tool: DVC (Data Version Control), LakeFS, Pachyderm, or simple S3 buckets with timestamps
Define versioning strategy: Version on data changes (new labels), schema changes (new features), or time-based (monthly snapshots)
Tag datasets: Use semantic versioning (v1.0, v1.1) or timestamps (2025-01-15). Link each model to its training data version.
Document dataset changes: Changelog for each version: what changed, why, how many examples added/removed/modified
Set up access controls: Who can create new versions? Who can modify existing ones? Ensure test/validation sets never leak.

Tips

Pin production models to specific data versions—makes rollbacks and debugging much easier
Store data samples in version control (100 examples) so teammates can inspect without downloading full dataset

Generate Synthetic Training Data

Data Strategy ▼

Create artificial training examples to augment real data, especially for rare cases or privacy-sensitive scenarios.

When to use

When you lack sufficient real examples for certain categories
When dealing with rare events (fraud, medical conditions, edge cases)
When privacy regulations limit access to real user data

Steps

Choose generation method: Rule-based (templates with variations), generative models (GANs, VAEs), LLMs (for text), data augmentation (transforms)
Start with augmentation: For images/text, apply transforms to real data—rotate, crop, paraphrase. Easiest way to 10x your dataset.
Validate realism: Can humans distinguish synthetic from real examples? If yes, synthetic data is too artificial.
Test model performance: Train on real data only, then real + synthetic. Does synthetic data improve validation accuracy? If not, discard it.
Balance synthetic vs. real: Keep real data as majority (70-90%), use synthetic as supplement (10-30%) for rare cases

Tips

Synthetic data works best for augmenting rare classes—don't use it to replace real data collection
For LLM-generated data, use diverse prompts and validate that examples are factually correct

Implement Data Privacy Controls

Data Strategy ▼

Build safeguards to protect user privacy throughout data collection, training, and inference.

When to use

When handling PII (personally identifiable information) or sensitive data
Before launching in regulated industries (healthcare, finance, education)
When users express privacy concerns about AI features

Steps

Classify data sensitivity: Public, internal, confidential, PII, PHI. Apply appropriate controls to each tier.
Implement data minimization: Collect only data necessary for model training. Avoid collecting PII when possible.
Anonymize training data: Remove names, emails, IDs. Use tokenization, pseudonymization, or differential privacy techniques.
Set retention limits: Define how long you keep training data. Delete after 1-2 years unless needed for compliance.
Control access: Role-based access to training data. Log all data access. Require data handling training for team members.
Plan for deletion: Users can request data deletion (GDPR, CCPA). Have process to remove user data from training sets.

Tips

Use secure enclaves or federated learning for ultra-sensitive data—model trains without centralizing raw data
Document all privacy measures in your AI product specs—legal and compliance teams need this

Training Data Quality Assurance

Data Strategy ▼

Systematically detect and fix data quality issues that degrade model performance.

When to use

Before training models on new datasets
When model performance is worse than expected
When setting up ongoing data quality monitoring

Steps

Check label accuracy: Sample 200 random examples, manually verify labels. Aim for >95% correct. If lower, retrain labelers or fix guidelines.
Detect label noise: Find examples where multiple labelers disagree. Review and correct. High disagreement indicates unclear guidelines.
Assess class balance: Count examples per category. If any class is <5% of total, collect more examples or use class weighting.
Find duplicates: Use hashing or fuzzy matching to detect near-duplicate examples. Remove to prevent train/test leakage.
Validate feature quality: Check for missing values, outliers, incorrect data types. Implement feature validation pipeline.
Test representative coverage: Does training data cover all scenarios users will encounter in production? Identify gaps.

Tips

Automate quality checks—run on every new data batch before adding to training set
Track data quality metrics over time—catch degradation early

Design Data Refresh Strategy

Data Strategy ▼

Plan how and when to update training data to keep models accurate as the world changes.

When to use

When deploying models that will run for months or years
When user behavior or content patterns evolve over time
When setting up MLOps processes for production systems

Steps

Assess data freshness needs: How fast does your domain change? E-commerce trends change weekly, medical knowledge changes yearly.
Set refresh cadence: Daily (real-time personalization), weekly (content moderation), monthly (fraud detection), quarterly (general features)
Define refresh triggers: Time-based (every 30 days), performance-based (accuracy drops 5%), event-based (product launch, seasonality)
Design collection pipeline: Automated data pulls from production, scheduled labeling workflows, incremental dataset updates
Test before deployment: Always validate new data quality before retraining models. Check for distribution shifts or anomalies.

Tips

Start with monthly refreshes, then adjust based on monitoring—over-refreshing wastes resources
Keep historical data—you may need to retrain on older distributions if new data is poisoned

Plan Model Development Sprint

Model Development ▼

Structure two-week sprints that balance model experimentation with product progress.

When to use

When starting AI development with ML engineering teams
When adapting agile processes for machine learning work
When stakeholders need visibility into AI development progress

Steps

Set sprint goal: Focus on outcome, not model type. Example: 'Achieve 85% accuracy on validation set' not 'Try neural network'
Allocate experiment budget: Reserve 60% sprint capacity for model experiments, 20% for data work, 20% for infrastructure/tooling
Plan experiments: List 3-5 experiments to try. Example: 'Test XGBoost, fine-tune BERT, try ensemble'. Prioritize by expected impact.
Define success criteria: What metrics determine if an experiment worked? Be specific: 'Accuracy >85% AND latency <500ms'
Schedule demo: End each sprint with model performance demo—show metrics, example predictions, learned insights

Tips

Don't commit to specific models—commit to achieving performance targets. ML is iterative.
Track 'negative results' as progress—knowing what doesn't work has value

Track Model Experiments

Model Development ▼

Log and compare model experiments to identify what works and maintain reproducibility.

When to use

As soon as you start training models—don't wait until you have many experiments
When comparing multiple approaches or hyperparameter configurations
When you need to reproduce results or explain model choices to stakeholders

Steps

Choose experiment tracking tool: MLflow, Weights & Biases, Neptune.ai, or simple spreadsheet for small projects
Log experiment metadata: Model type, hyperparameters, training data version, features used, training duration, cost
Track key metrics: Training accuracy, validation accuracy, test accuracy, precision, recall, F1, latency, model size
Document insights: What worked? What failed? What surprised you? Store in experiment notes or shared doc.
Compare experiments: Sort by validation accuracy. Identify best performers. Look for patterns—what do top models have in common?

Tips

Log experiments automatically in training scripts—manual logging leads to gaps
Name experiments descriptively: 'bert-base-lr-1e-5-batch-32' not 'experiment_17'

Establish Model Baselines

Model Development ▼

Create simple benchmark models to measure if sophisticated ML approaches actually add value.

When to use

At the start of every AI project, before building complex models
When justifying investment in ML vs. simpler approaches
When evaluating if model improvements are meaningful

Steps

Create majority class baseline: Always predict the most common category. Example: If 80% of emails are not spam, baseline accuracy is 80%.
Build rule-based baseline: Use domain knowledge to create if-then rules. Example: Flag transaction as fraud if amount >$1,000 + new account.
Try simple ML baseline: Logistic regression or decision tree with basic features. Takes hours to implement, not weeks.
Measure baseline performance: Track same metrics you'll use for production model. Document baseline results.
Set improvement target: Production model must beat baseline by meaningful margin. Example: '>10 percentage points better accuracy'

Tips

Many projects discover that simple baselines are 'good enough' and cancel complex ML work—that's a win
Always compare new models to baseline, not just to previous model version

Evaluate Model Performance

Model Development ▼

Assess model quality across multiple dimensions beyond simple accuracy scores.

When to use

After training models but before deployment decisions
When comparing model candidates for production
When debugging why production performance differs from development

Steps

Test on held-out data: Evaluate on data the model has never seen. Never use test set during training or hyperparameter tuning.
Measure comprehensive metrics: Accuracy, precision, recall, F1, AUC-ROC. Choose primary metric based on business impact (false positives vs. false negatives).
Analyze per-class performance: Confusion matrix reveals which categories model struggles with. May be acceptable if rare classes.
Test on edge cases: Create separate test set of difficult examples. Example: Ambiguous queries, adversarial inputs, edge case scenarios.
Measure latency and cost: Time each prediction. Calculate cost per 1,000 predictions. Ensure within budget.
Review error cases: Manually inspect 50 wrong predictions. Categorize errors—helps prioritize improvements.

Tips

For production decisions, p95 and p99 metrics matter more than averages
Test demographic fairness—measure model performance across user segments (gender, age, geography)

Run Model Iteration Loops

Model Development ▼

Systematically improve model performance through structured iteration cycles.

When to use

When initial model meets baseline but not production requirements
When you have time/budget for multiple improvement cycles
When deciding where to invest effort for maximum gain

Steps

Analyze failure modes: Review model errors. Group into categories—data quality issues, missing features, model limitations, edge cases.
Prioritize improvements: Estimate impact and effort for each fix. Focus on high-impact, low-effort wins first.
Run targeted experiments: Try one major change per iteration. Example: Add new feature, collect more data for weak class, try different architecture.
Measure impact: Compare new model to previous best. Did accuracy improve? By how much? On which categories?
Iterate or ship: If model meets launch criteria, ship it. If not, run another cycle. Timebox iterations—diminishing returns after 3-4 cycles.

Tips

Track marginal improvement per iteration—if gaining <2% accuracy per cycle, diminishing returns suggest moving to production
Balance model quality with time-to-market—perfect is enemy of shipped

Optimize Model Performance

Model Development ▼

Improve model speed and reduce costs without sacrificing accuracy.

When to use

When model accuracy is good but latency or cost too high
Before scaling to millions of predictions per day
When infrastructure costs are eating into product margins

Steps

Profile bottlenecks: Measure where time is spent—data loading, preprocessing, model inference, post-processing. Optimize the slowest part first.
Optimize inference: Use smaller model variants (DistilBERT vs. BERT), quantization (FP16 or INT8), batching, caching frequent predictions.
Reduce model size: Prune unnecessary weights, knowledge distillation (train small model to mimic large one), feature selection.
Optimize deployment: Use faster hardware (GPUs for large models), serverless for variable load, edge deployment to reduce network latency.
Measure tradeoffs: Track accuracy, latency, cost after each optimization. Ensure accuracy doesn't drop >2-3 percentage points.

Tips

Quantization (FP32 to FP16) often gives 2x speedup with <1% accuracy loss—always try first
Cache predictions for repeated inputs—many applications have high overlap in queries

Implement Model Versioning

Model Development ▼

Track, compare, and manage different model versions across environments.

When to use

When deploying models to production for the first time
When managing multiple model versions in parallel
When you need to roll back to previous model versions

Steps

Choose versioning scheme: Semantic versioning (v1.0, v1.1, v2.0) or timestamp-based (2025-01-15-1530). Be consistent.
Tag model artifacts: Version model weights, preprocessing code, feature definitions, inference code. Package together.
Link to training data: Record which data version trained each model. Enables reproduction and debugging.
Track deployment: Which version is in production? Staging? Development? Use model registry (MLflow, SageMaker).
Set retention policy: Keep last 3-5 production models for quick rollback. Archive older models unless needed for compliance.

Tips

Store model metadata: Training date, performance metrics, owner, intended use. Makes it easy to compare versions.
Automate version bumping—manual versioning leads to errors and confusion

Design AI UX Patterns

UX & Product Design ▼

Apply proven UX patterns that help users understand and trust AI-powered features.

When to use

When designing interfaces for AI features
When users express confusion or mistrust of AI outputs
Before conducting usability testing of AI products

Steps

Show confidence levels: When model is uncertain (<70% confidence), communicate this to users. Example: 'I'm not sure, here are 3 options.'
Provide explanations: Show why AI made a decision. Example: 'Recommended because you viewed similar products.' Keep simple, not technical.
Enable feedback: Add thumbs up/down, 'Was this helpful?', or report buttons. Collect user corrections to improve model.
Offer alternatives: For key decisions, show top 3 predictions instead of only #1. Lets users choose if top pick is wrong.
Make AI status visible: Show when AI is thinking (loading), when it's done, when it failed. Don't hide AI delays.

Tips

Test AI explanations with users—what makes sense to you may confuse them
Balance transparency with simplicity—too much detail overwhelms, too little erodes trust

Design Loading & Latency States

UX & Product Design ▼

Create UX patterns that keep users engaged while AI processes requests.

When to use

When AI latency is unavoidably >1 second
When designing async AI features (report generation, video processing)
When users complain that AI features feel slow or unresponsive

Steps

Categorize by latency: Instant (<100ms), responsive (<1s), deliberate (1-5s), background (>5s). Each needs different UX.
Show immediate feedback: Display loading indicator within 100ms of user action. Proves system is working.
Use progressive disclosure: For long tasks, show interim results. Example: 'Found 20 results... still searching...' then final count.
Set expectations: Tell users how long to expect. 'This usually takes 30 seconds.' Uncertainty is worse than slow.
Make waiting engaging: Show fun loading messages, progress bars, skeleton screens. Distract from wait time.
Enable async patterns: For >10s tasks, let users do other things. Notify when done via email, notification, or dashboard.

Tips

Perceived latency matters more than actual latency—good UX makes 3s feel like 1s
Test loading states with intentionally delayed responses—reveals UX bugs

Design AI Error States

UX & Product Design ▼

Create clear, actionable error messages when AI features fail or produce low-confidence outputs.

When to use

When designing AI features that can fail or return uncertain results
When users report confusion about AI errors
When model confidence varies significantly across inputs

Steps

Categorize error types: Model failure (crashed), low confidence (<70%), ambiguous input, rate limiting, inappropriate request
Write user-friendly messages: Avoid technical jargon. Example: 'I couldn't understand your request' not 'Model returned null'
Provide next steps: Tell users what to do. 'Try rephrasing your question' or 'Here's a human expert who can help.'
Offer fallbacks: When AI fails, route to rules-based system, human expert, or simpler alternative.
Log error details: Capture input, model version, confidence, latency for debugging. Don't show to users but track for engineering.

Tips

Never say 'AI error' or 'Model failed'—users don't care about implementation, they want solutions
Test error states as thoroughly as success states—errors happen 5-20% of the time in production

Implement Confidence Score Display

UX & Product Design ▼

Communicate model uncertainty to users in intuitive, non-technical ways.

When to use

For high-stakes AI decisions (medical, financial, legal)
When model accuracy varies significantly across inputs
When you want users to verify AI outputs before acting

Steps

Choose confidence threshold: Low (<70%), medium (70-85%), high (>85%). Adjust based on domain and user testing.
Design visual indicators: Stars (★★★★★), bars (▮▮▮▯▯), labels ('High confidence', 'Low confidence'), colors (green/yellow/red)
Provide context: Explain what confidence means. 'High confidence: I'm very sure' vs. 'Low confidence: Please double-check this.'
Adjust behavior by confidence: High confidence = show single answer. Low confidence = show multiple options or route to human.
Test comprehension: Ask users what different confidence levels mean. Iterate until 80%+ interpret correctly.

Tips

Avoid raw percentages—'87% confident' means different things to different users
Consider hiding confidence for consumer products but showing it for professional/enterprise tools

Design Progressive Disclosure

UX & Product Design ▼

Structure AI interfaces to show simple results first with option to drill into details.

When to use

When AI produces complex outputs with multiple components
When users have varying expertise levels and information needs
When you want to reduce cognitive load while preserving access to details

Steps

Identify information layers: Core result (always shown), supporting details (click to expand), advanced info (settings/preferences)
Design default view: Show only essential information. Example: Search shows top result + 'See 10 more' vs. all 50 results.
Add expansion points: 'Show more', 'Details', 'Why this recommendation', 'Advanced options'. Make discoverable but not intrusive.
Preserve context: When user expands details, keep core result visible. Don't navigate away or replace entire screen.
Remember preferences: If user always expands details, make that their default. Learn from behavior.

Tips

80% of users need only surface-level info—optimize for them, not power users
Test with novices and experts—both should find the experience intuitive

Design AI Explanation Interfaces

UX & Product Design ▼

Create interfaces that help users understand why AI made specific decisions.

When to use

For high-stakes decisions requiring user trust (loans, hiring, medical)
When regulatory requirements mandate explainability (GDPR, financial services)
When users frequently question or override AI recommendations

Steps

Choose explanation method: Feature importance ('Price and location drove this score'), example-based ('Similar to properties you viewed'), counterfactual ('If price were $50K less, recommendation would change')
Match explanation to audience: Non-technical users need simple language, experts can handle technical details. Test comprehension.
Show top factors only: Display 3-5 most important factors, not all 50 features. 'Income, credit score, and employment history were most important.'
Make explanations actionable: If user can change outcome, tell them how. 'Improve credit score by 50 points to qualify.'
Validate accuracy: Ensure explanations reflect actual model logic. Use LIME, SHAP, or other XAI tools. Test edge cases.

Tips

Simple explanations are often wrong—balance accuracy with understandability
Let users drill down: Show simple explanation by default, offer 'Technical details' for experts

Design Feedback Collection Mechanisms

UX & Product Design ▼

Build interfaces that capture user feedback on AI outputs to enable continuous improvement.

When to use

For all AI features in production—feedback drives improvement
When implementing active learning or human-in-the-loop systems
When model performance needs ongoing monitoring and tuning

Steps

Choose feedback types: Implicit (clicks, time on page, conversions), explicit (thumbs up/down, ratings, corrections), detailed (text feedback, report issue)
Design for low friction: One-click feedback is used 10x more than forms. 'Was this helpful? Yes/No' beats 'Rate 1-5 stars with comment'
Capture corrections: Let users fix wrong predictions. 'This is actually spam' or 'Correct category: Electronics'. Enables retraining.
Close feedback loop: Show users that feedback matters. 'Thanks, we'll improve based on your input' or 'Your feedback improved results for everyone.'
Instrument everything: Log feedback with prediction details (input, output, confidence, model version). Enables analysis.

Tips

Aim for 5-10% feedback rate minimum—below 2% means your mechanism is too hard to use
Incentivize feedback for cold-start: 'Rate 5 results to unlock personalization' works well

Design Onboarding for AI Features

UX & Product Design ▼

Educate users about AI capabilities, limitations, and how to get best results.

When to use

When launching new AI features to existing user base
When AI behavior differs from user expectations
When users don't know AI features exist or how to use them

Steps

Set expectations: Tell users what AI can and cannot do. 'I can summarize documents up to 50 pages' sets clear boundaries.
Show examples: Demonstrate with real use cases. 'Try asking: Summarize this contract' or show sample outputs.
Teach best practices: Help users craft effective inputs. 'Be specific: Instead of 'cars', try 'red sedans under $30K''
Progressive disclosure: Don't dump all features at once. Introduce advanced features after user masters basics.
Offer contextual help: Provide tips in-app at point of use. Tooltip on search box: 'I understand natural language questions.'

Tips

Test onboarding with users who have never seen your product—reveals hidden assumptions
Track feature discovery and usage—if <50% of users find AI feature, your onboarding failed

Design AI Testing Strategy

Testing & Validation ▼

Create comprehensive test plans that cover model performance, system behavior, and user experience.

When to use

Before AI feature development begins—testing is not an afterthought
When planning QA resources and timelines for AI projects
When deciding what testing is required before launch

Steps

Unit tests: Test data pipelines, feature engineering, pre/post-processing logic. These should be deterministic and fast.
Model tests: Evaluate accuracy on test set, measure fairness across demographics, test edge cases, validate confidence calibration
Integration tests: Test full system—user input to model prediction to UI display. Include latency, error handling, fallbacks.
User acceptance tests: Real users test with realistic tasks. Measure task success rate, user satisfaction, confusion points.
Production validation: Shadow mode, canary deployment, A/B test. Measure real-world performance before full rollout.

Tips

Allocate 30-40% of development timeline to testing—AI testing takes longer than traditional software
Create regression test suite—as you fix issues, add to automated tests to prevent reoccurrence

Implement A/B Testing for AI

Testing & Validation ▼

Design experiments to measure real-world impact of AI models and features.

When to use

When comparing model versions before rolling out to all users
When measuring business impact of AI features
When deciding between different AI approaches or UX designs

Steps

Define hypothesis: Be specific. 'New model will increase click-through rate by >5%' not 'New model is better'
Choose success metrics: Primary (e.g., task success rate) and secondary (e.g., time on page, user satisfaction). Align with business goals.
Design experiment: Random user assignment (50/50 split), minimum sample size (calculate power analysis—typically need 10K+ users), duration (run 1-2 weeks minimum)
Monitor for issues: Check for errors, performance degradation, user complaints. Have kill switch ready if experiment causes problems.
Analyze results: Compare metrics with statistical significance tests. Look for segment differences (e.g., works for US but not EU users).

Tips

Run A/A tests first (same model in both groups)—validates your experiment infrastructure
Don't stop experiments early even if winning—need full sample size for valid results

Run Shadow Mode Testing

Testing & Validation ▼

Deploy new models in production without showing outputs to users to validate real-world performance safely.

When to use

Before launching new models to users for the first time
When testing major model changes or rewrites
When you want to measure production performance without user risk

Steps

Set up shadow deployment: Deploy new model alongside production model. Route same inputs to both. Show only production model output to users.
Log shadow predictions: Capture new model outputs, confidence scores, latency, errors. Store for analysis.
Compare to production: Measure agreement rate between models. Analyze disagreements—is new model fixing bugs or introducing new errors?
Monitor performance: Track shadow model accuracy, latency, error rates, cost. Ensure meets production requirements.
Validate at scale: Run shadow mode for 1-2 weeks with full production traffic volume. Reveals issues that don't appear in testing.

Tips

Shadow mode is expensive (2x compute) but invaluable for risk reduction—worth it for critical features
Set success criteria before shadow mode—know what metrics determine go/no-go for promotion

Conduct AI Red Teaming

Testing & Validation ▼

Simulate adversarial attacks and edge case scenarios to find AI vulnerabilities before users do.

When to use

Before launching consumer-facing AI features, especially conversational AI
For high-stakes applications (content moderation, security, financial decisions)
When testing robustness of safety guardrails

Steps

Recruit red team: Mix of security experts, domain experts, and creative thinkers. External teams find more issues than internal.
Define attack scenarios: Prompt injection, jailbreaking, bias exploitation, misinformation generation, adversarial inputs, edge case enumeration
Run attack sprints: Give red team 3-5 days to find vulnerabilities. Document all successful attacks with reproduction steps.
Triage findings: Severity scoring (critical/high/medium/low). Must-fix before launch vs. acceptable risk vs. post-launch improvement.
Implement mitigations: Add input filters, output filters, safety layers, fallback behaviors. Re-test to verify fixes work.

Tips

Budget $10-50K for external red teaming—finding issues pre-launch is 100x cheaper than post-launch PR disasters
Run red teaming quarterly for live products—new attack techniques emerge constantly

Execute User Acceptance Testing

Testing & Validation ▼

Validate that AI features meet user needs through structured testing with real users.

When to use

After AI features are functionally complete but before launch
When validating that AI solves the intended user problem
When gathering evidence for launch decision

Steps

Recruit representative users: 10-20 users matching target demographic. Include skeptics and early adopters. Compensate appropriately.
Design test scenarios: Create 5-10 realistic tasks users would do with AI feature. Example: 'Find red sedans under $30K in your area'
Measure task success: Can users complete tasks? How long does it take? How many attempts? What's user satisfaction score?
Capture qualitative feedback: What confused users? What delighted them? What would they change? Where did AI fail their expectations?
Test edge cases: Give users ambiguous, difficult, or unusual inputs. How does system handle? Do users understand error messages?

Tips

Test with users who have NOT seen the product before—your internal team is blind to usability issues
Video record sessions—watching users struggle reveals insights that surveys miss

Test Model Fairness

Testing & Validation ▼

Measure and validate that AI models perform equitably across different user groups.

When to use

Before launching AI that impacts people (hiring, lending, content recommendations)
When building AI for diverse user populations
When regulatory or ethical standards require fairness audits

Steps

Identify protected groups: Demographics (age, gender, race), geography, socioeconomic status, language. Base on domain and regulations.
Measure performance by group: Calculate accuracy, precision, recall, false positive/negative rates for each group. Look for disparities.
Define fairness criteria: Demographic parity (equal outcomes)? Equalized odds (equal error rates)? Choose standard appropriate to domain.
Quantify disparities: If accuracy for Group A is 90% but Group B is 75%, that's a 15-point gap. Set acceptable threshold (e.g., <5% gap).
Mitigate bias: Collect more training data for underperforming groups, use fairness constraints during training, post-process predictions to equalize outcomes

Tips

Fairness is multi-dimensional and contextual—no single metric captures all concerns
Document fairness analysis in launch review—shows stakeholders you took responsibility

Create Automated Test Suites

Testing & Validation ▼

Build automated tests for AI systems that run continuously to catch regressions and issues.

When to use

After initial AI launch when entering maintenance mode
When iterating on models frequently
When you need to ensure new model versions don't break existing functionality

Steps

Build golden test sets: Curate 100-500 examples with known correct outputs. Cover typical cases and edge cases. Version control this dataset.
Automate accuracy tests: Run new models against golden test set. Flag if accuracy drops >3% from previous version.
Test system integration: Automate end-to-end tests—API calls, response format, latency, error handling. Run on every deploy.
Monitor data quality: Automate validation of input data—schema checks, range checks, null detection, distribution monitoring.
Run regression tests: When fixing bugs, add failing cases to automated suite. Prevents reintroduction of same bugs.

Tips

Run automated tests on every code change AND weekly even without changes—catches data drift
Integrate with CI/CD pipeline—block deployments that fail critical tests

Plan Phased Rollout

Launch & Monitoring ▼

Deploy AI features incrementally to manage risk and learn from early users before full launch.

When to use

For all AI features—phased rollouts are best practice, not optional
When launching to large user bases where issues could affect millions
When uncertainty about production performance remains after testing

Steps

Define rollout phases: 1% (internal + beta), 5% (early adopters), 25% (broader test), 100% (full launch). Adjust percentages based on user base size.
Set phase duration: Run each phase 3-7 days minimum. Longer for complex features or when monitoring slow metrics (e.g., retention).
Define promotion criteria: What metrics must be met to move to next phase? Example: '95% task success, <2s latency p95, <0.1% error rate, NPS >40'
Plan rollback triggers: What causes immediate rollback? Example: 'Error rate >1%, latency >5s p95, user complaints spike >5x baseline'
Communicate timeline: Tell stakeholders and users the rollout plan. Manage expectations—'rolling out over 2 weeks' prevents 'why don't I have it?' questions.

Tips

Use feature flags for instant rollback without redeployment—essential for risk management
Bias initial phases toward power users or opt-in beta testers—they provide better feedback

Set Up Model Monitoring

Launch & Monitoring ▼

Instrument production AI systems to track model performance, data drift, and system health.

When to use

Before launching AI features to production—monitoring is not optional
When models are live but you lack visibility into production performance
When setting up MLOps processes

Steps

Track model metrics: Log predictions, confidence scores, latency for every request. Calculate accuracy, precision, recall daily from user feedback.
Monitor input distribution: Track feature distributions over time. Alert if input data shifts significantly from training distribution.
Set up alerts: Define thresholds for key metrics. Example: 'Alert if accuracy drops >5%, latency p95 >1s, error rate >1%'
Create dashboards: Visualize metrics for PM, engineers, executives. Show trends over time, comparison to baselines, breakdown by user segments.
Log errors: Capture all failures—model errors, timeouts, invalid inputs. Review weekly to identify patterns.

Tips

Monitor business metrics too, not just model metrics—user satisfaction and revenue matter more than accuracy
Use existing tools (Datadog, Grafana, CloudWatch) plus ML-specific tools (Arize, Fiddler, WhyLabs)

Build Monitoring Dashboards

Launch & Monitoring ▼

Create visual dashboards that surface AI system health and performance for different stakeholders.

When to use

After instrumenting monitoring—raw logs are useless without visualization
When stakeholders ask 'How is the AI performing?' and you don't have an answer
When managing multiple AI features or models in production

Steps

Design for audience: PM dashboard (user metrics, business impact), engineering dashboard (system health, latency, errors), executive dashboard (high-level KPIs)
Include key metrics: Model accuracy, user satisfaction, task success rate, latency (p50/p95/p99), error rate, cost per prediction, usage volume
Show trends: Current value vs. yesterday, last week, last month. Spot degradation early. Annotate with model version deploys.
Add drill-down: Click on metric to see breakdown by user segment, geography, device, time of day. Reveals where issues are concentrated.
Make actionable: Every dashboard should answer 'What should I do?' Include alerts, thresholds, comparison to targets.

Tips

Start simple—one dashboard with 6-8 key metrics beats ten dashboards nobody looks at
Review dashboards weekly in team meetings—makes monitoring a habit, not an afterthought

Design Incident Response Plan

Launch & Monitoring ▼

Define procedures for detecting, triaging, and resolving AI system failures in production.

When to use

Before launching AI to production—hope for best, plan for worst
After experiencing AI incidents without clear response procedures
When onboarding on-call engineers for AI systems

Steps

Define incident types: Model performance drop, latency spike, error rate spike, cost overrun, harmful outputs, data pipeline failure
Set severity levels: P0 (user-facing complete failure), P1 (degraded performance), P2 (minor issue), P3 (monitoring alert, no user impact)
Create runbooks: Step-by-step guides for common incidents. Example - If accuracy drops >10%: - Check recent data - Compare to baseline model - Rollback if needed
Assign on-call: Who responds to incidents? Rotation schedule? Escalation path if on-call can't resolve?
Define communication: Who gets notified? Users? Stakeholders? Executives? What's the message template?
Post-incident review: After major incidents, conduct blameless post-mortem. Document learnings, prevent recurrence.

Tips

Practice incident response with fire drills—uncovers gaps in procedures
Have rollback plan ready—ability to quickly revert to previous model is crucial

Implement Feedback Collection

Launch & Monitoring ▼

Deploy mechanisms to gather user feedback on AI outputs for continuous improvement.

When to use

At launch—feedback collection is core feature, not add-on
When model accuracy is good but you want to make it great
When implementing active learning or continuous training

Steps

Implement explicit feedback: Thumbs up/down, star ratings, 'Report issue' buttons. Make one-click easy.
Track implicit feedback: Click-through rate, time on page, task completion, return usage. Often more reliable than explicit feedback.
Collect corrections: Let users fix wrong predictions. 'This is actually X' or 'Correct answer: Y'. Generates training data.
Sample strategically: Don't ask for feedback on every interaction—causes fatigue. Sample 10-20% of users randomly plus 100% of uncertain predictions.
Close feedback loop: Show users their feedback improved the system. 'Thanks to feedback like yours, accuracy improved 5%'

Tips

Aim for 5-10% feedback rate—if lower, your UI friction is too high
Incentivize feedback sparingly—intrinsic motivation (helping improve product) beats extrinsic rewards

Measure AI Feature Adoption

Launch & Monitoring ▼

Track metrics that reveal whether users discover, try, and consistently use AI features.

When to use

After AI feature launch to measure product-market fit
When feature usage is lower than expected
When deciding whether to invest more in AI features or pivot

Steps

Track awareness: What % of users know AI feature exists? Survey or measure if users saw onboarding/announcement.
Measure trial: What % of aware users tried feature at least once? Track first use within 7 days of awareness.
Calculate activation: What % of trialists had successful first experience? Define success: task completed, positive feedback, no errors.
Monitor retention: What % of activated users return? Track D1, D7, D30 retention. AI features need habit formation.
Identify power users: Who uses AI feature daily? What % of total usage do they represent? Learn from them.
Diagnose drop-off: Where do users churn? Never try? Try once and abandon? Fixes differ for each stage.

Tips

Benchmark against non-AI features—is adoption good or bad in context?
Segment by user type—enterprise users and consumers have different adoption curves

Analyze AI Usage Patterns

Launch & Monitoring ▼

Study how users interact with AI features to identify improvements and optimization opportunities.

When to use

After AI feature has been live for 2-4 weeks with meaningful usage data
When planning next iteration or improvement cycle
When usage metrics are flat and you need ideas for growth

Steps

Segment users by behavior: Power users, casual users, one-time users. Analyze each segment separately.
Identify common queries: What are most frequent inputs? Are there patterns? Can you optimize for common cases?
Find failure patterns: When does AI fail? Which input types? Which user segments? Prioritize fixing most common failures.
Measure feature combinations: Do users combine AI with other features? What workflows emerge? Can you streamline?
Analyze temporal patterns: Time of day, day of week, seasonality. Usage spikes reveal unmet needs or opportunities.

Tips

Talk to 10 power users—they've figured out creative uses you never imagined
Look for 'workarounds'—users finding ways around AI limitations signal improvement opportunities

Plan Model Retraining

Optimization & Iteration ▼

Establish cadence and triggers for updating models with fresh data to maintain performance.

When to use

After initial model deployment—retraining is not optional for production AI
When model performance degrades over time
When setting up MLOps processes for long-term maintenance

Steps

Determine retraining cadence: Daily (high-churn domains like news), weekly (e-commerce, social), monthly (stable domains like document classification), quarterly (slow-changing domains)
Set performance triggers: Retrain if accuracy drops >5%, error rate increases >2x, or user feedback negative >20%
Plan data collection: Ensure sufficient new labeled data between retraining cycles. Budget for labeling.
Automate pipeline: Scheduled retraining jobs, automated evaluation, deployment if metrics improve, rollback if metrics worsen
Version and track: Record training date, data version, performance metrics for each retrained model

Tips

Start with monthly retraining, adjust based on monitoring—over-retraining wastes resources
Always validate retrained models before deployment—sometimes new data is worse than old

Optimize Model Costs

Optimization & Iteration ▼

Reduce inference and training costs while maintaining model quality and user experience.

When to use

When AI costs are higher than budgeted or eating into margins
When scaling to millions of predictions per day
When stakeholders question AI ROI due to cost concerns

Steps

Measure current costs: Break down by training compute, inference compute, data storage, labeling. Identify biggest expense.
Optimize inference: Use smaller models, quantization (FP32 to FP16), batching, caching common predictions, use cheaper hardware
Reduce training costs: Use transfer learning (fine-tune instead of training from scratch), reduce experiment volume, use spot instances
Optimize data costs: Compress datasets, delete old versions, use cheaper storage tiers, reduce labeling through active learning
Right-size infrastructure: Use autoscaling, serverless for variable load, reserved instances for predictable load

Tips

Caching can reduce costs 50-80% for applications with repeated queries—implement early
Profile costs weekly—gradual creep is harder to fix than sudden spikes

Iterate on AI Features

Optimization & Iteration ▼

Systematically improve AI features based on user feedback, usage data, and performance metrics.

When to use

After initial launch and 2-4 weeks of production data collection
When planning roadmap for next quarter of AI development
When feature adoption or satisfaction is below targets

Steps

Gather improvement ideas: User feedback, support tickets, usage analysis, error logs, competitive analysis, team brainstorms
Categorize improvements: Model accuracy, UX enhancements, edge case handling, performance/latency, new capabilities, cost reduction
Estimate impact: For each improvement, estimate user impact (low/medium/high) and confidence (how sure are you it will work?)
Estimate effort: T-shirt sizing (S/M/L) or story points. Include data collection, training, testing, deployment.
Prioritize by ROI: High impact + low effort = do first. Low impact + high effort = deprioritize. Build roadmap with quick wins and strategic bets.

Tips

Reserve 20% capacity for small improvements and bug fixes, 80% for planned features
Ship improvements incrementally—don't wait for perfect, ship better

Tune Model Performance

Optimization & Iteration ▼

Systematically adjust model hyperparameters and architecture to improve accuracy and efficiency.

When to use

When model performance is close but not quite meeting targets
After collecting more training data but before retraining
When you have time/budget for systematic optimization

Steps

Identify tunable parameters: Learning rate, batch size, model architecture, regularization, dropout, optimizer choice
Start with learning rate: Most impactful hyperparameter. Try values: 1e-5, 5e-5, 1e-4, 5e-4, 1e-3. Pick best.
Use automated search: Grid search (exhaustive but slow), random search (faster), Bayesian optimization (most efficient). Tools: Optuna, Ray Tune.
Set search budget: Define max experiments (e.g., 50) or max time (e.g., 3 days). Tuning has diminishing returns.
Validate improvements: Test tuned model on held-out test set. Ensure improvements are real, not overfitting.

Tips

Tune on validation set, evaluate on test set—using test set for tuning leads to overoptimistic results
Document tuning process—future engineers will thank you

Evaluate Feature Sunset

Optimization & Iteration ▼

Decide when to deprecate or retire underperforming AI features to focus resources on higher-impact work.

When to use

When AI feature has low adoption after 3-6 months in production
When maintenance costs exceed value delivered
When conducting annual portfolio reviews or roadmap planning

Steps

Evaluate usage: What % of users actively use feature? Is trend increasing or declining? Compare to other features.
Measure value: Does feature drive revenue, retention, satisfaction? Quantify business impact. If negligible, candidate for sunset.
Calculate costs: Engineer time for maintenance, retraining, monitoring, support tickets, infrastructure costs. Is ROI positive?
Consider alternatives: Can feature be simplified (remove AI, use rules)? Merged with another feature? Repositioned?
Plan sunset: Announce deprecation timeline (3-6 months notice), offer alternatives, support migration, monitor impact

Tips

Sunsets are normal—teams that never kill features accumulate technical debt and lose focus
Survey users before sunset—sometimes low usage hides high value for specific segments

AI Development Lifecycle Overview

Primers ▼

Understand the end-to-end process of taking AI features from concept to production.

When to use

When planning your first AI feature
When onboarding new team members to AI product development
When explaining AI development to stakeholders

Steps

Discovery: Define problem, validate AI is right solution, assess data availability, estimate feasibility
Data preparation: Collect data, label examples, clean and validate, version and store securely
Model development: Establish baseline, train models, evaluate performance, iterate until meeting criteria
Integration & testing: Build product integration, test end-to-end, conduct UAT, run red teaming
Deployment: Phased rollout, monitoring setup, incident response prep, feedback collection
Maintenance: Monitor performance, retrain models, iterate on features, optimize costs, handle drift

Tips

Expect 50% of time on data, 30% on modeling, 20% on deployment—adjust estimates accordingly
Build feedback loops from day 1—they enable continuous improvement

MLOps Basics Primer

Primers ▼

Learn the fundamentals of ML operations—deploying, monitoring, and maintaining AI systems in production.

When to use

When transitioning from AI development to production operations
When setting up infrastructure for production AI systems
When hiring MLOps engineers or defining their role

Steps

Version control: Track code (Git), data (DVC), models (MLflow). Everything must be versioned for reproducibility.
Automation: CI/CD pipelines for model training, testing, deployment. Automate retraining and evaluation.
Monitoring: Track model performance, data drift, system health. Alert on degradation. Dashboard for visibility.
Infrastructure: Scalable compute for training and inference, model serving platforms, data pipelines, experiment tracking
Governance: Model documentation, approval processes, audit logs, rollback capabilities, security controls

Tips

Start simple—don't build Google-scale MLOps for your first feature. Grow infrastructure as needed.
Treat models like code—they need testing, versioning, code review, deployment pipelines

Common AI Metrics Primer

Primers ▼

Understand key metrics for evaluating AI models and when to use each one.

When to use

When defining success criteria for AI features
When interpreting model performance reports from ML engineers
When comparing different model approaches

Steps

Accuracy: % of predictions correct. Good for balanced datasets. Misleading when classes are imbalanced (e.g., 95% negative examples).
Precision: Of positive predictions, % actually positive. High precision = few false alarms. Important when false positives are costly (e.g., spam filtering).
Recall: Of actual positives, % correctly identified. High recall = catch all positives. Important when false negatives are costly (e.g., fraud detection).
F1 Score: Harmonic mean of precision and recall. Use when you need to balance both and classes are imbalanced.
Latency: Time from input to output. P50 (median), p95 (95th percentile), p99. User experience depends on tail latency.
Cost per prediction: Infrastructure spend divided by prediction volume. Critical for unit economics at scale.

Tips

Always measure multiple metrics—accuracy alone hides problems
Ask ML engineers to explain metrics in business terms: 'Precision = how often recommendations are relevant'

Design Multi-Agent Workflows

AI Architecture ▼

Build AI systems with multiple specialized agents working together to handle complex tasks that single models can't solve.

When to use

When a single AI model hits capability limits (e.g., can't handle research + analysis + writing in one call)
When you need specialized expertise at different workflow stages (e.g., code review needs syntax checker + security scanner + style guide enforcer)
When tasks require coordination between different AI capabilities (e.g., extract data, verify accuracy, format output, send notification)

Steps

Map the end-to-end workflow: Break down the user's goal into discrete sub-tasks. For example, 'Generate market research report' might be: (1) Search for sources, (2) Extract key data, (3) Synthesize findings, (4) Draft report, (5) Fact-check citations.
Define agent roles and boundaries: Create one agent per distinct capability. Name them by function (Researcher, Analyst, Writer, Validator). Each agent gets a clear scope: inputs it receives, outputs it produces, and decision authority.
Design handoff protocols: Specify how agents pass information. Use structured formats (JSON schemas, typed objects). Define what happens if an agent fails: retry with same agent, escalate to different agent, or fall back to human.
Establish orchestration logic: Decide on control flow—sequential (Agent A → Agent B → Agent C), parallel (Agents A/B/C run simultaneously), or conditional (Agent A decides which of B/C/D runs next). Use state machines or workflow engines.
Build evaluation per agent: Each agent needs its own success metrics. Don't just measure the final output—track where in the chain quality degrades. Log agent decisions for debugging.
Plan for failure modes: Agents can produce invalid outputs, infinite loops, or conflicting instructions. Set timeouts, output validation, and maximum retry limits. Always have a human-in-the-loop escape hatch.

Tips

Start with 2-3 agents max, not 10. Add complexity only when single-agent approaches fail. Most problems don't need sophisticated orchestration.
Agents aren't microservices. Don't create an agent for every tiny function. Each agent should represent a meaningful capability that users or engineers would recognize as distinct.

Mental Models

Strategy Cards

Risk Cards

Execution Cards