Thinking

Mental models, frameworks, and decision-making tools I use to build and advise. These show up in the playbook's friction blocks, in the strategy calls, and in how I evaluate my own work. The toolkit behind the thinking.

Mental Models

9 decision-making frameworks I keep coming back to. Click any card to expand.

Sunk Cost Fallacy
Behavioral Economics
DecisionPsychology

Don't Throw Good Money After Bad

We continue investing in something just because we've already put time, money, or effort into it, even when the rational move is to stop.

You're two hours into a terrible movie and think, "I've already spent two hours, might as well finish." That's paying extra misery to justify time already lost.

It sounds like: "We can't cancel the project; we've spent nine months on it." or "I'll keep the subscription -- maybe I'll use it next year."

  • Ask: "If I were starting fresh today, would I choose this?"
  • Compare future upside to remaining time, money, and energy.
  • Set explicit stop conditions before you start.
  • Treat stopping as a win: you saved future cost.

You're debating whether to keep a struggling project, investment, product, relationship, or subscription purely because of past effort.

Name one thing you're only doing because of past effort. What is the smallest step you could take this week to exit or reduce your commitment?

Occam's Razor
William of Ockham
DecisionStrategy

The Simplest Explanation Is Usually Correct

Between competing hypotheses that explain the data equally well, the simplest one -- with the fewest moving parts -- is usually the best starting point.

Your Wi-Fi dies and your first theory is a global cyberwar. Then you realize the router is unplugged. The universe rarely needs a conspiracy when a loose cable will do.

We jump to exotic explanations when something breaks instead of checking the boring, obvious things first.

  • List all explanations that fit the evidence.
  • Cross out the ones that require extra assumptions or miracles.
  • Start by testing the simplest explanation first.
  • Only add complexity if simple explanations fail.

You're troubleshooting bugs, outages, or confusing behavior -- or evaluating wild theories about why something happened.

Take a current problem and write down the most boring explanation you can think of. Test that before anything clever.

Build-Measure-Learn
Eric Ries
InnovationProduct

The Lean Startup Loop

Instead of betting everything on a big launch, you build the smallest thing that can test a hypothesis, measure what happens, and learn whether to pivot or persevere.

You could spend a year perfecting your app... or you could launch a janky landing page this week to see if anyone even wants what you're building.

Founders fall in love with building and treat learning as a side effect. They over-engineer v1, measure vanity metrics, and learn nothing useful.

  • Start by asking: "What do we need to learn?"
  • Decide what to measure that will prove or disprove your hypothesis.
  • Build the smallest thing that can generate that measurement.
  • Run the loop quickly: build, measure, learn, adjust.

You're building something new and feel the urge to polish endlessly before anyone sees it.

Write down your current product hypothesis in one sentence. Now write the smallest experiment you could run in the next 7 days to test it.

Circle of Competence
Charlie Munger
DecisionStrategy

Play Where You Understand the Game

Your circle of competence is the set of domains where you genuinely understand what's going on. You don't have to be an expert at everything -- just know where you are and aren't competent.

A brilliant investor in consumer brands decides to dabble in biotech "for diversification." Spoiler: the molecules took his money.

Outside your circle, everything looks randomly good or bad. You're just guessing -- but with a dangerous illusion of understanding.

  • Write down the domains where you have real experience and results.
  • Be honest about where you're merely opinionated, not competent.
  • Say "no" quickly to opportunities outside your circle.
  • Deliberately expand the circle by learning and doing, not by pretending.

You're tempted by a shiny opportunity in a field you don't really understand (but your ego says you'll figure it out on the fly).

List three domains where you'd confidently bet your own money -- and three where you absolutely shouldn't.

The Eisenhower Matrix
Dwight D. Eisenhower
DecisionProductivity

Urgent vs. Important

Not all tasks are equal. Some are urgent and important (do now), some are important but not urgent (schedule), some are urgent but not important (delegate), and some are neither (delete).

Your inbox is full, Slack is screaming, and yet your big strategic project hasn't moved in weeks. Congratulations -- you've been living in Quadrants 1 and 3.

We confuse urgency with importance. We feel productive answering pings while our real goals quietly starve in the background.

  • List today's tasks in no particular order.
  • Label each as urgent/not urgent and important/not important.
  • Do: urgent + important. Schedule: not urgent + important.
  • Delegate: urgent + not important. Delete: not urgent + not important.

You feel busy all day but can't point to anything meaningful you actually accomplished.

Take your current to-do list and ruthlessly delete at least one item that is neither urgent nor important.

Day 1 Philosophy
Jeff Bezos
StrategyProduct

Stay Hungry, Stay Foolish

Day 1 means you're still focused on customers, moving fast, and making decisions. Day 2 is stasis, then irrelevance, then death.

Amazon could have become a slow, bureaucratic giant. Instead, Bezos made 'Day 1' a company-wide mantra -- treat every day like you're still scrappy and customer-obsessed.

It sounds like: 'We're too big to move fast' or 'We need more process' or 'Let's optimize for efficiency over customer experience.'

  • Ask: 'What would a startup do here?'
  • Prioritize customer experience over internal efficiency.
  • Make decisions quickly with 70% of the information.
  • Resist proxies -- process should serve customers, not replace judgment.

You're scaling and feel the pull toward bureaucracy, slow decisions, or optimizing for internal metrics instead of customer value.

Name one process or meeting that exists more for internal comfort than customer value. What's the smallest step to eliminate it?

First Principles Thinking
Elon Musk
InnovationDecision

Break Down to Fundamentals

Instead of copying what exists or reasoning by analogy, strip away assumptions to find the core physics, math, or immutable laws -- then rebuild from there.

Musk didn't ask 'How do we make cheaper rockets?' He asked 'What does a rocket actually need?' Then he built SpaceX from first principles and cut costs by 10x.

We default to 'best practices' and 'industry standards' without questioning whether they're actually necessary or optimal.

  • List everything you assume is true about the problem.
  • Question each assumption: 'Is this actually necessary?'
  • Identify the fundamental constraints (physics, math, laws).
  • Rebuild the solution from those fundamentals.

You're stuck optimizing within existing constraints, or everyone says 'that's just how it works' and you suspect there's a better way.

Pick a current problem. Write down three assumptions everyone accepts. Now question: what if those assumptions are wrong?

Do Things That Don't Scale
Paul Graham
InnovationProduct

Manual Work Before Automation

Startups try to automate before they understand. Manual work teaches you what customers actually want. Once you know, then you automate.

Stripe manually onboarded every customer in the early days. They learned exactly what customers needed, then built the perfect product.

We try to scale before we understand. We build systems for problems we don't fully understand yet.

  • Don't build a system until you've done it manually 100 times.
  • Talk to every user personally.
  • Write personal responses, not templates.
  • Only automate once you understand the problem deeply.

You're building something new and feel the urge to automate everything before you understand what customers actually need.

What's one thing you're trying to automate? Could you do it manually for 10 customers first?

Default Alive vs Default Dead
Paul Graham
StrategyDecision

Will You Survive Without More Funding?

Most startups are default dead. They're burning cash faster than they can make it, hoping for a miracle. Know which one you are.

You have 6 months of runway and you're burning $50k/month. You need to raise $2M or you're dead. That's default dead.

We assume we'll raise more money. We don't calculate if we can survive without it. Most startups are default dead and don't realize it.

  • Calculate your runway: months of cash / monthly burn.
  • Calculate path to profitability: revenue growth vs costs.
  • If you can't get profitable before running out of cash, you're default dead.
  • Either find product-market fit fast or cut costs dramatically.

You're running a startup and want to know if you're actually viable or just hoping for a miracle.

Calculate: if you never raise another dollar, will you survive? If no, you're default dead. What's the fastest path to default alive?

Strategy Cards

135 tactical frameworks for positioning, feasibility, and planning. From the AI PM Cards deck.

Map Model Capabilities
AI Feasibility

Systematically evaluate which AI capabilities your product needs and assess technical feasibility.

  • When evaluating if AI is the right solution for your product problem
  • Before committing to build vs. buy decisions
  • When scoping an AI feature for the first time
  • List all the tasks your product needs AI to perform (e.g., classify images, generate text, predict churn)
  • For each task, research state-of-the-art (SOTA) capabilities: accuracy, latency, data requirements, cost
  • Map your product requirements against SOTA: Is current tech good enough? What's the gap?
  • Prioritize tasks by feasibility + user value. Identify quick wins and long-term bets.
  • Add a column for 'Required by Launch' vs. 'Nice to Have'—prevents scope creep
  • Update this map every 6 months; AI capabilities improve rapidly
Data Availability Assessment
AI Feasibility

Evaluate whether you have sufficient quality data to train or fine-tune AI models effectively.

  • Before committing to custom model development
  • When deciding between pre-trained models vs. fine-tuning
  • If stakeholders assume AI will work without examining data reality
  • Quantify data volume: Count labeled examples per category. Most supervised tasks need 1,000+ examples minimum, 10,000+ for production quality
  • Assess data quality: Check for label accuracy (>95% correct?), class balance (no category <5% of total), representative coverage of edge cases
  • Evaluate data accessibility: Where does data live? Can you legally use it for ML? What's the pipeline to access and update it?
  • Identify data gaps: What scenarios are missing? What would it cost to collect/label the missing data?
  • Create data roadmap: Can you launch with existing data? When will you have sufficient data for v2 improvements?
  • Start with data audit before pitching AI features—60% of AI projects fail due to data issues
  • Budget 30-50% of AI development time for data collection and labeling, not just model work
AI Technical Debt Calculator
AI Feasibility

Estimate the long-term maintenance costs of AI systems beyond initial development.

  • When creating business cases for AI investments
  • Before choosing between simple rules vs. ML solutions
  • When stakeholders focus only on development costs, ignoring operations
  • Calculate model maintenance: Retraining frequency (monthly? quarterly?) × engineer time per retrain × salary. Add monitoring/on-call costs
  • Estimate infrastructure costs: Inference compute (API calls × cost per call), training compute, data storage and pipelines
  • Factor in data pipeline maintenance: Label quality audits, dataset versioning, feature engineering updates, data validation systems
  • Account for model updates: As AI capabilities improve, you'll need to evaluate and integrate new models every 6-12 months
  • Compare total 3-year cost: AI solution vs. non-AI alternatives. Include development + operations + opportunity cost
  • Rule of thumb: AI operational costs are 3-5× the initial development cost over 3 years
  • For low-stakes features, simple heuristics often beat ML when total cost of ownership is considered
Latency Budget Planning
AI Feasibility

Define acceptable response times for AI features and architect systems to meet latency requirements.

  • When designing user-facing AI features
  • Before selecting model architectures or inference infrastructure
  • If users complain that AI features feel slow
  • Define user expectations: Real-time (<100ms)? Interactive (<1s)? Asynchronous (>5s okay)? Base on user research and competitive benchmarks
  • Break down latency sources: Network roundtrip + model inference + post-processing + database queries. Measure each component
  • Set component budgets: Allocate total budget across pipeline. Example: 800ms total = 200ms network + 400ms inference + 200ms other
  • Optimize critical path: Can you use smaller models? Batch predictions? Cache results? Move compute closer to users?
  • Establish degradation strategy: If model is slow, show partial results, streaming responses, or fallback to faster (less accurate) model
  • Aim for <1 second for most user-facing AI features—users perceive longer waits as broken
  • Test latency at p95 and p99, not just average—tail latency kills UX for real users
Edge Case Scenario Mapping
AI Feasibility

Systematically identify and prioritize edge cases where AI will fail, then design mitigation strategies.

  • After initial feasibility testing shows promise
  • Before launching AI features to production
  • When designing AI user experience and error handling
  • Brainstorm failure modes: Run team workshop listing scenarios where AI might fail (rare inputs, ambiguous cases, adversarial examples, distribution shifts)
  • Collect real edge cases: Review support tickets, user feedback, and competitive failures. Test your prototype with extreme inputs
  • Quantify frequency and impact: Estimate % of users affected × severity of bad outcome. Create 2×2 matrix of frequency/impact
  • Prioritize mitigation: High-frequency or high-impact cases need solutions before launch. Low/low can ship with monitoring
  • Design fallback strategies: Human review, confidence thresholds, fallback to simpler methods, explicit 'AI can't help here' messages
  • Plan for 5-20% of inputs to hit edge cases in production—AI is never 100% accurate
  • Show your edge case matrix to legal/trust & safety teams early—some failures have regulatory or PR risk
Multi-Model Strategy Design
AI Feasibility

Plan when and how to combine multiple AI models to solve complex product problems.

  • When a single model can't meet all product requirements
  • When building AI products with multiple capabilities (e.g., search + summarization + recommendations)
  • Before scaling initial AI prototypes into full product suites
  • Map capabilities to models: Break your product into distinct AI tasks (classification, generation, ranking, etc.). Assign best-fit model type to each
  • Design model orchestration: Sequential (Model A → Model B)? Parallel (A + B → combine results)? Conditional (if A confident, skip B)?
  • Manage dependencies: What happens if Model A fails? Does Model B still work? Build fallback chains and circuit breakers
  • Optimize for cost and latency: Can you run cheaper/faster models first, then escalate to expensive models only when needed?
  • Version and deploy independently: Each model should have its own versioning, monitoring, and rollback capability
  • Start with single model, add models only when user needs clearly justify complexity
  • Use smaller, specialized models over one large model when possible—lower cost, faster, easier to debug
AI Unit Economics Model
Business Model & Pricing

Calculate the true cost per user or per action for AI features to ensure sustainable economics.

  • Before launching AI products with usage-based costs
  • When setting pricing for AI-powered features
  • If AI costs are growing faster than revenue
  • Calculate cost per prediction: Inference costs + model hosting + data pipeline costs / number of predictions. Track separately for different models/features
  • Estimate average usage per user: Based on product analytics or beta testing, how many AI actions does typical user take per month?
  • Model cost at scale: User base × actions per user × cost per action. Project at 10×, 100×, 1000× current scale
  • Determine unit economics target: For SaaS, aim for LTV:CAC of 3:1. For freemium, AI costs should be <30% of revenue per paying user
  • Identify optimization levers: Can you cache results? Batch requests? Use cheaper models for simple queries? Set usage caps?
  • OpenAI/Anthropic costs drop 50-90% yearly—don't over-optimize current pricing, but do monitor costs weekly
  • Set usage limits for free tiers to prevent runaway costs—Notion AI limits free users to 20 actions
Define AI Value Proposition
Value Proposition

Articulate the specific value AI delivers to users, beyond what non-AI solutions can provide.

  • When pitching AI features to stakeholders or users
  • Before writing product specs or user stories
  • When differentiating your AI product from competitors
  • Identify the user job: What task are users trying to accomplish? What's the current pain?
  • Define the AI advantage: What can AI do that rule-based systems or manual processes can't?
  • Quantify the benefit: Time saved? Better accuracy? Personalization? New capabilities?
  • Test the value prop: Share with 5-10 target users. Do they get excited? Do they see the benefit?
  • Focus on outcomes, not technology: "Get instant answers" not "Powered by GPT-4"
  • Avoid "AI for AI's sake"—if a simpler solution works, use it
AI Feature Pricing Strategy
Business Model & Pricing

Determine how to monetize AI capabilities: bundled, add-on, usage-based, or premium tier.

  • Before launching AI features to customers
  • When deciding whether AI justifies price increases
  • If competitors are undercutting your AI pricing
  • Assess AI value perception: Does AI unlock new use cases or just improve existing workflows? New capabilities justify premium pricing
  • Benchmark competitive pricing: Survey 5-10 competitors. Are they charging for AI separately or bundling? What's the price premium?
  • Model pricing options: - Bundled free - Add-on flat fee - Usage-based - Higher tier only Calculate revenue and adoption for each
  • Test willingness to pay: Run pricing surveys or A/B tests with beta users. What % would pay $X for AI features?
  • Choose initial strategy: Start conservative (bundle free or low add-on), then raise prices as value is proven. Easier to decrease later than increase
  • Usage-based pricing aligns incentives but adds billing complexity—only use if users heavily value usage flexibility
  • Avoid 'AI tax' perception—if AI just makes existing features slightly better, don't charge separately
Freemium AI Strategy
Business Model & Pricing

Design free vs. paid AI feature splits that drive conversion while controlling costs.

  • When adding AI to existing freemium products
  • If free tier AI costs are unsustainable
  • When optimizing free-to-paid conversion rates
  • Define free tier AI budget: Calculate sustainable cost per free user (e.g., $0.10-0.50/month). Convert to action limits (e.g., 20 AI queries/month)
  • Identify conversion-driving features: Which AI capabilities are 'need to have' for power users? Gate those behind paywall after taste
  • Design progression path: Free tier = 'try it' (10-50 actions). Paid tier = 'use it daily' (unlimited or high cap like 500/month)
  • Implement soft limits: Don't hard-block at limit. Show 'X uses left this month' warnings. Offer one-time upgrades or wait until next month
  • Monitor conversion metrics: What % of free users hit limits? What % convert within 7 days of hitting limit? Adjust limits to optimize revenue
  • Monthly resets create urgency—users convert when they need AI now, not when they accumulate limits over time
  • Make free tier generous enough for authentic trial—less than 10 AI actions feels like a demo, not a product
Usage-Based vs. Seat-Based Pricing
Business Model & Pricing

Choose the right pricing model for AI products by evaluating usage patterns and customer preferences.

  • When designing pricing for new AI products
  • If customers complain about current pricing model
  • When usage varies widely across customer segments
  • Analyze usage distribution: Plot AI actions per user. If variance is low (most users similar), seat-based works. If high variance (10× difference), usage-based fits better
  • Assess customer preference: Enterprise prefers predictable costs (seat-based). Startups prefer pay-as-you-grow (usage-based). Survey target customers
  • Model revenue scenarios: Calculate ARR under each model at different growth stages. Which maximizes revenue at 100, 1000, 10000 customers?
  • Consider operational complexity: Usage-based requires real-time metering, billing reconciliation, and overage management. Seat-based is simpler
  • Test hybrid approaches: Base seat price + usage overages (Anthropic model). Or tiered usage buckets (Notion AI: $10 for 200 actions)
  • Default to seat-based for B2B SaaS—procurement prefers predictable budgets, and sales cycles are faster
  • Use usage-based for API products or when AI is core value prop and usage varies 10×+ across customers
Run a Model Feasibility Spike
AI Feasibility

Test if your AI idea is technically possible by building a quick prototype in 1-2 weeks.

  • When stakeholders doubt whether AI can solve your problem
  • Before committing to a multi-month AI development roadmap
  • When you need to choose between multiple AI approaches
  • Define success criteria - Write down the minimum bar: "If the model can X with Y% accuracy, it's feasible."
  • Timebox the spike - Allocate 1-2 weeks maximum. Set a deadline for demo.
  • Use shortcuts - Pre-trained models, small datasets, manual labeling, cloud notebooks.
  • Build and evaluate - Train/fine-tune model. Test against success criteria.
  • Make go/no-go decision - If you hit the bar, green-light the project. If not, pivot or kill feature.
  • Set up tracking from day 1 of the spike—you'll want metrics to show stakeholders
  • Don't polish UX or code quality—this is a throwaway prototype
Enterprise AI Packaging
Business Model & Pricing

Design AI product tiers and packaging that align with enterprise buying processes and budgets.

  • When selling AI products to companies with 1000+ employees
  • If enterprise deals stall due to pricing or packaging concerns
  • When building multi-year roadmap for enterprise features
  • Create enterprise tier: Include SSO, audit logs, data residency, SLAs, dedicated support, custom contracts. Price 3-5× higher than self-serve tiers
  • Offer volume discounts: Tiered pricing based on seats/usage. 100-500 users = 10% off, 500-1000 = 20% off, 1000+ = custom pricing
  • Bundle services: Professional services for implementation, training, custom model fine-tuning. Charge separately or include in annual contracts
  • Design annual commit incentives: Offer 15-25% discount for annual prepay vs. monthly. Reduces churn and improves cash flow
  • Build custom pricing tools: Sales team needs calculator to quickly quote multi-year, multi-product, multi-region deals. Automate approval workflows
  • Enterprise sales cycles are 6-12 months—ensure trial/POC pricing covers your costs but removes friction
  • Security and compliance are table stakes, not upsells—include in base enterprise tier or risk disqualification
AI ROI Projection Model
Business Model & Pricing

Build data-driven ROI models that help customers justify AI product investments to their executives.

  • When selling high-cost AI products to enterprise
  • If sales team struggles to justify AI pricing
  • When creating case studies and marketing materials
  • Identify cost savings: Time saved per user × hourly cost × number of users. Example: 5 hours/week × $50/hour × 100 users = $1.3M/year
  • Quantify revenue impact: Increased conversion, faster sales cycles, better retention. Tie AI features to revenue metrics with A/B test data
  • Build ROI calculator: Create spreadsheet or web tool where prospects input their metrics (team size, salaries, current processes). Auto-calculate payback period
  • Validate with case studies: Get 3-5 customers to share actual ROI achieved. Use median results as conservative estimates for prospects
  • Present tiered scenarios: Conservative (10th percentile outcomes), expected (median), optimistic (90th percentile). Let buyers choose their assumptions
  • Aim for <6 month payback period for SMB, <12 months for enterprise—longer periods face budget scrutiny
  • Include implementation costs in ROI model—honest projections build trust and set realistic expectations
Build vs. Buy vs. API Decision
AI Feasibility

Systematically evaluate whether to build models in-house, buy commercial solutions, or use API services.

  • After validating technical feasibility of your AI feature
  • When stakeholders ask about cost and timeline for AI development
  • Before assembling your AI product team
  • Map your requirements: accuracy needs, customization level, data sensitivity, scale, budget, timeline
  • Evaluate APIs: Test 2-3 providers. Check if they meet accuracy bar, pricing, and latency requirements
  • Evaluate buy options: Commercial ML platforms. Consider vendor lock-in, customization limits
  • Evaluate build: Estimate team size, timeline, infrastructure costs. Do you have ML expertise?
  • Create decision matrix: Score each option on cost, time-to-market, quality, control, and scalability
  • Most teams should start with APIs—fastest path to validation
  • Build only when APIs can't meet requirements or when AI is your core differentiator
AI Cost Containment Tactics
Business Model & Pricing

Implement strategies to reduce AI infrastructure and API costs without sacrificing user experience.

  • When AI costs are growing faster than revenue
  • Before raising prices or cutting features due to costs
  • When optimizing for profitability after growth phase
  • Implement caching: Cache frequent queries/prompts. GitHub Copilot caches common code completions, reducing API calls 40%
  • Use tiered models: Route simple queries to cheaper models (GPT-3.5), complex to expensive (GPT-4). Classification model decides routing
  • Optimize prompts: Shorter prompts = lower costs. Test if you can achieve same quality with 50% fewer tokens. Use prompt compression techniques
  • Batch requests: Combine multiple API calls into single batch request where latency allows. Reduces overhead costs
  • Set usage quotas: Implement per-user rate limits to prevent abuse and runaway costs. Alert users before hitting limits
  • Audit your top 10% of users—often 5-10% of users drive 50%+ of costs. Target optimizations or pricing to them
  • Monitor cost per active user weekly—catch problems early before they become existential
AI Feature Prioritization
Roadmap & Prioritization

Systematically prioritize which AI features to build first based on value, feasibility, and strategic fit.

  • When planning quarterly or annual AI roadmaps
  • If stakeholders disagree on which AI features to build
  • When you have more AI ideas than engineering capacity
  • Score user value: Rate each feature 1-10 on user impact. Base on user interviews, surveys, and revenue potential. Weight by user segment size
  • Assess technical feasibility: Rate 1-10 based on data availability, model maturity, engineering complexity. Get ML team input
  • Evaluate strategic alignment: Does this AI feature support core product strategy? Build competitive moat? Enable platform vision?
  • Estimate effort: T-shirt size (S/M/L/XL) for development time. Include data prep, model training, integration, and testing
  • Calculate priority score: (Value × Strategic Fit) / Effort. Feasibility acts as a filter—don't build infeasible ideas regardless of value
  • Build 'quick wins' first (high value, low effort) to build momentum and credibility for AI program
  • Avoid 'AI for AI's sake'—if a non-AI solution scores higher on value/effort, build that instead
AI Feature Sequencing
Roadmap & Prioritization

Plan the optimal order to release AI features based on dependencies, learning, and user adoption.

  • When building multi-feature AI product roadmaps
  • If early AI features failed to gain traction
  • When planning phased rollouts over 6-12 months
  • Map feature dependencies: Which features require data from others? Which share models or infrastructure? Build dependency graph
  • Identify learning milestones: Which features teach you about user behavior, model performance, or data quality that inform later features?
  • Plan adoption curve: Start with features that drive frequent engagement (daily use). Delay features that need behavior change until users are habituated
  • Balance quick wins and strategic bets: Alternate between fast-shipping incremental features and longer-term platform investments
  • Design version gates: V1 = prove value with simple approach. V2 = improve quality with better models. V3 = scale with platform features
  • Ship user-facing AI value in first 60 days—builds credibility and user excitement for future features
  • Don't boil the ocean—better to ship 3 excellent AI features than 10 mediocre ones
Crawl-Walk-Run AI Roadmap
Roadmap & Prioritization

Structure AI product evolution in three phases: simple MVP, improved accuracy, and scaled platform.

  • When planning multi-year AI product strategy
  • If stakeholders push for perfect AI before any launch
  • When communicating AI maturity stages to executives
  • Crawl (Months 1-3): Ship simplest AI that provides value. Use pre-trained models, limit scope, manual fallbacks. Goal: prove users want this
  • Walk (Months 4-9): Improve accuracy and coverage. Fine-tune models, expand training data, reduce edge cases. Goal: daily use by core users
  • Run (Months 10-18): Scale and automate. Custom models, real-time retraining, platform features. Goal: product differentiator at scale
  • Define success metrics for each phase: Crawl = engagement. Walk = quality scores. Run = competitive moat metrics
  • Communicate trade-offs: Crawl is fast but imperfect. Walk is better but not ready for all use cases. Run is mature but requires investment
  • Don't skip Crawl—90% of AI learnings come from real users, not internal testing
  • Plan 12-18 months minimum for Run phase—AI platforms require sustained investment to build moats
Minimum Viable AI Feature
Roadmap & Prioritization

Define the smallest AI feature that delivers real user value and validates core hypotheses.

  • When starting new AI product initiatives
  • If AI projects are taking too long to ship
  • When stakeholders want to add too many capabilities before launch
  • Identify core user job: What's the single most important task AI helps with? Cut everything else for V1
  • Define minimum quality bar: What accuracy/latency is 'good enough' to be useful? Don't aim for perfection—aim for better than status quo
  • Limit initial scope: Constrain to single use case, user segment, or content type. Example: AI summaries for docs only, not all content
  • Use existing tools: Pre-trained models, third-party APIs, manual fallbacks. Build custom solutions only after validating demand
  • Set learning goals: What do you need to learn from V1 to inform V2? Design experiments to answer key questions
  • Ship MVAI in 4-6 weeks—if it takes longer, scope is too big
  • Perfect is the enemy of shipped—60% accuracy that users love beats 95% accuracy that never launches
AI Experiment Framework
Roadmap & Prioritization

Design and run controlled experiments to validate AI product hypotheses before full development.

  • When testing new AI feature ideas with uncertain value
  • Before committing to expensive AI development
  • If stakeholders need proof that AI will drive metrics
  • Define hypothesis: Clear, testable statement. Example: 'AI-generated summaries will increase doc engagement by 20%'
  • Choose experiment type: A/B test (AI vs. control), Wizard of Oz (humans simulate AI), prototype (limited real AI), or survey (measure willingness to use)
  • Set success criteria: What metrics move? By how much? What's the minimum effect size to justify building?
  • Design minimal experiment: Smallest sample size and shortest duration to reach statistical significance. Use power analysis
  • Analyze and decide: If hypothesis validated, green-light feature. If invalidated, pivot or kill. If inconclusive, run follow-up experiment
  • Wizard of Oz experiments (humans pretending to be AI) are faster than building real AI—use for early validation
  • Run experiments on 5-10% of users initially—limits risk if AI performs poorly
AI Feature Kill Criteria
Roadmap & Prioritization

Establish clear conditions for when to shut down or deprioritize AI features that aren't working.

  • Before launching new AI features
  • When AI features have low adoption despite investment
  • If engineering resources are spread too thin across AI initiatives
  • Set adoption thresholds: Define minimum active users or usage frequency. Example: If <10% of users try feature after 3 months, kill it
  • Define quality floors: Minimum acceptable accuracy, latency, or user satisfaction scores. If model can't hit bar after 2 improvement cycles, kill
  • Establish cost ceilings: Maximum cost per user or cost as % of revenue. If unit economics don't improve to target within 6 months, kill
  • Monitor competitive position: If competitors ship superior AI faster, evaluate whether to kill and copy or double down on differentiation
  • Create kill decision process: Who decides? How often do you review? What's the communication plan to users and stakeholders?
  • Review AI features quarterly—technology and user needs evolve fast, yesterday's good idea may be today's distraction
  • Celebrate kills as much as launches—killing bad features is good product management
AI Tech Debt Prioritization
Roadmap & Prioritization

Systematically prioritize AI technical debt against new features to maintain sustainable development velocity.

  • When AI features are slowing down due to accumulated tech debt
  • If model performance is degrading or infrastructure is brittle
  • When planning roadmap balance between new features and improvements
  • Categorize AI tech debt: Model debt (outdated models, drift), data debt (stale datasets, pipeline brittleness), infra debt (scaling issues, monitoring gaps), code debt (ML code quality)
  • Assess impact: How does each debt item affect user experience, development velocity, costs, or risk? Rate 1-10 on each dimension
  • Estimate effort: Size each debt item (S/M/L/XL). Get ML team input on complexity and dependencies
  • Calculate debt ROI: (Impact on velocity + risk reduction) / Effort. Prioritize highest ROI debt first
  • Allocate capacity: Dedicate 20-30% of AI engineering capacity to tech debt each quarter. Don't let it slip to 0% or accumulate to 100%
  • Address model drift and data quality debt immediately—these directly impact users and compound over time
  • Trade-off rule: If new feature will create significant debt, either fix existing debt first or simplify feature scope
Model Refresh Cadence
Roadmap & Prioritization

Plan regular cycles to evaluate and upgrade AI models as technology improves and data grows.

  • When setting up AI product development processes
  • If models are getting stale but team has no refresh plan
  • When planning long-term AI platform investments
  • Set evaluation cadence: Review new model releases quarterly. For fast-moving areas (LLMs), monthly. Track benchmarks and release notes
  • Define upgrade triggers: Automatic upgrade if new model improves accuracy >10%, reduces latency >30%, or cuts costs >50% with no quality loss
  • Plan testing windows: Allocate 1-2 weeks per quarter for ML team to test new models against production data and metrics
  • Manage version transitions: Run A/B tests (old model vs. new) before full rollout. Keep rollback plan for 2 weeks post-deployment
  • Schedule major refreshes: Every 6-12 months, revisit model architecture fundamentally. Is there a better approach than current solution?
  • Don't chase every model release—upgrade only when clear user benefit or cost savings justify the work
  • Document model lineage—track which model version was used when for debugging and compliance
AI Platform vs. Feature Decision
Roadmap & Prioritization

Decide when to invest in reusable AI infrastructure vs. building point solutions for specific features.

  • After shipping 2-3 successful AI features
  • When engineering velocity on AI features is slowing
  • If considering building in-house ML platform capabilities
  • Identify pattern repetition: Are you solving similar AI problems 3+ times? Similar data pipelines? Similar model patterns? Repetition justifies platform
  • Calculate platform ROI: Cost to build platform ÷ (time saved per feature × number of future features). ROI > 3× justifies investment
  • Assess team maturity: Platform work requires senior ML/infra engineers. Do you have the talent? Can you hire or train?
  • Evaluate build vs. buy: Can you use external platforms (SageMaker, Vertex AI, Hugging Face) instead of building? Usually cheaper and faster
  • Phase platform investment: Don't build the whole platform upfront. Start with highest-pain areas (e.g., model deployment, monitoring) and expand
  • Default to features until you have 5+ AI use cases in production—premature platform work is waste
  • Platform work takes 2-3× longer than estimated—only invest when truly needed for scale
Write a Clear Problem Statement
Problem Discovery

Frame the user problem you're solving before jumping to AI solutions.

  • At the very start of any AI initiative—before technical feasibility
  • When stakeholders ask 'Can we use AI for X?'
  • When your team is solution-focused instead of problem-focused
  • Identify the user: Who specifically experiences this problem? Be specific (e.g., 'sales reps at mid-market SaaS companies' not 'users')
  • Describe the problem: What job are they trying to do? What's blocking them? What workarounds exist today?
  • Quantify the pain: How often does this happen? How much time/money does it cost? How many users affected?
  • Articulate why now: Why hasn't this been solved yet? What's changed that makes solving it possible or urgent now?
  • Write the one-sentence problem statement: '[User] struggles to [job/goal] because [obstacle], which causes [negative outcome]'
  • If you can't write the problem statement without mentioning AI, you're solution-shopping—start over
  • Test your problem statement with 3-5 potential users. If they don't immediately relate, it's too vague
Conduct Customer Discovery Interviews
Problem Discovery

Run interviews that uncover real problems, not what users think you want to hear.

  • Before building any AI feature—validate the problem exists
  • When usage data doesn't explain why users behave a certain way
  • When stakeholders disagree about what problem to solve
  • Recruit the right users: Talk to people who have the problem NOW, not people who might someday. Aim for 8-12 interviews
  • Ask about past behavior, not future intent: 'Tell me about the last time you tried to do X' beats 'Would you use a feature that does X?'
  • Dig into workarounds: 'How do you handle this today?' reveals pain severity. Complex workarounds = high pain worth solving
  • Follow the 5 Whys: When they mention a problem, ask 'Why is that a problem?' 5 times to get to root cause
  • Listen for emotion and specifics: 'That's so frustrating' or detailed stories signal real pain. Vague answers signal low priority
  • Never pitch your solution during discovery—you're there to learn, not to sell
  • Record and transcribe interviews. Patterns across 5+ interviews are more reliable than your memory
Apply Jobs to Be Done Framework
Problem Discovery

Understand what users are really hiring your product to do.

  • When users request features but you suspect they have a deeper need
  • When trying to understand why users choose your product over alternatives
  • Before designing AI features—understand the job first
  • Identify the job: What progress is the user trying to make? What outcome do they want? (Not tasks, but end states)
  • Map the job timeline: When do they realize they have this job? What triggers them to look for a solution? What happens after?
  • Identify forces at play: - Push forces (problems with current solution) - Pull forces (attraction to new solution) - Anxiety (fear new solution won't work) - Habits (comfort with current solution)
  • Find underserved needs: Which parts of the job are poorly served today? Where do users overcompensate or accept trade-offs?
  • Frame AI solutions around the job: How can AI help users make progress faster, cheaper, or with less risk?
  • The job is rarely what users say it is—'I need a faster horse' means 'I need to get somewhere faster'
  • Look for jobs with high anxiety or strong habits—these are hard to solve but create switching costs once you do
Validate Problem Severity
Problem Discovery

Confirm the problem is painful enough that users will actually use your solution.

  • After identifying a potential problem but before building anything
  • When stakeholders claim 'everyone has this problem' but you have no evidence
  • Before prioritizing which of several problems to solve first
  • Measure frequency: How often do users encounter this problem? Daily = high severity. Monthly = low. One-off = don't solve
  • Assess impact: What's the cost when this problem occurs? Time lost? Money lost? Emotional toll? Quantify it
  • Check current solutions: What do users do today? If workarounds are cheap/easy, your solution needs to be 10× better to win
  • Test willingness to change: Ask 'If I could solve this perfectly, would you switch from your current solution?' Hesitation = low severity
  • Validate with multiple signals: - Users complain about it unprompted - Users pay for bad workarounds - Users abandon tasks because it's too hard
  • If users say it's a problem but won't schedule a follow-up call, it's not painful enough
  • Problems you discover in interviews > problems users report in surveys. Actions > words
Size the Opportunity
Problem Discovery

Estimate if solving this problem is big enough to justify AI investment.

  • After validating problem severity, before building business case
  • When choosing between multiple validated problems to solve
  • When executives ask 'How big is this opportunity?'
  • Count affected users: How many people/companies have this problem? Use customer data, market research, or proxy metrics
  • Estimate willingness to pay: Survey 20+ users. Ask 'What would you pay to solve this?' Use median, not mean (outliers skew)
  • Calculate TAM: Affected users × willingness to pay × purchase frequency. This is your ceiling
  • Estimate SAM (serviceable market): Of TAM, how many can you realistically reach with your sales/distribution? Usually 10-30% of TAM
  • Project SOM (share of market): What % of SAM can you capture in 3 years? Realistic first-mover = 5-15%, fast follower = 2-8%
  • TAM > $100M justifies significant AI investment. TAM < $10M rarely justifies custom ML—use APIs instead
  • Don't confuse market size with your opportunity. $1B TAM × 1% share = $10M business, not $1B business
Run a Design Sprint
Problem Discovery

Quickly prototype and validate AI solutions with users in 5 days.

  • After validating the problem, before committing to full development
  • When stakeholders want proof the solution will work
  • When choosing between multiple AI approaches
  • Monday - Map the problem: Define the long-term goal, map the user journey, pick a target moment to focus the sprint
  • Tuesday - Sketch solutions: Each person sketches how AI could solve the problem. Vote on strongest ideas. No coding yet
  • Wednesday - Decide: Critique sketches, vote on one solution to prototype. Storyboard the user experience step-by-step
  • Thursday - Prototype: Build a realistic fake (Wizard of Oz). Use mockups + humans behind the scenes to simulate AI. No real ML
  • Friday - Test with 5 users: Watch them use the prototype. Look for confusion, delight, and whether they'd use it. Decide: build it or pivot
  • Don't build real AI during the sprint—use fake data or humans pretending to be AI. You're testing UX, not models
  • 5 user tests reveal 85% of usability issues. More tests = diminishing returns

Risk Cards

135 frameworks for identifying what can go wrong and catching it early.

AI Risk Categories Overview
Primers

Understand the complete landscape of risks unique to AI products and when each type matters most.

  • When starting your first AI product initiative
  • Before creating a risk management plan for AI features
  • When onboarding stakeholders or executives to AI product development
  • Model Risks: Performance degradation, bias, drift, adversarial attacks. Critical for accuracy-dependent features.
  • Data Risks: Quality issues, privacy violations, poisoning. Critical when handling sensitive or regulated data.
  • User Safety & Trust: Harmful outputs, misaligned expectations, transparency gaps. Critical for consumer-facing AI.
  • Ethical Considerations: Fairness, discrimination, unintended consequences. Critical for high-stakes decisions.
  • Legal & Compliance: Regulatory requirements, IP issues, liability. Critical in regulated industries.
  • Operational Risks: Deployment failures, scaling issues, cost overruns. Critical at high scale or tight margins.
  • Start with User Safety & Trust for consumer products; Legal & Compliance for enterprise
  • Revisit this map quarterly—new AI risks emerge as your product matures
Risk Assessment Framework
Primers

Systematically evaluate and prioritize AI risks using likelihood, impact, and detection difficulty.

  • When planning a new AI feature or product launch
  • After identifying multiple risks and needing to prioritize mitigation efforts
  • When justifying risk management investments to leadership
  • List all identified risks: Use the AI Risk Categories Overview as your checklist
  • Score each risk: Likelihood (1-5), Impact (1-5), Detection Difficulty (1-5). Multiply for total score.
  • Prioritize by score: >75 = critical (address before launch), 50-75 = high (address within 30 days), 25-50 = medium (monitor), <25 = low (document only)
  • Create mitigation plan: For each critical/high risk, define prevention, detection, and response tactics
  • Assign owners: Every risk needs a DRI (Directly Responsible Individual)
  • Re-assess risks monthly in first 90 days post-launch—real user behavior reveals hidden risks
  • Include diverse stakeholders in scoring—PMs, engineers, legal, support teams see different risks
Detect and Prevent Overfitting
Model Risks

Ensure your model generalizes to real-world data instead of just memorizing training examples.

  • When your model shows great training metrics but poor real-world performance
  • Before committing to a model for production deployment
  • When stakeholders question why AI performance doesn't match development claims
  • Split data properly: 70% train, 15% validation, 15% test. Never let test data touch training.
  • Compare train vs. validation metrics: If train accuracy is 95% but validation is 75%, you're overfitting
  • Apply regularization: Use dropout, L1/L2 regularization, early stopping. Start with dropout=0.2-0.5.
  • Increase training data: More diverse examples help. Aim for 10x examples per model parameter as baseline.
  • Validate on production-like data: Test on data sampled from actual user scenarios, not just held-out training data
  • Red flag: >10% gap between training and validation metrics means overfitting
  • For small datasets (<10K examples), use k-fold cross-validation instead of single split
Detect Model Drift
Model Risks

Monitor when real-world data patterns change, causing your model's performance to degrade.

  • When setting up production monitoring for AI features
  • If users report AI quality declining over time
  • Every 30-90 days post-launch as routine health check
  • Track input distribution: Monitor feature distributions weekly. Use histograms, summary stats, KL divergence from baseline.
  • Track prediction distribution: Are outputs shifting? E.g., if your classifier suddenly predicts 80% class A vs. historical 50%, investigate.
  • Monitor model metrics: Track accuracy, precision, recall on live data (requires ground truth labels)
  • Set drift thresholds: If KL divergence >0.1 or accuracy drops >5%, trigger alert
  • Create retraining playbook: Define when to retrain (monthly default), who approves, how to A/B test new model
  • Use shadow mode for new models—run in parallel with production model for 1-2 weeks before switching
  • Seasonal businesses: Expect drift. Retrain models before peak seasons (holiday retail, tax season, etc.)
Defend Against Adversarial Attacks
Model Risks

Protect your model from malicious inputs designed to cause incorrect predictions or harmful outputs.

  • Before launching AI features with financial impact (fraud detection, lending, pricing)
  • For user-generated content moderation systems
  • When AI controls access to resources or benefits
  • Threat model your feature: Who benefits from gaming the system? What would they try? (e.g., spam filter evasion, face recognition spoofing)
  • Test adversarial robustness: Use libraries like CleverHans, Foolbox. Generate adversarial examples for your model.
  • Implement defenses: Input validation, adversarial training (retrain on adversarial examples), ensemble models
  • Add detection layer: Monitor for suspicious input patterns (e.g., small perturbations, repeated similar inputs)
  • Build human review workflow: Flag high-stakes decisions or suspicious patterns for manual review
  • Start with input sanitization—often cheaper and more effective than complex adversarial training
  • For image/audio models, check for small pixel/noise perturbations that flip predictions
Model Explainability Framework
Model Risks

Make AI decisions understandable to users, auditors, and internal teams for trust and compliance.

  • When building AI for regulated industries (finance, healthcare, hiring)
  • If users need to understand why AI made specific recommendations
  • Before launching AI features that impact high-stakes user decisions
  • Define your audience: End users need simple explanations; regulators need full audit trails; ML engineers need feature importance.
  • Choose explanation method: SHAP/LIME for feature importance, attention visualization for transformers, decision trees for simple rules
  • Build explanation UI: Show top 3-5 factors influencing each prediction. Use plain language, not technical jargon.
  • Document model cards: For each model, document training data, intended use, limitations, performance metrics
  • Test explanations: Show to 10 target users. Do they understand? Do they trust the AI more?
  • Start with global explanations (how the model works overall) before per-prediction explanations
  • For black-box models, consider building a simpler interpretable 'proxy model' for explanations
Bias Detection in Models
Model Risks

Systematically test for unfair outcomes across demographic groups and use cases.

  • Before launching AI that affects people's opportunities (hiring, lending, housing)
  • When AI serves diverse user populations
  • As part of regular model audits (quarterly minimum for high-stakes AI)
  • Identify protected attributes: Age, gender, race, disability status, etc. Check applicable laws (GDPR, ECOA, FHA).
  • Measure performance by group: Calculate accuracy, false positive rate, false negative rate for each demographic
  • Apply fairness metrics: Demographic parity (equal outcomes), equalized odds (equal error rates), individual fairness
  • Set fairness thresholds: E.g., false positive rate must be within 5% across all groups
  • Document disparities: If bias detected, decide: retrain with balanced data, adjust decision thresholds, add human review
  • Even if you don't collect demographic data, test on diverse synthetic or proxy datasets
  • Involve domain experts and affected communities in defining what 'fair' means for your use case
Model Performance Degradation
Model Risks

Plan for and monitor how model performance changes over time in production.

  • Before launching AI features in production
  • When setting up monitoring and alerting systems
  • If users report declining AI quality
  • Baseline your metrics: Record accuracy, precision, recall at launch. This is your reference point.
  • Set up monitoring: Track model metrics daily/weekly. Use tools like MLflow, Weights & Biases.
  • Define degradation thresholds: If accuracy drops >5%, trigger alert. If >10%, pause feature.
  • Create response playbook: Who gets alerted? How fast do you retrain? What's the communication plan?
  • Schedule regular retraining: Monthly or quarterly, depending on data freshness needs
  • Monitor input data distribution too—shifts in user behavior often cause model drift
  • Keep a "champion/challenger" system—always have a backup model ready
Handle Model Uncertainty
Model Risks

Quantify and communicate when your model is uncertain about predictions to prevent overconfidence.

  • When AI predictions have variable confidence levels
  • For high-stakes decisions where wrong predictions are costly
  • When users need to understand AI reliability before acting
  • Calibrate confidence scores: Use temperature scaling or Platt scaling. Test: Do 90% confidence predictions succeed 90% of the time?
  • Define uncertainty thresholds: <50% confidence = reject, 50-80% = human review, >80% = auto-approve
  • Surface uncertainty to users: Show confidence scores, use language like 'high/medium/low confidence', explain implications
  • Build fallback workflows: When model is uncertain, route to human review, simpler heuristic, or ask user for more input
  • Monitor uncertainty patterns: Are certain user segments or scenarios consistently high-uncertainty? Investigate why.
  • For neural networks, use dropout at inference time (Monte Carlo dropout) to estimate uncertainty
  • Never auto-execute high-stakes actions when confidence is below your calibrated threshold
Model Ensemble Strategies
Model Risks

Combine multiple models to improve reliability, reduce bias, and provide fallback options.

  • When single-model accuracy isn't meeting requirements
  • To reduce risk of model failures in production
  • When different models excel at different edge cases
  • Choose ensemble approach: Voting (majority rule), averaging (mean confidence), stacking (meta-model learns from base models)
  • Select diverse models: Different architectures (e.g., tree-based + neural net), different training data subsets, different hyperparameters
  • Define aggregation rules: For classification, use majority voting or weighted voting. For regression, use weighted average.
  • Test performance vs. cost: Measure accuracy gain vs. latency and compute cost. Aim for >5% accuracy improvement to justify.
  • Implement fallback logic: If models disagree significantly, route to human review or use most conservative prediction
  • Start with 3-5 models—diminishing returns beyond that for most applications
  • For latency-sensitive apps, run models in parallel rather than sequentially
Data Quality Validation
Data Risks

Systematically check training and production data for errors, inconsistencies, and quality issues.

  • Before training any ML model
  • When setting up data pipelines for production AI
  • If model performance unexpectedly degrades
  • Define quality checks: Completeness (missing values <5%?), accuracy (spot-check samples), consistency (format/range validation), timeliness (data freshness)
  • Automate validation: Use tools like Great Expectations, Pandera. Run checks on every data batch before training/inference.
  • Set quality thresholds: Define minimum acceptable quality. E.g., >95% complete records, <1% invalid formats.
  • Monitor data drift: Track feature distributions over time. Alert if statistical properties shift significantly.
  • Create data rejection policy: Automatically reject batches below quality thresholds. Never train on bad data.
  • Add schema validation as first line of defense—catches 80% of data quality issues
  • Keep examples of 'bad data' in a test suite to prevent regression
Data Privacy & Compliance
Data Risks

Ensure your AI systems handle user data in compliance with GDPR, CCPA, and other privacy regulations.

  • Before collecting any user data for ML training
  • When launching AI features in new geographic markets
  • After privacy regulations change or during audits
  • Map data flows: Document what data you collect, where it's stored, who accesses it, how long you keep it
  • Get proper consent: Users must opt-in to data collection for ML training. Separate from general product usage consent.
  • Implement data minimization: Only collect data necessary for model training. Aggregate or anonymize when possible.
  • Enable data deletion: Support 'right to be forgotten' (GDPR). Document how you remove user data from training sets and models.
  • Audit regularly: Quarterly review of data practices. Test that deletion workflows actually work.
  • For EU users, you need explicit consent and must explain ML model usage in privacy policy
  • Consider differential privacy techniques if working with sensitive data (medical, financial)
Training Data Contamination
Data Risks

Prevent and detect when training data contains errors, biases, or malicious examples that corrupt your model.

  • Before starting model training, especially with user-generated or scraped data
  • When model behavior is unexpected or problematic
  • After discovering anomalies in training data sources
  • Audit data sources: Where does training data come from? How was it collected? What's the sampling methodology?
  • Check for label quality: Measure inter-annotator agreement (Kappa score >0.7 is good). Review disputed labels.
  • Detect outliers: Use statistical methods to find anomalous examples. Manually review top 1% most unusual data points.
  • Test for distribution bias: Compare training data demographics/scenarios to real user population. Fill gaps.
  • Version training datasets: Use data versioning (DVC, Pachyderm). Track exactly what data trained each model version.
  • For crowd-sourced labels, require 3+ labelers per example and take majority vote
  • Spot-check 100 random training examples yourself—fastest way to catch systemic issues
Data Poisoning Defense
Data Risks

Protect your training pipeline from malicious actors injecting harmful examples to corrupt your model.

  • When training on user-generated content or external data sources
  • For content moderation or fraud detection systems
  • If your AI influences high-value decisions or resource allocation
  • Identify attack vectors: Can users submit training data? Can attackers access your data pipeline? What's the threat model?
  • Implement data validation: Sanitize inputs, check for suspicious patterns (duplicates, extremes, coordinated submissions)
  • Use trusted data sources: Prefer curated datasets over unfiltered web scraping. Verify data provenance.
  • Apply outlier detection: Use statistical methods or anomaly detection models to flag suspicious training examples
  • Monitor model behavior: Test trained models on known-good validation sets. Alert if performance drops unexpectedly.
  • For user-contributed training data, require minimum account age/reputation before accepting submissions
  • Keep a 'clean' holdout dataset that never touches user-generated data for validation
Data Pipeline Failure Response
Data Risks

Plan for and recover from data pipeline outages that break model training or inference.

  • When building production ML data pipelines
  • After experiencing a data pipeline incident
  • Before launch of AI features with real-time data dependencies
  • Map pipeline dependencies: Document data sources, transformations, storage, and downstream consumers. Identify single points of failure.
  • Build monitoring: Alert on pipeline failures (job failures, data quality drops, missing data, latency spikes)
  • Create fallback data: Cache recent data for inference. If live data fails, fall back to cached version for 24-48 hours.
  • Define recovery procedures: Document step-by-step recovery (restart jobs, backfill data, validate outputs, notify stakeholders)
  • Practice incident response: Run fire drills quarterly. Simulate pipeline failures and test recovery procedures.
  • Set up dual alerting: page on-call engineer AND send non-urgent alert to PM
  • For critical pipelines, implement automated rollback to last known good state
Labeling Quality Assurance
Data Risks

Ensure high-quality, consistent labels for supervised learning through systematic QA processes.

  • When setting up a data labeling operation (in-house or vendor)
  • If model performance is below expectations despite good architecture
  • Before scaling up labeling efforts
  • Create labeling guidelines: Write clear, detailed instructions with examples. Include edge cases and ambiguous scenarios.
  • Train labelers: Require all labelers to complete training set. Must score >90% agreement with gold standard.
  • Measure inter-annotator agreement: Have 10-20% of data labeled by multiple people. Calculate Cohen's Kappa or Fleiss' Kappa. Target >0.7.
  • Implement review process: Subject matter experts review 5-10% of labels. Provide feedback to labelers.
  • Track labeler performance: Monitor agreement rates per labeler. Provide additional training or remove low-performing labelers.
  • For subjective tasks, accept that perfect agreement is impossible. Kappa of 0.6-0.7 may be acceptable.
  • Use active learning: have model flag most uncertain examples for human review first
PII Detection & Redaction
Data Risks

Automatically detect and remove personally identifiable information from training data and model outputs.

  • When working with user-generated content or communication data
  • Before sharing data with labeling vendors or third parties
  • When building AI features that process sensitive information
  • Define PII scope: Names, emails, phone numbers, addresses, SSN, credit cards, medical records, etc. Check applicable regulations.
  • Implement detection: Use regex patterns, named entity recognition (NER) models, or services like AWS Macie, Google DLP API
  • Apply redaction strategy: Replace with tokens ([NAME], [EMAIL]) or synthetic data. Don't just delete—preserve context.
  • Validate effectiveness: Manually review sample of redacted data. Run PII detection on model outputs periodically.
  • Document exceptions: Some use cases require PII. Document why, how it's protected, and retention policies.
  • For text generation models, add PII detection as post-processing step before showing outputs to users
  • Test with creative PII formats—attackers use l33tspeak, Unicode, and other tricks to evade detection
Synthetic Data Generation
Data Risks

Create artificial training data to augment real data, protect privacy, or handle rare scenarios.

  • When you lack sufficient real training data
  • To protect user privacy while maintaining data utility
  • To oversample rare but important scenarios (fraud, safety incidents)
  • Choose generation method: Rule-based (for structured data), GANs (for images), language models (for text), data augmentation (transforms)
  • Validate realism: Statistical tests comparing synthetic vs. real data distributions. Use domain experts to review samples.
  • Measure utility: Train models on real vs. synthetic data. Performance drop >10% means synthetic data isn't good enough.
  • Check for privacy leaks: Ensure synthetic data doesn't accidentally memorize and reproduce real examples
  • Document limitations: Synthetic data may not capture all real-world complexity. Test models on real held-out data.
  • Start with data augmentation (rotations, crops, paraphrasing)—simpler and lower risk than full synthesis
  • For regulated industries, validate that synthetic data satisfies same compliance requirements as real data
Harmful Output Prevention
User Safety & Trust

Block AI from generating dangerous, offensive, or harmful content through multi-layered safety systems.

  • Before launching any generative AI feature (text, image, code)
  • When AI outputs are user-facing or influence user decisions
  • Required for consumer applications, especially those accessible to minors
  • Define harm taxonomy: Violence, hate speech, sexual content, self-harm, illegal activity, misinformation. Prioritize by severity and likelihood.
  • Implement input filters: Block prompts requesting harmful content. Use keyword lists + classifier models.
  • Apply output filters: Scan all generated content before showing to users. Use content moderation APIs + custom classifiers.
  • Set confidence thresholds: >0.9 = block automatically, 0.7-0.9 = human review, <0.7 = allow with monitoring
  • Build escalation workflow: Repeated violations trigger account review. Store blocked attempts for analysis.
  • Layer multiple filters—no single filter is perfect. Aim for 99.5%+ harmful content blocked.
  • Red-team your system monthly: try to generate harmful content and update filters based on findings
Set User Expectations for AI
User Safety & Trust

Clearly communicate what AI can and cannot do to prevent misunderstanding and misuse.

  • During onboarding for new AI features
  • When users first interact with AI capabilities
  • After incidents caused by user misunderstanding of AI limitations
  • Document capabilities and limitations: What tasks does AI excel at? Where does it fail? What shouldn't users try?
  • Communicate in-product: Show capability descriptions on first use. Use disclaimers for high-stakes use cases.
  • Provide examples: Show what good inputs/outputs look like. Show what AI cannot do.
  • Set accuracy expectations: 'This AI is 85% accurate on X task' or 'Always verify AI outputs for [use case]'
  • Update based on usage: Monitor support tickets and user errors. Refine messaging to address common misconceptions.
  • For safety-critical domains (medical, legal, financial), require explicit acknowledgment of limitations before use
  • Test messaging with target users—what's clear to you may confuse them
AI Transparency Communication
User Safety & Trust

Disclose when AI is involved in decisions and how it influences user experiences.

  • When AI influences recommendations, rankings, or decisions users care about
  • In regulated industries requiring algorithmic transparency
  • When building trust is critical to product adoption
  • Identify AI touchpoints: Where does AI influence user experience? Recommendations, search results, content moderation, pricing?
  • Decide disclosure level: Passive (AI badge), Active (explanation on demand), Proactive (always-visible explanation)
  • Write clear disclosures: Use plain language. 'AI suggests these results based on your history' not 'ML algorithm ranks outputs'
  • Provide controls: Let users adjust AI behavior (opt out, tune personalization, see alternatives)
  • Document for auditors: Maintain detailed technical documentation for regulators, even if users see simplified version
  • For high-stakes decisions (lending, hiring), proactive disclosure may be legally required
  • Test transparency features with users—too much detail overwhelms, too little erodes trust
Human-in-the-Loop Review
User Safety & Trust

Design workflows where humans review and approve high-stakes AI decisions before execution.

  • For AI decisions with significant user impact (financial, legal, safety)
  • When model confidence is low or decision is ambiguous
  • As a safety net while AI is maturing
  • Define review triggers: Low confidence (<80%), high stakes (>$100 transaction), sensitive content, user flags
  • Design review interface: Show AI's recommendation + confidence + key evidence. Make approve/reject/edit easy.
  • Set SLAs: How fast must reviews complete? Who gets escalated? What happens if no review within SLA?
  • Measure effectiveness: Track overturn rate (how often humans disagree with AI). If >20%, AI needs improvement.
  • Close feedback loop: Feed human decisions back to training data. AI should learn from corrections.
  • Start with 100% human review, gradually decrease as AI improves and you build confidence
  • Monitor reviewer fatigue—accuracy drops after ~2 hours. Rotate reviewers or add breaks.
Design Fallback Mechanisms
User Safety & Trust

Build graceful degradation when AI fails so users can still accomplish their goals.

  • For any AI feature in production
  • When AI is part of critical user workflows
  • During AI system outages or performance degradation
  • Identify failure modes: Model errors, low confidence, service outages, timeouts, unexpected inputs
  • Design fallback tiers: Tier 1 (simpler model), Tier 2 (rule-based system), Tier 3 (manual process), Tier 4 (graceful failure message)
  • Set degradation thresholds: If model latency >2s, fall back to cached results. If accuracy <70%, fall back to rules.
  • Implement seamlessly: Users shouldn't notice transition. Fallback should feel like normal feature operation.
  • Monitor fallback usage: Track how often each tier activates. High fallback rate indicates systemic AI issues.
  • For recommendation systems, always have a 'popular items' fallback—simple and always works
  • Test fallbacks in production regularly—fire drills ensure they work when needed
AI Error Communication
User Safety & Trust

Craft helpful, honest error messages when AI fails or produces low-quality outputs.

  • When designing error states for AI features
  • After users report confusion about AI failures
  • When AI cannot fulfill user requests
  • Categorize error types: 'AI not confident enough', 'Input unclear', 'Request outside AI capabilities', 'Temporary service issue'
  • Write specific messages: Not 'Error occurred', but 'AI couldn't understand your request. Try rephrasing or adding more details.'
  • Suggest next steps: Tell users what to do. 'Try again', 'Rephrase your question', 'Contact support for help'
  • Provide alternatives: If AI can't help, show manual workflow or human assistance option
  • Learn from errors: Log error types and user context. Use to improve model and error handling.
  • Never blame users—even if input is bad, frame as AI limitation: 'AI works best with X type of input'
  • For generative AI, distinguish 'couldn't generate' vs. 'generated but content was filtered'
Collect User Feedback on AI
User Safety & Trust

Systematically gather user feedback on AI quality to identify issues and drive improvements.

  • For any user-facing AI feature in production
  • When diagnosing AI quality issues
  • To prioritize model improvement efforts
  • Add lightweight feedback: Thumbs up/down on AI outputs. Takes <1 second, high response rate.
  • Segment by confidence: Always ask for feedback on low-confidence predictions. Sample 5-10% of high-confidence ones.
  • Add optional details: Let users explain why they downvoted (optional text field or predefined reasons)
  • Close the loop: Show users 'Thanks for feedback' + what will happen. Notify them when issue is fixed.
  • Analyze patterns: Weekly review of negative feedback. Identify common failure modes. Prioritize by frequency × severity.
  • Aim for >5% feedback rate. If lower, reduce friction (fewer clicks, better placement)
  • Tag feedback with model version so you can measure if improvements actually help
Content Moderation at Scale
User Safety & Trust

Build systems to detect and remove harmful user-generated content using AI + human review.

  • For platforms with user-generated content
  • When required by platform policies (App Store, regulatory requirements)
  • After discovering problematic content in your product
  • Define policy: What content is prohibited? Violence, hate speech, spam, misinformation, etc. Write clear guidelines.
  • Implement automated detection: Use content moderation APIs (AWS Rekognition, Google Vision, OpenAI Moderation) + custom models
  • Set action thresholds: >0.9 = auto-remove, 0.7-0.9 = human review, <0.7 = allow with monitoring
  • Build review queue: Surface flagged content to human moderators. Prioritize by severity and volume.
  • Handle appeals: Let users appeal removals. Review by senior moderators. Update policies based on patterns.
  • Start with pre-moderation (review before publishing) for high-risk platforms. Shift to post-moderation as systems mature.
  • Provide mental health support for human moderators—exposure to harmful content causes trauma
Build User Trust in AI
User Safety & Trust

Systematically increase user confidence in AI through transparency, consistency, and demonstrated reliability.

  • When launching new AI features to skeptical users
  • If adoption metrics show users avoiding AI features
  • After AI errors or incidents damage trust
  • Start small: Launch AI for low-stakes tasks first. Let users build confidence before expanding to critical workflows.
  • Show your work: Explain how AI works, what data it uses, how accurate it is. Transparency builds credibility.
  • Be honest about limitations: Don't oversell. Tell users what AI can't do. Honesty prevents disappointment.
  • Deliver consistent quality: Users trust reliable systems. Monitor and maintain >95% success rate for core use cases.
  • Give users control: Let them disable AI, adjust settings, override decisions. Control increases comfort.
  • Measure trust explicitly: survey users quarterly on AI confidence and reliability perceptions
  • Celebrate wins: when AI helps users succeed, acknowledge it. Positive associations build trust.
Fairness Auditing Process
Ethical Considerations

Conduct regular audits to measure and improve fairness across demographic groups and use cases.

  • Quarterly for high-stakes AI systems (hiring, lending, criminal justice)
  • Before major model updates or feature launches
  • When required by regulations or ethical AI commitments
  • Define fairness criteria: Demographic parity, equalized odds, individual fairness, or other domain-specific measures
  • Collect representative test data: Include diverse demographics and edge cases. Aim for 500+ examples per protected group.
  • Measure disparities: Calculate performance metrics (accuracy, FPR, FNR) for each demographic. Document gaps >5%.
  • Investigate root causes: Is bias in training data, model architecture, or post-processing? Use feature importance analysis.
  • Implement mitigations: Rebalance training data, adjust decision thresholds per group, add fairness constraints, or redesign feature.
  • Involve external auditors or diverse internal stakeholders—insider bias blinds you to issues
  • Document audit results even if no bias found—shows due diligence to regulators and stakeholders
Algorithmic Discrimination Prevention
Ethical Considerations

Proactively design AI systems to prevent unfair treatment based on protected characteristics.

  • During initial AI feature design and requirements
  • When AI influences decisions affecting people's opportunities
  • Before expanding AI to new markets or demographics
  • Conduct pre-deployment risk assessment: Could this AI system discriminate? Against which groups? What's the potential harm?
  • Remove or mitigate problematic features: Avoid using race, gender, zip code directly. Check for proxy features (name, address).
  • Ensure training data diversity: Balanced representation of protected groups. Oversample underrepresented groups if needed.
  • Apply fairness constraints during training: Use fairness-aware algorithms (e.g., Fairlearn library) that optimize for both accuracy and fairness
  • Test extensively pre-launch: Run fairness audits before release. Require sign-off from ethics/legal teams for high-stakes AI.
  • Include diverse voices in design—people from affected communities spot issues you miss
  • Document your fairness approach—shows good faith effort if challenged legally
Unintended Consequences Assessment
Ethical Considerations

Identify and plan for negative second-order effects of your AI system before they cause harm.

  • During AI product strategy and planning phases
  • Before launching AI features with broad societal impact
  • When expanding AI systems to new domains or scales
  • Map intended effects: What is AI designed to accomplish? Who benefits? How?
  • Brainstorm unintended effects: Who might be harmed? Could AI be misused? What behaviors might it incentivize? Could it be gamed?
  • Assess likelihood and severity: For each unintended consequence, rate probability and potential harm
  • Design mitigations: Rate limiting, access controls, monitoring for misuse, user education, or design changes
  • Monitor post-launch: Track metrics related to potential harms. Adjust mitigations based on observed behavior.
  • Use 'pre-mortem' technique: imagine AI caused major harm. Work backwards to identify how it happened.
  • Include diverse perspectives—different stakeholders see different risks
Stakeholder Impact Mapping
Ethical Considerations

Systematically identify everyone affected by your AI system and understand how it impacts them.

  • Early in AI product planning, before committing to approach
  • When making major changes to existing AI systems
  • If stakeholders raise concerns about AI impacts
  • Identify all stakeholders: End users, indirect users, employees, communities, competitors, regulators, society
  • Map impacts for each group: How does AI affect them? Benefits? Harms? Changes to work/life?
  • Prioritize by impact: Which groups experience the largest effects? Which effects are irreversible?
  • Engage stakeholders: Interview representatives from high-impact groups. Understand their concerns and priorities.
  • Incorporate feedback: Adjust AI design, policies, or safeguards based on stakeholder input. Document tradeoffs.
  • Don't forget indirect stakeholders—job displacement, ecosystem effects, societal norms
  • For high-impact systems, consider establishing ongoing stakeholder advisory boards
Value Alignment Testing
Ethical Considerations

Verify that AI system behaviors align with stated organizational values and ethical principles.

  • Before launching AI systems with significant autonomy
  • When AI makes decisions that reflect organizational values
  • As part of regular ethics audits
  • Articulate core values: What principles should guide AI behavior? Fairness, transparency, safety, respect, autonomy?
  • Translate to testable scenarios: Create specific situations where values might conflict. 'Should AI prioritize accuracy or fairness?'
  • Test AI behavior: Run scenarios through your AI system. Does it behave according to values? Where does it diverge?
  • Identify misalignments: Document cases where AI behavior conflicts with values. Understand root cause.
  • Adjust and retest: Modify training objectives, reward functions, constraints, or post-processing to improve alignment.
  • Values often conflict (privacy vs. personalization, safety vs. autonomy). Define priority hierarchy.
  • Test edge cases where tradeoffs are hardest—that's where value alignment matters most
Responsible AI Principles
Ethical Considerations

Establish and operationalize a set of ethical principles to guide AI development and deployment.

  • When starting an AI program or establishing AI governance
  • Before making major AI product decisions with ethical dimensions
  • When communicating AI approach to stakeholders or public
  • Define principles: Common ones include fairness, accountability, transparency, safety, privacy, human control. Adapt to your context.
  • Write clear definitions: What does each principle mean specifically? Include examples and counterexamples.
  • Create decision checklists: For each principle, list questions to ask during design/development. 'Does this AI treat all users fairly?'
  • Assign accountability: Who reviews AI products for principle adherence? Who has authority to block launches?
  • Integrate into processes: Add ethics review to design reviews, launch checklists, and post-launch monitoring.
  • Don't just copy Google/Microsoft principles—customize to your industry, users, and risks
  • Principles without enforcement are PR. Build real gatekeeping mechanisms.
Ethical AI Decision Framework
Ethical Considerations

Use a structured process to evaluate and resolve ethical dilemmas in AI product development.

  • When facing difficult tradeoffs between competing values
  • If team members disagree about ethics of an AI feature
  • Before launching controversial or high-stakes AI capabilities
  • Frame the dilemma: What are the competing values or interests? Who benefits? Who is harmed?
  • Gather perspectives: Consult diverse stakeholders. What do affected groups think? What do experts recommend?
  • Evaluate options: List possible approaches. For each, assess alignment with values, feasibility, risks, precedent set.
  • Make decision: Choose option with best balance of benefits, harms, and value alignment. Document rationale.
  • Plan monitoring: How will you know if decision was right? What metrics or signals indicate success or failure?
  • Use thought experiments: 'If this decision became public, could we defend it?' 'Would we want competitors to make the same choice?'
  • Sometimes best answer is 'don't build it'—not every AI application is worth the ethical costs
Impact Assessment for Stakeholders
Ethical Considerations

Conduct thorough impact assessments to understand social, economic, and ethical effects of AI systems.

  • Before launching AI systems with significant societal impact
  • When required by regulations (EU AI Act, impact assessments)
  • For major updates to existing high-stakes AI systems
  • Define scope: What AI system? What deployment context? What time horizon? Geographic scope?
  • Assess impacts by category: Human rights, safety, fairness, economic, environmental, social cohesion
  • Quantify where possible: How many people affected? What magnitude of impact? What probability?
  • Identify mitigation measures: For each significant risk, document prevention and response strategies
  • Publish and update: Share assessment with stakeholders. Update post-launch based on observed impacts.
  • Use established frameworks: Canada's ATIA, UK ICO DPIA, or EU AI Act requirements provide templates
  • Impact assessments should be living documents—update quarterly as you learn from real-world deployment
AI Regulation Landscape
Legal & Compliance

Navigate the evolving landscape of AI-specific regulations across different jurisdictions.

  • When planning AI product strategy and roadmap
  • Before launching AI features in new geographic markets
  • Quarterly as regulatory landscape evolves rapidly
  • Map applicable regulations: EU AI Act (high-risk AI systems), US sector-specific rules, China AI rules, GDPR (automated decisions)
  • Classify your AI system: High-risk (credit, employment, law enforcement), limited-risk (chatbots), minimal-risk (spam filters)
  • Identify compliance requirements: High-risk may require: conformity assessments, risk management, data governance, transparency, human oversight
  • Assess compliance gaps: What requirements don't you meet today? What's the timeline to comply?
  • Build compliance roadmap: Prioritize by regulation enforcement date and business impact. Assign owners.
  • Don't wait for final regulations—start building compliance capabilities now (documentation, testing, governance)
  • Work with legal counsel familiar with AI regulations—this is specialized and rapidly evolving
GDPR Compliance for AI
Legal & Compliance

Ensure AI systems comply with GDPR requirements for automated decision-making and data protection.

  • When processing data of EU residents
  • Before launching AI features that make automated decisions
  • During GDPR compliance audits
  • Assess Article 22 applicability: Does AI make decisions without human involvement? Is it legally or similarly significant?
  • Obtain proper consent: If using personal data for AI training, get explicit opt-in consent. Can't use pre-checked boxes.
  • Provide meaningful information: Tell users about AI logic, significance, and consequences in privacy policy
  • Enable human intervention: Allow users to contest AI decisions and request human review (Article 22(3))
  • Support data subject rights: Implement right to explanation, right to be forgotten (remove from training data), right to data portability
  • Conduct Data Protection Impact Assessment (DPIA) for high-risk AI—required by GDPR Article 35
  • Work with Data Protection Officer (DPO) throughout AI development, not just at launch
Intellectual Property Considerations
Legal & Compliance

Navigate IP issues around training data, model ownership, and AI-generated outputs.

  • When sourcing training data from third-party sources
  • Before using pre-trained models or APIs commercially
  • When AI generates content that might infringe copyrights
  • Audit training data sources: Do you have rights to use this data for ML training? Check terms of service, licenses.
  • Review model licenses: If using pre-trained models (GPT, LLaMA, Stable Diffusion), check license terms. Commercial use allowed?
  • Assess output liability: If AI generates content similar to copyrighted works, who's liable? Implement detection for problematic outputs.
  • Protect your IP: Document novel ML architectures. Consider patents for truly innovative techniques (high bar).
  • Establish usage policies: Define acceptable use of your AI. Prohibit generating content that infringes IP.
  • For generative AI, add content filters that block outputs too similar to known copyrighted works
  • IP law for AI is unsettled—work with specialized IP counsel, don't rely on general advice
Liability & Insurance
Legal & Compliance

Understand and manage legal liability for AI system failures, errors, and harms.

  • When launching AI products with potential for significant user harm
  • Before deploying AI in regulated industries (healthcare, finance, automotive)
  • When structuring contracts with AI vendors or customers
  • Identify liability scenarios: What could go wrong? AI error causes financial loss, physical harm, discrimination, privacy breach?
  • Assess liability exposure: Who could sue? What are potential damages? What's the probability?
  • Review liability limitations: Do your Terms of Service limit liability? Are limitations enforceable in relevant jurisdictions?
  • Obtain insurance coverage: Professional liability, cyber liability, product liability. Ensure AI is explicitly covered.
  • Implement risk controls: The measures in this deck reduce likelihood of incidents and show reasonable care if sued.
  • Many insurance policies exclude AI-related claims by default—get explicit AI coverage
  • For B2B AI, negotiate liability caps in contracts. Unlimited liability for AI is too risky.
Audit Trail Requirements
Legal & Compliance

Implement comprehensive logging and audit trails for AI systems to support compliance and investigations.

  • For regulated AI systems (finance, healthcare, government)
  • When AI makes decisions that could be legally challenged
  • As required by regulations (EU AI Act, SOC 2, ISO 27001)
  • Define logging scope: Model inputs, outputs, decisions, confidence scores, user interactions, model versions, data versions
  • Set retention policies: How long to keep logs? GDPR requires deletion upon request, but some regulations require multi-year retention.
  • Implement secure storage: Logs contain sensitive data. Encrypt at rest, restrict access, maintain immutability.
  • Enable traceability: Link each prediction to model version, training data version, user, timestamp. Must be able to reproduce.
  • Build audit reports: Create dashboards and reports for regulators, auditors, internal reviews. Test that you can answer common questions.
  • For high-stakes AI, log enough detail to fully reproduce any decision even years later
  • Balance retention needs with privacy—minimize PII in logs, anonymize where possible
Responsible AI Documentation
Legal & Compliance

Create and maintain comprehensive documentation of AI systems for transparency, compliance, and knowledge sharing.

  • Throughout AI development lifecycle
  • When required by regulations (model cards, data sheets)
  • Before launching AI systems to production
  • Create model cards: Document intended use, training data, performance metrics, limitations, fairness analysis, ethical considerations
  • Create data sheets: Document dataset origin, collection method, preprocessing, demographics, known biases, intended uses
  • Document system architecture: Data flows, model architecture, dependencies, infrastructure, update procedures
  • Write user-facing documentation: What AI does, how to use it, limitations, how to get help, how to provide feedback
  • Maintain living docs: Update documentation with each model version, system change, or new findings. Version control all docs.
  • Use templates: Google Model Cards, Microsoft datasheets, or EU AI Act technical documentation templates
  • Make documentation searchable and accessible—it's useless if people can't find it
Deployment Failure Prevention
Operational Risks

Minimize risk of failed deployments through testing, staging, and gradual rollouts.

  • Before deploying any AI model to production
  • When updating existing production AI systems
  • After experiencing deployment incidents
  • Test in staging: Deploy to production-like environment first. Validate model performance, latency, error rates.
  • Implement canary deployments: Roll out to 5% of traffic first. Monitor metrics for 24-48 hours before full rollout.
  • Define rollback criteria: If error rate >1% or latency >2x baseline, auto-rollback. Have one-click rollback mechanism.
  • Pre-deployment checklist: Verify dependencies, data schema, API compatibility, monitoring/alerting, documentation
  • Plan deployment windows: Deploy during low-traffic periods. Have team available to monitor and respond to issues.
  • Always deploy new models alongside old ones (shadow mode) for 24 hours before switching traffic
  • Keep last 3 model versions deployable—enables quick rollback if issues emerge days after deployment
Scaling AI Systems
Operational Risks

Plan for and manage challenges that emerge when scaling AI from prototype to high-volume production.

  • When traffic is expected to grow 10x or more
  • Before major product launches or marketing campaigns
  • When experiencing performance degradation under load
  • Benchmark capacity: Measure current throughput (requests/second), latency, and resource usage. Identify bottlenecks.
  • Project future load: Estimate peak traffic based on growth plans. Add 50% buffer for unexpected spikes.
  • Optimize performance: Model quantization, batching, caching, GPU optimization. Measure latency/cost tradeoffs.
  • Plan infrastructure: Auto-scaling policies, load balancing, multi-region deployment. Test failover scenarios.
  • Load test extensively: Simulate peak traffic + 2x. Measure behavior under sustained load and traffic spikes.
  • Model inference cost often scales linearly with traffic—factor this into unit economics early
  • For generative AI, implement rate limiting per user to prevent abuse and control costs
Cost Management for AI
Operational Risks

Monitor and optimize AI infrastructure costs to maintain healthy unit economics.

  • Before committing to AI features with significant compute costs
  • When monthly AI infrastructure costs exceed budget
  • During planning cycles and budget allocation
  • Calculate unit economics: Cost per prediction, cost per user, cost per month. Track over time.
  • Set cost budgets: Define acceptable costs for your business model. Alert if approaching limits.
  • Optimize inference: Smaller models, quantization, batching, caching, edge deployment. Measure accuracy vs. cost tradeoffs.
  • Optimize training: Spot instances, lower-precision training, smaller datasets, fewer experiments. Use MLOps tools to track experiment costs.
  • Monitor continuously: Daily/weekly cost dashboards by team, project, model. Identify cost spikes immediately.
  • For API-based AI, renegotiate pricing after hitting volume thresholds—vendors offer discounts at scale
  • Consider model distillation: train smaller, cheaper models that mimic larger expensive models
Vendor Lock-In Mitigation
Operational Risks

Reduce dependency on single AI vendors to maintain flexibility and negotiating leverage.

  • When evaluating AI vendor relationships
  • Before committing to proprietary AI platforms or APIs
  • If current vendor relationship becomes problematic
  • Assess lock-in risk: How hard to switch vendors? Proprietary APIs? Custom integrations? Data in vendor-specific formats?
  • Design for portability: Use abstraction layers. Build interfaces that work with multiple providers. Avoid vendor-specific features initially.
  • Maintain multi-vendor capability: Test alternative providers quarterly. Keep POC integrations working.
  • Diversify strategically: Use different vendors for different use cases. Prevents single point of failure.
  • Negotiate protections: Include data portability, API stability, and exit assistance terms in contracts.
  • For LLM APIs, use libraries like LangChain or LlamaIndex that support multiple providers
  • Build vendor switching into roadmap every 12-18 months—forces you to maintain portability
Technical Debt in AI Systems
Operational Risks

Identify and manage ML-specific technical debt that accumulates faster than traditional software.

  • During sprint planning and roadmap reviews
  • When velocity slows or bugs increase
  • Quarterly as part of technical health reviews
  • Audit ML-specific debt: Glue code, pipeline jungles, experimental codepaths, multiple versions of truth, undeclared dependencies
  • Quantify impact: How much does debt slow development? Increase bugs? Raise costs?
  • Prioritize by pain: Which debt causes most problems? Which is easiest to fix? Focus on high-impact, low-effort first.
  • Allocate capacity: Reserve 20-30% of engineering time for debt reduction. Track and celebrate progress.
  • Prevent accumulation: Code review standards, refactoring sprints, deprecation policies, monitoring for code smells
  • ML debt compounds faster than traditional software—it blocks experimentation and slows innovation
  • Create 'ML platform team' role to manage shared infrastructure and prevent debt at system level
Set Up Risk Monitoring Dashboard
Operational Risks

Create centralized visibility into AI system health, risks, and incidents across all dimensions.

  • When launching first AI features to production
  • If you lack visibility into AI system health
  • After incidents reveal monitoring gaps
  • Define key risk indicators: Model performance, data quality, cost, latency, error rates, user feedback, fairness metrics
  • Set thresholds and alerts: Green (healthy), yellow (investigate), red (immediate action). Define escalation procedures.
  • Build dashboard: Centralized view of all AI systems. Accessible to PMs, engineers, execs. Real-time + historical trends.
  • Automate data collection: Instrument production systems to emit metrics. Aggregate from multiple sources (logs, databases, APIs).
  • Review cadence: Daily check by on-call. Weekly review with team. Monthly review with leadership.
  • Start simple: track 5-10 most critical metrics. Expand over time as you learn what matters.
  • Include leading indicators (input data quality) not just lagging indicators (model accuracy)
Incident Response for AI
Operational Risks

Establish playbooks for responding to AI system failures, quality issues, or safety incidents.

  • Before launching AI features to production
  • After experiencing your first AI incident
  • When updating incident response procedures
  • Define incident types: Model failure, data corruption, harmful output, fairness violation, privacy breach, cost spike, outage
  • Set severity levels: P0 (user safety, major outage), P1 (significant degradation), P2 (minor issues). Define response SLAs.
  • Create response playbooks: For each incident type, document detection, initial response, investigation, mitigation, communication
  • Assign roles: Incident commander, communications lead, technical lead. Train team on roles and procedures.
  • Conduct post-mortems: After incidents, document timeline, root cause, action items. Share learnings broadly.
  • For AI safety incidents, communicate proactively to users even before full resolution—transparency builds trust
  • Practice incident response with fire drills quarterly—muscle memory matters during real incidents
Prevent Underfitting
Model Risks

Ensure your model is complex enough to capture important patterns and deliver useful predictions.

  • When baseline models show poor performance on training and validation data
  • Before giving up on an AI approach due to low accuracy
  • When stakeholders question if AI adds value over simple rules
  • Diagnose underfitting: If both training and validation accuracy are low (e.g., both ~65% for binary classification), you're underfitting
  • Increase model complexity: Add more layers, more parameters, more features. Start with 2-3x current capacity.
  • Improve features: Add more informative input features. Feature engineering often matters more than model architecture.
  • Train longer: Increase epochs/iterations. Ensure model has converged (training loss plateaus).
  • Try different architectures: If linear model underperforms, try decision trees. If simple NN underperforms, try deeper networks.
  • Check training loss first—if it's not decreasing, you have optimization or data problems before underfitting
  • For structured data, gradient boosting (XGBoost, LightGBM) often fixes underfitting better than neural networks
User Control Over AI
User Safety & Trust

Give users meaningful control over AI behavior, personalization, and decision-making.

  • When AI personalizes experiences or makes recommendations
  • If users express concerns about AI control or autonomy
  • To build trust and meet transparency requirements
  • Identify control points: What aspects of AI can users adjust? Personalization level, data usage, automation degree, feature on/off
  • Design controls: Simple toggles for most users, advanced settings for power users. Provide clear explanations of each control.
  • Set sensible defaults: Most users won't change settings. Default to safe, balanced options.
  • Make controls discoverable: Surface key controls in main settings. Don't bury in deep menus.
  • Provide override mechanisms: Let users undo AI actions, manually adjust results, revert to non-AI experience.
  • Test controls with non-technical users—what's obvious to you may be confusing to them
  • For sensitive use cases, default to 'AI off' and make users opt in to automation

Execution Cards

135 tactical how-to frameworks for shipping and operating.

Write AI Product Specs
Requirements & Specs

Create comprehensive product requirements documents tailored for AI features with probabilistic behaviors.

  • When scoping a new AI feature before development begins
  • When communicating requirements to ML engineers and designers
  • Before estimating timelines or resources for AI projects
  • Define the user problem: What task does AI solve? What's the current painful alternative? Include 3-5 specific user scenarios.
  • Specify success criteria: Model performance thresholds (accuracy, precision, recall), latency limits, cost constraints, user satisfaction targets.
  • Document failure modes: What happens when model is wrong? When it's unsure? When it's slow? Define graceful degradation paths.
  • List edge cases explicitly: Enumerate at least 10 scenarios where AI might fail. How should system behave for each?
  • Define data requirements: Training data volume, labeling needs, refresh frequency, privacy constraints, retention policies.
  • Map dependencies: APIs, infrastructure, monitoring tools, human-in-the-loop processes, fallback systems.
  • Include example inputs and expected outputs for 5 typical cases and 5 edge cases
  • Specify what's in scope for MVP vs. future iterations—prevents scope creep
Define AI Success Metrics
Requirements & Specs

Establish clear, measurable criteria for what "good enough" means for your AI feature.

  • Before starting AI development or model training
  • When aligning stakeholders on AI launch criteria
  • When evaluating if your AI feature is ready to ship
  • Define user-facing metrics: Task completion rate, user satisfaction, time saved
  • Define model metrics: Accuracy, precision, recall, F1 score (based on your use case)
  • Define system metrics: Latency, cost per prediction, uptime
  • Set minimum bars: What's the minimum acceptable level for each metric to ship?
  • Weight by importance: Rank metrics by priority (e.g., accuracy 40%, latency 30%, cost 20%)
  • Always include latency—a slow model frustrates users even if accurate
  • Get ML engineers to validate that metrics are achievable
Write AI Acceptance Criteria
Requirements & Specs

Define testable conditions that AI features must meet before marking stories complete or shipping to users.

  • When writing user stories for AI features during sprint planning
  • Before QA begins testing AI functionality
  • When determining if an AI feature is ready for launch
  • Functional criteria: Define what the feature does. Example: 'Given user query, system returns relevant results in <1s'
  • Performance criteria: Set minimum bars. Example: 'Accuracy >85% on validation set, precision >90% for top 3 results'
  • Edge case handling: Test boundaries. Example: 'When confidence <70%, show 'Not sure' message instead of prediction'
  • UX criteria: User experience standards. Example: 'Loading indicator appears within 100ms, shows model confidence level'
  • Monitoring criteria: Observability requirements. Example: 'Log all predictions with confidence scores, latency, and user feedback'
  • Use 'When/Given/Then' format for clarity: 'Given ambiguous input, when model confidence <70%, then show 3 options instead of 1'
  • Include negative test cases: What should NOT happen (e.g., 'System never returns offensive content')
Document Edge Cases & Failure Modes
Requirements & Specs

Systematically identify and specify how AI systems should behave when encountering unusual inputs or model failures.

  • During AI product spec writing, before development starts
  • When designing error handling and fallback strategies
  • After discovering edge cases in testing or production
  • Brainstorm input edge cases: Empty inputs, extremely long inputs, non-English text, special characters, adversarial inputs, ambiguous requests
  • Identify model failure modes: Low confidence predictions, contradictory outputs, hallucinations, timeout/latency spikes, model unavailable
  • Define system behaviors: For each edge case, specify exact system response—show error message? Fallback to rules? Route to human?
  • Document user communication: What does user see? Example: 'I'm not confident about this answer' vs. hiding uncertainty
  • Prioritize edge cases: Mark which must be handled at launch (P0) vs. can be addressed later (P1, P2)
  • Aim to document 20-30 edge cases minimum—real AI systems encounter dozens of failure modes
  • Test your edge case handling with red teaming before launch
Write User Stories for AI Features
Requirements & Specs

Craft user stories that capture AI-specific requirements, uncertainty, and iterative learning needs.

  • During sprint planning for AI development
  • When breaking down large AI epics into deliverable increments
  • When communicating AI requirements to cross-functional teams
  • Start with user value: 'As a [user], I want [AI capability] so that [benefit]'. Focus on outcome, not technology.
  • Add AI-specific details: Include model type, accuracy target, latency requirement, data source, fallback behavior
  • Split into layers: Story 1: MVP with simple model. Story 2: Improve accuracy. Story 3: Add personalization. Build incrementally.
  • Include training stories: 'As an ML engineer, I need labeled data to train the classification model' counts as a story
  • Add monitoring stories: 'As a PM, I want to see model accuracy in production to know when to retrain'
  • Use this format: 'As a [user], I want [AI feature] with [performance level] so that [outcome]'
  • Always pair feature stories with monitoring/evaluation stories in the same sprint
Specify Model Constraints & Requirements
Requirements & Specs

Define technical constraints and non-functional requirements that limit model selection and architecture choices.

  • Before ML engineers begin model selection or architecture design
  • When negotiating tradeoffs between accuracy, latency, and cost
  • When evaluating whether to use pre-trained vs. custom models
  • Latency constraints: Define max acceptable response time. Example: 'P95 latency <500ms' or 'Batch processing <1 hour'
  • Cost constraints: Set budget per prediction or monthly inference spend. Example: '$0.001 per prediction max' or '$5K/month inference budget'
  • Data constraints: Privacy requirements, data location restrictions, retention limits. Example: 'No PII can leave EU data centers'
  • Infrastructure constraints: On-premise vs. cloud, GPU availability, scaling requirements. Example: 'Must run on CPU-only instances'
  • Model size constraints: Deployment target limits. Example: 'Model must fit in 100MB for mobile deployment'
  • Document 'must-have' vs. 'nice-to-have' constraints—helps ML engineers make tradeoff decisions
  • Re-evaluate constraints quarterly—technology improves, costs drop, requirements change
Create Model Evaluation Rubric
Requirements & Specs

Build a standardized scorecard for comparing model candidates and making go/no-go decisions.

  • When evaluating multiple model approaches or vendors
  • Before final model selection for production deployment
  • When comparing fine-tuned models against baselines
  • List evaluation dimensions: Accuracy, latency, cost, maintainability, explainability, fairness, ease of deployment
  • Define scoring criteria: For each dimension, create 1-5 scale. Example: Accuracy: 1=<70%, 2=70-80%, 3=80-85%, 4=85-90%, 5=>90%
  • Assign weights: Total should equal 100%. Example: Accuracy 35%, Latency 25%, Cost 20%, Maintainability 15%, Explainability 5%
  • Evaluate candidates: Score each model on every dimension. Calculate weighted total score.
  • Set minimum bars: Define deal-breakers. Example: 'Any score <3 on Accuracy is automatic rejection regardless of other scores'
  • Include non-technical stakeholders in weighting exercise—reveals business priorities
  • Document evaluation in decision log for future reference when explaining model choices
Define Human-in-the-Loop Requirements
Requirements & Specs

Specify when and how humans should review, override, or augment AI decisions.

  • For high-stakes AI decisions (hiring, lending, medical, legal)
  • When model accuracy alone is insufficient for user trust
  • When designing content moderation or fraud detection systems
  • Identify human intervention triggers: When does AI route to human? Low confidence (<70%)? Specific content types? Random sampling?
  • Define review workflows: Who reviews? What information do they see? What actions can they take? What's the SLA?
  • Specify override rules: Can humans override AI? Is override logged? Does it retrain the model?
  • Design feedback loops: How do human decisions improve the model? Label correction? Active learning prioritization?
  • Plan for scale: What happens when review volume exceeds capacity? Which cases get priority?
  • Start with 100% human review at launch, then gradually decrease as model improves and you build trust
  • Track human-AI agreement rates—if humans override >20%, your model needs improvement
Plan Data Collection Strategy
Data Strategy

Design systematic approach to gathering, labeling, and maintaining high-quality training data.

  • Before starting AI development when you lack sufficient data
  • When planning to improve model performance through more data
  • When designing data pipelines for continuous learning
  • Quantify data needs: Calculate required examples per class/scenario. Start with 1K minimum, 10K target, 100K for production scale.
  • Identify data sources: Internal logs, user-generated content, purchased datasets, web scraping, partnerships, synthetic generation
  • Plan collection timeline: Map data acquisition to development phases. Example: 'MVP needs 5K labeled examples by Month 2'
  • Design labeling workflow: Who labels? Internal team, contractors, crowdsourcing? What's the quality bar? How much does it cost?
  • Build validation process: How do you verify label quality? Inter-rater agreement? Expert review? Automated checks?
  • Set refresh cadence: How often do you collect new data? Daily, weekly, monthly? What triggers data updates?
  • Budget $0.10-$5 per label depending on complexity—data labeling often costs more than development
  • Prioritize data diversity over volume—1K diverse examples beats 10K similar ones
Establish Data Labeling Pipeline
Data Strategy

Build efficient, quality-controlled workflows for annotating training data at scale.

  • When you have raw data but need labeled examples for supervised learning
  • When scaling from prototype to production-quality models
  • When managing ongoing labeling for model improvements
  • Choose labeling approach: In-house experts (high quality, slow, expensive), contractors (medium quality, faster, moderate cost), crowdsourcing (variable quality, fastest, cheap)
  • Design labeling interface: Simple, clear instructions with examples. Include 'unsure' option. Show previous labels for context.
  • Implement quality controls: Gold standard test sets (10-20% of labels), measure inter-rater agreement (aim for >80%), require 2-3 labelers per example for disagreement detection
  • Set up labeling workflow: Task assignment, review queue, dispute resolution process, label correction mechanism
  • Track metrics: Labels per hour, cost per label, label quality score, labeler agreement rates
  • Iterate on guidelines: Update labeling instructions weekly based on common errors and edge cases
  • Start with small batch (100 examples), measure quality, adjust process before scaling to thousands
  • Pay labelers fairly—quality correlates with compensation and training
Design Active Learning Workflow
Data Strategy

Implement smart sampling strategies that prioritize labeling the most valuable training examples.

  • When you have large amounts of unlabeled data but limited labeling budget
  • When trying to improve model performance efficiently
  • When deploying models that learn from production data
  • Set up uncertainty sampling: Deploy model, capture predictions with confidence scores. Queue low-confidence examples (<70%) for human review.
  • Implement diversity sampling: Don't just label uncertain examples—also sample to cover edge cases and rare scenarios. Use clustering.
  • Create review interface: Show model prediction + confidence, allow labeler to correct or confirm, capture reasoning for corrections
  • Feed labels back: Retrain model weekly or monthly with new labels. Measure if accuracy improves.
  • Balance exploration vs. exploitation: 80% uncertain examples (exploitation), 20% random samples (exploration for coverage)
  • Start active learning after you have 1K baseline labels—need initial model for uncertainty estimates
  • Track label efficiency: Are you getting accuracy gains per 100 labels? If not, switch sampling strategy
Implement Data Versioning
Data Strategy

Track and manage different versions of training datasets for reproducibility and model comparison.

  • When you start training models and need to track which data produced which results
  • When managing multiple model experiments in parallel
  • When debugging model performance regressions
  • Choose versioning tool: DVC (Data Version Control), LakeFS, Pachyderm, or simple S3 buckets with timestamps
  • Define versioning strategy: Version on data changes (new labels), schema changes (new features), or time-based (monthly snapshots)
  • Tag datasets: Use semantic versioning (v1.0, v1.1) or timestamps (2025-01-15). Link each model to its training data version.
  • Document dataset changes: Changelog for each version: what changed, why, how many examples added/removed/modified
  • Set up access controls: Who can create new versions? Who can modify existing ones? Ensure test/validation sets never leak.
  • Pin production models to specific data versions—makes rollbacks and debugging much easier
  • Store data samples in version control (100 examples) so teammates can inspect without downloading full dataset
Generate Synthetic Training Data
Data Strategy

Create artificial training examples to augment real data, especially for rare cases or privacy-sensitive scenarios.

  • When you lack sufficient real examples for certain categories
  • When dealing with rare events (fraud, medical conditions, edge cases)
  • When privacy regulations limit access to real user data
  • Choose generation method: Rule-based (templates with variations), generative models (GANs, VAEs), LLMs (for text), data augmentation (transforms)
  • Start with augmentation: For images/text, apply transforms to real data—rotate, crop, paraphrase. Easiest way to 10x your dataset.
  • Validate realism: Can humans distinguish synthetic from real examples? If yes, synthetic data is too artificial.
  • Test model performance: Train on real data only, then real + synthetic. Does synthetic data improve validation accuracy? If not, discard it.
  • Balance synthetic vs. real: Keep real data as majority (70-90%), use synthetic as supplement (10-30%) for rare cases
  • Synthetic data works best for augmenting rare classes—don't use it to replace real data collection
  • For LLM-generated data, use diverse prompts and validate that examples are factually correct
Implement Data Privacy Controls
Data Strategy

Build safeguards to protect user privacy throughout data collection, training, and inference.

  • When handling PII (personally identifiable information) or sensitive data
  • Before launching in regulated industries (healthcare, finance, education)
  • When users express privacy concerns about AI features
  • Classify data sensitivity: Public, internal, confidential, PII, PHI. Apply appropriate controls to each tier.
  • Implement data minimization: Collect only data necessary for model training. Avoid collecting PII when possible.
  • Anonymize training data: Remove names, emails, IDs. Use tokenization, pseudonymization, or differential privacy techniques.
  • Set retention limits: Define how long you keep training data. Delete after 1-2 years unless needed for compliance.
  • Control access: Role-based access to training data. Log all data access. Require data handling training for team members.
  • Plan for deletion: Users can request data deletion (GDPR, CCPA). Have process to remove user data from training sets.
  • Use secure enclaves or federated learning for ultra-sensitive data—model trains without centralizing raw data
  • Document all privacy measures in your AI product specs—legal and compliance teams need this
Training Data Quality Assurance
Data Strategy

Systematically detect and fix data quality issues that degrade model performance.

  • Before training models on new datasets
  • When model performance is worse than expected
  • When setting up ongoing data quality monitoring
  • Check label accuracy: Sample 200 random examples, manually verify labels. Aim for >95% correct. If lower, retrain labelers or fix guidelines.
  • Detect label noise: Find examples where multiple labelers disagree. Review and correct. High disagreement indicates unclear guidelines.
  • Assess class balance: Count examples per category. If any class is <5% of total, collect more examples or use class weighting.
  • Find duplicates: Use hashing or fuzzy matching to detect near-duplicate examples. Remove to prevent train/test leakage.
  • Validate feature quality: Check for missing values, outliers, incorrect data types. Implement feature validation pipeline.
  • Test representative coverage: Does training data cover all scenarios users will encounter in production? Identify gaps.
  • Automate quality checks—run on every new data batch before adding to training set
  • Track data quality metrics over time—catch degradation early
Design Data Refresh Strategy
Data Strategy

Plan how and when to update training data to keep models accurate as the world changes.

  • When deploying models that will run for months or years
  • When user behavior or content patterns evolve over time
  • When setting up MLOps processes for production systems
  • Assess data freshness needs: How fast does your domain change? E-commerce trends change weekly, medical knowledge changes yearly.
  • Set refresh cadence: Daily (real-time personalization), weekly (content moderation), monthly (fraud detection), quarterly (general features)
  • Define refresh triggers: Time-based (every 30 days), performance-based (accuracy drops 5%), event-based (product launch, seasonality)
  • Design collection pipeline: Automated data pulls from production, scheduled labeling workflows, incremental dataset updates
  • Test before deployment: Always validate new data quality before retraining models. Check for distribution shifts or anomalies.
  • Start with monthly refreshes, then adjust based on monitoring—over-refreshing wastes resources
  • Keep historical data—you may need to retrain on older distributions if new data is poisoned
Plan Model Development Sprint
Model Development

Structure two-week sprints that balance model experimentation with product progress.

  • When starting AI development with ML engineering teams
  • When adapting agile processes for machine learning work
  • When stakeholders need visibility into AI development progress
  • Set sprint goal: Focus on outcome, not model type. Example: 'Achieve 85% accuracy on validation set' not 'Try neural network'
  • Allocate experiment budget: Reserve 60% sprint capacity for model experiments, 20% for data work, 20% for infrastructure/tooling
  • Plan experiments: List 3-5 experiments to try. Example: 'Test XGBoost, fine-tune BERT, try ensemble'. Prioritize by expected impact.
  • Define success criteria: What metrics determine if an experiment worked? Be specific: 'Accuracy >85% AND latency <500ms'
  • Schedule demo: End each sprint with model performance demo—show metrics, example predictions, learned insights
  • Don't commit to specific models—commit to achieving performance targets. ML is iterative.
  • Track 'negative results' as progress—knowing what doesn't work has value
Track Model Experiments
Model Development

Log and compare model experiments to identify what works and maintain reproducibility.

  • As soon as you start training models—don't wait until you have many experiments
  • When comparing multiple approaches or hyperparameter configurations
  • When you need to reproduce results or explain model choices to stakeholders
  • Choose experiment tracking tool: MLflow, Weights & Biases, Neptune.ai, or simple spreadsheet for small projects
  • Log experiment metadata: Model type, hyperparameters, training data version, features used, training duration, cost
  • Track key metrics: Training accuracy, validation accuracy, test accuracy, precision, recall, F1, latency, model size
  • Document insights: What worked? What failed? What surprised you? Store in experiment notes or shared doc.
  • Compare experiments: Sort by validation accuracy. Identify best performers. Look for patterns—what do top models have in common?
  • Log experiments automatically in training scripts—manual logging leads to gaps
  • Name experiments descriptively: 'bert-base-lr-1e-5-batch-32' not 'experiment_17'
Establish Model Baselines
Model Development

Create simple benchmark models to measure if sophisticated ML approaches actually add value.

  • At the start of every AI project, before building complex models
  • When justifying investment in ML vs. simpler approaches
  • When evaluating if model improvements are meaningful
  • Create majority class baseline: Always predict the most common category. Example: If 80% of emails are not spam, baseline accuracy is 80%.
  • Build rule-based baseline: Use domain knowledge to create if-then rules. Example: Flag transaction as fraud if amount >$1,000 + new account.
  • Try simple ML baseline: Logistic regression or decision tree with basic features. Takes hours to implement, not weeks.
  • Measure baseline performance: Track same metrics you'll use for production model. Document baseline results.
  • Set improvement target: Production model must beat baseline by meaningful margin. Example: '>10 percentage points better accuracy'
  • Many projects discover that simple baselines are 'good enough' and cancel complex ML work—that's a win
  • Always compare new models to baseline, not just to previous model version
Evaluate Model Performance
Model Development

Assess model quality across multiple dimensions beyond simple accuracy scores.

  • After training models but before deployment decisions
  • When comparing model candidates for production
  • When debugging why production performance differs from development
  • Test on held-out data: Evaluate on data the model has never seen. Never use test set during training or hyperparameter tuning.
  • Measure comprehensive metrics: Accuracy, precision, recall, F1, AUC-ROC. Choose primary metric based on business impact (false positives vs. false negatives).
  • Analyze per-class performance: Confusion matrix reveals which categories model struggles with. May be acceptable if rare classes.
  • Test on edge cases: Create separate test set of difficult examples. Example: Ambiguous queries, adversarial inputs, edge case scenarios.
  • Measure latency and cost: Time each prediction. Calculate cost per 1,000 predictions. Ensure within budget.
  • Review error cases: Manually inspect 50 wrong predictions. Categorize errors—helps prioritize improvements.
  • For production decisions, p95 and p99 metrics matter more than averages
  • Test demographic fairness—measure model performance across user segments (gender, age, geography)
Run Model Iteration Loops
Model Development

Systematically improve model performance through structured iteration cycles.

  • When initial model meets baseline but not production requirements
  • When you have time/budget for multiple improvement cycles
  • When deciding where to invest effort for maximum gain
  • Analyze failure modes: Review model errors. Group into categories—data quality issues, missing features, model limitations, edge cases.
  • Prioritize improvements: Estimate impact and effort for each fix. Focus on high-impact, low-effort wins first.
  • Run targeted experiments: Try one major change per iteration. Example: Add new feature, collect more data for weak class, try different architecture.
  • Measure impact: Compare new model to previous best. Did accuracy improve? By how much? On which categories?
  • Iterate or ship: If model meets launch criteria, ship it. If not, run another cycle. Timebox iterations—diminishing returns after 3-4 cycles.
  • Track marginal improvement per iteration—if gaining <2% accuracy per cycle, diminishing returns suggest moving to production
  • Balance model quality with time-to-market—perfect is enemy of shipped
Optimize Model Performance
Model Development

Improve model speed and reduce costs without sacrificing accuracy.

  • When model accuracy is good but latency or cost too high
  • Before scaling to millions of predictions per day
  • When infrastructure costs are eating into product margins
  • Profile bottlenecks: Measure where time is spent—data loading, preprocessing, model inference, post-processing. Optimize the slowest part first.
  • Optimize inference: Use smaller model variants (DistilBERT vs. BERT), quantization (FP16 or INT8), batching, caching frequent predictions.
  • Reduce model size: Prune unnecessary weights, knowledge distillation (train small model to mimic large one), feature selection.
  • Optimize deployment: Use faster hardware (GPUs for large models), serverless for variable load, edge deployment to reduce network latency.
  • Measure tradeoffs: Track accuracy, latency, cost after each optimization. Ensure accuracy doesn't drop >2-3 percentage points.
  • Quantization (FP32 to FP16) often gives 2x speedup with <1% accuracy loss—always try first
  • Cache predictions for repeated inputs—many applications have high overlap in queries
Implement Model Versioning
Model Development

Track, compare, and manage different model versions across environments.

  • When deploying models to production for the first time
  • When managing multiple model versions in parallel
  • When you need to roll back to previous model versions
  • Choose versioning scheme: Semantic versioning (v1.0, v1.1, v2.0) or timestamp-based (2025-01-15-1530). Be consistent.
  • Tag model artifacts: Version model weights, preprocessing code, feature definitions, inference code. Package together.
  • Link to training data: Record which data version trained each model. Enables reproduction and debugging.
  • Track deployment: Which version is in production? Staging? Development? Use model registry (MLflow, SageMaker).
  • Set retention policy: Keep last 3-5 production models for quick rollback. Archive older models unless needed for compliance.
  • Store model metadata: Training date, performance metrics, owner, intended use. Makes it easy to compare versions.
  • Automate version bumping—manual versioning leads to errors and confusion
Design AI UX Patterns
UX & Product Design

Apply proven UX patterns that help users understand and trust AI-powered features.

  • When designing interfaces for AI features
  • When users express confusion or mistrust of AI outputs
  • Before conducting usability testing of AI products
  • Show confidence levels: When model is uncertain (<70% confidence), communicate this to users. Example: 'I'm not sure, here are 3 options.'
  • Provide explanations: Show why AI made a decision. Example: 'Recommended because you viewed similar products.' Keep simple, not technical.
  • Enable feedback: Add thumbs up/down, 'Was this helpful?', or report buttons. Collect user corrections to improve model.
  • Offer alternatives: For key decisions, show top 3 predictions instead of only #1. Lets users choose if top pick is wrong.
  • Make AI status visible: Show when AI is thinking (loading), when it's done, when it failed. Don't hide AI delays.
  • Test AI explanations with users—what makes sense to you may confuse them
  • Balance transparency with simplicity—too much detail overwhelms, too little erodes trust
Design Loading & Latency States
UX & Product Design

Create UX patterns that keep users engaged while AI processes requests.

  • When AI latency is unavoidably >1 second
  • When designing async AI features (report generation, video processing)
  • When users complain that AI features feel slow or unresponsive
  • Categorize by latency: Instant (<100ms), responsive (<1s), deliberate (1-5s), background (>5s). Each needs different UX.
  • Show immediate feedback: Display loading indicator within 100ms of user action. Proves system is working.
  • Use progressive disclosure: For long tasks, show interim results. Example: 'Found 20 results... still searching...' then final count.
  • Set expectations: Tell users how long to expect. 'This usually takes 30 seconds.' Uncertainty is worse than slow.
  • Make waiting engaging: Show fun loading messages, progress bars, skeleton screens. Distract from wait time.
  • Enable async patterns: For >10s tasks, let users do other things. Notify when done via email, notification, or dashboard.
  • Perceived latency matters more than actual latency—good UX makes 3s feel like 1s
  • Test loading states with intentionally delayed responses—reveals UX bugs
Design AI Error States
UX & Product Design

Create clear, actionable error messages when AI features fail or produce low-confidence outputs.

  • When designing AI features that can fail or return uncertain results
  • When users report confusion about AI errors
  • When model confidence varies significantly across inputs
  • Categorize error types: Model failure (crashed), low confidence (<70%), ambiguous input, rate limiting, inappropriate request
  • Write user-friendly messages: Avoid technical jargon. Example: 'I couldn't understand your request' not 'Model returned null'
  • Provide next steps: Tell users what to do. 'Try rephrasing your question' or 'Here's a human expert who can help.'
  • Offer fallbacks: When AI fails, route to rules-based system, human expert, or simpler alternative.
  • Log error details: Capture input, model version, confidence, latency for debugging. Don't show to users but track for engineering.
  • Never say 'AI error' or 'Model failed'—users don't care about implementation, they want solutions
  • Test error states as thoroughly as success states—errors happen 5-20% of the time in production
Implement Confidence Score Display
UX & Product Design

Communicate model uncertainty to users in intuitive, non-technical ways.

  • For high-stakes AI decisions (medical, financial, legal)
  • When model accuracy varies significantly across inputs
  • When you want users to verify AI outputs before acting
  • Choose confidence threshold: Low (<70%), medium (70-85%), high (>85%). Adjust based on domain and user testing.
  • Design visual indicators: Stars (★★★★★), bars (▮▮▮▯▯), labels ('High confidence', 'Low confidence'), colors (green/yellow/red)
  • Provide context: Explain what confidence means. 'High confidence: I'm very sure' vs. 'Low confidence: Please double-check this.'
  • Adjust behavior by confidence: High confidence = show single answer. Low confidence = show multiple options or route to human.
  • Test comprehension: Ask users what different confidence levels mean. Iterate until 80%+ interpret correctly.
  • Avoid raw percentages—'87% confident' means different things to different users
  • Consider hiding confidence for consumer products but showing it for professional/enterprise tools
Design Progressive Disclosure
UX & Product Design

Structure AI interfaces to show simple results first with option to drill into details.

  • When AI produces complex outputs with multiple components
  • When users have varying expertise levels and information needs
  • When you want to reduce cognitive load while preserving access to details
  • Identify information layers: Core result (always shown), supporting details (click to expand), advanced info (settings/preferences)
  • Design default view: Show only essential information. Example: Search shows top result + 'See 10 more' vs. all 50 results.
  • Add expansion points: 'Show more', 'Details', 'Why this recommendation', 'Advanced options'. Make discoverable but not intrusive.
  • Preserve context: When user expands details, keep core result visible. Don't navigate away or replace entire screen.
  • Remember preferences: If user always expands details, make that their default. Learn from behavior.
  • 80% of users need only surface-level info—optimize for them, not power users
  • Test with novices and experts—both should find the experience intuitive
Design AI Explanation Interfaces
UX & Product Design

Create interfaces that help users understand why AI made specific decisions.

  • For high-stakes decisions requiring user trust (loans, hiring, medical)
  • When regulatory requirements mandate explainability (GDPR, financial services)
  • When users frequently question or override AI recommendations
  • Choose explanation method: Feature importance ('Price and location drove this score'), example-based ('Similar to properties you viewed'), counterfactual ('If price were $50K less, recommendation would change')
  • Match explanation to audience: Non-technical users need simple language, experts can handle technical details. Test comprehension.
  • Show top factors only: Display 3-5 most important factors, not all 50 features. 'Income, credit score, and employment history were most important.'
  • Make explanations actionable: If user can change outcome, tell them how. 'Improve credit score by 50 points to qualify.'
  • Validate accuracy: Ensure explanations reflect actual model logic. Use LIME, SHAP, or other XAI tools. Test edge cases.
  • Simple explanations are often wrong—balance accuracy with understandability
  • Let users drill down: Show simple explanation by default, offer 'Technical details' for experts
Design Feedback Collection Mechanisms
UX & Product Design

Build interfaces that capture user feedback on AI outputs to enable continuous improvement.

  • For all AI features in production—feedback drives improvement
  • When implementing active learning or human-in-the-loop systems
  • When model performance needs ongoing monitoring and tuning
  • Choose feedback types: Implicit (clicks, time on page, conversions), explicit (thumbs up/down, ratings, corrections), detailed (text feedback, report issue)
  • Design for low friction: One-click feedback is used 10x more than forms. 'Was this helpful? Yes/No' beats 'Rate 1-5 stars with comment'
  • Capture corrections: Let users fix wrong predictions. 'This is actually spam' or 'Correct category: Electronics'. Enables retraining.
  • Close feedback loop: Show users that feedback matters. 'Thanks, we'll improve based on your input' or 'Your feedback improved results for everyone.'
  • Instrument everything: Log feedback with prediction details (input, output, confidence, model version). Enables analysis.
  • Aim for 5-10% feedback rate minimum—below 2% means your mechanism is too hard to use
  • Incentivize feedback for cold-start: 'Rate 5 results to unlock personalization' works well
Design Onboarding for AI Features
UX & Product Design

Educate users about AI capabilities, limitations, and how to get best results.

  • When launching new AI features to existing user base
  • When AI behavior differs from user expectations
  • When users don't know AI features exist or how to use them
  • Set expectations: Tell users what AI can and cannot do. 'I can summarize documents up to 50 pages' sets clear boundaries.
  • Show examples: Demonstrate with real use cases. 'Try asking: Summarize this contract' or show sample outputs.
  • Teach best practices: Help users craft effective inputs. 'Be specific: Instead of 'cars', try 'red sedans under $30K''
  • Progressive disclosure: Don't dump all features at once. Introduce advanced features after user masters basics.
  • Offer contextual help: Provide tips in-app at point of use. Tooltip on search box: 'I understand natural language questions.'
  • Test onboarding with users who have never seen your product—reveals hidden assumptions
  • Track feature discovery and usage—if <50% of users find AI feature, your onboarding failed
Design AI Testing Strategy
Testing & Validation

Create comprehensive test plans that cover model performance, system behavior, and user experience.

  • Before AI feature development begins—testing is not an afterthought
  • When planning QA resources and timelines for AI projects
  • When deciding what testing is required before launch
  • Unit tests: Test data pipelines, feature engineering, pre/post-processing logic. These should be deterministic and fast.
  • Model tests: Evaluate accuracy on test set, measure fairness across demographics, test edge cases, validate confidence calibration
  • Integration tests: Test full system—user input to model prediction to UI display. Include latency, error handling, fallbacks.
  • User acceptance tests: Real users test with realistic tasks. Measure task success rate, user satisfaction, confusion points.
  • Production validation: Shadow mode, canary deployment, A/B test. Measure real-world performance before full rollout.
  • Allocate 30-40% of development timeline to testing—AI testing takes longer than traditional software
  • Create regression test suite—as you fix issues, add to automated tests to prevent reoccurrence
Implement A/B Testing for AI
Testing & Validation

Design experiments to measure real-world impact of AI models and features.

  • When comparing model versions before rolling out to all users
  • When measuring business impact of AI features
  • When deciding between different AI approaches or UX designs
  • Define hypothesis: Be specific. 'New model will increase click-through rate by >5%' not 'New model is better'
  • Choose success metrics: Primary (e.g., task success rate) and secondary (e.g., time on page, user satisfaction). Align with business goals.
  • Design experiment: Random user assignment (50/50 split), minimum sample size (calculate power analysis—typically need 10K+ users), duration (run 1-2 weeks minimum)
  • Monitor for issues: Check for errors, performance degradation, user complaints. Have kill switch ready if experiment causes problems.
  • Analyze results: Compare metrics with statistical significance tests. Look for segment differences (e.g., works for US but not EU users).
  • Run A/A tests first (same model in both groups)—validates your experiment infrastructure
  • Don't stop experiments early even if winning—need full sample size for valid results
Run Shadow Mode Testing
Testing & Validation

Deploy new models in production without showing outputs to users to validate real-world performance safely.

  • Before launching new models to users for the first time
  • When testing major model changes or rewrites
  • When you want to measure production performance without user risk
  • Set up shadow deployment: Deploy new model alongside production model. Route same inputs to both. Show only production model output to users.
  • Log shadow predictions: Capture new model outputs, confidence scores, latency, errors. Store for analysis.
  • Compare to production: Measure agreement rate between models. Analyze disagreements—is new model fixing bugs or introducing new errors?
  • Monitor performance: Track shadow model accuracy, latency, error rates, cost. Ensure meets production requirements.
  • Validate at scale: Run shadow mode for 1-2 weeks with full production traffic volume. Reveals issues that don't appear in testing.
  • Shadow mode is expensive (2x compute) but invaluable for risk reduction—worth it for critical features
  • Set success criteria before shadow mode—know what metrics determine go/no-go for promotion
Conduct AI Red Teaming
Testing & Validation

Simulate adversarial attacks and edge case scenarios to find AI vulnerabilities before users do.

  • Before launching consumer-facing AI features, especially conversational AI
  • For high-stakes applications (content moderation, security, financial decisions)
  • When testing robustness of safety guardrails
  • Recruit red team: Mix of security experts, domain experts, and creative thinkers. External teams find more issues than internal.
  • Define attack scenarios: Prompt injection, jailbreaking, bias exploitation, misinformation generation, adversarial inputs, edge case enumeration
  • Run attack sprints: Give red team 3-5 days to find vulnerabilities. Document all successful attacks with reproduction steps.
  • Triage findings: Severity scoring (critical/high/medium/low). Must-fix before launch vs. acceptable risk vs. post-launch improvement.
  • Implement mitigations: Add input filters, output filters, safety layers, fallback behaviors. Re-test to verify fixes work.
  • Budget $10-50K for external red teaming—finding issues pre-launch is 100x cheaper than post-launch PR disasters
  • Run red teaming quarterly for live products—new attack techniques emerge constantly
Execute User Acceptance Testing
Testing & Validation

Validate that AI features meet user needs through structured testing with real users.

  • After AI features are functionally complete but before launch
  • When validating that AI solves the intended user problem
  • When gathering evidence for launch decision
  • Recruit representative users: 10-20 users matching target demographic. Include skeptics and early adopters. Compensate appropriately.
  • Design test scenarios: Create 5-10 realistic tasks users would do with AI feature. Example: 'Find red sedans under $30K in your area'
  • Measure task success: Can users complete tasks? How long does it take? How many attempts? What's user satisfaction score?
  • Capture qualitative feedback: What confused users? What delighted them? What would they change? Where did AI fail their expectations?
  • Test edge cases: Give users ambiguous, difficult, or unusual inputs. How does system handle? Do users understand error messages?
  • Test with users who have NOT seen the product before—your internal team is blind to usability issues
  • Video record sessions—watching users struggle reveals insights that surveys miss
Test Model Fairness
Testing & Validation

Measure and validate that AI models perform equitably across different user groups.

  • Before launching AI that impacts people (hiring, lending, content recommendations)
  • When building AI for diverse user populations
  • When regulatory or ethical standards require fairness audits
  • Identify protected groups: Demographics (age, gender, race), geography, socioeconomic status, language. Base on domain and regulations.
  • Measure performance by group: Calculate accuracy, precision, recall, false positive/negative rates for each group. Look for disparities.
  • Define fairness criteria: Demographic parity (equal outcomes)? Equalized odds (equal error rates)? Choose standard appropriate to domain.
  • Quantify disparities: If accuracy for Group A is 90% but Group B is 75%, that's a 15-point gap. Set acceptable threshold (e.g., <5% gap).
  • Mitigate bias: Collect more training data for underperforming groups, use fairness constraints during training, post-process predictions to equalize outcomes
  • Fairness is multi-dimensional and contextual—no single metric captures all concerns
  • Document fairness analysis in launch review—shows stakeholders you took responsibility
Create Automated Test Suites
Testing & Validation

Build automated tests for AI systems that run continuously to catch regressions and issues.

  • After initial AI launch when entering maintenance mode
  • When iterating on models frequently
  • When you need to ensure new model versions don't break existing functionality
  • Build golden test sets: Curate 100-500 examples with known correct outputs. Cover typical cases and edge cases. Version control this dataset.
  • Automate accuracy tests: Run new models against golden test set. Flag if accuracy drops >3% from previous version.
  • Test system integration: Automate end-to-end tests—API calls, response format, latency, error handling. Run on every deploy.
  • Monitor data quality: Automate validation of input data—schema checks, range checks, null detection, distribution monitoring.
  • Run regression tests: When fixing bugs, add failing cases to automated suite. Prevents reintroduction of same bugs.
  • Run automated tests on every code change AND weekly even without changes—catches data drift
  • Integrate with CI/CD pipeline—block deployments that fail critical tests
Plan Phased Rollout
Launch & Monitoring

Deploy AI features incrementally to manage risk and learn from early users before full launch.

  • For all AI features—phased rollouts are best practice, not optional
  • When launching to large user bases where issues could affect millions
  • When uncertainty about production performance remains after testing
  • Define rollout phases: 1% (internal + beta), 5% (early adopters), 25% (broader test), 100% (full launch). Adjust percentages based on user base size.
  • Set phase duration: Run each phase 3-7 days minimum. Longer for complex features or when monitoring slow metrics (e.g., retention).
  • Define promotion criteria: What metrics must be met to move to next phase? Example: '95% task success, <2s latency p95, <0.1% error rate, NPS >40'
  • Plan rollback triggers: What causes immediate rollback? Example: 'Error rate >1%, latency >5s p95, user complaints spike >5x baseline'
  • Communicate timeline: Tell stakeholders and users the rollout plan. Manage expectations—'rolling out over 2 weeks' prevents 'why don't I have it?' questions.
  • Use feature flags for instant rollback without redeployment—essential for risk management
  • Bias initial phases toward power users or opt-in beta testers—they provide better feedback
Set Up Model Monitoring
Launch & Monitoring

Instrument production AI systems to track model performance, data drift, and system health.

  • Before launching AI features to production—monitoring is not optional
  • When models are live but you lack visibility into production performance
  • When setting up MLOps processes
  • Track model metrics: Log predictions, confidence scores, latency for every request. Calculate accuracy, precision, recall daily from user feedback.
  • Monitor input distribution: Track feature distributions over time. Alert if input data shifts significantly from training distribution.
  • Set up alerts: Define thresholds for key metrics. Example: 'Alert if accuracy drops >5%, latency p95 >1s, error rate >1%'
  • Create dashboards: Visualize metrics for PM, engineers, executives. Show trends over time, comparison to baselines, breakdown by user segments.
  • Log errors: Capture all failures—model errors, timeouts, invalid inputs. Review weekly to identify patterns.
  • Monitor business metrics too, not just model metrics—user satisfaction and revenue matter more than accuracy
  • Use existing tools (Datadog, Grafana, CloudWatch) plus ML-specific tools (Arize, Fiddler, WhyLabs)
Build Monitoring Dashboards
Launch & Monitoring

Create visual dashboards that surface AI system health and performance for different stakeholders.

  • After instrumenting monitoring—raw logs are useless without visualization
  • When stakeholders ask 'How is the AI performing?' and you don't have an answer
  • When managing multiple AI features or models in production
  • Design for audience: PM dashboard (user metrics, business impact), engineering dashboard (system health, latency, errors), executive dashboard (high-level KPIs)
  • Include key metrics: Model accuracy, user satisfaction, task success rate, latency (p50/p95/p99), error rate, cost per prediction, usage volume
  • Show trends: Current value vs. yesterday, last week, last month. Spot degradation early. Annotate with model version deploys.
  • Add drill-down: Click on metric to see breakdown by user segment, geography, device, time of day. Reveals where issues are concentrated.
  • Make actionable: Every dashboard should answer 'What should I do?' Include alerts, thresholds, comparison to targets.
  • Start simple—one dashboard with 6-8 key metrics beats ten dashboards nobody looks at
  • Review dashboards weekly in team meetings—makes monitoring a habit, not an afterthought
Design Incident Response Plan
Launch & Monitoring

Define procedures for detecting, triaging, and resolving AI system failures in production.

  • Before launching AI to production—hope for best, plan for worst
  • After experiencing AI incidents without clear response procedures
  • When onboarding on-call engineers for AI systems
  • Define incident types: Model performance drop, latency spike, error rate spike, cost overrun, harmful outputs, data pipeline failure
  • Set severity levels: P0 (user-facing complete failure), P1 (degraded performance), P2 (minor issue), P3 (monitoring alert, no user impact)
  • Create runbooks: Step-by-step guides for common incidents. Example - If accuracy drops >10%: - Check recent data - Compare to baseline model - Rollback if needed
  • Assign on-call: Who responds to incidents? Rotation schedule? Escalation path if on-call can't resolve?
  • Define communication: Who gets notified? Users? Stakeholders? Executives? What's the message template?
  • Post-incident review: After major incidents, conduct blameless post-mortem. Document learnings, prevent recurrence.
  • Practice incident response with fire drills—uncovers gaps in procedures
  • Have rollback plan ready—ability to quickly revert to previous model is crucial
Implement Feedback Collection
Launch & Monitoring

Deploy mechanisms to gather user feedback on AI outputs for continuous improvement.

  • At launch—feedback collection is core feature, not add-on
  • When model accuracy is good but you want to make it great
  • When implementing active learning or continuous training
  • Implement explicit feedback: Thumbs up/down, star ratings, 'Report issue' buttons. Make one-click easy.
  • Track implicit feedback: Click-through rate, time on page, task completion, return usage. Often more reliable than explicit feedback.
  • Collect corrections: Let users fix wrong predictions. 'This is actually X' or 'Correct answer: Y'. Generates training data.
  • Sample strategically: Don't ask for feedback on every interaction—causes fatigue. Sample 10-20% of users randomly plus 100% of uncertain predictions.
  • Close feedback loop: Show users their feedback improved the system. 'Thanks to feedback like yours, accuracy improved 5%'
  • Aim for 5-10% feedback rate—if lower, your UI friction is too high
  • Incentivize feedback sparingly—intrinsic motivation (helping improve product) beats extrinsic rewards
Measure AI Feature Adoption
Launch & Monitoring

Track metrics that reveal whether users discover, try, and consistently use AI features.

  • After AI feature launch to measure product-market fit
  • When feature usage is lower than expected
  • When deciding whether to invest more in AI features or pivot
  • Track awareness: What % of users know AI feature exists? Survey or measure if users saw onboarding/announcement.
  • Measure trial: What % of aware users tried feature at least once? Track first use within 7 days of awareness.
  • Calculate activation: What % of trialists had successful first experience? Define success: task completed, positive feedback, no errors.
  • Monitor retention: What % of activated users return? Track D1, D7, D30 retention. AI features need habit formation.
  • Identify power users: Who uses AI feature daily? What % of total usage do they represent? Learn from them.
  • Diagnose drop-off: Where do users churn? Never try? Try once and abandon? Fixes differ for each stage.
  • Benchmark against non-AI features—is adoption good or bad in context?
  • Segment by user type—enterprise users and consumers have different adoption curves
Analyze AI Usage Patterns
Launch & Monitoring

Study how users interact with AI features to identify improvements and optimization opportunities.

  • After AI feature has been live for 2-4 weeks with meaningful usage data
  • When planning next iteration or improvement cycle
  • When usage metrics are flat and you need ideas for growth
  • Segment users by behavior: Power users, casual users, one-time users. Analyze each segment separately.
  • Identify common queries: What are most frequent inputs? Are there patterns? Can you optimize for common cases?
  • Find failure patterns: When does AI fail? Which input types? Which user segments? Prioritize fixing most common failures.
  • Measure feature combinations: Do users combine AI with other features? What workflows emerge? Can you streamline?
  • Analyze temporal patterns: Time of day, day of week, seasonality. Usage spikes reveal unmet needs or opportunities.
  • Talk to 10 power users—they've figured out creative uses you never imagined
  • Look for 'workarounds'—users finding ways around AI limitations signal improvement opportunities
Plan Model Retraining
Optimization & Iteration

Establish cadence and triggers for updating models with fresh data to maintain performance.

  • After initial model deployment—retraining is not optional for production AI
  • When model performance degrades over time
  • When setting up MLOps processes for long-term maintenance
  • Determine retraining cadence: Daily (high-churn domains like news), weekly (e-commerce, social), monthly (stable domains like document classification), quarterly (slow-changing domains)
  • Set performance triggers: Retrain if accuracy drops >5%, error rate increases >2x, or user feedback negative >20%
  • Plan data collection: Ensure sufficient new labeled data between retraining cycles. Budget for labeling.
  • Automate pipeline: Scheduled retraining jobs, automated evaluation, deployment if metrics improve, rollback if metrics worsen
  • Version and track: Record training date, data version, performance metrics for each retrained model
  • Start with monthly retraining, adjust based on monitoring—over-retraining wastes resources
  • Always validate retrained models before deployment—sometimes new data is worse than old
Optimize Model Costs
Optimization & Iteration

Reduce inference and training costs while maintaining model quality and user experience.

  • When AI costs are higher than budgeted or eating into margins
  • When scaling to millions of predictions per day
  • When stakeholders question AI ROI due to cost concerns
  • Measure current costs: Break down by training compute, inference compute, data storage, labeling. Identify biggest expense.
  • Optimize inference: Use smaller models, quantization (FP32 to FP16), batching, caching common predictions, use cheaper hardware
  • Reduce training costs: Use transfer learning (fine-tune instead of training from scratch), reduce experiment volume, use spot instances
  • Optimize data costs: Compress datasets, delete old versions, use cheaper storage tiers, reduce labeling through active learning
  • Right-size infrastructure: Use autoscaling, serverless for variable load, reserved instances for predictable load
  • Caching can reduce costs 50-80% for applications with repeated queries—implement early
  • Profile costs weekly—gradual creep is harder to fix than sudden spikes
Iterate on AI Features
Optimization & Iteration

Systematically improve AI features based on user feedback, usage data, and performance metrics.

  • After initial launch and 2-4 weeks of production data collection
  • When planning roadmap for next quarter of AI development
  • When feature adoption or satisfaction is below targets
  • Gather improvement ideas: User feedback, support tickets, usage analysis, error logs, competitive analysis, team brainstorms
  • Categorize improvements: Model accuracy, UX enhancements, edge case handling, performance/latency, new capabilities, cost reduction
  • Estimate impact: For each improvement, estimate user impact (low/medium/high) and confidence (how sure are you it will work?)
  • Estimate effort: T-shirt sizing (S/M/L) or story points. Include data collection, training, testing, deployment.
  • Prioritize by ROI: High impact + low effort = do first. Low impact + high effort = deprioritize. Build roadmap with quick wins and strategic bets.
  • Reserve 20% capacity for small improvements and bug fixes, 80% for planned features
  • Ship improvements incrementally—don't wait for perfect, ship better
Tune Model Performance
Optimization & Iteration

Systematically adjust model hyperparameters and architecture to improve accuracy and efficiency.

  • When model performance is close but not quite meeting targets
  • After collecting more training data but before retraining
  • When you have time/budget for systematic optimization
  • Identify tunable parameters: Learning rate, batch size, model architecture, regularization, dropout, optimizer choice
  • Start with learning rate: Most impactful hyperparameter. Try values: 1e-5, 5e-5, 1e-4, 5e-4, 1e-3. Pick best.
  • Use automated search: Grid search (exhaustive but slow), random search (faster), Bayesian optimization (most efficient). Tools: Optuna, Ray Tune.
  • Set search budget: Define max experiments (e.g., 50) or max time (e.g., 3 days). Tuning has diminishing returns.
  • Validate improvements: Test tuned model on held-out test set. Ensure improvements are real, not overfitting.
  • Tune on validation set, evaluate on test set—using test set for tuning leads to overoptimistic results
  • Document tuning process—future engineers will thank you
Evaluate Feature Sunset
Optimization & Iteration

Decide when to deprecate or retire underperforming AI features to focus resources on higher-impact work.

  • When AI feature has low adoption after 3-6 months in production
  • When maintenance costs exceed value delivered
  • When conducting annual portfolio reviews or roadmap planning
  • Evaluate usage: What % of users actively use feature? Is trend increasing or declining? Compare to other features.
  • Measure value: Does feature drive revenue, retention, satisfaction? Quantify business impact. If negligible, candidate for sunset.
  • Calculate costs: Engineer time for maintenance, retraining, monitoring, support tickets, infrastructure costs. Is ROI positive?
  • Consider alternatives: Can feature be simplified (remove AI, use rules)? Merged with another feature? Repositioned?
  • Plan sunset: Announce deprecation timeline (3-6 months notice), offer alternatives, support migration, monitor impact
  • Sunsets are normal—teams that never kill features accumulate technical debt and lose focus
  • Survey users before sunset—sometimes low usage hides high value for specific segments
AI Development Lifecycle Overview
Primers

Understand the end-to-end process of taking AI features from concept to production.

  • When planning your first AI feature
  • When onboarding new team members to AI product development
  • When explaining AI development to stakeholders
  • Discovery: Define problem, validate AI is right solution, assess data availability, estimate feasibility
  • Data preparation: Collect data, label examples, clean and validate, version and store securely
  • Model development: Establish baseline, train models, evaluate performance, iterate until meeting criteria
  • Integration & testing: Build product integration, test end-to-end, conduct UAT, run red teaming
  • Deployment: Phased rollout, monitoring setup, incident response prep, feedback collection
  • Maintenance: Monitor performance, retrain models, iterate on features, optimize costs, handle drift
  • Expect 50% of time on data, 30% on modeling, 20% on deployment—adjust estimates accordingly
  • Build feedback loops from day 1—they enable continuous improvement
MLOps Basics Primer
Primers

Learn the fundamentals of ML operations—deploying, monitoring, and maintaining AI systems in production.

  • When transitioning from AI development to production operations
  • When setting up infrastructure for production AI systems
  • When hiring MLOps engineers or defining their role
  • Version control: Track code (Git), data (DVC), models (MLflow). Everything must be versioned for reproducibility.
  • Automation: CI/CD pipelines for model training, testing, deployment. Automate retraining and evaluation.
  • Monitoring: Track model performance, data drift, system health. Alert on degradation. Dashboard for visibility.
  • Infrastructure: Scalable compute for training and inference, model serving platforms, data pipelines, experiment tracking
  • Governance: Model documentation, approval processes, audit logs, rollback capabilities, security controls
  • Start simple—don't build Google-scale MLOps for your first feature. Grow infrastructure as needed.
  • Treat models like code—they need testing, versioning, code review, deployment pipelines
Common AI Metrics Primer
Primers

Understand key metrics for evaluating AI models and when to use each one.

  • When defining success criteria for AI features
  • When interpreting model performance reports from ML engineers
  • When comparing different model approaches
  • Accuracy: % of predictions correct. Good for balanced datasets. Misleading when classes are imbalanced (e.g., 95% negative examples).
  • Precision: Of positive predictions, % actually positive. High precision = few false alarms. Important when false positives are costly (e.g., spam filtering).
  • Recall: Of actual positives, % correctly identified. High recall = catch all positives. Important when false negatives are costly (e.g., fraud detection).
  • F1 Score: Harmonic mean of precision and recall. Use when you need to balance both and classes are imbalanced.
  • Latency: Time from input to output. P50 (median), p95 (95th percentile), p99. User experience depends on tail latency.
  • Cost per prediction: Infrastructure spend divided by prediction volume. Critical for unit economics at scale.
  • Always measure multiple metrics—accuracy alone hides problems
  • Ask ML engineers to explain metrics in business terms: 'Precision = how often recommendations are relevant'
Design Multi-Agent Workflows
AI Architecture

Build AI systems with multiple specialized agents working together to handle complex tasks that single models can't solve.

  • When a single AI model hits capability limits (e.g., can't handle research + analysis + writing in one call)
  • When you need specialized expertise at different workflow stages (e.g., code review needs syntax checker + security scanner + style guide enforcer)
  • When tasks require coordination between different AI capabilities (e.g., extract data, verify accuracy, format output, send notification)
  • Map the end-to-end workflow: Break down the user's goal into discrete sub-tasks. For example, 'Generate market research report' might be: (1) Search for sources, (2) Extract key data, (3) Synthesize findings, (4) Draft report, (5) Fact-check citations.
  • Define agent roles and boundaries: Create one agent per distinct capability. Name them by function (Researcher, Analyst, Writer, Validator). Each agent gets a clear scope: inputs it receives, outputs it produces, and decision authority.
  • Design handoff protocols: Specify how agents pass information. Use structured formats (JSON schemas, typed objects). Define what happens if an agent fails: retry with same agent, escalate to different agent, or fall back to human.
  • Establish orchestration logic: Decide on control flow—sequential (Agent A → Agent B → Agent C), parallel (Agents A/B/C run simultaneously), or conditional (Agent A decides which of B/C/D runs next). Use state machines or workflow engines.
  • Build evaluation per agent: Each agent needs its own success metrics. Don't just measure the final output—track where in the chain quality degrades. Log agent decisions for debugging.
  • Plan for failure modes: Agents can produce invalid outputs, infinite loops, or conflicting instructions. Set timeouts, output validation, and maximum retry limits. Always have a human-in-the-loop escape hatch.
  • Start with 2-3 agents max, not 10. Add complexity only when single-agent approaches fail. Most problems don't need sophisticated orchestration.
  • Agents aren't microservices. Don't create an agent for every tiny function. Each agent should represent a meaningful capability that users or engineers would recognize as distinct.