← Writing

Building AI Is Not Building Software

Your instincts from building software will mislead you when you build with AI. The system is probabilistic, not deterministic. That changes everything about how you ship, test, and talk to users.

I spent years building data products before I started working on AI. The transition was humbling in ways I didn't expect. Not because AI is more technically complex. It is, but complexity is manageable. The problem is that the mental models I'd built from software don't transfer. Some of them are actively dangerous.

In software, you define what the system should do and it does it. A button saves a record. A form validates input. An API returns data. Every time. If it doesn't, that's a bug, and you fix it. The system is deterministic. Your job is to decide what it should do, then verify that it does.

In AI, you define an objective, and the system approximates it. It produces outputs that are right most of the time but wrong some of the time, in ways you can't fully predict, for reasons you can't always explain.

That isn't a bug. That's the product.

If you're building with AI and you haven't internalized that distinction, every decision downstream will be slightly wrong. How you test. How you ship. How you set user expectations. How you decide what "done" means. All of it.

"Does it work?" is the wrong question

When you build software and someone asks "does it work?" the answer is yes or no. The login works or it doesn't. The checkout flow completes or it errors. You can test it, reproduce the issue, and fix it.

When you build with AI and someone asks "does it work?" the honest answer is: "It works about 88% of the time." And that answer doesn't fit in any box most people have.

I talked to a founder who built an AI tool that summarized customer support tickets. He showed it to a potential customer. The first three summaries were perfect. The fourth one hallucinated a detail that wasn't in the original ticket. The prospect said "so it doesn't work" and the meeting was over.

Three out of four is 75% accuracy. Whether that's good or bad depends entirely on what happens when it's wrong. If a support agent reads the summary, catches the error, and fixes it in ten seconds, 75% might be fine. It still saves time on the other three. If the summary goes directly into a customer-facing response with no human review, 75% is a disaster.

The number alone tells you nothing. The context tells you everything.

The question isn't "does it work?" The question is "what happens when it doesn't?" If you can't answer that, you're not ready to ship.

This was the hardest mental shift for me. In software, my job was to eliminate errors. In AI, my job was to design for the errors that would definitely happen. Fallback flows. Confidence thresholds. Human-in-the-loop checkpoints. The error handling isn't an edge case. It's the core product design decision.

Your MVP will lie to you

In software, an MVP that works in development will work in production. The behavior is the same. The code runs the same way. What you tested is what users get.

In AI, the MVP will perform significantly better in your testing than it does in the real world. Every time. Because the data you tested with was clean. The inputs you tried were the ones you expected. The edge cases you thought of are the ones you handled. And the real world has edge cases you can't think of because they come from users doing things that make no sense to you but make perfect sense to them.

I watched a team build a content classification model that hit 94% accuracy on their test set. In production, within two months, it was at 78%. Not because the model broke. Because the content changed. New categories appeared that didn't exist in the training data. Users submitted inputs in formats nobody anticipated. The world moved, and the model stayed where it was.

This isn't a failure of testing. It's the nature of AI systems. They're trained on a snapshot of reality, and reality doesn't hold still. A traditional application does the same thing tomorrow that it did today. A model that's neglected degrades. Not might. Will.

So when your MVP looks great in testing, the right response isn't excitement. It's the question: what's my plan for when this stops being true?

Shipping is not the finish line

In software, shipping is mostly the finish line. You deploy, you monitor for bugs, you move on to the next feature. The shipped thing stays shipped. It works the same way next month as it does today, barring infrastructure changes.

In AI, shipping is the starting line. The model is now running against live data that will drift from training data. User behavior will shift. Source systems will change their schemas. The inputs will evolve in ways the model has never seen. Everything that made your test set clean and predictable will slowly become less true.

I talked to a builder who shipped an AI feature and moved on to the next project. Three months later, users started complaining that the suggestions were "off." He checked the system metrics. Uptime was 99.9%. Latency was fine. Throughput was stable. Everything looked healthy. But the model was confidently returning results that were increasingly wrong, and nothing in his monitoring was designed to catch that.

He was monitoring system health. Nobody was monitoring model health. Those are different things, and the gap between them is where AI products quietly fail.

In software, the thing you shipped is the thing users get. In AI, the thing you shipped is the thing users got today. Tomorrow it might be something different, and you won't know unless you're watching.

This doesn't mean you need an elaborate MLOps pipeline from day one. But it means you need answers to basic questions before you launch: How will I know if the output quality is degrading? Who checks? How often? What's the trigger for retraining or adjusting? If your plan is "I'll notice when users complain," your plan is to let the product break and hope someone tells you.

Users don't trust what they can't see

This one surprised me. I assumed users would be impressed by AI doing something automatically. Some are. But a lot of users, especially the ones making important decisions, don't want AI to tell them what to do. They want AI to show them what's happening so they can decide for themselves.

I talked to a founder who built a beautiful AI analytics dashboard. Smart summarization. Automated insights. The AI would look at the data and tell you what to do. After talking to seven potential users, he learned something deflating: they didn't trust AI-generated recommendations. Not because the recommendations were wrong. Because they couldn't see the reasoning. They wanted to understand why the AI said what it said, not just what it said.

His entire product was built on an assumption: "people want AI to tell them what to do." Seven conversations revealed the assumption was backwards. People wanted AI to show them what was happening so they could decide themselves.

This is a pattern I've seen repeatedly. Builders are excited about the AI's ability to make decisions. Users are skeptical of decisions they can't verify. The gap between those two perspectives is where a lot of AI products go to die.

The fix isn't less AI. It's more transparency. Show the reasoning. Show the confidence level. Show what data the answer is based on. Let the user override. Make the AI a collaborator, not an oracle. Most users will accept "the AI thinks X, and here's why" far more readily than "the AI says do X."

What I got wrong and what I'd tell a builder

When I moved from building data products to building AI products, I carried over assumptions that cost me months. I assumed deterministic testing patterns would work. They didn't. I assumed shipping meant done. It didn't. I assumed users would trust the outputs if the outputs were accurate. They didn't, because accuracy without explainability is just a black box asking for faith.

If you're a builder shipping your first AI feature, here's what I wish someone had told me:

Design for the wrong answer. Before you optimize the prompt or fine-tune the model, decide what happens when the AI is wrong. The error path is your most important product decision.
Set expectations in the UI. "AI-generated" labels, confidence indicators, "verify this" prompts. These aren't disclaimers. They're trust architecture. Users who know the system is approximate will tolerate errors. Users who expect perfection will leave after the first one.
Build monitoring before you need it. Track output quality, not just system uptime. Even something simple: sample 20 outputs a week and spot-check them yourself. If you catch drift early, it's a tweak. If you catch it after users notice, it's a crisis.
Stop saying "accuracy." It's almost never the right metric on its own. Ask: accurate at what? For whom? What's the cost of a false positive vs. a false negative? An AI that's 95% accurate but wrong in catastrophic ways is worse than one that's 80% accurate but fails safely.
Treat the AI like a junior employee, not a finished product. It needs supervision. It needs guardrails. It gets better with feedback. It will confidently do the wrong thing if you let it. Ship it with training wheels on, and take them off slowly.

The skill set from building software is related. It's not the same. The sooner you stop treating AI as "software but smarter" and start treating it as a fundamentally different kind of system, the sooner you'll build something users actually trust.

Software product management is about building the right thing. AI product management is about building something that's right enough, and knowing what "enough" means for your specific users.

Related: The Build Trap Got Cheaper — why AI makes it easier to build the wrong thing faster. And Ten Conversations and One Decision — how to find out what "right enough" means before you build.

Builder's Path is a public lab from Sellhausen AI Systems focused on AI-native building, validation, and product judgment.

Built by Frank Sellhausen · Thinking · Privacy