What We Learned Shipping 9 AI Products to Production

What We Learned Shipping 9 AI Products to Production

Last March, our travel AI sent a user to a resort that doesn't exist.

They'd asked for a road trip itinerary. The AI confidently included "Sunoutdoors Resort" as a lunch stop. Clean formatting. Reasonable drive time. It looked perfect.

The resort was made up. If they'd followed the route, they'd have driven 50 miles to an empty parking lot.

No errors. No warning. Just a fake place, presented as fact. That's when I realized we had no idea what our agents were actually doing in production.

Wrong Answers That Look Right

The resort incident wasn't unique. We kept finding variations of the same problem.

A voice AI was transcribing meetings. When audio was unclear, instead of flagging it, the system invented notes. Made up action items. Fabricated decisions. A user followed up on an action item that was never assigned. That was a fun support ticket.

A customer service agent was looking up hotel prices. Three similar rooms, three different prices. The agent returned the same price for all of them. We only caught it because a user complained they were charged more than quoted.

A knowledge assistant had the right answer in its context and still got it wrong. The user asked "Who founded [company]?" The retrieved chunks contained the answer. The model gave a different name. Retrieval working doesn't mean the answer is right.

The pattern: AI doesn't fail loudly. It fails with confidence. There's no stack trace for "wrong but plausible."

Execution That Silently Breaks

Tool calling looks simple until you ship it.

We had both Gmail and Outlook integrations.

User said "send email." Agent picked Gmail. User's actual email client was Outlook. Message never sent. User blamed us. Rightfully.

A scheduling request:

"Find me 30-minute slots with @Sharbel." The agent should call "Find Slot" then "Create Meeting." Instead, it skipped straight to "Create Meeting." Booked a meeting without checking if the slot was free.

This one still bothers me:

A customer service agent told a frustrated user "I'll let the manager know about this escalation." The user calmed down. Felt heard. The escalation never happened. The agent said the words. It didn't take action. We found out a week later when the customer came back, angrier.

Small Changes, Big Failures

We removed one word from a prompt. The agent's entire tone shifted. Became curt. Users noticed before we did.

We switched from GPT-3.5 to a newer model. Same prompt. Output started including triple quotes in the HTML. Broke our parser. No error from the model—it thought it was being helpful.

Same prompt on Gemini 2.5 Pro versus Flash gives different results. Not slightly different. Different enough to fail our tests. Prompts aren't portable.

Why Debugging Takes Forever

Traditional debugging: error in logs → find the line → fix it.

AI debugging:

"The answer seemed off" → Logs show 200 OK → Run it again, different answer → Which step failed? All look fine → Retrieval? Reasoning? Tool selection? Context? → Hours later, still guessing

There's no breakpoint for "confident but wrong."

What Actually Moved the Needle

After nine products, I can point to three shifts that mattered.

First: trace decisions, not just outputs.

We needed to see why the model picked Gmail over Outlook. Why it skipped the "Find Slot" step. Why it ignored the user's change of mind.

Output logs tell you what happened. They don't tell you why.

Once we could see the full reasoning chain — retrieval results, tool selection logic, response generation — debugging went from hours to minutes.

But we were still debugging. Still reactive. Still finding issues because users reported them.

Second: evaluate outputs, not just execution.

"Did it run?" isn't the question. "Was it right?" is.

The knowledge assistant returned a wrong founder name even though retrieval worked perfectly. 200 OK. All steps completed. Wrong answer.

We started evaluating outputs against expected behavior. Not just "did the API return?" but "did the response match the retrieved context?" Not just "did it call the tool?" but "did it call the right tool with the right parameters?"

The fake resort would have been caught. The wrong hotel price would have been caught. Evaluation is the difference between "it ran" and "it worked."

Third: simulate full conversations before shipping.

This was the big one.

We started running agents through realistic conversation flows. Multi-turn. Users who change their minds. Users who add context mid-conversation. Users who contradict themselves. Edge cases that only show up when you simulate the full journey.

The scheduling agent that skipped steps? Simulation caught it. The escalation that never happened? Simulation caught it. The Gmail/Outlook confusion? Simulation caught it.

You can't unit test your way to reliable agents. You have to simulate how users actually behave.

The Reality

Every product we shipped had bugs we found in production. That's the nature of AI. But there's a gap between "found it in logs" and "found it from users." That gap is everything.

Tracing tells you what happened. Evaluation tells you if it was right. Simulation tells you before users do.

We built Netra to close that gap.