
When AI Fixes the Summary but Leaves the System Wrong
We know AI can make teams faster and improve the quality of their work. With the right training and rollout support, people start using AI in practical ways that save time, improve thinking, and make everyday work easier.
Now many of our clients are ready to move from people using AI on their own to AI built into the processes their teams use every day.
That approach creates a new risk to companies.
As we allow AI to take more actions on its own, we're introducing the risk that it will create responses that look right but leave information in our systems incorrect. This is a problem, because it doesn't matter how much faster something is if people can't rely on the outcome.
We wanted to test the current systems to determine how well AIs handle keeping information updated. When new or corrected information is given to an AI, does it go back and update the system or leave traces of outdated information?
The Problem
The problem we tested is how often old, incorrect information is not updated when AI agents are handed new, accurate information.
If the fields aren't kept up to date it will lead to bad outcomes when AI is inserted into workflows. For example, a support manager may receive an AI summary, but the ticket still shows the wrong status even though the AI was expected to make the update. An accountant gets a helpful credit review, but the billing record is outdated. Or a project lead gets an AI-generated project update, but the summary still includes irrelevant, outdated information.
From the employee's point of view, AI helped. The work seemed to move faster. Sometimes the answers even appear better. But the next person in the handoff may inherit outdated and inaccurate information that doesn't match reality. That's a pretty big problem, especially if the information makes it back to customers or stakeholders.
The Experiment
To test this, we created an environment where we could simulate a fictional company with business systems an AI can work inside. The test acts like a realistic business system with support tickets, internal messages, documents, billing records, work orders, audit logs, and issue lists.
The AI was asked to operate inside that test company to complete work. We then delivered information to the AI in waves, with early information and then more accurate later information. We wanted to know whether the AI workflow would go back and correct the specific fields in the company's systems after it had more recent, and more accurate, information.
We used GPT-5.4 as the AI model and Codex as the environment running the test workflow. In plain terms, GPT-5.4 did the AI work, Codex ran the multi-step workflow, and the simulation environment held the tickets, billing records, work orders, and other business data. The main test covered two realistic business scenarios.
We ran the scenarios once, then ran them again under the same rules with fresh versions of the scenario. All 36 test runs completed and our scoring process could evaluate them. At the end of each run, we checked the fields to see what had ended up fully corrected and what failed to update. If even one field was wrong, we counted it as leaving the wrong record behind.
What We Found
In the 36 test runs, at least one business record was still wrong at the end in 32 runs. Only four corrected the fields.
The most common remaining error was billing and account review information. Not great things to get wrong. In practical terms this meant things like billing conditions and account credit reviews were incorrect. These two specific areas made up 28 of the failed runs.
Interestingly, the AI did not always misunderstand the situation or task. Often, it did some parts of the work correctly (and quickly). The common issue was that the AI workflow did not catch every update required by the new information.
In ordinary business terms, the AI often made the summary better but failed to update all the fields the next person would rely on, so the result back to the human was accurate, but it left a time bomb in the system's records someone else would discover in the future.
Better Instructions Helped, But Didn't Solve the Problem
As we tell people in our AI training courses, when AIs get things wrong it can often be solved through better prompting. So, we wanted to test whether better instructions would change the outcome. It did, but not by much.
We added a general instruction telling the workflow that when new information arrived, it should check and fix the fields that should change. We did not give the AI the answers, but a nudge on where to look. We kept everything else the same: the tools, timing, fields we checked, and scoring process.
We ran the test nine times without the new instructions. All nine failed. After the instructions were added seven of the nine failed, while two ended up correct.
Good prompts still matter. But better instructions alone did not reliably fix the problem. If AI is going to help with important business processes, the workflow needs a reliable way to check and update the records, we can't rely on better inputs alone.
What This Means for Teams Using AI
This matters most when AI gets close to the systems your team uses to run the business. Which is increasing as companies move from individual AI to organizational AI.
That could mean working in your CRM, managing support tickets, updating billing status, approval steps, work orders, project dashboards, or anything else where the system is expected to carry the truth forward.
You can imagine a customer support person sending what seems like a strong AI-generated summary of an issue and its resolution back to a customer. The customer logs back in and sees the same thing they saw before because the AI summary improved, but the status was never corrected. And the human didn't stop to check before sending the update, because they were lulled into trusting the AI workflow.
This can impact small and mid-sized business especially hard because people wear a lot of hats and many handoffs carry important information. When systems have incorrect information, it becomes rework, a billing problem, a client trust problem, or a leadership problem.
As we put AI into important business processes, we have to test to make sure people can actually trust it, and they understand where human review is required.
What to Check Before Launch
Before putting AI into a workflow that will be close to your company's operating tools ask whether the workflow leaves your systems accurate enough for the next person in line to trust.
Some questions you can ask:
- Which system is the source of truth for this process?
- Which fields can the AI read, suggest changes to, or update?
- When new information arrives, what needs to be checked again?
- When are humans required to review AI updates?
- Are we testing whether the system was updated and not just relying on how good the AI response sounds?
If you cannot answer these, the workflow is probably not ready to be introduced across the team.
How Mostly Serious Can Help
At Mostly Serious, we use a framework we call Shovel to design AI workflows around the way your business actually works. That means identifying what matters, what systems will be incorporated, who relies on them, when they should change, and how we confirm everything ended up correct before the workflow is rolled out more broadly.
As part of our process, we build tests to check whether your systems stay accurate after the AI runs.
If you're going to invest in an AI-powered process, you should end up with something usable, accurate, and understood by the people who depend on it. When the next person picks up the work, your team should have information the business can trust.
When you don't, it erodes trust, both with your customers but also with your employees who are being asked to lean further and further into AI use as part of their daily work. Getting it wrong has lingering effects.