AI MODEL TRAINING · LARGE LANGUAGE MODELS

What Happens When AI Meets the People It Was Never Trained On

An illustrated robot stands on a grassy hillside at dawn, looking out over a town in the valley below, representing an AI model facing the real-world conditions it must operate in.

Report

3 min read

Titan, Amazon's internal large language model and foundation for future developer and policy tools, performed well in controlled testing. The development team had trained and evaluated it against clean, carefully written prompts drawn mostly from US-centric inputs. In a lab, that looked promising. In a global workforce, it was the wrong target.

Amazon's employees do not prompt like researchers. They write fragments, paste ticket language, use regional shorthand, ask contradictory questions without realizing it. In high-trust enterprise applications, where Titan would interpret policy, guide engineers, and support workflow automation, a confident wrong answer is more dangerous than a visible failure. Enterprise AI trust is not built by answering every question. It is built by knowing when not to. The evaluation process had no way to measure that.

The instinct was to improve the model

When AI systems underperform, the default move is to tune the model. That instinct is usually right, but it assumes the test set reflects the user. Here, it did not. The evaluation benchmarks were clean. The workforce was not. Tuning against a clean test set would have optimized Titan for a user population that did not exist.

Hubert saw the flaw. The question was not "how does Titan perform on ideal inputs?" but "how does Titan behave when real people show up with real problems?" Answering that required building an evaluation tool from scratch.

200 personas turned messy communication into a structured test

Hubert led a small team, a solutions architect and an engineer, to build the data set the evaluation was missing. They scraped non-restricted internal sources: support tickets, recorded town halls stretching back to the early 2000s, wikis, and public Slack groups, mapping how Amazon employees actually communicated in writing across roles and regions. They supplemented that with synthetic data generated to cover plausible email and support conversations the scraped sources did not capture. The result was a global picture of how real people typed, asked, complained, and requested things at work.

An illustrated robot sits at a desk in front of a wall of human persona profiles, representing an AI persona testing framework. — Two hundred personas turned real employee communication patterns into a repeatable way to test how Titan handled ambiguity, context, and regional language.

From that research, Hubert built a library of 200 personas and sub-personas. Each was a behavioral test case. Prompts were written in each persona's actual style: vague, indirect, contradictory, non-linear. Some buried the real ask. Some required the model to surface uncertainty rather than guess. That last requirement was the point. A model that fills in gaps with confidence, rather than admitting it does not know, is a liability in enterprise use.

An illustrated robot looks into a mirror that reflects a human face, representing AI testing based on real user personas. — The persona library reframed model evaluation around the human behind the prompt: their role, region, habits, uncertainty, and shorthand.

Failure patterns clustered

Running Titan through the persona library produced a map, not a complaint. Hallucinations clustered around prompts with missing context. Retrieval failures concentrated in messy or emotionally loaded inputs, where the model drifted from source material. Ambiguity was handled as a formatting problem instead of a reasoning problem.

Findings went to the product, UX, and science teams by persona type, failure mode, and likely cause, with specific fixes attached to each. The library did not replace technical evaluation. It told the technical teams where to look.

Results

Hallucinations dropped from 27% to 15% after tuning against the persona set
Retrieval adherence improved from 52% to 70%
The 200-persona library became Titan's primary behavioral evaluation tool, shaping safety reviews and demo preparation across customer types

An illustrated yellow robot stands in a doorway, looking out toward a forest path, representing an AI model moving from controlled testing into real-world use. — By the end of the evaluation work, Titan was no longer being tested against ideal prompts. It was ready to face the messier world it had been built to serve.

Why it matters

Most enterprise AI deployments fail quietly. The model works in the demo and struggles with the workforce. That gap is rarely a capability problem. It is a testing problem: the evaluation set reflects the developer's assumptions, not the user's habits. Any organization building AI for a global, mixed-experience workforce faces the same flaw. The fix is to make the test set harder, not the model smarter. That is portable to every enterprise deployment where the user population is messier than the training data assumed, which is most of them.