AI MODEL TRAINING ยท LARGE LANGUAGE MODELS
What Happens When AI Meets the People It Was Never Trained On
Titan, Amazon's internal large language model and foundation for future developer and policy tools, performed well in controlled testing. The development team had trained and evaluated it against clean, carefully written prompts drawn mostly from US-centric inputs. In a lab, that looked promising. In a global workforce, it was the wrong target.
Amazon's employees do not prompt like researchers. They write fragments, paste ticket language, use regional shorthand, ask contradictory questions without realizing it. In high-trust enterprise applications, where Titan would interpret policy, guide engineers, and support workflow automation, a confident wrong answer is more dangerous than a visible failure. Enterprise AI trust is not built by answering every question. It is built by knowing when not to. The evaluation process had no way to measure that.
The instinct was to improve the model
When AI systems underperform, the default move is to tune the model. That instinct is usually right, but it assumes the test set reflects the user. Here, it did not. The evaluation benchmarks were clean. The workforce was not. Tuning against a clean test set would have optimized Titan for a user population that did not exist.
Hubert saw the flaw. The question was not "how does Titan perform on ideal inputs?" but "how does Titan behave when real people show up with real problems?" Answering that required building an evaluation tool from scratch.
200 personas turned messy communication into a structured test
Hubert led a small team, a solutions architect and an engineer, to build the data set the evaluation was missing. They scraped non-restricted internal sources: support tickets, recorded town halls stretching back to the early 2000s, wikis, and public Slack groups, mapping how Amazon employees actually communicated in writing across roles and regions. They supplemented that with synthetic data generated to cover plausible email and support conversations the scraped sources did not capture. The result was a global picture of how real people typed, asked, complained, and requested things at work.
From that research, Hubert built a library of 200 personas and sub-personas. Each was a behavioral test case. Prompts were written in each persona's actual style: vague, indirect, contradictory, non-linear. Some buried the real ask. Some required the model to surface uncertainty rather than guess. That last requirement was the point. A model that fills in gaps with confidence, rather than admitting it does not know, is a liability in enterprise use.
Failure patterns clustered
Running Titan through the persona library produced a map, not a complaint. Hallucinations clustered around prompts with missing context. Retrieval failures concentrated in messy or emotionally loaded inputs, where the model drifted from source material. Ambiguity was handled as a formatting problem instead of a reasoning problem.
Findings went to the product, UX, and science teams by persona type, failure mode, and likely cause, with specific fixes attached to each. The library did not replace technical evaluation. It told the technical teams where to look.
Results
- Hallucinations dropped from 27% to 15% after tuning against the persona set
- Retrieval adherence improved from 52% to 70%
- The 200-persona library became Titan's primary behavioral evaluation tool, shaping safety reviews and demo preparation across customer types
Why it matters
Most enterprise AI deployments fail quietly. The model works in the demo and struggles with the workforce. That gap is rarely a capability problem. It is a testing problem: the evaluation set reflects the developer's assumptions, not the user's habits. Any organization building AI for a global, mixed-experience workforce faces the same flaw. The fix is to make the test set harder, not the model smarter. That is portable to every enterprise deployment where the user population is messier than the training data assumed, which is most of them.