Big Tech’s AI Hallucination Problem is Now A Scandal

Open AI research paper September 2025

On September 5 2025, with no fanfare, Open AI published a mathematical research paper summarised in a blog, both titled ‘Why Language Models Hallucinate’.   This shows several reasons why such models confidently persistently generate false content.  The largest is of these reasons is one which tech companies themselves are responsible for, and could easily fix if they wanted to (it’s ‘teaching to the [wrong] test’, incentivising models to guess).

Given the real world harms that LLMs can do by pumping out ‘hallucinations’, defined in the Open AI paper as ‘plausible yet incorrect statements’, this means hallucinations qualify as a scandal: an avoidable harm.

Here’s the Scandal Equation from my book How To Win Campaigns: Communications for Change.  A harm which is unavoidable is a tragedy but an avoidable one is scandalous.

Open AI’s research show what-can-be-done to stop hallucinations.  Every harm they cause – from wrong financial decisions to affirming self-harming thoughts of depressed teens – is now a potential scandal, not just a tragedy.

Even worse, companies developing them decide in Open AI’s words, to ‘build models that guess rather than hold back’, meaning the models prioritise guessing over ‘abstaining’ and saying ‘sorry I don’t know’, because they fear that would affect their marketing (ie directly and indirectly a profit motive, so they benefit from generating the harms – immoral profit).

This means that every time a harm is generated by a false Chatbot output which could have been avoided, the scandal will increase. That in turn could tip the political balance against the Big Tech lobby, and in favour of regulation which so far they have managed to evade.

Open AI blames this perverse incentive on common industry practice of using benchmark tests to evaluate model performance which reward ‘accuracy’ rather than a lack of errors:  ‘accuracy-only scoreboards dominate leaderboards and model cards, motivating developers to build models that guess rather than hold back’.

It even sees this continuing as companies pursue the ‘race’ to ‘superintelligence’, saying:  ‘That is one reason why, even as models get more advanced, they can still hallucinate, confidently giving wrong answers instead of acknowledging uncertainty’.  Which others might say is not really very intelligent.

Open AI says:

‘There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty’. 

But also ‘the widely used, accuracy-based evals need to be updated so that their scoring discourages guessing. If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess’.

(IDK = “I Don’t know”). Nine out of ten commonly used evaluations do not give any credit to models which return a “I don’t know” result.  But if the response is correct it gets a credit, whether it is from knowledge and reasoning, or blind guesswork.  So even guessing birthdays (1 in 365 chance of being right) will over time improve the model’s score but generate a lot of hallucinations.

Discussing post-training where the goal is often ‘reducing hallucination’, the paper explains:

‘Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. On the other hand, language models are primarily evaluated using exams that penalize uncertainty. Therefore, they are always in “test-taking” mode. Put simply, most evaluations are not aligned’*.

[*Alignment]

It says the ‘abundance of evaluations that are not aligned’ is the ‘root of the problem’ (I don’t think that’s the ultimate issue – see below).  The researchers propose that models under test should be incentivised not to answer questions below a defined confidence threshold.

‘we propose evaluations explicitly state confidence targets in their instructions, within the prompt (or system message). For example, one could append a statement like the following to each question:

 Answer only if you are >t confident, since mistakes are penalized t/(1?t) points, while correct answers receive 1 point, and an answer of “I don’t know” receives 0 points’.

Assistant Professor Wei Xing at the University of Sheffield  commented:

OpenAI’s proposed fix is to have the AI consider its own confidence in an answer before putting it out there, and for benchmarks to score them on that basis.

The AI could then be prompted, for instance: “Answer only if you are more than 75 percent confident, since mistakes are penalized 3 points while correct answers receive 1 point.” 

The OpenAI researchers’ mathematical framework shows that under appropriate confidence thresholds, AI systems would naturally express uncertainty rather than guess. So this would lead to fewer hallucinations. The problem is what it would do to user experience.

Consider the implications if ChatGPT started saying “I don’t know” to even 30% of queries – a conservative estimate based on the paper’s analysis of factual uncertainty in training data. Users accustomed to receiving confident answers to virtually any question would likely abandon such systems rapidly.

In my view the ‘real root’ of the problem is the business model built on over-promising capabilities of LLM Large Language Model Chatbots, with addictive properties, and making them available free with no effective regulation of standards or quality control, while playing down their proven tendency to deceive users and spread misinformation.

As Wei Xing implies, and developers are quite open about, companies like Google or Open AI fear that a more reliably accurate but slower Chatbot will be bad for business, so the ‘speed + convenience’ AI Chatbot market – the info-pollution equivalent of disposable plastic bags – might shrink.  But a switch from quantity to quality is what has to happen if the AI sector is going to avoid the flood of hallucinations from the LLM subsector causing a disastrous shift in public perception.

The Adam Raine case – NYT

In the words of Marc Benioff CEO of EinsteinAI at Davos in 2024, “We just want to make sure that people don’t get hurt. We don’t want something to go really wrong … We don’t want to have a Hiroshima moment”. Some such moment is very likely unless the tap is turned off on the outpouring of fabrications and lies from Chatbot AI.

ChatGPT alone receives 2.5 billion user prompts to produce an output (inference) every day. According to its owner OpenAI, 1 in 10 of those of the outputs from its latest model ChatGPT-5 is false (a ‘hallucination’).  1 in 10 of 2.5 billion is 250 million mistakes every day.  It only needs one to have really bad consequences to crystallise public perceptions.

At the moment 97.9% of users of ChatGPT, which holds two thirds of the user market, get the most untrustworthy ‘janky’ version of the model for free, which spends most of its time thinking intuitively not analytically.  The better versions are not more intelligent or at least not bigger but allowed to ‘think’ and reason longer, which costs money.

The Open AI paper goes on to discuss what to do about it in opaque language, saying it’s a ‘socio-technical problem’ of changing the ‘influential’ leaderboards.

But changing the benchmark scoring systems is hardly an insoluble ‘social’ problem.  Almost every industry in existence has standards.  And it’s not as if the AI companies using these systems don’t know one another or the groups which created and now run these benchmarks.

I asked Open AI’s ChatGPT and Google’s AI who was involved in setting up the ten leading benchmark systems listed in the paper. It seems Anthropic helped create GPQA. Google helped create IFEval and BBH. Open AI had a hand in MATH (L5 split) and SWE-Bench. Anthropic part owns GPQA. Google owns BBH.  Meta owns SWE-bench. (Though of course some of this information may be wrong …). On top of this, the LLM owners are customers for these systems.

Surely they could get together and sort it out?  Or maybe the companies and the numerous universities and institutes dislike or don’t trust their rivals and competitors?  What they need is impartial actors to convene a process, who also ultimately have the power to force them to agree.  Regulators accountable to government for instance.  But of course there isn’t one, not at least in the US and the UK, thanks to Big Tech’s own lobbying activities.

Perhaps the industry is finally bringing regulation upon itself, having now established that it knows how to make its product much safer, even if people inside the AI bubble can’t imagine how it can be done?

***

I will explore AI LLM’s ‘War on Truth’ through generating info- or ‘synth-’ pollution in a subsequent blog.

Share
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *