{"id":3492,"date":"2025-10-25T12:28:13","date_gmt":"2025-10-25T11:28:13","guid":{"rendered":"https:\/\/threeworlds.campaignstrategy.org\/?p=3492"},"modified":"2025-10-25T12:28:13","modified_gmt":"2025-10-25T11:28:13","slug":"big-techs-ai-hallucination-problem-is-now-a-scandal","status":"publish","type":"post","link":"https:\/\/threeworlds.campaignstrategy.org\/?p=3492","title":{"rendered":"Big Tech\u2019s AI Hallucination Problem is Now A Scandal"},"content":{"rendered":"<h3><\/h3>\n<h3><a href=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Screenshot-Open-AI-hallucinations-paper-.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3500\" src=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Screenshot-Open-AI-hallucinations-paper-.png\" alt=\"\" width=\"1000\" height=\"940\" srcset=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Screenshot-Open-AI-hallucinations-paper-.png 1000w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Screenshot-Open-AI-hallucinations-paper--300x282.png 300w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Screenshot-Open-AI-hallucinations-paper--768x722.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><\/h3>\n<p><em>Open AI research paper September 2025<\/em><\/p>\n<h3 style=\"font-weight: 400;\">On September 5 2025, with no fanfare, Open AI published a mathematical research <a href=\"https:\/\/arxiv.org\/pdf\/2509.04664\">paper <\/a>summarised in a <a href=\"https:\/\/openai.com\/index\/why-language-models-hallucinate\/\">blog<\/a>, both titled \u2018Why Language Models Hallucinate\u2019. \u00a0\u00a0This shows several reasons why such models confidently persistently generate false content. \u00a0The largest is of these reasons is one which tech companies themselves are responsible for, and could easily fix if they wanted to (<a href=\"https:\/\/openai.com\/index\/why-language-models-hallucinate\/\">it\u2019s \u2018teaching to the [wrong] test\u2019<\/a>, incentivising models to guess).<\/h3>\n<p style=\"font-weight: 400;\">Given the real world harms that LLMs can do by pumping out \u2018hallucinations\u2019, defined in the Open AI paper as \u2018plausible yet incorrect statements\u2019, this means hallucinations qualify as a scandal: an <em>avoidable<\/em> harm.<\/p>\n<p style=\"font-weight: 400;\">Here\u2019s the <a href=\"https:\/\/www.campaignstrategy.org\/advanced_2.html\">Scandal Equation<\/a> from my book <em><a href=\"https:\/\/www.amazon.co.uk\/How-Win-Campaigns-Chris-Rose\/dp\/1849711143\">How To Win Campaigns: Communications for Change<\/a><\/em>.\u00a0 A harm which is unavoidable is a tragedy but an avoidable one is scandalous.<\/p>\n<p><a href=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3493\" src=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC.png\" alt=\"\" width=\"952\" height=\"1262\" srcset=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC.png 952w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC-226x300.png 226w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC-772x1024.png 772w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/scandal-equation-from-HTWC-768x1018.png 768w\" sizes=\"auto, (max-width: 952px) 100vw, 952px\" \/><\/a><\/p>\n<p><em>Open AI&#8217;s research show what-can-be-done to stop hallucinations. \u00a0Every harm they cause &#8211; from wrong financial decisions to affirming self-harming thoughts of depressed teens &#8211; is now a potential scandal, not just a tragedy.<\/em><\/p>\n<p style=\"font-weight: 400;\">Even worse, companies developing them <a href=\"https:\/\/openai.com\/index\/why-language-models-hallucinate\/\">decide<\/a> in Open AI\u2019s words, to \u2018build models that guess rather than hold back\u2019, meaning the models prioritise guessing over \u2018abstaining\u2019 and saying \u2018sorry I don\u2019t know\u2019, because they fear that would affect their marketing (ie directly and indirectly a profit motive, so they benefit from generating the harms \u2013 immoral profit).<\/p>\n<p style=\"font-weight: 400;\">This means that every time a harm is generated by a false Chatbot output which could have been avoided, the scandal will increase. That in turn could tip the political balance against the Big Tech lobby, and in favour of regulation which so far they have managed to evade.<\/p>\n<p style=\"font-weight: 400;\">Open AI blames this perverse incentive on common industry practice of using benchmark tests to evaluate model performance which reward \u2018accuracy\u2019 rather than a lack of errors:\u00a0 \u2018accuracy-only scoreboards dominate leaderboards and model cards, motivating developers to build models that guess rather than hold back\u2019.<\/p>\n<p style=\"font-weight: 400;\">It even sees this continuing as companies pursue the \u2018race\u2019 to \u2018superintelligence\u2019, saying: \u00a0\u2018That is one reason why, even as models get more advanced, they can still hallucinate, confidently giving wrong answers instead of acknowledging uncertainty\u2019.\u00a0 Which others might say is not really very intelligent.<\/p>\n<p style=\"font-weight: 400;\">Open AI <a href=\"https:\/\/openai.com\/index\/why-language-models-hallucinate\/\">says<\/a>:<\/p>\n<p style=\"font-weight: 400;\"><em>\u2018There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty\u2019.\u00a0 <\/em><\/p>\n<p style=\"font-weight: 400;\">But also \u2018the widely used, accuracy-based evals need to be updated so that their scoring discourages guessing. If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess\u2019.<\/p>\n<p><a href=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/model-evals-from-Open-AI-paper.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3495\" src=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/model-evals-from-Open-AI-paper.png\" alt=\"\" width=\"1000\" height=\"566\" srcset=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/model-evals-from-Open-AI-paper.png 1000w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/model-evals-from-Open-AI-paper-300x170.png 300w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/model-evals-from-Open-AI-paper-768x435.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><\/p>\n<p>(IDK = &#8220;I Don&#8217;t know&#8221;). <i>Nine out of ten commonly used evaluations do not give any credit to models which return a \u201cI don\u2019t know\u201d result.\u00a0 But if the response is correct it gets a credit, whether it is from knowledge and reasoning, or blind guesswork.\u00a0 So even guessing birthdays (1 in 365 chance of being right) will over time improve the model\u2019s score but generate a lot of hallucinations.<\/i><\/p>\n<p style=\"font-weight: 400;\">Discussing post-training where the goal is often \u2018reducing hallucination\u2019, the <a href=\"https:\/\/arxiv.org\/pdf\/2509.04664\">paper explains<\/a>:<\/p>\n<p style=\"font-weight: 400;\"><em>\u2018Humans learn the value of expressing uncertainty outside of school, in the school of hard knocks. On the other hand, language models are primarily evaluated using exams that penalize uncertainty. Therefore, they are always in \u201ctest-taking\u201d mode. Put simply, most evaluations are not aligned\u2019*.<\/em><\/p>\n<p style=\"font-weight: 400;\">[<em>*<a href=\"https:\/\/spectrum.ieee.org\/the-alignment-problem-openai\">Alignment<\/a><\/em>]<\/p>\n<p style=\"font-weight: 400;\">It says the \u2018abundance of evaluations that are not aligned\u2019 is the \u2018root of the problem\u2019 (I don\u2019t think that\u2019s the ultimate issue \u2013 see below).\u00a0 The researchers propose that models under test should be incentivised not to answer questions below a defined confidence threshold.<\/p>\n<p style=\"font-weight: 400;\"><em>\u2018we propose evaluations explicitly state confidence targets in their instructions, within the prompt (or system message). For example, one could append a statement like the following to each question:<\/em><\/p>\n<p style=\"font-weight: 400;\"><em>\u00a0<\/em><em>Answer only if you are &gt;t confident, since mistakes are penalized t\/(1?t) points, while correct answers receive 1 point, and an answer of \u201cI don\u2019t know\u201d receives 0 points\u2019.<\/em><\/p>\n<p style=\"font-weight: 400;\">Assistant Professor Wei Xing at the University of Sheffield \u00a0<a href=\"https:\/\/www.sciencealert.com\/openai-has-a-fix-for-hallucinations-but-you-really-wont-like-it\">commented<\/a>:<\/p>\n<p style=\"font-weight: 400;\"><em>OpenAI&#8217;s proposed fix is to have the AI consider its own confidence in an answer before putting it out there, and for benchmarks to score them on that basis.<\/em><\/p>\n<p style=\"font-weight: 400;\"><em>The AI could then be prompted, for instance: &#8220;Answer only if you are more than 75 percent confident, since mistakes are penalized 3 points while correct answers receive 1 point.&#8221;<\/em><em>\u00a0<\/em><\/p>\n<p style=\"font-weight: 400;\"><em>The OpenAI researchers&#8217; mathematical framework shows that under appropriate confidence thresholds, AI systems would naturally express uncertainty rather than guess. So this would lead to fewer hallucinations. The problem is what it would do to user experience.<\/em><\/p>\n<p style=\"font-weight: 400;\"><em>Consider the implications if ChatGPT started saying &#8220;I don&#8217;t know&#8221; to even 30% of queries \u2013 a conservative estimate based on the paper&#8217;s analysis of factual uncertainty in training data. Users accustomed to receiving confident answers to virtually any question would likely abandon such systems rapidly. <\/em><\/p>\n<p style=\"font-weight: 400;\">In my view the \u2018real root\u2019 of the problem is the business model built on over-promising capabilities of LLM Large Language Model Chatbots, with addictive properties, and making them available free with no effective regulation of standards or quality control, while playing down their proven tendency to deceive users and spread misinformation.<\/p>\n<p style=\"font-weight: 400;\">As Wei Xing implies, and developers are quite open about, companies like Google or Open AI fear that a more reliably accurate but slower Chatbot will be bad for business, so the \u2018speed + convenience\u2019 AI Chatbot market \u2013 the info-pollution equivalent of disposable plastic bags &#8211; might shrink.\u00a0 But a switch from quantity to quality is what has to happen if the AI sector is going to avoid the flood of hallucinations from the LLM subsector causing a disastrous shift in public perception.<\/p>\n<p><a href=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Adam-Raine-NYT.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3497\" src=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Adam-Raine-NYT.png\" alt=\"\" width=\"1000\" height=\"516\" srcset=\"https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Adam-Raine-NYT.png 1000w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Adam-Raine-NYT-300x155.png 300w, https:\/\/threeworlds.campaignstrategy.org\/wp-content\/uploads\/2025\/10\/Adam-Raine-NYT-768x396.png 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><\/p>\n<p style=\"font-weight: 400;\"><em>The Adam Raine case &#8211; <a href=\"https:\/\/www.nytimes.com\/2025\/08\/26\/technology\/chatgpt-openai-suicide.html\">NYT<\/a><\/em><\/p>\n<p style=\"font-weight: 400;\">In the words of Marc Benioff CEO of EinsteinAI at Davos in 2024, \u201cWe just want to make sure that people don\u2019t get hurt. We don\u2019t want something to go really wrong &#8230; We don\u2019t want to have a Hiroshima moment\u201d.\u00a0Some such moment is very likely unless the tap is turned off on the outpouring of fabrications and lies from Chatbot AI.<\/p>\n<p style=\"font-weight: 400;\">ChatGPT alone receives 2.5 billion user prompts to produce an output (inference) every day. According to its owner OpenAI, 1 in 10 of those of the outputs from its latest model ChatGPT-5 is false (a \u2018hallucination\u2019).\u00a0 1 in 10 of 2.5 billion is 250 million mistakes every day. \u00a0It only needs one to have really bad consequences to crystallise public perceptions.<\/p>\n<p style=\"font-weight: 400;\">At the moment 97.9% of users of ChatGPT, which holds two thirds of the user market, get the most untrustworthy \u2018janky\u2019 version of the model for free, which spends most of its time thinking intuitively not analytically.\u00a0 The better versions are not more intelligent or at least not bigger but allowed to \u2018think\u2019 and reason longer, which costs money.<\/p>\n<p style=\"font-weight: 400;\">The Open AI paper goes on to discuss what to do about it in opaque language, saying it\u2019s a \u2018socio-technical problem\u2019 of changing the \u2018influential\u2019 leaderboards.<\/p>\n<p style=\"font-weight: 400;\">But changing the benchmark scoring systems is hardly an insoluble \u2018social\u2019 problem.\u00a0 Almost every industry in existence has standards.\u00a0 And it\u2019s not as if the AI companies using these systems don\u2019t know one another or the groups which created and now run these benchmarks.<\/p>\n<p style=\"font-weight: 400;\">I asked Open AI\u2019s ChatGPT and Google\u2019s AI who was involved in setting up the ten leading benchmark systems listed in the paper. It seems Anthropic helped create GPQA. Google helped create IFEval and BBH. Open AI had a hand in MATH (L5 split) and SWE-Bench. Anthropic part owns GPQA. Google owns BBH.\u00a0 Meta owns SWE-bench. (Though of course some of this information may be wrong &#8230;). On top of this, the LLM owners are customers for these systems.<\/p>\n<p style=\"font-weight: 400;\">Surely they could get together and sort it out? \u00a0Or maybe the companies and the numerous universities and institutes dislike or don\u2019t trust their rivals and competitors? \u00a0What they need is impartial actors to convene a process, who also ultimately have the power to force them to agree. \u00a0<em>Regulators<\/em> accountable to government for instance. \u00a0But of course there isn\u2019t one, not at least in the US and the UK, thanks to Big Tech\u2019s own lobbying activities.<\/p>\n<p style=\"font-weight: 400;\">Perhaps the industry is finally bringing regulation upon itself, having now established that it knows how to make its product much safer, even if people inside the AI bubble can\u2019t imagine how it can be done?<\/p>\n<p style=\"font-weight: 400;\">***<\/p>\n<p style=\"font-weight: 400;\">I will explore AI LLM\u2019s \u2018War on Truth\u2019 through generating info- or \u2018synth-\u2019 pollution in a subsequent blog.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Open AI research paper September 2025 On September 5 2025, with no fanfare, Open AI published a mathematical research paper summarised in a blog, both titled \u2018Why Language Models Hallucinate\u2019. \u00a0\u00a0This shows several reasons why such models confidently persistently generate &hellip; <a href=\"https:\/\/threeworlds.campaignstrategy.org\/?p=3492\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3492","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/posts\/3492","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3492"}],"version-history":[{"count":6,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/posts\/3492\/revisions"}],"predecessor-version":[{"id":3503,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=\/wp\/v2\/posts\/3492\/revisions\/3503"}],"wp:attachment":[{"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3492"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3492"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/threeworlds.campaignstrategy.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3492"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}