AI Search Has A Citation Problem - Prateek's Digital Garden

## Colophon tags:: url:: https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php %% title:: AI Search Has A Citation Problem type:: [[clipped-note]] author:: [[@cjr.org]] %% ## Notes > After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL, using the following query: — [view in context](https://hyp.is/CjOzaP8fEe-IzIf6foW2Jg/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ It seems like the publication date isn't asked for in the check(?) > the retrieval of (1) the correct article, (2) the correct publisher, and (3) the correct URL. According to these parameters, each response was marked with one of the following labels: Correct: All three attributes were correct. Correct but Incomplete: Some attributes were correct, but the answer was missing information. Partially Incorrect: Some attributes were correct while others were incorrect. Completely Incorrect: All three attributes were incorrect and/or missing. Not Provided: No information was provided. Crawler Blocked: The publisher disallows the chatbot’s crawler in its robots.txt. — [view in context](https://hyp.is/LvU2uP8fEe-n2NPqe6fFSQ/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ See, no publication date. (unless correct article means a combination of the headline and the date, which it doesn't say explicitly). > AI Search Has A Citation Problem — [view in context](https://hyp.is/WAoQPP8fEe-YTk_ZC1q8ig/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ date:: [[2025-03-12]] > Chatbots’ responses to our queries were often confidently wrong Overall, the chatbots often failed to retrieve the correct articles. Collectively, they provided incorrect answers to more than 60 percent of queries. Across different platforms, the level of inaccuracy varied, with Perplexity answering 37 percent of the queries incorrectly, while Grok 3 had a much higher error rate, answering 94 percent of the queries incorrectly. — [view in context](https://hyp.is/is0qzP8fEe-6rP_ksqwL4g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Q: How are they determining confidence? > Most of the tools we tested presented inaccurate answers with alarming confidence, rarely using qualifying phrases such as “it appears,” “it’s possible,” “might,” etc., or acknowledging knowledge gaps with statements like “I couldn’t locate the exact article.” — [view in context](https://hyp.is/z4gTwP8fEe-QU3tcZNGGiA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Ok, this is how they are determining confidence. Q for me: Do chatbots express confidence this way? i.e. should be treat this is a conscious signal? I understand that it can be interpreted this way since we are tuned to interpreting these phrases in certain ways. > The fundamental concern extends beyond the chatbots’ factual errors to their authoritative conversational tone, which can make it difficult for users to distinguish between accurate and inaccurate information. This unearned confidence presents users with a potentially dangerous illusion of reliability and accuracy. — [view in context](https://hyp.is/X--Bbv8gEe-aXSeQ68rD2g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Platforms retrieved information from publishers that had intentionally blocked their crawlers — [view in context](https://hyp.is/t77hZP8gEe-VjXc8M6Q6Jw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Five of the eight chatbots tested in this study (ChatGPT, Perplexity and Perplexity Pro, Copilot, and Gemini) have made the names of their crawlers public, giving publishers the option to block them, while the crawlers used by the other three (DeepSeek, Grok 2, and Grok 3) are not publicly known. — [view in context](https://hyp.is/t8qwov8gEe-oyMPmbbMcTg/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > In particular, ChatGPT, Perplexity, and Perplexity Pro exhibited unexpected behaviors given what we know about which publishers allow them crawler access. On some occasions, the chatbots either incorrectly answered or declined to answer queries from publishers that permitted them to access their content. On the other hand, they sometimes correctly answered queries about publishers whose content they shouldn’t have had access to; Perplexity Pro was the worst offender in this regard, correctly identifying nearly a third of the ninety excerpts from articles it should not have had access to. — [view in context](https://hyp.is/voxUfP8gEe-yB2MQbs6Qiw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > While ChatGPT answered fewer questions about articles that blocked its crawlers compared with the other chatbots, overall it demonstrated a bias toward providing wrong answers over no answers. — [view in context](https://hyp.is/556WuP8gEe-EGs9rwMzZxw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Among the chatbots whose crawlers are public, Copilot was the only one that was not blocked by any of the publishers in our dataset. This is likely because Copilot uses the same crawler, BingBot, as the Bing search engine, which means that publishers wishing to block it would also have to opt out of inclusion in Bing search — [view in context](https://hyp.is/8ma2jv8gEe--zeu7-zYnkg/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ This is a problem, tbh. > In theory, Copilot should have been able to access all of the content we queried for; however, it actually had the highest rate of declined answers. — [view in context](https://hyp.is/-DeCoP8gEe-H3V_OGM247A/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > On the other hand, Google created its Google-Extended crawler to give publishers the option of blocking Gemini’s crawler without having their content affected on Google’s search. Its crawler was permitted by ten of the twenty publishers we tested, yet Gemini only provided a completely correct response on one occasion. — [view in context](https://hyp.is/BKMmZv8hEe--zjuSiVrV3Q/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Gemini also declined to answer questions about content from publishers that permitted its crawler if the excerpt appeared to be related to politics, responding with statements like “I can’t help with responses on elections and political figures right now — [view in context](https://hyp.is/C1rW_P8hEe-Vj8-jZfop3g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ ooh, potential moderation controls at work. > Platforms often failed to link back to the original source — [view in context](https://hyp.is/hedHDP8hEe-HMuf2M6JztA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > But when chatbots are wrong, they don’t just taint their own reputations, they also taint the reputations of the publishers they lean on for legitimacy. The generative search tools we tested had a common tendency to cite the wrong article. — [view in context](https://hyp.is/Hv8LAP8iEe-3Dqd7xbIwNQ/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > For instance, DeepSeek misattributed the source of the excerpts provided in our queries 115 out of 200 times. This means that news publishers’ content was most often being credited to the wrong source. — [view in context](https://hyp.is/JFStCP8iEe--qgtvG3inzQ/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Even when the chatbots appeared to correctly identify the article, they often failed to properly link to the original source. This creates a twofold problem: publishers wanting visibility in search results weren’t getting it, while the content of those wishing to opt out remained visible against their wishes. — [view in context](https://hyp.is/mqtcvP8jEe-EuDenhsr_BA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > On some occasions, chatbots directed us to syndicated versions of articles on platforms like Yahoo News or AOL rather than the original sources—often even when the publisher was known to have a licensing deal with the AI company. — [view in context](https://hyp.is/u4g6hv8jEe-6rK9OmqOYjg/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Understandable, though I don't see how LLMs would know/account for this. > This tendency deprives the original sources of proper attribution and potential referral traffic. — [view in context](https://hyp.is/vjaedv8jEe-kRPs-jX-k2A/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Conversely, syndicated versions or unauthorized copies of news articles present a challenge for publishers wishing to opt out of crawling. Their content continued to appear in results without their consent, albeit incorrectly attributed to the sources that republished it. For instance, while USA Today blocks ChatGPT’s crawler, the chatbot still cited a version of its article that was republished by Yahoo News. — [view in context](https://hyp.is/FOCoUv8kEe-KTGc0JuKU9g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Problem, yes, but not an LLM problem per se, especially for unauthorised copies, which exist irrespective of whether they were indexed by an LLM or not. Since the comparison is with web search, also worth pointing out that this could happen with web search as well. > Meanwhile, generative search tools’ tendency to fabricate URLs can also affect users’ ability to verify information sources. Grok 2, for instance, was prone to linking to the homepage of the publishing outlet rather than specific articles. — [view in context](https://hyp.is/IvUvqP8kEe-e-QNbiUribA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Obviously, a huge problem. > More than half of responses from Gemini and Grok 3 cited fabricated or broken URLs that led to error pages. Out of the 200 prompts we tested for Grok 3, 154 citations led to error pages. Even when Grok correctly identified an article, it often linked to a fabricated URL. While this problem wasn’t exclusive to Grok 3 and Gemini, it happened far less frequently with other chatbots.Mark Howard, Time magazine’s chief operating officer, emphasized to us that “it’s critically important how our brand is represented, when and where we show up, that there’s transparency about how we’re showing up and where we’re showing up, as well as what kind of engagement [chatbots are] driving on [our] platform.” — [view in context](https://hyp.is/ahk5lv8lEe-KT6vyiibGww/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > As Press Gazette’s Bron Maher wrote recently, the way in which chatbots disincentivize click-through traffic “has left news publishers continuing to expensively produce the information that answers user queries on platforms like ChatGPT without receiving compensation via web traffic and the resultant display advertising income.” — [view in context](https://hyp.is/znmhCv8lEe-yBOvm44qN3g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Setting aside the revenue aspect for a second -- there is, I think, an interface-catalysed disincentive to click through with chatbots because of the illusion of context/summarisation unlike a search engine where only short snippets are displayed and you have to click through to get more information. > The presence of licensing deals didn’t mean publishers were cited more accurately — [view in context](https://hyp.is/1_7ngP8lEe-l3ie6Kp7oaw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ Not surprising, given how LLMs work. > Of the companies whose models we tested, OpenAI and Perplexity have expressed the most interest in establishing formal relationships with news publishers. In February, OpenAI secured its sixteenth and seventeenth news content licensing deals with the Schibsted and Guardian media groups, respectively. Similarly, last year Perplexity established its own Publishers Program, “designed to promote collective success,” which includes a revenue-sharing arrangement with participating publishers — [view in context](https://hyp.is/5khdEv8lEe-0f1uz3YO9zw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > These arrangements typically provide AI companies direct access to publisher content, eliminating the need for website crawling. Such deals might raise the expectation that user queries related to content produced by partner publishers would yield more accurate results. However, this was not what we observed during tests conducted in February 2025. At least not yet. — [view in context](https://hyp.is/7HYPLP8lEe-HOM-JAH5czA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > When we asked whether the AI companies made any commitments to ensuring the content of publisher partners would be accurately surfaced in their search results, Time’s Howard confirmed that was the intention. However, he added that the companies did not commit to being 100 percent accurate. — [view in context](https://hyp.is/IXtWZP8mEe-EJPO-cGAYzA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > When we asked whether the AI companies made any commitments to ensuring the content of publisher partners would be accurately surfaced in their search results, Time’s Howard confirmed that was the intention. However, he added that the companies did not commit to being 100 percent accurate. — [view in context](https://hyp.is/KhwBsP8mEe-6fmvljROGAQ/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ tbh, they can't commit to this. > Critics of generative search like Chirag Shah and Emily M. Bender have raised substantive concerns about using large language models for search, noting that they “take away transparency and user agency, further amplify the problems associated with bias in [information access] systems, and often provide ungrounded and/or toxic answers that may go unchecked by a typical user.” — [view in context](https://hyp.is/Q8tFiv8mEe-jpi_QlklWEQ/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > In spite of this, Howard, the COO of Time, maintains optimism about future improvements: “I have a line internally that I say every time somebody brings me anything about any one of these platforms—my response back is, ‘Today is the worst that the product will ever be.’ — [view in context](https://hyp.is/X8Kumv8mEe-0gO-bIm2B3g/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ 😂 > If anybody as a consumer is right now believing that any of these free products are going to be 100 percent accurate, then shame on them.” — [view in context](https://hyp.is/aAzc9v8mEe-f-o84uOR1mA/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) ⬆️ 😂 > Limitations of our experiment — [view in context](https://hyp.is/lzy-sP8mEe-Q2Pv12Hhe8A/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > While our research design may not reflect typical user behavior, it is intended to assess how generative search tools perform at a task that is easily accomplished via a traditional search engine. Though we did not expect the chatbots to be able to correctly answer all of the prompts, especially given crawler restrictions, we did expect them to decline to answer or exhibit uncertainty when the correct answer couldn’t be determined, rather than provide incorrect or fabricated responses. — [view in context](https://hyp.is/l08AFv8mEe-7xyNK110rkw/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) > Furthermore, our findings represent just one occurrence of each of the excerpts being queried in the AI search tools. Because AI chatbots’ responses are dynamic and can vary in response to the same query, the chances are high that if someone ran the exact same prompts again, they would get different outputs. — [view in context](https://hyp.is/pneC8v8mEe-6f8NlFFRWyg/www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php)