Encyclopedia Britannica and Merriam-Webster sued OpenAI in March 2026 for copying nearly 100,000 articles to train ChatGPT. The stakes go beyond one company: if the plaintiffs win, small AI players could be regulated out of the market while giants with existing licensing deals tighten their grip.
A win for reference publishers would tilt the board toward big AI and licensed content
Encyclopedia Britannica and its subsidiary Merriam-Webster filed suit against OpenAI in the Southern District of New York on March 13, 2026. As TechCrunch reported on March 16, the complaint alleges massive copyright infringement: nearly 100,000 online articles scraped and used to train OpenAI’s LLMs without permission. The plaintiffs claim ChatGPT generates full or partial verbatim reproductions of their content and that OpenAI’s RAG workflow uses their articles without authorization. Britannica also alleges trademark violations when ChatGPT attributes fabricated statements to them. The suit joins a wave of publisher cases against OpenAI, including The New York Times, Ziff Davis, and more than a dozen newspapers.
Courthouse News Service reported that the 44-page complaint includes four counts of copyright infringement and one of trademark dilution. Britannica states it approached OpenAI about licensing in November 2024 and was rebuffed while OpenAI cut deals with other publishers. An OpenAI spokesperson said the company’s models are trained on publicly available data and grounded in fair use. The outcome will shape whether training on scraped reference and dictionary content remains defensible or becomes a licensing-only game.
Small AI startups cannot match big-tech licensing budgets
If courts consistently require licensing for training data, the cost structure favors incumbents. News Corp’s five-year deal with OpenAI was reported at more than $250 million; Reddit’s agreement with Google at about $60 million per year. The Associated Press, Financial Times, Axel Springer, Vox Media, and The Atlantic have undisclosed licensing agreements with OpenAI. Startups and smaller AI labs do not have those budgets. According to industry analysis from Encypher and Digital Content Next, the market is shifting toward standardized licensing frameworks; by mid-2026, licensing is expected to be the norm for quality content. That trajectory rewards firms that can pay at scale and pushes smaller players toward riskier scraping or narrower data sets. A court ruling that scraped reference content cannot be used without a license would formalize that split: big AI with licenses, small AI with less data or more legal risk.
Precedent is moving against unfettered training use
In Thomson Reuters v. ROSS Intelligence, Judge Stephanos Bibas held that using copyrighted legal materials to train AI does not qualify as fair use. Lexology and Skadden have noted the ruling undercuts the argument that tech companies can freely use copyrighted works for AI training and supports content owners’ right to be paid. TechCrunch reported that Anthropic reached a $1.5 billion settlement with writers after a court found that illegally downloading millions of books for training warranted the payout. Perplexity is already facing suits from Britannica and Merriam-Webster (September 2025), The New York Times (December 2025), and others over scraping and copying. Ziff Davis, which owns Mashable, CNET, and IGN, sued OpenAI; a December 2025 ruling confirmed that copyright claims based on AI-generated outputs are viable. The industry is moving toward licensing as the default for quality content, with intermediaries and pricing frameworks emerging. If the dictionary and reference suits succeed, the precedent will apply to any AI startup that trains on scraped web content without licenses.
What This Actually Means
A legal standard that requires paid licensing for training data does not treat all AI companies equally. Giants with deep pockets and existing publisher relationships can secure deals and pass costs to users; startups and open-source projects get squeezed. Regulatory and litigation risk will fall hardest on firms that cannot afford large content deals. The Britannica and Merriam-Webster case is a precedent-break moment: win or lose, it will influence how every AI startup that scrapes the web is treated under copyright.
Who are Encyclopedia Britannica and Merriam-Webster in this case?
Encyclopedia Britannica dates to the late 18th century and owns Merriam-Webster. According to Courthouse News Service, they retain copyright in nearly 100,000 online articles and digital reference works. They allege that human researchers, writers, and editors produced this content and that OpenAI free-rides on it without compensation. The complaint frames the dispute as quality and trust: they argue ChatGPT starves web publishers of revenue and that hallucinations attributed to them jeopardize the public’s access to trustworthy information. Their suit against OpenAI is the second major action after the September 2025 case against Perplexity. In both suits they seek to establish that reference and dictionary content cannot be scraped or used for training without a license.
What does the lawsuit say about scraping and AI training?
The complaint alleges that OpenAI copied the publishers’ online articles at scale to train large language models and that ChatGPT sometimes outputs verbatim or near-verbatim reproductions of their content. Britannica and Merriam-Webster claim this cannibalizes traffic to their sites and harms their reputation when ChatGPT attributes hallucinations to them. The suit targets both training use and output use. Britannica also alleges that OpenAI uses their articles in ChatGPT’s RAG (retrieval augmented generation) workflow, which scans the web and databases when answering queries. If courts side with the publishers, the implication is that scraping reference and dictionary content for training without a license is infringement, which would force AI startups to either license, limit training data, or face litigation.
How could a plaintiff win affect small AI players?
Small AI players typically rely on publicly available or scraped data because they cannot afford the kind of licensing deals OpenAI has signed with major publishers. A ruling that training on scraped reference content is infringement would leave them with three options: pay for licenses (often prohibitive), restrict training to fully licensed or synthetic data (narrowing model quality), or continue scraping and face lawsuits. Meanwhile, large players with existing deals would see their licensed content become a moat. The result could be a more concentrated market where only well-funded incumbents can afford to train on the best reference and news content.
What has happened in similar cases against AI companies?
Encyclopedia Britannica and Merriam-Webster sued Perplexity in September 2025, alleging the AI search company scraped their websites and reproduced dictionary definitions verbatim. Bloomberg Law and The Verge reported that the publishers accused Perplexity of cannibalizing traffic and using a bot to crawl their research sites. The New York Times sued Perplexity in December 2025 for infringing copyright. Reuters and Business Insider have reported that Perplexity faces multiple publisher suits and has been accused of stealth crawling and fabricating content while attributing it to news sources. In March 2026, the Big Five book publishers and academic publishers sued Anna’s Archive, a shadow library allegedly supplying pirated content to AI developers. The pattern is consistent: content owners are targeting both large and small AI firms. A win for Britannica and Merriam-Webster against OpenAI would strengthen the same arguments in every other case. The precedent would apply to any startup that trains on scraped reference or news content.
Sources
TechCrunch, Courthouse News Service, Encypher, Lexology, The Verge