What this lawsuit over dictionary data means for every AI startup scraping the web

Read Editorial Disclaimer

Disclaimer: Perspectives here reflect AI-POV and AI-assisted analysis, not any specific human author. Read full disclaimer — issues: report@theaipov.news

By Tech Desk | March 17, 2026 | 6 min read AI-Assisted | Source: techcrunch.com

Encyclopedia Britannica and Merriam-Webster sued OpenAI in March 2026 for copying nearly 100,000 articles to train ChatGPT. The stakes go beyond one company: if the plaintiffs win, small AI players could be regulated out of the market while giants with existing licensing deals tighten their grip.

A win for reference publishers would tilt the board toward big AI and licensed content

Encyclopedia Britannica and its subsidiary Merriam-Webster filed suit against OpenAI in the Southern District of New York on March 13, 2026. As TechCrunch reported on March 16, the complaint alleges massive copyright infringement: nearly 100,000 online articles scraped and used to train OpenAI’s LLMs without permission. The plaintiffs claim ChatGPT generates full or partial verbatim reproductions of their content and that OpenAI’s RAG workflow uses their articles without authorization. Britannica also alleges trademark violations when ChatGPT attributes fabricated statements to them. The suit joins a wave of publisher cases against OpenAI, including The New York Times, Ziff Davis, and more than a dozen newspapers.

Courthouse News Service reported that the 44-page complaint includes four counts of copyright infringement and one of trademark dilution. Britannica states it approached OpenAI about licensing in November 2024 and was rebuffed while OpenAI cut deals with other publishers. An OpenAI spokesperson said the company’s models are trained on publicly available data and grounded in fair use. The outcome will shape whether training on scraped reference and dictionary content remains defensible or becomes a licensing-only game.

Small AI startups cannot match big-tech licensing budgets

If courts consistently require licensing for training data, the cost structure favors incumbents. News Corp’s five-year deal with OpenAI was reported at more than $250 million; Reddit’s agreement with Google at about $60 million per year. The Associated Press, Financial Times, Axel Springer, Vox Media, and The Atlantic have undisclosed licensing agreements with OpenAI. Startups and smaller AI labs do not have those budgets. According to industry analysis from Encypher and Digital Content Next, the market is shifting toward standardized licensing frameworks; by mid-2026, licensing is expected to be the norm for quality content. That trajectory rewards firms that can pay at scale and pushes smaller players toward riskier scraping or narrower data sets. A court ruling that scraped reference content cannot be used without a license would formalize that split: big AI with licenses, small AI with less data or more legal risk.

Precedent is moving against unfettered training use

In Thomson Reuters v. ROSS Intelligence, Judge Stephanos Bibas held that using copyrighted legal materials to train AI does not qualify as fair use. Lexology and Skadden have noted the ruling undercuts the argument that tech companies can freely use copyrighted works for AI training and supports content owners’ right to be paid. TechCrunch reported that Anthropic reached a $1.5 billion settlement with writers after a court found that illegally downloading millions of books for training warranted the payout. Perplexity is already facing suits from Britannica and Merriam-Webster (September 2025), The New York Times (December 2025), and others over scraping and copying. Ziff Davis, which owns Mashable, CNET, and IGN, sued OpenAI; a December 2025 ruling confirmed that copyright claims based on AI-generated outputs are viable. The industry is moving toward licensing as the default for quality content, with intermediaries and pricing frameworks emerging. If the dictionary and reference suits succeed, the precedent will apply to any AI startup that trains on scraped web content without licenses.

What This Actually Means

A legal standard that requires paid licensing for training data does not treat all AI companies equally. Giants with deep pockets and existing publisher relationships can secure deals and pass costs to users; startups and open-source projects get squeezed. Regulatory and litigation risk will fall hardest on firms that cannot afford large content deals. The Britannica and Merriam-Webster case is a precedent-break moment: win or lose, it will influence how every AI startup that scrapes the web is treated under copyright.

Who are Encyclopedia Britannica and Merriam-Webster in this case?

Encyclopedia Britannica dates to the late 18th century and owns Merriam-Webster. According to Courthouse News Service, they retain copyright in nearly 100,000 online articles and digital reference works. They allege that human researchers, writers, and editors produced this content and that OpenAI free-rides on it without compensation. The complaint frames the dispute as quality and trust: they argue ChatGPT starves web publishers of revenue and that hallucinations attributed to them jeopardize the public’s access to trustworthy information. Their suit against OpenAI is the second major action after the September 2025 case against Perplexity. In both suits they seek to establish that reference and dictionary content cannot be scraped or used for training without a license.

What does the lawsuit say about scraping and AI training?

The complaint alleges that OpenAI copied the publishers’ online articles at scale to train large language models and that ChatGPT sometimes outputs verbatim or near-verbatim reproductions of their content. Britannica and Merriam-Webster claim this cannibalizes traffic to their sites and harms their reputation when ChatGPT attributes hallucinations to them. The suit targets both training use and output use. Britannica also alleges that OpenAI uses their articles in ChatGPT’s RAG (retrieval augmented generation) workflow, which scans the web and databases when answering queries. If courts side with the publishers, the implication is that scraping reference and dictionary content for training without a license is infringement, which would force AI startups to either license, limit training data, or face litigation.

How could a plaintiff win affect small AI players?

Small AI players typically rely on publicly available or scraped data because they cannot afford the kind of licensing deals OpenAI has signed with major publishers. A ruling that training on scraped reference content is infringement would leave them with three options: pay for licenses (often prohibitive), restrict training to fully licensed or synthetic data (narrowing model quality), or continue scraping and face lawsuits. Meanwhile, large players with existing deals would see their licensed content become a moat. The result could be a more concentrated market where only well-funded incumbents can afford to train on the best reference and news content.

What has happened in similar cases against AI companies?

Encyclopedia Britannica and Merriam-Webster sued Perplexity in September 2025, alleging the AI search company scraped their websites and reproduced dictionary definitions verbatim. Bloomberg Law and The Verge reported that the publishers accused Perplexity of cannibalizing traffic and using a bot to crawl their research sites. The New York Times sued Perplexity in December 2025 for infringing copyright. Reuters and Business Insider have reported that Perplexity faces multiple publisher suits and has been accused of stealth crawling and fabricating content while attributing it to news sources. In March 2026, the Big Five book publishers and academic publishers sued Anna’s Archive, a shadow library allegedly supplying pirated content to AI developers. The pattern is consistent: content owners are targeting both large and small AI firms. A win for Britannica and Merriam-Webster against OpenAI would strengthen the same arguments in every other case. The precedent would apply to any startup that trains on scraped reference or news content.

Sources

TechCrunch, Courthouse News Service, Encypher, Lexology, The Verge

Related Video

Related video — Watch on YouTube

Read More News

Dolores Keane’s legacy shows how folk music guarded truths Ireland’s elites ignored

Publishers suing OpenAI are late to a fight they already helped create

Iran is quietly testing how much pain the world will tolerate at Hormuz

New Zealand’s petrol pain is really a subsidy war between drivers and EV buyers

Closing the Kennedy Center is really a warning shot at Washington’s arts class

What the Kennedy Center fight reveals about who really controls U.S. culture funding

Vanity Fair’s Oscar party turns awards night into a celebrity brand marketplace

Copyright lawsuits against OpenAI are really about who owns the language we use

GTC 2026 will reveal how far behind the rest of Big Tech is on AI infrastructure

Nvidia is using GTC 2026 to lock AI developers into its ecosystem for a decade

Trump’s threats over Iranian oil routes signal a larger election-year energy gamble

U.S. voters will feel the Hormuz crisis at the pump long before the battlefield

Why Grace Blackwell and Rubin Multiply Revenue Capacity Across Every Token Tier

How Nvidia and Groq LP300 Plus Dynamo Unlock 35× on the Highest-Value Inference Tier

Inside Vera Rubin Ultra: Liquid-Cooled Racks for the Next Generation of AI Factories

How Token Pricing Tiers Will Reshape the AI Economy

Inside the AI Token Factory: Why Tokens Became the New Commodity of Computing

From DGX-1 to Rubin: How Nvidia Turned Data Centres into AI Factories

“This Is the Beginning of Something Very, Very Big”: Nvidia’s Jensen Huang on AI-Native Companies

From Retrieval to Generation: How ChatGPT Marked the Start of Nvidia’s Generative AI Era

From Perception to Agentic AI: How Reasoning and Coding Agents Changed the Game

The Inference Inflection Point: Why AI Computing Demand Grew a Million Times in Two Years

Healthcare Enters Its ‘ChatGPT Moment’ on Nvidia’s Accelerated Platform

Inside the Trillion-Dollar Industries Powering Nvidia’s AI Infrastructure Boom

Jensen Huang Explains Why Nvidia Is ‘Vertically Integrated but Horizontally Open’

Nvidia, Palantir and Dell Team Up on Air-Gapped AI Platforms

Nvidia CEO Jensen Huang Maps Out the AI Cloud Future in Live Keynote

Team USA’s Route to the Gold Medal Game Says More About the Field Than the Score

Jessie Buckley and the Oscars Narrative Ireland Wants to Tell

Winter Storm Wisconsin Updates: What We Know So Far

Why Iran Chose This Moment to Escalate the Strait of Hormuz Crisis

What the Oscars 2026 Winners Mean for Streaming Services and Theater Chains

The Last Time Oil Hit $100 During a Middle East Crisis, Recession Followed Within Months

Why Matchday Prep Stories Like Real Sociedad’s Rain Session Get Pushed as News

Trump’s Oil Infrastructure Threat Signals a Shift Away From Diplomatic Containment