What this lawsuit over dictionary data means for every AI startup scraping the web

Read Editorial Disclaimer

Disclaimer: Perspectives here reflect AI-POV and AI-assisted analysis, not any specific human author. Read full disclaimer — issues: report@theaipov.news

By Tech Desk | March 17, 2026 | 6 min read AI-Assisted | Source: techcrunch.com

Encyclopedia Britannica and Merriam-Webster sued OpenAI in March 2026 for copying nearly 100,000 articles to train ChatGPT. The stakes go beyond one company: if the plaintiffs win, small AI players could be regulated out of the market while giants with existing licensing deals tighten their grip.

A win for reference publishers would tilt the board toward big AI and licensed content

Encyclopedia Britannica and its subsidiary Merriam-Webster filed suit against OpenAI in the Southern District of New York on March 13, 2026. As TechCrunch reported on March 16, the complaint alleges massive copyright infringement: nearly 100,000 online articles scraped and used to train OpenAI’s LLMs without permission. The plaintiffs claim ChatGPT generates full or partial verbatim reproductions of their content and that OpenAI’s RAG workflow uses their articles without authorization. Britannica also alleges trademark violations when ChatGPT attributes fabricated statements to them. The suit joins a wave of publisher cases against OpenAI, including The New York Times, Ziff Davis, and more than a dozen newspapers.

Courthouse News Service reported that the 44-page complaint includes four counts of copyright infringement and one of trademark dilution. Britannica states it approached OpenAI about licensing in November 2024 and was rebuffed while OpenAI cut deals with other publishers. An OpenAI spokesperson said the company’s models are trained on publicly available data and grounded in fair use. The outcome will shape whether training on scraped reference and dictionary content remains defensible or becomes a licensing-only game.

Small AI startups cannot match big-tech licensing budgets

If courts consistently require licensing for training data, the cost structure favors incumbents. News Corp’s five-year deal with OpenAI was reported at more than $250 million; Reddit’s agreement with Google at about $60 million per year. The Associated Press, Financial Times, Axel Springer, Vox Media, and The Atlantic have undisclosed licensing agreements with OpenAI. Startups and smaller AI labs do not have those budgets. According to industry analysis from Encypher and Digital Content Next, the market is shifting toward standardized licensing frameworks; by mid-2026, licensing is expected to be the norm for quality content. That trajectory rewards firms that can pay at scale and pushes smaller players toward riskier scraping or narrower data sets. A court ruling that scraped reference content cannot be used without a license would formalize that split: big AI with licenses, small AI with less data or more legal risk.

Precedent is moving against unfettered training use

In Thomson Reuters v. ROSS Intelligence, Judge Stephanos Bibas held that using copyrighted legal materials to train AI does not qualify as fair use. Lexology and Skadden have noted the ruling undercuts the argument that tech companies can freely use copyrighted works for AI training and supports content owners’ right to be paid. TechCrunch reported that Anthropic reached a $1.5 billion settlement with writers after a court found that illegally downloading millions of books for training warranted the payout. Perplexity is already facing suits from Britannica and Merriam-Webster (September 2025), The New York Times (December 2025), and others over scraping and copying. Ziff Davis, which owns Mashable, CNET, and IGN, sued OpenAI; a December 2025 ruling confirmed that copyright claims based on AI-generated outputs are viable. The industry is moving toward licensing as the default for quality content, with intermediaries and pricing frameworks emerging. If the dictionary and reference suits succeed, the precedent will apply to any AI startup that trains on scraped web content without licenses.

What This Actually Means

A legal standard that requires paid licensing for training data does not treat all AI companies equally. Giants with deep pockets and existing publisher relationships can secure deals and pass costs to users; startups and open-source projects get squeezed. Regulatory and litigation risk will fall hardest on firms that cannot afford large content deals. The Britannica and Merriam-Webster case is a precedent-break moment: win or lose, it will influence how every AI startup that scrapes the web is treated under copyright.

Who are Encyclopedia Britannica and Merriam-Webster in this case?

Encyclopedia Britannica dates to the late 18th century and owns Merriam-Webster. According to Courthouse News Service, they retain copyright in nearly 100,000 online articles and digital reference works. They allege that human researchers, writers, and editors produced this content and that OpenAI free-rides on it without compensation. The complaint frames the dispute as quality and trust: they argue ChatGPT starves web publishers of revenue and that hallucinations attributed to them jeopardize the public’s access to trustworthy information. Their suit against OpenAI is the second major action after the September 2025 case against Perplexity. In both suits they seek to establish that reference and dictionary content cannot be scraped or used for training without a license.

What does the lawsuit say about scraping and AI training?

The complaint alleges that OpenAI copied the publishers’ online articles at scale to train large language models and that ChatGPT sometimes outputs verbatim or near-verbatim reproductions of their content. Britannica and Merriam-Webster claim this cannibalizes traffic to their sites and harms their reputation when ChatGPT attributes hallucinations to them. The suit targets both training use and output use. Britannica also alleges that OpenAI uses their articles in ChatGPT’s RAG (retrieval augmented generation) workflow, which scans the web and databases when answering queries. If courts side with the publishers, the implication is that scraping reference and dictionary content for training without a license is infringement, which would force AI startups to either license, limit training data, or face litigation.

How could a plaintiff win affect small AI players?

Small AI players typically rely on publicly available or scraped data because they cannot afford the kind of licensing deals OpenAI has signed with major publishers. A ruling that training on scraped reference content is infringement would leave them with three options: pay for licenses (often prohibitive), restrict training to fully licensed or synthetic data (narrowing model quality), or continue scraping and face lawsuits. Meanwhile, large players with existing deals would see their licensed content become a moat. The result could be a more concentrated market where only well-funded incumbents can afford to train on the best reference and news content.

What has happened in similar cases against AI companies?

Encyclopedia Britannica and Merriam-Webster sued Perplexity in September 2025, alleging the AI search company scraped their websites and reproduced dictionary definitions verbatim. Bloomberg Law and The Verge reported that the publishers accused Perplexity of cannibalizing traffic and using a bot to crawl their research sites. The New York Times sued Perplexity in December 2025 for infringing copyright. Reuters and Business Insider have reported that Perplexity faces multiple publisher suits and has been accused of stealth crawling and fabricating content while attributing it to news sources. In March 2026, the Big Five book publishers and academic publishers sued Anna’s Archive, a shadow library allegedly supplying pirated content to AI developers. The pattern is consistent: content owners are targeting both large and small AI firms. A win for Britannica and Merriam-Webster against OpenAI would strengthen the same arguments in every other case. The precedent would apply to any startup that trains on scraped reference or news content.

Sources

TechCrunch, Courthouse News Service, Encypher, Lexology, The Verge

Related Video

Related video — Watch on YouTube

Read More News

How To Build A Legal RAG App In Weaviate

AI YouTube Clones Are Turning Professor Jiang’s Viral Rise Into A Conspiracy Machine

The Iran Ceasefire Is Turning Into A Maritime Pressure Campaign

China’s Taiwan Carrot Still Depends On Military Pressure

Putin’s Easter Ceasefire Shows Why Russia Still Controls The Timing

OpenAI’s Cyber Defense Push Shows GPT-5.4 Is Arriving With Guardrails

Meta’s Muse Spark Makes Subagents The New Face Of Meta AI

Your Fingerprints Are Now Europe’s First Gatekeeper: How a Digital Border Quietly Seized Unprecedented Control

Meloni’s Crime Wave Panic: A January Stabbing Becomes April’s Political Opportunity

Germany’s Noon Price Cap Is Economic Surrender Dressed as Policy Innovation

Germany’s Quiet Healthcare Revolution: How Free Lung Cancer Screening Reveals What’s Really Broken

France’s Buried Confession: Why Naming America as an Election Threat Really Means

The State as Digital Parent: Why the UK’s Teen Social Media Ban Is Actually Totalitarian

Starmer’s Crypto Ban Is Political Theater Hiding a Completely Different Story

Spain’s €5 Billion Emergency Response Will Delay Economic Pain, Not Prevent It

The Spanish Soldier Detention Reveals the EU’s Fractured Israel Strategy

Anthropic’s Mythos Reveals the Truth: AI Labs Now Possess Models That Exceed Human Capability

Polymarket’s Pattern of Suspiciously Timed Bets Reveals Systemic Information Asymmetry

Beyond Nostalgia: How Japan’s Article 9 Debate Reveals a Civilization Under Existential Pressure

Japan’s Oil Panic Exposes the Myth of Wealthy Nation Invulnerability

Brazil’s 2026 Rematch: The Election That Will Determine If Latin America Surrenders to the Left

Brazil’s Lithium Trap: How the Energy Transition Boom Could Destroy the Region’s Future

Australia’s Iran Refusal: A Sovereign Challenge to American Hegemony That Will Cost It Dearly

Artemis II’s Historic Return: The Moon Mission That Should Be Celebrated but Reveals Space’s True Purpose

Why the Netherlands’ Tesla FSD Approval Is a Regulatory Trap for Europe

The Dutch Government’s Shareholder Revolt Could Reshape Executive Compensation Across Europe

Poland’s Economic Success Cannot Prevent the Rise of Polexit and European Fragmentation

The Poland-South Korea Defense Partnership Is Quietly Reshaping European Security Architecture

North Korea’s Missile Tests Are Reactive—The Real Escalation Is Seoul’s Preemption Strategy

Samsung’s Record Earnings Are Real, But the Profits Vanish When You Understand the Costs

Turkey’s Radical Tobacco Ban Could Kill an Industry—But First It Will Consolidate Power

Turkey’s Balancing Act Is Breaking: Fitch Downgrade Reveals Currency Collapse Risk

Milei’s Libertarian Experiment Is Unraveling: Approval Hits Historic Low

Mexico’s Last Fossil Fuel Bet: Saguaro LNG Would Transform Mexico’s Energy Future—If It Survives Politics

Mexico’s World Cup Dream Meets Security Nightmare: 100,000 Troops Cannot Prevent Cartel War Bloodshed