Reddit’s Standoff with Perplexity AI: Unpacking the ‘Industrial-Scale’ Scraping Lawsuit and its Wider Implications for AI Training Data

Reddit is taking a firm stance against unauthorized data scraping, suing Perplexity AI and its infrastructure providers for allegedly harvesting user comments at an ‘industrial scale’ without licensing, signaling a pivotal moment for AI training data ethics and copyright law.

In a move that could redefine the boundaries of data usage in the artificial intelligence landscape, social media giant Reddit has filed a federal lawsuit against Perplexity AI and three other entities. The core allegation is the “industrial-scale, unlawful” scraping of millions of Reddit user comments for commercial gain, without proper licensing or authorization.

This isn’t just about one AI company; it’s a broader challenge to the entire ecosystem supporting large-scale data acquisition for AI training. The lawsuit aims to disrupt the often-opaque practices of data scrapers that feed the burgeoning AI industry, raising critical questions about intellectual property, user rights, and fair competition.

The Defendants: An AI Chatbot and Its Data Infrastructure

The lawsuit, filed in a New York federal court, targets several key players:

Perplexity AI: A San Francisco-based company known for its AI chatbot and “answer engine,” which competes with established players like Google and ChatGPT.
Oxylabs UAB: A Lithuanian company specializing in data scraping.
AWMProxy: Described by Reddit as a “former Russian botnet,” a web domain allegedly involved in proxy services for scraping.
SerpApi: A Texas-based startup that lists Perplexity as a customer on its website, providing access to search engine results.

This multi-pronged attack emphasizes Reddit’s argument that the scraping isn’t merely an individual act, but a concerted effort by an “unlawful economy” designed to exploit valuable user-generated content.

Why Reddit’s Content is a Prime Target for AI Training

Reddit boasts over 100 million daily users and represents “one of the largest and most dynamic collections of human conversation ever created,” as stated by Ben Lee, Reddit’s chief legal officer. This vast repository of diverse human language, nuanced discussions, and cultural context is incredibly valuable for training AI models to understand and generate human-like text. Websites like Reddit and Wikipedia, along with digitized books and news articles, serve as crucial “deep troves of written materials” essential for teaching AI assistants the patterns of human language.

The lawsuit alleges that these companies are circumventing Reddit’s own anti-scraping measures. Furthermore, they are accused of “circumventing Google’s controls and scraping Reddit content directly from Google’s search engine results” when direct access to Reddit is blocked. Lee highlighted that scrapers “mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search,” with Perplexity allegedly a willing buyer of this “stolen data.”

Legal Battlefronts: Copyright, Competition, and Enrichment

Reddit’s lawsuit accuses the defendants of:

Unfair competition: Exploiting Reddit’s content without contributing to its creation or maintenance.
Unjust enrichment: Profiting from Reddit’s valuable data without providing compensation.
Violation of U.S. copyright laws: Unauthorized reproduction and use of copyrighted material.

This marks Reddit’s second such lawsuit, following a similar filing against another major AI company, Anthropic, in June. That case, which alleges Anthropic ignored appeals to cease using Reddit’s content, was moved to federal court and has a hearing scheduled for January. These lawsuits illustrate a growing trend of content creators and platforms seeking to assert their rights and control over their data in the face of widespread AI training practices.

The Defendants’ Rebuttals and the ‘Public Data’ Debate

The accused companies have begun to issue their initial responses:

Perplexity AI: Stated it had not yet received the lawsuit but vowed to “always fight vigorously for users’ rights to freely and fairly access public knowledge.” They emphasized a “principled and responsible” approach to providing factual answers with accurate AI, asserting they “will not tolerate threats against openness and the public interest.”
SerpApi: Through its customer success director, Ryan Schafer, expressed strong disagreement with Reddit’s allegations and an intention to “vigorously defend ourselves in court.”
Oxylabs: Conveyed being “shocked and disappointed” and stated it “will not hesitate to defend itself.” Denas Grybauskas, Oxylabs’ chief governance and strategy officer, articulated a key counter-argument: “no company should claim ownership of public data that does not belong to them.” He suggested that Reddit’s action might be “just an attempt to sell the same public data at an inflated price.”

This debate around the “ownership of public data” is central to many AI copyright cases. While scraping publicly available online data is a common practice for businesses and researchers, Reddit’s argument is that the defendants are behaving like “would-be bank robbers” who, unable to access the vault, break into the armored truck instead – implying a deliberate circumvention of protective measures and a commercial intent that goes beyond general research.

Reddit’s Licensing Model and the Future of AI Data

Significantly, Reddit has already established a precedent for monetizing its data for AI training. The platform has previously entered into licensing agreements with major AI players, including Google and OpenAI, allowing them to legally train their AI systems on the vast public commentary of Reddit’s user base. These licensing deals were crucial in helping the 20-year-old platform raise capital ahead of its Wall Street debut as a publicly traded company last year, as reported by the Associated Press. Another report by the Associated Press further detailed the licensing arrangement with OpenAI.

This litigation against Perplexity AI and its alleged scraping partners underscores Reddit’s commitment to its licensing model. It sends a clear message to the AI industry: while public data may be accessible, its commercial use for AI training without proper agreements could lead to significant legal challenges. The outcome of this case, much like the ongoing Anthropic lawsuit, could set crucial precedents for how AI companies acquire and utilize the vast amounts of online data needed to fuel their technological advancements.