In January 2026, the internet celebrated the 25th anniversary of Wikipedia, the original workhorse behind third-grade papers on giant squids and fourth-place social studies projects. Much like students debating the academic integrity of enlisting generative AI for their work today, it was once difficult to abstain from using Wikipedia articles as a source. After all, right at our fingertips were over 7 million articles in plainly written English, summarizing even the most complex, niche, and fraught subjects.
What we may not have fully understood at the time is that Wikipedia embodied the very foundational philosophy that gave us the internet. In the 1980s, programmer Richard M. Stallman founded the free software movement to combat rising commercialization. Key examples still in use today include the Linux operating system (hosting most websites), Apache HTTP Server, WordPress, and Mozilla Firefox. The Hill cites estimates that 96 percent of all software still benefits from open source software. It is now largely taken for granted by those of us raised with the internet, that this concerted push to share code, collaborate, and instill community feedback produced higher quality software.
The early web was often framed in almost utopian terms. In his 1996 “A Declaration of the Independence of Cyberspace,” Electronic Frontier Foundation co-founder John Perry Barlow proclaimed cyberspace a borderless “new home of Mind,” that would develop its own rules and social contracts, free from state control. Wikipedia is one of the most enduring experiments to actually build such a space: a digital commons where knowledge is collectively produced, governed, and shared. With a Creative Commons Attribution-ShareAlike 4.0 license, contributors allow their work to be freely shared and adapted in support of broad public access. This includes commercial reuse and derivative works, provided that proper credit is given.
Since English Wikipedia was launched in January 2001 by Jimmy Wales and Larry Sanger, it has become one of the most widely used global reference sources. By their own statistics, across all Wikimedia projects there is an average of over 16 edits per second, carried out by a global community of volunteer editors, administrators– and their bots. As a steward of free and accessible knowledge, it has contributed significantly to the advancement of artificial intelligence. According to The Washington Post and the Wikimedia Foundation, this content from Wikimedia platforms has been one of the largest sources of training data for large language models (LLMs).
The open internet of the past two decades has been instrumental in advancing the AI era. However, despite promises of broad public benefit, growing backlash against AI training practices recasts data, not as an inexhaustible pool of cumulative knowledge but as a finite resource that must be protected from commercial exploitation, actively maintained, and responsibly governed. In that light, Wikipedia may still have much to teach us.
The commons functions on shared management and mutually agreed governance structures to ensure equitable access to a finite resource. Once boundaries are established for the management and archival of information, rules are instilled to prevent pollution or outright vandalism of articles with false information. Regular offenders sharing rumors, lies, or other biases have their IP addresses and usernames banned from future edits. In order for this system to run effectively, a minimum number of people have to be involved, reading articles and contributing to them so that the site continues to be a timely and accurate source of information.
The goodwill and mysterious motivations of Wikipedia volunteers aside, this group is emblematic of a belief that accumulated knowledge can foster innovation in the public interest. As Ryan Safner writes in “Institutional entrepreneurship, Wikipedia, and the opportunity of the Commons,” innovation does not come from orphan ideas. Free encyclopedias challenge the popular image of the lone inventor creating new expressive works in isolation, and instead embrace a process in which we absorb, imitate, and finally build upon existing knowledge to innovate.
AI should in theory be another great example of how digital commons lead to innovation. Yet, the questionable training of these models on copyrighted works scraped from both legal and illegal sources is cultivating an online attitude that is outright hostile towards LLMs. Creative Commons works are a standard source for data mining, but so too are copyrighted books and newspaper articles, often used without payment, the author’s consent, or any attribution. In response, dozens of copyright infringement lawsuits have emerged over the past two years, claiming that even if original texts are sufficiently “transformed” into tokens for training, chatbots can still reproduce the underlying works when prompted.
This protest movement is not limited to authors and artists defending their copyright. Platforms like Reddit and Twitter (now X) have restricted access to their Application Programming Interfaces to prevent AI “crawlers” from freely scraping their content, effectively locking their digital doors. Internet users themselves are resorting to a kind of digital guerrilla warfare, embedding code in web content to “poison” language models. For instance, researchers at the University of Chicago developed Nightshade, a free tool that introduces pixel-level changes that cause models to misinterpret images. If enough of these altered images are used in training, the models can fail to generate images as intended.
Despite assurances that powerful AI models will benefit everyone, the backlash reflects a growing sentiment. This collectively produced knowledge, made accessible through ad revenue or donations, has subsidized a few companies that fail to meaningfully contribute back to the ecosystem, not only distorting our understanding of copyright but also threatening the depletion of these shared resources.
One of the more recent lawsuits filed by Encyclopedia Britannica alleges OpenAI misused their reference materials to train its artificial intelligence models. Notably, Encyclopedia Britannica states in its complaint that ChatGPT trained on its articles, and “‘cannibalized’ Britannica’s web traffic with AI-generated summaries of its content.” Copyright issues aside, helpful AI summaries can bring about a tragedy of the digital commons. McMahon et al. refer to this as a “death spiral,” in which fewer people clicking on links leads to reduced ad revenue and, in the case of Wikipedia, fewer volunteers and contributors of high-quality content. The decline of online foot traffic has clear implications for journalists, researchers, and artists who depend on sales or subscriptions for their livelihoods, but it also presents a significant obstacle to AI development, which should rely on diverse and representative human-generated inputs, even as some propose using synthetic data to fill the gap.
In the age of data mining, knowledge commons are no longer immune to overuse and companies only have access to a small sliver of the data that makes up the online universe. As it stands now, some studies point to us nearing language data exhaustion between 2030 and 2040 and visual data exhaustion between 2040 and 2060. The New York Times has reported that access to representative datasets capable of training more inclusive and powerful models is dwindling as public opposition to AI training grows. These websites, once considered low-hanging fruit for AI training data, are becoming increasingly restricted through robots.txt policies and terms of service. One study from the Data Provenance Initiative suggests that this new iteration of restrictions fails to distinguish between crawlers used for AI training and those used by archival and nonprofit organizations for academic research or organizing knowledge commons.
Over the past two decades, Wikipedia has reflected the internet’s complicated relationship with machine learning tools. At its peak in the early 2000s, editors began relying on bots to assist in various menial tasks like copy editing, monitoring for vandalism, fixing links, and connecting different articles to each other. These bots, approved by the community and supervised by humans, accounted for up to 10% of Wikipedia’s activity as of 2019 and played a critical role in “cleaning” the commons. Some also began generating short technical articles based on census data.
When ChatGPT entered the picture, contributors began experimenting with LLMs to generate articles. However, early cases in which these articles included hallucinated sources, biases, and misinformation led to an “AI cleanup” initiative. By 2025, the community had established a process for nominating and removing articles suspected of being entirely AI-generated. This brings us to today, where, as of March 2026, Wikipedia has restricted the use of AI chatbots, limiting them to copyediting, having determined that LLMs violate its core content policies.
By taking a closer look at Wikipedia’s evolution, we can learn a lot about the importance of maintaining and securing an accessible knowledge commons. The online encyclopedia has not only been a useful resource for the public, it’s underpinned search engines and a new frontier with LLMs. Its community of editors and their pet bots can similarly chart a path for preserving human insights on the internet flooded with more and more AI noise.
Outside of the courtroom, new governance frameworks and funding mechanisms can meet the challenge of properly incentivizing quality content creation and public support for training strong AI models. On its birthday, Wikipedia signed deals with Meta, Amazon, Microsoft and others charging the companies to use its content for AI training. A different approach by Cloudflare, a CDN that provides services to almost a quarter of the internet, has granted that ability to charge AI crawlers to the content creators themselves through a Pay-per-crawl feature. This exploration raises broader questions about whether charging AI companies can sufficiently preserve publicly available content, what more innovative approaches might better protect content creators, and, most fundamentally, whether it is possible to win the AI race without the public on your side.
Karolina Mackiewicz is the Director of Policy and Engagement at Portulans Institute, where she focuses on responsible data and tech governance.



