Did you know Wikipedia Pays Price of AI Boom: Company Struggling From Rising Costs Due to Bots Scraping Its Articles
Popular online encyclopedia Wikipedia is reportedly paying a major price
for the AI boom. The encyclopedia giant is struggling with a rise in
costs due to bots scraping content that is used for training AI models.
This is not only a financial constraint but also has to do with a strain across the platform’s bandwidth.
On Tuesday, we saw the nonprofit firm hosting Wikipedia issue a warning
about automated requests for its content that keep on growing
exponentially. This causes a massive disruption across the website and
forces the encyclopedia to add greater capacity and, similarly, increase
the billing for data centers.
The infrastructure is created to withstand the rise in traffic from
humans during top events, but the traffic levels produced using scraper
bots are unpredictable and keep showing up as a rise in costs and higher
risk.
The Foundation shared how the bandwidth for downloading
content grew 50%. But the traffic here is not arriving from actual
humans but automated programs. These keep on installing licensed images
to feed pictures to their AI models.
Another serious issue has to
do with bots gathering large amounts of data from less famous articles
on Wikipedia. Taking a closer look, it was shown that nearly 65% of the
traffic arrives through bots. This is an unequal amount when we look at
overall pageviews via bots, which make up 35% of the majority.
To address the issue further, the Wikimedia Foundation says it’s rolling out a more Responsible Use of this Infrastructure plan that identifies the network strain coming from AI bot scrapers that aren’t sustainable.
Wikipedia hopes to get more feedback from the community on how to best tackle this serious issue and identify traffic coming from these bot scrapers and how to filter them out. This will include forcing bot operators to scan through authentication for top volume scrapers and API usage.
Wikipedia
knows that it’s a huge threat as their material is free of cost, but
the infrastructure isn’t. They have to act now to re-create a healthier
balance.
Reddit faced something similar in 2023. Software giant
Microsoft, for instance, didn’t alert Reddit about scraping content and
using that for AI features. It then blocked Microsoft from scraping its
own pages, which Reddit’s CEO openly condemned.
Reddit further
decided to take action by charging third-party developers to gain access
to their API. This led to the developer to revolt, experience sudden
blackouts on the app, and even shut down for some leading clients of the
company.
