Because of artificial intelligence, the web is closing more and more

by time news

2024-09-13 04:30:06

Data flow that is AI (example).

The arrival of ChatGPT, in November 2022, continues to have consequences, fantasized or real. Among the latter, a group of independent scientists, Data Provenance Initiativewho just knows one, quite unexpected: the drying up of resources from which the generative artificial intelligence systems of the tool of the American company OpenAI has been popular can draw. More precisely, in your preprint submitted to a conferencein July, this group measures the extent to which a large number of sites, among the busiest in the world (The New York Times, HuffPost, Guardian…), thus disallowing automatic data recovery tools, or crawling in English, to access their information. And it is from this data that more corpora are used to learn artificial intelligence such as ChatGPT, Gemini, Copilot, Le Chat, Llama, Claude… The bigger the corpora, the better the results, even of “quality” is also important.

Read also This article is reserved for our subscribers In the production secrets of the original AI

To arrive at the observation of this web closure, the researchers studied three corpora widely used for AI development, C4, RefinedWeb and Dolma, which contain billions of “tokens” (or lexical categories , syllables, even words) coming from tens of millions. of websites (media, forums, encyclopedias, online merchants, personal or university sites, social networks, etc.). They also collect two types of information on these sites in order to know what they authorize or not: their general conditions of use (CGU) and a file called “robots.txt”, which the robot-crawlers It should be “read” to determine whether they have the right to collect data or not (but “ban” can also be respected).

Blacklist

The first observation is that the restrictions introduced in robots.txt “explode” from 2023. Almost 30% of the largest sites now use them, compared to a mere 2% previously. In terms of data volume, the researchers estimated that more than 30% of the characters of the 3,950 largest sites within the C4 and RefinedWeb corpora were affected by restrictions.

All crawlers are not in the same boat: 25.9% of C4 tokens are blocked to OpenAI robots, while it is only 13.3% for those of Anthropic or 4.1% for those of Meta. Recently, several publishers have announced that they are blocking one of the last robots on the market, Apple’s.

Researchers also noted that an American non-profit organization, Common Crawl, is also on the blacklist of many sites. It is true that we use your data to build C4, RefinedWeb, FineWeb, Dolma, etc. But the bans also apply to Internet Archive crawlers, a non-commercial Web “archiving” service.

#artificial #intelligence #web #closing

You may also like

Leave a Comment