2024-08-03 03:42:53
Many studies indicate that if AI models are trained with data itself generated by AI, the answers end up being useless.
If artificial intelligence (AI) models are repeatedly trained with the data itself generated by AI, they begin to produce increasingly inconsistent content, a problem highlighted by many scientific studies.
The models underlying artificial intelligence tools such as ChatGPT, which make it possible to generate all kinds of content on a simple question in everyday language, need to be trained on an astronomical amount of data. The data is constantly being collected from the web, which contains more and more images and words created by AI. This “autophagy”, where AI feeds on AI, leads to the collapse of models, which produce responses that are less original and original and appropriate and then end up meaningless, according to the article of was published in late July in a scientific journal. Creation.
A phenomenon comparable to mad cow disease
In fact, with the use of this type of data called “synthetic data” because it is generated by machines, the sample from which artificial intelligence models draw to provide their answers loses richness. It’s like making a copy of a scanned image and then printing it. As the prints progress, the result loses quality to the point of being unpredictable.
Researchers from the American universities Rice and Stanford came to the same conclusion by studying the image-creating AI models Midjourney, Dall-E and Stable Diffusion. They show that the original images become more and more complex and peppered with inconsistent elements as they add data. “Joint” to models, compare this phenomenon to mad cow disease. This epidemic that appeared in England had its origin in the use of meat for meat consumption, obtained from the uneaten parts of animal carcasses and the carcasses of contaminated animals.
“Synthetic data”
However, companies in the artificial intelligence sector often use it “synthetic data” to build their systems because of their ease of access, abundance and low cost compared to demographic data. “Unauthorized, high-quality, machine-readable human data sources are becoming increasingly common”explained to AFP Jathan Sadowski, a researcher specializing in new technology at Monash University in Australia.
“Without any control for many generations, a catastrophic scenario” would be model collapse syndrome “poisoning the quality and diversity of data for the entire Internet”, warned Richard Baraniuk, one of the authors of the Rice University study, in a press release. Just as the mad cow crisis hit the cattle industry in the 90s, the Internet is full of content produced by artificial intelligence and models. “insane” could threaten the future of a booming, billion-dollar AI industry, scientists say. “The real question for researchers and companies building AI systems is: At what point does the use of synthetic data become too much?”concluded Jathan Sadowski.
“This part of the Internet is dirty”
But, for other professionals, the problem is exaggerated and far from impossible. Anthropic and Hugging Face, two gems in the field of artificial intelligence, confirmed to AFP that they use data generated by AI. Newspaper article Creation offers an interesting theoretical perspective, but not a realistic one for Anton Lozhkov, machine learning engineer at Face Hugging. “Studies (of models) on many sets of synthetic data is not realistically possible”it is guaranteed.
Anton Lozhkov admits, however, that AI experts are disappointed, like everyone else, with the state of the web. “This part of the Internet is dirty”he said, adding that his company has made great efforts to clean the data collected, sometimes deleting up to 90% of it.
#artificial #intelligence #feeds #ambiguity #remains