Study: AI "inbreeding" may cause model collapse for tools like ChatGPT, Microsoft Copilot
It's like Game of Thrones, but for artificial intelligence large language models.
What you need to know
- AI tools like ChatGPT and Microsoft Copilot are driving a ton of hype throughout the tech world.
- Generative AI systems rely on training data, typically stolen from human internet content creators, to train their models.
- However, as the industrialized hose of AI-generated content floods the internet, researchers are worried about how AI models may be impacted by their own regurgitated data.
- Now, a comprehensive study published in Nature seems to suggest that AI "inbreeding" fears may indeed be founded.
What do AI models, European royal families, and George R. R. Martin have i common? Well, it could be a troubling infatuation with inbreeding.
AI models and tools are currently the big hotness in tech, with every company from Google to Microsoft to Meta getting deeply involved in the shift. Large language models (LLMs) and generative AI tools like ChatGPT, Microsoft Copilot, and Google Gemini are upending our relationship with computing. Or at least, they will, in theory — apparently.
Right now, AI tools are so server intensive and expensive to run that even AI frontrunner OpenAI is apparently on a collision course with bankruptcy without more funding rounds. Even huge tech companies like Google and Microsoft are struggling to figure out how to actually monetize this technology, as the masses don't yet see the point in actually paying for many of the tools currently on offer. There's a school of thought that AI models might actually have already peaked, too, and are destined to only get dumber.
"Model collapse" is a largely theoretical concept that predicts that as increasing amounts of content on the web becomes AI-generated, that AI will begin essentially "inbreeding" on AI generated training data, as high-quality human-made data becomes increasingly scarce. There's already been instances of this occurring in parts of the net where localized data is scarce, owing to content being created in less populated languages. We now have some more comprehensive studies into the phenomenon, with this new paper published in Nature.
"We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear," the abstract reads. "We refer to this effect as 'model collapse' and show that it can occur in [Large Language Models] as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs)."
In incredibly simplistic terms, you could think of "model collapse" as running along a similar entropic trajectory as JPEG compression. As memes and JPEGs get saved, posted, saved, and posted repeatedly across the internet, artifacts and errors in the data begin to appear and then, get replicated. The paper is arguing that "indiscriminate" use of online training data could result in similar degradation in LLMs, as companies scrape the open web to train their machines.
"We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models," the paper continues. "We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet."
Get the Windows Central Newsletter
All the latest news, reviews, and guides for Windows and Xbox diehards.
Tech companies don't care about 'healthy' AI
The mad dash to capitalize on this supposed generational computational shift backed by a bulldozer of hype and speculation has been embarrassing to watch in a way. While materially, LLMs and generative AI are evidently far more substantive than the blockchain and metaverse Big Tech faddy trends of previous years, Google, Microsoft, and others have been stumbling over themselves even more carelessly than usual. Google pushed its AI search queries out to the masses with reckless abandon, resulting in hilarious answers that encouraged users to eat rocks. Microsoft's Copilot PC launch "Recall" feature was an unmitigated disaster, showcasing a complete lack of taste, tact, and vision for what AI tech's relationship with consumers should even be.
Microsoft and Google both have taken a torch to their climatological pledges too, as the AI-fuelled frenzy triggers skyrocketing data center power and water costs. Microsoft laid off its team dedicated to ethics in AI too — we all know how those pesky ethics can get in the way of short-term profits.
Every action these companies take in the name of AI screams of greed and reckless irresponsibility. I don't believe for a second that any of them would take warnings of "model collapse" seriously, since that'll be a problem for a future fiscal year to solve.
RELATED: Microsoft AI chief says content on the web is "free" to be stolen
Microsoft and Google are aggressively pursuing ways to rob content creators of all shapes and sizes of much-needed income, by stealing content and putting it directly into search results. Making content creation financially unviable for all but the biggest corporate entities is only going to further degrade the quality of information on the web and exacerbate any potential "model collapse," while also centralizing information around a powerful few. But hey, maybe that's partially the point.
I can't foresee Microsoft and Google taking any of this seriously, though. Nor do I expect any recompense for the content being stolen wholesale to power these systems. What I do foresee, though, is a pretty dark future for the internet.
Jez Corden is a Managing Editor at Windows Central, focusing primarily on all things Xbox and gaming. Jez is known for breaking exclusive news and analysis as relates to the Microsoft ecosystem while being powered by tea. Follow on Twitter (X) and Threads, and listen to his XB2 Podcast, all about, you guessed it, Xbox!