Google's going to scrape the entire public Internet to train its AI tools and there's nothing we can do about it

Google Bard announcement blog — (Image credit: Future)

What you need to know

Google's latest privacy policy came into effect on July 1 and it's going to be a little controversial.
The owner of the largest search engine on the planet is now going to use all that scraping to train its AI models and we're basically going to have to live with it.
Use of data to train AI models has already providing its own drama, notably from large sources like Reddit.

Google's latest privacy policy update isn't necessarily surprising, but it does also set off some alarm bells. Particularly for those who already have their doubts over the AI revolution.

As highlighted by Gizmodo the latest statement on the search giant's privacy policy contains a key update relating to AI:

“For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”

Latest Videos From

The most recent policy prior to this only made mention of "language models" and specifically, Google Translate. The latest update makes it clear that anything public on the Internet Google is going to be feeding into its AI tools like Bard.

Is this surprising? Not at all. Google is the gatekeeper to the Internet, especially for publishers like us and our parent company. Playing the game of getting your content to rank well in Google is exhausting, but also critical. And now all of that content is going to be fed into Google AI. All of it.

It's certainly going to stoke the flames of debate. Recently we've seen issues on Reddit with regards to access to its API, the losers of which were basically the users of Reddit. Twitter's owner, Elon Musk, has also been vocal about scraping, claiming the recent disaster on the platform with rate limits is in response to that (even if it might not be 100% true).

BingGPT brings the Bing Chat experience to the desktop — AI is the future but it's going to get messy. (Image credit: Windows Central)

This move is only going to further stoke the debate, and the backlash over the training of AI tools. OpenAI has already had its fair share over the data used to train the GPT model, the same that powers Microsoft's Bing Chat. Microsoft also has a search engine, but its reach pales in comparison to that of Google Search.

The legality will also come into question. We're in uncharted, murky waters with all this stuff. The EU already has issues with Google Bard, and quite how this will align with the territory's GDPR rules will be interesting to find out. Until it's technically not illegal, maybe Google is just going to do what Google does. Which is whatever it wants.

AI models need to be trained somehow. But Google's latest policy doesn't seem to indicate the company is willing to compensate any of the creators of that content. Everyone needs their stuff to be surfaced in Google, and it does feel kind of like Google is abusing that to its own ends.

Buckle up, it's going to be a bumpy ride.

TOPICS

Richard Devine is the Managing Editor at Windows Central, where he combines a deep love for the open-source community with expert-level technical coverage. Whether he’s hunting for the next big project on GitHub, fine-tuning a WSL workflow, or breaking down the latest meta in Call of Duty, Forza, and The Division 2, Richard focuses on making complex tech accessible to every kind of user. If it’s happening in the world of Windows or PC gaming, he’s probably already knee-deep in the code (or the lobbies). Follow him on X and Mastodon.