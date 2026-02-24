While OpenAI CEO Sam Altman has openly admitted that it's virtually impossible to develop advanced AI models like ChatGPT without copyrighted content, he argues that copyright law doesn't categorically prohibit AI firms, ultimately leveraging "the fair use doctrine" to violate copyright law and destroy the internet.

More recently, Microsoft was forced to delete a blog post it had published in November 2024, which seemingly encouraged developers to pirate Harry Potter books to train AI models following backlash from critics in a Hacker News thread.

“The Harry Potter series, written by J.K. Rowling, is a globally beloved collection of seven books that follow the journey of a young wizard, Harry Potter, and his friends as they battle the dark forces led by the evil Voldemort,” wrote Pooja Kamath, a Microsoft Senior Product Manager in the now-deleted blog post (but you can still accessible via the Internet Archive)

In the blog post, the tech giant leveraged the beloved collection of seven Harry Potter books to show its new feature, which would help users add AI to apps via Azure. The goal behind this was to use a well-known dataset to show engaging and relatable examples of its new feature that would resonate with a wide audience.

The Microsoft blog post linked to a Kaggle dataset that included all seven Harry Potter books, which was mistakenly marked as “public domain.” According to Ars Technica, it was only downloaded about 10,000 times, which is significantly low considering the fact that the blog post has been available online for over a year.

The dataset was marked as Public Domain by mistake; there was no intention to misrepresent the licensing status of the works. Shubham Maindola, Data Scientist

The dataset was deleted late last week after the outlet reached out to Shubham Maindola, a data scientist in India with no known affiliations to Microsoft. “The dataset was marked as Public Domain by mistake," Maindola told Ars Technica. "There was no intention to misrepresent the licensing status of the works.”

Developing generative AI is no easy feat. Top AI research labs, such as OpenAI, are quickly burning through substantial funds to maintain the hype amid rising concerns among investors about returns on their investments. The ChatGPT maker is reportedly on-course to make a $14 billion loss in 2026 before going into bankruptcy by mid-next year.

The money aside, AI models heavily rely on information from the internet for training. However, reports suggest that Google, OpenAI, and Anthropic are suffering from a lack of high-quality data for model training, slowing down the advances in AI development.

Do you think AI model training constitutes copyright infringement?

AI model training has always been a complex issue, largely because there are no clear laws preventing tech companies from using copyrighted material in the process. Many firms lean on the concept of fair use as a legal shield, arguing that their practices fall within its protections.

