Microsoft quietly deleted a blog promoting training AI on pirated Harry Potter books — amid backlash over copyright concerns
Microsoft deleted a 2024 post that encouraged developers to use text files copied from Harry Potter books to train AI.
All the latest news, reviews, and guides for Windows and Xbox diehards.
You are now subscribed
Your newsletter sign-up was successful
While OpenAI CEO Sam Altman has openly admitted that it's virtually impossible to develop advanced AI models like ChatGPT without copyrighted content, he argues that copyright law doesn't categorically prohibit AI firms, ultimately leveraging "the fair use doctrine" to violate copyright law and destroy the internet.
More recently, Microsoft was forced to delete a blog post it had published in November 2024, which seemingly encouraged developers to pirate Harry Potter books to train AI models following backlash from critics in a Hacker News thread.
“The Harry Potter series, written by J.K. Rowling, is a globally beloved collection of seven books that follow the journey of a young wizard, Harry Potter, and his friends as they battle the dark forces led by the evil Voldemort,” wrote Pooja Kamath, a Microsoft Senior Product Manager in the now-deleted blog post (but you can still accessible via the Internet Archive)
In the blog post, the tech giant leveraged the beloved collection of seven Harry Potter books to show its new feature, which would help users add AI to apps via Azure. The goal behind this was to use a well-known dataset to show engaging and relatable examples of its new feature that would resonate with a wide audience.
The Microsoft blog post linked to a Kaggle dataset that included all seven Harry Potter books, which was mistakenly marked as “public domain.” According to Ars Technica, it was only downloaded about 10,000 times, which is significantly low considering the fact that the blog post has been available online for over a year.
The dataset was marked as Public Domain by mistake; there was no intention to misrepresent the licensing status of the works.
Shubham Maindola, Data Scientist
The dataset was deleted late last week after the outlet reached out to Shubham Maindola, a data scientist in India with no known affiliations to Microsoft. “The dataset was marked as Public Domain by mistake," Maindola told Ars Technica. "There was no intention to misrepresent the licensing status of the works.”
Developing generative AI is no easy feat. Top AI research labs, such as OpenAI, are quickly burning through substantial funds to maintain the hype amid rising concerns among investors about returns on their investments. The ChatGPT maker is reportedly on-course to make a $14 billion loss in 2026 before going into bankruptcy by mid-next year.
The money aside, AI models heavily rely on information from the internet for training. However, reports suggest that Google, OpenAI, and Anthropic are suffering from a lack of high-quality data for model training, slowing down the advances in AI development.
Do you think AI model training constitutes copyright infringement?
AI model training has always been a complex issue, largely because there are no clear laws preventing tech companies from using copyrighted material in the process. Many firms lean on the concept of fair use as a legal shield, arguing that their practices fall within its protections.
Join us on Reddit at r/WindowsCentral to share your insights and discuss our latest news, reviews, and more.

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.
