Was Sora AI trained using YouTube and gaming content? OpenAI might need a minute to check with the team.

OpenAI officially launches the AI video generation model Sora, December 10, 2024
OpenAI officially launched the AI video generation model Sora, on December 10, 2024. (Image credit: Getty Images | CFOTO)

OpenAI recently shipped its text-to-video Sora AI model to general availability as part of its 12 days of shipmas extravaganza. The model was released in preview earlier this year in February. The ChatGPT maker indicated that the tool is limited to ChatGPT Pro and Plus users, and there was no indication of whether it will be shipped to free users in the foreseeable future.

While the tool is admittedly impressive and in a class of its own, the AI firm highlighted critical performance issues with its video generation process, including the struggle to generate unrealistic physics with complex actions over long durations. This is despite being backed by OpenAI's powerful and more capable Sora Turbo AI model.

Copyrighted content is AI's bread and butter

YouTube YouTube
Watch On

OpenAI, and by extension, Microsoft, are no strangers to copyright infringement issues. The companies have been slapped with several lawsuits over the issue. OpenAI CEO Sam Altman admitted developing tools like ChatGPT without copyrighted content is impossible. The executive argued that copyright law doesn't categorically forbid using copyrighted content to train AI models.

While speaking to TechCrunch, Joshua Weigensberg, an IP attorney at Pryor Cashman, indicated:

“Companies that are training on unlicensed footage from video game playthroughs are running many risks. Training a generative AI model generally involves copying the training data. If that data is video playthroughs of games, it’s overwhelmingly likely that copyrighted materials are being included in the training set.”

Microsoft and OpenAI have contested the copyright infringement cases, citing fair use while referring to their models' creations as transformative rather than plagiarized work.

Popular YouTuber and Tech Reviewer Marques Brownlee raised critical concerns about Sora when it recently launched, questioning the source of its training material. Brownlee had early access to the tool, which allowed him to access its capabilities. In the process, the YouTuber asked the AI tool to generate a video of a tech reviewer talking about a smartphone.

The AI-generated video caught the reviewer's attention, especially the plant on the desk in the video. He indicated that the plant featured in the clip looked suspiciously similar to the one in dozens of his videos.

Are my videos in that source material? Is this exact plant part of the source material? Is it just a coincidence? I don’t know.

Marques Brownlee, Tech Reviewer

While the AI-generated video isn't a 100% tell-tale sign that Sora might have lifted some of its inspiration from Brownlee's videos, it raises eyebrows and might be worth watching.

Former OpenAI CTO Mira Murati was previously asked if Sora is trained using YouTube, Instagram, and Facebook content but couldn't provide a straight answer other than indicating that the model is trained on publicly available data alongside licensed data from stock media, including Shutterstock.

The AI firm didn't respond to TechCrunch's comment request on its findings other than saying it would "check with the team."

Kevin Okemwa
Contributor

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.