Was Sora AI trained using YouTube and gaming content? OpenAI might need a minute to check with the team.

OpenAI officially launches the AI video generation model Sora, December 10, 2024

OpenAI officially launched the AI video generation model Sora, on December 10, 2024. (Image credit: Getty Images | CFOTO)

OpenAI recently shipped its text-to-video Sora AI model to general availability as part of its 12 days of shipmas extravaganza. The model was released in preview earlier this year in February. The ChatGPT maker indicated that the tool is limited to ChatGPT Pro and Plus users, and there was no indication of whether it will be shipped to free users in the foreseeable future.

While the tool is admittedly impressive and in a class of its own, the AI firm highlighted critical performance issues with its video generation process, including the struggle to generate unrealistic physics with complex actions over long durations. This is despite being backed by OpenAI's powerful and more capable Sora Turbo AI model.

The ChatGPT maker has seemingly remained mum about the model's training source. However, a report by TechCrunch suggests Sora may have been trained on game content. When debuting Sora in February, it was apparent that the AI model was trained using Minecraft videos.

As it now seems, Minecraft isn't the only video game in Sora AI's training chest. Super Mario Bros, Call of Duty, Counter-Strike, and a ’90s version of Teenage Mutant Ninja Turtle game seem to be in the fold, too. OpenAI has publicly shared several clips of Sora AI-generated clips that are uncanny to the video games listed above.

Interestingly, Sora's training material goes beyond video games. Twitch streams could also be part of the material used to train the model. In a screenshot shared by TechCrunch, the model seems to have a great idea of what a Twitch stream looks like, alluding that it might have been trained using the platform's content. Perhaps more interestingly, the AI model generates videos featuring popular Twitch streamers, including Raúl Álvarez Genes (Auronplay).

TechCrunch admits the model is aggressively filtered to prevent copyright infringement issues. As such, a direct prompt asking the model to generate a clip featuring a trademarked character will outrightly be rejected, meaning you'll have to get creative with your prompt engineering skills.

Copyrighted content is AI's bread and butter

YouTube

Watch On

OpenAI, and by extension, Microsoft, are no strangers to copyright infringement issues. The companies have been slapped with several lawsuits over the issue. OpenAI CEO Sam Altman admitted developing tools like ChatGPT without copyrighted content is impossible. The executive argued that copyright law doesn't categorically forbid using copyrighted content to train AI models.

While speaking to TechCrunch, Joshua Weigensberg, an IP attorney at Pryor Cashman, indicated:

“Companies that are training on unlicensed footage from video game playthroughs are running many risks. Training a generative AI model generally involves copying the training data. If that data is video playthroughs of games, it’s overwhelmingly likely that copyrighted materials are being included in the training set.”

Microsoft and OpenAI have contested the copyright infringement cases, citing fair use while referring to their models' creations as transformative rather than plagiarized work.

Popular YouTuber and Tech Reviewer Marques Brownlee raised critical concerns about Sora when it recently launched, questioning the source of its training material. Brownlee had early access to the tool, which allowed him to access its capabilities. In the process, the YouTuber asked the AI tool to generate a video of a tech reviewer talking about a smartphone.

The AI-generated video caught the reviewer's attention, especially the plant on the desk in the video. He indicated that the plant featured in the clip looked suspiciously similar to the one in dozens of his videos.

Are my videos in that source material? Is this exact plant part of the source material? Is it just a coincidence? I don’t know.
Marques Brownlee, Tech Reviewer

While the AI-generated video isn't a 100% tell-tale sign that Sora might have lifted some of its inspiration from Brownlee's videos, it raises eyebrows and might be worth watching.

Former OpenAI CTO Mira Murati was previously asked if Sora is trained using YouTube, Instagram, and Facebook content but couldn't provide a straight answer other than indicating that the model is trained on publicly available data alongside licensed data from stock media, including Shutterstock.

The AI firm didn't respond to TechCrunch's comment request on its findings other than saying it would "check with the team."

See more Artificial Intelligence News

TOPICS

Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.