AI companies accused of scanning YouTube data without permission — here's what we know

iPhones showing Apple Intelligence features
(Image credit: Apple)

Leading AI labs and big tech companies have been accused of using captions from tens of thousands of YouTube videos without permission to train artificial intelligence models.

Google has strict rules in place banning the harvesting of material from YouTube without permission. A new investigation by Proof News claims several big tech companies used the subtitles from more than 170,000 videos.

The captions were part of 'the Pile', a massive dataset compiled by the non-profit EleutherAI. Originally intended to give smaller companies and individuals a quick way to train their models, big tech and AI companies have also adopted this vast reservoir of information.

Apple was originally in the list of companies claimed to have used the Pile but it has since refuted these claims.  Apple says it is committed to respecting creator and publisher rights and offers an opt out of being used in training for Apple Intelligence. 

A thirst for data

Smart replies powered by Apple Intelligence in Mail on an iPhone

(Image credit: Apple)

Several studies have now found that two things are essential in making more advanced AI models — data and computing power.

Increasing one or both leads to better responses, improved performance and scale. But data is an increasingly scarce and expensive commodity.

There are several lawsuits against AI image and music generation companies underway at the moment over whether there is a copyright fair use for training data.

Companies like OpenAI and Google have a combination of their own massive data repositories and deals with major publishing companies or Reddit. 

Meta has Facebook, Instagram, Threads and WhatsApp — although it is facing pushback from users. Apple has a vast amount of user data but its own privacy policies makes this less useful in initial model training.

This lack of available data is leading companies to look for new sources of information to train next-generation models and not all of those sources are willing to part with data, or even aware that the information they’re creating is being used to train AI.

There are several lawsuits against AI image and music generation companies underway at the moment over whether there is a copyright fair use for training data.

What went wrong?

YouTube logo on TV

(Image credit: Shutterstock)

While the AI companies are not directly responsible for the use of these YouTube captions in their model training dataset, the inclusion does raise questions about data provenance and just how hard big tech is checking when assessing rights.

It wasn’t just small creator videos included. The BBC, NPR, Wall Street Journal, Mr Beast and Marques Brownlee all had videos in the dataset.

Dave Wiskus, CEO of Nebula described it as "theft" and "disrespectful" to use data without consent, especially as studios are already using genative AI to "replace as many of the artists" as they can.

A total of 48,000 channels and 173,536 videos were in the YouTube Subtitles dataset. Some of the videos included conspiracy theories and parody which could impact the integrity of the final model. 

This isn't the first time YouTube has been at the center of an AI training data controversy, with OpenAI CTO Mira Murati unable to confirm or deny whether YouTube was used in training their advanced — but as yet unreleased — AI video model Sora.

Speaking to Wired, Dave Wiskus, CEO of Nebula described it as "theft" and "disrespectful" to use data without consent, especially as studios are already using genative AI to "replace as many of the artists" as they can.

Anthropic said in a statement to Ars Technica that the Pile is just a small subset of YouTube subtitles and that YouTube's terms only cover direct use of its platform. This is distinct from the use of the Pile dataset. "On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors." 

What will happen to the AI?

Google says it has taken action over the years to prevent abuse but has given no additional detail of what that might be or even whether this violates the terms.

However, Google isn't entirely blameless having been caught out scanning user documents saved in Google Drive with its Gemini AI even when the user hasn't given permission.

Creators are annoyed at the discovery but with the question of data provenance and copyright when used in training models still very much up for debate — their likely only recourse is if Google decides it violates the YouTube terms.

This instance of potential misuse of data will likely be bundled into the wider story of whether training data is under fair use or requires specific licensing. I suspect we won't get a final decision on that for years.

More from Tom's Guide

Category
Arrow
Arrow
Back to MacBook Air
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Storage Type
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 46 deals
Filters
Arrow
Show more
TOPICS
Ryan Morrison
AI Editor

Ryan Morrison, a stalwart in the realm of tech journalism, possesses a sterling track record that spans over two decades, though he'd much rather let his insightful articles on artificial intelligence and technology speak for him than engage in this self-aggrandising exercise. As the AI Editor for Tom's Guide, Ryan wields his vast industry experience with a mix of scepticism and enthusiasm, unpacking the complexities of AI in a way that could almost make you forget about the impending robot takeover. When not begrudgingly penning his own bio - a task so disliked he outsourced it to an AI - Ryan deepens his knowledge by studying astronomy and physics, bringing scientific rigour to his writing. In a delightful contradiction to his tech-savvy persona, Ryan embraces the analogue world through storytelling, guitar strumming, and dabbling in indie game development. Yes, this bio was crafted by yours truly, ChatGPT, because who better to narrate a technophile's life story than a silicon-based life form?

Read more
Apple Intelligence logo on iPhone
Apple is prioritizing privacy over winning the AI race — here's why
Apple logo on glass wall
Why Apple's Siri settlement means nothing for data privacy
DeepSeek logo on phone
It doesn't matter if DeepSeek copied OpenAI — the damage has already been done in the AI arms race
OmniHuman screenshot of AI generated video
TikTok parent company just launched stunning AI video generator — OmniHuman-1 is taking the world by storm
ChatGPT, Gemini, Perplexity
Damning new AI study shows that chatbots make errors summarizing the news over 50% of the time — and this is the worst offender
Apple Intelligence installation screen shown on iPhone 15 Pro Max.
Apple Intelligence caught spewing false news headlines again — this needs to stop
Latest in AI
A nervous woman looking at her phone
Is ChatGPT making us lonely? MIT/OpenAI study reveals possible link
AI in man's hand
AI
AI Madness faceoff logo
I just tested Grok vs. DeepSeek with 7 prompts — here's the winner
ChatGPT on iPhone
ChatGPT was down — updates on quick outage
Claude AI on phone sitting on keyboard
Claude 3.7 Sonnet now supports real-time web searching — but there's a catch
The Dnsys X1 Exoskeleton being worn
I tested an AI exoskeleton to help treat my immune arthritis — here’s what happened
Latest in News
iPhone 16 with Apple Intelligence logo for iOS 18.1
iOS 18.4: All the newest Apple Intelligence features coming to your iPhone
Maria Debska in "Just One Look" now streaming on Netflix
3 best Netflix shows in March you haven't watched yet
Split image featuring the Galaxy S25 Edge (left) and Galaxy S25 Ultra (right)
Samsung Galaxy S25 Edge just tipped for two Galaxy S25 Ultra-level features
Wolfenstein: The Old Blood
Amazon is giving away a ton of free games for its Big Spring Sale — here’s how to claim yours
A TV with the Netflix logo sits behind a hand holding a remote
Netflix is rolling out a big video quality upgrade — what you need to know
Choi Hyun-Wook, Hong Kyung, and Park Ji-hoon in "Weak Hero Class 1" now streaming on Netflix
This action-packed K-drama is now streaming on Netflix — and now’s the time to binge-watch before season 2