Tech companies are turning to controversial tactics to feed their data-hungry artificial intelligence models, vacuuming up books, websites, photos, and social media posts often unbeknownst to the creators, WIRED reported.
AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.
One investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nividia, Apple, and Salesforce.
The dataset, called YouTube Subtitles, contains video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard. The Wall Street Journal, NPR, and the BBC also had their videos used to train AI, as did The Late Show with Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.
9TO5Mac reported a number of tech giants, including Apple, trained AI models on YouTube videos without the consent of the creators, according to a new report today.
They did this by using subtitle files downloaded by a third party from more than 170,000 videos. Creators affected include tech reviewer Marquees Brownlee (MKBHD), MrBeast, PewDePie, Stephen Colbert, John Oliver, and Jimmy Kimmel.
The subtitle files are effectively transcripts of the video content.
The downloads were reportedly preformed by a non-profit called EleutherAI, which says it helps developers train AI models. While the aim appears to have been to provide training materials to small developers and academics, the dataset has also been used by several tech giants, including Apple.
According to 9to5Mac, it’s important to emphasize here that Apple didn’t download the data itself, but this was instead preformed by EleutherAI. It is this organization which appears to have broken YouTube’s terms and conditions.
The Verge reported as part of its investigation, Proof News also released an interactive lookup tool. You can use its search engine feature to see if your content— or your favorite’s YouTuber’s — appears in the dataset.
The subtitles dataset is part of a larger collection of material from the nonprofit EleutherAI called The Pile, an open-source collection that also contains datasets of books, Wikipedia articles, and more. Last year, an analysis of one dataset called Books3 revealed which authors work had been used to train AI systems, and the dataset has been cited in lawsuits by authors against the companies that used it to train AI.
In my opinion, scraping from content creator’s works – even if it’s only the audio part of the YouTube video – should be illegal. The work made by humans should not be fed to AI systems without the creator’s consent.