AI Data 'Scraping' Allegations Dog Apple, Anthropic,
and Nvidia. Training AI models depends on feeding them data, so much of it that companies are bending or breaking copyright rules to get it. A new investigation implicates more big names in the practice.
An April kerfuffle brought one of the less savory aspects of the AI revolution to light, when market leader OpenAI faced accusations it grabbed words from millions of user-uploaded YouTube videos to train generative AI models. The fact it was a deliberate act, with OpenAI team members reportedly discussing how the scraping may violate YouTube's copyright rules, landed the company in the spotlight for the wrong reasons. Worse still, YouTube owner Google itself was also said to have tried the same trick.
Now it looks like OpenAI and Google aren't alone in turning to YouTube content for AI training material, with buzzy brand Anthropic, AI chipmaker Nvidia and even privacy-centric Apple accused of data scraping. There is an excusing factor at play, since the data was acquired from a third party company, but it just shows how complicated the business of AI training really is.
Wired and Proof News, a site specializing in data-driven reporting and analysis, found that leading AI companies like Apple, Anthropic and Nvidia have been using a training dataset called "YouTube Subtitles" that includes the text of over 170,000 YouTube videos from many different sources--including top-ranked influencers like Mr. Beast and even papers like the Wall Street Journal and the New York Times (which is actually suing OpenAI for scraping its news archive). The data apparently came from a research outfit called EleutherAI, which used the info it had gathered from numerous online sources for academic reasons. But Eleuther's data was also publicly released, and therein lies the problem exposed today.
While it may not be surprising to see AI brands like Anthropic and Nvidia involved in this development, as companies in the industry are gaining a reputation for playing fast and loose with copyright rules, Apple's inclusion is a surprise. The company has long touted itself as being a staunch defender of user privacy, and we know it's paid for permitted access to AI training data before. Even more pointedly, when the company recently revealed its anticipated big push to embrace AI tech, Apple Intelligence, it took pains to distinguish itself from its rivals by pointing out exactly how it prevents user data from being repurposed to train AIs--even that of its chatbot AI partner OpenAI.
For Apple to have trained its AIs on data that likely violates some of YouTube's privacy rules is problematic. News site Quartz, noting that its own videos were used in the training data, quotes leading tech influencer Marques Brownlee, who said, "Apple technically avoids 'fault'" because they weren't "the ones scraping." Still, Brownlee thinks it's going to be "an evolving problem."
There are a couple of big lessons here for your company, even if you're only beginning to toy with the business benefits AI technology can bring to a small team.
Firstly, don't trust the data you've generated from AI tools--if you release a product made with AI that identifiably violates someone else's intellectual property rights, even though it's not really "your fault," then the situation could quickly get legally complex. Second: if you're looking at training AIs for your own purposes, double check the sources of all your data, lest you incorporate info from questionable sources.
Photo: Getty Images.