An investigation by Proof News has revealed that major tech companies, including Apple and NVIDIA, have used a dataset without permission for their AI models.
The dataset, created by EleutherAI, contains transcripts from over 173,000 YouTube videos from popular channels like Marques Brownlee and MrBeast, as well as news publishers such as The New York Times and the BBC. This highlights the uncomfortable truth of AI technology being built on data taken from creators without consent or compensation.
The dataset does not include any actual videos or images from YouTube, but only transcripts from them. However, it is still a valuable resource for training AI models due to the abundance of audio, video, and image data available on the platform. Companies like Apple and OpenAI have been criticized for their lack of transparency regarding the source of their training data.
YouTube’s CEO has stated that using data from their platform without consent would violate their terms of service. A lookup tool created by Proof News allows users to see if subtitles from their own YouTube videos or favorite channels are included in the dataset.