The enormous datasets used to train the latest generation of these AI systems, like those behind ChatGPT and Stable Diffusion, are likely to contain billions of images scraped from the internet, millions of pirated ebooks, the entire proceedings of 16 years of the European parliament and the whole of English-language Wikipedia.
But the industry’s voracious appetite for big data is starting to cause problems, as regulators and courts around the world crack down on researchers hoovering up content without consent or notice. In response, AI labs are fighting to keep their datasets secret, or even daring regulators to push the issue.
Hern, A., & Milmo, D. (2023, April 10). ‘I didn’t give permission’: Do AI’s backers care about data law breaches? The Guardian. https://www.theguardian.com/technology/2023/apr/10/i-didnt-give-permission-do-ais-backers-care-about-data-law-breaches