The updated policy specifies that “publicly available information” is used to train Google’s AI products but doesn’t say how (or if) the company will prevent copyrighted materials from being included in that data pool. Many publicly accessible websites have policies in place that ban data collection or web scraping for the purpose of training large language models and other AI toolsets. It’ll be interesting to see how this approach plays out with various global regulations like GDPR that protect people against their data being misused without their express permission, too.
A combination of these laws and increased market competition have made makers of popular generative AI systems like OpenAI’s GPT-4 extremely cagey about where they got the data used to train them and whether or not it includes social media posts or copyrighted works by human artists and authors.
The matter of whether or not the fair use doctrine extends to this kind of application currently sits in a legal gray area.
Read more:
Weatherbed, J. (2023, July 5). Google confirms it’s training Bard on scraped web data, too. The Verge. https://www.theverge.com/2023/7/5/23784257/google-ai-bard-privacy-policy-train-web-scraping