CommonCrawl

Spider web illuminated in red against a black background

CCBot [Common Crawl Bot]

Published by smh767 on December 15, 2023

To prevent Common Crawl from crawling your website, include the following in your robots.txt: User-agent: CCBot Disallow: / https://commoncrawl.org/ccbot

Common Crawl

Published by smh767 on December 15, 2023

The Common Crawl corpus contains petabytes of data, regularly collected since 2008. https://commoncrawl.org/

Close-up image of numbers in an Excel file

Registry of Open Data on AWS

Published by smh767 on December 15, 2023

This registry exists to help people discover and share datasets that are available via AWS resources. https://registry.opendata.aws/

Selective focus photograph of stack of newspapers

The New York Times Updates Terms of Service to Prevent AI Scraping Its Content – AdWeek

Published by smh767 on August 18, 2023

Among the early use cases of AI within newsrooms appears to be fighting AI itself. The New York Times updated… read more

Digital Shred

Privacy Literacy Toolkit

CommonCrawl

CCBot [Common Crawl Bot]

Common Crawl

Registry of Open Data on AWS

The New York Times Updates Terms of Service to Prevent AI Scraping Its Content – AdWeek