Common Crawl

URI: https://gptkb.org/entity/Common_Crawl

GPTKB entity

Predicate	Object
gptkbp:instanceOf	gptkb:nonprofit_organization
gptkbp:archiveFrequency	monthly
gptkbp:category	non-profit open data web archiving internet organization
gptkbp:dataLanguage	multilingual
gptkbp:datasetAvailability	public domain
gptkbp:format	gptkb:WAT gptkb:WARC WET
gptkbp:founded	2007
gptkbp:founder	gptkb:Gil_Elbaz
gptkbp:headquartersLocation	gptkb:United_States
gptkbp:includes	web pages metadata HTTP headers crawled URLs outlinks text content
gptkbp:license	gptkb:Creative_Commons_CC0
gptkbp:mission	to crawl the web and freely provide its archives and datasets to the public
gptkbp:notableProject	gptkb:Common_Crawl_Corpus
gptkbp:notableUser	gptkb:Google gptkb:OpenAI gptkb:Allen_Institute_for_AI gptkb:EleutherAI gptkb:Meta_AI
gptkbp:numberOfVolumes	petabytes
gptkbp:partner	gptkb:Internet_Archive gptkb:Amazon_Web_Services gptkb:Archive-It gptkb:Data_Commons gptkb:Web_Data_Commons
gptkbp:storage	gptkb:Amazon_S3
gptkbp:supportedBy	grants donations
gptkbp:twitter	@CommonCrawl
gptkbp:type	web crawl data
gptkbp:usedBy	gptkb:researchers developers companies
gptkbp:usedFor	gptkb:machine_learning data analysis natural language processing web mining search engine development
gptkbp:website	https://commoncrawl.org/
gptkbp:bfsParent	gptkb:WARC
gptkbp:bfsLayer	5
http://www.w3.org/2000/01/rdf-schema#label	Common Crawl