Common Crawl Corpus

URI: https://gptkb.org/entity/Common_Crawl_Corpus

GPTKB entity

Predicate	Object
gptkbp:instanceOf	gptkb:web_corpus
gptkbp:access	gptkb:HTTP gptkb:AWS_S3
gptkbp:availableOn	2011
gptkbp:category	gptkb:archives big data open data
gptkbp:citation	Common Crawl. (n.d.). Common Crawl. https://commoncrawl.org/
gptkbp:contains	gptkb:text metadata outlinks WARC files raw HTML web page data
gptkbp:covers	global multiple domains
gptkbp:dataAccessibility	free open access
gptkbp:dataCollected	web crawling
gptkbp:dataCollectionTool	crawler
gptkbp:dataSource	public web
gptkbp:dataUpdate	monthly crawl
gptkbp:format	gptkb:WAT gptkb:WARC WET
gptkbp:frequency	monthly
gptkbp:language	multilingual
gptkbp:license	gptkb:Creative_Commons_Attribution-ShareAlike_4.0_International
gptkbp:maintainedBy	gptkb:Common_Crawl_Foundation
gptkbp:notableUser	gptkb:Google gptkb:OpenAI gptkb:Allen_Institute_for_AI gptkb:Meta_AI
gptkbp:numberOfVolumes	billions of web pages
gptkbp:size	petabytes
gptkbp:type	semi-structured data unstructured data structured metadata
gptkbp:usedFor	gptkb:machine_learning data analysis natural language processing web mining search engine research
gptkbp:website	https://commoncrawl.org/
gptkbp:bfsParent	gptkb:Common_Crawl
gptkbp:bfsLayer	6
http://www.w3.org/2000/01/rdf-schema#label	Common Crawl Corpus