Common Crawl Corpus

GPTKB entity

Statements (47)
Predicate Object
gptkbp:instanceOf web corpus
gptkbp:access gptkb:HTTP
gptkb:AWS_S3
gptkbp:availableOn 2011
gptkbp:category gptkb:archives
big data
open data
gptkbp:citation Common Crawl. (n.d.). Common Crawl. https://commoncrawl.org/
gptkbp:contains gptkb:text
metadata
outlinks
WARC files
raw HTML
web page data
gptkbp:covers global
multiple domains
gptkbp:dataAccessibility free
open access
gptkbp:dataCollected web crawling
gptkbp:dataCollectionTool crawler
gptkbp:dataSource public web
gptkbp:dataUpdate monthly crawl
gptkbp:format gptkb:WAT
gptkb:WARC
WET
gptkbp:frequency monthly
https://www.w3.org/2000/01/rdf-schema#label Common Crawl Corpus
gptkbp:language multilingual
gptkbp:license gptkb:Creative_Commons_Attribution-ShareAlike_4.0_International
gptkbp:maintainedBy gptkb:Common_Crawl_Foundation
gptkbp:notableUser gptkb:Google
gptkb:OpenAI
gptkb:Allen_Institute_for_AI
gptkb:Meta_AI
gptkbp:numberOfVolumes billions of web pages
gptkbp:size petabytes
gptkbp:type semi-structured data
unstructured data
structured metadata
gptkbp:usedFor gptkb:machine_learning
data analysis
natural language processing
web mining
search engine research
gptkbp:website https://commoncrawl.org/
gptkbp:bfsParent gptkb:Common_Crawl
gptkbp:bfsLayer 6