gptkbp:instanceOf
|
web corpus
|
gptkbp:access
|
gptkb:HTTP
gptkb:AWS_S3
|
gptkbp:availableOn
|
2011
|
gptkbp:category
|
gptkb:archives
big data
open data
|
gptkbp:citation
|
Common Crawl. (n.d.). Common Crawl. https://commoncrawl.org/
|
gptkbp:contains
|
gptkb:text
metadata
outlinks
WARC files
raw HTML
web page data
|
gptkbp:covers
|
global
multiple domains
|
gptkbp:dataAccessibility
|
free
open access
|
gptkbp:dataCollected
|
web crawling
|
gptkbp:dataCollectionTool
|
crawler
|
gptkbp:dataSource
|
public web
|
gptkbp:dataUpdate
|
monthly crawl
|
gptkbp:format
|
gptkb:WAT
gptkb:WARC
WET
|
gptkbp:frequency
|
monthly
|
https://www.w3.org/2000/01/rdf-schema#label
|
Common Crawl Corpus
|
gptkbp:language
|
multilingual
|
gptkbp:license
|
gptkb:Creative_Commons_Attribution-ShareAlike_4.0_International
|
gptkbp:maintainedBy
|
gptkb:Common_Crawl_Foundation
|
gptkbp:notableUser
|
gptkb:Google
gptkb:OpenAI
gptkb:Allen_Institute_for_AI
gptkb:Meta_AI
|
gptkbp:numberOfVolumes
|
billions of web pages
|
gptkbp:size
|
petabytes
|
gptkbp:type
|
semi-structured data
unstructured data
structured metadata
|
gptkbp:usedFor
|
gptkb:machine_learning
data analysis
natural language processing
web mining
search engine research
|
gptkbp:website
|
https://commoncrawl.org/
|
gptkbp:bfsParent
|
gptkb:Common_Crawl
|
gptkbp:bfsLayer
|
6
|