gptkbp:instanceOf
|
gptkb:nonprofit_organization
|
gptkbp:archiveFrequency
|
monthly
|
gptkbp:category
|
non-profit
open data
web archiving
internet organization
|
gptkbp:dataLanguage
|
multilingual
|
gptkbp:datasetAvailability
|
public domain
|
gptkbp:format
|
gptkb:WAT
gptkb:WARC
WET
|
gptkbp:founded
|
2007
|
gptkbp:founder
|
gptkb:Gil_Elbaz
|
gptkbp:headquartersLocation
|
gptkb:United_States
|
https://www.w3.org/2000/01/rdf-schema#label
|
Common Crawl
|
gptkbp:includes
|
web pages
metadata
HTTP headers
crawled URLs
outlinks
text content
|
gptkbp:license
|
gptkb:Creative_Commons_CC0
|
gptkbp:mission
|
to crawl the web and freely provide its archives and datasets to the public
|
gptkbp:notableProject
|
gptkb:Common_Crawl_Corpus
|
gptkbp:notableUser
|
gptkb:Google
gptkb:OpenAI
gptkb:Allen_Institute_for_AI
gptkb:EleutherAI
gptkb:Meta_AI
|
gptkbp:numberOfVolumes
|
petabytes
|
gptkbp:partner
|
gptkb:Internet_Archive
gptkb:Amazon_Web_Services
gptkb:Archive-It
gptkb:Data_Commons
gptkb:Web_Data_Commons
|
gptkbp:storage
|
gptkb:Amazon_S3
|
gptkbp:supportedBy
|
grants
donations
|
gptkbp:twitter
|
@CommonCrawl
|
gptkbp:type
|
web crawl data
|
gptkbp:usedBy
|
gptkb:researchers
developers
companies
|
gptkbp:usedFor
|
gptkb:machine_learning
data analysis
natural language processing
web mining
search engine development
|
gptkbp:website
|
https://commoncrawl.org/
|
gptkbp:bfsParent
|
gptkb:GPT-3
gptkb:WARC
|
gptkbp:bfsLayer
|
5
|