gptkbp:instanceOf
|
gptkb:dataset
|
gptkbp:availableOn
|
https://huggingface.co/datasets/c4
https://www.tensorflow.org/datasets/community_catalog/huggingface/c4
|
gptkbp:canBeFilteredBy
|
heuristics to remove low-quality content
|
gptkbp:citation
|
gptkb:Colin_Raffel_et_al.,_'Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer',_JMLR,_2020
|
gptkbp:contains
|
web pages
|
gptkbp:createdBy
|
gptkb:Allen_Institute_for_AI
gptkb:Google_Research
|
gptkbp:excludes
|
gptkb:Wikipedia
low-quality content
news sites
|
gptkbp:format
|
gptkb:JSON
gptkb:TFRecord
|
gptkbp:fullName
|
gptkb:Colossal_Clean_Crawled_Corpus
|
https://www.w3.org/2000/01/rdf-schema#label
|
C4 dataset
|
gptkbp:language
|
English
|
gptkbp:license
|
Apache 2.0
|
gptkbp:notableFor
|
large-scale web text corpus
|
gptkbp:numberOfArticles
|
~365 million
|
gptkbp:releaseYear
|
2019
|
gptkbp:size
|
~750GB
|
gptkbp:source
|
gptkb:Common_Crawl
|
gptkbp:usedFor
|
pretraining language models
|
gptkbp:usedIn
|
gptkb:T5
gptkb:UL2
gptkb:Flan-T5
gptkb:PaLM
|
gptkbp:bfsParent
|
gptkb:Flan-T5
gptkb:The_Pile
|
gptkbp:bfsLayer
|
7
|