C4 (Colossal Clean Crawled Corpus)
GPTKB entity
Statements (26)
| Predicate | Object |
|---|---|
| gptkbp:instanceOf |
gptkb:text
|
| gptkbp:availableOn |
gptkb:Hugging_Face_Datasets
gptkb:TensorFlow_Datasets |
| gptkbp:citation |
gptkb:Colin_Raffel_et_al.,_'Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer',_JMLR,_2020
|
| gptkbp:contains |
web pages
|
| gptkbp:createdBy |
gptkb:Google_Research
|
| gptkbp:excludes |
gptkb:Wikipedia
discussion forums news sites |
| gptkbp:introducedIn |
2020
|
| gptkbp:language |
English
|
| gptkbp:license |
Apache 2.0
|
| gptkbp:nativeToken |
~365 billion
|
| gptkbp:numberOfArticles |
~15 million
|
| gptkbp:preprocessing |
cleaned of boilerplate and non-English text
deduplicated filtered for quality |
| gptkbp:size |
~750GB
|
| gptkbp:source |
gptkb:Common_Crawl
|
| gptkbp:usedFor |
natural language processing
pretraining language models |
| gptkbp:usedIn |
gptkb:T5
gptkb:UL2 |
| gptkbp:bfsParent |
gptkb:Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer
|
| gptkbp:bfsLayer |
7
|
| https://www.w3.org/2000/01/rdf-schema#label |
C4 (Colossal Clean Crawled Corpus)
|