Colossal Clean Crawled Corpus (C4)
GPTKB entity
Statements (33)
Predicate | Object |
---|---|
gptkbp:instanceOf |
gptkb:dataset
|
gptkbp:abbreviation |
C4
|
gptkbp:availableOn |
gptkb:TensorFlow_Datasets
|
gptkbp:canBeFilteredBy |
toxicity
duplicates non-English content |
gptkbp:contains |
web pages
|
gptkbp:createdBy |
gptkb:Google_Research
|
gptkbp:firstReleased |
2019
|
gptkbp:format |
gptkb:JSON
plain text |
https://www.w3.org/2000/01/rdf-schema#label |
Colossal Clean Crawled Corpus (C4)
|
gptkbp:language |
English
|
gptkbp:license |
Apache 2.0
|
gptkbp:numberOfArticles |
~365 million
|
gptkbp:relatedTo |
gptkb:OpenWebText
gptkb:Common_Crawl gptkb:The_Pile |
gptkbp:size |
~750GB
|
gptkbp:source |
gptkb:Common_Crawl
|
gptkbp:usedFor |
text analysis
language model pretraining |
gptkbp:usedIn |
gptkb:T5
gptkb:UL2 gptkb:Flan-T5 |
gptkbp:bfsParent |
gptkb:T5-XXL_language_model
gptkb:T5-11B gptkb:T5-3B gptkb:T5-Base gptkb:T5-Large gptkb:T5-Small gptkb:T5_(Text-To-Text_Transfer_Transformer) |
gptkbp:bfsLayer |
7
|