Colossal Clean Crawled Corpus
GPTKB entity
Statements (23)
Predicate | Object |
---|---|
gptkbp:instanceOf |
gptkb:text
|
gptkbp:abbreviation |
C4
|
gptkbp:access |
publicly available
|
gptkbp:canBeFilteredBy |
gptkb:language
duplication toxicity |
gptkbp:contains |
web pages
|
gptkbp:createdBy |
gptkb:Allen_Institute_for_AI
gptkb:Google_Research |
gptkbp:firstReleased |
2019
|
gptkbp:format |
gptkb:JSONL
|
https://www.w3.org/2000/01/rdf-schema#label |
Colossal Clean Crawled Corpus
|
gptkbp:language |
English
|
gptkbp:license |
varies (depends on Common Crawl)
|
gptkbp:numberOfArticles |
~365 million
|
gptkbp:size |
~750GB
|
gptkbp:source |
gptkb:Common_Crawl
|
gptkbp:usedFor |
language model training
|
gptkbp:usedIn |
gptkb:T5
gptkb:UL2 gptkb:PaLM |
gptkbp:bfsParent |
gptkb:T5
|
gptkbp:bfsLayer |
6
|