GPTKB
Browse
Query
Compare
Download
Publications
Contributors
Search
C4 (Colossal Clean Crawled Corpus)
URI:
https://gptkb.org/entity/C4_(Colossal_Clean_Crawled_Corpus)
GPTKB entity
Statements (27)
Predicate
Object
gptkbp:instanceOf
gptkb:text
gptkbp:availableOn
gptkb:Hugging_Face_Datasets
gptkb:TensorFlow_Datasets
gptkbp:citation
gptkb:Colin_Raffel_et_al.,_'Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer',_JMLR,_2020
gptkbp:contains
web pages
gptkbp:createdBy
gptkb:Google_Research
gptkbp:excludes
gptkb:Wikipedia
discussion forums
news sites
https://www.w3.org/2000/01/rdf-schema#label
C4 (Colossal Clean Crawled Corpus)
gptkbp:introducedIn
2020
gptkbp:language
English
gptkbp:license
Apache 2.0
gptkbp:nativeToken
~365 billion
gptkbp:numberOfArticles
~15 million
gptkbp:preprocessing
cleaned of boilerplate and non-English text
deduplicated
filtered for quality
gptkbp:size
~750GB
gptkbp:source
gptkb:Common_Crawl
gptkbp:usedFor
natural language processing
pretraining language models
gptkbp:usedIn
gptkb:T5
gptkb:UL2
gptkbp:bfsParent
gptkb:T5:_Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer
gptkb:Exploring_the_Limits_of_Transfer_Learning_with_a_Unified_Text-to-Text_Transformer
gptkbp:bfsLayer
7