Multilingual C4

GPTKB entity

Statements (19)
Predicate Object
gptkbp:instanceOf gptkb:dataset
gptkbp:access gptkb:Hugging_Face_Datasets
gptkbp:basedOn gptkb:Common_Crawl
gptkbp:contains web-crawled text
cleaned text
deduplicated text
gptkbp:creator gptkb:Allen_Institute_for_AI
https://www.w3.org/2000/01/rdf-schema#label Multilingual C4
gptkbp:language multiple languages
gptkbp:license varies (depends on Common Crawl)
gptkbp:relatedTo gptkb:C4_dataset
gptkbp:releaseYear 2020s
gptkbp:size hundreds of gigabytes
gptkbp:usedFor natural language processing
language model training
gptkbp:usedIn T5 model
mT5 model
gptkbp:bfsParent gptkb:mC4
gptkbp:bfsLayer 8