The Pile: An 800GB Dataset of Diverse Text for Language Modeling
GPTKB entity
Statements (33)
Predicate | Object |
---|---|
gptkbp:instanceOf |
gptkb:dataset
|
gptkbp:alsoKnownAs |
gptkb:The_Pile
|
gptkbp:citation |
Gao et al., 2020, arXiv:2101.00027
|
gptkbp:contains |
diverse text sources
|
gptkbp:createdBy |
gptkb:EleutherAI
|
https://www.w3.org/2000/01/rdf-schema#label |
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
|
gptkbp:includes |
gptkb:OpenSubtitles
gptkb:GitHub gptkb:PubMed_Central gptkb:Wikipedia gptkb:Stack_Exchange gptkb:arXiv gptkb:DM_Mathematics gptkb:Enron_Emails gptkb:EuroParl gptkb:FreeLaw gptkb:HackerNews gptkb:NIH_ExPorter gptkb:OpenWebText2 gptkb:Pile-CC gptkb:YouTubeSubtitles gptkb:PhilPapers Books3 USPTO Backgrounds |
gptkbp:language |
English
|
gptkbp:license |
gptkb:MIT_License
|
gptkbp:purpose |
language modeling
|
gptkbp:releaseDate |
2020
|
gptkbp:size |
800GB
|
gptkbp:url |
https://pile.eleuther.ai/
|
gptkbp:usedFor |
training large language models
|
gptkbp:bfsParent |
gptkb:EleutherAI
|
gptkbp:bfsLayer |
7
|