The Pile: An 800GB Dataset of Diverse Text for Language Modeling
GPTKB entity
Statements (33)
| Predicate | Object |
|---|---|
| gptkbp:instanceOf |
gptkb:dataset
|
| gptkbp:alsoKnownAs |
gptkb:The_Pile
|
| gptkbp:citation |
Gao et al., 2020, arXiv:2101.00027
|
| gptkbp:contains |
diverse text sources
|
| gptkbp:createdBy |
gptkb:EleutherAI
|
| gptkbp:includes |
gptkb:OpenSubtitles
gptkb:GitHub gptkb:PubMed_Central gptkb:Wikipedia gptkb:Stack_Exchange gptkb:arXiv gptkb:DM_Mathematics gptkb:Enron_Emails gptkb:EuroParl gptkb:FreeLaw gptkb:HackerNews gptkb:NIH_ExPorter gptkb:OpenWebText2 gptkb:Pile-CC gptkb:YouTubeSubtitles gptkb:PhilPapers Books3 USPTO Backgrounds |
| gptkbp:language |
English
|
| gptkbp:license |
gptkb:MIT_License
|
| gptkbp:purpose |
language modeling
|
| gptkbp:releaseDate |
2020
|
| gptkbp:size |
800GB
|
| gptkbp:url |
https://pile.eleuther.ai/
|
| gptkbp:usedFor |
training large language models
|
| gptkbp:bfsParent |
gptkb:EleutherAI
|
| gptkbp:bfsLayer |
7
|
| https://www.w3.org/2000/01/rdf-schema#label |
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
|