gptkbp:instanceOf
|
gptkb:dataset
|
gptkbp:citation
|
gptkb:Gao_et_al.,_2020
|
gptkbp:contains
|
gptkb:OpenSubtitles
gptkb:Wikipedia_(English)
gptkb:PubMed_Central
gptkb:Wikipedia
gptkb:Stack_Exchange
gptkb:arXiv
gptkb:BookCorpus2
gptkb:DM_Mathematics
gptkb:Enron_Emails
gptkb:EuroParl
gptkb:FreeLaw
gptkb:Gutenberg_(PG-19)
gptkb:HackerNews
gptkb:NIH_ExPorter
gptkb:OpenWebText2
gptkb:Pile-CC
gptkb:Ubuntu_IRC
gptkb:YouTubeSubtitles
gptkb:PhilPapers
books
web pages
academic papers
GitHub code
|
gptkbp:createdBy
|
gptkb:EleutherAI
|
gptkbp:format
|
gptkb:JSONL
|
https://www.w3.org/2000/01/rdf-schema#label
|
The Pile
|
gptkbp:language
|
English
|
gptkbp:license
|
various open licenses
|
gptkbp:notableFor
|
open access
large scale
diversity of sources
|
gptkbp:numberOfArticles
|
~22 million
|
gptkbp:relatedTo
|
gptkb:OpenWebText
gptkb:C4_dataset
gptkb:LAION-400M
gptkb:RedPajama_dataset
|
gptkbp:releaseDate
|
2020
|
gptkbp:size
|
825 GiB
|
gptkbp:url
|
https://pile.eleuther.ai/
|
gptkbp:usedBy
|
gptkb:Pythia
gptkb:GPT-J
gptkb:GPT-Neo
gptkb:GPT-NeoX
gptkb:Open_LLMs
|
gptkbp:usedFor
|
language model training
|
gptkbp:bfsParent
|
gptkb:GPT-J
gptkb:GPT-Neo
gptkb:GPT-NeoX
|
gptkbp:bfsLayer
|
6
|