Switch Transformers: Scaling to Trillion Parameter Models
GPTKB entity
Statements (22)
| Predicate | Object |
|---|---|
| gptkbp:instanceOf |
gptkb:academic_journal
|
| gptkbp:arXivID |
2101.03961
|
| gptkbp:author |
gptkb:Noam_Shazeer
gptkb:Barret_Zoph gptkb:William_Fedus |
| gptkbp:citation |
1000+
|
| gptkbp:demonstrates |
scaling language models to over a trillion parameters
|
| gptkbp:evaluatesOn |
language modeling tasks
|
| gptkbp:focusesOn |
Switch Transformer architecture
|
| gptkbp:foundIn |
Switch Transformer outperforms dense models at similar computational cost
|
| gptkbp:improves |
computational efficiency
training speed model quality |
| gptkbp:proposedBy |
Mixture-of-Experts (MoE) model
|
| gptkbp:publicationYear |
2021
|
| gptkbp:publishedBy |
gptkb:Google_Research
|
| gptkbp:url |
https://arxiv.org/abs/2101.03961
|
| gptkbp:uses |
sparse activation
routing network |
| gptkbp:bfsParent |
gptkb:Google_Brain_(former)
|
| gptkbp:bfsLayer |
7
|
| https://www.w3.org/2000/01/rdf-schema#label |
Switch Transformers: Scaling to Trillion Parameter Models
|