gptkbp:instanceOf
|
machine learning model architecture
|
gptkbp:advantage
|
improves efficiency
enables model sparsity
specialization of sub-models
|
gptkbp:application
|
computer vision
natural language processing
speech recognition
|
gptkbp:category
|
gptkb:convolutional_neural_network
ensemble method
deep learning technique
|
gptkbp:challenge
|
load balancing
training instability
routing complexity
|
gptkbp:component
|
expert networks
gating network
|
gptkbp:expertNetworkFunction
|
processes input data
|
gptkbp:field
|
gptkb:artificial_intelligence
gptkb:machine_learning
|
gptkbp:gatingNetworkFunction
|
selects which experts to activate for each input
|
https://www.w3.org/2000/01/rdf-schema#label
|
Mixture of Experts (MoE)
|
gptkbp:introduced
|
gptkb:Ronald_A._Jacobs
|
gptkbp:introducedIn
|
1991
|
gptkbp:notableFor
|
gptkb:Switch_Transformer
gptkb:GShard
Pathways
|
gptkbp:purpose
|
divide complex tasks among specialized models
|
gptkbp:relatedPaper
|
gptkb:GShard:_Scaling_Giant_Models_with_Conditional_Computation_and_Automatic_Sharding_(Lepikhin_et_al.,_2020)
gptkb:Hierarchical_Mixtures_of_Experts_(Jordan_&_Jacobs,_1994)
gptkb:Switch_Transformers:_Scaling_to_Trillion_Parameter_Models_with_Simple_and_Efficient_Sparsity_(Fedus_et_al.,_2021)
|
gptkbp:relatedTo
|
ensemble learning
conditional computation
sparse neural networks
|
gptkbp:size
|
enables large-scale models with fewer computations per example
|
gptkbp:trainer
|
backpropagation
|
gptkbp:usedIn
|
gptkb:GPT-4
gptkb:DeepSpeed-MoE
gptkb:Google's_PaLM
gptkb:GLaM
|
gptkbp:bfsParent
|
gptkb:Noam_Shazeer
|
gptkbp:bfsLayer
|
6
|