Vision Transformer (ViT): An Image is Worth 16x16 Words

GPTKB entity

Predicate	Object
gptkbp:instanceOf	gptkb:academic_journal
gptkbp:arXivID	2010.11929
gptkbp:author	gptkb:Alexey_Dosovitskiy gptkb:Jakob_Uszkoreit gptkb:Alexander_Kolesnikov gptkb:Dirk_Weissenborn gptkb:Georg_Heigold gptkb:Lucas_Beyer gptkb:Matthias_Minderer gptkb:Mostafa_Dehghani gptkb:Neil_Houlsby gptkb:Sylvain_Gelly gptkb:Thomas_Unterthiner gptkb:Xiaohua_Zhai
gptkbp:citation	high
gptkbp:contribution	shows transformers can outperform CNNs with sufficient data splits images into 16x16 patches treats image patches as tokens applies transformer architecture to image classification
gptkbp:field	computer vision deep learning transformer models
gptkbp:influenced	subsequent vision transformer research
gptkbp:proposedBy	gptkb:Vision_Transformer_(ViT)
gptkbp:publicationYear	2021
gptkbp:publishedIn	gptkb:International_Conference_on_Learning_Representations_(ICLR)
gptkbp:shortName	gptkb:Vision_Transformer_(ViT)
gptkbp:title	gptkb:An_Image_is_Worth_16x16_Words:_Transformers_for_Image_Recognition_at_Scale
gptkbp:url	https://arxiv.org/abs/2010.11929
gptkbp:bfsParent	gptkb:Google_Brain_(former)
gptkbp:bfsLayer	7
http://www.w3.org/2000/01/rdf-schema#label	Vision Transformer (ViT): An Image is Worth 16x16 Words