gptkbp:instanceOf
|
Vision-and-Language Model
|
gptkbp:architecture
|
gptkb:transformation
|
gptkbp:author
|
Bokyung Son
Ildoo Kim
Wonjae Kim
|
gptkbp:citation
|
1000+
|
gptkbp:designedFor
|
Vision-and-Language Pretraining
|
gptkbp:developedBy
|
gptkb:NAVER_AI_Lab
|
https://www.w3.org/2000/01/rdf-schema#label
|
ViLT
|
gptkbp:input
|
gptkb:illustrator
gptkb:text
|
gptkbp:introducedIn
|
2021
|
gptkbp:language
|
English
|
gptkbp:memiliki_tugas
|
gptkb:Visual_Question_Answering
Image Captioning
Image-Text Retrieval
Visual Reasoning
|
gptkbp:notableFor
|
Efficient vision-and-language model
No convolutional or region-based visual backbone
|
gptkbp:notablePublication
|
https://arxiv.org/abs/2102.03334
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
|
gptkbp:openSource
|
Yes
|
gptkbp:pretrainingDataset
|
gptkb:COCO
gptkb:Flickr30k
gptkb:VQA
NLVR2
|
gptkbp:repository
|
https://github.com/dandelin/ViLT
|
gptkbp:uses
|
Multimodal Transformer Encoder
|
gptkbp:bfsParent
|
gptkb:BLIP
gptkb:Vision-Language_Pretraining_research_community
|
gptkbp:bfsLayer
|
7
|