ViLT

GPTKB entity

Statements (31)
Predicate Object
gptkbp:instanceOf Vision-and-Language Model
gptkbp:architecture gptkb:transformation
gptkbp:author Bokyung Son
Ildoo Kim
Wonjae Kim
gptkbp:citation 1000+
gptkbp:designedFor Vision-and-Language Pretraining
gptkbp:developedBy gptkb:NAVER_AI_Lab
https://www.w3.org/2000/01/rdf-schema#label ViLT
gptkbp:input gptkb:illustrator
gptkb:text
gptkbp:introducedIn 2021
gptkbp:language English
gptkbp:memiliki_tugas gptkb:Visual_Question_Answering
Image Captioning
Image-Text Retrieval
Visual Reasoning
gptkbp:notableFor Efficient vision-and-language model
No convolutional or region-based visual backbone
gptkbp:notablePublication https://arxiv.org/abs/2102.03334
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
gptkbp:openSource Yes
gptkbp:pretrainingDataset gptkb:COCO
gptkb:Flickr30k
gptkb:VQA
NLVR2
gptkbp:repository https://github.com/dandelin/ViLT
gptkbp:uses Multimodal Transformer Encoder
gptkbp:bfsParent gptkb:BLIP
gptkb:Vision-Language_Pretraining_research_community
gptkbp:bfsLayer 7