Statements (30)
| Predicate | Object |
|---|---|
| gptkbp:instanceOf |
gptkb:Vision-and-Language_Model
|
| gptkbp:architecture |
gptkb:transformation
|
| gptkbp:author |
Bokyung Son
Ildoo Kim Wonjae Kim |
| gptkbp:citation |
1000+
|
| gptkbp:designedFor |
Vision-and-Language Pretraining
|
| gptkbp:developedBy |
gptkb:NAVER_AI_Lab
|
| gptkbp:input |
gptkb:illustrator
gptkb:text |
| gptkbp:introducedIn |
2021
|
| gptkbp:language |
English
|
| gptkbp:memiliki_tugas |
gptkb:Visual_Question_Answering
Image Captioning Image-Text Retrieval Visual Reasoning |
| gptkbp:notableFor |
Efficient vision-and-language model
No convolutional or region-based visual backbone |
| gptkbp:notablePublication |
https://arxiv.org/abs/2102.03334
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision |
| gptkbp:openSource |
Yes
|
| gptkbp:pretrainingDataset |
gptkb:COCO
gptkb:Flickr30k gptkb:VQA NLVR2 |
| gptkbp:repository |
https://github.com/dandelin/ViLT
|
| gptkbp:uses |
Multimodal Transformer Encoder
|
| gptkbp:bfsParent |
gptkb:BLIP
|
| gptkbp:bfsLayer |
7
|
| https://www.w3.org/2000/01/rdf-schema#label |
ViLT
|