A single ViLBERT Multi-Task model can perform 8 different vision and language tasks learnt from 12 datasets!
Datasets: VQA v2, GQA, Visual Genome QA, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GuessWhat, COCO Retrieval, Flickr30k Retrieval, SNLI-VE, NLVR2.
More details about the ViLBERT Multi-Task paper can be found here along with the code for model training & code for model inference + demo interface.
Browsers currently supported by the demo: Google Chrome, Mozilla Firefox.