ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Abstract

Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

Publication
arxiv
Valentino Maiorca
Valentino Maiorca
Computer Science student @ Sapienza | NLP Engineer @ Babelscape

Excited by Natural Language Understanding, Representation learning and Reasoning.

Luca Moschella
Luca Moschella
Ph.D. student in Computer Science, NVIDIA Research Intern

I’m excited by many Computer Science fields and, in particular, by Artificial Intelligence.