{"author":[{"first_name":"Antonio","full_name":"Norelli, Antonio","last_name":"Norelli"},{"last_name":"Fumero","first_name":"Marco","full_name":"Fumero, Marco"},{"last_name":"Maiorca","full_name":"Maiorca, Valentino","first_name":"Valentino"},{"first_name":"Luca","full_name":"Moschella, Luca","last_name":"Moschella"},{"last_name":"Rodolà","full_name":"Rodolà, Emanuele","first_name":"Emanuele"},{"last_name":"Locatello","id":"26cfd52f-2483-11ee-8040-88983bcc06d4","first_name":"Francesco","full_name":"Locatello, Francesco","orcid":"0000-0002-4850-0683"}],"doi":"10.48550/arXiv.2210.01738","main_file_link":[{"url":"https://doi.org/10.48550/arXiv.2210.01738","open_access":"1"}],"_id":"14216","title":"ASIF: Coupled data turns unimodal models to multimodal without training","publication_status":"submitted","publication":"arXiv","oa_version":"Preprint","article_number":"2210.01738","abstract":[{"lang":"eng","text":"CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning."}],"date_updated":"2024-02-12T09:57:14Z","type":"preprint","language":[{"iso":"eng"}],"year":"2022","status":"public","day":"04","external_id":{"arxiv":["2210.01738"]},"user_id":"2DF688A6-F248-11E8-B48F-1D18A9856A87","date_published":"2022-10-04T00:00:00Z","oa":1,"article_processing_charge":"No","department":[{"_id":"FrLo"}],"citation":{"ieee":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, and F. Locatello, “ASIF: Coupled data turns unimodal models to multimodal without training,” arXiv. .","ama":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. doi:10.48550/arXiv.2210.01738","mla":"Norelli, Antonio, et al. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, 2210.01738, doi:10.48550/arXiv.2210.01738.","short":"A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà, F. Locatello, ArXiv (n.d.).","apa":"Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (n.d.). ASIF: Coupled data turns unimodal models to multimodal without training. arXiv. https://doi.org/10.48550/arXiv.2210.01738","ista":"Norelli A, Fumero M, Maiorca V, Moschella L, Rodolà E, Locatello F. ASIF: Coupled data turns unimodal models to multimodal without training. arXiv, 2210.01738.","chicago":"Norelli, Antonio, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. “ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training.” ArXiv, n.d. https://doi.org/10.48550/arXiv.2210.01738."},"date_created":"2023-08-22T14:22:04Z","month":"10"}