Multimodal Representation and Retrieval [MRR 2024]

Multimodal data is available in many applications like e-commerce production listings, social media posts and short videos. However, existing algorithms dealing with those types of data still focus on uni-modal representation learning by vision-language alignment and cross-modal retrieval. In this workshop, we target to bring a new retrieval problem where both queries and documents are multimodal. With the popularity of vision language modeling, large language models (LLMs), retrieval augmented generation (RAG), and multimodal LLM, we see a lot of new opportunities for multimodal representation and retrieval tasks. This event will be a comprehensive half-day workshop focusing on the subject of multimodal representation and retrieval. The agenda includes keynote speeches, oral presentations, and an interactive panel discussion.

Submission Guidelines

Submissions of short papers must be in English, in PDF format, and be at most 4 pages (including figures, tables, proofs, appendixes, acknowledgments, and any content except references) in length, with unrestricted space for references, in the current ACM two-column conference format. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word). ACM's CCS concepts and keywords are required for review.

For LaTeX, the following should be used:

\documentclass[sigconf,natbib=true,anonymous=true{acmart}]

Submissions must be anonymous and should be submitted electronically via EasyChair:

https://easychair.org/conferences/?conf=mrr2024

Important dates for submissions to MRR 2024

Workshop paper submission due date: ~~April 25, 2024 (11:59 pm, AOE)~~ May 5, 2024 (11:59 pm, AOE)
Workshop paper acceptance notification: May 23, 2024
Workshop day: July 18, 2024

Topics includes but not limited to

Multimodal representation learning and retrieval, such as
- Multimodal embeddings learning and fusion
- Learning with noisy labels
- Multimodal query representation
- Multimodal query understanding
- Multimodal query suggestion
- Ranking algorithms for multimodal retrieval
Dataset, such as
- New dataset for multimodal retrieval
- Ways to synthesize data
Applications of Multimodal Retrieval, such as
- Multimodal retrieval in search engine
- Multimodal retrieval in recommendation system
- Multimodal retrieval in Ads
- Multimodal retrieval in Chatbot
- Multimodal query suggestion
- Multimodal retrieval in Robotics

Accepted Papers

Kang Zhao, Xinyu Zhao, Zhipeng Jin, Yi Yang, Xuewu Jiao, Wen Tao, Yafei Li, Cong Han, Shuanglong Li, and Lin Liu. Image Captioning for Baidu Ad Image Generation with Multi-Stage Refinements.

Jing Zhu, Xiang Song, Vassilis Ioannidis, Danai Koutra, and Christos Faloutsos. Improving Feature Representation through Graph-Centric Finetuning.

Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Karim Bouyarmane, Shioulin Sam, Ismail Tutar, and Junzhou Huang. Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment.

Kevin Dela Rosa. Video Enriched Retrieval Augmented Generation Using Aligned Video Captions.

Mingwei Tang, Meng Liu, Hong Li, Junjie Yang, Chenglin Wei, Boyang Li, Dai Li, Rengan Xu, Yifan Xu, Zehua Zhang, Xiangyu Wang, Linfeng Liu, Yuelei Xie, Chengye Liu, Labib Fawaz, Li Li, Hongnan Wang, Bill Zhu, and Sri Reddy. Async Learned User Embeddings for Ads Delivery Optimization.

Sarthak Srivastava, and Kathy Wu. Vision-Language Understanding in Hyperbolic Space.

Program

Time	Activity	Host
9:00 AM - 9:05 AM	Opening Remarks	Doug Gray
9:05 AM - 9:35 AM	Keynote Address by Hamed Zamani	Doug Gray
9:35 AM - 10:35 AM	Oral Presentations	Xinliang Zhu
10:35 AM - 10:45 AM	Coffee Break	-
10:45 AM - 11:15 AM	Keynote Address by Dinesh Manocha	Arnab Dhua
11:15 AM - 11:45 AM	Panel Discussion	Arnab Dhua
11:45 AM - 11:50 AM	Closing Remarks	Xinliang Zhu
11:50 AM - 12:15 PM	Networking	-

Speakers

Hamed Zamani, Associate Professor, University of Massachusetts Amherst

Abstract: Information access systems, such as search engines and recommender systems, have long supported people in accomplishing a wide range of tasks. In this talk, I will discuss how one can broaden the scope of users of information access systems to include task-driven machines, such as generative AI models. In this way, the core principles of indexing, representation, retrieval, and ranking can be applied and extended to substantially improve model generalization, scalability, robustness, and interpretability. I will describe a generic retrieval-enhanced machine learning (REML) framework and connect this framework with the information retrieval literature. I will next introduce our recent implementations of REML for various language and vision tasks. Finally, I will discuss open problems in this area for future explorations.

Dinesh Manocha, Professor, University of Maryland

Abstract: Perceiving and understanding non-speech sounds and non-verbal speech are essential for making informed decisions that facilitate our interactions with our surroundings. Audio is a crucial modality, offering rich, contextual information that complements visual and textual data, thereby enhancing the capabilities of AI systems. In this talk, we will highlight the significance of audio as an integral component in developing the next generation of intelligent AI agents. We will highlight why audio is an indispensable modality for AI, highlighting how humans naturally and extensively rely on auditory cues to navigate and comprehend the physical and virtual worlds. Understanding the auditory signals is fundamental to creating AI systems that can interact with the world in a more human-like and intuitive manner. Next, we will discuss how contemporary AI systems are beginning to integrate audio perception with other modalities to achieve more holistic and accurate environmental awareness. We will describe key advancements and methodologies that enable these multimodal integrations, focusing on the role of audio encoders and large language models (LLMs) in this synergy. Finally, we will address the open challenges and future directions in the field of audio question answering and multimodal AI.

Abstract

Call for Papers