Cross-Modality Alignment for Large Language Model (2024)

Hanrong Ye^1,2,De-An Huang¹,Yao Lu¹,Zhiding Yu¹,Wei Ping¹,Andrew Tao¹,
Jan Kautz¹,Song Han^1,3,Dan Xu²,Pavlo Molchanov¹,Hongxu Yin¹
NVIDIA¹ HKUST² MIT³Work done during an internship at NVIDIA.

Abstract

We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation.To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset.Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss.To address the issue, we propose a visual alignment mechanism with a visual embedding highway module.We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data.The project will be made open-source.

Cross-Modality Alignment for Large Language Model (1)

1 Introduction

Large language models (LLMs) provide an emerging foundation for enhancing various deep learning tasks beyond the realm of natural language processing. As an example, research community has been quickly extending the fast progress of LLMs[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] towards the computer vision (CV)domain[14, 15, 16, 17, 18, 19, 20, 21, 22].The introduction of LLMs in CV tasks enables vision models to perform many zero/few-shot and in-context learning tasks that are “promptable” through user questions, potentially empowering reasoning capabilities for the first time.Despite remarkable progress, cross-modality alignment is still a challenging task as the joint training stage for cross-modality learning requires carefully designed feedback signal[23, 24] to guide the connected foundation models[15, 14, 18], backed by cross-modality datasets at scale[25, 26, 27].Hence, the majority of existing studies revolve around a solitary input modality linked to LLMs, with the output being solely text. For example, Flamingo[15], LLaVA[14], and VILA[28] delve into image input, while Video-LLaMA[29] and LITA[30] specifically concentrates on video input.Exploring the integration of various modalities into a cohesive framework is a crucial yet relatively unexplored research challenge[31, 32, 33] in the domain of multi-modality LLMs, yet observed practical in proprietary GPT-4o[21].

This study focuses on the development of a systematic approach to integrate multiple modalities, such as video, image, and audio, into an LLM at both the input and output stages. Our objective is to facilitate cross-modal conversations in an any-to-any modality (or “X-to-X”) LLM, allowing for generation in different modalities.To accomplish the ambitious objective, we present a two-phase alignment mechanism:(i) Textual alignment.We align input and output representations of different modalities to the textual embedding space of the LLM[32].Specifically, in regard to the input of LLM, we use a unified embedding space that allows for the sharing of features extracted from encoders across diverse modalities. As for the output of LLM, we employ fine-tunable modality-specific diffusion models to convert the generated outputs of the LLM into content that aligns with the respective modalities.(ii) Visual alignment.We observe that the previous textual alignment alone fails to preserve visual features adequately in vision-to-vision generation tasks, such as image-to-video and video-to-image generation. This limitation can be attributed to the loss of information during the projection process from visual encoders to the LLM, as well as the LLM’s tendency to prioritize common concepts over specific visual details.To address this issue, we propose a new module named Visual Embedding Highway(VEH). The VEH module facilitates the direct guidance of visual decoders by enabling visual features to bypass the LLM. By incorporating VEH, we have observed a notable enhancement in the correspondence of visual content between the input and output stages of our framework.

On the other hand, the scarcity of cross-modality instruction-following data poses a significant challenge in the development of any-to-any modality (or “X-to-X”) LLMs. This limitation severely restricts the progress in creating LLMs that can seamlessly handle multiple modalities in both input and output ends.Existing datasets provide limited data, mostly in the form of X-to-text or text-to-X.Therefore, we curate a new X-to-X dataset based on multi-modality data from WebVid[34] and ActivityNet Captions[35] to facilitate cross-modality interactions between text, audio, image, and video. Overall, we synthesize more than 1.5M multi-modality conversations, with each conversation containing at least one cross-modality question-and-answer pair.

To achieve the cross-modality input-output alignment of LLMs in our X-to-X LLM, we design three major training phases:(i) A data-effective alignment phase that involves aligning the multi-modality encoders with the LLM inputs and the multi-modality decoders with the LLM outputs.(ii) An interleaved multi-modality pre-training phase with interleaved instruction data across modalities for enhanced in-context learning performance.(iii) An X-to-X cross-modality instruction tuning phase that includes a two-step alignment process: textual alignment and visual alignment mechanism.Through our innovative approach to multi-modality alignment, we build a powerful X-to-X multi-modality LLM with the ability to comprehend and generate multi-modality content. We term our new model “X-VILA” for cross-modality understanding, reasoning, and generation in the domains of Video, Image, Language, and Audio.For instance, as shown in Figure1 and Figure8, X-VILA demonstrates its capacity to recognize the subjects in the image, which results from our vision-language alignment training.Then, it can retrieve its knowledge and make logical deductions to answer the user’s questions about the content in the image. Last but not least, it can generate aligned multi-modality output that matches the given context.

In summary, this work makes contributions in three aspects:

•
A new family of any-to-any modality chat LLM that is able to conduct multi-modality conversations by understanding signals from different modalities and generating content in various formats, including video, audio, image, and text.
•
A novel 2-step alignment mechanism that effectively aligns both semantic and visual details between the input and output spaces. This mechanism ensures a comprehensive and accurate correspondence between the input and output of our X-to-X LLM.
•
The creation of a new X-to-X multi-modality instruction tuning dataset that is proven effective for cross-modality alignment. This dataset serves as a valuable resource for future research in the realm of multi-modality foundation models.

2 Methodology

Cross-Modality Alignment for Large Language Model (2)

2.1 X-VILA Architecture

We consider four common modalities in this work: text, image, video, and audio.The tenet of X-VILA is an alignment-based architecture to augment an LLM with the ability to “see/hear/read” multi-modality inputs and “draw/speak/write” multi-modality outputs, as shown in Figure2.

Modality-specific encoders.We adopt modality-specific encoders to handle inputs from different modalities. This strategy harvests the pre-trained understanding ability of the domain expert encoders and has been proven successful in many vision-language models[15, 18, 14]. To better align embeddings of different modalities, we use ImageBind encoders[36], which unify features from different modalities, including image, video, and audio, into one feature space.Technically, for each modality $m\in\{\text{`text', `image', `video', `audio'}\}$ , we notate the encoders as $\text{{Enc}}_{m}$ .For text modality, the encoder is a text tokenizer[37], while for other modalities they are ImageBind transformers[36].We then use modality-specific trainable linear layers, notated as $\textbf{P}^{\text{in}}_{m}$ , to project $\text{{Enc}}_{m}$ output into embedding sequences $\mathbf{S}$ in the textual embedding space of the following LLM. We can formulate this process as:

\mathbf{S}^{\text{in}}=\{\textbf{P}^{\text{in}}_{m}(\textbf{Enc}_{m}(\mathbf{X%}_{m}))\},

(1)

where $\mathbf{X}_{m}$ is input from different modalities $m\in\{\text{`text', `image', `video', `audio'}\}$ .

2.2 X-VILA Training

The training process of X-VILA is divided into three phases, namely (i) encoder-LLM-Decoder alignment training, (ii) interleaved data pre-training, and (iii) X-to-X cross-modality instruction fine-tuning.We describe the details of X-VILA training in AppendixA due to space limit.

3 Experiments

3.1 Datasets and Evaluation

Setup. In this work, we utilize different datasets for different training phases. For the first encoder-LLM-decoder alignment training, the X-text pairs from LLaVA-pretrain[14], cc3m[44], WebVid[34], AudioCaps[45], and WavCaps[46] are utilized.During the interleaved data pre-training phase, we construct interleaved multi-modality corpus from MMC4[25] and ActivityNet Captions[35].

Cross-Modality Alignment for Large Language Model (4)

In the final X-to-X cross-modality instruction tuning, we create a new X-to-X dataset from WebVid[34] and ActivityNet Captions. We synthesize conversation samples in 6 types based on the modalities in input and output ends: video-to-image, video-to-video, image-to-video, video-to-audio, audio-to-video, image&audio-to-video.Statistically, for ActivityNet Captions, we make 10009 image-to-video, 10009 video-to-image, and 10009 video-to-video conversations.For WebVid, we randomly select 500k training samples and build 499,915 image-to-video, 499,915 video-to-image, 499,915 video-to-video, 32,874 audio-to-video, 32,874 video-to-audio, and 32874 image+audio-to-video conversations. Each conversation contains more than one pair of cross-modality Q&A pairs. Some conversation examples are shown in Figure9.We blend our X-to-X dataset with SFT datasets from LLaVA[14], VideoChat[47], NextGPT-instructions[32], and Alpaca[7].

Cross-Modality Alignment for Large Language Model (5)

Evaluation. For benchmarking the X-to-X alignment ability of different models, we randomly sample a validation subset of WebVid and ActivityNet Captions to build the cross-modality conversations for evaluation. Specifically, for ActivityNet Captions we generate 100 video-to-image, 100 video-to-video, and 100 image-to-video conversations.For WebVid, we curate 100 video-to-image, 100 image-to-video, 100 video-to-video, 62 audio-to-video, 62 image+Audio-to-video, and 62 audio-to-video conversations for evaluation.In order to evaluate the similarity between ground-truth annotations and model predictions across different modalities, we introduce a metric called the “X-to-X Alignment Score(X²A Score)”. To compute this score, we employ the ImageBind transformer[36] to extract embedding vectors from the audio, video, and image predictions as well as the corresponding ground truths. We then calculate the cosine similarity scores between these vectors. The resulting scores are presented as percentages, ranging from 0 to 100. Finally, we average the scores across all validation samples to obtain the X²A scores for each type of data.

Baseline methods.We conduct a comparison between our model and Next-GPT[32], a recently introduced instruction-following LLM designed for multi-modality understanding and generation. Their method is restricted to textual alignment exclusively.

Method	VID2IMG	VID2VID	IMG2VID
Next-GPT[32]	27.85	10.47	13.08
X-VILA w/ X2X text	36.09	46.18	45.93
X-VILA w/ X2X text + VEH (img)	44.06	46.68	45.94
X-VILA w/ X2X text + VEH (img+vid)	43.95	49.76	48.81

Method	VID2IMG	IMGAUD2VID	VID2AUD	IMG2VID	VID2VID	AUD2VID
Next-GPT[32]	15.31	44.63	8.17	38.23	31.81	37.13
X-VILA w/ X2X text	53.82	49.54	22.79	42.94	44.42	42.23
X-VILA w/ X2X text + VEH (img)	67.40	48.64	23.53	42.66	43.04	42.04
X-VILA w/ X2X text + VEH (img+vid)	67.94	59.71	23.87	57.01	57.39	49.44

Cross-Modality Alignment for Large Language Model (6)

Cross-Modality Alignment for Large Language Model (7)

Cross-Modality Alignment for Large Language Model (8)

3.2 Quantitative Analysis and Ablation Study

Effectiveness of Visual Embedding Highway. We compute the aforementioned X²A scores of different models on the X-to-X alignment benchmarks of both ActivityNet Captions and WebVid datasets, and present the results in Table1 and Table2.Specifically, we study the X²A scores of Next-GPT and different versions of our X-VILA model. We investigate the performance of our model under different scenarios: (i) utilizing only textual alignment, (ii) incorporating visual alignment through the proposed visual embedding highway (VEH) on the image decoder, and (iii) extending VEH to both the image and video decoders.Our findings indicate that even by utilizing textual alignment alone with our carefully curated X-to-X datasets, our model demonstrates a substantial performance advantage over Next-GPT.Moreover, as we progressively introduce the visual embedding highway to the image and video decoders, we observe consistent and significant improvements in visual understanding and generation tasks.In summary, our X-VILA demonstrates significantly stronger cross-modality understanding, reasoning, and generation ability on all types of conversation data.These results suggest the effectiveness of our X-to-X alignment strategy and the proposed visual embedding highway design.Notably, both Next-GPT and X-VILA are based on the ImageBind model, making it fair to use ImageBind scores for both models.

Method	VQAv2	VisWiz	MMMU-val
BLIP-2 13B[49]	65.0	19.6	-
InstructBLIP 13B[24]	-	33.4	-
Qwen-VL-Chat 7B[20]	78.2	38.9	35.9
LLaVA 1.5 7B[50]	78.5	50.0	36.4
X-VILA 7B	72.9	50.9	33.9

Influence of conditioning rates.We present the X²A scores plotted with varying conditioning rates $\alpha$ (Equation4) in VEH (image), as depicted in Figure7. Our observations indicate that an increase in $\alpha$ , corresponding to more reverse steps exposed to VEH features during image sampling, leads to improved multi-modality alignment. This outcome aligns with our intuitive expectations.

Extra multi-modality benchmarks.To further evaluate the multi-modality understanding capabilities of X-VILA, we perform zero-shot experiments on several multi-modality VQA benchmarks, including VQAv2[51], VisWiz[52], and MMMU-val[53].The results in Table3 indicate that X-VILA is competitive with the leading domain-expert VLMs, while possessing the X-to-X capability.

3.3 Qualitative Analysis and Ablation Study

Qualitative X-to-X alignment measurement.We provide a qualitative comparison to the state-of-the-art any-to-any LLMs, namely Next-GPT[32], CoDi[31], and GPT-4o[48] on visual cross-modality alignment tasks in Figure6.We assess their performance by supplying an image to the models and prompting “Please generate a video (or an image in the case of GPT-4o which cannot generate video) similarto the semantics in the input.”X-VILA demonstrates significant improvements in visual correspondence over previous methods, thanks to the integration of the Visual Embedding Highway (VEH) into output diffusion models.

Emergent X-to-X ability.During our experiments, we observe highly promising emergent abilities displayed by X-VILA following its training on our X-to-X datasets. As depicted in Figure5, we have identified two key capabilities that have surfaced:
(i) Long-context cross-modality generation. X-VILA exhibits an impressive capacity for comprehending and combining diverse concepts from multiple iterations of input. Consequently, it produces natural and coherent output, as suggested by the users.
(ii) Unseen cross-modality ability. Remarkably, X-VILA showcases the ability to perform image-to-audio and audio-to-image tasks without any explicit training on similar data. This newfound competence emerges organically through the model’s exposure to our comprehensive X-to-X dataset.These remarkable emergent abilities underscore the efficacy of our meticulously curated X-to-X dataset. Not only does it enable the model to excel in the specified data types as suggested in Section3.2, but it also facilitates generalization across a wide range of multi-modality interactions between users and the model.

More insights on varying design choices on decoder alignment.We next present our findings when aligning LLM output end to the modality-specific decoders.We study different ways to bridge LLM output and the diffusion models:(i) “Text-Aligned Decoding”: LLM generates text description for the expected image/video/audio predictions and then feeds the text description into pre-trained image/video/audio decoders.(ii) “Text-Embed-Aligned Decoding”: LLM generates modality-specific generation tokens and then we use the corresponding high-dimensional textual embeddings to control the modality-specific decoders(as described in Section2.1).(iii) “Text-Embed-Aligned Decoding with VEH”: Building upon method (ii), we introduce the Visual Embedding Highway(VEH) to align the visual feature between encoders and decoders.We conduct experiments on video-to-image and image-to-video cross-modality alignment tasks, and show the results on the right side of Figure7.

The findings suggest that conveying specific details such as visual style, object appearance, and precise human actions from the input to the output is challenging for Text-Aligned Decoding. This difficulty arises due to the low-dimensional nature of pure text descriptions, which limits the amount of information they can contain. On the other hand, Text-Embed-Aligned Decoding offers a significantly greater “bandwidth” in the textual embedding space between the LLM and modality-specific decoders.Consequently, Text-Embed-Aligned Decoding is capable of generating more consistent outcomes. Nevertheless, Text-Embed-Aligned Decoding alone is still not good enough for capturing visual details, as a substantial amount of visual information is lost during the projection from encoders to the LLM. This is where our Visual Embedding Highway demonstrates its performance and aids X-VILA in attaining notably enhanced visual consistency.

Conversation examples.To thoroughly investigate the performance of our any-to-any modality LLM, we conducted extensive testing on X-VILA examining many use cases. We present conversation examples of X-VILA across varying tasks in Figure1 and Figure8. It can be observed that X-VILA provides users with a comprehensive set of multi-modality responses leveraging the encoders for perception, LLM for understanding and reasoning, and decoders for multi-modality content generation.As shown in Figure14, X-VILA not only exhibits its understanding of the visual input, including the scene and objects, but also predicts the actions of the person depicted in the image. This capability is a result of training on our extensive X-to-X dataset. Based on the visual input, it generates outputs visually consistent with the input, e.g. the snow mountain and red ski suit are presented in the generation output correctly.

Cross-Modality Alignment for Large Language Model (10)

4 Related Work

The era of Large Language Models (LLM) arguably started with the introduction of transformers[54] and a series of works that scaled them. Particularly, OpenAI introduced the Generative Pre-trained Transformer (GPT) models[55],[56], from GPT-2 (1.5B parameters) to GPT-4[21] (1.76T), and showed that parameter scaling, together with more high-quality data, can generate coherent and contextually relevant text across various domains. BERT[1] introduced a paradigm of bidirectional text processing enabling stronger context understanding and boosted question answering. T5[2] converted language problem into a text-to-text format advancing translation and summarizing. Transformer-XL[3] demonstrated the capability of extending the context window allowing for a better understanding of longer text. The application era of LLM was kickstarted by ChatGPT[4] which showcased the unprecedented ability of LLM chatbots.

Current Vision-Language Models (VLM) benefited from the development of ViT[57] that offers a unified way for vision models to communicate with other transformers from different modalities. Rapid progress has been shown in three streams[58]:(i) textually prompted models that accept image and text as input (CLIP[59], Frozen[60], BLIP[18], PaLI[17], LLaVa[14], VILA[28], miniGPT4[22]);(ii) visually prompted models (CLIPSeg[61], SAM[62]); and (iii) multi-modal input-output models (Painter[63], ImageBind[36], Palm-E[64], Video ChatGPT[65], RegionGPT[66], mPLUG-owl[67], PandaGPT[68], CoDi[31], NextGPT[32], Unified-IO[33, 69]). Among the first, Frozen[60] demonstrated that VLM can be constructed by linear projection of ViT features into LLM and only tuning ViT on image-text captioning data. They are the first that discover the few-shot capabilities of VLM without instruction. Flamingo[15] used cross-attention for vision language binding, and for a first time demonstrated surpassing state-of-the-art finetuned models for multiple tasks. PALI[17] created a universal model that can do vision and language tasks separately, they scaled ViT to 4B and demonstrated the importance of adding language-only data to the pretraining stage. Overall, VLM follows the pipeline of taking a pretrained LLM; adding a pretrained vision encoder; learning feature alignment at scale via a projector or cross-attention; followed by instruct-tuning (InstructBLIP[24], FLAN[23]).In close relation to our research, Next-GPT introduces an LLM that possesses the capability to comprehend multi-modality inputs and generate corresponding multi-modality outputs through textual alignment, yet it cannot effectively handle visual details present in the input.

5 Conclusion

This paper presents X-VILA, an any-to-any modality LLM that is able to understand, infer, and generate multi-modality contents. This ability is achieved through any-to-any modality alignment, for which we curate a dataset for any-to-any modality instruction tuning.We further identify a significant drawback in the previous textual alignment method that leads to the loss of crucial visual details. Accordingly, we propose an innovative visual alignment mechanism that incorporates a visual feature highway module. This solution helps preserve essential visual details from the input.The experimental results, both quantitative and qualitative, indicate the effectiveness of our data and methodology.X-VILA’s performance can be further enhanced across various VLM benchmarks.

References

[1]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: pre-training of deep bidirectional transformers for language understanding.In NAACL-HLT 2019, pages 4171–4186. Association for Computational Linguistics, 2019.
[2]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
[3]Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, QuocV Le, and Ruslan Salakhutdinov.Transformer-xl: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860, 2019.
[4]OpenAI.ChatGPT: Optimizing language models for dialogue.https://openai.com/blog/chatgpt, 2023.Accessed: 2023.
[5]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
[6]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
[7]Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and TatsunoriB. Hashimoto.Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023.
[8]Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
[9]Siddharth Karamcheti, Laurel Orr, Jason Bolton, Tianyi Zhang, Karan Goel, Avanika Narayan, Rishi Bommasani, Deepak Narayanan, Tatsunori Hashimoto, Dan Jurafsky, etal.Mistral–a journey towards reproducible language model training, 2021.
[10]Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay.The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116, 2023.
[11]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, etal.Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022.
[12]Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, etal.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024.
[13]Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, YuHan, Fei Huang, etal.Qwen technical report.Technical report, Alibaba Group, 2023.https://arxiv.org/abs/2303.08774.
[14]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024.
[15]Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal.Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
[16]Danny Driess, Fei Xia, MehdiSM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, etal.Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023.
[17]XiChen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, CarlosRiquelme Ruiz, Sebastian Goodman, Xiao Wang, YiTay, etal.Pali-x: On scaling up a multilingual vision and language model.arXiv preprint arXiv:2305.18565, 2023.
[18]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023.
[19]Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar.Fuyu-8B: A multimodal architecture for AI agents.https://www.adept.ai/blog/fuyu-8b, 2023.
[20]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023.
[21]OpenAI.GPT-4 technical report.Technical report, OpenAI, 2023.https://arxiv.org/abs/2303.08774.
[22]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.
[23]Jason Wei, Maarten Bosma, VincentY Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM Dai, and QuocV Le.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021.
[24]Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, BoyangAlbert Li, Pascale Fung, and Steven C.H. Hoi.Instructblip: Towards general-purpose vision-language models with instruction tuning.ArXiv, abs/2305.06500, 2023.
[25]Wanrong Zhu, Jack Hessel, Anas Awadalla, SamirYitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, WilliamYang Wang, and Yejin Choi.Multimodal c4: An open, billion-scale corpus of images interleaved with text.arXiv preprint arXiv:2304.06939, 2023.
[26]Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022.
[27]Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[28]JiLin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han.Vila: On pre-training for visual language models.CVPR, 2024.
[29]Hang Zhang, Xin Li, and Lidong Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023.
[30]De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz.Lita: Language instructed temporal-localization assistant.arXiv preprint arXiv:2403.19046, 2024.
[31]Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal.Any-to-any generation via composable diffusion.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[32]Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua.Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2023.
[33]Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi.Unified-io: A unified model for vision, language, and multi-modal tasks.In ICLR, 2022.
[34]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In IEEE International Conference on Computer Vision, 2021.
[35]Ranjay Krishna, Kenji Hata, Frederic Ren, LiFei-Fei, and JuanCarlos Niebles.Dense-captioning events in videos.In International Conference on Computer Vision (ICCV), 2017.
[36]Rohit Girdhar, KalyanVasudev Alwala, Armand Joulin, and Ishan Misra.Imagebind: One embedding space to bind them all.arXiv: Computer Vision and Pattern Recognition, 2023.
[37]Taku Kudo and John Richardson.Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226, 2018.
[38]Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan.Videocrafter2: Overcoming data limitations for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024.
[39]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[40]Haohe Liu, Zehua Chen, YiYuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and MarkD Plumbley.Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503, 2023.
[41]Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie.T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.arXiv preprint arXiv:2302.08453, 2023.
[42]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[43]Guangxuan Xiao, Tianwei Yin, WilliamT Freeman, Frédo Durand, and Song Han.Fastcomposer: Tuning-free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023.
[44]Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
[45]ChrisDongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim.Audiocaps: Generating captions for audios in the wild.In NAACL-HLT, 2019.
[46]Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, MarkD Plumbley, Yuexian Zou, and Wenwu Wang.Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.arXiv preprint arXiv:2303.17395, 2023.
[47]KunChang Li, Yinan He, YiWang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and YuQiao.Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023.
[48]OpenAI.Chatgpt-4o https://www.openai.com/chatgpt, 2024.
[49]Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[50]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023.
[51]Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.Making the v in vqa matter: Elevating the role of image understanding in visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
[52]Danna Gurari, Qing Li, AbigaleJ Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and JeffreyP Bigham.Vizwiz grand challenge: Answering visual questions from blind people.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
[53]Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeZhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, YuSu, and Wenhu Chen.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.In CVPR, 2024.
[54]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
[55]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
[56]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
[57]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv: Computer Vision and Pattern Recognition, 2021.
[58]Muhammad Awais, Muzammal Naseer, Salman Khan, RaoMuhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and FahadShahbaz Khan.Foundational models defining a new era in vision: A survey and outlook.arXiv preprint arXiv:2307.13721, 2023.
[59]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning transferable visual models from natural language supervision.arXiv: Computer Vision and Pattern Recognition, 2021.
[60]Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S.M.Ali Eslami, Oriol Vinyals, and Felix Hill.Multimodal few-shot learning with frozen language models.arXiv: Computer Vision and Pattern Recognition, 2021.
[61]Timo Lüddecke.Image segmentation using text and image prompts.arXiv: Computer Vision and Pattern Recognition, 2021.
[62]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.Segment anything.arXiv preprint arXiv:2304.02643, 2023.
[63]Xinlong Wang, Wen Wang, and Tiejun Huang.Images speak in images: A generalist painter for in-context visual learning.arXiv: Computer Vision and Pattern Recognition, 2022.
[64]Danny Driess, Fei Xia, Mehdi S.M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence.Palm-e: An embodied multimodal language model.arXiv: Computer Vision and Pattern Recognition, 2023.
[65]Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan.Video-ChatGPT: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023.
[66]Qiushan Guo, Shalini DeMello, Hongxu Yin, Wonmin Byeon, KaChun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu.Regiongpt: Towards region understanding vision language model.CVPR, 2024.
[67]Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, etal.mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023.
[68]Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai.Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023.
[69]Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi.Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action.arXiv preprint arXiv:2312.17172, 2023.
[70]Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, PangWei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt.Openflamingo, March 2023.
[71]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.

X-VILA: Cross-Modality Alignment for
Large Language Model

Supplemental Material

Appendix A X-VILA Training

A.1 Encoder-LLM-decoder alignment training phase.

As the first step, we align the output of modality-specific encoders and the input of modality-specific decoders to the textual embedding space of LLM, as detailed in[32].To achieve this goal, we only train the input projection layers, output projection layers, and the vocabulary embedding layer of LLM, while keeping all other parameters frozen.We use corpus with “X”-text pairs to train the model, where “X” is one of the video, image, or audio modalities.

For this stage, we design two primary tasks to train the projection layers: X-to-text generation and text-to-X generation.(a) X-to-text generation includes video, image, and audio captioning tasks. The model is supervised to generate text based on the multi-modality inputs. During this process,the input projection layers are trained to align the output embedding of modality-specific encoders and the textual embedding space of pre-trained LLM.(b) Text-to-X generation aims at aligning the output textual embedding space of LLM and the input end of modality-specific decoders. We use video, image, and audio generation tasks to train the model, where only the output projection layers are optimized.As previously mentioned, the training objective here is pure textual alignment: minimizing the feature distance between the textual controller embedding $\mathbf{E}^{\text{text}}_{m}$ generated by the output projection layers and the embedding generated by the original pre-trained text encoder of diffusion model. This training strategy ensures that $\mathbf{E}^{\text{text}}_{m}$ shares a distribution similar to that of the pre-trained text encoder in the diffusion model. After training, $\mathbf{E}^{\text{text}}_{m}$ replaces the diffusion text encoder feature to control the U-Nets of the modality-specific decoders via cross-attention.

A.2 Interleaved data pre-training phase.

Interleaved data training has been proven to be an effective strategy for vision-language models in alleviating the catastrophic forgetting issue after training on only visual-text pairs, and obtaining long-context understanding ability[28, 70].Therefore, we introduce a dedicated phase for pre-training X-VILA using a multi-modality interleaved corpus. In addition to interleaved image-text pairs as in MMC4[25], we further construct a new dataset from ActivityNet Captions[35]. The main idea is to exploit the nature of video that contains sequential flow of text (e.glet@tokeneonedot, captions), audio, short video, and image. This enables us to put the images/videos and texts in an interleaved manner, and use the corpus to pre-train X-VILA.

Specifically, we construct interleaved multi-modality data sequences from each target video clip as:

\displaystyle\scriptsize\underbrace{\texttt{\{<img. 1>, <aud. 1>, <vid. 1>, <%txt 1>\}}}_{\textrm{sampled from video chunk }1},...,\underbrace{\texttt{\{<%img. n>, <aud. n>, <vid. n>, <txt n>\}}}_{\textrm{sampled from video chunk }n},

where the video chunks are sampled from an entire video clip that offers natural sources of interleaved cross-modality data structure. Once constructed, the modalities are sampled during training to align varying targets for gradient computation and network projector alignment. In this work, we observe the even sampling method and $n=3$ are sufficient for the task, namely constructing cross-modality tasks for the beginning, middle stage, and ending of video clips.During this stage, we jointly train the input and output projection layers, and use LoRA[71] on LLM for fine-tuning.

A.3 X-to-X cross-modality instruction tuning phase.

After the previous two phases, we have textually aligned different components of X-VILA in a unified framework. However, the model is still not ready for understanding and generating multi-modality content in a proper manner.To achieve this goal, we curate a comprehensive “X-to-X dataset” for cross-modality generation instruction tuning.As video captioning datasets are inherently multi-modal and provide abundant corpus in video, audio, image, and text forms, we build our X-to-X dataset based on two video captioning datasets: Webvid[34] and ActivityNet Captions[35].Our X-to-X dataset features six different types of cross-modality generative conversations, namely video-to-image, video-to-video, image-to-video, video-to-audio, audio-to-video, and image+audio-to-video.We show examples of different types of conversations in Figure9.Each conversation contains one or more rounds of cross-modality conversation.More details about the X-to-X dataset are described in the experiment section.

We further divide the X-to-X cross-modality instruction tuning phase into two distinct steps, each based on different alignment methods: textual alignment and visual alignment.

(a)To achieve textual alignment, we first project the multi-modality inputs into the textual embedding space of LLM. Then, LLM generates textual embeddings that are subsequently converted into the corresponding modality’s content.We follow a process similar to phases (i) and (ii). Firstly, for image, video, or audio outputs, we generate embeddings using the text encoders of corresponding diffusion models. We then optimize the distance between these embeddings and the $\mathbf{E}^{\text{text}}_{m}$ generated by our model. During this step, we keep all the decoder weights frozen and train the input projection layers, output projection layers, and vocabulary embedding layer as well as LoRA parameters of LLM.For training data, we blend our X-to-X dataset with common SFT datasets used by other VLM models[14, 32] (more details in the experiment section).

(b)As mentioned earlier, relying solely on textual alignment is inherently insufficient to retain the visual details of the input when generating visual outputs.To address such an issue, we design a novel visual alignment method. We propose a visual embedding highway(VEH) module as introduced in Section2.1, which is utilized for the image and video decoders when there is a visual modality in the input. During training, we update the parameters of the visual decoders and the visual controller module. Meanwhile, we keep all other network parameters fixed, including the input and output projection layers and LLM. In this way, the model’s ability to conduct tasks in other modalities is not influenced by the visual alignment process.

Cross-Modality Alignment for Large Language Model (11)

Appendix B More Qualitative Results

B.1 Examples of our X-to-X dataset.

To provide an intuitive understanding of the six types of conversations in our curated X-to-X dataset, we visualize the conversation samples of the dataset in Figure9.The design of the dataset focuses on building any-to-any modality connection through various conversation templates.

B.2 Visual comparison with CoDi on cross-modality alignment.

To further examine the visual alignment advantage of X-VILA, we compare it with the state-of-the-art any-to-any model CoDi[31] in Figure10. We observe that CoDi fails to capture the real semantics and details of the input. Notably, CoDi is unable to perform X-to-X chatting, unlike X-VILA, which is specifically designed for omni-modality chatting while being able to produce superior visually aligned generation results.

B.3 Human-model interaction demonstration.

To conduct a comprehensive assessment of our any-to-any modality LLM’s performance, we undertake more testing on X-VILA, meticulously examining different use cases. We present a collection of human-model conversation examples in Figure11, 12, 13 and 14, showcasing the versatility of X-VILA across diverse tasks.These results demonstrate the effectiveness of X-VILA in addressing the needs of users by offering comprehensive and generative multi-modality capabilities.

Cross-Modality Alignment for Large Language Model (12)

Cross-Modality Alignment for Large Language Model (13)

Cross-Modality Alignment for Large Language Model (14)

Appendix C More Implementation Details

As introduced in SectionA, X-VILA training is separated into three phases.(i) In the initial phase, referred to as encoder-LLM-decoder alignment training, the model undergoes 20,000 iterations using an Adam optimizer. The base learning rate is set to $4\times 10^{-4}$ , and a learning rate warm-up strategy is employed. The batch size for this phase is set to 200.(ii) During the second phase, known as interleaved data pre-training, a batch size of 192 is utilized. The base learning rate is set to $1\times 10^{-4}$ , and the training is conducted for 10,000 iterations.(iii) The final phase, called cross-modality instruction tuning, involves separate training for textual and visual alignment. For textual alignment, a batch size of 192 is maintained, and the model is trained for 30,000 iterations using a base learning rate of $1\times 10^{-4}$ .Regarding visual alignment, both the Visual Embedding Highway (VEH) and modality-specific decoders are trained for 20,000 iterations. The batch size for this phase is set to 64, and the learning rate is adjusted to $1\times 10^{-6}$ .For data amount,Our training pipeline is highly efficient compared to many previous vision-language models[15, 20, 17].We utilize a total of 4 NVIDIA A100 80GB server nodes in the training process.