VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (2024)

Muhammad Maaz¹,Hanoona Rasheed¹,Salman Khan^1,2,Fahad Shahbaz Khan^1,3
¹Mohamed bin Zayed University of AI, UAE
²Australian National University, Australia ³Linköping University, Sweden

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding.While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations.Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences.On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding.To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling).The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders.Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering.Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance.Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos.This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

1 Introduction

Existing methods for video understanding often rely solely on either image encoders or video encodersMaaz2023VideoChatGPT ; jin2023chatunivi ; st-llm . Most works focus on image encoders, which encode multiple frames and either fuse the information or concatenate the embeddings before passing them to the LLM. When fusing the information, spatial or temporal pooling is typically usedMaaz2023VideoChatGPT . Spatial pooling has shown minimal effectiveness in capturing video information, whereas temporal pooling retains some spatial information but lacks explicit temporal context. On the other hand, concatenating embeddings without poolingjin2023chatunivi ; st-llm ; zhang2024llavanextvideo can rapidly increase computational complexity due to the extended context length required by the LLM, limiting the number of frames that can be processed. While this approach provides better spatial representation, the overall context is still limited to few frames. The limited context results in a poor understanding of the video, especially if a uniform sampling strategy is employed, as it only captures small segments of the video, missing important temporal dynamics.

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (1)

In order to address these challenges, we propose VideoGPT+ which effectively combines the merits of both image and video encoders (see Fig.2). By leveraging an image encoder for rich spatial details and a video encoder for global temporal context, our model achieves improved video understanding. To model finegrained temporal dynamics in VideoGPT+, we use a segment-wise sampling strategy. Unlike uniform sampling used in existing video LMMs Maaz2023VideoChatGPT , which may miss important temporal dynamics, our approach divides the video into smaller segments and applies segment-wise sampling. This ensures that the model captures representative information from different segments of the video, enabling a morecomprehensive understanding.

To facilitate the integration of image and video features, VideoGPT+introduces a visual adapter module that combines their complimentary benefits. This module performs projection and pooling operations, mapping both image and video features to a common space while reducing computational complexity. By aligning the features in this manner, the model can effectively utilize the combined spatial and temporal information for improved video understanding.

We demonstrate the effectiveness of VideoGPT+across multiple video-conversation benchmarks, including VCGBench Maaz2023VideoChatGPT , MVBench li2023mvbench , and Zero-shot question-answering Maaz2023VideoChatGPT , where it outperforms previous SoTA approaches (see Fig.1). Further, we develop VCG+ 112Kusing a novel semi-automatic annotation pipeline (see Fig.3), which provides dense video captions along with spatial understanding and reasoning-based question-answer (QA) pairs, further enhancing the model’s performance. We also propose VCGBench-Diverse, extending VCGBenchMaaz2023VideoChatGPT by including videos from 18 different domains to extensively evaluate the video-based conversation models in diverse domains (see Fig.4).

Our work has three main contributions:

•
We present VideoGPT+, the first video-conversation model that benefits from a dual-encoding scheme based on both image and video features. These complimentary sets of features offer rich spatiotemporal details for improved video understanding (Sec.3).
•
Addressing the limitations of existing VideoInstruct100K datasetMaaz2023VideoChatGPT , we develop VCG+ 112K with a novel semi-automatic annotation pipeline, offering dense video captions along with spatial understanding and reasoning-based QA pairs, further improving the model performance (Sec.4).
•
Recognizing the lack of diverse benchmarks for video-conversation task, we propose VCGBench-Diverse, which provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model (Sec.5).

2 Related Works

Building on advances in language models, LLMs offer a flexible interface for various multimodal applications. Early efforts in image-based conversation models such as BLIP-2li2023blip , MiniGPT-4zhu2023minigpt and LLaVAliu2023llava ; liu2023improvedllava project image features into the language space through a learnable module and perform instruction tuning for visual conversations capabilities. Other efforts extend these models to visual grounding taskskosmos-2 ; hanoona2023GLaMM ; you2023ferret , exploring the potential of LLMs in complex vision tasks.

Video Conversation Models: Initial works like Video-ChatGPTMaaz2023VideoChatGPT and Video-LLaMAdamonlpsg2023videollama extend image-based models to the video domain by introducing components to encode temporal features, where frame-level visual features are fed to the LLM. However, this is computationally expensive and quickly fills its context window. To address this issue, Video-ChatGPTMaaz2023VideoChatGPT employs spatial and temporal pooling. LLaMA-Vidllamavid proposes representing a single image with two tokens, context and content. IG-VLMkim2024image treats a video as a grid of images, while LITAhuang2024lita employs slow-fast token pooling to reduce the number of visual features. Chat-UniVijin2023chatunivi uses clustering in both spatial and temporal dimensions to merge tokens, and VideoChat2023videochat uses Q-Formerli2023blip to learn a fixed number of queries by cross-attending to the visual features. MobileVLMchu2023mobilevlm ; chu2024mobilevlm utilize a lightweight CNN to reduce the spatial dimensions. Other notable methods include bt_adapter ; video-llava ; munasinghe2023PGVideoLLaVA ; song2023moviechat ; huang2023vtimellm .

Alternatively, methods such as VideoChat2li2023mvbench use pretrained video encoders. Although video encoders provide temporal context, they are limited by computational constraints, operating with limited frames at lower resolutions, restricting temporal context and spatial understanding. Our VideoGPT+ model addresses these issues by using segment-wise sampling and effectively combining the merits of image and video encoders to capture richspatial and temporal details (see Fig.2).

Video Instruction Tuning Datasets: VideoChat2023videochat builds a video-instruction tuning dataset consisting of 7K instructions using videos from WebVid-10Mbain2021frozen . Video-ChatGPTMaaz2023VideoChatGPT introduces a semi-automatic annotation pipeline to generate VideoInstruct100K using videos from ActivityNetcaba2015activitynet . VideoChat2li2023mvbench combines multiple existing image and video datasets to develop a 1.9M joint image-video instruction tuning dataset. In our experiments, we use VideoInstruct100K and a subset of the dataset from VideoChat2. Additionally, addressing the limitations of the VideoInstruct100K datasetMaaz2023VideoChatGPT , we develop VCG+ 112Kthrough a novel semi-automatic annotation pipeline, which provides dense video captions along with 112K QA pairs targeting reasoning, spatial and temporal understanding, which further improves model’s understanding of video content (see Fig.3).

3 Method

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (2)

For effective video understanding, combining detailed spatial information with explicit temporal context is crucial. To achieve this, we propose VideoGPT+, which features a dual encoder design that leverages the complementary strengths of an image encoder and a video encoder.

Overall Architecture:The overall architecture consists of (i) segment-wise sampling, (ii) dual visual encoder, (iii) vision-language adapters that project vision features to the language domain and (iv) a large language model. Frames selected through a segment-wise sampling strategy are encoded through a dual encoder consisting of an image and a video encoder. Both sets of features are projected to language space using vision-language (V-L) adapters, and the resulting tokens are pooled through adaptive token pooling and concatenated before being fed to the LLM (see Fig.2).

Segment-wise Sampling: To extract fine-grained temporal cues, we use a segment-wise frame sampling strategy. Given an input video $\mathbf{V}\in\mathbb{R}^{T\times H\times W\times C}$ , we divide it into $K$ segments, where each segment consists of $n=\frac{T}{K}$ frames. Thus, the video can be represented as $\mathbf{V}=[\mathbf{V}_{k}]_{k=1}^{K}$ . Each segment $\mathbf{V}_{k}\in\mathbb{R}^{n\times H\times W\times C}$ can be described as a sequence of frames, $\mathbf{X}_{i}$ , where $\mathbf{V}_{k}=[\mathbf{X}_{i,j}]_{j=1}^{n}$ . The video segments are downsampled to a lower resolution of $n\times h\times w\times c$ for video encoding.

Compared to a uniform sampling, segment-wise sampling better aligns with our dual encoder design. Video encoders often face computational constraints, limiting them to processing only sparse frames. Uniform sampling increases the self-attention computation complexity as it requires attending to features of all frames. Additionally, video encoders are typically trained with sparse frames, and providing more frames can hinder their ability to accurately capture temporal information. In contrast, the segment-wise sampling strategy divides the video into smaller, manageable segments, enabling the video encoder to efficiently capture rich temporal cues within each segment.

Dual Vision Encoder: Our design leverages the complementary strengths of an image encoder that captures detailed spatial features and a video encoder that provides explicit temporal context. The image encoder $g$ , processes $T$ frames, $g(\mathbf{X})\in\mathbb{R}^{T\times H_{g}\times W_{g}\times D_{g}}$ , producing local features that provide frame-level context. Meanwhile, the video encoder $h$ , operates on low-resolution video segments $\mathbf{V}_{k}$ , yielding global features that provide segment-wise context, $h(\mathbf{V}_{k})\in\mathbb{R}^{n\times h_{h}\times w_{h}\times D_{h}}$ .

The primary goal of VideoGPT+is to leverage the capabilities of a pre-trained LLM alongside visual modalities from both a pre-trained image encoder and a pre-trained video encoder. Specifically, we utilize the pre-trained CLIP model, ViT-L/14 ( $336\times 336$ )clip as the image encoder, and InternVideo-v2 ( $224\times 224$ )wang2024internvideo2 as the video encoder. These models are selected for their robust performance and their ability to complement each other in capturing both spatial and temporal information. Both encoders are pre-trained on large-scale datasets in a multimodal setting using contrastive loss, facilitating their integration within our architecture.

Visual Adapter: The output embeddings from the second last layer of both image and video encoders are passed through separate V-L projection layers, $W_{g}$ and $W_{h}$ , respectively. These Multi-Layer perceptrons (MLPs) project the visual features into the language space. The projection layers are trainable, while the visual encoders remain frozen, preserving the rich, pre-trained representations. The projected embeddings are reshaped back into their grid forms and subjected to a $2\times 2$ adaptive token pooling, which operates on the spatial dimensions of the local and global features. This pooling reduces the token length by a factor of $4$ , thereby allowing to fit in larger visual context within the same LLM context window.The pooled embeddings from the local features form $\mathbf{E}^{img}\in\mathbb{R}^{T\times h_{g}\times w_{g}\times D_{t}}$ , while the pooled embeddings from the global features of each segment form $\mathbf{E}^{vid}\in\mathbb{R}^{n\times h_{h}\times w_{h}\times D_{t}}$ .

Large Language Model: We obtain the final representation by concatenating the embeddings $\mathbf{E}^{img}$ with $K$ segment-wise embeddings $\mathbf{E}^{vid}$ , such that we have detailed spatial representation across all segments followed by their global temporal context. We then concatenate the text embeddings $\mathbf{E}^{text}\in\mathbb{R}^{L\times D_{t}}$ of the user text query with the visual embeddings,

\mathbf{E}=[\mathbf{E}^{img},\mathbf{E}_{1}^{vid},\ldots,\mathbf{E}_{K}^{vid},%\mathbf{E}^{text}].

(1)

This integration ensures that the LLM receives a sequence of embeddings that include detailed spatial features from the image encoder and comprehensive temporal context from the video encoder, allowing for robust video understanding. The LLM is fine-tuned using LoRAhu2021lora in an auto-regressive manner with a next-token prediction loss. Refer to Fig.2 for detailed illustration.

4 Dataset

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (6)

Video-ChatGPTMaaz2023VideoChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present VCG+ 112Kdataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.

Keyframe Extraction:VideoInstruct100K uses a fixed number of video keyframes, regardless of video length or dynamics, to generate frame-level dense captions. This often results in both insufficient and redundant information. We address this by first extracting scenes from videosPySceneDetect , and then selecting one keyframe/scene. Consequently, we obtaindetailed information for videos with rich content and reduce redundancy for videos with less content. It provides better visual context by extracting more stable keyframes, thus offering a more accurate video representation.

Frame-Level Descriptions:After extracting keyframes, we use a SoTA image LMM, LLaVA-v1.6liu2024llavanext , to generate dense descriptions for each keyframe. These descriptions encompass comprehensive visual details, including spatial attributes, scene context, and object characteristics, which are often absent in concise ground truth captions. While ground truth captions are precise, they lack the granularity to capture intricate visual and spatial information. To address this, we augment them captions with detailed but noisy information from the frame-level descriptions, thus enhancing the quality and accuracy of the subsequent video descriptions.

Detailed Video Descriptions:VideoInstruct100KMaaz2023VideoChatGPT prompts GPT-3.5 directly with frame-level descriptions and concise ground truth captions to generate QA pairs, imposing a significant cognitive load on the model to verify frame-level descriptions with the ground truth. We improve this process by first creating a coherent and detailed video description. We prompt GPT-4 to integrate the detailed frame-level descriptions with the ground truth captions by comparing information and removing any inconsistencies. The resulting detailed descriptions include a timeline of events, actions, object attributes, and scene settings, providing a thorough representation of the video content. This structured inputsimplifies the task for LLM, thereby enhancing the generated QA pairs quality.

Improved Instruction Tuning Data:Using the ground truth captions and detailed video descriptions, we generate two types of high-quality QA pairs using GPT-3.5: descriptive and concise. For descriptive instruction pairs, we focus on three categories: (i) dense captioning, which provides descriptions of the video covering the entire sequence of events and visual details; (ii) detailed temporal information, which addresses the sequence of events and their dependency to learn temporal relationships; and (iii) generic question answering, which involves in-depth questions about different actions, their consequences, and other detailed aspects of the video. For concise instruction pairs, we target (i) spatial reasoning, focusing on understanding and describing spatial details such as scene settings, number of objects, attire, and locations; (ii) reasoning of events, covering the causal relationships between events; and (iii) short temporal questions, addressing specific moments or sequences, such as what happened at the beginning or end.

5 Proposed Benchmark

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (7)

Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-BenchMaaz2023VideoChatGPT provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200caba2015activitynet dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework. The detailed breakdown of VCGBench-Diverse is illustrated in Fig.4, showcasing the distribution of videos across content domains, video capturing methods, and reasoning complexities.

We collect videos from 18 distinct domains, including lifestyle, how-to, science and technology, news, travel, entertainment, film, sports, comedy, activism, gaming, education, surveillance, pets, cooking, music, automobile, and traffic (see Fig.4). These categories encompass a broad spectrum of real-world scenarios, ensuring that models are evaluated on a diverse set of challenges. In addition to content diversity, VCGBench-Diverse includes a variety of video capture methods, which ensures a comprehensive assessment of robustness to different filming techniques, camera movements, quality levels and lighting. The benchmark covers five video capture methods including static and controlled settings, dynamic and unpredictable settings, fixed camera perspectives, professional and high-quality videos, and uncontrolled and variable quality. Further, the benchmark evaluates models across six reasoning complexities, including sequential understanding, complex action and predictive reasoning, contextual and world knowledge reasoning, causal reasoning, narrative and emotional reasoning, and analytical and critical reasoning, which is crucial for understanding diverse video content.

The videos in VCGBench-Diverse are sourced from HDVILAxue2022hdvila , MPIIandriluka14cvpr , YouCook2zhou2018towards , UCF Crimeucfcrime , and STUD Trafficxu2021sutd . The video durations range from 29 sec to 471 sec, with an average of 217 sec. Human annotators are tasked with writing detailed descriptions based on their understanding of both audio and visual elements of the videos. This comprehensive annotation process involves a set of annotators who are provided with an initial set of ten videos each. These annotations undergo a meta-review stage where feedback is provided, and necessary corrections are made to meet the required standards. Following this, annotators receive additional batches, with random samples being selected for quality checks by the meta-reviewer. The final human annotations are utilized to generate QA pairs using GPT-3.5, based on prompts detailed in Fig.10.

Following VCG-BenchMaaz2023VideoChatGPT , the evaluation is computed over five different aspects: (i) correctness of information (ii) detail orientation (iii) contextual understanding (iv) temporal understanding and (v) consistency. Additionally, VCGBench-Diverse provides a breakdown of performance across three key aspects: (i) dense video captioning, which assesses the ability to generate detailed and accurate descriptions of the video content, (ii) spatial understanding, which evaluates the capability to understand and describe the spatial relationships and settings within the video, and (iii) reasoning, which tests the adeptness in inferring and explaining causal relationships and actions within the video.

6 Experiments

We perform quantitative evaluation of VideoGPT+on four standard benchmarks: i) VCGBenchMaaz2023VideoChatGPT , ii) VCGBench-Diverse, iii) MVBenchli2023mvbench and iv) Zero-shot QA.

Implementation Details: We use CLIP-L/14clip as our image encoder, InternVideo-v2wang2024internvideo2 stage-2 1B model as our video encoder in conjunction with Phi-3-Mini-3.8Bphi3mini4k based LLM with 4K context window in our experiments. The image encoder operates at $336\times 336$ , while the video encoder operates at $224\times 224$ resolution. Our training consists of two pretraining stages and one instruction-tuning stage. In the pretraining stage, we train with only the image encoder and only the video encoder on the CC-595K datasetliu2023improved , with only the visual adapters being learned while the rest of the model is kept frozen. During the instruction-tuning stage, we use LoRAhu2022lora with $r=64$ for LLM, while visual adapters are fully trained and vision encoders are kept frozen. The LR is set to $1e^{-3}$ during pretraining and $2e^{-4}$ during instruction tuning.

For experiments on VCGBench, VCGBench-Diverseand Zero-shot QA, we sample 16 frames from videos, while for MVBench which consists of relatively shorter videos, we sample 8 frames. We keep the same sampling strategy during inference. For VCGBench and VCGBench-Diverse, the model is trained on VideoInstruct100KMaaz2023VideoChatGPT , VCG+ 112K, conversation and caption data from VideoChat2023videochat and VQA dataset from WebVidbain2021frozen , that combines to approximately 260K single turn conversations. For MVBench, the model is trained on Kinetics-710kay2017kinetics , Something-Something-v2goyal2017something , conversations from VideoChat2023videochat , CLEVRERyi2019clevrer , VQA dataset from WebVidbain2021frozen and NExT-QAxiao2021next datasets, which combines to approximately 330K single turn conversations. We run all trainings for one epoch. Following previous approachesMaaz2023VideoChatGPT ; jin2023chatunivi ; st-llm , we employ GPT-3.5-Turbo-0613 for VCGBench and Zero-shot QA evaluation. However, for our proposed VCGBench-Diverse, we employ the latest GPT-3.5-Turbo-0125 for evaluation.

Method	CI	DO	CU	TU	CO	Avg.
Video-ChatGPTMaaz2023VideoChatGPT	2.40	2.52	2.62	1.98	2.37	2.38
BT-Adapterbt_adapter	2.68	2.69	3.27	2.34	2.46	2.69
VTimeLLMhuang2023vtimellm	2.78	3.10	3.40	2.49	2.47	2.85
Chat-UniVijin2023chatunivi	2.89	2.91	3.46	2.89	2.81	2.99
LLAMA-VIDllamavid	2.96	3.00	3.53	2.46	2.51	2.89
Video-LLaVAvideo-llava	2.84	2.86	3.44	2.46	2.57	2.81
VideoChat2li2023mvbench	3.02	2.88	3.51	2.66	2.81	2.98
IG-VLMkim2024image	3.11	2.78	3.51	2.44	3.29	3.03
VideoGPT+(ours)	3.27	3.18	3.74	2.83	3.39	3.28

VCGBench: The benchmark consists of approximately 3000 QA pairs generated using 500 human-annotated videos from ActivityNetcaba2015activitynet . The benchmark evaluates the responses on five different aspects: i) Correctness of Information (CI), which assesses the correctness of the response to ensure it aligns with the video contents, ii) Detail Orientation (DO), which evaluates the depth of the response, iii) Contextual Understanding (CU), which assesses if the response aligns with the overall context of the video, iv) Temporal Understanding (TU), which assesses the model’s ability to identify temporal sequences accurately, and v) Consistency (CO), which evaluates the consistency in the model response to similar questions. Table1 compares our model with previous SoTA approaches. VideoGPT+achieves an average score of 3.28 surpassing previous best method by a margin of 0.25 (5%).

VCGBench-Diverse: We provide a quantitative comparison of VideoGPT+against previous SoTA approaches on VCGBench-Diverse, which contains 4,354 QA pairs from 877 videos. Following Maaz2023VideoChatGPT , we evaluate the Correctness of Information (CI), Detail Orientation (DO), Contextual Understanding (CU), Temporal Understanding (TU), and Consistency (CO). Additionally, we provide results for dense captioning, spatial understanding, and visual reasoning abilities. The results are presented in Table2. VideoGPT+achieves an average score of 2.47 surpassing all previous methods. Further, VideoGPT+achieves a score of 1.38, 2.80, and 3.63 on dense captioning, spatial understanding, and visual reasoning, respectively. Notably, VideoGPT+ achieves improvements in spatial and temporal understanding, surpassing previous best models by 0.37 (7.4%) and 0.23 (4.6%), respectively. This is attributed to the dual encoder architecture, where the high-resolution image encoder enhances spatial understanding and the video encoder improves temporal accuracy.

Method	CI	DO	CU	TU	CO	Avg.	Caption	Spatial	Reasoning
Video-ChatGPT(ACL 2024)Maaz2023VideoChatGPT	2.07	2.42	2.46	1.39	2.06	2.08	0.89	2.25	3.60
BT-Adapter(CVPR 2024)bt_adapter	2.20	2.62	2.59	1.29	2.27	2.19	1.03	2.35	3.62
VTimeLLM(CVPR 2024)huang2023vtimellm	2.16	2.41	2.48	1.46	2.35	2.17	1.13	2.29	3.45
Chat-UniVi(CVPR 2024)jin2023chatunivi	2.29	2.56	2.66	1.56	2.36	2.29	1.33	2.36	3.59
VideoChat2(CVPR 2024)li2023mvbench	2.13	2.42	2.51	1.66	2.27	2.20	1.26	2.43	3.13
VideoGPT+(ours)	2.46	2.73	2.81	1.78	2.59	2.47	1.38	2.80	3.63

Model	AS	AP	AA	FA	UA	OE	OI	OS	MD	AL	ST	AC	MC	MA	SC	FP	CO	EN	ER	CI	Avg.
Random	25.0	25.0	33.3	25.0	25.0	33.3	25.0	33.3	25.0	25.0	25.0	33.3	25.0	33.3	33.3	25.0	33.3	25.0	20.0	30.9	27.3
GPT-4V2023GPT4VisionSC	55.5	63.5	72.0	46.5	73.5	18.5	59.0	29.5	12.0	40.5	83.5	39.0	12.0	22.5	45.0	47.5	52.0	31.0	59.0	11.0	43.5
Otter-Vli2023otter	23.0	23.0	27.5	27.0	29.5	53.0	28.0	33.0	24.5	23.5	27.5	26.0	28.5	18.0	38.5	22.0	22.0	23.5	19.0	19.5	26.8
mPLUG-Owl-Vye2023mplug	22.0	28.0	34.0	29.0	29.0	40.5	27.0	31.5	27.0	23.0	29.0	31.5	27.0	40.0	44.0	24.0	31.0	26.0	20.5	29.5	29.7
Video-ChatGPTMaaz2023VideoChatGPT	23.5	26.0	62.0	22.5	26.5	54.0	28.0	40.0	23.0	20.0	31.0	30.5	25.5	39.5	48.5	29.0	33.0	29.5	26.0	35.5	32.7
VideoLLaMAdamonlpsg2023videollama	27.5	25.5	51.0	29.0	39.0	48.0	40.5	38.0	22.5	22.5	43.0	34.0	22.5	32.5	45.5	32.5	40.0	30.0	21.0	37.0	34.1
VideoChat2023videochat	33.5	26.5	56.0	33.5	40.5	53.0	40.5	30.0	25.5	27.0	48.5	35.0	20.5	42.5	46.0	26.5	41.0	23.5	23.5	36.0	35.5
VideoChat2li2023mvbench	66.0	47.5	83.5	49.5	60.0	58.0	71.5	42.5	23.0	23.0	88.5	39.0	42.0	58.5	44.0	49.0	36.5	35.0	40.5	65.5	51.1
VideoGPT+(ours)	69.0	60.0	83.0	48.5	66.5	85.5	75.5	36.0	44.0	34.0	89.5	39.5	71.0	90.5	45.0	53.0	50.0	29.5	44.0	60.0	58.7

MVBench: We evaluate VideoGPT+on MVBenchli2023mvbench , which provides 4,000 QA pairs from 11 video datasets covering a broad spectrum of scenes, ranging from first-person to third-person and from indoor to outdoor environments. The tasks are categorized into 20 fine-grained temporal understanding tasks. The results presented in Table3 compare VideoGPT+with previous methods, indicating an overall improvement of 7.6% compared to the previous best, VideoChat2. Specifically, VideoGPT+ achieves SoTA results in 14 out of 20 tasks and comes second in 4 out of 20 tasks, obtaining an average score of 58.7% across the 20 tasks. Additionally, VideoGPT+shows significant improvements in the Action Prediction (+12.5%), Object Existence (OE) (+27.5%), Moving Direction (MD) (+17%), Moving Count (MC) (+29%) and Moving Attributes (MA) (+32%) indicating the rich spatial information and temporal context achieved by our model.

Zero-shot Question-Answering: We provide a quantitative comparison of our method on the zero-shot QA task across four open-ended QA datasets, including MSVD-QAmsvd , MSRVTT-QAmsvd , TGIF-QATGIF , and ActivityNet-QAcaba2015activitynet . Results presented in Table4 show VideoGPT+achieves superior performance compared to previous methods, indicating its ability to adapt effectively to unseen videos and generate accurate contextually relevant responses in challenging settings.

Model	MSVD-QA		MSRVTT-QA		TGIF-QA		ActivityNet-QA
	Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
FrozenBiLMyang2022frozenbilm	32.2	–	16.8	–	41.0	–	24.7	–
VideoChat2023videochat	56.3	2.8	45.0	2.5	34.4	2.3	26.5	2.2
LLaMA Adapterllama_adapter	54.9	3.1	43.8	2.7	-	-	34.2	2.7
Video-LLaMAdamonlpsg2023videollama	51.6	2.5	29.6	1.8	-	-	12.4	1.1
Video-ChatGPTMaaz2023VideoChatGPT	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.8
ChatUniVijin2023chatunivi	65.0	3.6	54.6	3.1	60.3	3.4	45.8	3.2
LLaMA-VIDllamavid	70.0	3.7	58.9	3.3	–	–	47.5	3.3
Video-LLaVAvideo-llava	70.7	3.9	59.2	3.5	70.0	4.0	45.3	3.3
VideChat2li2023mvbench	70.0	3.9	54.1	3.3	–	–	49.1	3.3
VideoGPT+(ours)	72.4	3.9	60.6	3.6	74.6	4.1	50.6	3.6

	Vision Encoder Type			Image Pooling			Video Pooling		VCG+112K
	Image	Video	Dual	CNN	$\boldsymbol{4\times 4}$	$\boldsymbol{2\times 2}$	Time	Space	×	✓
Correctness (CI)	3.14	3.22	3.27	3.24	3.24	3.27	3.21	3.27	3.20	3.27
Detail (DO)	3.09	3.10	3.18	3.13	3.18	3.18	3.13	3.18	3.08	3.18
Context (CU)	3.68	3.70	3.74	3.70	3.73	3.74	3.70	3.74	3.66	3.74
Temporal (TU)	2.69	2.70	2.83	2.74	2.73	2.83	2.72	2.83	2.66	2.83
Consistency (CO)	3.26	3.31	3.39	3.41	3.39	3.39	3.36	3.39	3.28	3.39
Average	3.17	3.20	3.28	3.25	3.25	3.28	3.23	3.28	3.17	3.28

Vision Encoder Type: We ablate our dual visual encoder design in VideoGPT+in on VCGBench with results presented in Table5. We conduct three experiments: using only the image encoder, only the video encoder, and both encoders. The image encoder alone achieves a score of 3.17, while the video encoder alone achieves a better score of 3.20, indicating the benefits of video-based pretraining. The dual encoder design, combining both spatial and temporal information, achieves the highest score of 3.28, demonstrating enhanced performance in video-conversation tasks.

Pooling Strategy: We ablate different pooling strategies for the image and video encoders in Table5. The image encoder outputs a $24\times 24$ feature map from a $336\times 336$ input. We compare two downsampling methods: a learnable lightweight CNN (LDPv2 from chu2024mobilevlm ) and a non-learnable adaptive average pooling with a $2\times 2$ kernel. Results indicate that adaptive pooling performs better than CNN. A $4\times 4$ adaptive pooling was also tested but showed inferior performance.

Similarly, we ablate the pooling choice for the video encoder, which takes an input of size $T\times 224\times 224\times C$ and outputs a feature map of $T\times 16\times 16\times d$ . We compare two pooling strategies: time pooling across the temporal dimension to reduce the feature map to $1\times 16\times 16\times d$ , and space pooling across the spatial dimension with a $2\times 2$ kernel. Table5 shows that space pooling effectively preserves temporal information and yields better results.

LLM	VCGBench					Avg.
LLM	CI	DO	CU	TU	CO	Avg.
Phi3-Mini-3.8B	3.27	3.18	3.74	2.83	3.39	3.28
Vicuna-7B	3.22	3.14	3.69	2.65	3.46	3.23
Vicuna-13B	3.30	3.20	3.75	2.77	3.48	3.30
LLaMA3-8B	3.29	3.21	3.73	2.86	3.38	3.29

VCG+ 112K: To demonstrate the effectiveness of VCG+ 112K, we train VideoGPT+ with and without it. As shown in Table5, VCG+ 112K improves performance, particularly in detail orientation (DO) and temporal understanding (TU). This improvement can be attributed to our novel semi-automatic annotation pipeline and the enhanced instruction tuning data, which focuses on generating both detailed and concise instruction pairs. Refer to Fig.3 for qualitative visualization of the data.

LLM Type: We train VideoGPT+ with different LLMs including Vicuna 7B and 13Bvicuna2023 and LLaMA-3 8Bllama3 and shows results in Table6. We observe slight improvements in VCGBench scores when training using better LLMs, including Vicuna 13B and LLaMA-3 8B models.

7 Conclusion

In this work, we introduce VideoGPT+, a novel video conversation model that leverages the complementary benefits of image and video encoders to achieve enhanced video understanding. VideoGPT+demonstrates better performance across multiple video benchmarks, owing to its dual-encoder design, lightweight visual adapters that map image/video features to a common space and a segment-wise sampling strategy that retains fine-grained temporal information.We also develop VCG+ 112K, a 112K video-instruction set using a resource-efficient semi-automated annotation pipeline that delivers further gains. Lastly, we propose VCGBench-Diverse, a diverse benchmark covering 18 video categories, to comprehensively evaluate video LMMs. Despite reported improvements, video LMMs still find challenges in precise action localization, understanding very long videos, and navigating long paths; areas where major improvements can unlock new applications.

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (8)

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (9)

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (10)

8 Qualitative Results

We provide a qualitative comparison of our VideoGPT+ with the previous state-of-the-art approach, VideoChat2li2023mvbench , in Fig.5. The example shows an advertisem*nt video for sunscreen, where multiple scene changes are present. The video starts with a close-up view of the sunscreen, followed by a woman applying sunscreen on her hand, then applying sunscreen near a beach. The woman is then seen applying sunscreen on her arms, and finally, the video shows the key ingredients of the sunscreen and ends with the cover of the sunscreen.

As shown in Fig.5, our VideoGPT+ correctly identifies the events present in the video and provides a detailed and accurate description. On the other hand, VideoChat2 struggles to accurately capture all the events. Further, our model generates an advertisem*nt post highlighting one of the unique features of the sunscreen shown in the video, namely that it functions as both sunscreen and moisturizer. Lastly, our VideoGPT+ correctly identifies the SPF value and brand name of the sunscreen, while VideoChat2 struggles to correctly identify the brand name. We present further comparison in Fig.7.

9 Additional Implementation Details

In this section, we provide additional implementation details regarding our training setup and compute requirements. All of our experiments are conducted using 8xA100 40GB GPUs. The training for VCGBench experiments takes around 12 hours to complete, while the training for MVBench experiments finishes in around 10 hours. We use the model trained for the VCGBench task to evaluate on VCGBench-Diverse and zero-shot question-answering benchmarks. All of our training and evaluation codes, pretrained models and dataset will be publicly released.

10 Additional Ablations

Feature	VCGBench					Avg.
Concatenation	CI	DO	CU	TU	CO	Avg.
Interleaved	3.25	3.17	3.72	2.78	3.39	3.26
Sequential	3.27	3.18	3.74	2.83	3.39	3.28

Feature concatenation strategy: We conduct an ablation study to determine the optimal order in which image and video features should be input to the LLM. Specifically, we perform two experiments. In the first experiment, image and video features are extracted for each video segment and concatenated in an interleaved manner before sending as input to the LLM. For example, the video is divided into segments of equal size, and then the image and video features from each segment are concatenated and input to the LLM.In the second experiment, we first place all the image features followed by all the video features. The results shown in Table7, indicate that the sequential design, where the image features are placed first followed by the video features, yields better performance. This can be justified by the fact that we use different visual adapters for image and video features, so interleaving the features from both modalities can create a larger distribution shift, hindering the learning process.

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (11)

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (12)

VideoGPT+ : Integrating Image and Video Encoders for Enhanced Video Understanding (13)

11 GPT Prompts

In this section, we provide the GPT prompts used for the following tasks: (i) Dense video description generation for VCG+ 112K, (ii) Question-answer generation for VCG+ 112K and (iii) Question-answer generation for VCGBench-Diverse.

Dense Video Description Generation for VCG+112K: To generate dense video captions, we provide GPT-4 with a concise ground truth caption of the video and detailed frame-level captions of the key-frames generated from LLaVA-v1.6liu2024llavanext . GPT-4 is then prompted to combine this information into a detailed caption for the entire video. As illustrated in Fig.8, the prompt includes clear instructions to eliminate any conflicting information, ensuring an accurate and detailed caption.

Question-answer generation for VCG+112K: After generating detailed video descriptions using GPT-4, we use GPT-3.5 to create question-answer pairs for instruction tuning. Fig.9 shows the prompt to generate detailed summary question-answer pair using the ground truth caption and the dense description of the video.

Question-Answer Generation for VCGBench-Diverse: We provide prompts used to generate comprehensive question-answer pairs for VCGBench-Diverse. As illustrated in Fig.10, the questions are generated in three categories: temporal, spatial, and reasoning. Similar prompts are used to generate consistency and summary questions, offering an extensive evaluation protocol for VCGBench-Diverse.

References

[1]Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan.Video-chatgpt: Towards detailed video understanding via large vision and language models.In Association for Computational Linguistics, 2024.
[2]Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and LiYuan.Chat-univi: Unified visual representation empowers large language models with image and video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[3]Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and GeLi.St-llm: Large language models are effective temporal learners.arXiv preprint arXiv:2404.00308, 2024.
[4]Yuanhan Zhang, BoLi, haotian Liu, Yongjae Lee, Liangke Gui, DiFu, Jiashi Feng, Ziwei Liu, and Chunyuan Li.Llava-next: A strong zero-shot video understanding model, April 2024.
[5]Kunchang Li, Yinan He, YiWang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and YuQiao.Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023.
[6]Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and LiYuan.Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023.
[7]Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, YiWang, YiLiu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, etal.Mvbench: A comprehensive multi-modal video understanding benchmark.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[8]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models.In International Conference on Machine Learning, 2023.
[9]Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.In International Conference on Learning Representations, 2024.
[10]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.In Advances in Neural Information Processing Systems, 2023.
[11]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.arXiv:2310.03744, 2023.
[12]Zhiliang Peng, Wenhui Wang, LiDong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei.Kosmos-2: Grounding multimodal large language models to the world.ArXiv, abs/2306, 2023.
[13]Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, RaoM. Anwer, Eric Xing, Ming-Hsuan Yang, and FahadS. Khan.Glamm: Pixel grounding large multimodal model.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[14]Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang.Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023.
[15]Hang Zhang, Xin Li, and Lidong Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023.
[16]Yanwei Li, Chengyao Wang, and Jiaya Jia.Llama-vid: An image is worth 2 tokens in large language models.arXiv preprint arXiv:2311.17043, 2023.
[17]Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee.An image grid can be worth a video: Zero-shot video question answering using a vlm.arXiv preprint arXiv:2403.18406, 2024.
[18]De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz.Lita: Language instructed temporal-localization assistant.arXiv preprint arXiv:2403.19046, 2024.
[19]Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, BoZhang, Xiaolin Wei, etal.Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023.
[20]Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, BoZhang, etal.Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024.
[21]Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, ThomasH Li, and GeLi.One for all: Video conversation is feasible without video instruction tuning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[22]Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, HanoonaAbdul Rasheed, Salman Khan, Mubarak Shah, and Fahad Khan.Pg-video-llava: Pixel grounding large video-language models.ArXiv 2311.13435, 2023.
[23]Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, etal.Moviechat: From dense token to sparse memory for long video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[24]Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu.Vtimellm: Empower llm to grasp video moments.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[25]Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
[26]BernardGhanem Fabian CabaHeilbron, VictorEscorcia and JuanCarlos Niebles.Activitynet: A large-scale video benchmark for human activity understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
[27]Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang.Video question answering via gradually refined attention over appearance and motion.In ACM International Conference on Multimedia, 2017.
[28]Yunseok Jang, Yale Song, ChrisDongjoo Kim, Youngjae Yu, Youngjin Kim, and Gunhee Kim.Video Question Answering with Spatio-Temporal Reasoning.International Journal of Computer Vision, 2019.
[29]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International Conference on Machine Learning, 2021.
[30]YiWang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, etal.Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377, 2024.
[31]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021.
[32]Brandon Castellano.Pyscenedetect: Automated video scene detection.https://github.com/Breakthrough/PySceneDetect, 2022.
[33]Haotian Liu, Chunyuan Li, Yuheng Li, BoLi, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
[34]Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo.Advancing high-resolution video-language representation with large-scale video transcriptions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[35]Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele.2d human pose estimation: New benchmark and state of the art analysis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2014.
[36]Luowei Zhou, Chenliang Xu, and Jason Corso.Towards automatic learning of procedures from web instructional videos.In AAAI, 2018.
[37]Waqas Sultani, Chen Chen, and Mubarak Shah.Real-world anomaly detection in surveillance videos.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.
[38]LiXu, HeHuang, and Jun Liu.Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9878–9888, 2021.
[39]Marah Abdin, SamAde Jacobs, AmmarAhmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, etal.Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024.
[40]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023.
[41]EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Machine Learning, 2022.
[42]Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, etal.The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017.
[43]Raghav Goyal, Samira EbrahimiKahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, etal.The" something something" video database for learning and evaluating visual common sense.In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017.
[44]Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and JoshuaB Tenenbaum.Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019.
[45]Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua.Next-qa: Next phase of question-answering to explaining temporal actions.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
[46]OpenAI.Gpt-4v(ision) system card.https://api.semanticscholar.org/CorpusID:263218031, 2023.
[47]BoLi, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu.Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023.
[48]Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, etal.mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023.
[49]Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid.Zero-shot video question answering via frozen bidirectional language models.In Advances in Neural Information Processing Systems, 2022.
[50]Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and YuQiao.Llama-adapter: Efficient fine-tuning of language models with zero-init attention.In International Conference on Learning Representations, 2024.
[51]Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.https://lmsys.org/blog/2023-03-30-vicuna, 2023.
[52]Meta AI.Llama 3.https://llama.meta.com/llama3, 2024.