M2Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation

The Hong Kong University of Science and Technology;
Nanjing University; Sun Yat-sen University; Shanghai AI Laboratory; Peking University

Figure 1. Advanced capabilities of our proposed M2Chat in interleaved multimodal chat, multi-round text and image-to-image generation, and text-to-image generation.

Abstract

While current LLM chatbots like GPT-4V bridge the gap between human instructions and visual representations to enable text-image generations, they still lack efficient alignment methods for high-fidelity performance on multiple downstream tasks. In this paper, we propose M2Chat, a novel unified multimodal LLM framework for generating interleaved text-image conversation across various scenarios. Specifically, we propose an M3Adapter that efficiently integrates granular low-level visual information and high-level semantic features from multi-modality prompts. Upon the well-aligned fused feature, M3Adapter tailors a learnable gating strategy to balance the model creativity and consistency across various tasks adaptively. Moreover, to further enhance the effectiveness of M3Adapter while preserving the coherence of semantic context comprehension, we introduce a two-stage M3FT fine-tuning strategy. This strategy optimizes disjoint groups of parameters for image-text alignment and visual-instruction respectively. Extensive experiments demonstrate our M2Chat surpasses state-of-the-art counterparts across diverse benchmarks, showcasing its prowess in interleaving generation, storytelling, and multimodal dialogue systems.

Our Proposed Method


Illustration of M2Chat, which features a generation pipeline that processes both image and text inputs, harnessing the capabilities of LLaMA-AdapterV2 and SDXL to craft high-fidelity image-text pairs. Our system excels in three key areas: Text-to-Image (T2I) generation, Storytelling, and Multimodal dialogue. Image generation occurs as VLM forward propagation yields hidden embeddings, which are then utilized to train the M3Adapter—distinguished by its minimal trainable parameters.