Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

[ICLR 2025]
Australian Artificial Intelligence Institute (AAII),
University of Technology Sydney (UTS), Australia

News

  • [2026.04] 🔥 Our OpenMMlo dataset has reached over total 2k+ downloads on Hugging Face! Thanks for the community's support!

Abstract

Multi-modal Large Language Models (MLLMs) frequently face challenges from concept drift when dealing with real-world streaming data, wherein distributions change unpredictably. This mainly includes gradual drift due to long-tailed data and sudden drift from Out-Of-Distribution (OOD) data, both of which have increasingly drawn the attention of the research community. While these issues have been extensively studied in the individual domain of vision or language, their impacts on MLLMs in concept drift settings remain largely underexplored.

In this paper, we reveal the susceptibility and vulnerability of Vision-Language (VL) models to significant biases arising from gradual drift and sudden drift, particularly in the pre-training. To effectively address these challenges, we propose a unified framework that extends concept drift theory to the multi-modal domain, enhancing the adaptability of the VL model to unpredictable distribution changes.

Additionally, a T-distribution based drift adapter is proposed to effectively mitigate the bias induced by the gradual drift, which also facilitates the model in distinguishing sudden distribution changes through explicit distribution modeling. Extensive experiments demonstrate our method enhances the efficiency and accuracy of image-text alignment in the pre-training of VL models, particularly in the concept drift scenario.
Concept Drift Framework Visualization
The inspiration for this research stems from the television series "House M.D."

Findings

Pre-training: As shown above, pre-training VL models on imbalanced datasets severely degrades intra-class compactness and inter-class separability compared to balanced ones. This long-tailed drift not only disrupts image-text alignment in tail categories but also negatively impacts head categories. Furthermore, the model struggles to distinguish out-of-distribution (OOD) instances, biasing the underlying feature space and fundamentally disturbing pre-training.

Fine-tuning: Visualizing the classifier's feature space shows that tail drift severely compresses the representation of minority categories, allowing head categories to completely dominate. As a result, unexpected OOD samples become visually entangled with these dominant head categories, drastically reducing the model’s OOD separability in non-stationary fine-tuning scenarios.

Image-Text Alignment in Pre-training
(a) Image-Text Alignment in Pre-training
Feature Space Allocation in Fine-tuning
(b) Feature Space Allocation in Fine-tuning

Key Contributions

  • Uncovering Pre-training Bias: We are the pioneers in revealing the unexplored impacts of concept drift to multi-modal large language models, especially in the image-text alignment in the pre-training and feature space allocation in the fine-tuning.
  • Unified Concept Drift Framework: The concept drift theory is introduced and extended to multi-modal data. The novel T-distributed spherical adapter is proposed to perform tailed adaptation and OOD detection simultaneously.
  • Superior Generalization & Robustness: In downstream tasks (long-tailed classification, OOD detection), ours demonstrates statistically significant improvements over competitive fundamental models (e.g. CLIP).
  • OpenMMlo Dataset: We engineered a large-scale multi-modal long-tailed open-world dataset (OpenMMlo) composed of ~740,000 imbalanced image-caption pairs annotated for real-world scenarios.

Methodology

Multi-modal Concept Drift Theory

Assume that there are $N$ data streams corresponding to $N$ modalities, giving a set of multi-modal streams. The concept drift occurs at timestamp $t+1$, if $$P_{t}(S_{i}) \neq P_{t+1}(S_{i})$$ This theory provides a holistic perspective linking gradual tail drift and sudden OOD drift into one unified mathematical formulation.

To address the drifting problem, we developed a T-distributed Adapter designed to replace the conventional hyperspherical standard. Our Thp probability density metric works differently:

$$p(x^{(i)})=N_{X}(d)^{-1} \frac{2}{\kappa(1-\mu^{T}x^{(i)})},\quad x^{(i)}\sim \text{Thp}(\mu)$$

The desirable light-tailed property of the proposed Thp prevents the compression of tailed categories and mitigates the crowding of feature space caused by long-tailed drift. This aligns visual and textual features correctly, avoiding minority collapse compared to models guided by traditional contrast loss.

Main Results

Long-Tailed ImageNet-LT Classification (Training from Scratch vs Fine-Tuning):

Method Backbone Many-shot Medium-shot Few-shot Overall
Training from scratch
cRT ResNet-50 61.8 46.2 27.3 49.6
RIDE ResNet-50 68.2 53.8 36.0 56.9
PaCo ResNet-50 68.2 58.7 41.0 60.0
ViT ViT-B/16 50.5 23.5 6.9 31.6
MAE ViT-B/16 74.7 48.2 19.4 54.5
DeiT ViT-B/16 70.4 40.9 12.8 48.4
LiVT ViT-B/16 73.6 56.4 41.0 60.9
LiVT† ViT-B/16 76.4 59.7 42.7 63.8
Ours ViT-B/16 76.4 66.2 48.9 67.6
Ours† (Pre-trained) ViT-B/16 77.2 68.6 51.3 69.4
Only Fine-tuning
CLIP+ZS Pre-trained ViT-B/16 82.0 70.4 69.6 70.5
CLIP+LP Pre-trained ViT-B/16 87.3 65.1 19.0 67.4
CLIP+FT Pre-trained ViT-B/16 83.0 65.0 39.9 68.5
VL-LTR Pre-trained ViT-B/16 84.5 74.6 59.3 77.2
LIFT Pre-trained ViT-B/16 80.2 76.1 71.5 77.0
Ours Pre-trained ViT-B/16 79.5 76.5 74.1 77.2
Ours† (Fine-tuning) Pre-trained ViT-B/16 79.9 76.8 74.5 77.6

* Performance comparison over ImageNet-LT. Red colors indicate top results in respective learning settings.

Out-Of-Distribution Detection:

Methods SVHN LSUN iSUN Texture
FPR ↓ AUROC ↑ FPR ↓ AUROC ↑ FPR ↓ AUROC ↑ FPR ↓ AUROC ↑
ProxyAnchor 87.2 82.4 37.2 91.7 70.0 85.0 65.6 85.0
CE + SimCLR 24.8 94.5 56.4 89.0 66.5 83.8 63.7 82.0
CSI 44.5 92.7 75.6 83.8 76.6 85.0 61.6 86.5
SSD+ 31.2 94.2 79.4 85.2 80.9 84.1 66.6 86.2
KNN+ 39.2 92.8 49.0 89.3 75.0 82.7 57.2 88.4
CIDER 12.6 97.8 30.2 92.8 46.0 88.9 35.6 92.3
Ours 8.3 98.7 20.3 97.5 32.5 95.2 45.1 96.3

OpenMMlo Dataset

To study the impact of the long-tailed open world on the multi-modal large language models (MLLMs), we construct this dataset called OpenMMlo (Open Multi-Modal Long-tailed dataset), by extending the open-source datasets, namely ImageNet-LT, iNatualist2018 and Places-LT. ImageNet-LT has 1,000 classes and contains 115.8k samples, with a maximum of 1,280 samples and a minimum of 5 samples for a category. Besides, it consists of 18k images for OOD detection. Places-LT has 184.5K samples from 365 classes, with class samples ranging from 4,980 to 5. The iNaturalist 2018 is a large-scale species dataset collected in the natural world with 437.5K samples for 8,142 classes. We use the InstructBLIP to generate the related caption of the image, with the prompt of "What does this picture describe? Please describe in detail its size, location, color, and its relationship to the surroundings.".

Sample in Training Set

Panda in training set

Caption: The image depicts a [mask], also known as a [mask], sitting on a branch of a tree. The [mask] is holding a leaf in its mouth, which suggests that it might be eating or chewing on the plant. This behavior is typical of [mask]s, as they primarily feed on bamboo shoots, leaves, fruits, and insects. In the wild, [mask]s are found in the mountainous regions of southern and southwestern China, Myanmar, and India.

Annotation: lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens

Sample in Test Set

Man in test set

Caption: The picture depicts a young man sitting on a bench, holding a [mask] in his hand. This suggests that he is either playing the [mask] or contemplating playing it. The [mask] is a musical instrument that is commonly associated with blues and folk music, and it can be used to create melodic and rhythmic sounds. The presence of the [mask] in the image adds a musical element to the scene.

Annotation: harmonica, mouth organ, harp, mouth harp

Sample in Open Set

Package in open set

Caption: The main object in the picture is an open suitcase, which is a type of luggage. It is red in color and appears to be medium-sized. The suitcase is located on the floor of a room. The suitcase is partially filled with clothing items, including shirts, pants, and socks. It appears that the suitcase is still in the process of being packed or unpacked, as some items are visible on top of the suitcase while others are spilling out of it. The suitcase is open, allowing easy access to the clothing items inside. Overall, the picture provides a glimpse into the process of preparing for a trip or organizing one's belongings.

BibTeX

@inproceedings{yang2025adapting,
  title={Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards},
  author={Xiaoyu Yang and Jie Lu and En Yu},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=b20VK2GnSs}
}