Walking the Tightrope - NeurIPS 2025

Abstract

This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs), especially for chest diagnosis: detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT's autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel autonomous counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables autonomous and stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contribute a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR.

Findings

Concept drift BEHIND THINKING

Internal thinking

Output

Perturbed internal reasoning

Low probability High probability Perturbed token

Per-category accuracy after disturbance

As for tokens "lung opacity" and "opacity", they have similar token probability, and similar semantics

If we replace them, the disease prediction shouldn't change much.

However:

opacity lung opacity

Concept Drift Behind Thinking

MLLM $\pi$ autoregressively generates the token at position $j$ in the reasoning chain, conditioned on the visual input $v$, textual prompt $l$, and partial token sequence $t_{(<j)}$ of the CoT trajectory:

$$t_j \sim \pi(\cdot\,|\,v,l,t_{(<j)})$$

Definition:

The MLLM's autoregressive reasoning trajectory manifests as a thinking stream $S_{0,i} = \{s_0, \ldots, s_i\}$, where each cognitive state $s_j = (t_{(<j)},\, z_j)$ encapsulates all tokens generated so far $t_{(<j)}$ and its latent predicted distribution $z_j$ of the results by $t_{(<j)}$. Therefore, in position $i$, $S_{0,i}$ follows a certain distribution $F_{0,i}(x, z)$, thus the concept drift behind the thinking can be formalized as:

$$\exists\, i : P_i(t,z) \neq P_{i+1}(t,z)$$

Contributions

Our main contributions are summarized as follows:

Theoretical Framework: We establish a novel theoretical framework that formalizes autoregressive token generation in MLLMs through the lens of concept drift theory, enabling systematic identification and causal analysis of detrimental reasoning divergence during non-stationary reinforced custom-tuning.
Counterfactual Preference Optimization (CPO): We propose CPO, which synergizes structured domain-specific knowledge with systematic counterfactual intervention, driving the MLLMs with preference-aligned reinforcement learning. By embedding learnable concept graphs as the expert and generating adversarially-constrained reasoning trajectories, our approach achieves substantial decoupling between beneficial distribution adaptation and detrimental concept drift.
Empirical Validation: We conduct comprehensive empirical validation across various clinical benchmarks for chest radiograph, including disease classification, diagnostic report generation and zero-shot generalization. The superior results demonstrate statistically significant improvements in robustness, generalization, and accuracy under non-stationary custom-tuning.
CXR-CounterFact (CCF) Dataset: As a pioneer contribution to the community, we introduce CCF, a large-scale dataset comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR.

Results

CPO achieves state-of-the-art performance across multiple clinical benchmarks:

Method	Venue	Con.	PE	Pna.	Pnx.	Ede.	Avg.
CTrans	CVPR'23	44.0	61.3	45.1	31.5	65.5	49.5
CheXRelNet	MICCAI'22	47.0	47.0	47.0	36.0	49.0	45.2
BioViL	ECCV'22	56.0	63.0	60.2	42.5	67.5	57.8
BioViL-T	CVPR'23	61.1	67.0	61.9	42.6	68.5	60.2
Med-ST	ICML'24	60.6	67.4	58.5	65.0	54.2	61.1
TempA-VLP	WACV'25	65.2	59.4	73.4	43.1	77.1	63.6
CoCa-CXR	Arxiv'25	70.4	69.6	61.4	72.8	71.8	69.2
SFT		54.9	71.7	70.0	95.9	76.5	73.8
DPO		63.2	72.4	76.7	93.5	76.3	76.4
CPO (Ours)		77.7	72.7	87.4	95.8	75.3	81.8

* Evaluation results of multi-label chest diseases classification on MS-CXR-T. Top-1 accuracy is applied to evaluate the performance of different methods. The best-performing models are highlighted in red, with the second-best in blue. SFT denotes the results of supervised fine-tuning, and DPO indicates the direct preference optimization with random negative samples, while the CPO represents our counterfactual preference optimization method.

CXR-CounterFact (CCF) Dataset

We are pioneers in introducing counterfactual cause into reinforced custom-tuning of MLLMs. We are deeply aware of the scarcity of counterfactual CoT in downstream tasks, especially in the highly professional medical field. Thus, our aspiration is for the model to adeptly acclimate to the concept drift by itself, acquiring abundant knowledge with more and more data, but not exhibiting bias.

In this context, a more realistic training dataset for multi-modal large language models is required to validate their potential to be trained under the non-stationary reinforced custom-tuning. Recognizing the demand for higher-quality multi-modal data with CoT, we develop a dataset called CXR-CounterFact Dataset (CCF), extending the MIMIC-CXR with counterfactual chain-of-thought. This novel dataset introduces 320,416 meticulously curated counterfactual pairs spanning 14 thoracic pathologies, establishing a pioneering large-scale benchmark for causal interpretation in clinical chest X-ray analysis.

Figure showcases the samples utilized for training and validation in CXR-CounterFact. We use the LLM to generate the related caption of the image, with the prompt of:

"This is a radiology chest DR examination report of a patient: <Report>.

This is a diagram of the relationship between lung diseases and their radiographic manifestations:
<Concept Graph>

Please generate a counterfactual radiology text showing <disease> based on the relationship and above context, with the same formatting."

BibTeX

@inproceedings{yang2025walking,
  title={Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning},
  author={Yang, Xiaoyu and Lu, Jie and Yu, En},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}