Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

[ICML 2026]
Xiaoyu Yang, En Yu, Wei Duan, Jie Lu,
Australian Artificial Intelligence Institute (AAII)
University of Technology Sydney (UTS)

Abstract

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO) , a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift.

"两小儿辩日·列子" (Two Children Arguing about the Sun, Lie Zi) vividly illustrates Multi-source drift: Different source models, like the two children, often produce conflicting yet seemingly reasonable logical trajectories.

Findings

During our initial attempts to align target models using the collective reasoning of multiple state-of-the-art MLLMs, we observed a counter-intuitive phenomenon: simply aggregating their chain-of-thought (CoT) trajectories often degraded the target model's robustness, leading to severe hallucinations and semantic inconsistencies. By systematically tracing the generation processes, we discovered that these errors did not stem from individual source inadequacies, but rather from the fundamentally conflicting reasoning paradigms across different models. This critical observation led us to realize that the multi-source alignment process is not a static knowledge transfer, but a dynamic and non-stationary challenge characterized by inter-model concept drift.

Conceptual illustration of multi-stream concept drift

Conceptual illustration of multi-stream drift.

Concept drift among source MLLMs

Quantitative measurement of concept drift among source MLLMs.

Distilled Student

Based on the provided PA chest X-ray, the following key findings support the diagnoses of atelectasis and pleural effusion:

1. Atelectasis:

  • Volume Loss in the Left Lower Lung Field: The left lower lung appears significantly darker than the right, indicating volume loss.
    # Inconsistency: atelectasis is denser (whiter), not darker.
  • ...

2. Pleural Effusion:

  • Bilateral Pleural Opacities: Both lungs show areas of increased opacity, but they are more pronounced on the left.
    # Overgeneralization: bilateral opacities may have multiple causes and cannot be considered the primary cause.
  • Possible Pleural Line: There may be a pleural line visible along the midline, indicating pleural fluid accumulation.
    # Inconsistency: pleural line along the midline usually indicates pneumothorax, not pleural effusion.
  • No Evidence of Pulmonary Infiltrates: The lung fields appear relatively clear, with no signs of pulmonary infiltrates, consolidation, or ground-glass opacities.
    # Overconfident: infiltrates may be masked.
  • ...

Example of drift-biased target model.

Main Observations

  1. While some models, such as Qwen-VL-Max, adhere to high-precision, concise reasoning distributions, others like GPT-4o favor high-recall, expansive elaboration. This suggests that the "true" reasoning manifold lies within the consensus of these divergent streams, rather than in any single trajectory.
  2. The target model trained simply to mimic these drifting streams does not automatically synthesize their strengths; instead, it internalizes the union of their biases, resulting in hallucinations and semantic inconsistencies.

How can we autonomously turn drift into constraint, thereby achieving robust reasoning alignment in non-stationary environments?

Methodology

Workflow of Autonomous Preference Optimization

The main contributions of our methods.

(a) Supervised Bootstrapping with Consensus Synthesis. In the initial phase, the target model undergoes Supervised Bootstrapping to establish a broad capability covering by assimilating the collective knowledge of source MLLMs. However, as shown in the inference block, this naive integration inevitably inherits non-stationary inter-model drift, resulting in hallucinations and semantic ambiguities.

(b) Constraint-Aware Optimization for Robust Reasoning Alignment. To mitigate the inherited drift, we propose a Constraint-Aware protocol. The model first employs in-context extraction to synthesize self-consistent consensus trajectories as preferred thinking. Crucially, rather than simply discarding the conflicting source outputs, APO repurposes them as negative constraints within a Plackett-Luce preference formulation, explicitly suppressing the probability of generating drifting patterns.

(c) Evolution of Distributions. The distributional dynamics of our alignment process. Initially, bootstrapping projects the target model into the union of source distributions (Yellow). Subsequently, APO refines this space by treating the drifting regions (Red) as decision boundaries to be avoided, effectively carving out a robust consensus manifold (Green) for reliable reasoning.

Definition: Multi-Stream Reasoning Drift

Consider $N$ CoT streams corresponding to $N$ source models. Let the collective state at reasoning step $j$ be denoted by $\mathcal{S}_{j} = (s^{1}_{j}, \ldots, s^{N}_{j})$, where $s^{u}_{j}$ represents the state of the $u$-th source model. We define the reasoning alignment process as experiencing concept drift if the joint distribution of the collective states evolves non-stationarily across steps. That is, for any two distinct reasoning steps $j$ and $j+\Delta$, the joint probability distributions differ:

$$ P_{j}(\mathcal{S}) \neq P_{j+\Delta}(\mathcal{S}) $$

Autonomous Preference Optimization (APO)

We recast multi-source reasoning integration as a constraint satisfaction problem. The target model undergoes Supervised Bootstrapping to establish a broad capability covering. To mitigate inherited drift, APO employs in-context extraction to synthesize self-consistent consensus trajectories. Crucially, rather than discarding conflicting source outputs, APO repurposes them as negative constraints within a Plackett-Luce preference formulation:

$$ P(t^{+} \succ \mathcal{T} | v,l) = \frac{\exp (r(v,l,t^{+}))}{\exp (r(v,l,t^{+})) + \sum_{u=1}^{N} \exp (r(v,l,\tau^{u}))} $$

Where:

  • $t^{+}$: The synthesized self-consistent consensus trajectory (positive constraint).
  • $\mathcal{T}$: The set of divergent, drifting reasoning trajectories from source models.
  • $\tau^{u}$: The $u$-th negative trajectory within $\mathcal{T}$ ($u \in \{1, \ldots, N\}$).
  • $v, l$: The input visual image and text prompt, respectively.
  • $r(\cdot)$: The implicit reward function indicating the model's preference score.

This effectively minimizes the likelihood of generating drifting patterns while maximizing the consensus manifold.

Contributions

  • Theoretical Framework: We establish a novel framework that recasts multi-source reasoning integration as a constraint satisfaction problem in non-stationary environments. We demonstrate how conflicting reasoning trajectories can be transformed from disruptive noise into actionable negative constraints.
  • Autonomous Preference Optimization (APO): We propose APO, a self-supervised alignment strategy that eliminates the need for ground-truth labels. APO autonomously constructs preference pairs to guide robust reasoning alignment.
  • Empirical Validation: We conduct extensive evaluations across diverse benchmarks. APO achieves superior robustness and generalization while utilizing only 10% of the data typically required by standard alignment methods.
  • CXR-MAX Dataset: To facilitate future research on alignment under drift, we release CXR-MAX, a large-scale benchmark comprising over 170k reasoning trajectories with fine-grained alignment annotations.

Results

Beyond comparing with standard domain-specific baselines, a more rigorous evaluation benchmarks our 7B-parameter target model against the proprietary source MLLMs, which possess vastly superior parameter scales, such as GPT-5 and Claude Sonnet-4. As visualized below, despite the immense disparity in model size, our approach achieves the highest average accuracy of 0.78 across all diseases, surpassing every single source MLLM. This counter-intuitive result empirically demonstrates that our constraint-aware optimization empowers the compact target model to synthesize a consensus manifold that effectively integrates the diverse strengths of the source ensemble, allowing it to stand on the shoulders of giants.

MLLMs Con. PE Pna. Pnx. Ede. Avg.
Source MLLMs with Huge Parameters
GPT-5 0.75 0.68 0.89 0.90 0.52 0.75
Gemini-2.5 0.28 0.61 0.40 0.94 0.42 0.53
Sonnet-4 0.89 0.69 0.48 0.15 0.15 0.47
Qwen-VL-Max 0.54 0.65 0.40 0.95 0.64 0.64
Grok-4 0.43 0.41 0.36 0.97 0.61 0.56
GLM-4.5V 0.59 0.67 0.52 0.96 0.72 0.69
Moonshot 0.13 0.46 0.77 0.88 0.19 0.48
Target Model (7B MLLM)
Ours 0.84 0.67 0.78 0.96 0.65 0.78

Evaluation results of multiple source MLLMs on classification of MS-CXR-T for comparison. Top-1 accuracy is applied to evaluate the performance of different methods. The best-performing models are highlighted in red, with the second-best in blue. The comparison MLLMs include: Claude Sonnet-4, Gemini-2.5, GLM-4.5V, GPT-5, Qwen-VL-Max, Grok-4 and Moonshot-v1.

CXR-MAX Dataset

To evaluate reasoning alignment in non-stationary environments, a dataset exhibiting high-variance inter-model drift is essential. Existing benchmarks typically rely on single-source annotations or static consensus, failing to capture the dynamic conflicts inherent in multi-stream reasoning.

Addressing this gap, we introduce CXR-MAX (Multi-source Alignment for X-rays), a large-scale benchmark designed to facilitate the study of autonomous preference optimization. CXR-MAX extends the MIMIC-CXR dataset by aggregating reasoning trajectories from seven distinct, publicly available MLLMs. CXR-MAX provides 170,982 distillation instances of reasoning trajectories covering 14 thoracic pathologies.

Example Instance from CXR-MAX

Chest X-Ray

Radiology Report (Ground Truth)

Findings: There is unchanged moderate left lower lobe atelectasis and bilateral pleural effusions. Linear atelectases in the left mid lung field are again seen, essentially unchanged. Lesion in the right seventh rib is unchanged. Pleurx catheter is seen, unchanged in position at the base of the left lung. There is no evidence of pneumothorax. Pacer is seen with leads appropriately placed within the right and left atria. Mediastinal silhouette is within normal limits with a calcified aorta.

Claude Sonnet-4

Looking at this chest X-ray, I can identify several findings that support the diagnoses of atelectasis and pleural effusion:

Evidence of Atelectasis:

  • Volume loss in the right hemithorax - The right side appears smaller compared to the left
  • Mediastinal shift - The mediastinum appears shifted toward the right side
  • Elevated right hemidiaphragm - The right diaphragm appears higher than the left
  • Increased opacity in the right lung - Areas of collapsed lung tissue appear more dense/white

Evidence of Pleural Effusion:

  • Blunting of the right costophrenic angle - The normally sharp angle is obscured
  • Increased density at the right lung base - There appears to be fluid accumulation in the lower right pleural space
  • Meniscus sign - The fluid appears to have the characteristic curved upper border

The combination of these findings provides clear radiographic evidence supporting both diagnoses...

GPT-5-nano

Key findings that support atelectasis and pleural effusion:

  • Left subclavian pacemaker/ICD in place (device visible in the left chest).
  • Basilar opacities at the left lower lung suggesting subsegmental/plate-like atelectasis with some volume loss.
  • Blunting of the left costophrenic angle with posterior layering of fluid consistent with a small left pleural effusion.

Right lung and mediastinal contours are otherwise relatively unremarkable.
In short: left basilar atelectasis with a small left pleural effusion. Correlate clinically...

GLM-4.5V

To determine the basis for diagnosing Atelectasis and Pleural Effusion in the chest X-ray, we analyze key imaging features:

1. Pleural Effusion:
Pleural effusion is identified by blunting of the costophrenic angles... In this image, the right costophrenic angle appears blurred and loses its typical sharp contour...

2. Atelectasis:
Atelectasis refers to partial lung collapse, which shows as increased opacity (whiteness) in the affected lung regions... In this X-ray, there are areas of increased density in the lower lung fields (particularly on the right side)...

In summary, the diagnostic clues are the blunted right costophrenic angle and increased opacification in the lower lung zones.

Gemini-2.5-flash-lite

Based on the chest DR image, here are the findings that support the diagnoses of Atelectasis and Pleural Effusion:

1. Pleural Effusion:

  • Right Costophrenic Angle Blunting: The right costophrenic angle is completely blunted and obscured by fluid.
  • Meniscus Sign: There is a dense, homogeneous opacity at the right lung base with a characteristic curvilinear upper border...
  • Obscured Diaphragm: The right hemidiaphragm is completely obscured by the effusion.

2. Atelectasis:

  • Mediastinal Shift: The mediastinum is shifted towards the right...
  • Reduced Right Lung Volume: The overall right lung field appears smaller and more collapsed...
  • Increased Opacity/Consolidation: The remaining lung parenchyma in the right hemithorax appears denser...

In summary, the image demonstrates a large right-sided pleural effusion... accompanied by signs of volume loss in the right hemithorax...

Moonshot-v1

As an AI, I'm not capable of providing medical diagnoses or interpreting medical images with the same level of expertise as a trained medical professional. However, I can provide general information...

Atelectasis is a condition where one or both lungs partially or completely collapse. On an X-ray, this might appear as an area of increased opacity...

Pleural effusion refers to the buildup of excess fluid in the pleural cavity... This can cause the lung to partially collapse and may appear on an X-ray as a layering density at the lung base or costophrenic angle...

Qwen-VL-Max

The chest X-ray (DR image) shows several findings consistent with atelectasis and pleural effusion, particularly on the left side. Here's the basis for each diagnosis:

1. Pleural Effusion (Left Side)

  • Blunting of the left costophrenic angle: The sharp angle between the diaphragm and rib cage on the left is obscured or rounded...
  • Homogeneous opacity in the lower left hemithorax: There is increased density in the left lower lung field...
  • Meniscus sign: A concave upper border of the fluid collection...
  • Elevation of the left hemidiaphragm: Fluid can push the diaphragm upward.

2. Atelectasis (Left Lower Lobe Likely)

  • Volume loss in the left lower lung: The left hemithorax appears compressed, and there is a shift of mediastinal structures slightly toward the left...
  • Loss of normal lung markings in the affected area: The lung tissue appears denser and consolidated...

Summary:
Pleural effusion: Left-sided, based on blunted costophrenic angle, homogeneous opacity, and diaphragmatic elevation.
Atelectasis: Likely left lower lobe, secondary to compressive atelectasis from the adjacent pleural effusion.

Grok-4

I can explain the basis for the diagnosis of atelectasis and pleural effusion based on an analysis of the chest X-ray (CXR) you provided. I'll keep this concise, objective, and focused on the visible findings...

1. Basis for Pleural Effusion:

  • Increased opacity (whitish haziness) at the right lung base, obscuring the normal sharp outline of the right hemidiaphragm...
  • Blunting or filling-in of the right costophrenic angle...
  • A subtle "meniscus sign" (curved upper border of the opacity) at the right base...

2. Basis for Atelectasis:

  • Loss of volume and increased density in the right lower lung zone, making it appear smaller and whiter compared to the left lung...
  • Crowding or absence of normal lung markings (vascular structures) in the affected area...
  • Subtle elevation or silhouetting of the right hemidiaphragm, suggesting collapse of the adjacent lung tissue.

BibTeX

@inproceedings{yang2026turning,
    title={Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments},
    author={Yang, Xiaoyu and Yu, En and Duan, Wei and Lu, Jie},
    booktitle={Forty-third International Conference on Machine Learning},
    year={2026},
    url={https://openreview.net/forum?id=jgebUtw1lA}
}