cs.CV [Total: 263]
cs.CL [Total: 109]
cs.MA [Total: 1]
cs.AI [Total: 29]
eess.AS [Total: 2]
cs.IR [Total: 1]
cs.GR [Total: 4]
cs.HC [Total: 1]
eess.SP [Total: 2]
q-bio.QM [Total: 1]
cs.DL [Total: 1]
cs.SD [Total: 4]
astro-ph.IM [Total: 1]
cs.NE [Total: 1]
cs.RO [Total: 9]
cs.SE [Total: 1]
q-bio.NC [Total: 1]
cs.CR [Total: 3]
eess.IV [Total: 4]
cs.LG [Total: 28]

cs.CV [Back]

[1] Pathological Truth Bias in Vision-Language Models cs.CVPDF

Yash Thube

TL;DR: MATS方法是一种多模态行为审核工具，用于评估视觉语言模型是否拒绝与视觉矛盾的陈述，揭示了生成式模型在空间一致性上的缺陷和对比模型的鲁棒性。

Details

Motivation: 标准的视觉语言模型评估基准可能掩盖系统性错误，降低实际应用中的信任度。

Result: 生成式VLMs（如LLaVA 1.5）在SCS上表现差且IAR高，而对比模型（如CLIP）更稳健。激活补丁技术揭示了失效的具体模块。

Insight: 生成式模型的失效可能与跨注意力机制的晚期模块有关，而对比模型的失效则集中在池化投影组件。

Abstract: Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.

[2] Scale and Rotation Estimation of Similarity-Transformed Images via Cross-Correlation Maximization Based on Auxiliary Function Method cs.CVPDF

Shinji Yamashita, Yuma Kinoshita, Hitoshi Kiya

TL;DR: 本文提出了一种高效算法，通过基于辅助函数方法的互相关系数最大化，实现子像素级的尺度和旋转联合估计。

Details

Motivation: 传统的相位相关技术在处理平移偏移时效果良好，但无法有效解决因相机缩放或旋转引起的尺度和旋转变化问题。

Result: 实验表明，该方法在尺度和旋转估计上的均值误差低于传统基于离散互相关系数的傅里叶变换技术。

Insight: 该方法为图像对齐问题提供了一种高效且精确的解决方案，尤其在需要处理尺度和旋转变化的场景中表现优异。

Abstract: This paper introduces a highly efficient algorithm capable of jointly estimating scale and rotation between two images with sub-pixel precision. Image alignment serves as a critical process for spatially registering images captured from different viewpoints, and finds extensive use in domains such as medical imaging and computer vision. Traditional phase-correlation techniques are effective in determining translational shifts; however, they are inadequate when addressing scale and rotation changes, which often arise due to camera zooming or rotational movements. In this paper, we propose a novel algorithm that integrates scale and rotation estimation based on the Fourier transform in log-polar coordinates with a cross-correlation maximization strategy, leveraging the auxiliary function method. By incorporating sub-pixel-level cross-correlation our method enables precise estimation of both scale and rotation. Experimental results demonstrate that the proposed method achieves lower mean estimation errors for scale and rotation than conventional Fourier transform-based techniques that rely on discrete cross-correlation.

[3] Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization cs.CVPDF

Xu Jia

TL;DR: 本文提出了一种基于强化学习的框架，通过课程引导的Group Relative Policy Optimization（GRPO）方法，解决了自动驾驶中目标检测任务中的稀疏奖励和噪声问题，显著提升了检测精度和鲁棒性。

Details

Motivation: 当前的多模态大语言模型（MLLMs）在视觉语言推理任务中表现优异，但在需要精确定位和鲁棒性的结构化感知任务中表现不佳。为解决这一问题，作者提出了一种结合课程学习和强化学习的优化方法。

Result: 在自动驾驶基准测试中，该方法显著提升了目标检测的准确性和鲁棒性。消融实验验证了奖励设计、KL正则化和课程调度的重要性。

Insight: 研究表明，结合结构化数据课程的强化学习是一种可扩展且鲁棒的方法，适用于复杂的多模态检测任务。

Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language reasoning but often struggle with structured perception tasks requiring precise localization and robustness. We propose a reinforcement learning framework that augments Group Relative Policy Optimization (GRPO) with curriculum-based data scheduling and difficulty-aware filtering. This approach stabilizes optimization under sparse, noisy rewards and enables progressive adaptation to complex samples. Evaluations on autonomous driving benchmarks demonstrate substantial improvements in detection accuracy and robustness. Ablation studies confirm the importance of reward design, KL regularization, and curriculum pacing for convergence stability and generalization. Our findings highlight reinforcement-driven optimization with structured data curricula as a scalable path toward robust and interpretable multimodal detection.

[4] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation cs.CVPDF

Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham

TL;DR: 论文提出了一种拓扑图一致性（TGC）框架，通过整合图论约束，显著提升半监督病理分割的精度和拓扑有效性。

Details

Motivation: 病理图像的分割需要密集标注，但成本高昂且数据有限。现有方法依赖像素级一致性，容易传播噪声伪标签并产生碎片化或不合理的拓扑结构。

Result: 在GlaS和CRAG数据集上，TGC在5-10%监督下达到SOTA性能，显著缩小了与全监督的差距。

Insight: 全局拓扑约束可以有效弥补半监督分割中像素级一致性的不足，显著提升分割质量和拓扑合理性。

Abstract: Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.

[5] Sequential Token Merging: Revisiting Hidden States cs.CVPDF

Yan Wen, Peng Ye, Lin Zhang, Baopu Li, Jiakang Yuan

TL;DR: 论文提出了一种称为Sequential Token Merging (STM)的方法，通过分析Vision Mambas (ViMs)中的Limited Directional Sequential Dependence (LDSD)机制，优化了隐藏状态的聚合方式，显著提升了模型效率，且精度损失极小。

Details

Motivation: Vision Mambas虽然实现了次二次复杂度，但图像分辨率带来的二次复杂度问题仍限制了其效率。现有方法未能充分利用ViMs中的隐藏状态信息流机制。

Result: ViM-Ti在20%标记减少时精度仅下降1.0%，ViM-S在40%减少时仅下降1.4%，实现了高效的SOTA表现。

Insight: Mamba的选择性扫描机制可用于隐藏状态的渐进信息聚合，为状态空间模型动力学提供了新见解。

Abstract: Vision Mambas (ViMs) achieve remarkable success with sub-quadratic complexity, but their efficiency remains constrained by quadratic token scaling with image resolution. While existing methods address token redundancy, they overlook ViMs’ intrinsic Limited Directional Sequential Dependence (LDSD) - a critical information flow mechanism revealed in our analysis. We further identify Mamba’s selective scan enables gradual information aggregation in hidden states. Based on these insights, we propose Sequential Token Merging (STM), featuring: 1) Bidirectional nearest neighbor merging to preserve sequential dependencies through symmetric spatial aggregation, and 2) Hidden states protection to stabilize the hidden states around the class token. STM strategically leverages Mamba’s layer-wise loss convergence to convert temporal forgetfulness into stability. Experiments demonstrate STM’s superiority: 1.0% accuracy drop for ViM-Ti at 20% token reduction, and only 1.4% degradation for ViM-S at 40% reduction. Our method achieves state-of-the-art efficiency with minimal complexity, while providing new insights into state-space model dynamics. Codes will be released soon.

[6] Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects cs.CVPDF

Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar

TL;DR: 这篇论文是对深度学习驱动的超分辨率技术的全面综述，涵盖了单图像、视频、立体和光场超分辨率方法，分析了超过150种技术，并探讨了未来研究方向。

Details

Motivation: 超分辨率技术在计算机视觉领域备受关注，现有综述多集中于特定领域，缺乏全面性。本文旨在填补这一空白，提供全面的技术分析和未来展望。

Result: 论文总结了现有技术的优缺点，提供了方法论、数据集、评估协议和复杂性分析，同时提出了潜在研究方向和开放问题。

Insight: 超分辨率技术仍面临挑战，如复杂场景下的鲁棒性、计算效率与实际应用的平衡。未来的研究应关注多模态融合和轻量化设计。

Abstract: Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at https://github.com/AVC2-UESTC/Holistic-Super-Resolution-Review.

[7] Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment cs.CV | cs.AI | cs.LGPDF

Abhiroop Chatterjee, Susmita Ghosh

TL;DR: 该论文提出了一种高效的多模态对齐方法，通过学习高光谱图像（HSI）并结合精心设计的文本提示，显著减少了参数更新量（仅0.07%），同时在高光谱场景理解任务中取得了最佳性能。

Details

Motivation: 高光谱图像（HSI）的高维3D体素结构使得跨模态对齐成为一项具有挑战性的问题。尽管视觉和语言模型在自然图像或文本任务中表现良好，但在高光谱领域的跨模态对齐研究仍不足。本文旨在通过优化视觉语言模型（VLM）来解决这一问题。

Result: 在Indian Pines（IP）数据集上，OA和Kappa分别提升了0.92和1.60；在Pavia University（PU）数据集上，分别提升了0.69和0.90。模型参数比基线方法DCTN少50倍，比SS-TMNet少90倍。

Insight: 通过精心设计的文本提示和对比学习，可以在极少量参数更新的情况下实现高效的多模态对齐，为高光谱图像理解提供了一种轻量化且高性能的解决方案。

Abstract: As data requirements continue to grow, efficient learning increasingly depends on the curation and distillation of high-value data rather than brute-force scaling of model sizes. In the case of a hyperspectral image (HSI), the challenge is amplified by the high-dimensional 3D voxel structure, where each spatial location is associated with hundreds of contiguous spectral channels. While vision and language models have been optimized effectively for natural image or text tasks, their cross-modal alignment in the hyperspectral domain remains an open and underexplored problem. In this article, we make an attempt to optimize a Vision-Language Model (VLM) for hyperspectral scene understanding by exploiting a CLIP-style contrastive training framework. Our framework maps voxel-level embeddings from a vision backbone onto the latent space of a frozen large embedding model (LEM), where a trainable probe aligns vision features with the model’s textual token representations. The two modalities are aligned via a contrastive loss restricted to a curated set of hard (closest wrong classes) and semi-hard (random distractors) negatives, along with positive pairs. To further enhance alignment, descriptive prompts that encode class semantics are introduced and act as structured anchors for the HSI embeddings. It is seen that the proposed method updates only 0.07 percent of the total parameters, yet yields state-of-the-art performance. For example, on Indian Pines (IP) the model produces better results over unimodal and multimodal baselines by +0.92 Overall Accuracy (OA) and +1.60 Kappa ($\kappa$), while on Pavia University (PU) data it provides gains of +0.69 OA and +0.90 $\kappa$. Moreover, this is achieved with the set of parameters, nearly 50$\times$ smaller than DCTN and 90$\times$ smaller than SS-TMNet.

[8] IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism cs.CV | cs.AI | cs.LGPDF

Adithya Giri

TL;DR: IBiT通过引入归纳偏置的学习掩码，使视觉Transformer在小数据集上表现更好，同时保留了Transformer的可解释性。

Details

Motivation: 尽管Transformer在计算机视觉中表现优异，但它们缺乏卷积神经网络的归纳偏置。本文旨在通过引入这些偏置，提升Transformer在小数据集上的性能。

Result: IBiT在小数据集上表现优于传统Transformer，同时保持了可解释性。

Insight: 通过显式引入归纳偏置，可以弥补Transformer在小数据集上的不足，同时保留其优势。

Abstract: In recent years, Transformer-based architectures have become the dominant method for Computer Vision applications. While Transformers are explainable and scale well with dataset size, they lack the inductive biases of Convolutional Neural Networks. While these biases may be learned on large datasets, we show that introducing these inductive biases through learned masks allow Vision Transformers to learn on much smaller datasets without Knowledge Distillation. These Transformers, which we call Inductively Biased Image Transformers (IBiT), are significantly more accurate on small datasets, while retaining the explainability Transformers.

[9] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning cs.CV | cs.AI | cs.LGPDF

Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng

TL;DR: LayoutAgent结合视觉语言推理与组合扩散方法，生成多目标场景的空间布局，解决了传统扩散模型缺乏空间推理能力的问题。

Details

Motivation: 传统扩散模型在图像生成方面表现优异，但缺乏空间推理能力，导致生成的对象布局不现实；而传统空间规划方法虽强调几何一致性，却无法捕捉视觉场景的语义丰富性。因此，需要一种新方法，既能生成高质量图像，又能规划符合语义关系和物理合理性的空间布局。

Result: 实验证明，LayoutAgent在布局一致性、空间真实性和美学对齐方面优于其他现有布局生成模型。

Insight: LayoutAgent的创新在于将视觉语言推理与组合扩散方法结合，统一了语义丰富性与空间一致性，为多目标场景生成提供了新思路。

Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

[10] CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models cs.CV | cs.AIPDF

Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li

TL;DR: CompareBench是一个用于评估视觉-语言模型（VLMs）视觉比较推理能力的基准测试，包含1000个QA对，涵盖四个任务：数量、时间、几何和空间关系。测试发现，即使是先进模型也存在显著缺陷。

Details

Motivation: 视觉比较推理是视觉-语言模型中的基础能力，但之前缺乏系统性评测。CompareBench填补了这一空白，旨在推动更可靠的多模态推理研究。

Result: 所有模型在时间排序和空间关系任务上表现较差，且在基础数量和几何比较任务中也常犯错，显示当前模型在视觉比较推理中存在系统性盲点。

Insight: 视觉比较推理需要更多研究，尤其在时间动态性和空间关系建模方面，CompareBench为后续改进提供了诊断工具。

Abstract: We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

[11] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning cs.CV | cs.AIPDF

Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu

TL;DR: MILR提出了一种在测试时联合推理图像和文本的统一潜在向量空间方法，显著提升了多模态图像生成的性能。

Details

Motivation: 现有基于推理的图像生成方法通常局限于单模态或依赖高质量推理数据进行微调，限制了其性能和应用范围。

Result: 在GenEval、T2I-CompBench和WISE基准测试中，MILR实现了SOTA性能，尤其在WISE上提升了80%。

Insight: 统一潜在空间的联合推理是关键优势，MILR还展现了在时间和文化推理上的强大能力。

Abstract: Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR’s non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

[12] VideoScore2: Think before You Score in Generative Video Evaluation cs.CV | cs.AI | cs.CLPDF

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang

TL;DR: VideoScore2是一个多维度、可解释且与人类对齐的视频生成评估框架，专注于视觉质量、文本视频对齐和物理一致性，并通过详细的思考链提供评估依据。

Details

Motivation: 现有视频评估方法多为单一不透明评分，缺乏可解释性或仅提供粗略分析，无法全面评估视频质量。

Result: 在VideoScore-Bench-v2上达到44.35（+5.94）准确率，在四个跨域基准上平均提升4.32分。

Insight: VideoScore2的详细评估维度为可控视频生成提供有效奖励建模，并增强了评估与生成之间的桥梁。

Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/

Sahar Dastani, Ali Bahri, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori

TL;DR: TRUST提出了一种新颖的测试时间适应方法，利用SSM的独特架构特性，通过多样化的遍历排列生成多视角输入，以伪标签指导参数更新，显著提升了模型的鲁棒性。

Details

Motivation: State Space Models（SSMs）在视觉任务中表现出色，但在分布偏移下泛化性能显著下降。TRUST旨在利用SSMs的独特结构特性，通过测试时间适应提升模型鲁棒性。

Result: 在七个基准测试中，TRUST显著提升了模型的鲁棒性，并优于现有的TTA方法。

Insight: TRUST展示了SSMs在测试时间适应中的潜力，通过其独特的结构特性（如遍历排列）提升模型对分布偏移的适应能力。

Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

Jaeik Kim, Woojin Kim, Woohyeon Park, Jaeyoung Do

TL;DR: MMPB是一个用于评估视觉语言模型（VLM）个性化能力的首个大规模基准测试，包含10k图像-查询对和111个个性化概念。研究发现现有VLM在个性化任务中表现不佳，尤其是在对话一致性和视觉适应能力方面。

Details

Motivation: 个性化在用户导向的AI系统中至关重要，但现有VLM在这方面未得到充分探索。通过MMPB基准测试，填补了这一研究空白。

Result: 实验显示大多数VLM（包括闭源模型）在个性化任务中表现不佳，特别是在对话一致性和用户偏好处理方面存在显著问题。

Insight: MMPB不仅提供了可扩展的基准测试，还为未来实现真正个性化的多模态AI指明了方向。

Abstract: Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

[15] Seeing Isn’t Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN cs.CV | cs.AIPDF

Roie Kazoom, Alon Goldberg, Hodaya Cohen, Ofer Hadar

TL;DR: 该论文提出了一种条件GAN框架，用于合成上下文感知的对抗性补丁，能够在指定目标类别的同时保持视觉真实性和黑盒攻击的有效性。

Details

Motivation: 现有的对抗性补丁攻击方法多依赖于不切实际的白盒假设、非针对性目标，或生成视觉上显眼的补丁，限制了实际应用。

Result: 在多种卷积网络和视觉Transformer上实现了SOTA性能，攻击成功率（ASR）和目标类别成功率（TCS）均超过99%。

Insight: 该方法在视觉真实性、目标控制和黑盒适用性三方面取得平衡，为对抗性鲁棒性研究树立了新标杆。

Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

[16] Learning Temporal Saliency for Time Series Forecasting with Cross-Scale Attention cs.CV | cs.LGPDF

Ibrahim Delibasoglu, Fredrik Heintz

TL;DR: 本文提出了CrossScaleNet，一种结合跨尺度注意力机制的创新架构，用于时间序列预测和时间显著性检测，同时提升模型的可解释性和预测性能。

Details

Motivation: 时间序列预测中可解释性的缺失限制了模型的透明度和决策支持能力。传统的时间显著性检测方法计算成本高且效果不佳，亟需一种高效且透明的解决方案。

Result: 在合成数据集和公开基准上验证了模型的鲁棒性，并在真实数据集上超过了大多数基于Transformer的模型，同时保持了预测准确性和更好的可解释性。

Insight: 现有的可解释性模型往往在性能上表现不佳，而CrossScaleNet成功填补了这一空白，实现了性能和可解释性的平衡。

Abstract: Explainability in time series forecasting is essential for improving model transparency and supporting informed decision-making. In this work, we present CrossScaleNet, an innovative architecture that combines a patch-based cross-attention mechanism with multi-scale processing to achieve both high performance and enhanced temporal explainability. By embedding attention mechanisms into the training process, our model provides intrinsic explainability for temporal saliency, making its decision-making process more transparent. Traditional post-hoc methods for temporal saliency detection are computationally expensive, particularly when compared to feature importance detection. While ablation techniques may suffice for datasets with fewer features, identifying temporal saliency poses greater challenges due to its complexity. We validate CrossScaleNet on synthetic datasets with known saliency ground truth and on established public benchmarks, demonstrating the robustness of our method in identifying temporal saliency. Experiments on real-world datasets for forecasting task show that our approach consistently outperforms most transformer-based models, offering better explainability without sacrificing predictive accuracy. Our evaluations demonstrate superior performance in both temporal saliency detection and forecasting accuracy. Moreover, we highlight that existing models claiming explainability often fail to maintain strong performance on standard benchmarks. CrossScaleNet addresses this gap, offering a balanced approach that captures temporal saliency effectively while delivering state-of-the-art forecasting performance across datasets of varying complexity.

[17] Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging cs.CV | cs.AIPDF

Yi Luo, Yike Guo, Hamed Hooshangnejad, Rui Zhang, Xue Feng

TL;DR: 该论文提出了一种基于迁移学习的多模态交互感知网络方法，结合切片交互模块（SIM），用于在PET/CT影像中精确分割肺癌的内部总肿瘤体积（IGTV），显著提升了分割性能。

Details

Motivation: 肺癌是全球癌症相关死亡的主要原因，PET/CT影像中IGTV的精确分割对放射治疗至关重要，但受限于标注数据稀缺和肿瘤边界PET信号弱的问题。

Result: 在私有IGTV数据集上，Dice分数达到0.609，显著超过传统基线方法的0.385。

Insight: 迁移学习和多模态交互的有效结合能够显著提升IGTV分割的精度，为临床放射治疗计划提供了更可靠的工具。

Abstract: Lung cancer remains the leading cause of cancerrelated deaths globally. Accurate delineation of internal gross tumor volume (IGTV) in PET/CT imaging is pivotal for optimal radiation therapy in mobile tumors such as lung cancer to account for tumor motion, yet is hindered by the limited availability of annotated IGTV datasets and attenuated PET signal intensity at tumor boundaries. In this study, we present a transfer learningbased methodology utilizing a multimodal interactive perception network with MAMBA, pre-trained on extensive gross tumor volume (GTV) datasets and subsequently fine-tuned on a private IGTV cohort. This cohort constitutes the PET/CT subset of the Lung-cancer Unified Cross-modal Imaging Dataset (LUCID). To further address the challenge of weak PET intensities in IGTV peripheral slices, we introduce a slice interaction module (SIM) within a 2.5D segmentation framework to effectively model inter-slice relationships. Our proposed module integrates channel and spatial attention branches with depthwise convolutions, enabling more robust learning of slice-to-slice dependencies and thereby improving overall segmentation performance. A comprehensive experimental evaluation demonstrates that our approach achieves a Dice of 0.609 on the private IGTV dataset, substantially surpassing the conventional baseline score of 0.385. This work highlights the potential of transfer learning, coupled with advanced multimodal techniques and a SIM to enhance the reliability and clinical relevance of IGTV segmentation for lung cancer radiation therapy planning.

[18] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models cs.CVPDF

Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll

TL;DR: ControlEvents提出了一种基于扩散模型的生成方法，用于合成高质量的事件相机数据，通过利用Stable Diffusion的先验知识，显著降低了标记数据的成本，并展示了在多个任务中的有效性。

Details

Motivation: 事件相机因其高时间分辨率和高动态范围特性受到关注，但获取大规模标记数据成本高昂且困难。本文旨在通过生成合成数据解决这一问题。

Result: 生成的合成数据在视觉识别、2D骨架估计和3D姿态估计任务中均提升了模型性能。同时展示了基于未见文本标签生成事件的能力。

Insight: 利用基础模型的扩散先验知识可以有效降低事件相机数据生成的成本和难度，同时通过多样化控制信号实现灵活的数据生成。

Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

[19] Learning KAN-based Implicit Neural Representations for Deformable Image Registration cs.CVPDF

Nikita Drozdov, Marat Zinovev, Dmitry Sorokin

TL;DR: 该论文提出了KAN-IDIR和RandKAN-IDIR，首次将Kolmogorov-Arnold Networks (KANs)与隐式神经表示(INRs)相结合，用于变形图像配准。通过随机基函数采样策略，降低了计算成本，同时保持了配准质量，并在多种数据集上表现出优异的精度和学习稳定性。

Details

Motivation: 传统学习方法的变形图像配准需要大量训练数据且在精度上难以匹敌经典迭代方法，而隐式神经表示(INRs)虽具备潜力但计算效率和学习稳定性不足。论文旨在通过KANs与INRs的结合解决这些问题。

Result: KAN-IDIR和RandKAN-IDIR在所有评估模态和解剖结构中均达到INR方法中最高的配准精度，计算开销低且学习稳定性强。RandKAN-IDIR的表现略优于可学习基函数索引的模型。

Insight: 随机基函数采样策略不仅能降低成本，还能略微提升性能，揭示了学习稳定性与计算效率的平衡可能成为未来INR研究的关键方向。

Abstract: Deformable image registration (DIR) is a cornerstone of medical image analysis, enabling spatial alignment for tasks like comparative studies and multi-modal fusion. While learning-based methods (e.g., CNNs, transformers) offer fast inference, they often require large training datasets and struggle to match the precision of classical iterative approaches on some organ types and imaging modalities. Implicit neural representations (INRs) have emerged as a promising alternative, parameterizing deformations as continuous mappings from coordinates to displacement vectors. However, this comes at the cost of requiring instance-specific optimization, making computational efficiency and seed-dependent learning stability critical factors for these methods. In this work, we propose KAN-IDIR and RandKAN-IDIR, the first integration of Kolmogorov-Arnold Networks (KANs) into deformable image registration with implicit neural representations (INRs). Our proposed randomized basis sampling strategy reduces the required number of basis functions in KAN while maintaining registration quality, thereby significantly lowering computational costs. We evaluated our approach on three diverse datasets (lung CT, brain MRI, cardiac MRI) and compared it with competing instance-specific learning-based approaches, dataset-trained deep learning models, and classical registration approaches. KAN-IDIR and RandKAN-IDIR achieved the highest accuracy among INR-based methods across all evaluated modalities and anatomies, with minimal computational overhead and superior learning stability across multiple random seeds. Additionally, we discovered that our RandKAN-IDIR model with randomized basis sampling slightly outperforms the model with learnable basis function indices, while eliminating its additional training-time complexity.

[20] Convolutional Set Transformer cs.CV | cs.AI | cs.LGPDF

Federico Chinello, Giacomo Boracchi

TL;DR: 论文提出了卷积集变换器（CST），一种直接处理3D图像张量的神经网络架构，避免了传统方法需要先通过CNN提取特征的局限，同时在特征提取和上下文建模中实现了协同作用。

Details

Motivation: 现有集合输入网络（如Deep Sets和Set Transformer）仅支持向量输入，无法直接处理3D图像张量，需要额外借助CNN提取特征。CST旨在解决这一问题，直接处理图像集合并实现更高性能。

Result: 实验表明，CST在集合分类和集合异常检测任务中性能优于现有方法，且支持模型解释。

Insight: CST的设计避免了传统方法的分步处理，展现了端到端的优势；其兼容解释方法的特点提升了模型透明性。

Abstract: We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).

[21] TY-RIST: Tactical YOLO Tricks for Real-time Infrared Small Target Detection cs.CV | cs.AI | cs.LGPDF

Abdulkarim Atrash, Omar Moured, Yufan Chen, Jiaming Zhang, Seyda Ertekin

TL;DR: TY-RIST 是一种优化的 YOLOv12n 架构，专注于实时红外小目标检测，通过改进主干网络、检测头、注意力模块和剪枝策略，显著提升了检测性能和计算效率。

Details

Motivation: 红外小目标检测在国防和监控中至关重要，但面临目标特征少、环境噪声多、检测遗漏和计算成本高的挑战。

Result: 在四个基准测试中，mAP@0.5IoU 提升 7.9%，精确度提升 3%，召回率提升 10.2%，同时达到 123 FPS 的实时性能。

Insight: 剪枝策略在显著降低计算成本的同时略微提升精度，表明模型优化可以在速度和性能之间取得平衡。

Abstract: Infrared small target detection (IRSTD) is critical for defense and surveillance but remains challenging due to (1) target loss from minimal features, (2) false alarms in cluttered environments, (3) missed detections from low saliency, and (4) high computational costs. To address these issues, we propose TY-RIST, an optimized YOLOv12n architecture that integrates (1) a stride-aware backbone with fine-grained receptive fields, (2) a high-resolution detection head, (3) cascaded coordinate attention blocks, and (4) a branch pruning strategy that reduces computational cost by about 25.5% while marginally improving accuracy and enabling real-time inference. We also incorporate the Normalized Gaussian Wasserstein Distance (NWD) to enhance regression stability. Extensive experiments on four benchmarks and across 20 different models demonstrate state-of-the-art performance, improving mAP at 0.5 IoU by +7.9%, Precision by +3%, and Recall by +10.2%, while achieving up to 123 FPS on a single GPU. Cross-dataset validation on a fifth dataset further confirms strong generalization capability. Additional results and resources are available at https://www.github.com/moured/TY-RIST

Chenghan Yang, Peng Zhou, Dong-Sheng Zhang, Yueyun Wang, Hong-Bin Shen

TL;DR: FishAI 2.0结合多模态少样本学习与图像生成技术，提出了一种海洋鱼类智能识别框架，解决了传统方法在数据稀缺条件下的性能问题，并通过实验验证了其高效性和实用性。

Details

Motivation: 传统海洋生物图像识别在数据稀缺条件下（尤其是稀有物种）表现不佳，为解决这一问题，FishAI 2.0融合多模态学习和图像生成技术。

Result: 在家、属、种级别上，Top-1准确率分别达到91.67%、87.58%和85.42%，显著优于基线模型。

Insight: 多模态学习和数据增强可有效解决少样本学习中的数据稀缺问题，为生态监测提供了可扩展的技术方案。

Abstract: Traditional marine biological image recognition faces challenges of incomplete datasets and unsatisfactory model accuracy, particularly for few-shot conditions of rare species where data scarcity significantly hampers the performance. To address these issues, this study proposes an intelligent marine fish recognition framework, FishAI 2.0, integrating multimodal few-shot deep learning techniques with image generation for data augmentation. First, a hierarchical marine fish benchmark dataset, which provides a comprehensive data foundation for subsequent model training, is utilized to train the FishAI 2.0 model. To address the data scarcity of rare classes, the large language model DeepSeek was employed to generate high-quality textual descriptions, which are input into Stable Diffusion 2 for image augmentation through a hierarchical diffusion strategy that extracts latent encoding to construct a multimodal feature space. The enhanced visual-textual datasets were then fed into a Contrastive Language-Image Pre-Training (CLIP) based model, enabling robust few-shot image recognition. Experimental results demonstrate that FishAI 2.0 achieves a Top-1 accuracy of 91.67 percent and Top-5 accuracy of 97.97 percent at the family level, outperforming baseline CLIP and ViT models with a substantial margin for the minority classes with fewer than 10 training samples. To better apply FishAI 2.0 to real-world scenarios, at the genus and species level, FishAI 2.0 respectively achieves a Top-1 accuracy of 87.58 percent and 85.42 percent, demonstrating practical utility. In summary, FishAI 2.0 improves the efficiency and accuracy of marine fish identification and provides a scalable technical solution for marine ecological monitoring and conservation, highlighting its scientific value and practical applicability.

[23] Brain Tumor Classification from MRI Scans via Transfer Learning and Enhanced Feature Representation cs.CVPDF

Ahta-Shamul Hoque Emran, Hafija Akter, Abdullah Al Shiam, Abu Saleh Musa Miah, Anichur Rahman

TL;DR: 该论文提出了一个基于迁移学习的深度学习框架，用于从MRI扫描中自动检测脑肿瘤，并引入了新的Dense-Dropout模块以增强特征学习。同时，作者贡献了一个新的脑肿瘤MRI数据集（MMCB T）。

Details

Motivation: 脑肿瘤的及时检测对患者治疗至关重要，但现有数据集可靠性不足。本研究旨在提供一个高效、自动化的检测框架，并填补数据空白。

Result: 该方法通过增强的特征表示和数据平衡优化了脑肿瘤分类性能。

Insight: 迁移学习结合新型Dropout模块能有效提升医学图像分类任务的效果。高质量、平衡的数据集对模型性能至关重要。

Abstract: Brain tumors are abnormal cell growths in the central nervous system (CNS), and their timely detection is critical for improving patient outcomes. This paper proposes an automatic and efficient deep-learning framework for brain tumor detection from magnetic resonance imaging (MRI) scans. The framework employs a pre-trained ResNet50 model for feature extraction, followed by Global Average Pooling (GAP) and linear projection to obtain compact, high-level image representations. These features are then processed by a novel Dense-Dropout sequence, a core contribution of this work, which enhances non-linear feature learning, reduces overfitting, and improves robustness through diverse feature transformations. Another major contribution is the creation of the Mymensingh Medical College Brain Tumor (MMCBT) dataset, designed to address the lack of reliable brain tumor MRI resources. The dataset comprises MRI scans from 209 subjects (ages 9 to 65), including 3671 tumor and 13273 non-tumor images, all clinically verified under expert supervision. To overcome class imbalance, the tumor class was augmented, resulting in a balanced dataset well-suited for deep learning research.

[24] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View cs.CVPDF

Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao

TL;DR: ARSS是一个基于解码器-自回归模型的框架，用于从单一图像生成新视角，解决了扩散模型在多视角一致性上的不足。

Details

Motivation: 扩散模型在生成质量上表现出色，但在世界建模任务（如稀疏输入的新视角生成）中存在非因果生成的缺点，导致视角间的不一致性。AR模型通过因果生成更适合此类任务。

Result: 在公开数据集上，ARSS的性能与或优于基于扩散模型的SOTA视角合成方法。

Insight: 因果自回归模型在视角合成任务中具有潜在优势，尤其是在需要视角一致性的场景下，可以通过结构调整提升生成质量。

Abstract: Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.

[25] Disentangling Static and Dynamic Information for Reducing Static Bias in Action Recognition cs.CVPDF

Masato Kobayashi, Ning Ding, Toru Tamaki

TL;DR: 该论文提出了一种通过分离静态场景信息和动态时间信息来减少动作识别中静态偏置的方法，结合统计独立性损失和场景预测损失，实验证明其有效性。

Details

Motivation: 动作识别模型过度依赖静态线索而非动态人体运动（静态偏置），这在实际应用和零样本动作识别中表现不佳，因此需要减少这种偏置。

Result: 实验表明该方法能有效减少静态偏置，并验证了场景预测损失的重要性。

Insight: 动作识别中，分离静态和动态信息是关键，场景预测损失有助于防止模型过度依赖静态线索。

Abstract: Action recognition models rely excessively on static cues rather than dynamic human motion, which is known as static bias. This bias leads to poor performance in real-world applications and zero-shot action recognition. In this paper, we propose a method to reduce static bias by separating temporal dynamic information from static scene information. Our approach uses a statistical independence loss between biased and unbiased streams, combined with a scene prediction loss. Our experiments demonstrate that this method effectively reduces static bias and confirm the importance of scene prediction loss.

[26] Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training cs.CVPDF

Zhiqiang Tian, Weigang Li, Chunhua Deng, Junwei Hu, Yongqiang Wang

TL;DR: 论文提出了一种名为Desensitized Adversarial Training（DesenAT）的方法，通过特征脱敏和自蒸馏框架提升点云分类任务中模型的鲁棒性。

Details

Motivation: 点云数据因场景复杂性、传感器不精确性和处理误差不可避免地存在噪声。传统方法训练的DNN对点云特征过度依赖，导致模型在噪声数据上表现不佳。研究尝试量化这种依赖并探索降低依赖是否可以提高模型鲁棒性。

Result: 在ModelNet-C和PointCloud-C数据集上的实验表明，DesenAT在不降低干净数据集性能的情况下显著提升了模型鲁棒性。

Insight: 减少DNN对特定特征的过度依赖可以有效提升其在噪声数据上的表现，自蒸馏框架是平衡鲁棒性和性能的有效手段。

Abstract: Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model’s robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training’ (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN’s over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.

[27] Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation cs.CV | cs.CLPDF

Zetian Wu, Tianshuo Zhou, Stefan Lee, Liang Huang

TL;DR: 该论文提出了一种基于几何约束的方法，用于从文本生成手语的视频，重点解决传统方法忽略人体解剖学约束的问题，显著提升了动作的自然性和结构的合理性。

Details

Motivation: 传统文本到手语视频生成方法忽略了人体骨骼运动的解剖学约束和协调模式，导致生成的姿势僵硬或不符合生物力学原理。因此，作者提出了一种新方法，通过几何约束显式建模骨骼关节之间的关系。

Result: 相比于之前的最佳方法，该方法将性能差距缩小了56.51%，同时在骨骼长度和运动方差上分别减少了18.76%和5.48%，显著提升了生成结果的解剖学合理性和动作自然性。

Insight: 显式建模几何约束可以有效解决人体动作生成的解剖学不合理问题，尤其是在复杂的联合运动（如手指动作）中，几何感知的损失函数能够显著提升生成质量。

Abstract: Sign language translation from text to video plays a crucial role in enabling effective communication for Deaf and hard–of–hearing individuals. A major challenge lies in generating accurate and natural body poses and movements that faithfully convey intended meanings. Prior methods often neglect the anatomical constraints and coordination patterns of human skeletal motion, resulting in rigid or biomechanically implausible outputs. To address this, we propose a novel approach that explicitly models the relationships among skeletal joints–including shoulders, arms, and hands–by incorporating geometric constraints on joint positions, bone lengths, and movement dynamics. During training, we introduce a parent-relative reweighting mechanism to enhance finger flexibility and reduce motion stiffness. Additionally, bone-pose losses and bone-length constraints enforce anatomically consistent structures. Our method narrows the performance gap between the previous best and the ground-truth oracle by 56.51%, and further reduces discrepancies in bone length and movement variance by 18.76% and 5.48%, respectively, demonstrating significant gains in anatomical realism and motion naturalness.

[28] Planning with Unified Multimodal Models cs.CVPDF

Yihao Sun, Zhilong Zhang, Yang Yu, Pierre-Luc Bacon

TL;DR: 论文提出了Uni-Plan框架，基于统一多模态模型（UMMs），通过结合语言和视觉推理能力提升决策任务的成功率。

Details

Motivation: 现有决策方法主要依赖语言推理，限制了其能力和准确性，而多模态输入输出的统一模型具有更强的潜力。

Result: 在长时程规划任务中，Uni-Plan显著优于VLM方法，且无需专家演示，数据扩展性更强。

Insight: 结合多模态生成的视觉内容可以更好地支持复杂决策任务，未来研究有望进一步发展UMM的推理能力。

Abstract: With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

[29] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy cs.CVPDF

Xiafeng Man, Zhipeng Wei, Jingjing Chen

TL;DR: 本文提出了一种基于差分隐私的文本到图像扩散模型版权侵权检测方法，提出了DPM框架和CIDD数据集，解决了现有方法缺乏鲁棒性和理论依据的问题。

Details

Motivation: 大规模视觉模型（如Stable Diffusion）可能未经授权记忆并重现受版权保护的内容，现有检测方法缺乏鲁棒性和理论支持，亟需一种新方法。

Result: DPM在不依赖原始训练数据或文本提示的情况下，可靠地检测侵权内容，提供了可解释的解决方案。

Insight: 差分隐私为版权侵权检测提供了理论基础，DPM的提出为生成式AI时代的知识产权保护提供了实用工具。

Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model’s output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

Tomohiro Tanaka, Narumasa Tsutsumida

TL;DR: 该论文提出了一种基于预训练多模态变换器的传感器自适应洪水监测方法，能够在SAR和多光谱数据中灵活切换，实现快速准确的洪水范围绘制。

Details

Motivation: 洪水是频繁发生的自然灾害，现有监测方法受限于单一传感器的天气依赖性和多传感器融合的计算成本，亟需一种灵活高效的解决方案。

Result: 在最优传感器融合场景下，F1分数为0.896，mIoU为0.886；在多光谱单独输入时F1为0.893，SAR单独输入时为0.718。

Insight: 多模态预训练提升了模型的鲁棒性和灵活性，使其在现实灾害场景中更具实用性。

Abstract: Floods are increasingly frequent natural disasters causing extensive human and economic damage, highlighting the critical need for rapid and accurate flood inundation mapping. While remote sensing technologies have advanced flood monitoring capabilities, operational challenges persist: single-sensor approaches face weather-dependent data availability and limited revisit periods, while multi-sensor fusion methods require substantial computational resources and large-scale labeled datasets. To address these limitations, this study introduces a novel sensor-flexible flood detection methodology by fine-tuning Presto, a lightweight ($\sim$0.4M parameters) multi-modal pre-trained transformer that processes both Synthetic Aperture Radar (SAR) and multispectral (MS) data at the pixel level. Our approach uniquely enables flood mapping using SAR-only, MS-only, or combined SAR+MS inputs through a single model architecture, addressing the critical operational need for rapid response with whatever sensor data becomes available first during disasters. We evaluated our method on the Sen1Floods11 dataset against the large-scale Prithvi-100M baseline ($\sim$100M parameters) across three realistic data availability scenarios. The proposed model achieved superior performance with an F1 score of 0.896 and mIoU of 0.886 in the optimal sensor-fusion scenario, outperforming the established baseline. Crucially, the model demonstrated robustness by maintaining effective performance in MS-only scenarios (F1: 0.893) and functional capabilities in challenging SAR-only conditions (F1: 0.718), confirming the advantage of multi-modal pre-training for operational flood mapping. Our parameter-efficient, sensor-flexible approach offers an accessible and robust solution for real-world disaster scenarios requiring immediate flood extent assessment regardless of sensor availability constraints.

[31] GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization cs.CV | cs.AIPDF

Jingxing Li, Yongjae Lee, Deliang Fan

TL;DR: GeLoc3r提出了一种通过几何一致性正则化增强相对相机位姿估计的回归方法，在不增加推理时间的情况下显著提高了精度。

Details

Motivation: 现有的ReLoc3R在速度和回归精度上表现优异，但其内部表示的几何不一致性限制了其达到基于对应方法（如MASt3R）的精度上限。

Result: 在多个基准测试中显著优于ReLoc3R，例如CO3Dv2数据集的AUC@5°提升16%。

Insight: 通过在训练中注入几何知识而非推理时强制计算，GeLoc3r实现了回归速度和几何理解的平衡。

Abstract: Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R’s fast speed and approaching MASt3R’s high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5{\deg} on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5{\deg} on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

Ye-eun Kim, Suhyeon Lim, Andrew J. Choi

TL;DR: 该论文提出了一种多模态集成视觉Transformer（MMeViT）模型，用于中风患者的康复动作识别，结合IMU传感器与RGB-D摄像头的多模态数据，填补了现有HAR技术在中风患者中的空白。

Details

Motivation: 由于中风康复治疗需求增加而医疗资源短缺，远程监控系统成为缓解医疗人员负担的有效方案。现有HAR技术多针对健康人群，缺乏对中风患者动作的适应性。

Result: 发现中风患者的动作数据聚类性较弱；模型能学习难以聚类的数据特征，为未来扩展至康复评估反馈提供可能。

Insight: 深度学习模型可从中风患者的动作特征中学习，未来可能用于复杂任务（如康复评估），而不仅是简单的动作分类。

Abstract: Rehabilitation therapy for stroke patients faces a supply shortage despite the increasing demand. To address this issue, remote monitoring systems that reduce the burden on medical staff are emerging as a viable alternative. A key component of these remote monitoring systems is Human Action Recognition (HAR) technology, which classifies actions. However, existing HAR studies have primarily focused on non-disable individuals, making them unsuitable for recognizing the actions of stroke patients. HAR research for stroke has largely concentrated on classifying relatively simple actions using machine learning rather than deep learning. In this study, we designed a system to monitor the actions of stroke patients, focusing on domiciliary upper limb Activities of Daily Living (ADL). Our system utilizes IMU (Inertial Measurement Unit) sensors and an RGB-D camera, which are the most common modalities in HAR. We directly collected a dataset through this system, investigated an appropriate preprocess and proposed a deep learning model suitable for processing multimodal data. We analyzed the collected dataset and found that the action data of stroke patients is less clustering than that of non-disabled individuals. Simultaneously, we found that the proposed model learns similar tendencies for each label in data with features that are difficult to clustering. This study suggests the possibility of expanding the deep learning model, which has learned the action features of stroke patients, to not only simple action recognition but also feedback such as assessment contributing to domiciliary rehabilitation in future research. The code presented in this study is available at https://github.com/ye-Kim/MMeViT.

[33] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis cs.CVPDF

Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen

TL;DR: 论文提出了一种名为Mask What Matters的可控文本引导掩码框架，用于自监督医学图像分析。通过结合视觉语言模型区分诊断相关区域与背景，降低了冗余掩码，显著提升了语义对齐性和任务通用性，在分类、检测任务中表现优于现有方法。

Details

Motivation: 医学图像标注数据稀缺，现有自监督掩码图像建模方法因随机高比例掩码导致效率低下且语义对齐不佳。需要一种更高效、可控的方法来提升模型性能。

Result: 在脑MRI、胸部CT和肺部X光等多种医学图像模态上，分类准确率提升+3.1%，检测任务中BoxAP和MaskAP分别提升+1.3和+1.1。

Insight: 可控的文本驱动掩码可以显著提升自监督学习的效果，适用于医学图像分析的语义对齐和模型鲁棒性。

Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40% vs. 70%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

[34] FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection cs.CV | cs.LGPDF

Ben Liang, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui

TL;DR: FMC-DETR提出了一种频率解耦的多域协调框架，用于解决航空图像中小物体检测的挑战，通过WeKat骨干网和MDFC模块实现了全局上下文建模与细粒度细节的平衡。

Details

Motivation: 航空图像中小物体检测因视觉线索有限和复杂场景下全局上下文建模困难而极具挑战性，现有方法的延迟上下文融合和非线性建模不足导致性能瓶颈。

Result: 在VisDrone数据集上，AP和AP50分别提升6.5%和8.2%，性能优于基线，参数量更少。

Insight: 频率解耦和多域协调是提升小物体检测性能的关键，全局低频信息与局部高频细节的平衡对模型有效。

Abstract: Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.

[35] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting cs.CVPDF

Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma

TL;DR: 该论文研究了基于偏好对齐的图像修复技术，通过直接偏好优化方法和公共奖励模型构建训练数据集，实验表明多模型集成能有效缓解偏差并提升性能。

Details

Motivation: 图像修复领域缺乏偏好对齐的研究，作者希望通过基本方法重新审视这一问题，并为该领域提供一个简单可靠的基线。

Result: 实验显示，提出的对齐模型在标准指标、GPT-4评估和人工评估中均显著优于之前的方法。

Insight: 奖励模型尽管存在偏差，但仍可用于偏好对齐；多模型集成是提升鲁棒性和泛化能力的有效策略。

Abstract: This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.

[36] CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP cs.CV | cs.AIPDF

Na Min An, Inha Kang, Minhyun Lee, Hyunjung Shim

TL;DR: CoPatch通过利用CLIP中未开发的空间知识，提出了一种零样本Referring Image Segmentation方法，显著提升了空间定位能力。

Details

Motivation: 现有的基础视觉语言模型（如CLIP）虽然在图像和文本对齐方面表现优异，但在理解空间关系上存在不足，影响了Referring Image Segmentation任务的性能。

Result: 在RefCOCO、RefCOCO+、RefCOCOg和PhraseCut数据集上，CoPatch在不需额外训练的情况下，mIoU提升了2–7个点。

Insight: 研究表明，挖掘和利用视觉语言模型中潜藏的空间知识对于提升零样本RIS任务的性能至关重要。

Abstract: Spatial grounding is crucial for referring image segmentation (RIS), where the goal of the task is to localize an object described by language. Current foundational vision-language models (VLMs), such as CLIP, excel at aligning images and text but struggle with understanding spatial relationships. Within the language stream, most existing methods often focus on the primary noun phrase when extracting local text features, undermining contextual tokens. Within the vision stream, CLIP generates similar features for images with different spatial layouts, resulting in limited sensitivity to spatial structure. To address these limitations, we propose \textsc{CoPatch}, a zero-shot RIS framework that leverages internal model components to enhance spatial representations in both text and image modalities. For language, \textsc{CoPatch} constructs hybrid text features by incorporating context tokens carrying spatial cues. For vision, it extracts patch-level image features using our novel path discovered from intermediate layers, where spatial structure is better preserved. These enhanced features are fused into a clustered image-text similarity map, \texttt{CoMap}, enabling precise mask selection. As a result, \textsc{CoPatch} significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2–7 mIoU) without requiring any additional training. Our findings underscore the importance of recovering and leveraging the untapped spatial knowledge inherently embedded in VLMs, thereby paving the way for opportunities in zero-shot RIS.

[37] Deep Learning for Oral Health: Benchmarking ViT, DeiT, BEiT, ConvNeXt, and Swin Transformer cs.CV | 68U10: Image processingPDF

Ajo Babu George, Sadhvik Bathini, Niranjana S R

TL;DR: 本研究系统地评估并比较了五种先进的基于Transformer的架构（ViT、DeiT、ConvNeXt、Swin Transformer和BEiT）在多类牙齿疾病分类中的表现，特别关注数据不平衡的实际挑战。ConvNeXt表现最佳，验证准确率达到81.06%，其次是BEiT和Swin Transformer。ViT和DeiT在龋齿相关类别上表现较差。

Details

Motivation: 现有文献中常忽视数据不平衡等实际问题，本研究旨在填补这一空白，为临床牙齿疾病诊断工具提供模型选择的依据。

Result: ConvNeXt表现最佳（验证准确率81.06%），BEiT和Swin Transformer次之（80.00%和79.73%），ViT和DeiT在龋齿相关类别上表现较弱。

Insight: ConvNeXt、Swin Transformer和BEiT在牙齿疾病分类中表现可靠，适合临床影像诊断。数据不平衡问题对模型性能有显著影响，需要在未来研究中重点关注。

Abstract: Objective: The aim of this study was to systematically evaluate and compare the performance of five state-of-the-art transformer-based architectures - Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), ConvNeXt, Swin Transformer, and Bidirectional Encoder Representation from Image Transformers (BEiT) - for multi-class dental disease classification. The study specifically focused on addressing real-world challenges such as data imbalance, which is often overlooked in existing literature. Study Design: The Oral Diseases dataset was used to train and validate the selected models. Performance metrics, including validation accuracy, precision, recall, and F1-score, were measured, with special emphasis on how well each architecture managed imbalanced classes. Results: ConvNeXt achieved the highest validation accuracy at 81.06, followed by BEiT at 80.00 and Swin Transformer at 79.73, all demonstrating strong F1-scores. ViT and DeiT achieved accuracies of 79.37 and 78.79, respectively, but both struggled particularly with Caries-related classes. Conclusions: ConvNeXt, Swin Transformer, and BEiT showed reliable diagnostic performance, making them promising candidates for clinical application in dental imaging. These findings provide guidance for model selection in future AI-driven oral disease diagnostic tools and highlight the importance of addressing data imbalance in real-world scenarios

[38] HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing cs.CV | cs.AIPDF

Emadeldeen Hamdan, Ahmet Enis Cetin

TL;DR: HTMA-Net通过结合哈达玛变换和免乘法SRAM内存计算，减少了深度神经网络的乘法运算，同时保持精度，适用于边缘设备。

Details

Motivation: 减少乘法运算是边缘设备上高效部署深度神经网络的关键，HTMA-Net旨在降低计算复杂度和参数数量。

Result: 在ResNet-18等模型上，HTMA-Net消除了52%的乘法运算，同时保持精度。

Insight: 结构化哈达玛变换与免乘法内存计算的结合是实现高效深度学习架构的有效途径。

Abstract: Reducing the cost of multiplications is critical for efficient deep neural network deployment, especially in energy-constrained edge devices. In this work, we introduce HTMA-Net, a novel framework that integrates the Hadamard Transform (HT) with multiplication-avoiding (MA) SRAM-based in-memory computing to reduce arithmetic complexity while maintaining accuracy. Unlike prior methods that only target multiplications in convolutional layers or focus solely on in-memory acceleration, HTMA-Net selectively replaces intermediate convolutions with Hybrid Hadamard-based transform layers whose internal convolutions are implemented via multiplication-avoiding in-memory operations. We evaluate HTMA-Net on ResNet-18 using CIFAR-10, CIFAR-100, and Tiny ImageNet, and provide a detailed comparison against regular, MF-only, and HT-only variants. Results show that HTMA-Net eliminates up to 52% of multiplications compared to baseline ResNet-18, ResNet-20, and ResNet-50 models, while achieving comparable accuracy in evaluation and significantly reducing computational complexity and the number of parameters. Our results demonstrate that combining structured Hadamard transform layers with SRAM-based in-memory computing multiplication-avoiding operators is a promising path towards efficient deep learning architectures.

[39] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM cs.CVPDF

Junxiao Xue, Quan Deng, Xuecheng Wu, Kelu Yao, Xinyi Yin

TL;DR: 论文提出了一种新的大规模交互式多任务数据集ChangeIMTI，并设计了一种具有双粒度感知能力的视觉引导视觉语言模型ChangeVG，用于双时相遥感图像的变化理解任务。

Details

Motivation: 现有数据集在遥感变化理解任务（如变化描述、分类、计数与定位）中缺乏深度理解与交互性，限制了模型的性能。

Result: 在变化描述任务中，方法在S*m指标上超越最强基线1.39分。

Insight: 双粒度特征和高层语义的协同结合对遥感变化理解任务至关重要，且大规模多任务数据集有助于提升模型性能。

Abstract: Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method.

[40] Stochastic Interpolants via Conditional Dependent Coupling cs.CVPDF

Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen

TL;DR: 这篇论文提出了一种基于条件依赖耦合的统一多阶段生成框架，解决了现有图像生成模型在计算成本和保真度之间的权衡问题。

Details

Motivation: 现有图像生成模型（如基于VAE或像素空间的模型）在信息丢失、计算成本和多阶段优化方面存在局限性。本文旨在实现高保真度和高效率的统一生成框架。

Result: 实验表明，该方法在多个分辨率下实现了高保真度和高效率。

Insight: 通过统一的Diffusion Transformer和多阶段插值策略，可以在不牺牲保真度的情况下显著降低计算成本。

Abstract: Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.

[41] Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT cs.CVPDF

Donghao Zhang, Yimin Chen, Kauê TN Duarte, Taha Aslan, Mohamed AlShamrani

TL;DR: 论文利用DINOv3，一种先进的自监督视觉转换器，解决非对比CT（NCCT）影像在卒中分析中的低对比度和信噪比问题，建立了多任务卒中分析的基准。

Details

Motivation: NCCT是卒中快速诊断的重要工具，但其低对比度和信噪比限制了分析效果。本研究旨在探索DINOv3在卒中多任务分析中的潜力。

Result: 研究表明DINOv3在多任务卒中分析中表现优异，为未来研究提供了可靠的基准，并展示了自监督模型在医学影像中的潜力。

Insight: 自监督视觉转换器（如DINOv3）可以显著提升NCCT影像的卒中分析效果，但其性能仍需结合特定任务优化。医学影像的低信噪比问题需要进一步研究。

Abstract: Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at https://github.com/Zzz0251/DINOv3-stroke.

[42] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents cs.CVPDF

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo

TL;DR: Earth-Agent是一个新型的基于代理的框架，整合了RGB和多光谱地球观测数据，通过多模态工具生态系统实现复杂任务的跨模态、多步推理。它还提出了Earth-Bench基准测试，用于全面评估代理的性能。

Details

Motivation: 现有的MLLMs在多模态地球观测任务中存在局限性，无法完成复杂的多步推理和领域工具的使用。代理方法为解决这些问题提供了潜力，但当前研究仍局限于RGB感知和浅层推理。

Result: 实验表明，Earth-Agent在不同LLM骨干、通用代理框架和MLLMs的比较中表现出色，展示了其在EO领域的潜力。

Insight: Earth-Agent为地球观测分析设定了新范式，推动了LLMs在科学领域的应用。

Abstract: Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.

[43] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction cs.CVPDF

Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin

TL;DR: Sparse2Dense利用稀疏3D关键点作为传输符号，实现超低码率人体视频压缩和精确顶点预测，支持动态运动建模、外观合成与几何一致性。

Details

Motivation: 在带宽受限的多媒体应用中，同时实现超低码率人体视频压缩和精确顶点预测是一项关键挑战，需动态运动建模、外观合成与几何一致性协调。

Result: 实验表明Sparse2Dense在压缩性能上优于传统/生成式视频编解码器，同时实现精确顶点预测，适用于实时运动分析、虚拟动画等场景。

Insight: 稀疏关键点是高效表征人体运动的紧凑符号，结合生成模型和几何预测可提升带宽受限场景下的多媒体传输效率。

Abstract: For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.

[44] TRAX: TRacking Axles for Accurate Axle Count Estimation cs.CV | cs.AIPDF

Avinash Rai, Sandeep Jana, Vishal Vijay

TL;DR: 提出了一种基于视频的端到端系统TRAX，用于准确估计车辆轴数，解决了密集环境中轴数统计的局限性，并通过创新的TRAX算法提升了长车辆和遮挡情况下的准确性。

Details

Motivation: 车辆轴数的准确统计对交通管理、收费和基础设施建设至关重要。现有方法在密集环境、长车辆和遮挡场景中表现不佳，亟需改进。

Result: 在真实交通视频中表现出较强的鲁棒性，显著减少了误检并提升了长车辆的轴数统计准确性。

Insight: 机器视觉可以替代传统路边基础设施，为智能交通系统提供可扩展的解决方案，尤其是在复杂环境中表现出色。

Abstract: Accurate counting of vehicle axles is essential for traffic control, toll collection, and infrastructure development. We present an end-to-end, video-based pipeline for axle counting that tackles limitations of previous works in dense environments. Our system leverages a combination of YOLO-OBB to detect and categorize vehicles, and YOLO to detect tires. Detected tires are intelligently associated to their respective parent vehicles, enabling accurate axle prediction even in complex scenarios. However, there are a few challenges in detection when it comes to scenarios with longer and occluded vehicles. We mitigate vehicular occlusions and partial detections for longer vehicles by proposing a novel TRAX (Tire and Axle Tracking) Algorithm to successfully track axle-related features between frames. Our method stands out by significantly reducing false positives and improving the accuracy of axle-counting for long vehicles, demonstrating strong robustness in real-world traffic videos. This work represents a significant step toward scalable, AI-driven axle counting systems, paving the way for machine vision to replace legacy roadside infrastructure.

[45] Patch Rebirth: Toward Fast and Transferable Model Inversion of Vision Transformers cs.CV | cs.AIPDF

Seongsoo Heo, Dong-Wan Choi

TL;DR: 该论文提出了一种名为Patch Rebirth Inversion (PRI)的新方法，用于解决Vision Transformers (ViTs)在模型反转中的计算效率问题。PRI通过逐步分离重要补丁并允许其余补丁继续优化，显著提升了反转效率和质量。

Details

Motivation: 模型反转技术在无数据学习中应用广泛，但ViTs因其昂贵的自注意力机制导致计算效率低下。此前提出的Sparse Model Inversion (SMI)通过剪枝补丁提升效率，但研究发现丢弃补丁会抑制知识的迁移。

Result: 实验表明，PRI比标准DMI快10倍，比SMI快2倍，且在精度上优于SMI，与DMI相当。

Insight: 1. 补丁的丢弃可能抑制知识的迁移；2. 渐进式优化策略可有效平衡效率和知识提取；3. “Re-Birth效应”为模型反转提供了新的优化方向。

Abstract: Model inversion is a widely adopted technique in data-free learning that reconstructs synthetic inputs from a pretrained model through iterative optimization, without access to original training data. Unfortunately, its application to state-of-the-art Vision Transformers (ViTs) poses a major computational challenge, due to their expensive self-attention mechanisms. To address this, Sparse Model Inversion (SMI) was proposed to improve efficiency by pruning and discarding seemingly unimportant patches, which were even claimed to be obstacles to knowledge transfer. However, our empirical findings suggest the opposite: even randomly selected patches can eventually acquire transferable knowledge through continued inversion. This reveals that discarding any prematurely inverted patches is inefficient, as it suppresses the extraction of class-agnostic features essential for knowledge transfer, along with class-specific features. In this paper, we propose Patch Rebirth Inversion (PRI), a novel approach that incrementally detaches the most important patches during the inversion process to construct sparse synthetic images, while allowing the remaining patches to continue evolving for future selection. This progressive strategy not only improves efficiency, but also encourages initially less informative patches to gradually accumulate more class-relevant knowledge, a phenomenon we refer to as the Re-Birth effect, thereby effectively balancing class-agnostic and class-specific knowledge. Experimental results show that PRI achieves up to 10x faster inversion than standard Dense Model Inversion (DMI) and 2x faster than SMI, while consistently outperforming SMI in accuracy and matching the performance of DMI.

[46] Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection cs.CV | cs.AIPDF

Mingfei Han, Haihong Hao, Jinxing Zhou, Zhihui Li, Yuhui Zheng

TL;DR: 本文提出了一种利用视觉语言模型的自一致性（长回答与短回答的对比）来自动生成训练数据的方法，以减少幻觉现象（如虚构物体或不准确属性），无需人工标注或外部监督。

Details

Motivation: 视觉语言模型常因幻觉问题生成不可靠的输出，现有方法依赖人工标注或外部监督，成本高且不易扩展。本文旨在通过自一致性检测和自动训练数据生成，降低成本并提升可靠性。

Result: 在多个基准测试（如AMBER、MMHal-Bench等）上显著提升了事实基础和可靠性，同时保持了指令跟随能力。

Insight: 模型的短回答（如二元问题）通常更可靠，可作为自一致性检测的参考信号；自一致性可以作为减少幻觉的低成本且高效的解决方案。

Abstract: Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability. Existing methods typically address these issues via extensive human annotations or external supervision from more powerful models. In this work, we present a novel framework that leverages the model’s self-consistency between long responses and short answers to generate preference pairs for training. We observe that short binary questions tend to yield highly reliable responses, which can be used to query the target model to evaluate and rank its generated responses. Specifically, we design a self-reflection pipeline where detailed model responses are compared against concise binary answers, and inconsistency signals are utilized to automatically curate high-quality training data without human annotations or external model-based supervision. By relying solely on self-consistency rather than external supervision, our method offers a scalable and efficient solution that effectively reduces hallucinations using unlabeled data. Extensive experiments on multiple benchmarks, i.e., AMBER, MultiObject-Hal (ROPE), Object HalBench, and MMHal-Bench, demonstrate significant improvements in factual grounding and reliability. Moreover, our approach maintains robust instruction-following ability, as evidenced by enhanced performance on LLaVA-Bench and MMBench.

[47] TATTOO: Training-free AesTheTic-aware Outfit recOmmendation cs.CVPDF

Yuntian Wu, Xiaonan Hu, Ziqi Zhou, Hao Lu

TL;DR: 本文提出了TATTOO，一种无需训练的美学感知穿搭推荐方法，利用多模态大语言模型（MLLMs）生成目标物品描述和美学链式思考，从而实现高效的穿搭推荐。

Details

Motivation: 现有的穿搭推荐方法通常需要大规模标注数据进行任务特定训练，且缺乏对人类美学的显式引导。本文旨在通过训练自由范式简化传统流程，同时提升推荐效果和美学感知。

Result: 在Aesthetic-100数据集上表现优于现有训练方法，在Polyvore数据集上展示了先进的零样本检索能力。

Insight: 训练自由范式结合MLLMs可以显著简化传统推荐系统流程，同时利用美学特征提升推荐效果。

Abstract: The global fashion e-commerce market relies significantly on intelligent and aesthetic-aware outfit-completion tools to promote sales. While previous studies have approached the problem of fashion outfit-completion and compatible-item retrieval, most of them require expensive, task-specific training on large-scale labeled data, and no effort is made to guide outfit recommendation with explicit human aesthetics. In the era of Multimodal Large Language Models (MLLMs), we show that the conventional training-based pipeline could be streamlined to a training-free paradigm, with better recommendation scores and enhanced aesthetic awareness. We achieve this with TATTOO, a Training-free AesTheTic-aware Outfit recommendation approach. It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile including color, style, occasion, season, material, and balance. By fusing the visual summary of the outfit with the textual description and aesthetics vectors using a dynamic entropy-gated mechanism, candidate items can be represented in a shared embedding space and be ranked accordingly. Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods. Another standard Polyvore dataset is also used to measure the advanced zero-shot retrieval capability of our training-free method.

[48] Increasing the Diversity in RGB-to-Thermal Image Translation for Automotive Applications cs.CVPDF

Kaili Wang, Leonardo Ravaglia, Roberto Longo, Lore Goetschalckx, David Van Hamme

TL;DR: 提出了一种通过多模态翻译框架和CoAdaIN技术实现RGB到热图像的多样化转换，以解决自动驾驶中热成像数据集不足的问题。

Details

Motivation: 热成像在ADAS中具有优势，但现有数据和方法限制了多样性，需要更真实的转换方法。

Result: 生成的转换热图像更真实且多样化，优于现有方法。

Insight: 逐组件风格调整是实现多样化图像转换的有效途径。

Abstract: Thermal imaging in Advanced Driver Assistance Systems (ADAS) improves road safety with superior perception in low-light and harsh weather conditions compared to traditional RGB cameras. However, research in this area faces challenges due to limited dataset availability and poor representation in driving simulators. RGB-to-thermal image translation offers a potential solution, but existing methods focus on one-to-one mappings. We propose a one-to-many mapping using a multi-modal translation framework enhanced with our Component-aware Adaptive Instance Normalization (CoAdaIN). Unlike the original AdaIN, which applies styles globally, CoAdaIN adapts styles to different image components individually. The result, as we show, is more realistic and diverse thermal image translations. This is the accepted author manuscript of the paper published in IEEE Sensors Conference 2024. The final published version is available at 10.1109/SENSORS60989.2024.10785056.

[49] LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis cs.CV | cs.HCPDF

Sasan Sharifipour, Constantino Álvarez Casado, Le Nguyen, Tharindu Ekanayake, Manuel Lage Cañellas

TL;DR: 本文提出了一种基于LiDAR点云和拉普拉斯谱分析的人类活动识别方法，通过图谱特征提取和时间统计分析实现高效分类，在MM-Fi数据集上表现优于骨架基线方法。

Details

Motivation: 基于LiDAR的人类活动识别（HAR）因其隐私保护和光照鲁棒性优于摄像头，但需要开发高效且可解释的特征提取方法，避免依赖复杂的端到端深度学习。

Result: 在MM-Fi数据集上，该方法在13类康复活动上达到94.4%的准确率，27类活动上达到90.3%，优于骨架基线方法。

Insight: 拉普拉斯谱分析提供了一种直接从点云几何中提取高效特征的途径，避免了深度学习的黑盒问题，同时保持了高准确率。

Abstract: Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.

[50] Learning Regional Monsoon Patterns with a Multimodal Attention U-Net cs.CV | cs.AI | cs.LGPDF

Swaib Ilias Mazumder, Manish Kumar, Aparajita Khan

TL;DR: 本文提出了一种多模态深度学习框架（注意力引导的U-Net），用于高分辨率的印度季风降水分类，结合多种地理空间数据，在极端降雨类别中表现优于现有方法。

Details

Motivation: 准确的季风降雨预测对印度农业和水资源管理至关重要，但现有模型受限于粗分辨率数据和复杂的区域变化性。

Result: 该方法在多模态数据上表现优于单模态基准和现有深度学习方法，尤其是在极端降雨类别中。

Insight: 结合多种地理空间数据和注意力机制可以有效捕捉区域季风模式，提升降水预测精度。

Abstract: Accurate monsoon rainfall prediction is vital for India’s agriculture, water management, and climate risk planning, yet remains challenging due to sparse ground observations and complex regional variability. We present a multimodal deep learning framework for high-resolution precipitation classification that leverages satellite and Earth observation data. Unlike previous rainfall prediction models based on coarse 5-50 km grids, we curate a new 1 km resolution dataset for five Indian states, integrating seven key geospatial modalities: land surface temperature, vegetation (NDVI), soil moisture, relative humidity, wind speed, elevation, and land use, covering the June-September 2024 monsoon season. Our approach uses an attention-guided U-Net architecture to capture spatial patterns and temporal dependencies across modalities, combined with focal and dice loss functions to handle rainfall class imbalance defined by the India Meteorological Department (IMD). Experiments demonstrate that our multimodal framework consistently outperforms unimodal baselines and existing deep learning methods, especially in extreme rainfall categories. This work contributes a scalable framework, benchmark dataset, and state-of-the-art results for regional monsoon forecasting, climate resilience, and geospatial AI applications in India.

[51] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction cs.CVPDF

Yihao Ding, Soyeon Caren Han, Yanbei Jiang, Yan Li, Zechuan Li

TL;DR: SynDoc是一个混合判别式-生成式框架，通过合成数据生成和自适应指令调优提升领域文档关键信息抽取的性能。

Details

Motivation: 现有大型（多模态）语言模型在领域适应的视觉丰富文档理解中存在幻觉、领域适应性不足和依赖大量微调数据的问题。

Result: SynDoc在领域适应的文档关键信息抽取任务中表现出高效、精确和可扩展的性能。

Insight: 混合判别式-生成式方法能够有效结合领域知识和通用世界知识，提升文档理解的稳定性和准确性。

Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model’s ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.

[52] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing cs.CV | cs.AIPDF

Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal, Siddharth Roheda

TL;DR: Vid-Freeze是一种新型对抗攻击方法，通过精心设计的对抗扰动冻结图像到视频（I2V）模型的运动合成，防止恶意视频生成。

Details

Motivation: 随着图像到视频生成模型的快速发展，恶意内容合成的风险增加。现有防御方法如I2VGuard效果有限，亟需一种更有效的方法来阻断运动合成。

Result: 实验证明Vid-Freeze能有效保护图像，阻断恶意视频生成，展示了注意力攻击在防御I2V滥用中的潜力。

Insight: 注意力机制是I2V模型的关键弱点，针对它的对抗攻击为防御恶意视频生成提供了一种新思路。

Abstract: The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.

[53] Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification cs.CVPDF

Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong

TL;DR: 本文提出了一种平衡扩散引导融合（BDGF）框架，利用多模态扩散特征指导多分支网络进行土地覆盖分类，解决了多模态DDPM的模态不平衡问题，并通过自适应掩码策略和层次特征引导提升了分类性能。

Details

Motivation: 多模态遥感数据分类中，现有DDPMs存在模态不平衡问题，且如何有效利用扩散特征引导多样性特征提取尚不明确。

Result: 在四个多模态遥感数据集上验证了BDGF的优越分类性能。

Insight: 平衡数据分布和层次特征引导是多模态遥感分类的关键，互学习策略能有效提升模型协作能力。

Abstract: Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.

[54] Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning cs.CV | cs.AI | cs.CLPDF

Haorui Yu, Qiufeng Yi, Yijia Chu, Yang Zhao

TL;DR: 论文研究了视觉语言模型（VLMs）在文化图像理解中的局限性，发现模型倾向于依赖表面模式匹配而非真正的文化理解。通过诊断框架，揭示了模型在处理西方文化与非西方文化图像时的系统性偏见。

Details

Motivation: 尽管VLMs在多模态任务中表现出色，但其文化理解能力存在明显缺陷，尤其是在处理非西方文化或紧急场景时。研究旨在揭示这些模型的局限性，强调文化理解和公平性的重要性。

Result: 实验表明，VLMs能够正确识别西方节日，但在处理非西方文化活动和紧急场景时表现不佳，经常产生模糊标签或严重误分类。

Insight: 研究揭示VLMs依赖符号化捷径而非深层文化理解，强调了在多模态系统中融入文化评估的必要性，以确保模型的解释性和公平性。

Abstract: Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Siheng Wang, Zhengdao Li, Yanshu Li, Canran Xiao, Haibo Zhan

TL;DR: C3-OWD提出了一种课程跨模态对比学习框架，同时解决了开放世界目标检测的鲁棒性和多样性问题。

Details

Motivation: 现有方法在开放世界目标检测中要么缺乏鲁棒性，要么多样性不足，无法同时应对未见类别和极端环境。

Result: 在FLIR、OV-COCO和OV-LVIS上表现优异，分别达到80.1 AP50、48.6 AP50-Novel和35.7 mAPr。

Insight: 通过课程学习和EMA机制，可以同时优化鲁棒性和多样性，为开放世界检测提供了新思路。

Abstract: Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~~1 enhances robustness by pretraining with RGBT data, while Stage~~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.

[56] Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning cs.CVPDF

Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye

TL;DR: 该论文提出了一种无需训练的视觉推理框架，通过解耦推理和感知过程，利用LLM负责高级推理，LMM仅作为视觉问答引擎，显著减少了视觉无根据的推理步骤并提高了推理保真度。

Details

Motivation: 当前大型多模态模型（LMMs）在扩展推理链时过度依赖文本逻辑，逐渐脱离视觉信息，导致推理错误。为了解决这一问题，作者提出解耦推理和感知过程。

Result: 综合评估表明，该框架有效控制了视觉推理过程，显著减少了视觉无根据的推理步骤，并大幅提高了推理保真度。

Insight: 解耦推理和感知任务是提高多模态模型推理保真度的有效方法，无需复杂训练或架构调整即可实现性能提升。

Abstract: Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.

Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang

TL;DR: LRPO首次将在线强化学习应用于盲人脸恢复任务，通过采样候选者的奖励优化策略网络，提升恢复质量。

Details

Motivation: 盲人脸恢复（BFR）的大解空间导致常见伪影（如细节缺失和身份模糊），传统方法难以解决。

Result: LRPO显著超越基线方法，实现最先进的恢复质量。

Insight: 平衡感知质量与保真度是BFR的关键，RL提供了一种细化解空间的灵活方法。

Abstract: Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.

[58] DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice cs.CV | cs.AIPDF

Zijie Meng, Jin Hao, Xiwei Dai, Yang Feng, Jiaxiang Liu

TL;DR: 这篇论文介绍了DentVLM，一个多模态视觉-语言模型，用于全面的口腔疾病诊断和临床实践增强。

Details

Motivation: 当前AI模型在口腔疾病诊断中往往只能处理单一任务，无法满足临床复杂的多模态需求，因此需要开发一个更全面的解决方案。

Result: DentVLM在临床研究中表现优异，超越初级和高级牙医的诊断能力，并显著减少诊断时间（15-22%）。

Insight: DentVLM不仅提升了诊断效率，还展示了在家庭、医院和多代理协作等场景中的应用潜力，有助于改善医疗资源不均。

Abstract: Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for expert-level oral disease diagnosis. DentVLM was developed using a comprehensive, large-scale, bilingual dataset of 110,447 images and 2.46 million visual question-answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists’ performance to senior levels and reduced diagnostic time for all practitioners by 15-22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home-based dental health management, hospital-based intelligent diagnosis and multi-agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.

[59] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling cs.CV | cs.AIPDF

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang

TL;DR: 本文提出Dynamic-TreeRPO方法，通过树结构采样和动态噪声强度改进文本到图像生成的轨迹搜索效率，并结合LayerTuning-RL范式优化训练。

Details

Motivation: 现有RL方法在文本到图像生成中存在探索低效和采样冗余问题。

Result: 在HPS-v2.1、PickScore和ImageReward基准上分别提升4.9%、5.91%和8.66%，训练效率提高近50%。

Insight: 树结构采样共享前缀路径可降低计算开销，动态噪声强度提升探索多样性。

Abstract: The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9%$, $5.91%$, and $8.66%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50%$.

[60] Test-time Uncertainty Estimation for Medical Image Registration via Transformation Equivariance cs.CVPDF

Lin Tian, Xiaoling Hu, Juan Eugenio Iglesias

TL;DR: 该论文提出了一种基于变换等变性的测试时不确定性估计框架，适用于任何预训练的医学图像配准网络，无需修改架构或重新训练。通过分析输入空间扰动下网络预测的方差，分解出不确定性为内在噪声和系统误差漂移两部分，实验表明不确定性图与配准误差高度相关。

Details

Motivation: 当前深度配准网络缺乏对其预测可靠性的明确指示，限制了其在临床应用中的安全性。现有不确定性估计方法需修改网络或重新训练，不够灵活。

Result: 在多种解剖结构和配准模型上验证，不确定性图与配准误差显著相关，能有效标识需注意的区域。

Insight: 变换等变性为不确定性估计提供了理论依据，框架的灵活性使其可直接应用于预训练网络，推动医学图像配准的临床安全部署。

Abstract: Accurate image registration is essential for downstream applications, yet current deep registration networks provide limited indications of whether and when their predictions are reliable. Existing uncertainty estimation strategies, such as Bayesian methods, ensembles, or MC dropout, require architectural changes or retraining, limiting their applicability to pretrained registration networks. Instead, we propose a test-time uncertainty estimation framework that is compatible with any pretrained networks. Our framework is grounded in the transformation equivariance property of registration, which states that the true mapping between two images should remain consistent under spatial perturbations of the input. By analyzing the variance of network predictions under such perturbations, we derive a theoretical decomposition of perturbation-based uncertainty in registration. This decomposition separates into two terms: (i) an intrinsic spread, reflecting epistemic noise, and (ii) a bias jitter, capturing how systematic error drifts under perturbations. Across four anatomical structures (brain, cardiac, abdominal, and lung) and multiple registration models (uniGradICON, SynthMorph), the uncertainty maps correlate consistently with registration errors and highlight regions requiring caution. Our framework turns any pretrained registration network into a risk-aware tool at test time, placing medical image registration one step closer to safe deployment in clinical and large-scale research settings.

[61] GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval cs.CVPDF

Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang

TL;DR: GRAPE通过引入GRPO和排序奖励信号，优化基于LLM的查询重写策略，显著提升了检索系统在分布偏移（如多语言、长文本和多模态）下的性能，Recall@10平均提高4.9%。

Details

Motivation: CLIP模型在大规模检索系统中表现优异，但在输入分布与训练数据不同的任务（如多语言、长文本或多模态查询）中效果不佳。传统方法依赖LLM的查询重写，但缺乏监督信号导致效果不理想。GRAPE旨在通过排序信号优化查询重写以解决这一问题。

Result: 在多语言（Flickr30k-CN等）、长文本（Wikipedia）和多模态（CIRR）数据集上，Recall@10平均提升4.9%。

Insight: 排序信号能有效引导LLM生成更优查询，语料相关奖励设计是关键，能避免分数膨胀问题，提升检索性能。

Abstract: The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences – including length, multilingual, and modality shifts – by transforming queries into forms better aligned with the retriever’s training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts – including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) – achieving an average improvement of 4.9% in Recall@10. The code is available at https://github.com/Chinese0123456/GRAPE.git

[62] UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation cs.CVPDF

Jinghong Zheng, Changlong Jiang, Jiaqi Li, Haohong Kuang, Hang Xu

TL;DR: UniPose提出了一种统一的跨模态姿态先验传播方法，用于弱监督的3D人体姿态估计，通过自监督学习将2D姿态标注从RGB数据集迁移到3D域，避免了多视角校准或合成-真实数据偏移问题。

Details

Motivation: 传统3D姿态估计依赖大量3D关键点标注，标注成本高昂。UniPose旨在利用易获取的RGB-D序列和现成的2D姿态标注，实现弱监督的3D姿态估计。

Result: 在CMU Panoptic和ITOP数据集上，UniPose性能接近全监督方法，加入无标注数据（如NTU RGB+D 60）后表现更优。

Insight: UniPose通过跨模态学习和自监督机制，显著减少了3D姿态估计对标注数据的依赖，展示了在弱监督条件下的高效性和实用性。

Abstract: In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network’s 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.

[63] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving cs.CVPDF

Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun

TL;DR: WorldSplat提出了一种前馈框架，用于生成4D自动驾驶场景，结合了生成与重建的优势，通过4D高斯模型和多模态信息实现了高质量的新视角合成。

Details

Motivation: 现有的驾驶场景生成方法在3D一致性和多视角覆盖上表现不足，而3D/4D重建方法缺乏生成能力。WorldSplat旨在结合两者的优势，解决生成与重构之间的矛盾。

Result: 在基准数据集上的实验表明，WorldSplat能够生成高保真、时空一致的多视角驾驶视频。

Insight: WorldSplat展示了生成与重建技术的结合潜力，为自动驾驶系统提供了更具可扩展性和可控性的训练数据。

Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

[64] SPIKE-RL: Video-LLMs meet Bayesian Surprise cs.CV | cs.CLPDF

Sahithya Ravi, Aditya Chinchure, Raymond T. Ng, Leonid Sigal, Vered Shwartz

TL;DR: SPIKE-RL是一个通过贝叶斯惊喜量化视频中关键时刻的框架，结合强化学习优化视频帧采样策略，从而提升Video-LLMs在下游任务中的表现。

Details

Motivation: 现有Video-LLMs通常均匀采样视频帧，容易错过关键叙事时刻，希望通过量化视觉证据引发的信念更新，识别视频中的关键时刻。

Result: 在五个下游任务中，惊喜加权采样优于均匀采样，显著提升了模型性能。

Insight: 通过跟踪信念更新和量化惊喜，Video-LLMs可以动态调整理解，增强对新信息的适应能力。

Abstract: Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video’s narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

[65] FM-SIREN & FM-FINER: Nyquist-Informed Frequency Multiplier for Implicit Neural Representation with Periodic Activation cs.CVPDF

Mohammed Alsakabi, Wael Mobeirek, John M. Dolan, Ozan K. Tonguz

TL;DR: FM-SIREN & FM-FINER提出了一种新的频率乘数设计，解决了周期性激活网络中神经元冗余问题，提升了隐式神经表示的表达能力。

Details

Motivation: 现有的基于周期性激活的隐式神经表示网络（如SIREN和FINER）存在特征冗余问题，固定频率乘数导致神经元捕获的重叠频率分量限制了多层感知机的表达能力。

Result: 该方法在1D音频、2D图像、3D形状拟合及神经辐射场（NeRF）合成任务中表现出色，特征冗余减少近50%，显著优于基线模型。

Insight: 频率多样性的引入是提升隐式神经表示表达能力的关键，且无需增加网络复杂度即可实现显著改进。

Abstract: Existing periodic activation-based implicit neural representation (INR) networks, such as SIREN and FINER, suffer from hidden feature redundancy, where neurons within a layer capture overlapping frequency components due to the use of a fixed frequency multiplier. This redundancy limits the expressive capacity of multilayer perceptrons (MLPs). Drawing inspiration from classical signal processing methods such as the Discrete Sine Transform (DST), we propose FM-SIREN and FM-FINER, which assign Nyquist-informed, neuron-specific frequency multipliers to periodic activations. Unlike existing approaches, our design introduces frequency diversity without requiring hyperparameter tuning or additional network depth. This simple yet principled modification reduces the redundancy of features by nearly 50% and consistently improves signal reconstruction across diverse INR tasks, including fitting 1D audio, 2D image and 3D shape, and synthesis of neural radiance fields (NeRF), outperforming their baseline counterparts while maintaining efficiency.

[66] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing cs.CV | cs.CLPDF

Tanawan Premsri, Parisa Kordjamshidi

TL;DR: FoR-SALE 是一种基于 FoR（参考框架）的扩散编辑方法，通过空间调整提升文本到图像生成的空间一致性，显著改善现有模型的性能。

Details

Motivation: 目前的文本到图像生成模型在多模态空间中缺乏对人类空间描述（如不同视角）的准确理解能力，FoR-SALE 旨在解决这一问题。

Result: 在两个空间理解基准测试中，FoR-SALE 将 SOTA 模型的性能提升高达 5.3%，仅需一轮校正。

Insight: FoR 的引入显著提升了模型对复杂空间描述的理解能力，潜在空间操作为未来的图像编辑任务提供了新思路。

Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. For-Sale evaluates the alignment between a given text and an initially generated image, and refines the image based on the Frame of Reference specified in the spatial expressions. It employs vision modules to extract the spatial configuration of the image, while simultaneously mapping the spatial expression to a corresponding camera perspective. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3% using only a single round of correction.

[67] 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras cs.CV | cs.LGPDF

Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López

TL;DR: 3DPCNet是一个紧凑的、与姿态估计器无关的模块，用于将3D姿态从相机坐标系转换到一致的身体中心坐标系，从而消除视角依赖性。

Details

Motivation: 单目3D姿态估计器生成的相机中心骨架会产生视角依赖的运动信号，这在健康和体育科学等应用中的比较分析中带来困难。

Result: 在MM-Fi数据集上，3DPCNet将平均旋转误差从20°降低到3.4°，关节位置误差从64mm降低到47mm。在TotalCapture数据集上，生成的加速度信号与IMU地面真实数据表现出良好的视觉一致性。

Insight: 3DPCNet通过消除视角依赖性，显著提高了运动分析的物理合理性，为单目3D姿态估计的实际应用提供了有力工具。

Abstract: Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.

[68] No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation cs.CVPDF

Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban

TL;DR: 本文提出了一种细粒度的测试时优化框架，通过分解输入提示为语义概念，在全局和概念级别评估对齐，提升文本到图像生成的组合忠实度。

Details

Motivation: 现有的文本到图像模型在处理复杂提示时，常遗漏或错误表示特定对象和属性，需要一种无需重新训练的方法改进生成效果。

Result: 在DrawBench和CompBench数据集上，方法显著提升了概念覆盖和人类评估的生成忠实度。

Insight: 细粒度的概念对齐评估是提升文本到图像生成效果的关键，测试时优化可以在不重新训练模型的情况下改进生成质量。

Abstract: Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind

Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Jun-Ren Chen

TL;DR: 该论文提出了一种名为MFAS-DANet的新框架，解决了多模态人脸防伪（FAS）在领域适应场景下的三个主要问题：缺失模态、噪声伪标签和模型退化。

Details

Motivation: 现有的多模态FAS方法在处理新领域攻击时表现不佳，且尚未探索领域适应在多模态FAS中的应用。

Result: 通过大量实验证明了MFAS-DANet的有效性和领先性能。

Insight: 该研究为多模态FAS领域适应提供了新的解决方案，突出了互补特征和预测不确定性的重要性。

Abstract: Recent multi-modal face anti-spoofing (FAS) methods have investigated the potential of leveraging multiple modalities to distinguish live and spoof faces. However, pre-adapted multi-modal FAS models often fail to detect unseen attacks from new target domains. Although a more realistic domain adaptation (DA) scenario has been proposed for single-modal FAS to learn specific spoof attacks during inference, DA remains unexplored in multi-modal FAS methods. In this paper, we propose a novel framework, MFAS-DANet, to address three major challenges in multi-modal FAS under the DA scenario: missing modalities, noisy pseudo labels, and model degradation. First, to tackle the issue of missing modalities, we propose extracting complementary features from other modalities to substitute missing modality features or enhance existing ones. Next, to reduce the impact of noisy pseudo labels during model adaptation, we propose deriving reliable pseudo labels by leveraging prediction uncertainty across different modalities. Finally, to prevent model degradation, we design an adaptive mechanism that decreases the loss weight during unstable adaptations and increasing it during stable ones. Extensive experiments demonstrate the effectiveness and state-of-the-art performance of our proposed MFAS-DANet.

[70] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation cs.CVPDF

Shourya Verma, Mengbo Wang, Nadia Atallah Lanman, Ananth Grama

TL;DR: RestoRect提出了一种基于潜在修正流特征蒸馏的图像恢复方法，解决了高性能模型速度慢与快速模型效果差的权衡问题。

Details

Motivation: 现有图像恢复方法在性能和速度之间存在明显权衡，知识蒸馏的传统静态特征匹配方法无法捕捉现代transformer架构的动态特征生成过程。

Result: 在15个图像恢复数据集、4个任务和8个指标上取得了优越的结果，训练稳定性、收敛速度和推理速度均有显著提升。

Insight: RestoRect的动态特征学习方法突破了传统静态匹配的局限性，为跨架构知识迁移提供了新思路。

Abstract: Current approaches for restoration of degraded images face a critical trade-off: high-performance models are too slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose ‘RestoRect’, a novel Latent Rectified Flow Feature Distillation method for restoring degraded images. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality. We demonstrate superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.

[71] Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos cs.CVPDF

Junyi Wu, Jiachen Tao, Haoxuan Wang, Gaowen Liu, Ramana Rao Kompella

TL;DR: OriGS提出了一种基于场景方向的超维度表示方法，通过全局方向场（Global Orientation Field）和定向感知超高斯（Orientation-aware Hyper-Gaussian）实现了高质量的四维重建。

Details

Motivation: 现有方法通常依赖于低秩假设，难以建模非约束动态场景中复杂的、区域特定的变形。OriGS旨在通过引入方向信息的超维度表示来解决这一问题。

Result: 实验表明，OriGS在复杂动态场景中的重建质量优于主流方法。

Insight: 通过方向信息引导动态建模，可以更好地捕捉局部动态与全局运动意图的一致性。

Abstract: We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

TL;DR: 该论文通过大规模实证研究，量化了23个视觉问答基准中视觉和文本模态及其交互作用的依赖程度，揭示了当前基准评估中存在的一些问题，为多模态基准设计和评估提供了定量分析方法。

Details

Motivation: 理解单模态依赖（个体模态对目标任务的贡献）和多模态依赖（模态与目标任务之间的关系）的相互作用是推动多模态学习的关键，但目前对这些依赖关系的本质和在基准评估中的交互作用了解甚少。

Result: 研究发现，视觉、文本及其交互作用的依赖程度在不同基准和同一基准内差异显著，许多基准在试图减少文本偏向时无意中放大了图像依赖性。大型模型常通过单模态依赖掩盖多模态推理能力的不足。

Insight: 论文揭示了多模态基准设计中的潜在问题，强调了定量分析的重要性，并为未来的多模态学习和基准设计提供了指导。

Abstract: Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

[73] Enhancing Polyp Segmentation via Encoder Attention and Dynamic Kernel Update cs.CV | cs.AIPDF

Fatemeh Salahi Chashmi, Roya Sotoudeh

TL;DR: 该论文提出了一种结合动态核机制和全局编码器注意力模块的新框架，用于提高息肉分割的准确性和效率。通过动态核迭代优化分割预测，并结合全局注意力捕捉关键病灶特征，同时采用统一通道适应在解码器中标准化特征维度。实验表明，该方法在多个基准数据集上优于现有技术，并在计算效率和准确性上取得了平衡。

Details

Motivation: 息肉分割在结直肠癌检测中至关重要，但由于息肉形状多样、大小不一且边界对比度低，分割任务具有挑战性。现有方法在处理复杂边界和计算效率上仍有改进空间。

Result: 在KvasirSEG和CVC ClinicDB数据集上，该方法在Dice和交并比（IoU）指标上优于现有技术，同时简化了解码器结构，降低了计算成本。

Insight: 结合注意力机制和动态优化的设计可以有效提升分割任务的性能，尤其是在处理复杂边界和多样化形状的目标时。未来可扩展应用于其他医学图像分割任务。

Abstract: Polyp segmentation is a critical step in colorectal cancer detection, yet it remains challenging due to the diverse shapes, sizes, and low contrast boundaries of polyps in medical imaging. In this work, we propose a novel framework that improves segmentation accuracy and efficiency by integrating a Dynamic Kernel (DK) mechanism with a global Encoder Attention module. The DK mechanism, initialized by a global context vector from the EA module, iteratively refines segmentation predictions across decoding stages, enabling the model to focus on and accurately delineate complex polyp boundaries. The EA module enhances the network’s ability to capture critical lesion features by aggregating multi scale information from all encoder layers. In addition, we employ Unified Channel Adaptation (UCA) in the decoder to standardize feature dimensions across stages, ensuring consistent and computationally efficient information fusion. Our approach extends the lesion-aware kernel framework by introducing a more flexible, attention driven kernel initialization and a unified decoder design. Extensive experiments on the KvasirSEG and CVC ClinicDB benchmark datasets demonstrate that our model outperforms several state of the art segmentation methods, achieving superior Dice and Intersection over Union scores. Moreover, UCA simplifies the decoder structure, reducing computational cost without compromising accuracy. Overall, the proposed method provides a robust and adaptable solution for polyp segmentation, with promising applications in clinical and automated diagnostic systems.

[74] Evaluating point-light biological motion in multimodal large language models cs.CV | cs.AIPDF

Akila Kadambi, Marco Iacoboni, Lisa Aziz-Zadeh, Srini Narayanan

TL;DR: 论文提出了第一个用于评估多模态大语言模型（MLLMs）处理稀疏点光源生物运动能力的基准ActPLD，发现当前模型的性能普遍较低，揭示了其在动作和时空理解方面的根本缺陷。

Details

Motivation: 人类能从稀疏的点光源运动（PLDs）中提取丰富的语义信息，这是基于人类的具体经验。研究PLDs的处理能力有助于测试MLLMs在动作理解方面的局限性。

Result: 结果显示各模型在PLDs任务上的性能普遍较低，突显了其在动作和时空理解方面的不足。

Insight: 研究表明，当前MLLMs在缺乏丰富视觉信息的情况下，难以有效处理动作语义，强调了其在多模态理解中的局限性。

Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. This ability emerges early in development and is largely attributed to human embodied experience. Since PLDs isolate body motion as the sole source of meaning, they represent key stimuli for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from human PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action and spatiotemporal understanding.

[75] Imaging-Based Mortality Prediction in Patients with Systemic Sclerosis cs.CV | cs.AIPDF

Alec K. Peltekian, Karolina Senkow, Gorkem Durak, Kevin M. Grudzinski, Bradford C. Bemiss

TL;DR: 该研究提出了一个基于影像的大规模纵向分析框架，结合放射组学和深度学习技术，预测系统性硬化（SSc）患者的死亡率。

Details

Motivation: SSc相关的间质性肺病（ILD）是高死亡率的主要原因，但现有CT影像的诊断和预测能力尚未充分挖掘。

Result: 模型在1年、3年和5年的死亡率预测中分别取得了0.769、0.801和0.709的AUC值。

Insight: 研究表明，放射组学和深度学习能够显著提升SSc患者的早期风险预测能力，为临床决策提供了新的工具。

Abstract: Interstitial lung disease (ILD) is a leading cause of morbidity and mortality in systemic sclerosis (SSc). Chest computed tomography (CT) is the primary imaging modality for diagnosing and monitoring lung complications in SSc patients. However, its role in disease progression and mortality prediction has not yet been fully clarified. This study introduces a novel, large-scale longitudinal chest CT analysis framework that utilizes radiomics and deep learning to predict mortality associated with lung complications of SSc. We collected and analyzed 2,125 CT scans from SSc patients enrolled in the Northwestern Scleroderma Registry, conducting mortality analyses at one, three, and five years using advanced imaging analysis techniques. Death labels were assigned based on recorded deaths over the one-, three-, and five-year intervals, confirmed by expert physicians. In our dataset, 181, 326, and 428 of the 2,125 CT scans were from patients who died within one, three, and five years, respectively. Using ResNet-18, DenseNet-121, and Swin Transformer we use pre-trained models, and fine-tuned on 2,125 images of SSc patients. Models achieved an AUC of 0.769, 0.801, 0.709 for predicting mortality within one-, three-, and five-years, respectively. Our findings highlight the potential of both radiomics and deep learning computational methods to improve early detection and risk assessment of SSc-related interstitial lung disease, marking a significant advancement in the literature.

[76] Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis cs.CVPDF

Ibne Farabi Shihab, Weiheng Chai, Jiyang Wang, Sanjeda Akter, Senem Velipasalar Gursoy

TL;DR: 该论文提出了一种资源感知的自适应超分辨率框架，旨在优化驾驶员行为分析的模型校准和关键事件的精准召回，取得了最佳校准性能和安全性指标。

Details

Motivation: 驾驶员监控系统在安全关键场景中不仅需要高精度，还需要可靠的置信度评分，而直接低分辨率训练虽然整体精度高，但预测校准差，存在安全隐患。

Result: 在安全关键指标上表现最佳：校准误差（ECE）5.8%，AUPR（嗜睡检测）0.78，精准召回（手机使用检测）0.74。

Insight: 直接低分辨率训练的模型虽通用性强，但校准性能差；该框架为安全关键应用提供了可靠解决方案。

Abstract: Driver monitoring systems require not just high accuracy but reliable, well-calibrated confidence scores for safety-critical deployment. While direct low-resolution training yields high overall accuracy, it produces poorly calibrated predictions that can be dangerous in safety-critical scenarios. We propose a resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events. Our approach achieves state-of-the-art performance on safety-centric metrics: best calibration (ECE of 5.8% vs 6.2% for LR-trained baselines), highest AUPR for drowsiness detection (0.78 vs 0.74), and superior precision-recall for phone use detection (0.74 vs 0.71). A lightweight artifact detector (0.3M parameters, 5.2ms overhead) provides additional safety by filtering SR-induced hallucinations. While LR-trained video models serve as strong general-purpose baselines, our adaptive framework represents the state-of-the-art solution for safety-critical applications where reliability is paramount.

[77] OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction cs.CVPDF

Hongyang Li, Jinyuan Qu, Lei Zhang

TL;DR: OVSeg3R通过3D重建从2D感知模型中学习开放词汇的3D实例分割，避免了手动标注的成本。该方法利用2D到3D的对应关系生成3D注释，并提出View-wise Instance Partition和2D Instance Boundary-aware Superpoint算法以优化训练过程，显著提升了ScanNet200基准的性能。

Details

Motivation: 现有的3D实例分割方法通常依赖封闭词汇表或需要昂贵的标注成本，限制了其在实际应用中的广泛性和可扩展性。提出的OVSeg3R方法旨在通过利用2D模型的开放词汇能力和3D重建技术，降低标注成本并提升分割性能。

Result: 在ScanNet200基准测试中，OVSeg3R显著提升了性能（+2.3 mAP），并在开放词汇设置下，对未见类别的性能提升了约+7.1 mAP。

Insight: 通过结合2D开放词汇模型和3D重建技术，可以在降低标注成本的同时显著提升3D实例分割的性能。设计针对性的算法（如监督分区和边界感知聚类）可以帮助解决2D到3D迁移中的关键问题。

Abstract: In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view’s 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view’s corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

[78] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement cs.CVPDF

Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen

TL;DR: VividFace是一个高效的一步扩散框架，用于视频人脸增强，解决了现有方法在纹理建模、训练数据不足和推理效率低的问题。

Details

Motivation: 视频人脸增强在许多应用中至关重要，但现有方法面临纹理建模不准确、训练数据不足和推理效率低的问题，需要改进。

Result: 在感知质量、身份保持和时间稳定性方面达到SOTA，同时显著提升推理效率。

Insight: 联合潜空间-像素的训练策略和数据筛选流程能显著提升模型性能和泛化能力。

Abstract: Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.

[79] VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration cs.CVPDF

Han Hu, Zhuoran Zheng, Liang Li, Chen Lyu

TL;DR: VAMamba提出了一种高效的视觉自适应Mamba框架，通过QCLAM和GPS-SS2D两项创新，解决了现有Mamba方法中固定扫描模式和特征利用效率低的问题。

Details

Motivation: 现有Mamba方法在图像修复任务中表现受限，主要原因是固定扫描模式和低效的特征利用，无法适应多样化的退化情况。

Result: 在多种图像修复任务中，VAMamba在修复质量和计算效率上均优于现有方法。

Insight: 动态特征重用和自适应扫描路径能显著提升图像修复任务的性能，同时保持较高的计算效率。

Abstract: Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling memorygrowth.Second, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at https://github.com/WaterHQH/VAMamba.

[80] Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery cs.CVPDF

Zekun Wang, Ethan Haarer, Zhiyi Dai, Tianyi Zhu, Christopher J. MacLellan

TL;DR: 该论文提出了深度分类网络（deep taxonomic networks），一种基于变分推断的深度潜变量方法，用于从未标注数据中自动发现层次化的原型簇，填补了现有深度层次聚类方法的不足。

Details

Motivation: 受到人类将知识组织为层次化分类的启发，论文旨在解决现有深度层次聚类方法的局限性，如结构固定为类别数量且未充分利用中间层次的原型信息。

Result: 在多个图像分类数据集上表现出色，优于基线方法；定性结果显示方法能够发现丰富且可解释的层次化分类。

Insight: 该方法不仅能够发现粗粒度的语义类别，还能捕捉细粒度的视觉差异，展示了层次化原型发现的潜力。

Abstract: Inspired by the human ability to learn and organize knowledge into hierarchical taxonomies with prototypes, this paper addresses key limitations in current deep hierarchical clustering methods. Existing methods often tie the structure to the number of classes and underutilize the rich prototype information available at intermediate hierarchical levels. We introduce deep taxonomic networks, a novel deep latent variable approach designed to bridge these gaps. Our method optimizes a large latent taxonomic hierarchy, specifically a complete binary tree structured mixture-of-Gaussian prior within a variational inference framework, to automatically discover taxonomic structures and associated prototype clusters directly from unlabeled data without assuming true label sizes. We analytically show that optimizing the ELBO of our method encourages the discovery of hierarchical relationships among prototypes. Empirically, our learned models demonstrate strong hierarchical clustering performance, outperforming baselines across diverse image classification datasets using our novel evaluation mechanism that leverages prototype clusters discovered at all hierarchical levels. Qualitative results further reveal that deep taxonomic networks discover rich and interpretable hierarchical taxonomies, capturing both coarse-grained semantic categories and fine-grained visual distinctions.

[81] MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising cs.CVPDF

Tangtangfang Fang, Jingxi Hu, Xiangjian He, Jiaqi Yang

TL;DR: MAN是一种基于潜在扩散的多阶段抗噪网络，用于高效高质量的低剂量CT图像去噪，通过压缩潜在空间和注意力机制显著降低计算成本，同时保持高质量的去噪效果。

Details

Motivation: 尽管扩散模型在低剂量CT去噪中表现出色，但其极高的计算成本阻碍了临床应用。本文旨在解决这一问题，提出高效且高质量的解决方案。

Result: 在低剂量CT数据集上，MAN在PSNR/SSIM得分上与DDPM和Dn-Dp等计算密集型模型相当，但推理速度快60倍以上。

Insight: 通过压缩潜在空间和注意力机制，MAN在保持高质量的同时大幅提升效率，为生成模型在医学影像中的实际应用提供了可行路径。

Abstract: While diffusion models have set a new benchmark for quality in Low-Dose Computed Tomography (LDCT) denoising, their clinical adoption is critically hindered by extreme computational costs, with inference times often exceeding thousands of seconds per scan. To overcome this barrier, we introduce MAN, a Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising task. Our method operates in a compressed latent space via a perceptually-optimized autoencoder, enabling an attention-based conditional U-Net to perform the fast, deterministic conditional denoising diffusion process with drastically reduced overhead. On the LDCT and Projection dataset, our model achieves superior perceptual quality, surpassing CNN/GAN-based methods while rivaling the reconstruction fidelity of computationally heavy diffusion models like DDPM and Dn-Dp. Most critically, in the inference stage, our model is over 60x faster than representative pixel space diffusion denoisers, while remaining competitive on PSNR/SSIM scores. By bridging the gap between high fidelity and clinical viability, our work demonstrates a practical path forward for advanced generative models in medical imaging.

[82] RIV: Recursive Introspection Mask Diffusion Vision Language Model cs.CV | cs.AI | cs.CL | cs.LGPDF

YuQian Li, Limeng Qiao, Lin Ma

TL;DR: 论文提出了RIV模型，通过自省训练和递归推断机制，赋予掩码扩散视觉语言模型自校正能力，显著提升了多模态理解任务的性能。

Details

Motivation: 现有掩码扩散视觉语言模型（MDVLM）缺乏对生成错误的校正能力，影响了模型的可靠性和准确性。

Result: 在多个基准测试中，RIV模型表现优异，优于现有大部分MDVLM模型。

Insight: 通过引入自省和递归机制，可以显著提升生成模型的自我校正能力，从而在多模态任务中实现更高的准确性和可靠性。

Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.

[83] Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models cs.CVPDF

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera

TL;DR: FAMDA提出了一种利用视觉基础模型（VFMs）生成高质量伪标签的自训练框架，用于多任务密集预测的领域自适应，在合成到真实和多场景任务中取得了SOTA效果。

Details

Motivation: 多任务密集预测在机器人应用中很重要，但领域迁移问题限制了其在新环境中的表现。现有方法主要依赖对抗学习，效果不如自训练技术。

Result: 在合成到真实和多场景任务中达到SOTA性能，轻量版本比基础模型小10倍以上。

Insight: 视觉基础模型的能力可以通过自训练高效迁移到轻量网络中，适合资源受限的应用。

Abstract: Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10$\times$ smaller than foundation models, highlighting FAMDA’s suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

[84] MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing cs.CVPDF

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, Shiguang Shan

TL;DR: MotionVerse是一个统一的多模态框架，利用大型语言模型（LLMs）理解和编辑单人与多人场景中的运动数据。通过运动分词器和延迟并行建模策略，实现了高效的运动表示和依赖捕捉，同时采用双塔架构避免模态干扰。

Details

Motivation: 传统运动理解、生成和编辑方法通常依赖特定任务的设计，缺乏统一性和灵活性。MotionVerse旨在通过LLMs的多模态能力，实现一个通用的运动数据处理框架。

Result: 实验表明，MotionVerse在各运动相关任务中表现优异，消融研究验证了各模块的有效性。

Insight: 通过结合LLMs和运动数据的离散表示，MotionVerse展示了多模态框架在运动任务中的潜力，同时为高效建模提供了新思路。

Abstract: This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.

[85] LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders cs.CV | cs.AI | cs.LGPDF

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi

TL;DR: LightFair提出了一种轻量级方法来提高文本到图像扩散模型（T2I DMs）的公平性，通过专注于微调文本编码器的嵌入信息，减少了对全参数训练或辅助网络的依赖。

Details

Motivation: 现有的T2I DMs公平性方法通常需要高成本的训练或采样负担，且性能不佳。LightFair旨在通过优化文本编码器的嵌入来高效解决这一问题。

Result: 在Stable Diffusion v1.5上，LightFair仅需四分之一的训练负担，取得了SOTA的去偏效果，且采样负担几乎不增加。

Insight: 文本编码器是T2I DMs中易于微调的核心模块，优化其嵌入信息可以高效提升模型的公平性，同时减少额外的计算负担。

Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder’s neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.

[86] EfficientMIL: Efficient Linear-Complexity MIL Method for WSI Classification cs.CVPDF

Chengying She, Ben Wang, Xinran Zhang, Dongjie Fan, Jialu Zhang

TL;DR: EfficientMIL是一种线性复杂度的多示例学习（MIL）方法，用于全切片图像（WSI）分类，通过设计自适应补丁选择器（APS）和采用高效序列模型（如GRU、LSTM和Mamba），解决了基于注意力机制的MIL方法的计算瓶颈。

Details

Motivation: 当前基于注意力机制的MIL方法在WSI分类中表现优异，但计算复杂度高，限制了其实际应用。因此，需要一种更高效的方法来提升计算效率。

Result: 在TCGA-Lung和CAMELYON16数据集上表现优于现有方法，例如EfficientMIL-Mamba在TCGA-Lung上的AUC为0.976，EfficientMIL-GRU在CAMELYON16上的AUC为0.990。

Insight: 线性复杂度的序列模型可以替代高复杂度的注意力机制，同时保持良好的性能，为大规模WSI分类提供了新的解决方案。

Abstract: Whole slide images (WSIs) classification represents a fundamental challenge in computational pathology, where multiple instance learning (MIL) has emerged as the dominant paradigm. Current state-of-the-art (SOTA) MIL methods rely on attention mechanisms, achieving good performance but requiring substantial computational resources due to quadratic complexity when processing hundreds of thousands of patches. To address this computational bottleneck, we introduce EfficientMIL, a novel linear-complexity MIL approach for WSIs classification with the patches selection module Adaptive Patch Selector (APS) that we designed, replacing the quadratic-complexity self-attention mechanisms in Transformer-based MIL methods with efficient sequence models including RNN-based GRU, LSTM, and State Space Model (SSM) Mamba. EfficientMIL achieves significant computational efficiency improvements while outperforming other MIL methods across multiple histopathology datasets. On TCGA-Lung dataset, EfficientMIL-Mamba achieved AUC of 0.976 and accuracy of 0.933, while on CAMELYON16 dataset, EfficientMIL-GRU achieved AUC of 0.990 and accuracy of 0.975, surpassing previous state-of-the-art methods. Extensive experiments demonstrate that APS is also more effective for patches selection than conventional selection strategies.

[87] From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving cs.CV | cs.ROPDF

Yixiao Chen, Ruining Yang, Xin Chen, Jia He, Dongliang Xu

TL;DR: 这篇综述系统地回顾了自动驾驶中拓扑感知感知的四个核心研究方向，强调了从静态地图到动态感知的范式转变，并探讨了其对自动驾驶系统的适应性、可扩展性和可解释性的影响。

Details

Motivation: 传统的静态地图虽然为自动驾驶系统提供了语义上下文，但其构建成本高、难以实时更新且缺乏跨区域的泛化能力。这促使研究者转向利用车载传感器数据进行实时地图构建和拓扑推理的动态表示。

Result: 动态表示方法能够实现实时地图构建和拓扑推理，提升了自动驾驶系统的适应性、可扩展性和可解释性。

Insight: 动态感知方法通过结合多模态数据和语言模型，能够更好地理解复杂的驾驶环境，为未来的自动驾驶系统提供了新的发展方向。

Abstract: The key to achieving autonomous driving lies in topology-aware perception, the structured understanding of the driving environment with an emphasis on lane topology and road semantics. This survey systematically reviews four core research directions under this theme: vectorized map construction, topological structure modeling, prior knowledge fusion, and language model-based perception. Across these directions, we observe a unifying trend: a paradigm shift from static, pre-built maps to dynamic, sensor-driven perception. Specifically, traditional static maps have provided semantic context for autonomous systems. However, they are costly to construct, difficult to update in real time, and lack generalization across regions, limiting their scalability. In contrast, dynamic representations leverage on-board sensor data for real-time map construction and topology reasoning. Each of the four research directions contributes to this shift through compact spatial modeling, semantic relational reasoning, robust domain knowledge integration, and multimodal scene understanding powered by pre-trained language models. Together, they pave the way for more adaptive, scalable, and explainable autonomous driving systems.

[88] Griffin: Generative Reference and Layout Guided Image Composition cs.CVPDF

Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi

TL;DR: 这篇论文提出了一种无需训练的方法，名为Griffin，通过参考图像和布局引导实现更精确的图像合成控制。

Details

Motivation: 现有文本到图像模型在生成高度真实图像方面表现出色，但文本控制的局限性限制了需要更精确布局的场景。论文旨在解决多图像布局控制的挑战，通过参考图像而非文本来指定内容，并提供明确的布局指导。

Result: 在多种图像合成任务中证明了该方法的有效性，展示了其在对象和部分级合成上的灵活性和控制能力。

Insight: 通过参考图像和布局引导的结合，可以在无需额外训练的情况下实现更精确的图像合成控制，为生成式模型的实际应用提供了新思路。

Abstract: Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

[89] Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures cs.CVPDF

Lu Xiao, Jiale Zhang, Yang Liu, Taicheng Huang, Xin Tian

TL;DR: Sparse-Up提出了一种高效的3D纹理建模框架，通过稀疏体素引导纹理重建并保持多视角一致性，同时利用表面锚定和视角分区突破分辨率限制。

Details

Motivation: 现有3D生成方法在保留高频细节和多视角一致性之间存在权衡，导致纹理撕裂或分辨率受限。

Result: 减少了70%以上的冗余体素，突破了分辨率限制，同时保持了高频细节和多视角一致性。

Insight: 稀疏性和局部化是高效高保真3D生成的关键，可学习的上采样策略显著优化了内存和细节的平衡。

Abstract: The creation of high-fidelity 3D assets is often hindered by a ‘pixel-level pain point’: the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up, a memory-efficient, high-fidelity texture modeling framework that effectively preserves high-frequency details. We use sparse voxels to guide texture reconstruction and ensure multi-view consistency, while leveraging surface anchoring and view-domain partitioning to break through resolution constraints. Surface anchoring employs a learnable upsampling strategy to constrain voxels to the mesh surface, eliminating over 70% of redundant voxels present in traditional voxel upsampling. View-domain partitioning introduces an image patch-guided voxel partitioning scheme, supervising and back-propagating gradients only on visible local patches. Through these two strategies, we can significantly reduce memory consumption during high-resolution voxel training without sacrificing geometric consistency, while preserving high-frequency details in textures.

[90] ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis cs.CV | cs.AIPDF

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang

TL;DR: 论文提出了ReWatch数据集和ReWatch-R1方法，通过多阶段合成流程和Multi-Agent ReAct框架解决了复杂视频推理中的数据瓶颈问题，并结合SFT和RLVR显著提升了LVLM的视频推理性能。

Details

Motivation: 当前的大型视觉语言模型（LVLM）在复杂视频推理中的应用因缺乏高质量、多跳问题和视频基础的CoT数据而受限，论文旨在解决这一数据瓶颈问题。

Result: ReWatch-R1在5个视频推理基准测试中达到了最先进的平均性能。

Insight: 通过模拟人类‘重看’过程和显式建模信息检索与验证，可以有效提升复杂视频推理的数据质量和模型性能。

Abstract: While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like “re-watching” process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation & Reasoning (O&R) reward mechanism that evaluates both the final answer’s correctness and the reasoning’s alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.

[91] LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training cs.CVPDF

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao

TL;DR: LLaVA-OneVision-1.5是一个开放、高效的多模态训练框架，通过大规模数据集和低成本训练方法实现了领先性能。

Details

Motivation: 现有的大规模多模态模型计算和成本高昂，LLaVA-OneVision-1.5旨在提供一个开放且低成本的解决方案，推动多模态研究的民主化。

Result: LLaVA-OneVision-1.5-8B在27个基准任务中的18个上优于Qwen2.5-VL-7B，4B版本在所有任务上均超越Qwen2.5-VL-3B。

Insight: 通过数据优化和高效训练框架，低成本也能实现高性能的多模态模型，为社区提供了可复现的开放方案。

Abstract: We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 26M instruction dataset LLaVA-OneVision-1.5-Instruct, collectively encompassing 64B compressed multimodal tokens. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.

[92] HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score cs.CVPDF

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel

TL;DR: HIVTP是一种无需训练的方法，通过基于中间层的重要性分数进行分层视觉令牌剪枝，显著提高了视觉语言模型（VLMs）的推理效率。

Details

Motivation: 视觉语言模型的视觉编码器输出的令牌数量庞大，严重影响了推理效率，但许多令牌并不重要，可以被安全剪枝。

Result: HIVTP将LLaVA-v1.5-7B和LLaVA-Next-7B的TTFT分别降低50.0%和55.1%，生成吞吐量提升60.9%和47.3%，且精度未受损。

Insight: 中间层注意力图能更精准地反映令牌重要性，分层剪枝策略平衡了全局和局部信息的保留。

Abstract: Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

[93] Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding cs.CVPDF

Xixi Jiang, Chen Yang, Dong Zhang, Pingcheng Dong, Xin Yang

TL;DR: 该论文提出了一种用于手术视频理解的时空信息挖掘令牌合并方法（STIM-TM），首次实现了时空维度独立的令牌合并，显著降低了计算成本（减少65% GFLOPs），同时保持了高精度。

Details

Motivation: 现有Transformer模型在处理手术视频时存在高计算成本问题，且未充分利用视频的时空结构和信息分布的异构性。

Result: 显著提升效率（减少65% GFLOPs），在多种手术视频任务中保持竞争力；支持长序列手术视频的高效训练。

Insight: 时空维度的解耦分析和信息异构性挖掘是提升手术视频理解效率的关键。

Abstract: Vision Transformer models have shown impressive effectiveness in the surgical video understanding tasks through long-range dependency modeling. However, current methods suffer from prohibitive computational costs due to processing massive spatiotemporal tokens across video frames. While prior work on token merging has advanced model efficiency, they fail to adequately consider the inherent spatiotemporal structure of video data and overlook the heterogeneous nature of information distribution, leading to suboptimal performance. In this paper, we propose a spatiotemporal information mining token merging (STIM-TM) method, representing the first dedicated approach for surgical video understanding. STIM-TM introduces a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently. Specifically, the temporal component merges spatially corresponding tokens from consecutive frames using saliency weighting, preserving critical sequential information and maintaining continuity. Meanwhile, the spatial component prioritizes merging static tokens through temporal stability analysis, protecting dynamic regions containing essential surgical information. Operating in a training-free manner, STIM-TM achieves significant efficiency gains with over $65%$ GFLOPs reduction while preserving competitive accuracy across comprehensive surgical video tasks. Our method also supports efficient training of long-sequence surgical videos, addressing computational bottlenecks in surgical applications.

[94] RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks cs.CV | cs.AI | cs.CL | cs.MM | 68T45, 68T50 | I.2.7; I.2.10; I.4.7; I.4.8PDF

Amit Agarwal, Hitesh Laxmichand Patel, Srikant Panda, Hansa Meghwani, Jyotika Singh

TL;DR: 该论文提出了区域理解指数（RCI），用于评估多模态基准中对全局和局部视觉信息的依赖程度，揭示了当前大多数基准偏向局部推理的空间偏差。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在视觉语言基准上表现出色，但缺乏对是否真正需要全局推理的评估。当前的评估方法无法区分全局和局部推理，限制了数据集优化和实际应用模型的开发。

Result: 应用于13个多模态基准时，发现大多数基准倾向于局部推理，并表现出显著的空间偏差。

Insight: RCI为研究者和实践者提供了诊断工具，有助于构建更均衡的数据集和基准，从而促进面向实际应用的多模态系统开发。

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive results on vision-language benchmarks, yet it remains unclear whether these benchmarks assess genuine global reasoning or allow success via localized visual cues. Existing evaluation methods do not explicitly measure this distinction, hindering effective dataset curation and real-world focused model development. We introduce Region Comprehension Index (RCI), the first model-based score to directly quantify a dataset’s reliance on global versus local visual information. RCI systematically compares reference-model performance on image patches versus full images, revealing if tasks require holistic image understanding or can be solved with partial or localized visual cues. When applying RCI to 13 widely used multimodal benchmarks, we observed that most of them favor localized reasoning and exhibit significant spatial biases, indicating potential risks in real-world applications. RCI equips researchers & practitioners with an actionable tool for diagnosing & mitigating these biases, enabling the construction of datasets and benchmarks to foster the development of robust, enterprise-ready multimodal systems.

Dayu Tan, Ziwei Zhang, Yansan Su, Xin Peng, Yike Dai

TL;DR: 论文提出了MSD-KMamba框架，结合双向空间感知和多尺度自蒸馏策略，解决了CNN-Transformer混合模型在长距离依赖和高计算复杂度上的问题，显著提升了3D多模态脑部分割的精度和效率。

Details

Motivation: 传统的CNN-Transformer混合模型依赖高复杂度的全局注意力机制，导致计算资源消耗大，而知识蒸馏和稀疏注意力机制在复杂任务中难以兼顾分割精度和效率。

Result: 在多个标准数据集上，MSD-KMamba在分割精度、鲁棒性和泛化性上优于现有方法，同时保持高计算效率。

Insight: 通过双向空间感知和多尺度自蒸馏的结合，可以有效平衡模型的性能和计算复杂度，为3D多模态分割任务提供了新的解决方案。

Abstract: Numerous CNN-Transformer hybrid models rely on high-complexity global attention mechanisms to capture long-range dependencies, which introduces non-linear computational complexity and leads to significant resource consumption. Although knowledge distillation and sparse attention mechanisms can improve efficiency, they often fall short of delivering the high segmentation accuracy necessary for complex tasks. Balancing model performance with computational efficiency remains a critical challenge. In this work, we propose a novel 3D multi-modal image segmentation framework, termed MSD-KMamba, which integrates bidirectional spatial perception with multi-scale self-distillation. The bidirectional spatial aware branch effectively captures long-range spatial context dependencies across brain regions, while also incorporating a powerful nonlinear feature extraction mechanism that further enhances the model’s ability to learn complex and heterogeneous patterns. In addition, the proposed multi-scale self-distilled fusion strategy strengthens hierarchical feature representations and improves the transfer of semantic information at different resolution levels. By jointly leveraging the bidirectional spatial perception branch and the multi-scale self-distilled fusion strategy, our framework effectively mitigates the bottleneck of quadratic computational complexity in volumetric segmentation, while simultaneously addressing the limitation of insufficient global perception. Extensive experiments on multiple standard benchmark datasets demonstrate that MSD-KMamba consistently outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization, while maintaining high computational efficiency and favorable scalability. The source code of MSD-KMamba is publicly available at https://github.com/daimao-zhang/MSD-KMamba.

[96] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification cs.CVPDF

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li

TL;DR: QuantSparse是一个统一的框架，通过结合模型量化和注意力稀疏化，显著降低了视频扩散变换器的计算和存储成本，同时保持了高性能。

Details

Motivation: 尽管扩散变换器在视频生成方面表现出色，但其高昂的计算和内存成本限制了实际应用。现有方法（如量化和稀疏化）在极端压缩下性能下降严重。

Result: 在HunyuanVideo-13B数据集上，QuantSparse的PSNR达到20.88，显著优于基线Q-VDiT（16.85），同时存储和推理速度分别提升了3.68倍和1.88倍。

Insight: 通过多尺度蒸馏和二阶稀疏化的结合，QuantSparse展示了在极端压缩下仍能保持高性能的潜力，为视频生成模型的实用化提供了新思路。

Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

[97] HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection cs.CV | cs.CLPDF

Siyuan Gao, Jiashu Yao, Haoyu Wen, Yuhang Guo, Zeming Liu

TL;DR: 该论文提出了HomeSafeBench基准，用于评估具身视觉语言模型（VLMs）在自由探索家庭安全检查任务中的能力，解决了现有基准的两个主要问题：依赖文本描述和静态视角。

Details

Motivation: 现有基准在家庭安全检查任务中存在不足：一是依赖环境文本描述而非视觉信息，二是使用静态视角限制了自由探索。这些问题导致无法准确评估基于VLMs的具身代理能力。

Result: 实验结果表明显著局限性：即使是表现最佳的VLMs也仅达到10.23%的F1分数，尤其在识别安全隐患和选择探索策略方面表现不佳。

Insight: 当前VLMs在家庭安全检查任务中存在明显不足，未来研究需关注动态多视角环境下的模型优化。

Abstract: Embodied agents can identify and report safety hazards in the home environments. Accurately evaluating their capabilities in home safety inspection tasks is curcial, but existing benchmarks suffer from two key limitations. First, they oversimplify safety inspection tasks by using textual descriptions of the environment instead of direct visual information, which hinders the accurate evaluation of embodied agents based on Vision-Language Models (VLMs). Second, they use a single, static viewpoint for environmental observation, which restricts the agents’ free exploration and cause the omission of certain safety hazards, especially those that are occluded from a fixed viewpoint. To alleviate these issues, we propose HomeSafeBench, a benchmark with 12,900 data points covering five common home safety hazards: fire, electric shock, falling object, trips, and child safety. HomeSafeBench provides dynamic first-person perspective images from simulated home environments, enabling the evaluation of VLM capabilities for home safety inspection. By allowing the embodied agents to freely explore the room, HomeSafeBench provides multiple dynamic perspectives in complex environments for a more thorough inspection. Our comprehensive evaluation of mainstream VLMs on HomeSafeBench reveals that even the best-performing model achieves an F1-score of only 10.23%, demonstrating significant limitations in current VLMs. The models particularly struggle with identifying safety hazards and selecting effective exploration strategies. We hope HomeSafeBench will provide valuable reference and support for future research related to home security inspections. Our dataset and code will be publicly available soon.

[98] Confidence Aware SSD Ensemble with Weighted Boxes Fusion for Weapon Detection cs.CV | cs.LGPDF

Atharva Jadhav, Arush Karekar, Manas Divekar, Shachi Natu

TL;DR: 该论文提出了一种基于SSD模型的集成方法，结合多种特征提取骨干网络和加权框融合（WBF）技术，显著提升了武器检测的鲁棒性和准确性。

Details

Motivation: 公共安全需求迫切，而现有单一模型在复杂场景（如遮挡、光线变化等）中的检测效果不佳，亟需提升武器检测的鲁棒性。

Result: 集成模型的mAP达到0.838，相对最佳单一模型提升了2.948%。

Insight: 融合策略（如WBF）与模型多样性同等重要，置信度感知融合是提升集成模型精度的关键机制。

Abstract: The safety and security of public spaces is of vital importance, driving the need for sophisticated surveillance systems capable of accurately detecting weapons, which are often hampered by issues like partial occlusion, varying lighting, and cluttered backgrounds. While single-model detectors are advanced, they often lack robustness in these challenging conditions. This paper presents the hypothesis that ensemble of Single Shot Multibox Detector (SSD) models with diverse feature extraction backbones can significantly enhance detection robustness. To leverage diverse feature representations, individual SSD models were trained using a selection of backbone networks: VGG16, ResNet50, EfficientNet, and MobileNetV3. The study is conducted on a dataset consisting of images of three distinct weapon classes: guns, heavy weapons and knives. The predictions from these models are combined using the Weighted Boxes Fusion (WBF) method, an ensemble technique designed to optimize bounding box accuracy. Our key finding is that the fusion strategy is as critical as the ensemble’s diversity, a WBF approach using a ‘max’ confidence scoring strategy achieved a mean Average Precision (mAP) of 0.838. This represents a 2.948% relative improvement over the best-performing single model and consistently outperforms other fusion heuristics. This research offers a robust approach to enhancing real-time weapon detection capabilities in surveillance applications by demonstrating that confidence-aware fusion is a key mechanism for improving accuracy metrics of ensembles.

[99] DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion cs.CVPDF

Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu

TL;DR: DiffPCN是一个基于潜在扩散模型的点云补全框架，通过两阶段（粗生成和细化）方法实现高精度和高完整性的点云补全。

Details

Motivation: 点云的非结构化和不规则特性限制了潜在扩散模型（LDMs）在点云补全中的应用。为了充分利用LDMs的强大生成能力，DiffPCN提出了一种新颖的基于多视角深度图像的生成和细化方法。

Result: 实验结果显示，DiffPCN在几何精度和形状完整性上达到最先进水平，显著提升了点云补全的鲁棒性和一致性。

Insight: DiffPCN通过结合潜在扩散模型的生成能力和点云的局部关联特征，为非结构化数据的生成任务提供了新的解决方案。

Abstract: Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM’ s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.

[100] Video Panels for Long Video Understanding cs.CV | cs.AIPDF

Lars Doorenbos, Federico Spurio, Juergen Gall

TL;DR: 该论文提出了一种用于长视频理解的视觉提示策略，通过将多帧图像合并为一个面板，以空间细节换取时间分辨率，从而提升现有模型的性能。

Details

Motivation: 现有的视频-语言模型在长视频理解任务上表现不佳，而改进模型复杂度或引入新模块需要大量调整和训练时间。因此，作者提出了一种无需训练、无需参数的通用方法。

Result: 1. 在TimeScope（Long）数据集上，视频问答任务的准确率提升了高达19.4%。2. 在五种基准测试中，对不同模型架构和规模均表现出一致性改进。

Insight: 1. 通过简单的视觉提示策略，可以在不修改模型结构的情况下提升长视频理解性能。2. 空间与时间分辨率的权衡是长视频理解的关键挑战之一。

Abstract: Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

[101] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation cs.CV | cs.AIPDF

Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li

TL;DR: 论文提出了M3DLayout数据集，一个用于3D室内布局生成的大规模多源数据集，旨在解决现有数据集规模小、多样性和标注质量不足的问题。

Details

Motivation: 当前3D室内布局生成模型的能力受限于数据集的规模、多样性和标注质量，限制了模型的性能和应用。

Result: 实验表明M3DLayout能够支持复杂和详细的场景生成，尤其是Inf3DLayout子集提升了小物体的生成能力。

Insight: 多源和高质量标注的数据集是提升3D场景生成模型性能的关键。

Abstract: In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.

[102] LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models cs.CV | cs.AI | cs.LG | eess.IVPDF

Shubhang Bhatnagar, Andy Xu, Kar-Han Tan, Narendra Ahuja

TL;DR: 本文研究了多模态大语言模型（MLLMs）的超低比特（<4-bit）量化问题，提出了基于层间差异性分析的新方法LUQ（Layerwise Ultra-Low Bit Quantization），通过选择性量化显著减少了内存占用，同时性能下降控制在10%以内。

Details

Motivation: 多模态大语言模型（MLLMs）在视觉语言任务中表现优异，但其部署需要巨大的内存和计算资源。现有的后训练量化（PTQ）方法在单模态语言模型中已成功压缩至1-bit，但对MLLMs的效果尚未充分研究。

Result: 在LLaVA-1.5和Qwen-2.5-VL上评估，LUQ模型的内存占用分别比4-bit模型减少40%和31%，在MME基准上的性能下降<10%。

Insight: 1. MLLMs的层间熵差异提供了量化优化的空间；2. 混合模态token可提升量化性能；3. 选择性量化是降低MLLMs内存占用的有效策略。

Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

[103] FastViDAR: Real-Time Omnidirectional Depth Estimation via Alternative Hierarchical Attention cs.CV | cs.ROPDF

Hangtian Zhao, Xiang Chen, Yizhe Li, Qianhao Wang, Haibo Lu

TL;DR: FastViDAR提出了一种实时多目鱼眼摄像头深度估计框架，通过新型的替代分层注意力（AHA）机制和多视角ERP融合方法，实现了高效的特征融合和深度估计，并在嵌入硬件上达到20 FPS的实时性能。

Details

Motivation: 实时全景深度估计在自动驾驶、虚拟现实等领域需求迫切，但现有方法在计算效率和特征融合方面存在瓶颈。

Result: 模型在真实数据集上表现优异，零样本性能竞争力强，嵌入硬件推理速度达20 FPS。

Insight: AHA机制在减少计算开销的同时实现了有效的跨视角特征融合，ERP投影为多视角深度统一提供了高效解决方案。

Abstract: In this paper we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full $360^\circ$ depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel ERP fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware. Project page: \href{https://3f7dfc.github.io/FastVidar/}{https://3f7dfc.github.io/FastVidar/}

[104] HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation cs.CV | cs.AIPDF

Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji

TL;DR: HieraTok是一种基于多尺度Vision Transformer（ViT）的Tokenizer，通过多尺度下采样和尺度因果注意力机制显著提升了图像重建和生成任务的性能，并在ViT Tokenizer中实现了最优结果。

Details

Motivation: 现有的单尺度ViT Tokenizer在建模图像表征时存在局限性。本文提出多尺度Tokenizer设计，以更好地捕捉从全局语义到局部细节的视觉信息。

Result: 1. 图像重建任务中rFID提升27.2%（1.47→1.07）；
2. 下游生成任务中，收敛速度提高1.38倍，gFID提升18.9%（16.4→13.3）；
3. 训练扩展后，rFID达到0.45，gFID达到1.82，均为ViT Tokenizer的最佳性能。

Insight: 多尺度表征和渐进信息传递是提升ViT Tokenizer性能的关键。HieraTok的平滑且均匀分布的潜在空间可能是下游任务加速收敛的核心原因。

Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer’s training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

[105] GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State cs.CV | cs.ROPDF

Guole Shen, Tianchen Deng, Yanbo Wang, Yongtao Chen, Yilin Shen

TL;DR: GRS-SLAM3R提出了一种基于GRU（门控循环单元）的端到端SLAM框架，支持实时稠密场景重建和姿态估计，通过空间记忆和全局一致性优化提升了DUSt3R方法的性能。

Details

Motivation: 现有DUSt3R方法仅通过图像对估计点云，忽略了空间记忆和全局一致性，限制了稠密SLAM的性能。

Result: 实验表明，GRS-SLAM3R在重建精度和实时性上均优于现有方法。

Insight: 空间记忆和全局一致性是稠密SLAM的关键，门控机制和子图策略能有效平衡实时性与精度。

Abstract: DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency.To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.

[106] ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning cs.CVPDF

Xincheng Yao, Chao Shi, Muming Zhao, Guangtao Zhai, Chongyang Zhang

TL;DR: ResAD++提出了一种类无关的异常检测方法，通过学习残差特征分布而非初始特征分布，实现了特征解相关，并通过特征超球约束和新的损失函数进一步提升性能。

Details

Motivation: 当前的单类和多类异常检测方法在新类别上的表现不佳，主要原因是特征学习仍与类别相关。

Result: 在8个真实异常检测数据集上，ResAD++在新类别中显著优于现有方法，且超越了原始ResAD。

Insight: 残差特征分布的一致性对新类别的泛化能力至关重要，特征解相关和尺度约束是实现这一目标的有效手段。

Abstract: This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at https://github.com/xcyao00/ResAD.

[107] Poivre: Self-Refining Visual Pointing with Reinforcement Learning cs.CV | cs.AIPDF

Wenjie Yang, Zengfeng Huang

TL;DR: 论文提出了一种名为Poivre的自微调视觉指向方法，通过强化学习（RL）迭代优化坐标预测，显著提升了视觉语言模型（VLM）在视觉指向任务上的性能。

Details

Motivation: 当前视觉语言模型在单一步骤完成视觉指向任务时表现远不及人类，主要原因是模型无法像人类一样通过观察自身动作进行迭代调整。

Result: Poivre-7B在Point-Bench评测中优于Gemini-2.5-Pro和Molmo-72B等大型模型，性能提升超过3%。

Insight: 通过模仿人类的迭代调整行为，结合RL训练，可以显著提升模型在视觉指向任务上的表现，为未来研究提供了新的思路和工具。

Abstract: Visual pointing, which aims to localize a target by predicting its coordinates on an image, has emerged as an important problem in the realm of vision-language models (VLMs). Despite its broad applicability, recent benchmarks show that current VLMs still fall far behind human performance on this task. A key limitation is that VLMs are typically required to complete the pointing task in a single step, akin to asking humans to point at an object without seeing their own fingers. To address this issue, we propose a simple yet effective self-refining procedure: Point, Visualize, then Refine (Poivre). This procedure enables a VLM to first mark its estimated point, then iteratively refine the coordinates if necessary. Inspired by advances of reasoning models in the natural language domain, we employ reinforcement learning (RL) to incentivize this self-refining ability. For the RL training, we design a neat process reward that is not only empirically effective but also grounded in appealing properties. Our trained model, Poivre-7B, sets a new state of the art on Point-Bench, outperforming both proprietary models such as Gemini-2.5-Pro and large open-source models such as Molmo-72B by over 3%. To support future research, we release our training and inference code, dataset, and the Poivre-7B checkpoint.

[108] PVTAdpNet: Polyp Segmentation using Pyramid vision transformer with a novel Adapter block cs.CV | cs.AIPDF

Arshia Yousefi Nezhad, Helia Aghaei, Hedieh Sajedi

TL;DR: PVTAdpNet是一个基于Pyramid Vision Transformer的新型息肉分割模型，结合U-Net结构和适配器块，实现了高精度的实时分割，适用于临床。

Details

Motivation: 结直肠癌是常见且致命的癌症，传统结肠镜检查因息肉变异性漏检率高。PVTAdpNet旨在通过改进的特征提取和分割技术提升检测效果。

Result: 在息肉数据集上，PVTAdpNet实现了mDice 0.8851和mIoU 0.8167的高分，展现了优异的实时分割性能。

Insight: 结合Transformer和CNN的优势，并引入适配器和注意力机制，能够在复杂医学图像任务中实现高效分割。

Abstract: Colorectal cancer ranks among the most common and deadly cancers, emphasizing the need for effective early detection and treatment. To address the limitations of traditional colonoscopy, including high miss rates due to polyp variability, we introduce the Pyramid Vision Transformer Adapter Residual Network (PVTAdpNet). This model integrates a U-Net-style encoder-decoder structure with a Pyramid Vision Transformer backbone, novel residual blocks, and adapter-based skip connections. The design enhances feature extraction, dense prediction, and gradient flow, supported by squeeze-and-excitation attention for improved channel-wise feature refinement. PVTAdpNet achieves real-time, accurate polyp segmentation, demonstrating superior performance on benchmark datasets with high mDice and mIoU scores, making it highly suitable for clinical applications. PVTAdpNet obtains a high Dice coefficient of 0.8851 and a mean Intersection over Union (mIoU) of 0.8167 on out-of-distribution polyp datasets. Evaluation of the PolypGen dataset demonstrates PVTAdpNet’s capability for real-time, accurate performance within familiar distributions. The source code of our network is available at https://github.com/ayousefinejad/PVTAdpNet.git

[109] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception cs.CVPDF

Xinyang Song, Libin Wang, Weining Wang, Shaozhen Liu, Dandan Zheng

TL;DR: UniAlignment提出了一种基于扩散变换器的统一多模态生成框架，通过双流扩散训练策略增强跨模态一致性，且在复杂文本指令下表现优异。

Details

Motivation: 现有方法依赖视觉语言模型或模块化设计，导致架构碎片化和计算低效。UniAlignment旨在解决这些问题，提升跨模态语义理解的统一性和效率。

Result: 在多个任务和基准测试中表现优异，证明了扩散模型在多模态统一生成中的潜力。

Insight: 通过统一的生成框架和语义对齐策略，UniAlignment为多模态任务提供了一种高效且一致的方法。

Abstract: The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model’s cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

[110] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning cs.CVPDF

Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie

TL;DR: GenView++ 是一个统一的对比学习框架，通过自适应视图生成和质量驱动的监督机制，解决了现有方法在视图多样性和语义一致性上的不足，显著提升了视觉和视觉-语言任务的性能。

Details

Motivation: 当前对比学习方法在高质量正对生成和监督机制上存在不足：视图多样化有限且可能破坏语义，同时缺乏对正对质量的评估导致训练效率低下。

Result: 在视觉任务中，ImageNet 线性分类提升 +2.5%；在视觉-语言任务中，零样本分类平均准确率提升 +12.31%（相比 CLIP）和 +5.31%（相比 SLIP），Flickr30k 文本检索 R@5 提升 +3.2%。

Insight: 通过动态生成多样化视图并结合质量评估的监督机制，能够显著提升对比学习的性能，尤其在跨模态任务中表现突出。

Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair’s semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.

[111] A Modality-Tailored Graph Modeling Framework for Urban Region Representation via Contrastive Learning cs.CV | stat.APPDF

Yaya Zhao, Kaiqi Zhao, Zixuan Tang, Zhiyuan Liu, Xiaoling Lu

TL;DR: MTGRR是一个针对多模态城市数据的图建模框架，通过定制化和对比学习优化区域表示。

Details

Motivation: 现有方法在多模态城市数据处理中，采用统一图神经网络架构，忽略了模态特异性；且在融合阶段忽视空间异质性，导致次优表示。

Result: 在两个真实数据集上，MTGRR在六种模态和三项任务中均优于现有方法。

Insight: 模态特异性与空间异质性是多模态城市数据建模的关键，定制化处理和动态融合机制显著提升表示效果。

Abstract: Graph-based models have emerged as a powerful paradigm for modeling multimodal urban data and learning region representations for various downstream tasks. However, existing approaches face two major limitations. (1) They typically employ identical graph neural network architectures across all modalities, failing to capture modality-specific structures and characteristics. (2) During the fusion stage, they often neglect spatial heterogeneity by assuming that the aggregation weights of different modalities remain invariant across regions, resulting in suboptimal representations. To address these issues, we propose MTGRR, a modality-tailored graph modeling framework for urban region representation, built upon a multimodal dataset comprising point of interest (POI), taxi mobility, land use, road element, remote sensing, and street view images. (1) MTGRR categorizes modalities into two groups based on spatial density and data characteristics: aggregated-level and point-level modalities. For aggregated-level modalities, MTGRR employs a mixture-of-experts (MoE) graph architecture, where each modality is processed by a dedicated expert GNN to capture distinct modality-specific characteristics. For the point-level modality, a dual-level GNN is constructed to extract fine-grained visual semantic features. (2) To obtain effective region representations under spatial heterogeneity, a spatially-aware multimodal fusion mechanism is designed to dynamically infer region-specific modality fusion weights. Building on this graph modeling framework, MTGRR further employs a joint contrastive learning strategy that integrates region aggregated-level, point-level, and fusion-level objectives to optimize region representations. Experiments on two real-world datasets across six modalities and three tasks demonstrate that MTGRR consistently outperforms state-of-the-art baselines, validating its effectiveness.

[112] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution cs.CVPDF

Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xinyu Zhou

TL;DR: 本文提出了一种纹理向量量化和重建感知预测策略（TVQ&RAP），旨在解决生成超分辨率任务中向量量化（VQ）方法的量化误差大和预测器训练不优化的问题。通过任务特性优化量化策略，并利用直通估计器直接训练预测器，实现了计算成本低且逼真的超分辨率效果。

Details

Motivation: 现有基于向量量化的方法在编码视觉特征时存在较大的量化误差，且预测器训练仅依赖代码级监督，忽略了最终重建误差，导致先验建模精度不足。本文旨在解决这两个问题。

Result: TVQ&RAP模型能够以较小的计算成本生成逼真的超分辨率结果。

Insight: 1. 任务导向的量化策略可以优化先验建模；2. 图像级监督训练预测器能有效降低重建误差。

Abstract: Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

[113] GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning cs.CV | cs.AIPDF

Nayeong Kim, Seong Joon Oh, Suha Kwak

TL;DR: GroupCoOp通过组特定提示学习增强视觉语言模型的组鲁棒性，解决了微调数据集中的子组不平衡问题，提升了少数群的辨识能力。

Details

Motivation: 视觉语言模型（VLM）的微调易受数据集子组不平衡导致的虚假相关性影响，需要一种高效的方法增强模型的组鲁棒性。

Result: 在五种CLIP架构的五个基准测试中表现最佳，偶尔优于全网络微调方法，仅训练了0.016%的参数。

Insight: 组特定提示能够有效捕捉少数群特征，缓解了嵌入空间的分布分散问题，提升了模型的鲁棒性。

Abstract: Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and occasionally outperformed prior methods that fine-tune the entire network, despite training only 0.016% of the network’s parameters.

[114] A Multi-Camera Vision-Based Approach for Fine-Grained Assembly Quality Control cs.CV | cs.AI | cs.LG | 68T45 | I.4.8; I.4.1; I.2.10PDF

Ali Nazeri, Shashank Mishra, Achim Wagner, Martin Ruskowski, Didier Stricker

TL;DR: 这篇论文提出了一种基于多摄像头的细粒度装配质量控制方法，通过多视角图像融合解决单视角检测的局限性,显著提高了检测精度。

Details

Motivation: 现有质量控制方法通常依赖单视角成像或人工检查，易受遮挡、视角限制或光照不一致的影响，导致错误率较高。增加检测站点不仅成本高,还可能中断生产线。

Result: 实验表明，该方法在检测未正确紧固的小零件（如螺丝）时，明显优于单视角方法，且在高精度和召回率上表现突出。

Insight: 多视角融合能有效解决单视角检测的固有缺陷，为工业自动化提供了一种可扩展且高精度的质量控制方案。

Abstract: Quality control is a critical aspect of manufacturing, particularly in ensuring the proper assembly of small components in production lines. Existing solutions often rely on single-view imaging or manual inspection, which are prone to errors due to occlusions, restricted perspectives, or lighting inconsistencies. These limitations require the installation of additional inspection stations, which could disrupt the assembly line and lead to increased downtime and costs. This paper introduces a novel multi-view quality control module designed to address these challenges, integrating a multi-camera imaging system with advanced object detection algorithms. By capturing images from three camera views, the system provides comprehensive visual coverage of components of an assembly process. A tailored image fusion methodology combines results from multiple views, effectively resolving ambiguities and enhancing detection reliability. To support this system, we developed a unique dataset comprising annotated images across diverse scenarios, including varied lighting conditions, occlusions, and angles, to enhance applicability in real-world manufacturing environments. Experimental results show that our approach significantly outperforms single-view methods, achieving high precision and recall rates in the identification of improperly fastened small assembly parts such as screws. This work contributes to industrial automation by overcoming single-view limitations, and providing a scalable, cost-effective, and accurate quality control mechanism that ensures the reliability and safety of the assembly line. The dataset used in this study is publicly available to facilitate further research in this domain.

[115] Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models cs.CV | cs.LGPDF

Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna, Rahul Gupta, Shrikanth Narayanan

TL;DR: 这篇论文通过引入一个全面的视觉隐私分类法，评估了现有视觉语言模型在理解上下文隐私方面的能力，揭示了其不一致性，并呼吁需要更具隐私意识的AI系统。

Details

Motivation: 随着大型语言模型（LLMs）和视觉语言模型（VLMs）的发展，AI在理解和整合视觉与语言方面的能力显著提升。然而，这些模型在隐私原则理解和执行方面存在显著不足，缺乏相关的评估资源。本文旨在填补这一空白。

Result: 研究发现VLMs在上下文隐私理解上存在显著不一致性，揭示了现有系统的局限性。

Insight: AI系统亟需更强大的隐私意识能力，未来的研究应侧重于隐私原则的整合和评估。

Abstract: Artificial Intelligence have profoundly transformed the technological landscape in recent years. Large Language Models (LLMs) have demonstrated impressive abilities in reasoning, text comprehension, contextual pattern recognition, and integrating language with visual understanding. While these advances offer significant benefits, they also reveal critical limitations in the models’ ability to grasp the notion of privacy. There is hence substantial interest in determining if and how these models can understand and enforce privacy principles, particularly given the lack of supporting resources to test such a task. In this work, we address these challenges by examining how legal frameworks can inform the capabilities of these emerging technologies. To this end, we introduce a comprehensive, multi-level Visual Privacy Taxonomy that captures a wide range of privacy issues, designed to be scalable and adaptable to existing and future research needs. Furthermore, we evaluate the capabilities of several state-of-the-art Vision-Language Models (VLMs), revealing significant inconsistencies in their understanding of contextual privacy. Our work contributes both a foundational taxonomy for future research and a critical benchmark of current model limitations, demonstrating the urgent need for more robust, privacy-aware AI systems.

[116] Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation cs.CVPDF

Hanyu Zhou, Gim Hee Lee

TL;DR: Uni4D-LLM是一个统一的时空感知视觉语言模型，首次实现了4D场景理解和生成任务的联合处理，通过共享表征和架构设计，取得了与SOTA模型相媲美的性能。

Details

Motivation: 现有的3D和4D方法通常在理解任务中使用自回归模型，在生成任务中使用扩散模型，导致任务间的范式差异，难以统一处理动态4D场景中的时空建模需求。

Result: 实验表明，Uni4D-LLM在多个任务上达到了与SOTA模型相当或更优的效果，首次真正统一了4D场景的理解和生成。

Insight: 统一的表征和架构设计是实现多任务联合处理的关键，时空建模的动态性是4D场景理解与生成的核心挑战。

Abstract: Vision-language models (VLMs) have demonstrated strong performance in 2D scene understanding and generation, but extending this unification to the physical world remains an open challenge. Existing 3D and 4D approaches typically embed scene geometry into autoregressive model for semantic understanding and diffusion model for content generation. This paradigm gap prevents a single model from jointly handling both tasks, especially in dynamic 4D settings where spatiotemporal modeling is critical. We propose Uni4D-LLM, the first unified VLM framework with spatiotemporal awareness for 4D scene understanding and generation. Our design is guided by two key insights: 1) Unification requires a shared representation. We extract semantic features for understanding and noisy-injected appearance features for generation, incorporate 4D geometric cues, and fuse them into a spatiotemporal-aware visual representation through adaptive cross-attention. 2) Unification requires a shared architecture. Both autoregression and diffusion are built on Transformer backbones, and this enables integration into a single LLM with task-specific heads. By aligning visual and linguistic representations, our Uni4D-LLM produces predictions for both understanding and generation within one Transformer-based framework. We further apply instruction fine-tuning on diverse 4D vision-language datasets to improve generalization across tasks. Extensive experiments on multiple benchmarks demonstrate that Uni4D-LLM achieves competitive or superior results compared to state-of-the-art models and offers the first true unification of 4D scene understanding and generation.

[117] 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC cs.CVPDF

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao

TL;DR: 本文介绍了在MOSEv2 Challenge 2025中获得第二名的SeC方法，通过引入大型视觉语言模型（LVLM）提升视频目标分割的语义理解能力，显著提高了对复杂场景的鲁棒性。

Details

Motivation: 传统半监督视频目标分割方法过度依赖外观匹配，难以应对剧烈视觉变化、遮挡和场景切换等挑战。需要一种基于高层语义理解的方法来提高分割的持久性。

Result: 在MOSEv2测试集上取得了39.7的JFn分数，位列Complex VOS赛道第二名，验证了其零样本性能和鲁棒性。

Insight: 深层语义理解可以显著提升视频目标分割对复杂场景的适应性，大型视觉语言模型在零样本迁移任务中表现出色。

Abstract: Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.

[118] CE-FAM: Concept-Based Explanation via Fusion of Activation Maps cs.CVPDF

Michihiro Kuroki, Toshihiko Yamasaki

TL;DR: CE-FAM是一种基于概念的解释方法，通过融合激活图揭示图像分类器的学习概念、相关区域及其贡献，无需标注数据集。

Details

Motivation: 现有显著性图仅能高亮重要区域，而基于概念的解释需要揭示概念区域及其贡献。CE-FAM旨在填补这一空白。

Result: CE-FAM在定性和定量评估中优于现有方法，且在零样本推理中表现优异。

Insight: CE-FAM通过融合VLM知识实现了无需标注数据的概念解释，为模型可解释性提供了新思路。

Abstract: Although saliency maps can highlight important regions to explain the reasoning behind image classification in artificial intelligence (AI), the meaning of these regions is left to the user’s interpretation. In contrast, conceptbased explanations decompose AI predictions into humanunderstandable concepts, clarifying their contributions. However, few methods can simultaneously reveal what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions. We propose a novel concept-based explanation method, Concept-based Explanation via Fusion of Activation Maps (CE-FAM). It employs a branched network that shares activation maps with an image classifier and learns to mimic the embeddings of a Vision and Language Model (VLM). The branch network predicts concepts in an image, and their corresponding regions are represented by a weighted sum of activation maps, with weights given by the gradients of the concept prediction scores. Their contributions are quantified based on their impact on the image classification score. Our method provides a general framework for identifying the concept regions and their contributions while leveraging VLM knowledge to handle arbitrary concepts without requiring an annotated dataset. Furthermore, we introduce a novel evaluation metric to assess the accuracy of the concept regions. Our qualitative and quantitative evaluations demonstrate our method outperforms existing approaches and excels in zero-shot inference for unseen concepts.

[119] FairViT-GAN: A Hybrid Vision Transformer with Adversarial Debiasing for Fair and Explainable Facial Beauty Prediction cs.CVPDF

Djamel Eddine Boukhari

TL;DR: FairViT-GAN结合CNN和ViT的优势，提出了一种混合架构，通过对抗性去偏技术减少算法偏见，并在SCUT-FBP5500基准测试中实现了高准确性和公平性。

Details

Motivation: 现有面部美感预测模型面临架构限制、人口统计偏见和缺乏透明性问题，需要一种更公平、透明的解决方案。

Result: 在SCUT-FBP5500上，Pearson相关性达0.9230，RMSE降至0.2650，性能差距减少82.9%。

Insight: 混合架构和对抗去偏可显著提升模型的公平性和解释性，适用于主观视觉任务的负责任AI开发。

Abstract: Facial Beauty Prediction (FBP) has made significant strides with the application of deep learning, yet state-of-the-art models often exhibit critical limitations, including architectural constraints, inherent demographic biases, and a lack of transparency. Existing methods, primarily based on Convolutional Neural Networks (CNNs), excel at capturing local texture but struggle with global facial harmony, while Vision Transformers (ViTs) effectively model long-range dependencies but can miss fine-grained details. Furthermore, models trained on benchmark datasets can inadvertently learn and perpetuate societal biases related to protected attributes like ethnicity. To address these interconnected challenges, we propose \textbf{FairViT-GAN}, a novel hybrid framework that synergistically integrates a CNN branch for local feature extraction and a ViT branch for global context modeling. More significantly, we introduce an adversarial debiasing mechanism where the feature extractor is explicitly trained to produce representations that are invariant to protected attributes, thereby actively mitigating algorithmic bias. Our framework’s transparency is enhanced by visualizing the distinct focus of each architectural branch. Extensive experiments on the SCUT-FBP5500 benchmark demonstrate that FairViT-GAN not only sets a new state-of-the-art in predictive accuracy, achieving a Pearson Correlation of \textbf{0.9230} and reducing RMSE to \textbf{0.2650}, but also excels in fairness. Our analysis reveals a remarkable \textbf{82.9% reduction in the performance gap} between ethnic subgroups, with the adversary’s classification accuracy dropping to near-random chance (52.1%). We believe FairViT-GAN provides a robust, transparent, and significantly fairer blueprint for developing responsible AI systems for subjective visual assessment.

[120] Sim-DETR: Unlock DETR for Temporal Sentence Grounding cs.CVPDF

Jiajin Tang, Zhengxuan Wei, Yuchen Zhu, Cheng Shi, Guanbin Li

TL;DR: Sim-DETR通过改进DETR的解码器层，解决了在时序语句定位任务中DETR性能不佳的问题，提出了语义与位置重叠的自注意力约束和查询-帧对齐机制。

Details

Motivation: 研究发现，现有的DETR增强策略不仅未能在时序语句定位任务中提升性能，反而可能降低表现。这源于相似目标时刻查询之间的冲突，以及全局语义与局部定位的内部冲突。

Result: 实验表明，Sim-DETR充分发挥了DETR的潜力，为时序语句定位任务提供了强有力的基线模型。

Insight: 时序语句定位任务中，DETR的性能瓶颈主要源于查询间的冲突和上下文整合问题，Sim-DETR通过针对性的改进成功解决了这些问题。

Abstract: Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.

[121] Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models cs.CV | cs.AIPDF

Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai

TL;DR: 论文提出了一种名为信息接地引导（IGG）的新机制，用于解决自回归模型中因渐进分辨率缩放导致的信息不一致问题，从而生成更清晰、更连贯的图像。

Details

Motivation: 自回归模型在图像生成中表现出色，但渐进分辨率缩放会导致跨时间步长的信息不一致，进而分散引导信号，影响生成质量。

Result: 在类条件生成和文本到图像生成任务中，IGG显著提升了图像质量，为自回归模型设定了新的基准。

Insight: 通过锚定引导信号到关键区域，可以有效缓解信息不一致问题，提升自回归模型的生成质量。

Abstract: Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.

[122] PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications cs.CV | cs.AI | cs.CL | cs.MM | 68T50, 68T45 | I.2.7; I.2.10; I.4.8; I.4.10; I.4.0PDF

Hitesh Laxmichand Patel, Amit Agarwal, Srikant Panda, Hansa Meghwani, Karan Dua

TL;DR: 该论文提出了Patch Context Robustness Index (PCRI)，一种系统性且可解释的评分方法，用于量化多模态大语言模型（MLLMs）对视觉上下文变化的稳健性，填补了现有评估指标的空白。

Details

Motivation: 现实场景中，多模态模型的可靠性常因对无关或干扰性视觉上下文的敏感性而受到影响，而现有评估指标未涵盖这一方面。

Result: 研究发现大多数顶尖模型对背景噪声仍表现脆弱，仅少数模型（如InternVL2-26B和Qwen2VL-72B）展现出跨任务的稳健性。

Insight: PCRI不仅提供了模型比较的工具，还揭示了不同模型架构对视觉上下文的处理方式，为未来模型设计和训练策略提供了指导。

Abstract: The reliability of Multimodal Large Language Models (MLLMs) in real-world settings is often undermined by sensitivity to irrelevant or distracting visual context, an aspect not captured by existing evaluation metrics. We introduce the \textbf{Patch Context Robustness Index (PCRI)}, the first systematic and interpretable score for quantifying MLLM robustness to variations in visual context granularity, measuring performance changes between localized image patches and full-image input. Applying PCRI to 19 state-of-the-art MLLMs across 15 vision-language benchmarks, we find that most leading models remain brittle to background noise, with only a few, such as InternVL2-26B and Qwen2VL-72B, demonstrating consistent robustness across tasks. PCRI analysis also highlights how different model architectures handle and integrate visual context, offering actionable diagnostic insight for both researchers and practitioners. PCRI enables rigorous comparison of context robustness, supporting principled model selection and guiding the development of future architectures and training strategies for robust, real-world deployment.

[123] Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection cs.CVPDF

Taehun Kong, Tae-Kyun Kim

TL;DR: 该论文提出了一种可学习的伪标签选择模块，用于半监督3D目标检测，通过自适应阈值和评分融合提高伪标签质量，显著提升了性能。

Details

Motivation: 半监督3D目标检测需要减少昂贵的3D标注依赖，现有方法依赖手动阈值或动态阈值选择伪标签，忽略了上下文信息，导致伪标签质量不高。

Result: 在KITTI和Waymo数据集上表现出色，伪标签选择精度高且覆盖范围广，显著提升了现有方法的性能。

Insight: 自适应阈值和评分融合能有效结合上下文信息提升伪标签质量，软监督策略对噪声鲁棒性强。

Abstract: Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher’s predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

[124] Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction cs.CV | cs.AIPDF

Guoquan Wei, Zekun Zhou, Liu Shi, Wenzhe Shan, Qiegen Liu

TL;DR: 论文提出一种名为SuperDiff的新方法，结合自监督上下文子数据和可调泛化扩散模型，用于低剂量CT重建，显著提升了重建性能和泛化能力。

Details

Motivation: 当前基于深度学习的低剂量CT去噪模型严重依赖配对数据且泛化能力差，扩散模型也需要学习干净数据的分布，难以满足临床应用需求。自监督方法在剂量扩展时面临模型泛化能力下降的问题。

Result: 在数据集和真实数据上的实验表明，SuperDiff在重建和泛化性能上均优于现有最先进方法。

Insight: 通过自监督学习和灵活的双域策略，SuperDiff能够在仅需低剂量CT投影域数据的情况下实现高性能重建，同时支持不同剂量甚至未见剂量的泛化。

Abstract: Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this paper proposes a novel method of tunable-generalization diffusion powered by self-supervised contextual sub-data for low-dose CT reconstruction, named SuperDiff. Firstly, a contextual subdata similarity adaptive sensing strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, SuperDiff requires only LDCT projection domain data for training and testing. Full qualitative and quantitative evaluations on both datasets and real data show that SuperDiff consistently outperforms existing state-of-the-art methods in terms of reconstruction and generalization performance.

[125] AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities cs.CVPDF

Tatsuro Banno, Takehiko Ohkawa, Ruicong Liu, Ryosuke Furuta, Yoichi Sato

TL;DR: AssemblyHands-X 是首个无标记的3D手-身体基准数据集，专注于双手活动中的手-身体协调性对动作识别的影响。通过多视角三角测量和 SMPL-X 网格拟合进行3D姿态标注，实验表明基于姿态的动作推断优于视频基线，且手-身体联合建模能提升识别性能。

Details

Motivation: 现有数据集缺乏对双手活动中手-身体协调性的系统评估，且标记运动捕捉技术限制了模型的泛化能力。

Result: 基于姿态的动作推断比视频基线更高效和准确，手-身体联合建模显著提升了动作识别性能。

Insight: 手-身体的动态协调性是全面理解双手活动的关键因素。

Abstract: Bimanual human activities inherently involve coordinated movements of both hands and body. However, the impact of this coordination in activity understanding has not been systematically evaluated due to the lack of suitable datasets. Such evaluation demands kinematic-level annotations (e.g., 3D pose) for the hands and body, yet existing 3D activity datasets typically annotate either hand or body pose. Another line of work employs marker-based motion capture to provide full-body pose, but the physical markers introduce visual artifacts, thereby limiting models’ generalization to natural, markerless videos. To address these limitations, we present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities, designed to study the effect of hand-body coordination for action recognition. We begin by constructing a pipeline for 3D pose annotation from synchronized multi-view videos. Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body. We then validate different input representations (e.g., video, hand pose, body pose, or hand-body pose) across recent action recognition models based on graph convolution or spatio-temporal attention. Our extensive experiments show that pose-based action inference is more efficient and accurate than video baselines. Moreover, joint modeling of hand and body cues improves action recognition over using hands or upper body alone, highlighting the importance of modeling interdependent hand-body dynamics for a holistic understanding of bimanual activities.

Jinghan Xu Yuyang Zhang Qixuan Cai Jiancheng Chen Keqiu Li

TL;DR: 该论文提出了一种跨模态对比遗忘（CCU）框架，解决了视觉遗忘中跨模态知识损失和类内结构不稳定的问题，通过选择性视觉遗忘、跨模态知识保留和双集对比分离，显著提升了模型性能和时间效率。

Details

Motivation: 在自动驾驶等跨模态应用中，视觉模态是最易泄露隐私的。传统机器学习遗忘方法在视觉遗忘时会破坏跨模态知识和类内结构稳定性，导致性能下降。

Result: 在三个数据集上的实验证明，CCU方法比基线提升了7.12%的准确率，且仅需基线7%的遗忘时间。

Insight: 在视觉遗忘任务中，保持跨模态知识和类内结构稳定性是关键，CCU通过对比学习和语义一致性实现了高效遗忘与性能稳定。

Abstract: Visual modality is the most vulnerable to privacy leakage in real-world multimodal applications like autonomous driving with visual and radar data; Machine unlearning removes specific training data from pre-trained models to address privacy leakage, however, existing methods fail to preserve cross-modal knowledge and maintain intra-class structural stability of retain data, leading to reduced overall and other modalities’ performance during visual unlearning; to address these challenges, we propose a Cross-modal Contrastive Unlearning (CCU) framework, which integrates three key components: (a) selective visual unlearning: employing inverse contrastive learning to dissociate visual representations from their original semantics, (b) cross-modal knowledge retention: preserving other modalities’ discriminability through semantic consistency, and (c) dual-set contrastive separation: preserving the model performance via isolation of structural perturbations between the unlearn set and retain set; extensive experiments on three datasets demonstrate the superiority of CCU, and our method achieves a 7.12% accuracy improvement with only 7% of the unlearning time compared to the top-accuracy baseline.

[127] Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering cs.CVPDF

Rakesh Thakur, Yusra Tariq, Rakesh Chandra Joshi

TL;DR: Q-FSRU是一种结合频率谱表示融合（FSRU）和量子检索增强生成（Quantum RAG）的新模型，用于医学视觉问答（VQA）。该方法通过FFT将医学图像和文本特征转换到频域，提升信息处理效率，并通过量子检索技术从外部知识源获取相关医学事实，增强了模型的推理能力和可解释性。在VQA-RAD数据集上的实验表明，Q-FSRU超越了现有模型，尤其在复杂问题上表现突出。

Details

Motivation: 医学VQA任务需要同时理解图像和文本信息，这在医疗AI中仍是一个重大挑战。传统方法在处理噪声和复杂推理时表现不佳，因此需要一种更高效且可解释的方法。

Result: Q-FSRU在VQA-RAD数据集上表现优于现有方法，尤其是在需要复杂图像文本推理的任务中。频率和量子信息的结合不仅提升了性能，还增强了模型的可解释性。

Insight: 1. 频域处理可以有效过滤噪声并突出关键信息。2. 量子检索技术能够高效关联外部知识，增强模型的推理能力。3. 融合频域和量子信息的方法为医疗AI的可解释性和性能提供了新思路。

Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.

[128] LifeCLEF Plant Identification Task 2014 cs.CVPDF

Herve Goeau, Alexis Joly, Pierre Bonnet, Souheil Selmi, Jean-Francois Molino

TL;DR: LifeCLEF 2014植物识别任务提供了一个系统化的评估平台，专注于500种树木和草本植物的识别，涵盖了7种图像类型。任务数据通过公民科学项目收集，强调了现实世界的应用性。

Details

Motivation: 任务是评估植物识别系统在多样化、非约束条件下的性能，同时推动生物多样性和植物学研究。

Result: 共27个提交结果，来自10个小组，显示了方法的多样性和任务的实际应用潜力。

Insight: 公民科学项目为数据收集提供了新途径，强调了现实世界应用的复杂性，为未来研究提供了挑战和方向。

Abstract: The LifeCLEFs plant identification task provides a testbed for a system-oriented evaluation of plant identification about 500 species trees and herbaceous plants. Seven types of image content are considered: scan and scan-like pictures of leaf, and 6 kinds of detailed views with unconstrained conditions, directly photographed on the plant: flower, fruit, stem & bark, branch, leaf and entire view. The main originality of this data is that it was specifically built through a citizen sciences initiative conducted by Tela Botanica, a French social network of amateur and expert botanists. This makes the task closer to the conditions of a real-world application. This overview presents more precisely the resources and assessments of task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. With a total of ten groups from six countries and with a total of twenty seven submitted runs, involving distinct and original methods, this fourth year task confirms Image & Multimedia Retrieval community interest for biodiversity and botany, and highlights further challenging studies in plant identification.

[129] EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging cs.CV | cs.AIPDF

Anoushka Harit, William Prew, Zhongtian Sun, Florian Markowetz

TL;DR: 该论文提出了一种基于EWC（弹性权重巩固）和类条件扩散重放的持续学习方法，用于医学影像领域，避免了存储患者样本的需求，同时在多个任务上取得了优异的性能。

Details

Motivation: 医学影像基础模型需要随时间适应新任务，但由于隐私和成本限制，无法进行完全重新训练。因此，需要一种高效且保护隐私的持续学习方法。

Result: 在CheXpert上达到0.851 AUROC，相比DER++减少30%的遗忘，接近联合训练的0.869 AUROC，同时保持高效和隐私保护。

Insight: 遗忘现象与重放保真度和Fisher加权参数漂移密切相关，这表明扩散重放和突触稳定性在持续学习中具有互补作用。

Abstract: Medical imaging foundation models must adapt over time, yet full retraining is often blocked by privacy constraints and cost. We present a continual learning framework that avoids storing patient exemplars by pairing class conditional diffusion replay with Elastic Weight Consolidation. Using a compact Vision Transformer backbone, we evaluate across eight MedMNIST v2 tasks and CheXpert. On CheXpert our approach attains 0.851 AUROC, reduces forgetting by more than 30% relative to DER\texttt{++}, and approaches joint training at 0.869 AUROC, while remaining efficient and privacy preserving. Analyses connect forgetting to two measurable factors: fidelity of replay and Fisher weighted parameter drift, highlighting the complementary roles of replay diffusion and synaptic stability. The results indicate a practical route for scalable, privacy aware continual adaptation of clinical imaging models.

[130] Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation cs.CVPDF

You Zhou, Lijiang Chen, Shuchang Lyu, Guangxia Cui, Wenpei Bai

TL;DR: 提出了一种基于对抗学习的多模态跨领域联邦医疗分割框架FedDA，通过特征级对抗训练对齐不同客户端的特征，解决了联邦学习中多模态医疗数据分割的挑战。

Details

Motivation: 联邦学习在医疗领域因数据隐私保护而流行，但不同客户端的医疗图像模态差异导致跨领域分割困难，需要一种高效方法来缓解领域偏移。

Result: FedDA在多模态医疗数据集上表现优于现有联邦聚合算法，且通过主客观评估验证了其性能。

Insight: 特征级对抗训练能够有效缓解领域偏移问题，联邦学习结合对抗学习是多模态医疗数据分割的有效解决方案。

Abstract: Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model’s generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.

[131] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling cs.CVPDF

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang

TL;DR: 该论文提出了一种名为EditScore的高保真奖励模型，用于指令引导的图像编辑任务。通过建立EditReward-Bench基准测试，开发了一系列奖励模型（7B-72B），并结合自集成策略，成功解锁了在线强化学习在图像编辑中的应用。

Details

Motivation: 当前指令引导的图像编辑模型在处理复杂指令时表现不足，且通常需要多次尝试才能得到理想结果。强化学习虽然潜力巨大，但因缺乏高效、高保真的奖励信号而难以应用。

Result: EditScore的最大变体在基准测试中甚至超过了GPT-5，并在OmniGen2基模型上实现了显著的性能提升。

Insight: 高保真、领域专用的奖励模型是解锁强化学习在图像编辑中潜力的关键，这为从基准测试到奖励建模再到强化学习训练提供了系统性路径。

Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

[132] MoReact: Generating Reactive Motion from Textual Descriptions cs.CVPDF

Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui

TL;DR: MoReact是一种基于扩散模型的文本驱动方法，专注于生成与文本描述相符的人类反应动作序列，通过全局轨迹与局部动作的解耦生成，提升反应的适应性和真实性。

Details

Motivation: 现有方法在生成人类反应动作时，通常将多人视为单一实体或仅依赖单方动作，忽略了交互语义信息的整合，导致适应性不足。MoReact旨在填补这一空白，通过文本驱动生成更符合动态交互场景的反应动作。

Result: 实验表明，MoReact能生成多样、可控且符合文本描述的反应动作，与对手动作高度匹配，同时保持了语义一致性。

Insight: 全局轨迹的生成对局部动作的引导至关重要，而文本驱动的方法能够有效捕捉交互语义信息，为动态场景的反应生成提供了新思路。

Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person’s motion to generate the other’s reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other’s actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent’s movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.

[133] Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis cs.CVPDF

Yihang Guo, Tianyuan Yu, Liang Bai, Yanming Guo, Yirun Ruan

TL;DR: 论文系统性分析了多任务学习（MTL）中的不平衡优化问题，发现现有方法的性能在不同数据集上表现不一致，且高级架构仍依赖昂贵的网格搜索。研究表明白初始化视觉基础模型（VFMs）不能解决优化不平衡，而任务特定梯度的范数与优化不平衡强相关。

Details

Motivation: 多任务学习中任务干扰导致性能下降（不平衡优化），阻碍其潜力。论文旨在通过实验分析揭示问题的根源并探索解决方案。

Result: 提出的梯度范数缩放方法能有效解决不平衡优化问题，性能接近网格搜索，计算成本更低。

Insight: 理解和控制梯度动态是稳定MTL的直接路径，而非依赖复杂方法或更多数据。

Abstract: Multi-task learning (MTL) aims to build general-purpose vision systems by training a single network to perform multiple tasks jointly. While promising, its potential is often hindered by “unbalanced optimization”, where task interference leads to subpar performance compared to single-task models. To facilitate research in MTL, this paper presents a systematic experimental analysis to dissect the factors contributing to this persistent problem. Our investigation confirms that the performance of existing optimization methods varies inconsistently across datasets, and advanced architectures still rely on costly grid-searched loss weights. Furthermore, we show that while powerful Vision Foundation Models (VFMs) provide strong initialization, they do not inherently resolve the optimization imbalance, and merely increasing data quantity offers limited benefits. A crucial finding emerges from our analysis: a strong correlation exists between the optimization imbalance and the norm of task-specific gradients. We demonstrate that this insight is directly applicable, showing that a straightforward strategy of scaling task losses according to their gradient norms can achieve performance comparable to that of an extensive and computationally expensive grid search. Our comprehensive analysis suggests that understanding and controlling gradient dynamics is a more direct path to stable MTL than developing increasingly complex methods.

[134] Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives cs.CVPDF

Kuanrong Liu, Siyuan Liang, Cheng Qian, Ming Zhang, Xiaochun Cao

TL;DR: 这篇论文研究了CLIP及其衍生模型在多任务对抗扰动下的跨任务迁移行为，发现细粒度任务生成的对抗样本比粗粒度任务更具攻击性。作者提出了一种新框架MT-AdvCLIP，显著提高了对抗样本在多任务间的迁移成功率。

Details

Motivation: CLIP虽然在图像文本对齐任务中表现优异，但在细粒度任务（如目标检测和语义分割）中表现不佳，且其对抗鲁棒性尚未充分研究。理解对抗样本在多任务间的迁移行为对评估CLIP的泛化能力和安全风险至关重要。

Result: 实验表明，MT-AdvCLIP在多任务对抗攻击中的平均成功率提升了39%以上，显著优于基线方法。

Insight: 细粒度任务生成的对抗样本在多任务迁移中更具优势，这为多任务鲁棒性评估和对抗样本设计提供了新视角。

Abstract: As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP’s generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.

[135] DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation cs.CV | cs.ROPDF

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong

TL;DR: DriveE2E提出了一个闭环基准测试框架，将真实世界的驾驶场景整合到CARLA模拟器中，通过构建数字孪生资产和动态交通场景，提升模拟的真实性。

Details

Motivation: 当前基于CARLA模拟器的闭环评测依赖于手动配置的交通场景，与真实世界条件存在差异，限制了评测结果的真实性。因此，DriveE2E旨在提供一个更贴近现实的评测框架。

Result: 提出的评测框架能够更真实地反映复杂城市交叉路口的多样性驾驶行为和环境条件。

Insight: 1. 真实世界数据的整合能够显著提升模拟评测的真实性。2. 基础设施传感器的应用为自动驾驶评测提供了新的数据来源。

Abstract: Closed-loop evaluation is increasingly critical for end-to-end autonomous driving. Current closed-loop benchmarks using the CARLA simulator rely on manually configured traffic scenarios, which can diverge from real-world conditions, limiting their ability to reflect actual driving performance. To address these limitations, we introduce a simple yet challenging closed-loop evaluation framework that closely integrates real-world driving scenarios into the CARLA simulator with infrastructure cooperation. Our approach involves extracting 800 dynamic traffic scenarios selected from a comprehensive 100-hour video dataset captured by high-mounted infrastructure sensors, and creating static digital twin assets for 15 real-world intersections with consistent visual appearance. These digital twins accurately replicate the traffic and environmental characteristics of their real-world counterparts, enabling more realistic simulations in CARLA. This evaluation is challenging due to the diversity of driving behaviors, locations, weather conditions, and times of day at complex urban intersections. In addition, we provide a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models. Project URL: \href{https://github.com/AIR-THU/DriveE2E}{https://github.com/AIR-THU/DriveE2E}.

[136] Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks cs.CVPDF

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

TL;DR: 本文提出了一种新方法，通过学习编码-解码方向对来揭示深度视觉网络中概念的影响，该方法优于传统的矩阵分解或自编码器方法，并在解释性和干预应用中表现出色。

Details

Motivation: 深度视觉网络的潜在空间中的概念表现为方向（概念嵌入），但目前缺乏直接访问这些方向的方法，限制了网络的可解释性和调试能力。

Result: 1. 在合成数据中恢复真实方向对；2. 在真实数据中解码方向映射到单义可解释概念，且优于无监督基线；3. 信号向量可靠估计编码方向。

Insight: 编码-解码方向对不仅提升了模型的解释性，还支持模型的干预和错误修正，为深度网络的可解释性研究提供了新思路。

Abstract: Empirical evidence shows that deep vision networks represent concepts as directions in latent space, vectors we call concept embeddings. Each concept has a latent factor-a scalar-indicating its presence in an input patch. For a given patch, multiple latent factors are encoded into a compact representation by linearly combining concept embeddings, with the factors as coefficients. Since these embeddings enable such encoding, we call them encoding directions. A latent factor can be recovered via the inner product with a filter, a vector we call a decoding direction. These encoding-decoding direction pairs are not directly accessible, but recovering them helps open the black box of deep networks, enabling understanding, debugging, and improving models. Decoder directions attribute meaning to latent codes, while encoding directions assess concept influence on predictions, with both enabling model correction by unlearning irrelevant concepts. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

[137] SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing cs.CVPDF

Yi Yang, Xiaokun Zhang, Qingchen Fang, Ziqi Ye, Rui Li

TL;DR: 本文提出了SAR-KnowLIP，首个通用的合成孔径雷达（SAR）多模态基础模型，填补了SAR图像建模的空白。通过构建大规模数据集SAR-GEOVL-1M、分层认知链生成对齐文本、自一致迭代优化机制以及统一评估基准，模型在多项下游任务中表现优异。

Details

Motivation: 现有方法主要针对RGB图像，SAR图像的建模存在显著空白。SAR因其全天候成像能力在遥感场景理解中不可或缺，亟需多模态基础模型的支持。

Result: SAR-KnowLIP在11项下游任务中表现领先，尤其在目标计数和土地覆盖分类任务中。

Insight: 地理信息的引入和多模态数据对齐对SAR图像理解至关重要；自一致迭代优化机制有效提升了跨模态模型的性能。

Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP’s large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

[138] AutoPrune: Each Complexity Deserves a Pruning Policy cs.CVPDF

Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu

TL;DR: AutoPrune提出了复杂性自适应的剪枝策略，通过量化视觉与文本token间的互信息，动态调整剪枝策略以适应不同任务和样本的复杂性，显著降低了计算开销并保持了高精度。

Details

Motivation: 现有视觉-语言模型的剪枝策略通常是固定的，未能根据输入样本和任务的复杂性动态调整，无法与模型的整体推理轨迹对齐。

Result: 在LLaVA-1.5-7B模型上，剪枝89%的视觉token，FLOPs减少76.8%，精度仅下降3.3%。

Insight: 剪枝策略应与输入和任务的复杂性动态对齐，类似于人类视觉处理的渐进式聚焦过程。

Abstract: The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model’s holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

[139] CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting cs.CVPDF

Dragoş-Andrei Chileban, Andrei-Ştefan Bulzan, Cosmin Cernǎzanu-Glǎvan

TL;DR: 论文提出了一种名为CrashSplat的自动车辆损伤检测方法，通过将2D分割掩码升维到3D高斯泼溅表示中，实现了基于单视图的3D损伤分割。

Details

Motivation: 车辆损伤检测对汽车保险行业至关重要，但现有方法多为2D图像分析，缺乏几何精确性。3D高斯泼溅的新颖视图合成技术为解决这一挑战提供了潜力。

Result: 实验表明，该方法在单视图难以捕获目标（如划痕、小凹痕）的场景中表现优异，超越了多视图一致性方法的限制。

Insight: 无需学习的单视图分割方法在几何复杂或目标仅出现在单一视图的场景中具有显著优势，为3D损伤分析提供了新思路。

Abstract: Automatic car damage detection has been a topic of significant interest for the auto insurance industry as it promises faster, accurate, and cost-effective damage assessments. However, few works have gone beyond 2D image analysis to leverage 3D reconstruction methods, which have the potential to provide a more comprehensive and geometrically accurate representation of the damage. Moreover, recent methods employing 3D representations for novel view synthesis, particularly 3D Gaussian Splatting (3D-GS), have demonstrated the ability to generate accurate and coherent 3D reconstructions from a limited number of views. In this work we introduce an automatic car damage detection pipeline that performs 3D damage segmentation by up-lifting 2D masks. Additionally, we propose a simple yet effective learning-free approach for single-view 3D-GS segmentation. Specifically, Gaussians are projected onto the image plane using camera parameters obtained via Structure from Motion (SfM). They are then filtered through an algorithm that utilizes Z-buffering along with a normal distribution model of depth and opacities. Through experiments we found that this method is particularly effective for challenging scenarios like car damage detection, where target objects (e.g., scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical or impossible. The code is publicly available at: https://github.com/DragosChileban/CrashSplat.

[140] HunyuanImage 3.0 Technical Report cs.CVPDF

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui

TL;DR: HunyuanImage 3.0是一个统一多模态理解与生成的自回归模型，其图像生成模块已开源。通过精细数据、架构设计、Chain-of-Thoughts、渐进预训练和后训练等技术，训练了包含800亿参数的MoE模型，是目前最大最强的开源图像生成模型。

Details

Motivation: 目标是提供一个强大的开源多模态基础模型，推动社区在多模态领域的研究和应用。

Result: 自动与人工评估显示，HunyuanImage 3.0在文本-图像对齐和视觉质量上媲美SOTA模型。

Insight: 结合MoE与Chain-of-Thoughts可以显著提升多模态模型的性能和效率。

Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

[141] ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation cs.CVPDF

Shilan Zhang, Jirui Huang, Ruilin Yao, Cong Wang, Yaxiong Chen

TL;DR: ColLab提出了一种无需人工监督的自动化REC和REG数据生成方法，通过协作式多模态模型交互（CMMI）和空间渐进增强（SPA），显著提升了数据的质量和多样性。

Details

Motivation: 当前的REC和REG任务依赖人工标注，成本高且难以扩展。为解决这一问题，需开发自动化数据生成方法。

Result: ColLab显著提升了数据生成效率和质量，并被ICCV 2025 MARS2挑战赛部分采用。

Insight: 自动化数据生成可解决人工标注的瓶颈，多模态协作和空间增强是提升任务性能的关键。

Abstract: Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.

[142] Reinforcement Learning with Inverse Rewards for World Model Post-training cs.CVPDF

Yang Ye, Tianyu He, Shuo Yang, Jiang Bian

TL;DR: 论文提出了一种通过逆向奖励强化学习（RLIR）的后训练框架，用于提升视频世界模型的动作跟随能力，避免了大规模偏好标注的高成本和不可行的规则视频验证。

Details

Motivation: 当前视频世界模型在视觉质量和时间一致性方面取得了进步，但其在准确建模人类指定动作方面的能力仍有待探索。直接使用强化学习优化模型动作跟随能力需要合适的奖励函数，但其实现成本高昂且不可行。

Result: 实验表明，RLIR在多任务中实现了5-10%的动作跟随能力提升，视觉质量提升达10%，并获得更高的人类偏好评分。

Insight: RLIR为视频世界模型的后训练提供了一种无需大规模标注的奖励信号定义方法，显著提升了模型的实用性和可扩展性。

Abstract: World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.

[143] A Novel Hybrid Deep Learning and Chaotic Dynamics Approach for Thyroid Cancer Classification cs.CVPDF

Nada Bouchekout, Abdelkrim Boukabou, Morad Grimes, Yassine Habchi, Yassine Himeur

TL;DR: 提出了一种结合自适应CNN和CDF9/7小波的甲状腺癌分类方法，通过混沌系统增强特征，实现了98.17%的准确率和优异的计算效率。

Details

Motivation: 甲状腺癌诊断的时效性和准确性对治疗至关重要，现有方法在特征提取和泛化能力上有待提升。

Result: 在DDTI数据集上达到98.17%准确率，优于EfficientNetV2-S等SOTA方法；计算高效（28.7 ms/图像）。

Insight: 混沌系统能有效增强小波特征的判别力，结合CNN在小数据医学图像分类中表现优越且具临床实用性。

Abstract: Timely and accurate diagnosis is crucial in addressing the global rise in thyroid cancer, ensuring effective treatment strategies and improved patient outcomes. We present an intelligent classification method that couples an Adaptive Convolutional Neural Network (CNN) with Cohen-Daubechies-Feauveau (CDF9/7) wavelets whose detail coefficients are modulated by an n-scroll chaotic system to enrich discriminative features. We evaluate on the public DDTI thyroid ultrasound dataset (n = 1,638 images; 819 malignant / 819 benign) using 5-fold cross-validation, where the proposed method attains 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and an AUC of 0.9912. A controlled ablation shows that adding chaotic modulation to CDF9/7 improves accuracy by +8.79 percentage points over a CDF9/7-only CNN (from 89.38% to 98.17%). To objectively position our approach, we trained state-of-the-art backbones on the same data and splits: EfficientNetV2-S (96.58% accuracy; AUC 0.987), Swin-T (96.41%; 0.986), ViT-B/16 (95.72%; 0.983), and ConvNeXt-T (96.94%; 0.987). Our method outperforms the best of these by +1.23 points in accuracy and +0.0042 in AUC, while remaining computationally efficient (28.7 ms per image; 1,125 MB peak VRAM). Robustness is further supported by cross-dataset testing on TCIA (accuracy 95.82%) and transfer to an ISIC skin-lesion subset (n = 28 unique images, augmented to 2,048; accuracy 97.31%). Explainability analyses (Grad-CAM, SHAP, LIME) highlight clinically relevant regions. Altogether, the wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization and practical runtime characteristics suitable for clinical integration.

[144] VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion cs.CVPDF

Kargi Chauhan, Leilani H. Gilpin

TL;DR: VFSI提出了一种基于能量引导的方法，在扩散采样中显式强制执行物理约束，显著提高了交通模拟的真实性和物理有效性，而无需重新训练模型。

Details

Motivation: 现代扩散模型生成的交通模拟虽然逼真，但普遍违反物理约束，如碰撞、偏离道路等，突显了现有方法将物理有效性视为辅助属性而非设计要求的局限性。

Result: 在Waymo数据集上的实验显示，VFSI将碰撞率降低67%，总体有效性提升87%，同时改善了轨迹精度（ADE降低）。

Insight: 显式约束强制执行在推断阶段是必要且充分的，为提升生成模型的物理有效性提供了通用解决方案。

Abstract: Modern diffusion models generate realistic traffic simulations but systematically violate physical constraints. In a large-scale evaluation of SceneDiffuser++, a state-of-the-art traffic simulator, we find that 50% of generated trajectories violate basic physical laws - vehicles collide, drive off roads, and spawn inside buildings. This reveals a fundamental limitation: current models treat physical validity as an emergent property rather than an architectural requirement. We propose Validity-First Spatial Intelligence (VFSI), which enforces constraints through energy-based guidance during diffusion sampling, without model retraining. By incorporating collision avoidance and kinematic constraints as energy functions, we guide the denoising process toward physically valid trajectories. Across 200 urban scenarios from the Waymo Open Motion Dataset, VFSI reduces collision rates by 67% (24.6% to 8.1%) and improves overall validity by 87% (50.3% to 94.2%), while simultaneously improving realism metrics (ADE: 1.34m to 1.21m). Our model-agnostic approach demonstrates that explicit constraint enforcement during inference is both necessary and sufficient for physically valid traffic simulation.

[145] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution cs.CVPDF

Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma

TL;DR: 该论文提出了OASIS，一种高效的一步扩散模型，用于视频超分辨率（VSR），通过注意力专业化路由减少冗余并提升性能。

Details

Motivation: 视频超分辨率任务中，直接使用扩散模型会引入冗余，增加计算负担和学习难度，因为低质量视频已包含大量内容信息。

Result: OASIS在合成和真实数据集上表现最优，推理速度比基线SeedVR2快6.2倍。

Insight: 通过注意力路由和渐进训练，可以有效优化扩散模型在VSR任务中的效率与性能。

Abstract: Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.

[146] Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning cs.CV | cs.ROPDF

Muleilan Pei, Shaoshuai Shi, Shaojie Shen

TL;DR: 论文提出SMART-R1，一种基于R1风格强化微调的范式，用于改进多智能体交通模拟，通过在监督微调和强化微调间交替训练，提升模型泛化能力和仿真真实性。

Details

Motivation: 现有数据驱动的交通模拟器主要依赖监督学习，难以解决训练与测试间的分布偏移问题，限制了模型在未见环境中的泛化能力。

Result: 在Waymo Open Sim Agents Challenge上取得SOTA性能，真实性元得分0.7858，排名第一。

Insight: 交替使用监督学习和强化学习可以有效缓解分布偏移问题，提升仿真模型的泛化能力和真实性。

Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative “SFT-RFT-SFT” training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

[147] TREAT-Net: Tabular-Referenced Echocardiography Analysis for Acute Coronary Syndrome Treatment Prediction cs.CV | cs.LGPDF

Diane Kim, Minh Nguyen Nhat To, Sherif Abdalla, Teresa S. M. Tsang, Purang Abolmaesumi

TL;DR: TREAT-Net是一个多模态深度学习框架，结合超声心动图视频和结构化临床记录，非侵入性地预测急性冠状动脉综合征（ACS）的治疗方案。通过表格指导的交叉注意力增强视频解释，并利用晚期融合机制对齐多模态预测。

Details

Motivation: 冠状动脉造影是ACS诊断的金标准，但其资源密集和侵入性可能导致患者风险和诊断延迟。TREAT-Net旨在提供一种非侵入性替代方案，以加速治疗决策。

Result: 在9000多例ACS数据上，TREAT-Net平衡准确率达67.6%，AUROC为71.1%，干预预测的一致性达88.6%。

Insight: 多模态数据融合（尤其是视频和表格数据的协同）能显著提升ACS预测准确性，为非侵入性诊断工具的开发提供了新思路。

Abstract: Coronary angiography remains the gold standard for diagnosing Acute Coronary Syndrome (ACS). However, its resource-intensive and invasive nature can expose patients to procedural risks and diagnostic delays, leading to postponed treatment initiation. In this work, we introduce TREAT-Net, a multimodal deep learning framework for ACS treatment prediction that leverages non-invasive modalities, including echocardiography videos and structured clinical records. TREAT-Net integrates tabular-guided cross-attention to enhance video interpretation, along with a late fusion mechanism to align predictions across modalities. Trained on a dataset of over 9000 ACS cases, the model outperforms unimodal and non-fused baselines, achieving a balanced accuracy of 67.6% and an AUROC of 71.1%. Cross-modality agreement analysis demonstrates 88.6% accuracy for intervention prediction. These findings highlight the potential of TREAT-Net as a non-invasive tool for timely and accurate patient triage, particularly in underserved populations with limited access to coronary angiography.

[148] FrameMind: Frame-Interleaved Chain-of-Thought for Video Reasoning via Reinforcement Learning cs.CV | cs.AIPDF

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

TL;DR: FrameMind提出了一种新的视频推理框架，通过强化学习动态选择帧进行推理，解决了固定采样策略的局限性。

Details

Motivation: 当前视频理解模型采用固定帧采样策略，无法根据问题的具体需求动态调整视觉信息，限制了性能。

Result: 在MLVU和VideoMME等基准测试中，表现出显著的性能提升，超越了现有模型。

Insight: 动态帧选择和强化学习的结合可以显著提高视频理解的灵活性和效率。

Abstract: Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.

[149] Generalized Category Discovery in Hyperspectral Images via Prototype Subspace Modeling cs.CVPDF

Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Aleksandra Pizurica

TL;DR: 本文提出首个针对高光谱图像(GCD)的原型子空间建模框架，通过多基向量表示类别，显著提升高维特征空间的表现力和判别能力。

Details

Motivation: 现有GCD方法主要针对RGB图像，其假设和建模策略难以适应高光谱图像的高维复杂光谱结构，因此需要专门设计。

Result: 实验表明，该方法在高光谱数据集上显著优于现有GCD方法。

Insight: 高光谱数据的复杂结构需子空间建模，多基向量比单原型更具表达力，约束设计对提升性能至关重要。

Abstract: Generalized category discovery(GCD) seeks to jointly identify both known and novel categories in unlabeled data. While prior works have mainly focused on RGB images, their assumptions and modeling strategies do not generalize well to hyperspectral images(HSI), which are inherently high-dimensional and exhibit complex spectral structures. In this paper, we propose the first GCD framework tailored for HSI, introducing a prototype subspace modeling model to better capture class structure. Instead of learning a single prototype vector for each category as in existing methods such as SimGCD, we model each category using a set of basis vectors, forming a subspace representation that enables greater expressiveness and discrimination in a high-dimensional feature space. To guide the learning of such bases, we enforce two key constraints: (1) a basis orthogonality constraint that promotes inter-class separability, and (2) a reconstruction constraint that ensures each prototype basis can effectively reconstruct its corresponding class samples. Experimental results on real-world HSI demonstrate that our method significantly outperforms state-of-the-art GCD methods, establishing a strong foundation for generalized category discovery in hyperspectral settings.

[150] $\mathbf{R}^3$: Reconstruction, Raw, and Rain: Deraining Directly in the Bayer Domain cs.CVPDF

Nate Rothschild, Moshe Kimhi, Avi Mendelson, Chaim Baskin

TL;DR: 该论文提出了一种直接在Bayer域去雨的方法，避免了传统ISP处理导致的颜色混合、动态范围裁剪和细节模糊问题，并通过实验验证了其优越性。

Details

Motivation: 传统去雨方法通常在ISP处理后的sRGB图像上进行，但ISP过程会损失原始Bayer数据的信息。作者希望通过直接在Bayer域学习，保留更多原始信息，提升重建质量。

Result: Bayer域模型在PSNR和ICS指标上分别提升了+0.99 dB和+1.2%，同时计算效率更高（GFLOPs减半）。

Insight: ISP后处理可能导致信息不可逆损失，直接在Bayer域学习有望为低层视觉任务和端到端相机流水线开辟新方向。

Abstract: Image reconstruction from corrupted images is crucial across many domains. Most reconstruction networks are trained on post-ISP sRGB images, even though the image-signal-processing pipeline irreversibly mixes colors, clips dynamic range, and blurs fine detail. This paper uses the rain degradation problem as a use case to show that these losses are avoidable, and demonstrates that learning directly on raw Bayer mosaics yields superior reconstructions. To substantiate the claim, we (i) evaluate post-ISP and Bayer reconstruction pipelines, (ii) curate Raw-Rain, the first public benchmark of real rainy scenes captured in both 12-bit Bayer and bit-depth-matched sRGB, and (iii) introduce Information Conservation Score (ICS), a color-invariant metric that aligns more closely with human opinion than PSNR or SSIM. On the test split, our raw-domain model improves sRGB results by up to +0.99 dB PSNR and +1.2% ICS, while running faster with half of the GFLOPs. The results advocate an ISP-last paradigm for low-level vision and open the door to end-to-end learnable camera pipelines.

[151] Joint Superpixel and Self-Representation Learning for Scalable Hyperspectral Image Clustering cs.CVPDF

Xianlu Li, Nicolas Nadisic, Shaoguang Huang, Aleksandra Pizurica

TL;DR: 该论文提出了一种联合超像素分割和自表示学习的端到端框架，用于高效的HSI聚类。

Details

Motivation: HSI子空间聚类通常计算和内存成本高，而现有的超像素分割方法独立于聚类任务，导致分割结果与聚类目标不一致。

Result: 在多个HSI基准数据集上，该方法优于现有的最先进聚类方法。

Insight: 联合优化超像素分割和子空间聚类可以显著提升HSI聚类的效率和准确性，同时灵活的紧凑性参数设计增强了分割的适应性。

Abstract: Subspace clustering is a powerful unsupervised approach for hyperspectral image (HSI) analysis, but its high computational and memory costs limit scalability. Superpixel segmentation can improve efficiency by reducing the number of data points to process. However, existing superpixel-based methods usually perform segmentation independently of the clustering task, often producing partitions that do not align with the subsequent clustering objective. To address this, we propose a unified end-to-end framework that jointly optimizes superpixel segmentation and subspace clustering. Its core is a feedback mechanism: a self-representation network based on unfolded Alternating Direction Method of Multipliers (ADMM) provides a model-driven signal to guide a differentiable superpixel module. This joint optimization yields clustering-aware partitions that preserve both spectral and spatial structure. Furthermore, our superpixel network learns a unique compactness parameter for each superpixel, enabling more flexible and adaptive segmentation. Extensive experiments on benchmark HSI datasets demonstrate that our method consistently achieves superior accuracy compared with state-of-the-art clustering approaches.

[152] A Second-Order Perspective on Pruning at Initialization and Knowledge Transfer cs.CV | cs.AIPDF

Leonardo Iurada, Beatrice Occhiena, Tatiana Tommasi

TL;DR: 本文研究发现，预训练视觉模型在初始化阶段的剪枝不仅能保持零样本性能，还能通过微调恢复未见任务的性能，这得益于大规模预训练带来的有利损失景观。

Details

Motivation: 预训练视觉模型的部署受限于其计算和存储成本，而剪枝是压缩模型的有效方法。然而传统方法需要任务特定数据，这在任务未知时成为挑战。

Result: 剪枝后的模型在未见任务上仍保持零样本性能，微调后性能还能恢复，尤其是在大规模预训练的数据集上表现更优。

Insight: 大规模预训练不仅提升了模型的泛化能力，还为其剪枝后的适应性和性能恢复提供了有利条件。

Abstract: The widespread availability of pre-trained vision models has enabled numerous deep learning applications through their transferable representations. However, their computational and storage costs often limit practical deployment. Pruning-at-Initialization has emerged as a promising approach to compress models before training, enabling efficient task-specific adaptation. While conventional wisdom suggests that effective pruning requires task-specific data, this creates a challenge when downstream tasks are unknown in advance. In this paper, we investigate how data influences the pruning of pre-trained vision models. Surprisingly, pruning on one task retains the model’s zero-shot performance also on unseen tasks. Furthermore, fine-tuning these pruned models not only improves performance on original seen tasks but can recover held-out tasks’ performance. We attribute this phenomenon to the favorable loss landscapes induced by extensive pre-training on large-scale datasets.

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian

TL;DR: 该论文研究了大型视觉语言模型中外部线索如何通过生成潜在标识符（Grounding IDs）来增强多模态绑定，从而提高结构化推理和精确接地的能力。

Details

Motivation: 大型视觉语言模型在多模态任务中表现优异，但在结构化推理和精确接地上仍有限制。研究旨在揭示外部视觉结构（如分区和注释）如何通过潜在标识符提升模型性能。

Result: 研究表明Grounding IDs能够增强注意力机制，减少幻觉现象，提高跨模态接地的鲁棒性。

Insight: Grounding IDs为理解外部线索如何提升多模态模型性能提供了符号化的解释和实际改进方向。

Abstract: Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as robust within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism explaining how external cues enhance multimodal binding, offering both interpretability and practical improvements in robustness.

[154] Autoregressive Video Generation beyond Next Frames Prediction cs.CVPDF

Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu

TL;DR: 本文提出VideoAR框架，突破了传统自回归视频生成逐帧预测的限制，支持多种预测单元（如完整帧、关键细节帧、多尺度细化、时空立方体），并通过实验证明时空立方体作为预测单元在质量、速度和时序一致性上的优越性。

Details

Motivation: 传统自回归视频生成模型通常逐帧预测，但其假设帧是视频的自然原子单元可能不合理。本文探讨是否存在更优的预测单元，以提升模型性能和效率。

Result: VideoAR在VBench评测中表现优于现有方法，推理速度更快，并能生成长达分钟级的视频序列。

Insight: 视频生成的预测单元不应局限于帧级，时空立方体等其他单元可能更高效且质量更高。

Abstract: Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video’s temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using \textit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

TL;DR: DualFlow是一个统一且高效的多模态双人运动生成框架，通过优化流模型实现快速、高质量的3D运动合成，并引入检索增强生成（RAG）模块和多模态对齐策略。

Details

Motivation: 现有方法在多模态条件（如文本、音乐）下生成双人运动时存在推断速度慢、误差累积和语义对齐不足的问题，DualFlow旨在解决这些挑战。

Result: 在文本驱动、音乐驱动和多模态交互任务中，DualFlow在运动质量、响应速度和效率上均优于现有方法。

Insight: Rectified Flow的高效采样路径和多模态对齐策略对提升运动生成的实时性和语义一致性至关重要。

Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

[156] SVAC: Scaling Is All You Need For Referring Video Object Segmentation cs.CVPDF

Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang

TL;DR: SVAC提出了一种通过扩展输入帧和分割标记来提升视频语言交互和分割精度的统一模型，解决了RVOS中的计算和动态对象行为问题，取得了SOTA性能。

Details

Motivation: 尽管多模态大语言模型（MLLMs）提升了RVOS性能，但仍存在对MLLMs先验知识利用不足、长视频计算成本高以及复杂时序动态处理不足等问题。

Result: SVAC在多个RVOS基准测试中达到SOTA性能，且计算效率高。

Insight: 通过扩展输入规模和优化计算模块可以显著提升RVOS性能，动态分配策略是处理复杂时序的关键。

Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs’ prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

[157] Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding cs.CV | cs.CLPDF

Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan

TL;DR: 该论文提出了一种名为GMS的框架，将通用的视觉-语言模型（Scanner）与小型的任务专用模型（Locator）结合，通过分层协同实现GUI中的自然语言查询定位，性能显著提升。

Details

Motivation: GUI中的自然语言查询定位需要同时理解多样化的UI元素并精准预测空间坐标，现有单一模型难以兼顾泛化能力和精确性。

Result: 在ScreenSpot-Pro数据集上，GMS框架的准确率达35.7%，显著优于单一模型（Scanner 2.0%，Locator 3.7%）及其他基线。

Insight: 模拟人类‘扫描-聚焦’行为的分层设计是提升GUI定位性能的关键，泛化与专用的互补性值得关注。

Abstract: Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a ‘Scanner’ to identify potential regions of interest, while the fine-tuned grounding model serves as a ‘Locator’ that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the ‘Scanner’ and ‘Locator’ models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$, representing a $10 \times$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.

[158] EYE-DEX: Eye Disease Detection and EXplanation System cs.CV | cs.AI | cs.LG | 60G35, 62M10, 62P35, 65C20, 68T45, 68U10, 92C35, 92C40, 92C42, 93E10 | I.4; I.4.8; I.4.9; I.4.10; I.2; I.2.6; I.2.10; J.3; C.2.4; C.3;

H.2.8; H.3.4; H.3.5; I.2.4; I.5; I.5.1; I.5.4; K.6.1PDF
Youssef Sabiri, Walid Houmaidi, Amine Abouaomar

TL;DR: EYE-DEX是一个自动化框架，用于分类10种视网膜疾病，使用21,577张视网膜图像数据集。采用微调的VGG16模型，达到92.36%的测试准确率，并通过Grad-CAM提供视觉解释以增强透明性。

Details

Motivation: 全球超过22亿人受视力障碍影响，传统手动诊断耗时且主观。深度学习可自动化分析视网膜图像，提升诊断效率和质量。

Result: 微调VGG16模型在视网膜疾病分类任务中取得92.36%的测试准确率，优于其他基准模型。

Insight: 1. 深度学习在医学图像分析中表现出色；2. 模型解释性（如Grad-CAM）对提升临床信任至关重要；3. 大规模数据集是提高模型性能的关键。

Abstract: Retinal disease diagnosis is critical in preventing vision loss and reducing socioeconomic burdens. Globally, over 2.2 billion people are affected by some form of vision impairment, resulting in annual productivity losses estimated at $411 billion. Traditional manual grading of retinal fundus images by ophthalmologists is time-consuming and subjective. In contrast, deep learning has revolutionized medical diagnostics by automating retinal image analysis and achieving expert-level performance. In this study, we present EYE-DEX, an automated framework for classifying 10 retinal conditions using the large-scale Retinal Disease Dataset comprising 21,577 eye fundus images. We benchmark three pre-trained Convolutional Neural Network (CNN) models–VGG16, VGG19, and ResNet50–with our finetuned VGG16 achieving a state-of-the-art global benchmark test accuracy of 92.36%. To enhance transparency and explainability, we integrate the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to generate visual explanations highlighting disease-specific regions, thereby fostering clinician trust and reliability in AI-assisted diagnostics.

[159] Asymmetric VAE for One-Step Video Super-Resolution Acceleration cs.CVPDF

Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang

TL;DR: FastVSR通过高压缩率VAE（f16）和创新的训练策略，显著提升了视频超分辨率的推理效率，速度和性能均优于现有方法。

Details

Motivation: 现有基于扩散模型的视频超分辨率方法虽减少了采样步数，但在推理效率上仍有优化空间，FastVSR旨在进一步提升效率。

Result: FastVSR比多步模型快111.9倍，比现有一步模型快3.92倍，同时保持高性能。

Insight: 高压缩率VAE和简化训练目标是提升推理效率的关键，同时稳定训练过程。

Abstract: Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE’s performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at https://github.com/JianzeLi-114/FastVSR.

[160] High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation cs.CVPDF

Le Dong, Jinghao Bian, Jingyang Hou, Jingliang Hu, Yilei Shi

TL;DR: 这篇论文提出了一种高阶渐进轨迹匹配方法（HoP-TM），用于医学图像数据集蒸馏，解决了现有方法忽视中间优化状态信息的问题，提升了蒸馏性能并保护隐私。

Details

Motivation: 医学图像分析中，数据共享因隐私法规和复杂机构协议受限。数据集蒸馏通过合成紧凑数据集来捕获真实大型数据集的关键信息，提供了一种解决方案。

Result: 实验表明，该方法在医学图像分类任务中提升了蒸馏性能，同时保持了与原始数据集训练相当的模型准确性。

Insight: 1. 中间优化状态的信息对数据集蒸馏至关重要；2. 渐进策略和几何结构捕获能够有效提升蒸馏质量。

Abstract: Medical image analysis faces significant challenges in data sharing due to privacy regulations and complex institutional protocols. Dataset distillation offers a solution to address these challenges by synthesizing compact datasets that capture essential information from real, large medical datasets. Trajectory matching has emerged as a promising methodology for dataset distillation; however, existing methods primarily focus on terminal states, overlooking crucial information in intermediate optimization states. We address this limitation by proposing a shape-wise potential that captures the geometric structure of parameter trajectories, and an easy-to-complex matching strategy that progressively addresses parameters based on their complexity. Experiments on medical image classification tasks demonstrate that our method improves distillation performance while preserving privacy and maintaining model accuracy comparable to training on the original datasets. Our code is available at https://github.com/Bian-jh/HoP-TM.

[161] Combining Discrepancy-Confusion Uncertainty and Calibration Diversity for Active Fine-Grained Image Classification cs.CVPDF

Yinghao Jin, Xi Yang

TL;DR: 论文提出了一种结合差异困惑不确定性和校准多样性的新方法DECERN，用于主动细粒度图像分类，通过多角度的信息度量提升了样本选择的有效性。

Details

Motivation: 在细粒度图像分类中，由于类间差异微小，传统主动学习方法难以准确评估样本的信息量。DECERN旨在解决这一问题，更有效地选择信息丰富的样本。

Result: 在7个细粒度图像数据集和26种实验设置中，DECERN优于现有方法，展示了其优越性能。

Insight: 通过融合局部特征和全局多样性，DECERN为细粒度图像分类中的主动学习提供了新思路。

Abstract: Active learning (AL) aims to build high-quality labeled datasets by iteratively selecting the most informative samples from an unlabeled pool under limited annotation budgets. However, in fine-grained image classification, assessing this informativeness is especially challenging due to subtle inter-class differences. In this paper, we introduce a novel method, combining discrepancy-confusion uncertainty and calibration diversity for active fine-grained image classification (DECERN), to effectively perceive the distinctiveness between fine-grained images and evaluate the sample value. DECERN introduces a multifaceted informativeness measure that combines discrepancy-confusion uncertainty and calibration diversity. The discrepancy-confusion uncertainty quantifies the category directionality and structural stability of fine-grained unlabeled data during local feature fusion. Subsequently, uncertainty-weighted clustering is performed to diversify the uncertainty samples. Then we calibrate the diversity to maximize the global diversity of the selected sample while maintaining its local representativeness. Extensive experiments conducted on 7 fine-grained image datasets across 26 distinct experimental settings demonstrate that our method achieves superior performance compared to state-of-the-art methods.

[162] Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning cs.CVPDF

Jonghun Kim, Hyunjin Park

TL;DR: 本文提出了一种基于扩散模型和提示调制的生成方法，用于从治疗前的DCE-MRI图像预测乳腺癌患者在接受新辅助化疗（NAC）后的MRI图像。该方法在图像质量和对肿瘤变化的反映上优于其他生成模型。

Details

Motivation: 新辅助化疗（NAC）后的反应监测对乳腺癌治疗计划至关重要，但目前缺乏准确的预测方法。本文旨在通过生成治疗后MRI图像，为精准医疗提供支持。

Result: 结果显示，该方法在图像质量指标上优于其他生成模型，尤其是能更准确地反映pCR（病理完全缓解）相关的肿瘤大小变化。消融实验验证了提示调制的必要性。

Insight: 通过结合临床因素的提示调制，扩散模型在医学图像生成任务中表现出更强的适应性和准确性。该方法为个性化治疗提供了潜在的工具。

Abstract: Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.

[163] Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection cs.CV | cs.AIPDF

Sojung An, Kwanyong Park, Yong Jae Lee, Donghyun Kim

TL;DR: 论文提出TaSe框架，通过解耦和分层聚合语言表示改进基于语言的物体检测能力，在复杂查询中表现显著优于现有方法。

Details

Motivation: 现有视觉语言模型（VLMs）在复杂语言查询（如包含描述性属性和关系从句）中表现不佳，主要源于文本编码器未能区分目标对象与其属性、关系。

Result: 在OmniLabel基准测试中性能提升24%。

Insight: 语言的结构性对多模态表示至关重要，解耦和分层聚合能显著提升复杂查询的感知能力。

Abstract: While vision-language models (VLMs) have made significant progress in multimodal perception (e.g., open-vocabulary object detection) with simple language queries, state-of-the-art VLMs still show limited ability to perceive complex queries involving descriptive attributes and relational clauses. Our in-depth analysis shows that these limitations mainly stem from text encoders in VLMs. Such text encoders behave like bags-of-words and fail to separate target objects from their descriptive attributes and relations in complex queries, resulting in frequent false positives. To address this, we propose restructuring linguistic representations according to the hierarchical relations within sentences for language-based object detection. A key insight is the necessity of disentangling textual tokens into core components-objects, attributes, and relations (“talk in pieces”)-and subsequently aggregating them into hierarchically structured sentence-level representations (“see in whole”). Building on this principle, we introduce the TaSe framework with three main contributions: (1) a hierarchical synthetic captioning dataset spanning three tiers from category names to descriptive sentences; (2) Talk in Pieces, the three-component disentanglement module guided by a novel disentanglement loss function, transforms text embeddings into subspace compositions; and (3) See in Whole, which learns to aggregate disentangled components into hierarchically structured embeddings with the guide of proposed hierarchical objectives. The proposed TaSe framework strengthens the inductive bias of hierarchical linguistic structures, resulting in fine-grained multimodal representations for language-based object detection. Experimental results under the OmniLabel benchmark show a 24% performance improvement, demonstrating the importance of linguistic compositionality.

[164] UniVid: The Open-Source Unified Video Model cs.CVPDF

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang

TL;DR: UniVid是一个统一视频模型，通过轻量级适配器将多模态大语言模型（MLLM）与扩散解码器结合，实现视频理解和生成。通过引入温度模态对齐和金字塔反射技术，提升了提示遵从性和高效时间推理。在多个基准测试中表现优异。

Details

Motivation: 现有统一视频模型在流式生成时面临语义忠实度不足和跨模态注意力效率低的问题，同时扩展图像中心的MLLM到视频领域代价高昂。UniVid旨在解决这些问题。

Result: 在VBench-Long上比EasyAnimateV5.1提升2.2%，在MSVD-QA和ActivityNet-QA上分别提升1.0%和3.3%的准确率。

Insight: 轻量级适配器和动态技术（如金字塔反射）可高效扩展MLLM到视频领域，同时保持生成和理解的性能。

Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

[165] BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation cs.CV | cs.AIPDF

Zelin Liu, Sicheng Dong, Bocheng Li, Yixuan Yang, Jiacheng Ruan

TL;DR: BALR-SAM是一种边界感知低秩适配框架，旨在高效地将SAM模型适配到医学图像分割任务中，通过减少参数需求和资源消耗，同时保持高性能。

Details

Motivation: 预训练的大规模视觉基础模型（如SAM）在医学图像分割中表现不佳，因其缺乏领域特异性。如何在临床实践中高效地微调这些模型，以满足资源限制并保持性能，是一个关键挑战。

Result: 在标准医学分割数据集上，BALR-SAM无需提示即可超越多个SOTA方法（包括完全微调的MedSAM），仅更新1.8%的参数（11.7M）。

Insight: 边界感知和低秩适配技术结合能显著提升医学图像分割的性能和效率，为资源受限场景提供了一种高效解决方案。

Abstract: Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM’s Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.

[166] Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos cs.CVPDF

Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu

TL;DR: Forge4D提出了一种高效的4D人体重建和插值模型，通过结合流式3D高斯重建和稠密运动预测，从未校准的稀疏视角视频中重建动态3D人体，并支持新视角和新时间的合成。

Details

Motivation: 现有方法在重建速度或生成新时间表征方面存在局限，影响了动态3D人体重建的应用。

Result: 实验验证了模型在领域内外数据集上的有效性。

Insight: 通过联合任务设计和自监督损失，Forge4D在高效性和生成能力上取得了平衡，为动态4D重建提供了新思路。

Abstract: Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.

[167] Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis cs.CVPDF

Xuecheng Wu, Junxiao Xue, Xinyi Yin, Yunyun Shi, Liangyu Fu

TL;DR: 论文提出了一种名为AVF-MAE++的自监督学习框架，专注于情感视频面部分析（AVFA），通过双模态掩码策略和改进的跨模态相关性学习模块，解决了数据稀缺和模态相关性建模的挑战。

Details

Motivation: 情感视频面部分析（AVFA）领域面临数据稀缺和多模态相关性建模的挑战，本文旨在通过可扩展的自监督学习方法解决这些问题。

Result: 在17个数据集上的实验表明，AVF-MAE++在多个基准任务中取得了最优性能。

Insight: 1. 可扩展性是提升AVFA性能的关键；2. 跨模态相关性建模对自监督学习至关重要。

Abstract: Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.

[168] EVLF-FM: Explainable Vision Language Foundation Model for Medicine cs.CVPDF

Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu

TL;DR: EVLF-FM是一种多模态视觉语言基础模型，旨在统统一广泛的诊断能力与细粒度的可解释性，在多个临床专业中表现优异。

Details

Motivation: 目前的医学AI基础模型存在局限性，如模态单一和缺乏透明的推理过程，阻碍了临床应用。EVLF-FM旨在解决这些问题。

Result: 在内部验证中，EVLF-FM的准确率和F1分数最高；在外部验证中也表现出强大的零样本和小样本性能。

Insight: 可解释性和多模态能力是提升医学AI临床信任的关键，EVLF-FM的成功展示了基础模型在该领域的潜力。

Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

[169] FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation cs.CV | cs.ROPDF

Seungwook Kim, Seunghyeon Lee, Minsu Cho

TL;DR: 论文介绍了两项无需训练的推理技术，提升基于扩散模型的轨迹到视频生成的真实性和可控性。

Details

Motivation: 生成逼真的机器人视频对于构建有效的世界模型和机器人基础模型至关重要。现有的方法往往将被动作向量视为被动条件信号，未能充分利用其指导作用。

Result: 在真实机器人操作数据集上的实验表明，这些技术显著提升了动作一致性和视觉质量。

Insight: 主动利用动作参数的动态调节能力，可以显著提升扩散模型中轨迹到视频生成的性能，尤其是在多场景机器人环境下。

Abstract: Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.

[170] Latent Visual Reasoning cs.CV | cs.CLPDF

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu

TL;DR: 论文提出了一种名为Latent Visual Reasoning (LVR)的新方法，通过在视觉嵌入空间中进行自回归推理，显著提升了多模态大语言模型在视觉问题解答任务中的表现。

Details

Motivation: 现有的多模态大语言模型虽然通过思维链（CoT）推理提高了任务表现，但其推理过程仍局限于语言空间，视觉信息被视为静态前提条件。这限制了模型在细粒度视觉理解任务中的潜力。

Result: LVR在MMVP任务中达到了71.67%的准确率，显著优于Qwen2.5-VL的66.67%。

Insight: 直接在视觉空间进行推理能更好地捕捉细粒度视觉信息，提升多模态任务的性能。

Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.

[171] When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs cs.CVPDF

Jinming Liu, Zhaoyang Jia, Jiahao Li, Bin Li, Xin Jin

TL;DR: 该论文分析了压缩失真对多模态大语言模型（MLLMs）的影响，并提出了一种名为CoTAM的图像编码方法，旨在自适应保护多级特征以满足下游任务需求。实验表明，该方法能节省35.99%的比特率，同时保持任务性能。

Details

Motivation: MLLMs通常部署在云端，需要高效的压缩技术以减少边缘设备传输带宽。传统图像编解码器以人类视觉系统为目标优化，不适合MLLMs的多任务需求。

Result: CoTAM在MLLMs任务上保持性能的同时，节省了35.99%的比特率，超过了现有神经编解码器。

Insight: 压缩失真对不同层次特征的影响不均匀，针对MLLMs的设计需兼顾多级特征保护和任务需求。

Abstract: The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs’ downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP’s shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.

[172] S$^2$NN: Sub-bit Spiking Neural Networks cs.CVPDF

Wenjie Wei, Malu Zhang, Jieyuan Zhang, Ammar Belatreche, Shuai Wang

TL;DR: 论文提出了一种称为S²NN的亚比特脉冲神经网络，通过将权重表示为小于一比特，进一步探索SNNs的压缩和加速潜力，解决了大规模网络存储和计算的需求问题。

Details

Motivation: 尽管二元脉冲神经网络（SNNs）在能效方面表现优异，但其在大规模网络中的存储和计算需求仍然较高。为了进一步压缩和加速SNNs，论文提出了亚比特权重的表示方法。

Result: 实验结果显示，S²NN在视觉和非视觉任务中均优于现有的量化SNNs，且在性能和效率上表现优异，适用于边缘计算。

Insight: 研究表明，亚比特权重的表示方法可以显著降低SNNs的存储和计算需求，同时通过异常值自适应和特征蒸馏提升性能。

Abstract: Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for machine intelligence, but their continued scaling poses challenges for resource-limited deployment. Despite recent advances in binary SNNs, the storage and computational demands remain substantial for large-scale networks. To further explore the compression and acceleration potential of SNNs, we propose Sub-bit Spiking Neural Networks (S$^2$NNs) that represent weights with less than one bit. Specifically, we first establish an S$^2$NN baseline by leveraging the clustering patterns of kernels in well-trained binary SNNs. This baseline is highly efficient but suffers from \textit{outlier-induced codeword selection bias} during training. To mitigate this issue, we propose an \textit{outlier-aware sub-bit weight quantization} (OS-Quant) method, which optimizes codeword selection by identifying and adaptively scaling outliers. Furthermore, we propose a \textit{membrane potential-based feature distillation} (MPFD) method, improving the performance of highly compressed S$^2$NN via more precise guidance from a teacher model. Extensive results on vision and non-vision tasks reveal that S$^2$NN outperforms existing quantized SNNs in both performance and efficiency, making it promising for edge computing applications.

[173] Skeleton-based Robust Registration Framework for Corrupted 3D Point Clouds cs.CV | cs.LGPDF

Yongqiang Wang, Weigang Li, Wenping Liu, Zhiqiang Tian, Jinling Li

TL;DR: 论文提出了一种基于骨架的鲁棒点云配准框架（SRRF），通过引入抗噪的骨架表示，结合点云和骨架的变换，显著提升了在噪声、密度失真和几何变形等条件下的配准性能。

Details

Motivation: 由于点云在现实场景中常受到传感器限制、环境噪声和预处理错误的干扰，现有配准方法依赖直接点匹配或表面特征提取，容易受到这些干扰影响，导致配准精度下降。

Result: 在多种噪声场景下的实验表明，SRRF在配准精度上显著优于现有方法，尤其是在密度失真、噪声污染和几何变形等条件下表现突出。

Insight: 通过同时考虑局部几何特征和骨架结构的全局稳定性，SRRF能够在复杂的真实场景中实现鲁棒且准确的点云配准。

Abstract: Point cloud registration is fundamental in 3D vision applications, including autonomous driving, robotics, and medical imaging, where precise alignment of multiple point clouds is essential for accurate environment reconstruction. However, real-world point clouds are often affected by sensor limitations, environmental noise, and preprocessing errors, making registration challenging due to density distortions, noise contamination, and geometric deformations. Existing registration methods rely on direct point matching or surface feature extraction, which are highly susceptible to these corruptions and lead to reduced alignment accuracy. To address these challenges, a skeleton-based robust registration framework is presented, which introduces a corruption-resilient skeletal representation to improve registration robustness and accuracy. The framework integrates skeletal structures into the registration process and combines the transformations obtained from both the corrupted point cloud alignment and its skeleton alignment to achieve optimal registration. In addition, a distribution distance loss function is designed to enforce the consistency between the source and target skeletons, which significantly improves the registration performance. This framework ensures that the alignment considers both the original local geometric features and the global stability of the skeleton structure, resulting in robust and accurate registration results. Experimental evaluations on diverse corrupted datasets demonstrate that SRRF consistently outperforms state-of-the-art registration methods across various corruption scenarios, including density distortions, noise contamination, and geometric deformations. The results confirm the robustness of SRRF in handling corrupted point clouds, making it a potential approach for 3D perception tasks in real-world scenarios.

[174] Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context cs.CVPDF

Yongqiang Wang, Weigang Li, Wenping Liu, Zhe Xu, Zhiqiang Tian

TL;DR: 本文提出CEGC框架，通过全局上下文下的置信度估计，解决了部分点云配准中的结构模糊性和噪声问题。其结合语义描述符和几何相似性的混合重叠置信度模块，提高了配准的准确性和鲁棒性。

Details

Motivation: 部分点云配准在自主感知和3D场景理解中至关重要，但由于结构模糊性、部分可见性和噪声等问题，现有方法效果有限。

Result: 在ModelNet40、ScanObjectNN和7Scenes数据集上，CEGC在准确性、鲁棒性和泛化性上优于现有方法。

Insight: 通过全局上下文建模自适应加权策略，CEGC显著提升了部分点云配准的性能，为解决复杂场景中的配准问题提供了新思路。

Abstract: Partial point cloud registration is essential for autonomous perception and 3D scene understanding, yet it remains challenging owing to structural ambiguity, partial visibility, and noise. We address these issues by proposing Confidence Estimation under Global Context (CEGC), a unified, confidence-driven framework for robust partial 3D registration. CEGC enables accurate alignment in complex scenes by jointly modeling overlap confidence and correspondence reliability within a shared global context. Specifically, the hybrid overlap confidence estimation module integrates semantic descriptors and geometric similarity to detect overlapping regions and suppress outliers early. The context-aware matching strategy smitigates ambiguity by employing global attention to assign soft confidence scores to correspondences, improving robustness. These scores guide a differentiable weighted singular value decomposition solver to compute precise transformations. This tightly coupled pipeline adaptively down-weights uncertain regions and emphasizes contextually reliable matches. Experiments on ModelNet40, ScanObjectNN, and 7Scenes 3D vision datasets demonstrate that CEGC outperforms state-of-the-art methods in accuracy, robustness, and generalization. Overall, CEGC offers an interpretable and scalable solution to partial point cloud registration under challenging conditions.

[175] SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation cs.CVPDF

Hanqi Chen, Zhongyin Zhao, Ye Chen, Zhujin Liang, Bingbing Ni

TL;DR: SVGThinker是一个基于大型语言模型（LLM）的文本到SVG生成框架，通过推理驱动和指令对齐的方法，解决了传统方法的泛化能力弱和指令遵循性差的问题。

Details

Motivation: 传统文本到SVG生成方法存在泛化能力不足和指令遵循性差的问题，限制了SVG的实用性和编辑性。

Result: 实验表明，SVGThinker生成的SVG在稳定性、编辑性和质量上均优于现有方法，保留了矢量图形的结构优势。

Insight: SVGThinker不仅提升了生成质量，还支持精确的分层编辑，为设计和自动化图形生成提供了新方向。

Abstract: Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.

[176] FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting cs.CVPDF

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu

TL;DR: FrameThinker是一个新型框架，通过多轮帧聚焦技术使大视觉语言模型（LVLMs）能够高效处理长视频理解和推理任务，显著减少帧处理数量并提升性能。

Details

Motivation: 当前LVLMs在长视频推理中存在均匀帧采样和静态文本推理的效率低下问题，无法处理视觉密集型任务，因此提出FrameThinker以解决这些挑战。

Result: FrameThinker在LongVideo-Reason上达到76.1%准确率（仅用20.6帧），远超基线模型（+10.4%），且帧处理数量大幅减少。

Insight: 迭代式帧聚焦和多阶段训练策略（SFT+RL）可显著提升LVLMs的长视频处理效率和性能。

Abstract: While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy.Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.

[177] OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction cs.CVPDF

Yuhang Cao, Haojun Yan, Danya Yao

TL;DR: OMeGa是一个联合优化显式三角网格和2D高斯溅射的端到端框架，通过灵活的绑定策略和网格约束，显著提升了室内纹理缺失区域的重建精度。

Details

Motivation: 现有方法在纹理缺失的室内区域重建几何不准确，且网格提取与优化过程解耦，无法利用网格几何指导溅射优化。OMeGa旨在解决这些问题。

Result: 在室内重建基准上实现SOTA，Chamfer-$L_1$误差较基线降低47.3%，同时保持新颖视角渲染质量。

Insight: 联合优化显式网格和高斯溅射能有效解决纹理缺失区域的几何重建问题，网格几何可引导溅射优化，提升整体重建质量和鲁棒性。

Abstract: Neural rendering with Gaussian splatting has advanced novel view synthesis, and most methods reconstruct surfaces via post-hoc mesh extraction. However, existing methods suffer from two limitations: (i) inaccurate geometry in texture-less indoor regions, and (ii) the decoupling of mesh extraction from optimization, thereby missing the opportunity to leverage mesh geometry to guide splat optimization. In this paper, we present OMeGa, an end-to-end framework that jointly optimizes an explicit triangle mesh and 2D Gaussian splats via a flexible binding strategy, where spatial attributes of Gaussian Splats are expressed in the mesh frame and texture attributes are retained on splats. To further improve reconstruction accuracy, we integrate mesh constraints and monocular normal supervision into the optimization, thereby regularizing geometry learning. In addition, we propose a heuristic, iterative mesh-refinement strategy that splits high-error faces and prunes unreliable ones to further improve the detail and accuracy of the reconstructed mesh. OMeGa achieves state-of-the-art performance on challenging indoor reconstruction benchmarks, reducing Chamfer-$L_1$ by 47.3% over the 2DGS baseline while maintaining competitive novel-view rendering quality. The experimental results demonstrate that OMeGa effectively addresses prior limitations in indoor texture-less reconstruction.

[178] Towards Foundation Models for Cryo-ET Subtomogram Analysis cs.CVPDF

Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang

TL;DR: 该论文致力于为冷冻电子断层扫描（cryo-ET）的子断层图分析开发基础模型，通过生成大规模合成数据、设计自适应相位标记化的视觉Transformer，以及提出抗噪声对比学习策略，显著提升了子断层图的分类、对齐和平均任务的性能。

Details

Motivation: 冷冻电子断层扫描的子断层图分析面临标注稀缺、噪声严重和泛化能力差的问题，限制了结构的有效解析。

Result: 在24个合成和真实数据集上验证了在子断层图三大任务上的SOTA性能，并对未见数据集表现出强泛化能力。

Insight: 通过合成数据预训练和噪声鲁棒性设计，可以有效解决冷冻电子断层扫描中低信噪比和标注稀缺的挑战。

Abstract: Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

[179] Similarity-Aware Selective State-Space Modeling for Semantic Correspondence cs.CVPDF

Seungwook Kim, Minsu Cho

TL;DR: MambaMatcher提出了一种基于选择性状态空间模型（SSM）的高效方法，用于语义对应任务，通过相似性感知的选择性扫描机制优化4D相关映射，显著提升了性能。

Details

Motivation: 传统的特征度量方法可能忽略复杂的相关关系，而相关性度量方法因高计算成本受限。MambaMatcher旨在高效建模高维相关性。

Result: 在标准语义对应基准测试中取得了最先进的性能。

Insight: 选择性SSM和相似性感知机制的结合，为高维相关性建模提供了高效解决方案。

Abstract: Establishing semantic correspondences between images is a fundamental yet challenging task in computer vision. Traditional feature-metric methods enhance visual features but may miss complex inter-correlation relationships, while recent correlation-metric approaches are hindered by high computational costs due to processing 4D correlation maps. We introduce MambaMatcher, a novel method that overcomes these limitations by efficiently modeling high-dimensional correlations using selective state-space models (SSMs). By implementing a similarity-aware selective scan mechanism adapted from Mamba’s linear-complexity algorithm, MambaMatcher refines the 4D correlation map effectively without compromising feature map resolution or receptive field. Experiments on standard semantic correspondence benchmarks demonstrate that MambaMatcher achieves state-of-the-art performance.

[180] TP-MVCC: Tri-plane Multi-view Fusion Model for Silkie Chicken Counting cs.CVPDF

Sirui Chen, Yuhong Feng, Yifeng Wang, Jianghai Liao, Qi Zhang

TL;DR: 论文提出了一种基于三平面多视图融合的鸡只计数模型TP-MVCC，通过几何投影和三平面融合集成多摄像头特征，提升了密集遮挡场景下的计数准确性，并在真实农场环境中验证其高效性。

Details

Motivation: 精确的动物计数对智慧农业至关重要，但密集遮挡和多摄像头视角有限的问题使得计数任务困难重重。

Result: 实验结果显示TP-MVCC在密集遮挡场景中达到95.1%的准确率，显著优于单视图和传统融合方法。

Insight: 多视图融合能有效解决密集遮挡问题，三平面特征投影为多视角数据整合提供了新思路。

Abstract: Accurate animal counting is essential for smart farming but remains difficult in crowded scenes due to occlusions and limited camera views. To address this, we propose a tri-plane-based multi-view chicken counting model (TP-MVCC), which leverages geometric projection and tri-plane fusion to integrate features from multiple cameras onto a unified ground plane. The framework extracts single-view features, aligns them via spatial transformation, and decodes a scene-level density map for precise chicken counting. In addition, we construct the first multi-view dataset of silkie chickens under real farming conditions. Experiments show that TP-MVCC significantly outperforms single-view and conventional fusion comparisons, achieving 95.1% accuracy and strong robustness in dense, occluded scenarios, demonstrating its practical potential for intelligent agriculture.

[181] Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA cs.CV | cs.AIPDF

Yan Ke, Xin Yu, Heming Du, Scott Chapman, Helen Huang

TL;DR: 本文提出了一种动态编排的多智能体框架，用于解决农业视觉问答（VQA）中多图像输入和上下文不足的问题，通过四个角色（检索器、反思器、回答器和改进器）协作实现上下文丰富和迭代优化。

Details

Motivation: 现有农业VQA系统主要针对单图像或文本查询设计，无法应对多图像输入和动态变化的农业场景，且缺乏系统化的质量控制机制。

Result: 在AgMMU基准测试中，该框架在多图像农业问答任务中表现出色。

Insight: 通过动态编排多智能体实现上下文丰富和迭代优化，可以显著提升农业VQA系统在复杂真实场景中的适应性和准确性。

Abstract: Agricultural visual question answering is essential for providing farmers and researchers with accurate and timely knowledge. However, many existing approaches are predominantly developed for evidence-constrained settings such as text-only queries or single-image cases. This design prevents them from coping with real-world agricultural scenarios that often require multi-image inputs with complementary views across spatial scales, and growth stages. Moreover, limited access to up-to-date external agricultural context makes these systems struggle to adapt when evidence is incomplete. In addition, rigid pipelines often lack systematic quality control. To address this gap, we propose a self-reflective and self-improving multi-agent framework that integrates four roles, the Retriever, the Reflector, the Answerer, and the Improver. They collaborate to enable context enrichment, reflective reasoning, answer drafting, and iterative improvement. A Retriever formulates queries and gathers external information, while a Reflector assesses adequacy and triggers sequential reformulation and renewed retrieval. Two Answerers draft candidate responses in parallel to reduce bias. The Improver refines them through iterative checks while ensuring that information from multiple images is effectively aligned and utilized. Experiments on the AgMMU benchmark show that our framework achieves competitive performance on multi-image agricultural QA.

[182] NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis cs.CVPDF

Yixuan Ren, Hanyu Wang, Hao Chen, Bo He, Abhinav Shrivastava

TL;DR: NeRV-Diffusion通过生成神经网络的权重来合成视频，利用隐式神经表示（INR）和扩散模型实现高效的视频生成。

Details

Motivation: 视频合成通常依赖于逐帧特征编码和跨帧注意力机制，这不仅计算量大，还可能影响生成质量。NeRV-Diffusion提出以一种整体的方式编码和解码视频，从而避免这些限制。

Result: 在UCF-101和Kinetics-600等基准测试中，NeRV-Diffusion优于其他基于INR的模型，与非隐式的最先进模型性能相当，同时能实现帧或视频间的平滑插值。

Insight: 通过整体生成视频的神经表示，NeRV-Diffusion避免了逐帧处理的复杂性，同时在权重空间中实现了高质量的降噪和插值能力。

Abstract: We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.

[183] An Enhanced Pyramid Feature Network Based on Long-Range Dependencies for Multi-Organ Medical Image Segmentation cs.CV | cs.AIPDF

Dayu Tan, Cheng Kong, Yansen Su, Hai Chen, Dongliang Yang

TL;DR: LamFormer是一种新型U形网络，通过Linear Attention Mamba（LAM）增强金字塔编码器，解决多器官医学图像分割中的长范围依赖和局部细节提取问题，性能优于现有方法且模型复杂度平衡。

Details

Motivation: 当前的多器官医学图像分割方法依赖Transformer，但存在计算成本高和局部细节提取不足的问题。

Result: 在七个复杂多样的数据集上优于现有分割方法，性能优异且模型复杂度平衡。

Insight: 通过结合线性注意力和改进的特征聚合，LamFormer在性能和计算效率之间取得了平衡，为医学图像分割提供了新思路。

Abstract: In the field of multi-organ medical image segmentation, recent methods frequently employ Transformers to capture long-range dependencies from image features. However, these methods overlook the high computational cost of Transformers and their deficiencies in extracting local detailed information. To address high computational costs and inadequate local detail information, we reassess the design of feature extraction modules and propose a new deep-learning network called LamFormer for fine-grained segmentation tasks across multiple organs. LamFormer is a novel U-shaped network that employs Linear Attention Mamba (LAM) in an enhanced pyramid encoder to capture multi-scale long-range dependencies. We construct the Parallel Hierarchical Feature Aggregation (PHFA) module to aggregate features from different layers of the encoder, narrowing the semantic gap among features while filtering information. Finally, we design the Reduced Transformer (RT), which utilizes a distinct computational approach to globally model up-sampled features. RRT enhances the extraction of detailed local information and improves the network’s capability to capture long-range dependencies. LamFormer outperforms existing segmentation methods on seven complex and diverse datasets, demonstrating exceptional performance. Moreover, the proposed network achieves a balance between model performance and model complexity.

[184] DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense cs.CV | cs.LGPDF

Amira Guesmi, Muhammad Shafique

TL;DR: DRIFT是一种通过打破梯度共识来提升对抗防御能力的方法，通过随机变换和梯度不一致性训练轻量级滤波器，显著提升了模型对抗攻击的鲁棒性。

Details

Motivation: 现有对抗防御方法大多依赖梯度掩码，容易被攻击者利用梯度共识绕过。DRIFT旨在通过主动破坏梯度共识来提升防御能力。

Result: DRIFT在自适应白盒、迁移和无梯度攻击下均优于现有方法，且计算开销可忽略。

Insight: 梯度不一致性是提升对抗防御的有效原则，轻量级的滤波器设计可以广泛适用于不同架构。

Abstract: Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus} – the tendency of randomized transformations to yield aligned gradients – as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.

[185] UI-UG: A Unified MLLM for UI Understanding and Generation cs.CV | cs.AI | cs.HCPDF

Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao

TL;DR: UI-UG是一个统一的多模态大语言模型（MLLM），专注于用户界面（UI）的理解和生成任务。通过结合监督微调（SFT）、组相对策略优化（GRPO）和直接偏好优化（DPO），提升了模型的性能和生成质量。

Details

Motivation: 现有MLLMs在领域特定任务（如UI的理解和生成）中表现不佳，难以满足现代复杂UI的需求。

Result: UI-UG在UI理解任务上达到SOTA性能，生成任务上与更大规模MLLMs相当，计算成本更低。

Insight: UI理解和生成任务的整合可以相互提升性能和生成质量。

Abstract: Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks.

[186] Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models cs.CV | cs.AIPDF

Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu

TL;DR: Uni-X是一种两端分离、中间共享的统一多模态模型架构，通过避免浅层和深层的模态冲突，显著提升了训练效率和性能。

Details

Motivation: 现有的共享自回归变压器的统一多模态模型存在模态间梯度冲突问题，尤其是在浅层和深层，这源于图像与文本的低层次统计特性差异。

Result: Uni-X在3B参数规模下，性能匹配或超越7B参数的基线模型，生成评分达82，同时在多种任务中表现优异。

Insight: 中间层更适合多模态语义融合，而两端为模态特异性处理可以显著减少冲突，未来多模态模型设计可借鉴此思路。

Abstract: Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X

[187] DINOReg: Strong Point Cloud Registration with Vision Foundation Model cs.CVPDF

Congjia Chen, Yufu Qu

TL;DR: DINOReg提出了一种新的点云配准方法，结合视觉基础模型DINOv2提取的图像特征和几何特征，实现了显著的性能提升。

Details

Motivation: 现有的点云配准方法主要依赖几何信息，忽略了图像的纹理和语义信息，且特征融合方式不够高效。DINOReg旨在充分利用视觉和几何信息以提升配准性能。

Result: 在RGBD-3DMatch和RGBD-3DLoMatch数据集上，显著优于现有方法，配准召回率提高15.7%。

Insight: 视觉基础模型可为点云配准提供丰富的纹理和语义信息；多模态特征的高效融合是提升性能的关键。

Abstract: Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model’s ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at https://github.com/ccjccjccj/DINOReg.

[188] REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport cs.CV | cs.AIPDF

Soumyadeep Chandra, Kaushik Roy

TL;DR: REALIGN提出了一种基于正则化融合部分Gromov-Wasserstein最优传输的自监督框架，用于处理教学视频中的非单调步骤顺序、重复动作和无关帧问题，显著提升了性能。

Details

Motivation: 教学视频中存在背景片段、重复动作和非单调步骤顺序等问题，现有方法基于强单调性假设难以处理，且缺乏对高阶时间结构的建模。

Result: 在多个基准测试中，平均F1分数提升了18.9%，时间IoU提升了30%以上，且生成了更可解释的传输映射。

Insight: 通过联合建模视觉和时间关系，可以更有效地处理教学视频中的复杂动态，同时提升模型的可解释性。

Abstract: Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL, leverage Kantorovich Optimal Transport (KOT) to build frame-to-frame correspondences, but rely solely on feature similarity and fail to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

[189] Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy cs.CV | cs.AIPDF

Haijier Chen, Bo Xu, Shoujian Zhang, Haoze Liu, Jiaxuan Lin

TL;DR: Vid-LLM 是一种紧凑的基于视频的3D多模态大语言模型，通过重建-推理协同无需外部3D数据输入，提升了3D场景理解能力。

Details

Motivation: 现有的3D多模态大语言模型依赖3D数据输入，限制了其可扩展性和泛化能力。本文旨在通过视频输入直接实现3D场景理解。

Result: 在3D问答、3D密集标注和3D视觉定位等多个任务上验证了方法的有效性。

Insight: 视频输入可以作为替代3D数据的有效形式，几何先验的整合和多任务优化策略显著提升了3D场景理解能力。

Abstract: Recent developments in Multimodal Large Language Models (MLLMs) have significantly improved Vision-Language (VL) reasoning in 2D domains. However, extending these capabilities to 3D scene understanding remains a major challenge. Existing 3D Multimodal Large Language Models (3D-MLLMs) often depend on 3D data inputs, which limits scalability and generalization. To address this limitation, we propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data, making it practical for real-world deployment. In our method, the geometric prior are directly used to improve the performance of the sceen perception. To integrate the geometric cues into the MLLM compactly, we design a Cross-Task Adapter (CTA) module to align the 3D geometric priors with the vision-language representations. To ensure geometric consistency and integrity, we introduce a Metric Depth Model that recovers real-scale geometry from the reconstruction outputs. Finally, the model is fine-tuned with a two-stage distillation optimization strategy, realizing fast convergence and stabilizes training. Extensive experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Dense Captioning and 3D Visual Grounding tasks, demonstrating the superior multi-task capabilities.

[190] PCICF: A Pedestrian Crossing Identification and Classification Framework cs.CVPDF

Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger

TL;DR: PCICF是一个行人穿越识别和分类框架，旨在支持自动驾驶车辆在城市环境中识别和分类复杂的行人穿越场景。通过扩展合成数据集并利用空间填充曲线（SFCs）进行特征匹配，PCICF在实际数据集PIE上表现出色。

Details

Motivation: 自动驾驶车辆（如robotaxi）在城市环境中需要可靠地检测和应对行人等脆弱道路使用者（VRUs）。现有的多模态传感器数据处理系统缺乏高质量数据支持，尤其是在复杂行人穿越场景中。

Result: PCICF能成功识别和分类复杂的行人穿越场景（如行人合并或分散），并在计算效率上具备车载应用的潜力。

Insight: 1. SFCs在多维特征匹配中表现高效；2. 合成数据集的扩展支持了复杂场景的模拟；3. PCICF展示了自动驾驶中端到端AI的潜力。

Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD’s incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF

[191] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers cs.CV | cs.AIPDF

Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang

TL;DR: CLQ是一种针对扩散变压器（DiTs）的跨层引导正交量化方法，通过跨块校准、正交平滑和跨层参数搜索，实现了W4A4量化下视觉质量几乎无损的高效压缩。

Details

Motivation: 扩散变压器（DiTs）在大规模模型和复杂性的推动下提升了视觉生成质量，但其边缘设备部署受到限制，亟需高效量化方法以减少内存消耗并加速推理。

Result: CLQ成功压缩模型至W4A4，视觉质量和指标几乎无损，内存节省3.98倍，速度提升3.95倍。

Insight: 校准数据的准确性对量化至关重要，跨层引导和正交平滑是处理DiTs异常值的有效方法。

Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.

[192] A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models cs.CV | cs.AI | eess.IVPDF

Pei-Han Chen, Szu-Chi Chung

TL;DR: 本文探讨了图像数据集质量对机器学习模型性能的影响，提出了一种整合CleanVision和Fastdup工具的管道，并引入了自动阈值选择等方法。实验表明，不同质量问题对模型的影响各异，CNN对某些失真具有鲁棒性，但对模糊和降质尤其敏感。改进后的方法显著提升了低质量图像检测的F1分数。

Details

Motivation: 传统机器学习研究侧重于模型开发，而忽略了训练数据质量的影响。随着模型架构成熟，数据质量逐渐成为关键因素。本文旨在填补图像领域数据集质量评估的系统性研究空白。

Result: 自动阈值方法将单扰动和双扰动下的F1分数分别从0.6794提升至0.9468和从0.7447提升至0.8557；去重策略将F1分数从0.4576提升至0.7928。结果表明，改进后的方法显著提升了检测效果。

Insight: 1.并非所有质量问题对模型的影响相同，CNN对模糊和降质特别敏感；2.自动阈值选择和工具整合可有效提升数据质量评估的效果；3.数据质量应成为机器学习研究的重要方向。

Abstract: In machine learning, research has traditionally focused on model development, with relatively less attention paid to training data. As model architectures have matured and marginal gains from further refinements diminish, data quality has emerged as a critical factor. However, systematic studies on evaluating and ensuring dataset quality in the image domain remain limited. This study investigates methods for systematically assessing image dataset quality and examines how various image quality factors influence model performance. Using the publicly available and relatively clean CIFAKE dataset, we identify common quality issues and quantify their impact on training. Building on these findings, we develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup. We analyze their underlying mechanisms and introduce several enhancements, including automatic threshold selection to detect problematic images without manual tuning. Experimental results demonstrate that not all quality issues exert the same level of impact. While convolutional neural networks show resilience to certain distortions, they are particularly vulnerable to degradations that obscure critical visual features, such as blurring and severe downscaling. To assess the performance of existing tools and the effectiveness of our proposed enhancements, we formulate the detection of low-quality images as a binary classification task and use the F1 score as the evaluation metric. Our automatic thresholding method improves the F1 score from 0.6794 to 0.9468 under single perturbations and from 0.7447 to 0.8557 under dual perturbations. For near-duplicate detection, our deduplication strategy increases the F1 score from 0.4576 to 0.7928. These results underscore the effectiveness of our workflow and provide a foundation for advancing data quality assessment in image-based machine learning.

Runmin Zhang, Jialiang Wang, Si-Yuan Cao, Zhu Yu, Junchen Yu

TL;DR: DCFlow是一种新型的无监督跨模态光流估计框架，通过解耦优化策略和跨模态一致性约束，显著提升了光流预测的准确性。

Details

Motivation: 传统方法仅通过外观相似性隐式学习光流估计，难以解决模态差异和几何不对齐问题。本文旨在通过解耦优化和一致性约束解决这些挑战。

Result: DCFlow在无监督方法中表现优异，并与多种光流估计网络兼容。

Insight: 通过显式处理模态差异和几何不对齐问题，结合无监督学习策略，可以显著提升跨模态光流估计的性能。

Abstract: This work presents DCFlow, a novel unsupervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.

[194] UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark cs.CVPDF

Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu

TL;DR: 论文提出了UI2V-Bench，一个专注于语义理解和推理的图像到视频（I2V）生成评估基准，弥补了现有基准在语义和常识性评估上的不足。

Details

Motivation: 现有的I2V评估基准主要关注视频质量和时间一致性，而忽略了模型对输入图像语义的理解以及生成视频是否符合物理规律和人类常识的能力。

Result: UI2V-Bench评估了多个开源和闭源I2V模型，结果显示提出的MLLM指标与人工评估高度一致。

Insight: 强调语义理解和推理能力是I2V评估的重要方向，UI2V-Bench为未来研究提供了框架和数据支持。

Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model’s ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

[195] Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA cs.CV | cs.CLPDF

Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin

TL;DR: 本文提出了一种改进视频问答（VideoQA）模型的监督信号合成框架，通过两种策略（QBP和QBC）生成更丰富的叙事和视觉依据，显著提升了模型性能和泛化能力。

Details

Motivation: 当前VideoQA模型的监督信号（孤立的事实问答对）限制了其对视频内容的理解能力，缺乏对事件叙事和因果结构的捕捉。

Result: 在STAR和NExT-QA数据集上取得了新的SOTA性能，例如在STAR上将3B模型提升至72.5%（+4.9%）。QBP还加速了模型收敛2.5倍以上。

Insight: 从孤立事实转向叙事连贯性和视觉依据的合成数据，能够提升模型的准确性、效率和泛化能力。

Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This “bag-of-facts” approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video’s existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video’s event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5% on STAR (+4.9%) and a 7B model to 80.8% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.

[196] Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks cs.CV | cs.AI | cs.CL | cs.LGPDF

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu

TL;DR: 该论文提出通过几何代理任务增强视觉语言模型的空间感知与推理能力，构建了一个包含3万道几何问题的数据集Euclid30K，并用GRPO方法微调模型。实验表明，该方法在多个空间推理基准上显著提升了零样本性能。

Details

Motivation: 当前多模态大语言模型在空间智能（如形状变换、对象旋转、位置关系判断等）方面表现不足。论文旨在通过几何问题解决作为代理任务，提升模型的空间推理能力。

Result: 微调后的模型在多个空间推理基准（如Super-CLEVR、VSI-Bench等）上实现了显著的零样本性能提升。RoboBrain2.0-Euclid-7B以49.6%的准确率超越之前最优模型。

Insight: 几何问题可以作为有效的代理任务，显著提升视觉语言模型的空间推理能力，且这种能力能广泛迁移到其他空间任务中。

Abstract: Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

[197] Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks cs.CVPDF

Hangil Park, Yongmin Seo, Tae-Kyun Kim

TL;DR: 论文提出了一种基于知识蒸馏的双模型集成方法，用于多类别异常检测，结合了针对工业检测的小缺陷和语义异常检测的两种异质性学生模型，实现了跨领域的泛化性能。

Details

Motivation: 现有的异常检测方法往往偏向工业检测或语义异常检测单一任务，难以在多类别和跨领域任务中表现一致，缺乏通用性。本文旨在通过双模型集成的知识蒸馏方法解决这一问题。

Result: 在8个公开基准测试（包括MVTec-AD、CIFAR-10等）上实现了SOTA性能，MVTec-AD图像级AUROC达99.7%，CIFAR-10达97.8%，优于现有通用模型和部分专用模型。

Insight: 1. 双模型集成能够有效结合不同任务的专长；2. 共享预训练编码器有助于跨领域特征的泛化；3. Noisy-OR目标能够平衡两种任务的优化。

Abstract: Anomaly detection (AD) plays an important role in various real-world applications. Recent advancements in AD, however, are often biased towards industrial inspection, struggle to generalize to broader tasks like semantic anomaly detection and vice versa. Although recent methods have attempted to address general anomaly detection, their performance remains sensitive to dataset-specific settings and single-class tasks. In this paper, we propose a novel dual-model ensemble approach based on knowledge distillation (KD) to bridge this gap. Our framework consists of a teacher and two student models: an Encoder-Decoder model, specialized in detecting patch-level minor defects for industrial AD and an Encoder-Encoder model, optimized for semantic AD. Both models leverage a shared pre-trained encoder (DINOv2) to extract high-quality feature representations. The dual models are jointly learned using the Noisy-OR objective, and the final anomaly score is obtained using the joint probability via local and semantic anomaly scores derived from the respective models. We evaluate our method on eight public benchmarks under both single-class and multi-class settings: MVTec-AD, MVTec-LOCO, VisA and Real-IAD for industrial inspection and CIFAR-10/100, FMNIST and View for semantic anomaly detection. The proposed method achieved state-of-the-art accuracies in both domains, in multi-class as well as single-class settings, demonstrating generalization across multiple domains of anomaly detection. Our model achieved an image-level AUROC of 99.7% on MVTec-AD and 97.8% on CIFAR-10, which is significantly better than the prior general AD models in multi-class settings and even higher than the best specialist models on individual benchmarks.

[198] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation cs.CV | cs.AIPDF

Heechang Kim, Gwanghyun Kim, Se Young Chun

TL;DR: LaMoGen提出了一种基于拉班运动分析的文本到动作生成方法，通过结合Laban Effort和Shape组件，实现了细粒度和可解释的动作控制。

Details

Motivation: 现有的文本到动作生成模型缺乏对动作风格的细粒度控制，且难以通过自然语言表达定量特征。拉班运动分析被舞蹈专家广泛用于描述动作细节，因此作者希望将其融入生成模型中以提高控制性。

Result: 实验表明，LaMoGen能够生成多样化的表达性动作，并根据目标Laban标签成功操纵动作属性。

Insight: 将舞蹈领域的量化方法（如拉班分析）融入生成模型，能够显著提高动作生成的细粒度控制和可解释性。

Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.

[199] NeMo: Needle in a Montage for Video-Language Understanding cs.CV | cs.CLPDF

Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao

TL;DR: 论文提出了NeMo任务和NeMoBench基准，旨在评估视频大语言模型（VideoLLMs）的长上下文回忆和时间定位能力，并通过自动化数据生成管道生成高质量QA对。

Details

Motivation: 现有视频语言模型在复杂时序推理能力上的评估缺乏有效协议和基准，亟需新方法填补这一空白。

Result: 实验证明自动化管道可靠生成数据，NeMoBench有效揭示模型能力与局限。

Insight: 长视频处理与时序理解仍是VideoLLMs的重要挑战。

Abstract: Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs’ critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.

[200] Performance-Efficiency Trade-off for Fashion Image Retrieval cs.CVPDF

Julio Hurtado, Haoran Ni, Duygu Sap, Connor Mattinson, Martin Lotz

TL;DR: 本文提出了一种选择性表示框架，用于时尚图像检索的性能与效率权衡，通过聚类和核心集选择方法缩小数据库规模，同时保持检索准确性。

Details

Motivation: 时尚产业对环境的影响引发了对二手市场的兴趣，本文旨在通过机器学习方法优化时尚图像检索的可扩展性。

Result: 在三个公开数据集上验证，数据库规模缩减至10%，同时保持检索准确性。离群点剔除进一步提升性能。

Insight: 性能与效率的权衡可以通过策略性样本选择和离群点剔除实现，为大规模时尚图像检索提供了实用解决方案。

Abstract: The fashion industry has been identified as a major contributor to waste and emissions, leading to an increased interest in promoting the second-hand market. Machine learning methods play an important role in facilitating the creation and expansion of second-hand marketplaces by enabling the large-scale valuation of used garments. We contribute to this line of work by addressing the scalability of second-hand image retrieval from databases. By introducing a selective representation framework, we can shrink databases to 10% of their original size without sacrificing retrieval accuracy. We first explore clustering and coreset selection methods to identify representative samples that capture the key features of each garment and its internal variability. Then, we introduce an efficient outlier removal method, based on a neighbour-homogeneity consistency score measure, that filters out uncharacteristic samples prior to selection. We evaluate our approach on three public datasets: DeepFashion Attribute, DeepFashion Con2Shop, and DeepFashion2. The results demonstrate a clear performance-efficiency trade-off by strategically pruning and selecting representative vectors of images. The retrieval system maintains near-optimal accuracy, while greatly reducing computational costs by reducing the images added to the vector database. Furthermore, applying our outlier removal method to clustering techniques yields even higher retrieval performance by removing non-discriminative samples before the selection.

[201] Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs cs.CV | cs.AIPDF

Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng

TL;DR: 论文提出了一种名为SCPO的新框架，通过语义课程偏好优化来减少多模态大语言模型中的视觉幻觉问题，显著提高了模型性能和事实性。

Details

Motivation: 多模态大语言模型（MLLMs）在生成响应时容易出现视觉幻觉，即生成的文本与视觉证据不符。现有方法如DPO难以捕捉细粒度语义差异且容易鼓励捷径学习。

Result: 实验证明SCPO在LLaVA模型上显著降低了幻觉率（最高62.9%），并在广泛基准测试中保持了通用能力。

Insight: 通过结合语义课程和对称优化，SCPO不仅解决了视觉幻觉问题，还提升了模型的事实性和稳定性。

Abstract: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.

[202] Robust Multimodal Semantic Segmentation with Balanced Modality Contributions cs.CVPDF

Jiaqi Tan, Xu Zheng, Fangyu Li, Yang Liu

TL;DR: 本文提出了EQUISeg，一种通过均衡模态贡献来解决多模态语义分割中模态不平衡问题的框架，采用交叉模态Transformer块和自引导模块实现高效融合和鲁棒性提升。

Details

Motivation: 现有的多模态语义分割方法在主导模态性能下降时整体表现显著退化，亟需解决模态不平衡问题以提高实用性和鲁棒性。

Result: 在多个数据集上的实验表明，EQUISeg显著提升了性能，并有效缓解了模态不平衡对分割任务的负面影响。

Insight: 通过模态均衡和动态自适应机制，多模态模型的鲁棒性可以显著提升，尤其在主导模态退化时仍能保持稳定表现。

Abstract: Multimodal semantic segmentation enhances model robustness by exploiting cross-modal complementarities. However, existing methods often suffer from imbalanced modal dependencies, where overall performance degrades significantly once a dominant modality deteriorates in real-world scenarios. Thus, modality balance has become acritical challenge for practical multimodal segmentation. To address this issue, we propose EQUISeg, a multimodal segmentation framework that balances modality contributions through equal encoding of modalities. Built upon a four-stage Cross-modal Transformer Block(CMTB), EQUISeg enables efficient multimodal fusion and hierarchical selection. Furthermore, we design a Self-guided Module(SGM) that mitigates modality imbalance by introducing a mutual guidance mechanism, enabling each modality to adaptively adjust its contribution and enhance robustness under degraded conditions. Extensive experiments on multiple datasets demonstrate that EQUISeg achieves significant performance gains and effectively alleviates the adverse effects of modality imbalance in segmentation tasks.

[203] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models cs.CV | cs.AI | cs.LGPDF

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon

TL;DR: 该论文提出了一种称为“中期训练”（CMT）的新方法，用于高效训练一致性模型和均值流模型，解决了传统训练中的不稳定性和高成本问题。

Details

Motivation: 传统的一致性模型（CM）和均值流（MF）在训练中存在不稳定性和对超参数敏感的问题，且计算成本高。尽管从预训练扩散模型初始化有助于缓解部分问题，但仍需解决从小步长到大步长的转换不稳定性。

Result: CMT在多个数据集上实现了最先进的生成效果，如CIFAR-10上两步FID为1.97，ImageNet 512x512上为1.84，同时减少了高达98%的训练数据和GPU时间。

Insight: CMT的核心在于通过轨迹一致性初始化避免了不稳定性和高成本问题，为高效训练流映射模型提供了一种通用且可靠的方法。

Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

[204] MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment cs.CV | cs.CLPDF

Fankai Jia, Daisong Gan, Zhe Zhang, Zhaochi Wen, Chenchen Dan

TL;DR: MMRQA框架结合多模态大语言模型（MLLMs）和信号处理技术，通过模拟伪影、QA对转换和LoRA参数高效融合，实现了MRI质量评估的高性能与可解释性。

Details

Motivation: 传统MRI质量评估方法在定量指标和语义理解之间存在权衡，亟需一种既能提供准确结果又具可解释性的解决方案。

Result: 在MR-ART、FastMRI和MyConnectome基准测试中取得SOTA性能，表现强大的零样本泛化能力。

Insight: MMRQA通过量化分析与语义推理的结合，为动态医疗环境中的质量评估提供了可解释且高效的解决方案。

Abstract: Magnetic resonance imaging (MRI) quality assessment is crucial for clinical decision-making, yet remains challenging due to data scarcity and protocol variability. Traditional approaches face fundamental trade-offs: signal-based methods like MRIQC provide quantitative metrics but lack semantic understanding, while deep learning approaches achieve high accuracy but sacrifice interpretability. To address these limitations, we introduce the Multimodal MRI Quality Assessment (MMRQA) framework, pioneering the integration of multimodal large language models (MLLMs) with acquisition-aware signal processing. MMRQA combines three key innovations: robust metric extraction via MRQy augmented with simulated artifacts, structured transformation of metrics into question-answer pairs using Qwen, and parameter-efficient fusion through Low-Rank Adaptation (LoRA) of LLaVA-OneVision. Evaluated on MR-ART, FastMRI, and MyConnectome benchmarks, MMRQA achieves state-of-the-art performance with strong zero-shot generalization, as validated by comprehensive ablation studies. By bridging quantitative analysis with semantic reasoning, our framework generates clinically interpretable outputs that enhance quality control in dynamic medical settings.

[205] CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D cs.CV | cs.AIPDF

Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei

TL;DR: CORE-3D提出了一种上下文感知的开放词汇3D检索方法，通过改进语义分割和CLIP编码策略，显著提升了3D场景理解的性能。

Details

Motivation: 现有方法在进行3D语义分割和检索时，由于依赖2D类无关掩码的直接投影，导致分割碎片化和语义分配不准确。这限制了在复杂环境中的应用效果。

Result: 在多个基准数据集上的实验表明，该方法在3D语义分割和语言查询对象检索任务中显著优于现有方法。

Insight: 通过改进掩码生成和语义编码策略，可以有效提升3D场景理解的开放词汇检索能力，尤其在复杂环境中表现突出。

Abstract: 3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

[206] Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis cs.CVPDF

Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu

TL;DR: 论文通过随机最优控制和最优传输的统一理论框架，首次对Diffusion Bridge与Flow Matching进行了理论分析与实验验证，证明前者成本函数更低且轨迹更稳定，而后者在数据量减少时效率下降。实验结果表明两者的优劣并存。

Details

Motivation: 目前对Diffusion Bridge和Flow Matching两种方法的优劣缺乏统一的解释和理论支持，导致在实际选择时存在模糊性。

Result: 实验结果表明Diffusion Bridge在稳定性上表现更好，而Flow Matching在数据量减少时表现下降，验证了理论预测。

Insight: Diffusion Bridge更适合需要稳定轨迹的任务，而Flow Matching的效率依赖于充足的数据。两种方法在不同场景下有各自的优势。

Abstract: Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://anonymous.4open.science/r/DBFM-3E8E/.

[207] TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models cs.CVPDF

Zhifang Zhang, Qiqi Tao, Jiaqi Lv, Na Zhao, Lei Feng

TL;DR: 本文提出了TokenSwap，一种针对大型视觉语言模型（LVLM）组合理解能力的隐秘后门攻击方法。与传统固定目标模式的攻击不同，TokenSwap通过干扰对象关系的理解实现攻击，使其难以被检测。

Details

Motivation: 现有LVLM后门攻击通常强制模型生成固定目标模式，易被检测。TokenSwap旨在解决这一问题，通过更隐蔽的方式攻击模型的组合理解能力。

Result: TokenSwap在多个基准和LVLM架构上实现了高攻击成功率，同时保持了良好的隐蔽性和难以检测性。

Insight: 针对模型组合理解能力的攻击更具隐秘性；自适应损失函数能有效强化攻击效果。

Abstract: Large vision-language models (LVLMs) have achieved impressive performance across a wide range of vision-language tasks, while they remain vulnerable to backdoor attacks. Existing backdoor attacks on LVLMs aim to force the victim model to generate a predefined target pattern, which is either inserted into or replaces the original content. We find that these fixed-pattern attacks are relatively easy to detect, because the attacked LVLM tends to memorize such frequent patterns in the training dataset, thereby exhibiting overconfidence on these targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the compositional understanding capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples and simultaneously swaps the grammatical roles of key tokens in the corresponding textual answers. However, the poisoned samples exhibit only subtle differences from the original ones, making it challenging for the model to learn the backdoor behavior. To address this, TokenSwap employs an adaptive token-weighted loss that explicitly emphasizes the learning of swapped tokens, such that the visual triggers and bags-of-words behavior are associated. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness across multiple benchmarks and various LVLM architectures.

[208] SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics cs.CV | cs.ROPDF

Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze

TL;DR: SCOPE利用扩散模型和DINOv2特征作为连续语义先验，解决了Sim2Real环境中类别级物体姿态估计的问题，性能优于现有方法31.9%，并展示了在未知类别物体上的泛化能力。

Details

Motivation: 机器人在开放环境中需要处理未知物体的姿态估计，现有方法依赖离散类别标签，限制了泛化能力。SCOPE旨在通过连续的语义先验和扩散模型解决这一问题。

Result: 1. 在类别级姿态估计上，相比SOTA提升了31.9%（5°5cm指标）；2. 在未知类别物体上的抓取成功率高达100%。

Insight: 连续语义先验和扩散模型的结合可以显著提升类别级姿态估计的泛化能力，尤其在Sim2Real环境中。DINOv2特征的多模态表示能力是关键。

Abstract: Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100%. Code available: https://github.com/hoenigpeter/scope.

[209] BFSM: 3D Bidirectional Face-Skull Morphable Model cs.CVPDF

Zidu Wang, Meng Xu, Miao Xu, Hengyuan Ma, Jiankuo Zhao

TL;DR: BFSM提出了一种联合面部-颅骨的3D可变形模型，解决了数据稀缺、配准精度不足等问题，支持面部与颅骨的形状推断和临床应用。

Details

Motivation: 构建面部-颅骨联合模型在远程诊断、手术规划等方面具有潜力，但数据稀缺、配准精度低等问题限制了其发展。

Result: 实验验证了方法的鲁棒性和准确性，展示了BFSM在3D重建和手术规划中的潜力。

Insight: BFSM首次将面部与颅骨的形状推断纳入统一模型，同时考虑了组织厚度的变化，提升了模型的包容性和实用性。

Abstract: Building a joint face-skull morphable model holds great potential for applications such as remote diagnostics, surgical planning, medical education, and physically based facial simulation. However, realizing this vision is constrained by the scarcity of paired face-skull data, insufficient registration accuracy, and limited exploration of reconstruction and clinical applications. Moreover, individuals with craniofacial deformities are often overlooked, resulting in underrepresentation and limited inclusivity. To address these challenges, we first construct a dataset comprising over 200 samples, including both normal cases and rare craniofacial conditions. Each case contains a CT-based skull, a CT-based face, and a high-fidelity textured face scan. Secondly, we propose a novel dense ray matching registration method that ensures topological consistency across face, skull, and their tissue correspondences. Based on this, we introduce the 3D Bidirectional Face-Skull Morphable Model (BFSM), which enables shape inference between the face and skull through a shared coefficient space, while also modeling tissue thickness variation to support one-to-many facial reconstructions from the same skull, reflecting individual changes such as fat over time. Finally, we demonstrate the potential of BFSM in medical applications, including 3D face-skull reconstruction from a single image and surgical planning prediction. Extensive experiments confirm the robustness and accuracy of our method. BFSM is available at https://github.com/wang-zidu/BFSM

[210] Biomechanical-phase based Temporal Segmentation in Sports Videos: a Demonstration on Javelin-Throw cs.CVPDF

Bikash Kumar Badatya, Vipul Baghel, Jyotirmoy Amin, Ravi Hegde

TL;DR: 论文提出了一种无监督框架，结合结构化最优传输（SOT）和注意力时空图卷积网络（ASTGCN），用于精英标枪投掷视频的时空分割，显著优于现有方法。

Details

Motivation: 传统运动分析方法依赖人工标注或实验室设备，耗时且不可扩展，亟需自动化解决方案。

Result: 在测试数据上达到71.02%的mAP和74.61%的F1分数，显著优于基线方法。

Insight: 结构化最优传输（SOT）可以有效增强时空模型的运动分割能力，结合无监督方法可减少对标注数据的依赖。

Abstract: Precise analysis of athletic motion is central to sports analytics, particularly in disciplines where nuanced biomechanical phases directly impact performance outcomes. Traditional analytics techniques rely on manual annotation or laboratory-based instrumentation, which are time-consuming, costly, and lack scalability. Automatic extraction of relevant kinetic variables requires a robust and contextually appropriate temporal segmentation. Considering the specific case of elite javelin-throw, we present a novel unsupervised framework for such a contextually aware segmentation, which applies the structured optimal transport (SOT) concept to augment the well-known Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN). This enables the identification of motion phase transitions without requiring expensive manual labeling. Extensive experiments demonstrate that our approach outperforms state-of-the-art unsupervised methods, achieving 71.02% mean average precision (mAP) and 74.61% F1-score on test data, substantially higher than competing baselines. We also release a new dataset of 211 manually annotated professional javelin-throw videos with frame-level annotations, covering key biomechanical phases: approach steps, drive, throw, and recovery.

[211] FreeRet: MLLMs as Training-Free Retrievers cs.CVPDF

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu

TL;DR: FreeRet提出了一种无需训练的通用检索框架，直接将现有多模态大语言模型（MLLMs）转化为两阶段检索器，通过语义嵌入和推理能力实现高效检索与重排。

Details

Motivation: 现有MLLMs作为检索器通常需要额外训练，而FreeRet探索了直接利用预训练MLLMs的潜力，无需训练即可实现高效混合模态检索。

Result: 在MMEB和MMEB-V2基准（46个数据集）上，FreeRet显著优于需要数百万对训练的模型。

Insight: 预训练的MLLMs在无需额外训练的条件下，通过合理设计可以成为强大的检索引擎，填补其作为通用模型的空白。

Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

[212] Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs cs.CV | cs.AIPDF

Mohamad Ballout, Okajevo Wilfred, Seyedalireza Yaghoubi, Nohayr Muhammad Abdelmoneim, Julius Mayer

TL;DR: SPLICE是一个基于COIN数据集的人类标注基准，旨在评估视觉语言模型（VLMs）在多维度事件推理（如时间、因果、空间、上下文和常识）上的表现。结果显示，VLMs在重排事件片段任务中表现远逊于人类，且更依赖语言先验而非视觉理解。

Details

Motivation: 现有VLMs在视觉推理任务中的表现与人类存在显著差距，尤其是多维度事件推理能力尚未被充分评估。SPLICE通过人类标注数据集填补了这一空白。

Result: VLMs在重排任务中表现显著低于人类，尤其在依赖上下文和空间推理的任务中；人类标注的文本描述仅提升模型性能，但未影响人类表现。

Insight: VLMs更依赖语言先验而非视觉理解；在时间因果推理任务中表现较好，但在复杂推理任务中仍有局限性。

Abstract: In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

[213] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement cs.CVPDF

Zhu, Libo, Zhou, Zihan, Liu

TL;DR: 本文提出RIFLE，一种基于扩散模型的框架，用于消除图像中的闪烁带（FB）并保留细节，同时提出模拟FB的数据生成方法和首个真实世界FB数据集。

Details

Motivation: 当前拍摄屏幕时，闪烁带（FB）现象严重影响图像质量和可读性，但相关研究较少，亟需解决这一问题的理论与方法。

Result: 在真实世界数据集上，RIFLE表现优于现有图像重建基线，尤其在严重FB情况下。

Insight: 1.FB去除需结合专门的先验和局部监督；2.数据模拟对解决稀缺数据问题至关重要；3.扩散模型适合此类细节保持任务。

Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera’s rolling-shutter readout and the display’s brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

[214] Learning Object-Centric Representations Based on Slots in Real World Scenarios cs.CVPDF

Adil Kaan Akan

TL;DR: 该论文提出了一种基于插槽的对象中心表示学习方法，通过改进预训练扩散模型实现对图像和视频中对象的精细化控制与生成。

Details

Motivation: 当前扩散模型通常将图像视为整体，依赖文本条件，难以实现对象级别的编辑。该研究旨在填补这一空白，平衡全局场景一致性与对象的解耦控制。

Result: 在对象发现、分割、组合编辑等任务中表现优异，同时支持无监督的视频对象分割与重构，实现高级编辑任务如对象移除、替换和插入。

Insight: 通过插槽条件保留了预训练模型的生成能力，同时实现对象级别的可控性，为结构化生成工具的设计提供了新思路。

Abstract: A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

[215] TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models cs.CV | cs.CLPDF

Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu

TL;DR: TemMed-Bench是首个针对临床实践中患者随时间变化的医学图像推理评估基准，揭示了现有大型视觉语言模型（LVLMs）在时间维度上的推理能力不足，并提出了多模态检索增强的潜在改进方向。

Details

Motivation: 现有医学推理基准仅关注单次就诊的图像分析，与实际临床实践中医生参考患者历史记录的需求不符，因此需要新的基准来评估LVLMs在时间维度上的表现。

Result: 大多数LVLMs在时间医学图像推理上表现欠佳，部分模型性能接近随机猜测；多模态检索增强显著提升了VQA任务的性能（平均提升2.59%）。

Insight: 多模态检索增强是提升LVLMs在时间医学图像推理任务中的潜在方向，而现有模型的局限性表明这一领域仍需进一步研究。

Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient’s condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient’s historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients’ conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients’ condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs’ limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

[216] GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts cs.CV | cs.AI | cs.CLPDF

Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng

TL;DR: GSM8K-V是一个多图像视觉数学推理基准测试，通过将文本数学问题转换为视觉形式，揭示了当前视觉语言模型在解决视觉数学问题上的不足。

Details

Motivation: 现有的视觉数学推理基准测试主要集中于几何问题，缺乏对多图像环境下数学文字问题的评估。GSM8K-V填补了这一空白。

Result: 现有模型在GSM8K-V上的表现显著低于文本版本的GSM8K，例如Gemini-2.5-Pro的准确率仅为46.93%。

Insight: 视觉数学推理对模型的视觉理解和上下文推理能力提出了更高要求，未来需开发更强大和通用的视觉语言模型。

Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.

[217] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer cs.CV | cs.AIPDF

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen

TL;DR: SANA-Video是一种高效的小型扩散模型，能够生成720x1280分辨率、长达一分钟的视频，具有快速的文本-视频对齐能力和低训练成本。

Details

Motivation: 当前的视频生成模型通常计算成本高且效率低下，特别是在处理高分辨率长视频时。SANA-Video旨在通过优化注意力和内存管理，实现低成本高效的视频生成。

Result: SANA-Video在RTX 5090 GPU上实现高效推理（2.4倍加速），生成5秒720p视频仅需29秒，训练成本仅为同类模型的1%。

Insight: 线性注意力和块状内存管理的结合是高效长视频生成的关键，同时低训练成本使其更具实用性。

Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

[218] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility cs.CVPDF

Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu

TL;DR: 本文提出了一种无需训练的框架，通过显式推理物理不合理性并引导生成过程，提升了视频生成的物理合理性。

Details

Motivation: 现有扩散模型依赖大规模文本-视频数据集隐式学习物理推理，成本高且难以扩展，仍可能产生违背物理规律的不合理运动。

Result: 实验表明，该方法显著提升了物理合理性且不影响真实感，无需额外训练。消融研究验证了物理感知推理和SDG的有效性。

Insight: 显式推理物理不合理性并通过SDG策略引导生成是一种高效且可插拔的物理感知视频生成范式。

Abstract: Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

[219] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? cs.CVPDF

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang

TL;DR: IWR-Bench提出了一种新的基准测试，用于评估大型视觉语言模型（LVLM）从用户交互视频中重建交互式网页的能力，揭示了当前模型在推理时间动态和事件驱动逻辑方面的不足。

Details

Motivation: 现有的基准测试主要关注静态截图到代码的任务，忽视了真实网页应用中的动态交互，因此本文提出了IWR-Bench以填补这一空白。

Result: 在28个LVLM上的实验表明，最佳模型的总体得分仅为36.35%，功能正确性（24.39%）远低于视觉保真度（64.25%）。

Insight: 当前模型在推理时间动态和事件驱动逻辑方面存在显著不足，IWR-Bench为视觉语言研究提供了一个挑战性前沿。

Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models’ ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.

[220] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation cs.CVPDF

Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong

TL;DR: 这篇论文提出了一个新的越南语多模态医学数据集，包含大量CT-PET图像与临床报告配对数据，旨在解决医学视觉语言模型在功能成像任务和低资源语言（如越南语）中的局限性。通过引入数据增强和专家验证测试集，论文展示了该数据集显著提升了现有模型的性能。

Details

Motivation: 现有的视觉语言基础模型在医学成像领域的应用受到限制，主要因为缺乏多样化的成像模态和多语言临床数据，尤其是在低资源语言（如越南语）中的数据集稀缺。

Result: 实验结果表明，该数据集显著提升了现有视觉语言基础模型在医学报告生成和视觉问答任务上的性能。

Insight: 这项研究表明，针对特定语言和医学任务的定制数据集对于提升视觉语言模型的临床实用性至关重要，尤其是在低资源语言环境中。

Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs’ learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li

TL;DR: 论文提出了一种新的多模态跟踪任务，利用了可见光RGB、深度D和热红外TIR三种模态，增强了复杂场景下的鲁棒性。作者构建了RGBDT500数据集，并提出了RDTTrack跟踪器，通过正交投影约束和提示学习技术融合三模态信息，实验结果显著优于现有双模态方法。

Details

Motivation: 现有多模态跟踪方法主要集中于双模态（如RGB-D或RGB-T），在复杂场景中表现受限。本文旨在通过引入第三种互补模态（TIR）提升跟踪的鲁棒性。

Result: 实验表明，RDTTrack在复杂场景中的跟踪精度和鲁棒性显著优于现有双模态方法。

Insight: 三模态信号（RGB-D-TIR）的互补性可以显著提升跟踪性能，尤其是在复杂环境中。

Abstract: Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.

[222] VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding cs.CV | cs.AIPDF

Yizhuo Ding, Mingkang Chen, Zhibang Feng, Tong Xiao, Wanying Qu

TL;DR: VTPerception-R1通过显式视觉和文本感知提升多模态推理能力，提出两阶段框架：感知增强微调和感知感知强化学习，显著改善推理准确性和鲁棒性。

Details

Motivation: 多模态大语言模型（MLLMs）在推理时缺乏对感知证据的显式利用，限制了其性能。

Result: VTPerception-R1在多个任务中显著提升推理准确性和鲁棒性。

Insight: 显式感知（尤其是文本线索）对小模型效果提升最显著。

Abstract: Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.

[223] LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning cs.CVPDF

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie

TL;DR: LOVE-R1提出了一种自适应缩放的多步推理机制，通过动态调整视频帧的分辨率和采样密度，解决了长视频理解中时空感知的冲突，显著提升了性能。

Details

Motivation: 长视频理解在当前的大型视频语言模型（LVLMs）中仍具挑战性，因为统一的帧采样机制无法兼顾时间线索和空间细节。LOVE-R1旨在通过自适应缩放机制平衡这两者。

Result: 在4个常见长视频理解基准上，LOVE-R1平均比基线Qwen2.5-VL高出3.1%。

Insight: 1）动态分辨率调整是解决时空感知冲突的有效方法；2）多步推理和解耦强化微调显著提升了模型的细粒度理解能力。

Abstract: Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

[224] Vision Function Layer in Multimodal LLMs cs.CVPDF

Cheng Shi, Yizhou Yu, Sibei Yang

TL;DR: 研究发现多模态大语言模型（MLLMs）中的视觉功能解码分布在不同的解码器层中，表现为视觉功能层（VFL）。通过视觉标记交换框架揭示了这些层的顺序与人类行为一致，并提出了VFL-LoRA和VFL-select方法，显著提升了模型效率和数据选择性能。

Details

Motivation: 探索MLLMs中视觉功能解码的分布规律及其与人类行为的关联，以提高模型效率和下游任务表现。

Result: 1.VFL-LoRA优于全LoRA训练；2.VFL-select以20%数据达到98%全数据性能；3.揭示VFL顺序与人类行为一致。

Insight: MLLMs的视觉功能分层且有序，与人类行为一致；选择性微调和数据分类可显著提升模型效率和性能。

Abstract: This study identifies that visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs). Typically, each function, such as counting, grounding, or OCR recognition, narrows down to two or three layers, which we define as Vision Function Layers (VFL). Additionally, the depth and its order of different VFLs exhibits a consistent pattern across different MLLMs, which is well-aligned with human behaviors (e.g., recognition occurs first, followed by counting, and then grounding). These findings are derived from Visual Token Swapping, our novel analytical framework that modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding. Furthermore, these insights offer substantial utility in tailoring MLLMs for real-world downstream applications. For instance, when LoRA training is selectively applied to VFLs whose functions align with the training data, VFL-LoRA not only outperform full-LoRA but also prevent out-of-domain function forgetting. Moreover, by analyzing the performance differential on training data when particular VFLs are ablated, VFL-select automatically classifies data by function, enabling highly efficient data selection to directly bolster corresponding capabilities. Consequently, VFL-select surpasses human experts in data selection, and achieves 98% of full-data performance with only 20% of the original dataset. This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.

[225] TACO-Net: Topological Signatures Triumph in 3D Object Classification cs.CV | cs.CG | cs.LGPDF

Anirban Ghosh, Ayan Dutta

TL;DR: TACO-Net提出了一种结合拓扑数据分析与图像滤波技术的3D物体分类方法，通过提取点云的拓扑特征并使用轻量级1D CNN进行分类，在合成和真实数据集上实现了SOTA性能。

Details

Motivation: 3D物体分类在计算机视觉、机器人和自动驾驶等领域具有重要应用，但由于点云的无序性、不规则性和噪声，高精度分类仍然具有挑战性。

Result: 在ModelNet40和ModelNet10上分别达到99.05%和99.52%的准确率，并在大规模真实数据集OmniObject3D和多种噪声输入下表现出强鲁棒性。

Insight: 拓扑特征在3D物体分类中具有显著优势，尤其是在处理无序和噪声点云时，TACO-Net的轻量化设计也为实际应用提供了可能性。

Abstract: 3D object classification is a crucial problem due to its significant practical relevance in many fields, including computer vision, robotics, and autonomous driving. Although deep learning methods applied to point clouds sampled on CAD models of the objects and/or captured by LiDAR or RGBD cameras have achieved remarkable success in recent years, achieving high classification accuracy remains a challenging problem due to the unordered point clouds and their irregularity and noise. To this end, we propose a novel state-of-the-art (SOTA) 3D object classification technique that combines topological data analysis with various image filtration techniques to classify objects when they are represented using point clouds. We transform every point cloud into a voxelized binary 3D image to extract distinguishing topological features. Next, we train a lightweight one-dimensional Convolutional Neural Network (1D CNN) using the extracted feature set from the training dataset. Our framework, TACO-Net, sets a new state-of-the-art by achieving $99.05%$ and $99.52%$ accuracy on the widely used synthetic benchmarks ModelNet40 and ModelNet10, and further demonstrates its robustness on the large-scale real-world OmniObject3D dataset. When tested with ten different kinds of corrupted ModelNet40 inputs, the proposed TACO-Net demonstrates strong resiliency overall.

[226] Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models cs.CVPDF

Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee

TL;DR: 本文提出了一种无需训练的令牌剪枝框架，通过零阶梯度估计选择对模型输出敏感的视觉令牌，显著提升了视觉语言模型的推理效率。

Details

Motivation: 大型视觉语言模型（VLMs）的多模态推理能力强大，但其冗余的视觉令牌导致推理成本高昂。现有的令牌剪枝方法要么依赖不稳定的注意力分数，要么可能丢弃关键区域。为解决这些问题，作者提出了一种更高效的方法。

Result: 在多个VLMs和基准测试中，该方法能剪枝高达94.4%的令牌，同时保持准确性，并将端到端推理速度提升至基线方法的2.30倍。

Insight: 令牌敏感性估计可通过轻量级扰动实现，而无需依赖复杂的训练过程或注意力机制，为高效模型优化提供了新思路。

Abstract: Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space but risk dropping regions needed for accurate prediction. We propose \ours, a training-free framework built on a simple intuition: tokens with higher sensitivity are more likely to influence the model’s output, and they should also capture complementary visual cues rather than overlapping information. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the projection layer, a shallow and computationally light component of the model. This approach measures how small random perturbations affect the projection outputs, allowing us to approximate each token’s influence through lightweight forward passes without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that \ours consistently outperforms prior methods, pruning up to 94.4% of tokens while maintaining accuracy and significantly improving efficiency, achieving up to 2.30x faster end-to-end inference over the baseline.

[227] ELPG-DTFS: Prior-Guided Adaptive Time-Frequency Graph Neural Network for EEG Depression Diagnosis cs.CVPDF

Jingru Qiu, Jiale Liang, Xuanhan Fan, Mingda Zhang, Zhenli He

TL;DR: ELPG-DTFS是一种基于先验知识的自适应时频图神经网络，用于EEG抑郁症诊断，通过引入通道-频带注意力、动态功能链接和神经科学先验知识，显著提升了诊断性能。

Details

Motivation: 抑郁症诊断依赖主观量表，EEG提供了低成本生物标志物，但现有深度模型忽略了动态功能链接和先验知识，限制了准确性和解释性。

Result: 在MODMA数据集上达到97.63%准确率和97.33% F1分数，超越现有方法ACM-GNN。

Insight: 动态功能链接和先验知识的结合显著提升了EEG抑郁症诊断的性能和可解释性。

Abstract: Timely and objective screening of major depressive disorder (MDD) is vital, yet diagnosis still relies on subjective scales. Electroencephalography (EEG) provides a low-cost biomarker, but existing deep models treat spectra as static images, fix inter-channel graphs, and ignore prior knowledge, limiting accuracy and interpretability. We propose ELPG-DTFS, a prior-guided adaptive time-frequency graph neural network that introduces: (1) channel-band attention with cross-band mutual information, (2) a learnable adjacency matrix for dynamic functional links, and (3) a residual knowledge-graph pathway injecting neuroscience priors. On the 128-channel MODMA dataset (53 subjects), ELPG-DTFS achieves 97.63% accuracy and 97.33% F1, surpassing the 2025 state-of-the-art ACM-GNN. Ablation shows that removing any module lowers F1 by up to 4.35, confirming their complementary value. ELPG-DTFS thus offers a robust and interpretable framework for next-generation EEG-based MDD diagnostics.

[228] Vision At Night: Exploring Biologically Inspired Preprocessing For Improved Robustness Via Color And Contrast Transformations cs.CVPDF

Lorena Stracke, Lia Nimmermann, Shashank Agnihotri, Margret Keuper, Volker Blanz

TL;DR: 论文提出一种基于生物视觉机制的输入预处理方法，通过DoG滤波增强对比度和颜色对立通道，提升语义分割模型在夜间、雾、雪等恶劣条件下的鲁棒性，同时不影响原始性能。

Details

Motivation: 受人类视觉系统中对比度增强和颜色对立机制的启发，研究旨在改善语义分割模型在恶劣条件下的鲁棒性，无需修改模型架构或训练过程。

Result: 预处理方法在不修改模型的情况下，提升了语义分割模型在夜间、雾、雪等恶劣条件下的鲁棒性，同时不影响标准数据集上的性能。

Insight: 1）生物视觉机制可为计算机视觉任务提供有效启发；2）轻量预处理可显著提升模型鲁棒性，无需复杂架构调整。

Abstract: Inspired by the human visual system’s mechanisms for contrast enhancement and color-opponency, we explore biologically motivated input preprocessing for robust semantic segmentation. By applying Difference-of-Gaussians (DoG) filtering to RGB, grayscale, and opponent-color channels, we enhance local contrast without modifying model architecture or training. Evaluations on Cityscapes, ACDC, and Dark Zurich show that such preprocessing maintains in-distribution performance while improving robustness to adverse conditions like night, fog, and snow. As this processing is model-agnostic and lightweight, it holds potential for integration into imaging pipelines, enabling imaging systems to deliver task-ready, robust inputs for downstream vision models in safety-critical environments.

[229] StreamForest: Efficient Online Video Understanding with Persistent Event Memory cs.CVPDF

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang

TL;DR: StreamForest是一个专为实时视频理解设计的架构，通过持久事件记忆森林和细粒度时空窗口提升实时感知能力，并在多个基准测试中取得最先进性能。

Details

Motivation: 现有的多模态大语言模型(MLLMs)在实时流视频场景中的表现受限，主要由于历史视觉特征的存储限制和实时时空推理不足。

Result: StreamForest在StreamingBench、OVBench和OVO-Bench上的准确率分别为77.3%、60.5%和55.6%，即使在极端的视觉令牌压缩下（1024令牌），仍能保持96.8%的平均准确率。

Insight: StreamForest的设计展示了在有限计算资源下高效保留长期记忆的潜力，同时强调了实时感知与长期推理的结合对视频理解的重要性。

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.

[230] Environment-Aware Satellite Image Generation with Diffusion Models cs.CV | cs.LGPDF

Nikos Kostagiolas, Pantelis Georgiades, Yannis Panagakis, Mihalis A. Nicolaou

TL;DR: 该论文提出了一个基于扩散模型的环境感知卫星图像生成方法，通过结合文本、元数据和视觉数据三种控制信号，解决了现有方法在环境上下文不足、数据缺失或损坏以及用户意图反映不充分等问题。

Details

Motivation: 现有扩散模型在遥感图像生成中存在环境上下文利用不足、数据缺失问题处理不佳以及用户意图难以准确反映的局限。

Result: 该方法在单图像和时间序列生成中均优于现有方法，定性（对缺失元数据的鲁棒性和控制输入的响应性）和定量（更高保真度、准确性和生成质量）评估指标均有提升。

Insight: 环境上下文的引入显著提升了卫星图像生成的性能，为下游任务提供了潜在的高质量数据生成工具。

Abstract: Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.

[231] ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation cs.CV | cs.ROPDF

Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno

TL;DR: ThermalGen是一个基于流的生成模型，用于RGB到热图像的转换，具有风格解耦机制和多数据集支持，性能优于现有方法。

Details

Motivation: 由于同步校准的RGB-热图像对稀缺，阻碍了多模态任务的发展，ThermalGen旨在通过生成合成热图像解决这一问题。

Result: 在多个RGB-T基准测试中表现优异，优于现有的GAN和扩散模型方法。

Insight: ThermalGen的成功表明，基于流的生成模型在多模态图像转换任务中具有潜力，尤其在处理多样化数据时。

Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets–DJI-day, Bosonplus-day, and Bosonplus-night–captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http://xjh19971.github.io/ThermalGen

[232] VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines cs.CV | cs.LGPDF

Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma

TL;DR: VagueGAN 是一种针对图像生成管道的攻击方法，通过结合 PoisonerNet 和生成器-判别器对，生成隐蔽的触发器，导致生成图像的定向变化。攻击在保留甚至提升视觉质量的同时，暴露了像素级防御的盲点。

Details

Motivation: 生成模型（如 GANs 和扩散模型）广泛用于合成逼真图像和支持下游任务。然而，针对生成管道的攻击研究较少，尤其是那些输入微小扰动会导致输出可控变化的隐蔽攻击。

Result: 实验结果显示，中毒输出的视觉质量甚至可能高于干净样本，且扰动在生成器中能保持一致性。

Insight: 潜伏空间中毒可能比像素级扰动更隐蔽且高效，挑战了传统防御假设，引发对生成管道安全性的新思考。

Abstract: Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.

[233] DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation cs.CVPDF

Xi Chen, Hongxun Yao, Zhaopan Xu, Kui Jiang

TL;DR: 该论文提出了一种名为DAM的新框架，通过结合多模态视觉-语言（ViL）模型的监督信号与稀疏人工标注，提升了无源域适应（SFADA）的性能。

Details

Motivation: 当前的无源域适应方法通常将视觉-语言模型和数据监督视为独立来源，缺乏有效的融合机制，限制了性能提升。

Result: DAM在多个SFADA基准测试中超越了现有方法，并提升了性能表现。

Insight: 多模态监督信号可以有效补充稀疏标注数据，双向蒸馏机制是提升域适应性能的关键。

Abstract: Source-free active domain adaptation (SFADA) enhances knowledge transfer from a source model to an unlabeled target domain using limited manual labels selected via active learning. While recent domain adaptation studies have introduced Vision-and-Language (ViL) models to improve pseudo-label quality or feature alignment, they often treat ViL-based and data supervision as separate sources, lacking effective fusion. To overcome this limitation, we propose Dual Active learning with Multimodal (DAM) foundation model, a novel framework that integrates multimodal supervision from a ViL model to complement sparse human annotations, thereby forming a dual supervisory signal. DAM initializes stable ViL-guided targets and employs a bidirectional distillation mechanism to foster mutual knowledge exchange between the target model and the dual supervisions during iterative adaptation. Extensive experiments demonstrate that DAM consistently outperforms existing methods and sets a new state-of-the-art across multiple SFADA benchmarks and active learning strategies.

[234] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer cs.CVPDF

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

TL;DR: 通过引入Attention Surgery框架，无需从头训练即可线性化或混合预训练视频扩散模型中的注意力机制，显著降低计算成本并保持生成质量。

Details

Motivation: 当前基于Transformer的视频扩散模型（VDMs）在生成高质量视频方面表现出色，但其自注意力机制的二次计算成本限制了长序列和高分辨率的应用。虽然线性注意力能降低计算复杂度，但先前方法难以在不进行昂贵重新训练的情况下匹配softmax注意力的表达能力。

Result: 在Wan2.1 1.3B（SOTA DiT-based VDM）上实现首个具有竞争力的次二次注意力VDMs，减少40%的FLOPs成本，同时在VBench和VBench-2.0基准上保持生成质量。

Insight: 1. 无需从头训练即可优化预训练模型的效率；2. 混合注意力机制是平衡计算成本和模型性能的有效方法；3. 轻量级适配策略对实际应用具有重要意义。

Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

[235] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing cs.CV | cs.AIPDF

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang

TL;DR: OpenGPT-4o-Image是一个大规模数据集，通过结合分层任务分类法和自动化数据生成方法构建，提升了图像生成与编辑任务的性能。

Details

Motivation: 现有数据集缺乏系统性和挑战性场景，限制了多模态模型的性能。

Result: 在多个基准测试中显著提升性能（编辑任务提升18%，生成任务提升13%）。

Insight: 系统化数据构建是提升多模态AI能力的关键。

Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

[236] Segmentor-Guided Counterfactual Fine-Tuning for Image Synthesis cs.CV | cs.AIPDF

Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta

TL;DR: 本文提出了一种名为Seg-CFT的方法，通过分割器引导的反事实微调生成局部一致且有效的反事实图像，避免了传统方法因依赖外部分类器导致的全局干扰问题。

Details

Motivation: 传统方法依赖外部分类器或回归器进行反事实图像生成，导致结构特异性干预时可能引发全局干扰，而像素级标签映射需要繁琐的人工标注。

Result: 在胸部X光片生成和冠状动脉疾病建模中表现出色，生成了真实且一致性强的图像。

Insight: 分割器的引入显著提升了反事实图像生成的质量和局部一致性，为医学图像分析提供了新工具。

Abstract: Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient’s age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.

[237] Scalable GANs with Transformers cs.CV | cs.AI | cs.LGPDF

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

TL;DR: 本文研究了GAN的可扩展性，通过潜在空间训练和纯Transformer架构实现了高效生成，并解决了GAN扩展中的问题，提出了轻量级中间监督和宽度感知学习率调整方法，最终在ImageNet-256上取得了SOTA效果。

Details

Motivation: 生成模型的可扩展性推动了近期进展，但GAN的可扩展性原则尚未充分探索。本文旨在通过潜在空间训练和Transformer架构改进GAN的可扩展性。

Result: GAT-XL/2在ImageNet-256上以40轮训练达到FID 2.96的SOTA性能，训练轮数仅为基线方法的1/6。

Insight: GAN的可扩展性可通过潜在空间训练和Transformer架构实现，但需注意早期层利用不足和优化不稳定问题。

Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

[238] Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents cs.CVPDF

Jiahua Li, Kun Wei, Zhe Xu, Zibo Su, Xu Yang

TL;DR: 论文提出CogniGPT框架，通过多粒度感知代理（MGPA）和验证增强反思代理（VERA）的交互循环，实现对长视频的高效可靠理解。实验表明，CogniGPT在多个数据集上表现优异，尤其在EgoSchema上仅用11.2帧即超越现有无训练方法。

Details

Motivation: 长视频具有时间复杂性和任务相关信息稀疏的特点，现有基于大语言模型的方法在完整性和高效性上仍存在不足。受人类渐进视觉认知启发，提出一种交互式框架以更高效地捕捉任务关键信息。

Result: 在EgoSchema、Video-MME、NExT-QA和MovieChat数据集上，CogniGPT在准确性和效率上均表现优异。在EgoSchema上，仅用11.2帧即超越现有无训练方法，性能接近Gemini 1.5-Pro。

Insight: 1. 交互式多代理结构能有效解决长视频理解中的信息稀疏问题；2. 人类视觉认知机制的可借鉴性；3. 验证反思对减少幻觉和提高效率的重要性。

Abstract: Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although various Large Language Model (LLM)-based approaches have advanced long video understanding, they still struggle to achieve both completeness and efficiency in capturing task-critical information. Inspired by human progressive visual cognition, we propose CogniGPT, a framework that leverages an interactive loop between Multi-Granular Perception Agent (MGPA) and Verification-Enhanced Reflection Agent (VERA) for efficient and reliable long video understanding. Specifically, MGPA mimics human visual divergent and focused attention to capture task-related information, while VERA verifies perceived key clues to mitigate hallucination and optimize subsequent perception strategies. Through this interactive process, CogniGPT explores a minimal set of informative and reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat datasets demonstrate CogniGPT’s superiority in both accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

Ermanno Bartoli, Dennis Rotondi, Buwei He, Patric Jensfelt, Kai O. Arras

TL;DR: 本文提出了Social 3D Scene Graphs，一种增强的3D场景图表示，用于捕捉人类及其与环境的关系，并引入了一个新基准，实验表明该方法提升了人类活动预测和关系推理能力。

Details

Motivation: 现有3D场景图方法忽视了人类在场景中的角色，且缺乏对人类与环境关系的标注数据，限制了机器人对社会交互的理解能力。

Result: 实验证明该方法显著提升了人类活动预测的准确性，并增强了机器人对人类与环境关系的推理能力。

Insight: 通过结合人类行为建模与环境语义表示，可以更全面地为机器人提供社交智能支持。

Abstract: Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.

[240] On-the-Fly Data Augmentation for Brain Tumor Segmentation cs.CVPDF

Ishika Jain, Siri Willems, Steven Latre, Tom De Schepper

TL;DR: 该论文提出了一种动态数据增强策略，利用预训练的生成对抗网络（GliGANs）在训练过程中实时插入合成的脑瘤数据，以提高脑瘤分割模型的鲁棒性和泛化能力。

Details

Motivation: 脑瘤分割模型需要在治疗前后扫描中表现出鲁棒性，但对高质量标注数据的需求限制了模型的泛化能力。传统数据增强方法存储大量3D数据计算成本高，因此需要一种更高效的方法。

Result: 模型在BraTS 2025验证平台上的分割表现优异（如Dice分数为ET: 0.79, WT: 0.88），并赢得了BraTS Lighthouse Challenge 2025 Task 1第一名。

Insight: 动态数据增强策略可以有效解决数据稀缺和存储成本问题，同时提升模型的泛化能力和分割精度。

Abstract: Robust segmentation across both pre-treatment and post-treatment glioma scans can be helpful for consistent tumor monitoring and treatment planning. BraTS 2025 Task 1 addresses this by challenging models to generalize across varying tumor appearances throughout the treatment timeline. However, training such generalized models requires access to diverse, high-quality annotated data, which is often limited. While data augmentation can alleviate this, storing large volumes of augmented 3D data is computationally expensive. To address these challenges, we propose an on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training. We evaluate three nnU-Net-based models and their ensembles: (1) a baseline without external augmentation, (2) a regular on-the-fly augmented model, and (3) a model with customized on-the-fly augmentation. Built upon the nnU-Net framework, our pipeline leverages pretrained GliGAN weights and tumor insertion methods from prior challenge-winning solutions. An ensemble of the three models achieves lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on the online BraTS 2025 validation platform. This work ranked first in the BraTS Lighthouse Challenge 2025 Task 1- Adult Glioma Segmentation.

[241] Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel cs.CVPDF

Haotian Dong, Wenjing Wang, Chen Li, Di Lin

TL;DR: 该论文提出了一种名为Wan-Alpha的新框架，用于生成带有Alpha通道的高质量透明视频。通过联合学习RGB和Alpha通道，并结合变分自编码器（VAE）和扩散变换器，显著提升了视觉质量、运动真实性和透明度渲染。

Details

Motivation: 现有方法在RGBA视频生成中常忽视视觉质量，限制了实际应用。因此，作者提出了一个能够同时生成高质量RGB和Alpha通道的解决方案。

Result: 与现有方法相比，Wan-Alpha在视觉质量、运动真实性和透明度渲染方面表现出色，并能生成半透明物体、发光效果和细粒度细节。

Insight: Alpha通道的有效编码和联合学习对RGBA视频生成至关重要，高质量的数据集是训练扩散模型的关键。

Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{https://donghaotian123.github.io/Wan-Alpha/}{https://donghaotian123.github.io/Wan-Alpha/}.

[242] SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation cs.CVPDF

Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang

TL;DR: SDPose利用预训练的扩散模型（如Stable Diffusion）的潜特征进行人体姿态估计，通过轻量级卷积头和辅助RGB重建分支提高跨域鲁棒性，并在COCO-OOD等跨域基准上取得SOTA性能。

Details

Motivation: 扩散模型的多尺度潜特征对密集预测任务（如姿态估计）具有潜力，但此前工作主要关注生成任务，其在结构化输出（如姿态估计）中的应用尚待探索。

Result: 在COCO验证集上与Sapiens-1B/2B性能相当，在HumanArt和COCO-OOD上达到SOTA，并展示了其在零样本标注和控制生成任务中的潜力。

Insight: 扩散模型的潜特征可用于结构化输出任务，轻量级设计和辅助任务能有效提升跨域性能。

Abstract: Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold~~\citep{ke2024repurposing} and Lotus~~\citep{he2024lotus} adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose \textbf{SDPose}, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net’s image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct \textbf{COCO-OOD}, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.

[243] PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion cs.CVPDF

Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang

TL;DR: PanoWorld-X是一个用于生成高质量、可控全景视频的新框架，通过球面感知扩散Transformer和大规模数据集解决了现有方法的视场限制和相机控制不足问题。

Details

Motivation: 现有方法在生成全景世界时面临视场狭窄或相机控制不足的问题，限制了场景的连续性和自由探索性。

Result: PanoWorld-X在运动范围、控制精度和视觉质量方面表现优异，展示了实际应用潜力。

Insight: 球面几何建模对全景数据生成至关重要，传统视频扩散的归纳偏置可能与球面数据不对齐，需专门设计架构。

Abstract: Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.

[244] LVT: Large-Scale Scene Reconstruction via Local View Transformers cs.CV | cs.LGPDF

Tooba Imtiaz, Lucy Chai, Kathryn Heal, Xuan Luo, Jungyeon Park

TL;DR: LVT提出了一种基于局部视图Transformer的大规模场景重建方法，避免了传统Transformer的二次复杂度问题，通过局部邻域处理和几何位置编码实现高效的大场景重建和新视角合成。

Details

Motivation: 传统Transformer在3D视觉和新视角合成中存在二次复杂度问题，难以扩展到大规模场景，因此需要一种高效的替代方法。

Result: LVT支持单次前向传播完成任意大规模高分辨率场景的重建，效果显著。

Insight: 局部邻域信息比全局信息更能有效捕捉场景结构，几何位置编码在提升模型性能中发挥了关键作用。

Abstract: Large transformer models are proving to be a powerful tool for 3D vision and novel view synthesis. However, the standard Transformer’s well-known quadratic complexity makes it difficult to scale these methods to large scenes. To address this challenge, we propose the Local View Transformer (LVT), a large-scale scene reconstruction and novel view synthesis architecture that circumvents the need for the quadratic attention operation. Motivated by the insight that spatially nearby views provide more useful signal about the local scene composition than distant views, our model processes all information in a local neighborhood around each view. To attend to tokens in nearby views, we leverage a novel positional encoding that conditions on the relative geometric transformation between the query and nearby views. We decode the output of our model into a 3D Gaussian Splat scene representation that includes both color and opacity view-dependence. Taken together, the Local View Transformer enables reconstruction of arbitrarily large, high-resolution scenes in a single forward pass. See our project page for results and interactive demos https://toobaimt.github.io/lvt/.

[245] GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning cs.CVPDF

Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool

TL;DR: GeoVLM-R1提出了一种基于强化学习的后训练框架，通过任务感知奖励提升遥感图像推理能力，并在多个EO基准测试中表现优异。

Details

Motivation: 现有强化学习方法在自然图像领域表现优异，但在遥感图像（EO）任务中应用不足。EO任务的多样性和复杂性需要任务感知的推理能力。

Result: 在多个EO基准测试中，性能超越现有通用和专用视觉语言模型。

Insight: 任务感知奖励机制是提升遥感图像推理能力的关键，强化学习可有效适配复杂EO任务。

Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

[246] STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation cs.CVPDF

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang

TL;DR: 该论文提出了STAGE，一种稳定且通用的框架，用于解决在自回归图像生成中应用GRPO时的不稳定性和泛化性问题。通过两种关键方法（优势/KL重加权和熵奖励），STAGE提升了生成质量、训练稳定性及跨任务泛化能力。

Details

Motivation: 现有的GRPO算法在自回归图像生成中存在训练不稳定和泛化性差的问题，容易破坏预训练模型能力。为了解决这些挑战，论文提出了STAGE框架。

Result: 在多基准测试中，STAGE显著提升了视觉质量、训练稳定性及跨任务泛化能力，优于基线GRPO。

Insight: 1. 处理矛盾梯度和策略熵动态是改进GRPO的关键；2. 熵奖励是一种有效的稳定训练工具；3. STAGE的方法可以推广到其他需要稳定训练的强化学习场景。

Abstract: Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.

[247] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning cs.CV | cs.LG | I.4.9PDF

Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

TL;DR: VT-FSL是一种新颖的少样本学习框架，通过结合大型语言模型（LLMs）和视觉支持图像生成精确的跨模态提示，解决了现有方法中语义幻觉的问题，并在多个基准测试中取得了最先进的性能。

Details

Motivation: 现有的少样本学习方法虽然通过引入额外语义信息或设计复杂语义融合模块来增强支持特征，但仍存在语义幻觉的问题，导致噪声指导和昂贵的修正成本。VT-FSL旨在通过结合视觉和文本信息，克服这些问题。

Result: VT-FSL在标准、跨域和细粒度少样本学习场景的十个基准测试中均取得了最先进的性能。

Insight: 结合LLMs和视觉支持图像生成跨模态提示可以显著减少语义噪声，提升少样本学习的性能；几何对齐方法能够有效捕捉多模态表示之间的全局非线性关系。

Abstract: Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

[248] A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration cs.CV | cs.DCPDF

Rohit Jena, Vedant Zope, Pratik Chaudhari, James C. Gee

TL;DR: FFDP是一种分布式框架，通过优化非GEMM瓶颈和卷积感知的张量分片，实现了前所未有规模的多模态图像配准，显著提升了性能和效率。

Details

Motivation: 生物医学和生命科学中的图像配准算法未能与图像采集能力的提升同步扩展，需要一种高效的分布式框架来处理大规模问题。

Result: 在8个A6000 GPU上，FFDP能够在约1分钟内完成570倍于标准临床数据规模的多模态配准，加速现有方法6-7倍，并减少峰值内存消耗20-59%。

Insight: FFDP展示了通过分布式优化非GEMM瓶颈和内存管理，可以显著扩展图像配准的规模，同时保持高效性，适用于未来的大规模生物医学数据处理。

Abstract: In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

[249] BRIDGE – Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation cs.CV | cs.AIPDF

Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He

TL;DR: BRIDGE提出了一种基于强化学习的深度到图像（D2I）生成框架，大规模生成了高质量的RGB图像及其对应的深度数据，用于单目深度估计（MDE）任务的训练。

Details

Motivation: 传统单目深度估计方法受限于数据稀缺和质量问题，影响了模型的鲁棒性。BRIDGE旨在通过生成大规模、多样化的合成数据来弥补这一不足。

Result: BRIDGE在规模和领域多样性上取得了突破，在定量评估和复杂场景细节捕捉上超越了现有SOTA方法。

Insight: 通过强化学习优化数据生成过程，可以有效提升合成数据的质量，进而改善单目深度估计模型的泛化能力和鲁棒性。

Abstract: Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at https://dingning-liu.github.io/bridge.github.io/.

[250] MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification cs.CVPDF

Xiaoyi Huang, Junwei Wu, Kejia Zhang, Carl Yang, Zhiming Luo

TL;DR: MANI-Pure提出了一种基于幅值自适应噪声注入的对抗性净化框架，通过有针对性地在不同频率区域注入噪声，有效抑制高频率、低幅值区域的对抗扰动，同时保留重要语义内容。

Details

Motivation: 现有的对抗性净化方法通常采用均匀噪声注入，这会不加区分地扰动所有频率，破坏语义结构并削弱鲁棒性。本文通过实证研究发现对抗扰动并非均匀分布，而是主要集中在高频率区域，且幅度分布因攻击类型而异。

Result: 在CIFAR-10和ImageNet-1K上的实验表明，MANI-Pure将原始分类器的干净精度差距缩小至0.59以内，鲁棒精度提升2.15，并在RobustBench排行榜上取得最高鲁棒精度，超越此前最优方法。

Insight: 对抗扰动主要集中在高频率、低幅值区域，因此通过针对性噪声注入可以有效净化对抗样本，而不破坏低频语义结构。

Abstract: Adversarial purification with diffusion models has emerged as a promising defense strategy, but existing methods typically rely on uniform noise injection, which indiscriminately perturbs all frequencies, corrupting semantic structures and undermining robustness. Our empirical study reveals that adversarial perturbations are not uniformly distributed: they are predominantly concentrated in high-frequency regions, with heterogeneous magnitude intensity patterns that vary across frequencies and attack types. Motivated by this observation, we introduce MANI-Pure, a magnitude-adaptive purification framework that leverages the magnitude spectrum of inputs to guide the purification process. Instead of injecting homogeneous noise, MANI-Pure adaptively applies heterogeneous, frequency-targeted noise, effectively suppressing adversarial perturbations in fragile high-frequency, low-magnitude bands while preserving semantically critical low-frequency content. Extensive experiments on CIFAR-10 and ImageNet-1K validate the effectiveness of MANI-Pure. It narrows the clean accuracy gap to within 0.59 of the original classifier, while boosting robust accuracy by 2.15, and achieves the top-1 robust accuracy on the RobustBench leaderboard, surpassing the previous state-of-the-art method.

[251] Score Distillation of Flow Matching Models cs.CV | cs.AI | cs.LGPDF

Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He

TL;DR: 该论文提出了一个基于贝叶斯规则和条件期望的统一框架，将高斯扩散和流匹配联系起来，并成功将分数蒸馏（SiD）技术扩展到文本到图像的流匹配模型中，展示了其广泛的适用性和稳定性。

Details

Motivation: 扩散模型在高质量图像生成方面表现优异，但其迭代采样速度较慢。蒸馏方法可以缓解这一问题，但流匹配模型与扩散模型的等价性引发了分数蒸馏技术是否可以直接迁移的问题。

Result: 实验表明，SiD技术能够直接应用于这些流匹配模型，且在数据无关和数据辅助的设置中均表现稳定，无需额外的微调或架构调整。

Insight: 论文证明了分数蒸馏技术可以广泛适用于文本到图像的流匹配模型，消除了之前对稳定性和合理性的担忧，并统一了扩散模型和流匹配模型的加速技术。

Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation – based on Bayes’ rule and conditional expectations – that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

[252] Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events cs.CV | cs.AI | cs.LG | cs.ROPDF

Richeek Das, Kostas Daniilidis, Pratik Chaudhari

TL;DR: 该论文提出了一种称为快速特征场（F3）的表示方法，用于处理事件相机的数据。F3通过学习预测未来事件来保留场景结构和运动信息，具有高效性和鲁棒性。

Details

Motivation: 事件相机数据具有稀疏性、噪声多样性和事件速率变化的特点，传统方法难以高效处理。研究者希望通过一种新的表示方法，解决这些问题并支持多种下游任务。

Result: 在多项任务中（光流估计、语义分割、深度估计）取得最优性能，支持多种分辨率和事件速率，计算效率高（120 Hz HD，440 Hz VGA）。

Insight: F3的成功表明，通过预测未来事件学习表示可以显著提升事件相机数据的处理效率和下游任务的性能。

Abstract: This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field ($\text{F}^3$). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. $\text{F}^3$ exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets - achieving 120 Hz at HD and 440 Hz at VGA resolutions. $\text{F}^3$ represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.

[253] VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning cs.CVPDF

Zhaozhi Wang, Tong Zhang, Mingyue Guo, Yaowei Wang, Qixiang Ye

TL;DR: VideoAnchor通过在Transformer注意力机制中引入稀疏子空间聚类的自表达性，强化跨帧视觉线索，显著提升了多模态大语言模型（MLLMs）在视觉空间推理任务中的表现。

Details

Motivation: 当前的多模态大语言模型在视觉语言对齐方面表现优秀，但在视觉空间推理任务中存在不足。研究发现这是由于注意力机制中视觉标记被语言标记掩盖，导致模型难以跨帧识别同一视觉线索。

Result: 在多个基准测试和主干模型上取得了显著性能提升，如VSI-Bench和Video-MME任务中分别提高3.2%和4.6%。

Insight: 揭示了注意力机制中视觉线索被忽视的问题，并通过子空间聚类的方法提出了一种高效且无需重新训练的解决方案。

Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains – $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B – while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.

[254] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time cs.CVPDF

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu

TL;DR: Rolling Forcing是一种实时生成长视频的新技术，通过联合去噪、注意力汇机制和高效训练算法，显著减少误差累积。

Details

Motivation: 现有视频流生成技术在长时序中容易累积误差，导致生成质量下降。Rolling Forcing旨在解决这一问题，实现高质量、低延迟的长视频流生成。

Result: 实验表明，Rolling Forcing能在单GPU上实时生成长达数分钟的视频，显著减少误差累积。

Insight: 联合去噪和注意力汇机制是提升长视频生成一致性的关键，而高效训练算法是实际应用的重要保障。

Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

[255] YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection cs.CVPDF

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

TL;DR: YOLO26是YOLO系列的最新成员，专注于边缘设备的实时目标检测，通过架构创新（如无NMS推理、STAL标签分配等）提升了效率和精度。

Details

Motivation: 为解决边缘设备上实时目标检测的效率和精度问题，YOLO26引入了多项技术创新。

Result: 在NVIDIA Orin Jetson平台上，YOLO26在效率和精度上优于YOLOv8、YOLO11等版本。

Insight: YOLO26通过简化流程和优化组件，展示了边缘设备上实时目标检测的潜力。

Abstract: This study presents Key Architectural Enhancements and Performance Benchmarking of Ultralytics YOLO26 for real-time edge object detection, providing a comprehensive overview of the design principles of YOLO26, technological advances, and deployment readiness. YOLO26, released in September 2025 by Ultralytics, represents the newest and most cutting-edge member of the You Only Look Once (YOLO) family, engineered to push the boundaries of efficiency and accuracy on edge and low-power devices. This paper highlights architectural innovations in YOLO26, including end-to-end NMS-free inference, removal of Distribution Focal Loss (DFL) for streamlined exports, introduction of ProgLoss and Small-Target-Aware Label Assignment (STAL) for improved stability and small-object detection, and the adoption of the MuSGD optimizer inspired by large language model training. In addition, we report performance benchmarks for YOLO26 across edge devices, specifically NVIDIA Orin Jetson platforms, and compare results against YOLOv8 and YOLO11 (previous Ultralytics releases) as well as YOLOv12 and YOLOv13, which bridged the lineage between YOLO11 and YOLO26. Our comparative analysis highlights superior efficiency of YOLO26, accuracy, and deployment versatility, establishing it as a pivotal milestone in the YOLO evolution.

[256] Personalized Vision via Visual In-Context Learning cs.CV | cs.LGPDF

Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou

TL;DR: 论文提出了一种名为PICO的视觉上下文学习框架，通过扩散变换器实现个性化视觉任务，无需重新训练即可适应新的用户定义任务。

Details

Motivation: 现有视觉模型在大型标注数据集上训练表现良好，但无法灵活适应个性化任务（如用户自定义对象或目标）。传统方法依赖高成本微调或合成数据，缺乏灵活性。

Result: PICO在实验中超越微调和合成数据基线，灵活适应新任务，并在识别和生成任务中表现优异。

Insight: 任务多样性比数据集规模更能驱动模型的泛化能力，视觉上下文学习为个性化视觉任务提供了高效解决方案。

Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision – tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

[257] Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding cs.CVPDF

Bingkui Tong, Jiaer Xia, Kaiyang Zhou

TL;DR: 论文提出了一种名为Layer Contrastive Decoding（LayerCD）的方法，旨在减少多模态大语言模型（MLLMs）中的幻觉现象，通过对比视觉编码器中不同层次的特征输出分布来过滤不一致的生成内容。

Details

Motivation: 多模态大语言模型虽然在感知和推理上表现优异，但常常产生与输入图像上下文不符的幻觉输出，包括对象、属性和关系的不准确描述。

Result: 在多个幻觉基准测试中，LayerCD显著优于当前最先进的方法。

Insight: 研究表明，浅层视觉特征容易引发幻觉，而深层特征更适合用于高层次的推理任务。

Abstract: Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations – generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .

[258] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs cs.CV | cs.AI | cs.LGPDF

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam

TL;DR: GHOST是一种通过生成诱导幻觉的图像来压力测试多模态大语言模型（MLLMs）的自动方法，成功率高且无需人工干预。

Details

Motivation: MLLMs存在物体幻觉问题（感知到图像中不存在的物体），但现有研究依赖于静态基准测试，无法全面揭示模型的幻觉漏洞。

Result: 在多个模型（如GLM-4.1V-Thinking）上实现了超过28%的幻觉成功率，远超之前1%的数据驱动方法。

Insight: GHOST不仅揭示了MLLMs的幻觉漏洞，还可用于微调以提高模型可靠性，展示了诊断和纠正的双重价值。

Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

[259] DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder cs.CV | cs.AIPDF

Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu

TL;DR: DC-VideoGen是一个高效的视频生成加速框架，通过深度压缩视频自动编码器和轻量级微调来提升任何预训练视频扩散模型的效率，显著降低推理延迟。

Details

Motivation: 现有的视频生成模型在效率和计算资源上存在瓶颈，尤其是在高分辨率视频生成时。DC-VideoGen旨在通过深度压缩和高效适配策略解决这一问题。

Result: 将Wan-2.1-14B模型适配DC-VideoGen后，推理延迟降低14.8倍，且无需牺牲质量；单GPU支持2160x3840分辨率视频生成。

Insight: 通过深度压缩和高效适配策略，DC-VideoGen在降低计算成本的同时保持了生成质量，为高分辨率视频生成的实用化提供了可能。

Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.

[260] PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos cs.CVPDF

Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang

TL;DR: PAD3R提出了一种从随意拍摄的单目视频中重建可变形3D对象的方法，通过结合生成先验和可微分渲染，能够处理长视频序列中的大尺度变形和相机运动。

Details

Motivation: 现有方法在处理长视频中的大尺度变形、相机运动和有限视角覆盖时表现不佳，亟需一种更鲁棒且通用的解决方案。

Result: 实验表明PAD3R在多种挑战性场景下均表现出色，具有优异的鲁棒性和泛化能力。

Insight: PAD3R展示了生成先验与可微分渲染结合在动态场景理解和3D内容创作中的潜力，为解决复杂变形问题提供了新思路。

Abstract: We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.

[261] PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images cs.CVPDF

Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song

TL;DR: PixelCraft提出了一个多智能体系统，用于结构化图像的高保真处理和灵活视觉推理，通过动态三阶段工作流程解决了现有方法的局限。

Details

Motivation: 现有方法在处理结构化图像时存在低保真图像处理和线性推理模式的限制，影响了视觉推理的有效性。

Result: 在图表和几何基准测试中表现优异，显著提升了MLLM的视觉推理能力。

Insight: 通过多智能体协作和高保真图像处理，可以显著提升结构化图像的视觉推理性能。

Abstract: Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.

[262] FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation cs.CVPDF

Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan

TL;DR: FlashI2V通过傅里叶引导的潜在偏移防止图像到视频生成中的条件图像泄漏问题，提升了跨域数据的性能。

Details

Motivation: 现有的图像到视频（I2V）方法在生成视频时容易因条件图像泄漏导致性能下降，如运动缓慢和颜色不一致问题。

Result: FlashI2V在Vbench-I2V上动态度得分53.01，超越CogVideoX1.5-5B-I2V和Wan2.1-I2V-14B-480P。

Insight: 防止条件图像泄漏可显著提升I2V模型的跨域性能，且较小的模型也能实现更好的效果。

Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/

[263] Visual Jigsaw Post-Training Improves MLLMs cs.CVPDF

Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu

TL;DR: 该论文提出了Visual Jigsaw，一种自监督的后训练框架，旨在通过视觉拼图任务增强多模态大语言模型（MLLMs）的视觉理解能力。

Details

Motivation: 当前MLLMs的后训练范式主要以文本为中心，视觉输入仅用于提取稀疏线索。然而，视觉理解是MLLMs的核心能力，需专注于视觉而非依赖文本中介。

Result: 在图像、视频和3D数据等多种视觉模态上实验表明，显著提升了细粒度感知、时序推理和3D空间理解能力。

Insight: 证明了视觉为中心的自监督任务在后训练MLLMs中的潜力，为未来视觉导向的设计提供了新思路。

Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs’ intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/

cs.CL [Back]

[264] AccessEval: Benchmarking Disability Bias in Large Language Models cs.CL | cs.AI | cs.CYPDF

Srikant Panda, Amit Agarwal, Hitesh Laxmichand Patel

TL;DR: AccessEval是一个用于评估大型语言模型在残疾情境下偏见的基准测试，涵盖21个模型、6个领域和9种残疾类型。研究发现，模型对残疾相关查询的回答通常更负面、刻板且准确率更低。

Details

Motivation: 大型语言模型在多样化的应用中可能表现出对残疾群体的偏见，但缺乏系统性评估工具。本文旨在填补这一空白，通过真实情境下的查询揭示模型的偏见。

Result: 模型对残疾相关查询的回答更负面、刻板且错误率高，尤其在听力、言语和行动不便领域表现明显。

Insight: AccessEval揭示了模型行为中隐含的偏见，强调了在技术评估中纳入用户实际影响的重要性，为未来的偏见缓解提供了方向。

Abstract: Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real-life queries. To systematically investigate these effects within various disability contexts, we introduce \textbf{AccessEval (Accessibility Evaluation)}, a benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for sentiment, social perception, and factual accuracy. Our analysis reveals that responses to disability-aware queries tend to have a more negative tone, increased stereotyping, and higher factual error compared to neutral queries. These effects show notable variation by domain and disability type, with disabilities affecting hearing, speech, and mobility disproportionately impacted. These disparities reflect persistent forms of ableism embedded in model behavior. By examining model performance in real-world decision-making contexts, we better illuminate how such biases can translate into tangible harms for disabled users. This framing helps bridges the gap between technical evaluation and user impact, reinforcing importance of bias mitigation in day-to-day applications. Our dataset is publicly available at: https://huggingface.co/datasets/Srikant86/AccessEval

[265] RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval cs.CLPDF

Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li

TL;DR: RAR$^2$是一个联合学习框架，通过构建思维过程揭示隐式知识需求，指导检索和答案生成，提升了医学推理任务的性能。

Details

Motivation: 大型语言模型（LLM）在医学任务中表现潜力，但传统检索增强生成（RAG）对需要复杂推理的问题效果有限，未能明确建模推理过程。

Result: 在多个生物医学问答数据集上，RAR$^2$显著优于传统RAG方法（无论是否微调）。

Insight: 显式建模推理过程对提升医学任务的检索和生成效果至关重要，混合偏好对和DPO优化是关键创新点。

Abstract: Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.

Sadia Abdulhalim, Muaz Albaghdadi, Moshiur Farazi

TL;DR: 该论文提出了一种动态注意力融合（DAF）框架，结合预训练语言模型的文本嵌入和语音编码器的声学特征，通过自适应注意力机制动态加权多模态信息，显著提升了情感分析的性能。

Details

Motivation: 传统的情感分析仅依赖文本，忽略了声音等非语言线索的重要性，导致情感捕捉不全面。因此，作者提出了一种多模态动态融合方法，以更全面地建模情感。

Result: DAF模型在F1分数上显著提升，预测误差降低，并通过消融实验验证了动态加权策略的重要性。

Insight: 动态权重分配是多模态情感分析的关键，能够更好地建模情感复杂的输入，为情感计算应用提供了更鲁棒的基础。

Abstract: Traditional sentiment analysis has long been a unimodal task, relying solely on text. This approach overlooks non-verbal cues such as vocal tone and prosody that are essential for capturing true emotional intent. We introduce Dynamic Attention Fusion (DAF), a lightweight framework that combines frozen text embeddings from a pretrained language model with acoustic features from a speech encoder, using an adaptive attention mechanism to weight each modality per utterance. Without any finetuning of the underlying encoders, our proposed DAF model consistently outperforms both static fusion and unimodal baselines on a large multimodal benchmark. We report notable gains in F1-score and reductions in prediction error and perform a variety of ablation studies that support our hypothesis that the dynamic weighting strategy is crucial for modeling emotionally complex inputs. By effectively integrating verbal and non-verbal information, our approach offers a more robust foundation for sentiment prediction and carries broader impact for affective computing applications – from emotion recognition and mental health assessment to more natural human computer interaction.

[267] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models cs.CL | cs.AI | cs.LG | stat.ML | I.2.6; I.2.7PDF

Sasha Cui, Zhongren Chen

TL;DR: 该论文提出了一种名为PAS（Painless Activation Steering）的新型后训练方法，用于轻量级、自动化地调整大语言模型的行为，无需手工提示或特征标注，显著提升了可控性和效率。

Details

Motivation: 现有的权重调整和提示调整方法各有缺点：前者耗时耗力，后者不够精确且需要大量手工尝试。尽管激活导向（AS）是一种更廉价、快速且可控的替代方案，但其依赖手工提示对或特征标注，限制了实用性。

Result: 在18个任务上的实验表明，PAS能有效提升行为类任务的性能（如偏差、道德和对齐任务），但对智能导向任务效果有限。iPAS在部分任务中提升显著（如34.8%的对齐任务）。

Insight: AS技术在后训练中具有潜力，但其适用性因任务类型而异；PAS为AS的实用化提供了自动化解决方案，同时可以与ICL和SFT等方法结合使用。

Abstract: Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

[268] MIRAGE: Multi-hop Reasoning with Ambiguity Evaluation for Illusory Questions cs.CL | cs.AIPDF

Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella

TL;DR: 论文提出了一种名为MIRAGE的基准测试，用于评估多跳推理中的模糊性问题。当前的LLMs在该问题上表现不佳，作者提出了CLARION框架作为基线方法。

Details

Motivation: 现实中的多跳问答（QA）常涉及模糊性，现有的大型语言模型（LLMs）在处理此类问题时表现不佳。作者希望通过MIRAGE基准推动对模糊性与多跳推理结合的研究。

Result: 实验表明，现有LLMs在MIRAGE上表现不佳，CLARION框架显著优于现有方法。

Insight: 模糊性与多跳推理的结合是一个独特且重要的挑战，需要设计更自适应和鲁棒的推理系统。

Abstract: Real-world Multi-hop Question Answering (QA) often involves ambiguity that is inseparable from the reasoning process itself. This ambiguity creates a distinct challenge, where multiple reasoning paths emerge from a single question, each requiring independent resolution. Since each sub-question is ambiguous, the model must resolve ambiguity at every step. Thus, answering a single question requires handling multiple layers of ambiguity throughout the reasoning chain. We find that current Large Language Models (LLMs) struggle in this setting, typically exploring wrong reasoning paths and producing incomplete answers. To facilitate research on multi-hop ambiguity, we introduce MultI-hop Reasoning with AmbiGuity Evaluation for Illusory Questions (MIRAGE), a benchmark designed to analyze and evaluate this challenging intersection of ambiguity interpretation and multi-hop reasoning. MIRAGE contains 1,142 high-quality examples of ambiguous multi-hop questions, categorized under a taxonomy of syntactic, general, and semantic ambiguity, and curated through a rigorous multi-LLM verification pipeline. Our experiments reveal that even state-of-the-art models struggle on MIRAGE, confirming that resolving ambiguity combined with multi-step inference is a distinct and significant challenge. To establish a robust baseline, we propose CLarifying Ambiguity with a Reasoning and InstructiON (CLARION), a multi-agent framework that significantly outperforms existing approaches on MIRAGE, paving the way for more adaptive and robust reasoning systems.

[269] ML2B: Multi-Lingual ML Benchmark For AutoML cs.CLPDF

Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk

TL;DR: ML2B是多语言ML代码生成的第一个基准测试，包含30个Kaggle竞赛的13种语言翻译，揭示了非英语任务性能下降15-45%的问题。

Details

Motivation: 现有ML代码生成基准主要限于英语，忽略了ML研究和实践的全球性与多语言需求，需要填补这一空白。

Result: 结果显示非英语任务性能下降15-45%，突显多语言表示学习的挑战。

Insight: 多语言ML代码生成中，跨语言表现存在显著差异，需进一步研究改进表示学习。

Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

[270] EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation cs.CLPDF

Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min

TL;DR: EditGRPO是一种混合策略强化学习算法，通过临床效用的奖励优化胸部X光报告生成，结合在线探索与离线修正，在多个数据集上优于传统方法。

Details

Motivation: 现有MLLM在监督微调时未明确对齐临床效用，需要一种方法优化生成的临床准确性。

Result: 在CheXbert、GREEN等指标上平均提升3.4%，未见过数据集上平均增益5.9%。

Insight: 结合在线探索与离线修正的混合策略RL能有效提升临床报告的生成质量与泛化性。

Abstract: Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models (MLLMs), have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning (RL) algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B MLLM initialized with supervised fine-tuning (SFT), EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in CheXbert, GREEN, Radgraph, and RATEScore metrics across four major chest X-ray report generation datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

[271] Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning cs.CLPDF

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen

TL;DR: 论文提出了一种名为Critique Reinforcement Learning (CRL)的新方法，通过结合强化学习和批判性学习，提升了编码模型的性能。

Details

Motivation: 现有的强化学习方法主要关注生成响应，缺乏明确的批判或反思机制。研究希望通过引入CRL，增强模型的批判能力和推理能力。

Result: Critique-Coder在多个基准测试中表现优于仅使用RL的模型，尤其在LiveCodeBench（v5）和BBEH逻辑推理任务中表现突出。

Insight: CRL不仅可以提升编码能力，还能增强模型的通用推理能力，表明其在广泛任务中具有迁移性。CRL是标准RL的有力补充。

Abstract: Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in {\texttt{True}, \texttt{False}}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

[272] ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents cs.CLPDF

Hwan Chang, Yonghyun Jun, Hwanhee Lee

TL;DR: ChatInject是一种通过模仿结构化聊天模板的恶意负载攻击LLM代理的方法，利用模型的指令跟随特性，展示了比传统提示注入更高的攻击成功率。

Details

Motivation: 随着基于LLM的代理与外部环境交互的增多，间接提示注入成为一种新的攻击面。现有研究集中于纯文本攻击，但对结构化聊天模板的利用尚未充分探索。

Result: 实验表明，ChatInject的平均攻击成功率显著提升（AgentDojo从5.18%到32.05%），多轮对话成功率更高（InjecAgent达到52.33%）。现有防御方法对其无效。

Insight: 揭示了当前LLM代理系统在结构化聊天模板和多轮对话场景下的脆弱性，强调了针对此类攻击的新防御需求。

Abstract: The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs’ dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model’s inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

[273] Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems cs.CL | cs.IR | cs.LG | H.3.3; I.2.7; I.2.6PDF

Kai Hua, Zhiyuan Feng, Chongyang Tao, Rui Yan, Lu Zhang

TL;DR: 该论文提出了一种多轮响应选择模型RSM-DCK，用于检测上下文和知识库中相关的部分，以提高检索式对话系统的性能。通过预选和后选机制，模型能够更有效地匹配响应候选。

Details

Motivation: 现有的检索式对话系统在处理上下文和知识库时，往往将所有内容用于匹配响应候选，忽略了不同部分的差异性重要性，导致性能下降。

Result: 在两个基准数据集上测试，模型性能优于现有方法，并能有效检测相关内容。

Insight: 动态选择和融合上下文与知识库的相关部分可以显著提升对话系统的响应选择能力。

Abstract: Recently, knowledge-grounded conversations in the open domain gain great attention from researchers. Existing works on retrieval-based dialogue systems have paid tremendous efforts to utilize neural networks to build a matching model, where all of the context and knowledge contents are used to match the response candidate with various representation methods. Actually, different parts of the context and knowledge are differentially important for recognizing the proper response candidate, as many utterances are useless due to the topic shift. Those excessive useless information in the context and knowledge can influence the matching process and leads to inferior performance. To address this problem, we propose a multi-turn \textbf{R}esponse \textbf{S}election \textbf{M}odel that can \textbf{D}etect the relevant parts of the \textbf{C}ontext and \textbf{K}nowledge collection (\textbf{RSM-DCK}). Our model first uses the recent context as a query to pre-select relevant parts of the context and knowledge collection at the word-level and utterance-level semantics. Further, the response candidate interacts with the selected context and knowledge collection respectively. In the end, The fused representation of the context and response candidate is utilized to post-select the relevant parts of the knowledge collection more confidently for matching. We test our proposed model on two benchmark datasets. Evaluation results indicate that our model achieves better performance than the existing methods, and can effectively detect the relevant context and knowledge for response selection.

[274] HEART: Emotionally-driven test-time scaling of Language Models cs.CL | cs.LGPDF

Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang

TL;DR: HEART是一种新颖的框架，通过情感驱动的提示改进语言模型在推理任务中的表现，利用六种普遍情感引导模型自我校正，显著提升准确性，但在无验证器场景下效果不稳定。

Details

Motivation: 当前的语言模型测试时缩放策略主要依赖逻辑或结构优化，忽略了情感反馈的潜力。心理学研究表明情绪可以调节认知表现，因此研究者希望通过情感驱动的提示改进模型的推理能力。

Result: 在验证器辅助下，HEART显著提升了模型的推理深度和准确性。然而在无验证器场景下，效果不稳定，表明这是未来研究的重点挑战。

Insight: 机器推理的下一个前沿可能是不仅优化逻辑，还需理解和利用模型的‘情感’（HEART），情感驱动的提示为改进推理能力提供了新方向。

Abstract: Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART–a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model’s incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity’s Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART’ of the models.

[275] Infusing Theory of Mind into Socially Intelligent LLM Agents cs.CLPDF

EunJeong Hwang, Yuwei Yin, Giuseppe Carenini, Peter West, Vered Shwartz

TL;DR: 论文提出了一种将心智理论（ToM）注入大型语言模型（LLM）社会智能代理的方法，通过显式生成对话中的心理状态来提升模型的目标达成能力和对话效果。

Details

Motivation: 心智理论（ToM）是人类社交智能的核心，但现有的聊天机器人和LLM代理通常缺乏这种能力。作者希望通过引入ToM，提升LLM代理在社交对话中的表现和目标达成能力。

Result: ToMA在实验中表现出更强的策略性和目标导向推理能力，能够实现长期适应性并维持更好的伙伴关系，显著优于基线方法。

Insight: 显式建模心理状态有助于LLM代理更有效地达成社交目标，同时保持良好的社交关系，这一方法为构建更社会智能的LLM代理提供了新方向。

Abstract: Theory of Mind (ToM)-an understanding of the mental states of others-is a key aspect of human social intelligence, yet, chatbots and LLM-based social agents do not typically integrate it. In this work, we demonstrate that LLMs that explicitly use ToM get better at dialogue, achieving goals more effectively. After showing that simply prompting models to generate mental states between dialogue turns already provides significant benefit, we further introduce ToMAgent (ToMA), a ToM-focused dialogue agent. ToMA is trained by pairing ToM with dialogue lookahead to produce mental states that are maximally useful for achieving dialogue goals. Experiments on the Sotopia interactive social evaluation benchmark demonstrate the effectiveness of our method over a range of baselines. Comprehensive analysis shows that ToMA exhibits more strategic, goal-oriented reasoning behaviors, which enable long-horizon adaptation, while maintaining better relationships with their partners. Our results suggest a step forward in integrating ToM for building socially intelligent LLM agents.

[276] Extract-0: A Specialized Language Model for Document Information Extraction cs.CL | cs.AIPDF

Henrique Godoy

TL;DR: Extract-0是一个7B参数的专用语言模型，专注于文档信息提取任务，性能超过更大参数量的通用模型。其训练方法结合合成数据生成、LoRA微调和GRPO强化学习，在多样化的文档提取任务中表现优异。

Details

Motivation: 文档信息提取任务需要高效且精确的模型，但通用的大语言模型往往计算资源消耗巨大且性能不一定最优。为此，研究者提出了一种专门优化的模型和方法来解决这一问题。

Result: Extract-0在1000个多样化文档提取任务中的平均奖励为0.573，优于GPT-4.1（0.457）、o3（0.464）和GPT-4.1-2025（0.459）。

Insight: 任务专用的优化可以显著提升性能并减少计算资源需求，表明在特定领域开发专用模型的潜力。

Abstract: This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

[277] Large language models management of medications: three performance analyses cs.CL | cs.AIPDF

Kelli Henry, Steven Xu, Kaitlin Blotske, Moriah Cargile, Erin F. Barreto

TL;DR: 本文评估了GPT-4o在药物管理三个任务中的表现：药物名称与配方的匹配、药物相互作用识别以及药物订单句子的生成。结果表明，GPT-4o在这些任务中表现不佳，强调了领域特定训练和评估框架的重要性。

Details

Motivation: 尽管大语言模型（LLMs）在医疗诊断中有潜在用途，但其在药物管理任务中的一致性和准确性尚未得到充分评估。本文旨在填补这一空白。

Result: GPT-4o在药物配方匹配中表现不佳（仅49%正确），药物相互作用识别的准确性在搜索增强评估中有所提升（54.7% vs. 69.2%），但在无相互作用时表现更差。药物订单生成任务中错误率为34.2%。

Insight: 研究表明，通用大语言模型在特定领域（如药物管理）中表现有限，需要通过领域专用数据和评估框架进行优化。

Abstract: Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.

[278] ADAM: A Diverse Archive of Mankind for Evaluating and Enhancing LLMs in Biographical Reasoning cs.CL | cs.AI | cs.CV | cs.IR | cs.LGPDF

Jasin Cekinmez, Omid Ghahroodi, Saad Fowad Chandle, Dhiman Gupta, Ehsaneddin Asgari

TL;DR: ADAM框架首次系统性评估和提升大语言模型在传记推理中的能力，通过多语言、多模态数据集AdamDB和基于Bloom分类法的评估基准AdamBench，并提出AdamRAG解决幻觉问题。

Details

Motivation: 传记推理是大语言模型的关键能力，但此前未被系统研究。ADAM旨在填补这一空白，提供多语言、多模态和文化多样性的评估框架。

Result: AdamRAG显著提升开源模型的性能，对闭源模型也有一定帮助；流行度强烈影响准确性，多模态输入的改进较小且不稳定。

Insight: 1) 检索增强生成对传记推理尤其重要；2) 流行度是模型准确性的关键因素；3) 多模态输入的效果有限。

Abstract: We introduce ADAM (A Diverse Archive of Mankind), a framework for evaluating and improving multimodal large language models (MLLMs) in biographical reasoning. To the best of our knowledge, this is the first work to systematically examine LLM capabilities in biography, a critical yet underexplored dimension of factual knowledge. At its core, AdamDB is a multilingual and multimodal dataset covering over 4 million individuals across geography, time, and profession, while AdamBench provides cognitively structured evaluations based on Bloom’s taxonomy, spanning six reasoning levels in both English and native languages. To address hallucinations, particularly for lesser-known individuals, we propose AdamRAG, a retrieval-augmented generation system tailored to biographical contexts. Experiments show that AdamRAG substantially improves open-source models and modestly benefits closed-source ones, with the largest gains on lower-order reasoning. Popularity strongly mediates accuracy, and multimodal input via face images offers smaller, less consistent improvements than retrieval. ADAM establishes the first benchmark and framework for cognitively, culturally, and multimodally grounded biographical evaluation, advancing the development of multilingual, accurate, and hallucination-resistant MLLMs.

[279] Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents cs.CL | cs.AIPDF

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai

TL;DR: ReMemR1是一种针对长上下文问题设计的记忆增强型LLM智能体，通过回调增强的记忆和强化学习多级奖励（RLMLR）方法，解决了现有‘边读边记’方法中存在的信息丢失和非线性推理问题。

Details

Motivation: 解决大型语言模型在处理长上下文问题时因单向处理和记忆覆盖导致的信息丢失和非线性推理困难。

Result: 在长文档QA任务中表现显著优于现有记忆方法。

Insight: 非线性记忆访问和多级奖励信号是提升长上下文推理效果的关键。

Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the “memorize while reading” methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.

[280] From Evidence to Trajectory: Abductive Reasoning Path Synthesis for Training Retrieval-Augmented Generation Agents cs.CL | cs.AIPDF

Muzhi Li, Jinhu Qi, Yihong Wu, Minghao Zhao, Liheng Ma

TL;DR: 该论文提出了EviPath范式，通过合成证据锚定的推理路径，提升检索增强生成（RAG）代理的开发效果。

Details

Motivation: 现有方法在训练RAG代理时缺乏过程级监督，难以指导任务分解、检索器调用和逐步决策。强化学习的稀疏奖励和LLM推理能力有限，现有数据合成方法也无法建模环境交互。

Result: 实验表明，使用EviPath训练的8B参数模型在开放域问答任务中显著优于现有方法，EM得分绝对提升了14.7%。

Insight: 合成数据可以显式建模推理路径和环境交互，弥补LLM在复杂任务中的能力不足。

Abstract: Retrieval-augmented generation agents development is hindered by the lack of process-level supervision to effectively guide agentic capabilities like task decomposition, retriever invocation, and stepwise decision-making. While reinforcement learning offers a potential solution, it suffers from sparse rewards and the limited reasoning capabilities of large language models (LLMs). Meanwhile, existing data synthesis methods only produce chain-of-thought rationales and fail to model environmental interactions. In this paper, we propose EviPath, an evidence-anchored reasoning path synthesis paradigm for RAG agent development. EviPath comprises: (i) Abductive Subtask Planning, which decomposes the problem into sub-questions and iteratively plans an optimal solution path based on the dependencies between them; (ii) Faithful Sub-question Answering, which uses supporting evidence to construct a proxy environment to generate reasoning thoughts and answers for each sub-question; and (iii) Conversational Fine-Tuning, which formats the complete agent-environment interaction trajectory into a dialogue format suitable for Supervised Fine-Tuning. EviPath allows LLMs to learn complex reasoning and tool-use capabilities directly from synthesized data. Extensive experiments on widely-used question-answering benchmarks show that an 8B parameter model trained with EviPath-synthesized data significantly and consistently outperforms state-of-the-art baselines with a double-digit absolute EM gain of 14.7% in open-domain question answering.

[281] The Geometry of Creative Variability: How Credal Sets Expose Calibration Gaps in Language Models cs.CLPDF

Esteban Garces Arias, Julian Rodemann, Christian Heumann

TL;DR: 论文提出了一种基于credal sets的几何框架，用于量化语言模型在创造性任务中的不确定性，揭示其在捕捉人类创造性变化时的不足。

Details

Motivation: 理解大型语言模型在创造性任务中的不确定性是一个重要挑战，尤其是当存在多个有效输出时。

Result: 最佳模型与人类校准的最高值为0.434（Gemma-2B，温度0.7）。解码策略对总认知不确定性的贡献为39.4%到72.0%。模型规模与校准质量相关性较弱。

Insight: 解码策略对不确定性影响显著，而模型规模和调优方式对校准质量影响有限。该框架为提升人类-AI创造性对齐提供了实用建议。

Abstract: Understanding uncertainty in large language models remains a fundamental challenge, particularly in creative tasks where multiple valid outputs exist. We present a geometric framework using credal sets - convex hulls of probability distributions - to quantify and decompose uncertainty in neural text generation, calibrated against human creative variation. Analyzing 500 creative writing prompts from the WritingPrompts dataset with 10 unique human continuations each, we evaluate four language models across five decoding strategies, generating 100,000 stories. Our credal set analysis reveals substantial gaps in capturing human creative variation, with the best model-human calibration reaching only 0.434 (Gemma-2B with temperature 0.7). We decompose total uncertainty into epistemic and aleatoric components, finding that the choice of decoding strategy contributes 39.4% to 72.0% of total epistemic uncertainty. Model scale shows weak correlation with calibration quality and no significant difference exists between base and instruction-tuned models in calibration quality. Our geometric framework provides actionable insights for improving generation systems for human-AI creative alignment. We release our complete experimental framework.

[282] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching cs.CLPDF

Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang

TL;DR: d$^2$Cache是一种训练免费的近似KV缓存框架，旨在加速扩散式大语言模型（dLLMs）的推理效率。它通过双阶段细粒度选择策略和自适应KV状态更新，显著提升了推理速度和生成质量。

Details

Motivation: 扩散式大语言模型（dLLMs）因其双向注意力机制，无法直接受益于标准的KV缓存技术，导致推理效率低下。d$^2$Cache通过创新的缓存机制解决了这一问题。

Result: 在LLaDA和Dream等代表性dLLMs上的实验表明，d$^2$Cache不仅实现了显著的推理加速，还一致提升了生成质量。

Insight: d$^2$Cache的创新在于将KV缓存技术扩展到双向注意力模型，解决了传统KV缓存不适用于dLLMs的问题，同时通过动态更新和缓存机制提升了模型的整体效率。

Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

[283] Non-Collaborative User Simulators for Tool Agents cs.CLPDF

Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo

TL;DR: 本文提出了一种非协作用户模拟器架构，用于模拟真实世界中非协作用户的四种行为类别，以帮助开发和测试工具代理在多轮对话中的性能。

Details

Motivation: 现有用户模拟器通常只模拟协作行为，无法训练和测试工具代理应对真实世界中的非协作用户行为。

Result: 实验中，现有工具代理在非协作用户条件下性能显著下降，出现幻觉和对话题破裂等问题。

Insight: 非协作用户行为的模拟对工具代理的鲁棒性测试至关重要，可帮助改进其在真实世界中的表现。

Abstract: Non-Collaborative User Simulators for Tool Agents Download PDF Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo 19 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference SubmissionConference, AuthorsRevisionsCC BY 4.0 Keywords: Tool Agent, User Simulator, Non-collaborative User, Dialogue Simulation TL;DR: A non-collaborative user simulation method for tool agent. Abstract: Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, which fails to train and test agents against non-collaborative users in the real world. To address this, we propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users. We provide detailed analyses of agents’ weaknesses under each non-collaborative condition, such as escalated hallucinations and dialogue breakdowns. Ultimately, we contribute an easily extensible user simulation framework to help the research community develop tool agents and preemptively diagnose them under challenging real-world conditions within their own services.

[284] Tagging the Thought: Unlocking Personalization Reasoning via Reinforcement Learning cs.CLPDF

Song Jin, Juntian Zhang, Yong Liu, Xun Zhang, Yufei Zhang

TL;DR: 论文提出了TagPR框架，通过强化学习增强大语言模型的个性化推理能力，结合语义标签和复合奖励信号，在公开基准上取得了显著提升。

Details

Motivation: 当前大语言模型在通用推理能力上表现优异，但在个性化推理（如分析用户历史、推断偏好）方面仍有不足，亟需改进。

Result: 在LaMP基准和自建数据集上平均提升32.65%，验证了结构化推理对个性化能力的有效性。

Insight: 结构化且可解释的推理路径是解锁大语言模型真正个性化能力的关键路径。

Abstract: Recent advancements have endowed Large Language Models (LLMs) with impressive general reasoning capabilities, yet they often struggle with personalization reasoning - the crucial ability to analyze user history, infer unique preferences, and generate tailored responses. To address this limitation, we introduce TagPR, a novel training framework that significantly enhances an LLM’s intrinsic capacity for personalization reasoning through a tagging the thought approach. Our method first develops a data-driven pipeline to automatically generate and semantically label reasoning chains, creating a structured dataset that fosters interpretable reasoning. We then propose a synergistic training strategy that begins with Supervised Fine-Tuning (SFT) on this tagged data to establish foundational reasoning patterns, followed by a multi-stage reinforcement learning (RL) process. This RL phase is guided by a unique composite reward signal, which integrates tag-based constraints and a novel Personalization Reward Model with User Embeddings (PRMU) to achieve fine-grained alignment with user-specific logic. Extensive experiments on the public LaMP benchmark and a self-constructed dataset demonstrate that our approach achieves state-of-the-art results, delivering an average improvement of 32.65% over the base model across all tasks. Our work validates that structured, interpretable reasoning is a highly effective pathway to unlocking genuine personalization capabilities in LLMs.

[285] Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts cs.CLPDF

Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu

TL;DR: 该论文提出了一个全栈框架，通过诊断、定位和对齐三个阶段，解决多智能体系统中指令冲突导致的可靠性问题。

Details

Motivation: LLM驱动的多智能体系统在复杂任务中表现出色，但在指令冲突下（如系统-用户或智能体间冲突）容易失效，现有宏观指标难以发现问题。

Result: 在标准基准测试中，SAIL显著提升指令遵从性（如AutoGen在MedQA上提升5.60%），无需全模型微调。

Insight: 注意力头在中间层集中解决冲突，表明局部优化是提升系统可靠性的有效途径。

Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.

[286] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs cs.CLPDF

Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang

TL;DR: 论文探讨了推理大语言模型（RLMs）在few-shot CoT（思维链）下性能下降的问题，并提出了一种新方法I2S，通过将演示转化为显式的、可重用的洞察力来提升推理性能。

Details

Motivation: 研究发现RLMs在few-shot CoT下性能反而下降，尤其是通过验证器强化学习训练的模型。作者旨在揭示这一问题背后的机制并提出解决方案。

Result: 实验表明I2S和I2S+在多个基准测试中优于直接回答和测试扩展基线，甚至在GPT模型上也表现出显著提升（如GPT-4.1在AIME’25上提升14.0%）。

Insight: 发现few-shot CoT性能下降的两种机制：语义误导和策略迁移失败，并提出通过洞察力提取和精炼有效利用演示。

Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME’25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

[287] Global Beats, Local Tongue: Studying Code Switching in K-pop Hits on Billboard Charts cs.CLPDF

Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi

TL;DR: 这篇论文研究了K-pop歌曲中的语码转换现象，特别分析了英语和韩语的使用情况，揭示了全球市场压力对歌词语言选择的影响，以及表演者身份和榜单背景对风格模式的作用。

Details

Motivation: K-pop在全球市场的成功与其歌词中的语码转换现象密切相关。论文旨在探究这一现象，揭示语言策略如何在全球榜单成功的K-pop歌曲中发挥作用。

Result: 研究发现英语在K-pop歌曲中占主导地位，且语码转换频繁。性别差异不明显，但女性独唱歌手更倾向于使用英语。美国榜单Hot 100对英语的依赖更强。

Insight: 语言选择不仅反映了K-pop的全球市场策略，也揭示了表演者身份和榜单背景对歌词风格的影响。这种策略有助于增强歌曲的国际吸引力。

Abstract: Code switching, particularly between Korean and English, has become a defining feature of modern K-pop, reflecting both aesthetic choices and global market strategies. This paper is a primary investigation into the linguistic strategies employed in K-pop songs that achieve global chart success, with a focus on the role of code-switching and English lyric usage. A dataset of K-pop songs that appeared on the Billboard Hot 100 and Global 200 charts from 2017 to 2025, spanning 14 groups and 8 solo artists, was compiled. Using this dataset, the proportion of English and Korean lyrics, the frequency of code-switching, and other stylistic features were analysed. It was found that English dominates the linguistic landscape of globally charting K-pop songs, with both male and female performers exhibiting high degrees of code-switching and English usage. Statistical tests indicated no significant gender-based differences, although female solo artists tend to favour English more consistently. A classification task was also performed to predict performer gender from lyrics, achieving macro F1 scores up to 0.76 using multilingual embeddings and handcrafted features. Finally, differences between songs charting on the Hot 100 versus the Global 200 were examined, suggesting that, while there is no significant gender difference in English, higher English usage may be more critical for success in the US-focused Hot 100. The findings highlight how linguistic choices in K-pop lyrics are shaped by global market pressures and reveal stylistic patterns that reflect performer identity and chart context.

[288] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness cs.CL | cs.AIPDF

Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin

TL;DR: PARL-MT是一个新框架，通过显式整合进度感知来改进大型语言模型在多轮对话中的函数调用能力，结合了进度感知生成和强化学习算法，显著优于现有方法。

Details

Motivation: 现实应用（如旅行规划或多阶段数据分析）通常涉及多轮对话，现有方法要么忽略了任务级规划，要么缺乏进度感知的显式整合。PARL-MT旨在解决这些问题。

Result: 在两个公开基准测试中，PARL-MT显著优于现有方法，证明了进度感知的有效性。

Insight: 进度感知是多轮函数调用的关键，显式建模任务规划和上下文摘要能显著提升模型的鲁棒性和效率。

Abstract: Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.

[289] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks cs.CLPDF

Haorui Yu, Ramon Ruiz-Dolz, Qiufeng Yi

TL;DR: 论文提出了一个量化评估框架，测试主流视觉语言模型（VLMs）在传统中国画评论任务中的表现，并通过实验揭示了其优势和局限性。

Details

Motivation: 研究目标是为了评估VLMs在复杂文化任务（如中国画评论）中的表现，探索其语义理解与内容生成能力。

Result: 实验揭示了VLMs在艺术评论任务中的表现水平、优势和改进空间。

Insight: VLMs在复杂语义理解与文化任务中存在潜力，但需进一步提升多样性和精准性。

Abstract: This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM’s ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.

[290] Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models cs.CLPDF

Sina J. Semnani, Jirayu Burapacheep, Arpandeep Khatua, Thanawan Atchariyachanvanit, Zheng Wang

TL;DR: 该论文提出了一个称为CLAIRE的系统，结合LLM推理和检索技术，用于检测Wikipedia中的知识不一致性，并通过用户研究和基准测试验证其有效性。

Details

Motivation: Wikipedia是重要的知识库，但其准确性可能存在问题。研究团队关注知识不一致性这一特定类型的错误，以提升Wikipedia的质量。

Result: 用户研究中，87.5%的编辑信心提升，检测效率提高了64.7%。基准测试显示，3.3%的英文Wikipedia事实存在不一致性。

Insight: 知识不一致性是Wikipedia的显著问题，LLM辅助系统如CLAIRE能有效帮助人工编辑大规模提升知识一致性。

Abstract: Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.

[291] Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces cs.CL | cs.LGPDF

Joseph Marvin Imperial, Harish Tayyar Madabushi

TL;DR: 论文提出了一种称为政策推理痕迹（PRT）的方法，通过生成专业化的推理链来提升语言模型在政策合规性评估中的表现，显著提高了HIPAA和GDPR政策的评估准确性。

Details

Motivation: 政策合规性评估是一项重要任务，但专家级别的黄金标准推理过程获取成本高。为解决这一问题，论文提出了PRT方法。

Result: 实验表明，PRT显著提升了开放权重和商业模型的表现，并在政策条款引用和合规决策中展现了高利用率。

Insight: PRT不仅提高了评估准确性，还增强了模型对政策条款的引证能力，展示了推理链在任务中的关键作用。

Abstract: Policy compliance assessment is a fundamental task of evaluating whether an input case strictly complies with a set of human-defined rules, more generally known as policies. In practice, human experts follow a systematic, step-by-step process to identify violations with respect to specific stipulations outlined in the policy. However, such documentation of gold-standard, expert-level reasoning processes is costly to acquire. In this paper, we introduce Policy Reasoning Traces (PRT), a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM’s policy compliance assessment capabilities. Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models, setting a new state-of-the-art for HIPAA and GDPR policies. Beyond accuracy gains, we also highlight how PRTs can improve an LLM’s ability to accurately cite policy clauses, as well as influence compliance decisions through their high utilization from the raw chains of thought.

[292] Learning to Reason in Structured In-context Environments with Reinforcement Learning cs.CLPDF

Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang

TL;DR: 论文提出了SIE框架，通过结构化数据自动构建推理环境，提升语言模型的泛化推理能力和可验证性。

Details

Motivation: 现有推理环境难以扩展且泛化能力有限，需要一种能支持规模化、泛化性强且可验证的环境。

Result: 实验表明SIE显著提升结构化推理能力，并在跨领域数学和逻辑任务中有效泛化。

Insight: 在信息有限的SIE环境中，语言模型可通过探索推断缺失信息，进一步提升推理鲁棒性和泛化能力。

Abstract: Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.

[293] MedCritical: Enhancing Medical Reasoning in Small Language Models via Self-Collaborative Correction cs.CL | cs.AIPDF

Xinchun Su, Chunxu Luo, Yixuan Li, Weidong Yang, Lipeng Ma

TL;DR: MedCritical提出了一种新颖的自协作校正框架，通过两阶段方法增强小型语言模型在医学推理任务中的表现，避免了传统知识蒸馏的高成本和低效率问题。

Details

Motivation: 小型语言模型在复杂医学推理任务中表现不佳，而现有基于知识蒸馏的方法依赖大型教师模型，导致成本高、效率低。

Result: MedCritical 7B模型在CMExam基准测试中分别超越Taiyi和Huatuo-o1-7B模型3.04%和10.12%，取得了7B级小模型的SOTA性能。

Insight: 模型可以通过自协作和自我迭代优化提升能力，减少对大型教师模型的依赖，同时保持高性能。

Abstract: In the field of medicine, complex reasoning tasks such as clinical diagnosis, treatment planning, and medical knowledge integration pose significant challenges, where small language models often underperform compared to large language models like GPT-4 and Deepseek. Recent knowledge distillation-based methods aim to address these issues through teacher-guided error correction, but this LLM as judge approach remains challenging in terms of cost, time, and efficiency. To circumvent this issue, we propose a novel two-stage framework, MedCritical, which uses a small language model fine-tuned by a large teacher model to play against itself. In the first stage, we extract high-level and detailed long-chain thought templates from the teacher model to guide the student model to generate more complex reasoning thoughts. In the second stage, we introduce direct preference optimization (DPO) through model self-iteration collaboration to enhance the reasoning ability of the student model by playing against the correction trajectory of the fine-tuned model during training. This model self-learning DPO approach teaches the student model to use its own error-driven insights to consolidate its skills and knowledge to solve complex problems, and achieves comparable results to traditional knowledge distillation methods using teacher models at a lower cost. Notably, our MedCritical 7B model outperforms the Taiyi and Huatuo-o1-7B models by 3.04% and 10.12% respectively on the CMExam benchmark, achieving new SOTA performance among 7B-class small models.

[294] CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding cs.CL | cs.AI | cs.CV | I.2.10; J.3; I.5.4PDF

Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

TL;DR: 该论文提出了CCD（Clinical Contrastive Decoding）方法，通过双阶段对比机制减少放射学多模态大语言模型（MLLMs）中的医疗幻觉问题，无需额外训练或检索即可提升临床描述的准确性。

Details

Motivation: 放射学MLLMs在生成临床描述时易产生不支持的幻觉内容，这对医疗应用的准确性构成严重风险，亟需解决方法。

Result: 在多个数据集和模型上的实验表明，CCD能显著提升放射学报告生成的性能，如在MIMIC-CXR数据集上，RadGraph-F1指标提升达17%。

Insight: CCD提供了一种轻量化的通用方法，通过结合专家模型和MLLMs的力量，有效减少医疗幻觉，为实际应用提供了可行的技术路径。

Abstract: Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Cecoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

[295] Train Once, Answer All: Many Pretraining Experiments for the Cost of One cs.CL | cs.AI | cs.LGPDF

Sebastian Bordt, Martin Pawelczyk

TL;DR: 该论文提出一种高效的方法，通过单次训练同时进行多个预训练实验，显著降低大型语言模型（LLM）的实验成本。

Details

Motivation: 当前预训练模型实验的计算成本高昂，限制了科学研究的广度和深度，作者希望通过单次训练运行实现多实验并行。

Result: 实验表明，多个实验对模型训练动态和性能影响极小，交互效应可以忽略，验证了方法的有效性。

Insight: 该方法为在有限计算资源下进行大规模的模型研究提供了新的可能性，但仍需注意实验中潜在的交互影响。

Abstract: Recent work has demonstrated that controlled pretraining experiments are a powerful tool for understanding learning, reasoning, and memorization in large language models (LLMs). However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose to conduct multiple pretraining experiments simultaneously during a single training run. We demonstrate the feasibility of this approach by conducting ten experiments during the training of a 1.5B parameter model on 210B tokens. Although we only train a single model, we can replicate the results from multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until the model acquires a particular piece of knowledge. Remarkably, the influence of the ten experiments on the model’s training dynamics and overall performance is minimal. However, interactions between different experiments may act as a potential confounder in our approach. We propose to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our findings suggest that performing multiple pretraining experiments in a single training run can enable rigorous scientific experimentation with large models on a compute budget.

Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang

TL;DR: GRACE通过门控细化和自适应压缩策略，实现了高效且稳定的提示优化，显著提升了LLM的性能和效率。

Details

Motivation: 当前的自动提示优化方法存在不稳定生成和易陷入局部最优的问题，提出GRACE框架以解决这些挑战。

Result: 在11个任务上，GRACE平均性能提升4.7%、4.4%和2.7%，且仅需25%的生成预算。

Insight: 通过可控信息损失（细化与压缩）可以高效解决提示优化的局部最优问题。

Abstract: Prompt engineering is crucial for leveraging the full potential of large language models (LLMs). While automatic prompt optimization offers a scalable alternative to costly manual design, generating effective prompts remains challenging. Existing methods often struggle to stably generate improved prompts, leading to low efficiency, and overlook that prompt optimization easily gets trapped in local optima. Addressing this, we propose GRACE, a framework that integrates two synergistic strategies: Gated Refinement and Adaptive Compression, achieving Efficient prompt optimization. The gated refinement strategy introduces a feedback regulation gate and an update rejection gate, which refine update signals to produce stable and effective prompt improvements. When optimization stagnates, the adaptive compression strategy distills the prompt’s core concepts, restructuring the optimization trace and opening new paths. By strategically introducing information loss through refinement and compression, GRACE delivers substantial gains in performance and efficiency. In extensive experiments on 11 tasks across three practical domains, including BIG-Bench Hard (BBH), domain-specific, and general NLP tasks, GRACE achieves significant average relative performance improvements of 4.7%, 4.4% and 2.7% over state-of-the-art methods, respectively. Further analysis shows that GRACE achieves these gains using only 25% of the prompt generation budget required by prior methods, highlighting its high optimization efficiency and low computational overhead. Our code is available at https://github.com/Eric8932/GRACE.

[297] Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation cs.CLPDF

Sherrie Shen, Weixuan Wang, Alexandra Birch

TL;DR: 该论文研究了机器翻译中文化绑定词的副文本显化任务，通过构建数据集并评估大语言模型的性能，发现尽管LLM生成的副文本能提升理解，但仍不及专业翻译的效果。研究还揭示了文化中介的开放性和非规范性。

Details

Motivation: 机器翻译在处理文化绑定词时存在挑战，现有方法忽略了副文本（如脚注）的作用。本文旨在填补这一空白，探索副文本显化的可能性。

Result: LLM生成的副文本提升了受众理解，但仍显著弱于专业翻译。统计分析显示专业翻译在副文本使用上也存在较大差异。

Insight: 文化中介具有开放性，副文本显化在MT中有潜力超越语言对等，扩展到单语解释和个性化适应。

Abstract: The faithful transfer of contextually-embedded meaning continues to challenge contemporary machine translation (MT), particularly in the rendering of culture-bound terms–expressions or concepts rooted in specific languages or cultures, resisting direct linguistic transfer. Existing computational approaches to explicitating these terms have focused exclusively on in-text solutions, overlooking paratextual apparatus in the footnotes and endnotes employed by professional translators. In this paper, we formalize Genette’s (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for MT. We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai and evaluate LLMs with and without reasoning traces on choice and content of explicitation. Experiments across intrinsic prompting and agentic retrieval methods establish the difficulty of this task, with human evaluation showing that LLM-generated paratexts improve audience comprehension, though remain considerably less effective than translator-authored ones. Beyond model performance, statistical analysis reveals that even professional translators vary widely in their use of paratexts, suggesting that cultural mediation is inherently open-ended rather than prescriptive. Our findings demonstrate the potential of paratextual explicitation in advancing MT beyond linguistic equivalence, with promising extensions to monolingual explanation and personalized adaptation.

[298] Comparison of Scoring Rationales Between Large Language Models and Human Raters cs.CL | cs.LGPDF

Haowei Hua, Hong Jiao, Dan Song

TL;DR: 本文比较了人类评分员与大型语言模型（LLMs）在自动评分中的评分理由，研究了评分不一致的原因，并通过聚类和相似性分析评估了LLMs的评分准确性。

Details

Motivation: 随着LLMs在自动评分中的应用增加，研究其评分理由与人类评分的差异有助于提高对两者评分逻辑的理解，并优化自动评分的可靠性。

Result: 研究发现LLMs在评分中表现出一定的准确性，但其评分理由与人类评分员存在差异，揭示了评分不一致的潜在原因。

Insight: LLMs在自动评分中表现出强大的潜力，但其评分逻辑与人类不同，需要通过进一步研究优化其评分理由的一致性和可解释性。

Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking’’ of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.

[299] Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models cs.CL | cs.AIPDF

Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald

TL;DR: 论文提出了一种名为检索约束解码（RCD）的策略，通过限制模型输出为唯一的表面形式，揭示了语言模型（LM）中被低估的参数知识。实验表明，RCD显著提高了LM在YAGO-QA数据集上的表现。

Details

Motivation: 现有的语言模型虽然编码了大量事实知识，但由于严格的评估标准，许多正确的回答被误判为错误。作者认为这是低估了模型的参数知识。

Result: 实验结果显示，RCD显著提高了模型的F1分数，例如Llama-3.1-70B的F1从32.3%提升到46.0%。

Insight: 严格的评估标准可能低估了语言模型的知识能力，RCD为解码策略提供了一种新思路。

Abstract: Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models’ parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.

Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li

TL;DR: 论文提出了一种名为Cognition-of-Thought（CooT）的解码时框架，通过显式认知自监控循环增强大语言模型的对齐能力，实现了动态、可审计的安全性控制。

Details

Motivation: 现有的大语言模型对齐方法通常将安全性嵌入模型权重，导致控制隐式、静态且难以修改。CooT旨在通过显式的动态监控机制改进这一问题。

Result: 实验表明，CooT在多基准和模型家族中一致提升了安全性和社会推理表现。

Insight: 对齐可以从固定的权重属性转变为动态、显式的推理过程，通过解码时干预实现灵活的策略更新。

Abstract: Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

[301] Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review cs.CL | cs.AI | I.2.7PDF

Sydney Peters, Nan Zhang, Hong Jiao, Ming Li, Tianyi Zhou

TL;DR: 本文系统综述了37篇关于在大规模评估中使用基于文本的方法建模题目难度的研究，总结了数据集、方法、模型表现等，发现语言模型尤其是基于Transformer的架构在自动化难度预测中表现优异。

Details

Motivation: 传统题目难度建模依赖现场测试和经典测量理论（CTT）或项目反应理论（IRT），成本高且耗时。基于文本的机器学习方法提供了高效且经济的替代方案。

Result: 基于文本的方法在RMSE（0.165）、Pearson相关性（0.87）和准确率（0.806）上表现优异，语言模型尤其突出。

Insight: 语言模型能自动捕捉句法和语义特征，避免了手工特征工程，经典模型则因其可解释性仍有价值。未来需进一步探索模型的通用性和公平性。

Abstract: Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and language models, have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025. For each study, we delineate the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Results showed that although classic machine learning models remain relevant due to their interpretability, state-of-the-art language models, using both small and large transformer-based architectures, can capture syntactic and semantic patterns without the need for manual feature engineering. Uniquely, model performance outcomes were summarized to serve as a benchmark for future research and overall, text-based methods have the potential to predict item difficulty with root mean square error (RMSE) as low as 0.165, Pearson correlation as high as 0.87, and accuracy as high as 0.806. The review concludes by discussing implications for practice and outlining future research directions for automated item difficulty modeling.

[302] The Impact of Role Design in In-Context Learning for Large Language Models cs.CL | cs.AI | 68T50 | I.2.7PDF

Hamidreza Rouzegar, Masoud Makrehchi

TL;DR: 本文研究了在大语言模型（LLM）的上下文学习（ICL）中，提示中角色设计的影响，发现角色配置可以提升模型性能。

Details

Motivation: 尽管提示工程已被广泛研究，但提示中的角色设计对模型性能的影响尚未充分探索。

Result: 研究结果表明，基于角色的提示结构设计可以提高LLM的性能。

Insight: 角色设计是提示工程中一个未被充分挖掘的因素，可能为LLM的性能优化提供新思路。

Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models’ performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.

[303] Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales cs.CL | cs.AIPDF

Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Yang Xiang

TL;DR: MoRSD通过选择和优化高质量推理链（rationales）提升小语言模型的推理能力，减少了噪声和错误信息的传递，实现了性能提升和效率优化。

Details

Motivation: 现有CoT蒸馏方法过于注重数据量而忽视推理链质量，导致噪声或错误信息传递给学生模型，影响了小模型的推理能力。

Result: 在七个数据集上平均提升了4.6%的性能，同时减少了推理链数量，验证了少量高质量推理链的有效性。

Insight: 研究发现高质量推理链比大规模数据更能提升学生模型的推理能力，为高效CoT蒸馏提供了新思路。

Abstract: Chain-of-thought (CoT) distillation aims to enhance small language models’ (SLMs) reasoning by transferring multi-step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed \textbf{M}odel-\textbf{O}riented \textbf{R}ationale \textbf{S}election \textbf{D}istillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6$%$ average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in https://github.com/Leon221220/MoRSD.

[304] Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks cs.CLPDF

Kevin Frank, Anmol Gulati, Elias Lumer, Sindy Campagna, Vamse Kumar Subbiah

TL;DR: Jackal 是一个新颖的大规模文本到 JQL 的基准测试，评估大型语言模型（LLMs）在自然语言查询到 JQL 查询转换任务上的表现。它包含 100,000 对自然语言请求和已验证的 JQL 查询，并在一个拥有超过 200,000 个问题的实时 Jira 实例上进行执行验证。

Details

Motivation: 目前缺乏一个开放的、基于执行的基准测试来评估从自然语言到 JQL 查询的映射能力，尤其是在企业环境中。Jackal 的引入填补了这一空白。

Result: 在 Jackal-5K 子集上，最佳模型 Gemini 2.5 Pro 的平均执行准确率仅为 60.3%，不同请求类型之间的表现差异显著（长自然语言 86.0%，短自然语言 35.7%，语义相似 22.7%，语义精确 99.3%）。

Insight: 当前 LLMs 在文本到 JQL 任务上的表现存在显著不足，尤其在复杂或语义模糊的查询上表现较差。Jackal 为未来研究提供了新的挑战和方向。

Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.

[305] LLM Hallucination Detection: HSAD cs.CLPDF

JinXin Li, Gang Tu, JunJie Hu

TL;DR: HSAD提出了一种基于隐藏层时域信号频域分析的大语言模型幻觉检测方法，通过模拟人类信号感知和判别过程，结合频域特征提取，显著提升了检测准确性和鲁棒性。

Details

Motivation: 大语言模型在生成过程中频繁出现的幻觉问题成为其关键应用场景部署的阻碍，现有方法在知识覆盖范围和推理偏差检测上存在局限。

Result: HSAD在检测准确性和鲁棒性上优于现有方法，有效克服了知识覆盖和推理偏差检测的局限。

Insight: 频域分析为LLM幻觉检测提供了新视角，揭示了推理过程中的异常信号特点，推动了领域发展。

Abstract: Although Large Language Models have demonstrated powerful capabilities in a wide range of tasks such as language understanding and code generation, the frequent occurrence of hallucinations during the generation process has become a significant impediment to their deployment in critical application scenarios. Current mainstream hallucination detection methods rely on factual consistency verification or static hidden layer features. The former is constrained by the scope of knowledge coverage, while the latter struggles to capture reasoning biases during the inference process. To address these issues, and inspired by signal analysis methods in cognitive neuroscience, this paper proposes a hallucination detection method based on the frequency-domain analysis of hidden layer temporal signals, named HSAD (\textbf{H}idden \textbf{S}ignal \textbf{A}nalysis-based \textbf{D}etection). First, by treating the LLM’s reasoning process as a cognitive journey that unfolds over time, we propose modeling and simulating the human process of signal perception and discrimination in a deception-detection scenario through hidden layer temporal signals. Next, The Fast Fourier Transform is applied to map these temporal signals into the frequency domain to construct spectral features, which are used to capture anomalies that arise during the reasoning process; analysis experiments on these spectral features have proven the effectiveness of this approach. Finally, a hallucination detection algorithm is designed based on these spectral features to identify hallucinations in the generated content. By effectively combining the modeling of the reasoning process with frequency-domain feature extraction, the HSAD method overcomes the limitations of existing approaches in terms of knowledge coverage and the detection of reasoning biases, demonstrating higher detection accuracy and robustness.

[306] Timber: Training-free Instruct Model Refining with Base via Effective Rank cs.CL | cs.AIPDF

Taiqiang Wu, Runming Yang, Tao Liu, Jiahao Wang, Zenan Xu

TL;DR: Timber是一种无需训练的方法，通过部分恢复Instruct模型到Base模型的权重差异，提升探索能力同时保持其利用能力。实验验证其在Pass@k性能上的有效性。

Details

Motivation: 现有的后训练方法通常被认为表面化，导致Instruct模型在牺牲探索能力的情况下提升利用能力。Timber旨在解决这一权衡问题，无需额外训练即可优化模型。

Result: 实验表明，Timber显著提升了Pass@k性能，尤其在Instruct模型的探索能力方面表现突出。

Insight: 后训练阶段在权重层面的分析揭示了模型行为的变化，Timber提供了一种无需训练的实用优化策略。

Abstract: Post-training, which elicits a pretrained Base model into the corresponding Instruct model, is widely considered to be superficial. In this work, we first reinforce this hypothesis by providing novel quantitative evidence from the weight level that the effective rank (eRank) remains negligibly changed. However, this superficiality also suffers a critical trade-off, improving the exploitation capabilities at the cost of limiting its exploration. To tackle this issue, we propose Timber, a simple yet effective training-free method that enhances the exploration capability of the Instruct model while preserving its exploitation. The key insight is to partially revert Instruct towards the paired Base model by subtle yet targeted refinement of the weight deltas. Extensive experiments on Llama and Qwen series demonstrate that Timber consistently improves vanilla Instruct models, particularly on Pass@k performance. Our findings offer new insights into the post-training stage at the weight level and practical strategies to refine the Instruct model without training.

[307] Fast Thinking for Large Language Models cs.CLPDF

Haoyu Zheng, Zhuonan Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang

TL;DR: 本文提出了一种名为Fast-Thinking的框架，通过训练时学习离散策略先验的代码本，在推理时仅使用少量连续的思维向量，从而避免生成显式的推理标记，显著降低了推理成本。同时引入了GainRouter机制，自适应地在快速推理和显式推理间切换，提高了效率。

Details

Motivation: 现有的Chain-of-Thought (CoT)方法虽然在复杂推理任务中表现优异，但由于需要生成冗长的推理步骤，导致延迟高且标记使用量大。作者旨在通过减少显式推理标记的生成，提高推理效率。

Result: 实验表明，该方法在保持或提升准确性的同时，显著降低了推理成本。

Insight: 通过隐式策略引导和动态选择推理模式，可以有效平衡推理效率与模型性能。

Abstract: Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.

[308] Don’t Settle Too Early: Self-Reflective Remasking for Diffusion Language Models cs.CLPDF

Zemin Huang, Yuhang Wang, Zhiyang Chen, Guo-Jun Qi

TL;DR: 论文提出了RemeDi模型，通过引入remasking机制和联合预测token分布与置信度分数，提升了基于扩散的语言模型的文本生成质量。

Details

Motivation: 现有基于掩码的扩散语言模型（DLMs）难以修正错误的token，生成后token通常固定不变，导致文本质量受限。

Result: RemeDi在多个数据集上取得了开源DLMs中的最优性能。

Insight: 动态remasking机制能够显著提升扩散语言模型生成文本的灵活性和质量。

Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.

[309] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs cs.CLPDF

Shulin Huang, Yiran Ding, Junshu Pan, Yue Zhang

TL;DR: 论文探讨了强化学习（RL）与监督微调（SFT）在多语言复杂推理任务中的表现差异，发现RL在多语言推理中表现更优且泛化能力更强，尤其是在非英语数据上训练时。

Details

Motivation: 增强大型语言模型（LLMs）的复杂推理能力是研究热点，但RL在多语言推理中的泛化能力尚未被系统研究。

Result: RL在多语言推理任务中不仅准确率更高，且泛化能力显著优于SFT；在非英语数据上训练的RL表现更优。

Insight: RL可能使模型学到更鲁棒的推理策略，为多语言推理的公平性和有效性提供了新方向。

Abstract: Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL’s superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.

[310] Aligning LLMs for Multilingual Consistency in Enterprise Applications cs.CL | cs.AI | 68T50 | I.2.7PDF

Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Tao Sheng, Sujith Ravi

TL;DR: 论文提出一种批量对齐策略，通过多语言语义等价数据微调LLMs，显著提升非英语语言的准确性，同时保持英语性能和模型推理能力。

Details

Motivation: 大型语言模型（LLMs）在多语言企业应用中存在性能差距，尤其是非英语语言准确性显著低于英语，影响客户体验和操作可靠性。

Result: 非英语语言准确性提升高达23.9%，且未损害英语性能、模型推理能力或检索质量。

Insight: 直接对齐多语言输出是一种简单有效的策略，可显著改善企业应用中的多语言模型一致性。

Abstract: Large language models (LLMs) remain unreliable for global enterprise applications due to substantial performance gaps between high-resource and mid/low-resource languages, driven by English-centric pretraining and internal reasoning biases. This inconsistency undermines customer experience and operational reliability in multilingual settings such as customer support, content moderation, and information retrieval. Even with advanced Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy drop in non-English languages compared to English. We propose a practical, batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically equivalent multilingual data in each training batch to directly align model outputs across languages. This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality. Our method is simple to implement, scalable, and integrates seamlessly with existing LLM training & deployment pipelines, enabling more robust and equitable multilingual AI solutions in industry.

[311] TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F cs.CL | cs.PL | cs.SEPDF

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen

TL;DR: TF-Bench是一个新型基准测试，旨在评估大型语言模型（LLMs）在System F类型推断任务中的程序语义推理能力，揭示了当前最优模型的显著局限性。

Details

Motivation: 现有基准测试缺乏形式化的程序逻辑评估框架，无法区分LLMs是真正理解程序语义还是依赖自然语言与代码的浅层关联。

Result: 最佳模型Claude-3.7-sonnet在TF-Bench_pure上的准确率仅为55.85%，突显了LLMs在程序语义推理上的不足。

Insight: 研究揭示了当前LLMs在程序语义理解上的局限性，为未来研究方向提供了重要启示。

Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.

[312] VIVA+: Human-Centered Situational Decision-Making cs.CLPDF

Zhe Hu, Yixiao Ren, Guanzhong Liu, Jing Li, Yu Yin

TL;DR: VIVA+是一个基于认知科学的基准测试，旨在评估多模态大型语言模型（MLLMs）在复杂人类环境中的决策和推理能力，包含1,317个真实场景和6,373个选择题，重点关注情境理解、行动理由和反思推理三个核心能力。

Details

Motivation: 当前MLLMs在人类中心环境中的表现缺乏系统和深入的评估，尤其在细微推理和决策能力方面存在挑战。VIVA+填补了这一空白，提供了一个标准化的评估框架。

Result: 实验表明商业化模型和开源模型在VIVA+上表现差异显著，多步推理和训练策略能显著提升性能。

Insight: MLLMs在情境感知和社会适应性决策方面仍有较大局限，未来的改进方向包括提升上下文理解和多模态能力。

Abstract: Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.

Zhiqiang Liu, Yichi Zhang, Mengshu Sun, Lei Liang, Wen Zhang

TL;DR: 论文提出了一种新的多模态知识图谱补全方法M-Hyper，通过融合与独立模态表征的协同工作，结合超复数代数（四元数）建模多模态交互，实现了显著的性能提升。

Details

Motivation: 现有MMKGC方法分为融合型和集成型，前者丢失模态特异性信息且缺乏灵活性，后者难以捕捉模态间上下文相关的语义交互。M-Hyper旨在解决这两类方法的局限性。

Result: 实验表明M-Hyper在性能、鲁棒性和计算效率上均达到SOTA。

Insight: 超复数代数在多模态任务中具有潜力，能够自然建模正交模态间的交互，同时保持模态独立性。

Abstract: Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs) by leveraging both structural relationships and diverse modality information of entities. Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based. Fusion-based methods employ fixed fusion strategies, which inevitably leads to the loss of modality-specific information and a lack of flexibility to adapt to varying modality relevance across contexts. In contrast, ensemble-based methods retain modality independence through dedicated sub-models but struggle to capture the nuanced, context-dependent semantic interplay between modalities. To overcome these dual limitations, we propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations. Our method integrates the strengths of both paradigms, enabling effective cross-modal interactions while maintaining modality-specific information. Inspired by ``quaternion’’ algebra, we utilize its four orthogonal bases to represent multiple independent modalities and employ the Hamilton product to efficiently model pair-wise interactions among them. Specifically, we introduce a Fine-grained Entity Representation Factorization (FERF) module and a Robust Relation-aware Modality Fusion (R2MF) module to obtain robust representations for three independent modalities and one fused modality. The resulting four modality representations are then mapped to the four orthogonal bases of a biquaternion (a hypercomplex extension of quaternion) for comprehensive modality interaction. Extensive experiments indicate its state-of-the-art performance, robustness, and computational efficiency.

[314] Do LLMs Understand Romanian Driving Laws? A Study on Multimodal and Fine-Tuned Question Answering cs.CL | cs.LGPDF

Eduard Barbu, Adrian Marius Dumitran

TL;DR: 该论文评估了大型语言模型（LLMs）在处理罗马尼亚驾驶法律问答任务中的表现，研究了多模态和领域特定微调的效果，并揭示了文本描述优于直接视觉输入的发现。

Details

Motivation: 确保驾驶员掌握交通规则对道路安全至关重要，但针对低资源语言（如罗马尼亚语）的问答系统研究较少。

Result: 1.SOTA模型表现良好，微调后的8B模型具有竞争力；2.文本描述优于直接视觉输入；3.LLM-as-a-Judge揭示了自偏好偏差。

Insight: 多模态任务中文本描述可能比直接视觉输入更有效；LLM-as-a-Judge可用于评估解释质量，但需注意偏差。

Abstract: Ensuring that both new and experienced drivers master current traffic rules is critical to road safety. This paper evaluates Large Language Models (LLMs) on Romanian driving-law QA with explanation generation. We release a 1{,}208-question dataset (387 multimodal) and compare text-only and multimodal SOTA systems, then measure the impact of domain-specific fine-tuning for Llama 3.1-8B-Instruct and RoLlama 3.1-8B-Instruct. SOTA models perform well, but fine-tuned 8B models are competitive. Textual descriptions of images outperform direct visual input. Finally, an LLM-as-a-Judge assesses explanation quality, revealing self-preference bias. The study informs explainable QA for less-resourced languages.

[315] Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning cs.CL | cs.AIPDF

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan

TL;DR: 本文提出了一个基于逻辑的多模态推理评估框架，揭示了模态交互中的核心瓶颈，并提出简单直接的方法缓解这些问题。

Details

Motivation: 多模态大语言模型（MLLMs）在多模态推理中存在不一致的表现，缺乏对模态交互机制的系统分析，作者希望通过控制实验揭示核心问题。

Result: 额外模态仅在其提供独立且充分的推理路径时有益，冗余或链式支持则有害；两步提示和改进早期融合显著提升性能。

Insight: 多模态推理的主要障碍是模态整合而非感知能力，任务组合感知训练和融合控制是未来研究方向。

Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models’ internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

[316] Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis cs.CL | cs.AIPDF

Chao Wang, Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

TL;DR: 这篇论文分析了语音增强的大型语言模型（LLMs）中文本能力退化的原因，并提出了一种基于参数重要性分析的框架，揭示了微调过程对文本推理关键参数的分布扰乱，并通过分层学习率调度和低秩适应（LoRA）两种方法缓解这一问题。

Details

Motivation: 语音功能的引入扩展了LLMs的能力，但也导致了其核心文本能力的退化。论文旨在理解并解决这一问题，以充分发挥语音增强LLMs的潜力。

Result: 实验表明，两种缓解策略在保持文本能力的同时，提升了语音问答任务的性能。这为LLMs的结构化知识保存提供了原则性解释。

Insight: 语音微调导致的文本能力退化与关键参数的分布偏移密切相关，针对性地调整学习率或使用LoRA可以有效缓解这一问题。

Abstract: The integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence. This degradation limits the ability of speech-enabled LLMs to fully exploit their pre-trained text-based knowledge. In this work, we analyze the underlying mechanisms of this issue through a focused study of the widely used encoder-adaptor paradigm. We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift: the layer-wise allocation of parameters critical to textual reasoning is disrupted. Building on this insight, we investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA), both aim to preserve the original parameter distribution. Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance. Furthermore, our analysis offers a principled explanation for the effectiveness of the proposed mitigation strategies, linking their benefits to the structural properties of textual knowledge in LLMs.

[317] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality cs.CL | cs.AI | cs.LGPDF

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang

TL;DR: 论文提出了一种名为KLCF的新颖强化学习框架，通过双事实对齐机制优化长文本生成的事实性和一致性，解决了现有RLHF方法中的幻觉问题。

Details

Motivation: 长文本生成中的幻觉和事实性缺陷是LLM可靠性的主要障碍。现有RLHF框架依赖偏好奖励，但忽略了模型内部的知识边界，加剧了幻觉问题。

Result: 实验表明，KLCF显著提升了多长文本基准的事实性指标，有效缓解了幻觉问题。

Insight: KLCF通过轻量级、无需外部知识的设计，实现了高效且可扩展的事实性优化。

Abstract: Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model’s internal knowledge boundaries, exacerbating the so-called “hallucination tax”. To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model’s expressed knowledge and the base model’s parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model’s internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.

[318] Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering cs.CLPDF

Muhammad Abu Ahmad, Mohamad Ballout, Raia Abu Ahmad, Elia Bruni

TL;DR: 该论文提出了一种混合检索增强生成（RAG）系统，用于伊斯兰知识问答任务，结合稀疏和稠密检索方法以及交叉编码器重排序，显著提升大型语言模型（LLM）的性能。

Details

Motivation: 解决伊斯兰知识问答任务中LLM性能不足的问题，通过混合检索和重排序方法优化结果。

Result: 最佳配置（Fanar）在两子任务中分别达到45%和80%的准确率。

Insight: 混合检索和重排序能显著提升LLM在特定领域问答任务中的表现。

Abstract: This paper presents our submission to the QIAS 2025 shared task on Islamic knowledge understanding and reasoning. We developed a hybrid retrieval-augmented generation (RAG) system that combines sparse and dense retrieval methods with cross-encoder reranking to improve large language model (LLM) performance. Our three-stage pipeline incorporates BM25 for initial retrieval, a dense embedding retrieval model for semantic matching, and cross-encoder reranking for precise content retrieval. We evaluate our approach on both subtasks using two LLMs, Fanar and Mistral, demonstrating that the proposed RAG pipeline enhances performance across both, with accuracy improvements up to 25%, depending on the task and model configuration. Our best configuration is achieved with Fanar, yielding accuracy scores of 45% in Subtask 1 and 80% in Subtask 2.

[319] Open-DeBias: Toward Mitigating Open-Set Bias in Language Models cs.CLPDF

Arti Rani, Shweta Singh, Nihar Ranjan Sahoo, Gaurav Kumar Nayak

TL;DR: 该论文提出了Open-DeBias方法，用于解决语言模型中的开放集偏见问题，并引入了OpenBiasBench基准测试。通过高效的数据和参数微调，该方法在偏见检测和缓解方面表现出色，且具备跨语言泛化能力。

Details

Motivation: 现有偏见缓解方法仅针对预定义类别，无法处理新颖或上下文相关的偏见。论文旨在解决开放集偏见的检测与缓解问题。

Result: 在BBQ数据集上，Open-DeBias在模糊子集上提升QA准确率48%，在明确子集上提升6%；在零样本迁移至韩语BBQ时达到84%准确率。

Insight: Open-DeBias展示了偏见缓解方法的可扩展性和跨语言泛化能力，适用于开放域和多语言场景。

Abstract: Large Language Models (LLMs) have achieved remarkable success on question answering (QA) tasks, yet they often encode harmful biases that compromise fairness and trustworthiness. Most existing bias mitigation approaches are restricted to predefined categories, limiting their ability to address novel or context-specific emergent biases. To bridge this gap, we tackle the novel problem of open-set bias detection and mitigation in text-based QA. We introduce OpenBiasBench, a comprehensive benchmark designed to evaluate biases across a wide range of categories and subgroups, encompassing both known and previously unseen biases. Additionally, we propose Open-DeBias, a novel, data-efficient, and parameter-efficient debiasing method that leverages adapter modules to mitigate existing social and stereotypical biases while generalizing to unseen ones. Compared to the state-of-the-art BMBI method, Open-DeBias improves QA accuracy on BBQ dataset by nearly $48%$ on ambiguous subsets and $6%$ on disambiguated ones, using adapters fine-tuned on just a small fraction of the training data. Remarkably, the same adapters, in a zero-shot transfer to Korean BBQ, achieve $84%$ accuracy, demonstrating robust language-agnostic generalization. Through extensive evaluation, we also validate the effectiveness of Open-DeBias across a broad range of NLP tasks, including StereoSet and CrowS-Pairs, highlighting its robustness, multilingual strength, and suitability for general-purpose, open-domain bias mitigation. The project page is available at: https://sites.google.com/view/open-debias25

[320] SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models cs.CLPDF

Ziyi Yang, Weizhou Shen, Ruijun Chen, Chenliang Li, Fanqi Wan

TL;DR: SPELL是一个多角色自博弈强化学习框架，通过整合提问者、回答者和验证者三种角色，实现了无需标注的长上下文推理优化，显著提升了长上下文推理能力。

Details

Motivation: 长上下文推理在大型语言模型中进展缓慢，主要因为长文本处理的固有难度以及缺乏可靠的人工标注和可编程奖励信号。

Result: 在六个长上下文基准测试中表现优异，平均提升了7.6个点的pass@8指标。

Insight: SPELL展示了自博弈强化学习在长上下文推理中的潜力，尤其在没有大量标注数据的情况下仍能显著提升性能。

Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner’s reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.

[321] Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning cs.CLPDF

Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min

TL;DR: 该论文提出了一种名为Q-Tuning的统一框架，用于联合优化样本和token级别的剪枝策略，通过Error-Uncertainty (EU) Plane诊断数据效用，显著提升了监督微调（SFT）的数据效率。

Details

Motivation: 随着监督微调（SFT）的计算成本增加，传统剪枝方法（单一样本或token级别）存在效率低下的问题，未能充分利用数据潜力。

Result: 在SmolLM2-1.7B上，仅用12.5%数据即超越完整数据训练38%。

Insight: 联合样本和token剪枝能更精细地利用数据，显著提升模型性能，尤其在有限预算下表现出优越性。

Abstract: As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies–high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

[322] DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning cs.CL | cs.IRPDF

Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok

TL;DR: DocPruner提出了一个存储高效的多向量视觉文档检索框架，通过自适应剪枝减少存储开销。

Details

Motivation: 当前基于多向量范式的视觉文档检索方法虽然效果显著，但存储开销巨大，难以大规模部署，亟需解决方案。

Result: 在十多个数据集上的实验表明，DocPruner显著减少存储，性能损失可忽略。

Insight: 自适应剪枝是解决多向量范式存储问题的有效途径，为大规模VDR系统提供了实用的解决方案。

Abstract: Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance. Extensive experiments across more than ten representative datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.

[323] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step cs.CL | cs.AIPDF

Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao

TL;DR: 本文提出了一种名为EOSER和ASS的解码调度器，以及CJ-GRPO强化学习方法，用于优化掩码扩散语言模型（MDLMs）的解码和训练过程，减少解码步骤并提高性能。

Details

Motivation: 当前掩码扩散语言模型的解码策略和强化学习方法研究不足，直接迁移自回归模型技术可能不最优。本文旨在解决MDLMs解码和训练中的不一致性问题。

Result: 在数学和规划任务上实验，基于LLaDA-8B-Instruct模型，验证了方法的有效性和高效性。

Insight: 通过优化解码和强化学习方法，MDLMs能更高效地实现并行解码，减少推理步骤，适用于复杂推理任务。

Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

[324] Assessing Large Language Models in Updating Their Forecasts with New Information cs.CLPDF

Zhangdie Yuan, Zifeng Ding, Andreas Vlachos

TL;DR: EVOLVECAST框架评估大语言模型（LLMs）在新信息出现时如何调整预测和置信度，发现其更新行为不一致且保守，置信度校准也未达人类水平。

Details

Motivation: 现有研究将事件预测视为静态任务，忽视了新证据如何动态影响预测和置信度。本文提出评估LLMs在新信息下更新预测的能力。

Result: LLMs对新信息有一定响应，但更新行为不一致且保守，置信度校准未显著优于对方，且均远低于人类参考标准。

Insight: LLMs在新信息下的预测更新能力存在局限，需进一步研究更稳健的信念更新机制，其保守偏差可能影响实际决策可靠性。

Abstract: Prior work has largely treated future event prediction as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EVOLVECAST, a framework for evaluating whether large language models appropriately revise their predictions in response to new information. In particular, EVOLVECAST assesses whether LLMs adjust their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to analyze prediction shifts and confidence calibration under updated contexts. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that neither verbalized nor logits-based confidence estimates consistently outperform the other, and both remain far from the human reference standard. Across settings, models tend to express conservative bias, underscoring the need for more robust approaches to belief updating.

[325] Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues cs.CL | cs.AIPDF

Claudio Fantinuoli

TL;DR: 该论文提出了视觉接地机器口译（VGI），通过引入视觉线索改进单模态机器口译的性能。

Details

Motivation: 当前机器口译系统仅依赖语音信号，限制了在多模环境下（如需要视觉上下文）的翻译效果。

Result: 视觉接地显著改进了词汇歧义消解，但对性别解析的提升有限，对句法歧义无帮助。

Insight: 多模态（视觉+语音）是提升机器口译质量的必要方向，但在不同歧义类型上效果差异显著。

Abstract: Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains performance in contexts where disambiguation and adequacy depend on additional cues, such as visual, situational, or pragmatic information. This paper introduces Vision-Grounded Interpreting (VGI), a novel approach designed to address the limitations of unimodal machine interpreting. We present a prototype system that integrates a vision-language model to process both speech and visual input from a webcam, with the aim of priming the translation process through contextual visual information. To evaluate the effectiveness of this approach, we constructed a hand-crafted diagnostic corpus targeting three types of ambiguity. In our evaluation, visual grounding substantially improves lexical disambiguation, yields modest and less stable gains for gender resolution, and shows no benefit for syntactic ambiguities. We argue that embracing multimodality represents a necessary step forward for advancing translation quality in machine interpreting.

[326] HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs cs.CLPDF

Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng

TL;DR: HiPO提出了一种混合策略优化框架，用于动态控制LLM的推理过程，选择性决定何时详细推理（Think-on）或直接回应（Think-off），从而在保持精度的同时减少计算开销。

Details

Motivation: 现有的链式推理（CoT）方法虽能提升LLM在复杂任务的精度，但始终生成冗长推理代价高昂，导致不必要的token开销和高推理成本。

Result: 在数学和编程基准测试中，HiPO显著减少了token使用量，同时维持或提升了任务精度。

Insight: HiPO为资源敏感场景下的高效推理提供了原则性解决方案，推动了面向推理的LLM的实际部署。

Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

[327] Toward Preference-aligned Large Language Models via Residual-based Model Steering cs.CL | cs.AI | cs.CY | cs.LG | cs.NEPDF

Lucio La Cava, Andrea Tagarelli

TL;DR: PaLRS提出了一种无需训练、即插即用的方法，通过提取LLM残差流中的偏好信号，生成轻量级的转向向量，实现在推理时对齐模型偏好。相比现有方法，PaLRS更高效、灵活且节省时间。

Details

Motivation: 现有偏好对齐方法（如RLHF或DPO）需要大量标注数据和昂贵的参数优化，且生成的任务特定模型难以复用。PaLRS旨在解决这些问题，提供一种数据需求低、无需训练的高效替代方案。

Result: 实验证明，PaLRS对齐的模型在数学推理和代码生成任务中表现一致提升，同时保持基础模型的通用性能。相比DPO，PaLRS节省大量时间且效果更好。

Insight: LLM残差流中隐含了丰富的偏好信号，可通过轻量级方法提取并应用于模型对齐，这为高效偏好对齐提供了新方向。

Abstract: Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to DPO-aligned models, they perform better with huge time savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

[328] GEAR: A General Evaluation Framework for Abductive Reasoning cs.CL | cs.AI | cs.LGPDF

Kaiyu He, Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao

TL;DR: 论文提出了GEAR框架，用于评估大型语言模型（LLMs）的溯因推理能力，通过一致性、泛化性和多样性三个指标自动评分，无需人工标注。GEAR还能为模型提供无监督的训练信号。

Details

Motivation: 现有的研究主要集中在指令跟随和演绎推理，而忽略了LLMs发现新知识的能力。因此，作者希望通过溯因推理（生成解释观察结果的合理假设）来填补这一空白，并开发一个无需标注、可扩展的评估框架。

Result: 实验表明，GEAR能够可靠地评估LLMs的溯因能力，且基于GEAR的训练策略显著提升了所有指标的得分，并能迁移到传统基准测试中。

Insight: 1. 无需人工标注的评估框架可以更灵活地适应模型能力的提升；2. 动态课程学习策略有助于模型逐步掌握复杂任务；3. GEAR揭示了LLMs在溯因推理中的潜在改进空间。

Abstract: Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.

[329] BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models cs.CLPDF

Zsolt T. Kardkovács, Lynda Djennane, Anna Field, Boualem Benatallah, Yacine Gaci

TL;DR: BTC-SAM提出了一种利用LLMs生成高质量测试用例的新框架，用于检测情感分析模型的偏见问题，覆盖范围更广且语言多样性更好。

Details

Motivation: 情感分析模型中存在社会偏见，传统测试方法依赖专家或众包，成本高且覆盖有限。BTC-SAM旨在通过LLMs低成本生成高质量测试用例，解决这一问题。

Result: 实验表明，BTC-SAM生成的测试用例具有更好的语言多样性和覆盖范围，优于传统提示方法。

Insight: LLMs可以高效生成高质量测试用例，为模型偏见检测提供了一种低成本、可扩展的新方法。

Abstract: Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

[330] Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Distributional Semantics cs.CLPDF

Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson

TL;DR: 论文探讨了如何通过语用推理方法提升大语言模型在道德推理中的泛化能力，填补语用与语义间的差距。

Details

Motivation: 大语言模型依赖分布语义，但在道德推理中泛化能力不足，需解决语用层面的挑战。

Result: 实验证明方法显著提升大语言模型的道德推理泛化能力。

Insight: 研究为基于道德基础理论的未来工作奠定基础，揭示了语用信息在道德推理中的重要性。

Abstract: Moral reasoning has emerged as a promising research direction for Large Language Models (LLMs), yet achieving generalization remains a central challenge. From a linguistic standpoint, this difficulty arises because LLMs are adept at capturing distributional semantics, which fundamentally differs from the morals which operate at the pragmatic level. This paper investigates how LLMs can achieve generalized moral reasoning despite their reliance on distributional semantics. We propose pragmatic inference methods grounded in moral foundations theory, which leverage contextual information at each step to bridge the pragmatic gap and guide LLMs in connecting moral foundations with moral reasoning objectives. Experimental results demonstrate that our approach significantly enhances LLMs’ generalization in moral reasoning, providing a foundation for future research grounded in moral foundations theory.

[331] Dual-Scale World Models for LLM Agents Towards Hard-Exploration Problems cs.CLPDF

Minsoo Kim, Seung-won Hwang

TL;DR: GLoW提出了一种双尺度世界模型方法，通过全局轨迹前沿和局部试错学习解决LLM智能体在硬探索任务中的局限性，并在Jericho基准测试中取得SOTA性能。

Details

Motivation: LLM智能体在需要探索学习新知识的硬探索任务中表现受限，因此提出GLoW方法以提升其探索能力。

Result: 在Jericho基准测试中，GLoW表现优异，性能与SOTA强化学习方法相当，但环境交互需求减少100-800倍。

Insight: 双尺度建模和优势信号引导能显著提升LLM智能体的探索效率，降低对高交互成本的需求。

Abstract: LLM-based agents have seen promising advances, yet they are still limited in “hard-exploration” tasks requiring learning new knowledge through exploration. We present GLoW, a novel approach leveraging dual-scale world models, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-theart performance for LLM-based approaches. Compared to state-of-the-art RLbased methods, our approach achieves comparable performance while requiring 100-800x fewer environment interactions.

[332] EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos cs.CLPDF

Sourjyadip Ray, Shubham Sharma, Somak Aditya, Pawan Goyal

TL;DR: 该论文提出了EduVidQA数据集和任务，利用多模态大语言模型（MLLMs）自动生成并评估学生对在线讲座视频的提问的长篇回答。数据集包含5252个问题-答案对，来自296个计算机科学视频，涵盖多种主题和难度。作者通过实证研究学生偏好，并评估了6种先进MLLM的表现。

Details

Motivation: 随着数字平台重塑教育模式，确保互动性对有效学习至关重要。论文探索了MLLMs在自动回答学生在线讲座问题中的应用，填补了这一实际重要任务的研究空白。

Result: 实验表明合成数据对微调MLLMs有效，但任务本身仍具挑战性。评估结果显示性能的复杂性，为未来研究提供了基准。

Insight: 1. 学生偏好研究为任务设计提供了重要指导；2. 多模态任务需结合文本和定性指标全面评估；3. EduVidQA为教育领域的NLP研究开辟了新方向。

Abstract: As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models’ performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.

[333] Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE cs.CLPDF

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong

TL;DR: 论文提出TARE框架，通过对抗搜索和鲁棒选择优化提示词，减少其在语义空间中的文本锐度，从而提高大型语言模型的鲁棒性。

Details

Motivation: 现有提示词优化方法主要关注点对点准确性，忽视了语义不变性和搜索稳定性，导致提示词在小幅度语义保持的改写下性能波动较大。

Result: 实验表明，TARE和ATARE能够在多样化任务中生成对改写鲁棒的提示词，性能优于仅优化准确性的方法且计算高效。

Insight: 提示词的鲁棒性优化需结合对抗搜索和动态语义邻域调整，文本锐度的形式化为解决该问题提供了新思路。

Abstract: The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model’s parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.

[334] Your thoughts tell who you are: Characterize the reasoning patterns of LRMs cs.CL | cs.AI | cs.LGPDF

Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi

TL;DR: 该论文提出了一种名为LOT的分类方法，通过生成式语言模型比较不同大推理模型（LRM）的推理轨迹，并用自然语言描述其差异。LOT能够以高精度区分不同LRM的推理风格，并揭示了这些风格对任务性能的影响。

Details

Motivation: 目前对大推理模型（LRM）的比较主要基于宏观统计（如任务准确率或推理长度），但不同LRM是否具有不同的推理模式仍是一个未解问题。因此，作者希望填补这一空白。

Result: LOT在数学、科学和编程任务上对12种开源LRM的推理进行了比较，发现它们在规模和目标领域上有系统性差异。此外，通过对齐推理风格，小模型在GPQA任务上的准确率提升了3.3-5.7%。

Insight: 1. LRM的推理风格存在显著差异，且这些差异可以量化；2. 推理风格的调整可以提升模型性能，表明推理模式对任务解决具有实际影响。

Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT’s natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.

[335] Retrieval-augmented GUI Agents with Generative Guidelines cs.CL | cs.AI | cs.LGPDF

Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho

TL;DR: RAG-GUI是一种轻量级视觉语言模型（VLM），通过结合检索增强技术和生成指导原则，显著提升了GUI代理在复杂数字任务中的表现。

Details

Motivation: 由于训练数据稀缺和复杂任务的罕见场景需求，现有的GUI代理在实际应用中表现受限。RAG-GUI旨在通过检索增强和生成指导解决这些问题。

Result: 在三个任务中，RAG-GUI均优于基线代理，并在两种模型规模下表现提升2.6%到13.3%，展示了强泛化能力和实用性。

Insight: RAG-GUI的成功表明，动态外部知识检索与模型微调的结合能显著提升GUI代理的任务完成能力，为复杂任务的自动化提供了新思路。

Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

[336] Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models cs.CL | cs.AIPDF

Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He

TL;DR: 该论文提出了一个基于项目反应理论（IRT）的评估框架MedIRT，用于更精确地评估大型语言模型（LLMs）在医学领域的主题特定能力。通过对80个LLMs在1,100道USMLE题目上的评测，揭示了模型能力的多样性和复杂性，并提出了一个实用的决策支持框架。

Details

Motivation: 随着LLMs在医疗领域应用的增加，传统的准确性评估方法无法捕捉问题的特性和主题特定的能力差异，亟需一种更可靠的评估方法。

Result: 发现LLMs的能力分布不均匀，GPT-5在多数领域表现最佳，但在某些主题上被Claude-3-opus超越；同时识别出有缺陷的题目。

Insight: 1）总体排名可能掩盖模型在特定领域的优势；2）IRT不仅能评估模型表现，还能用于题目审计；3）主题特异性评估对医疗应用至关重要。

Abstract: As Large Language Models (LLMs) are increasingly proposed for high-stakes medical applications, there has emerged a critical need for reliable and accurate evaluation methodologies. Traditional accuracy metrics fail inadequately as they neither capture question characteristics nor offer topic-specific insights. To address this gap, we introduce \textsc{MedIRT}, a rigorous evaluation framework grounded in Item Response Theory (IRT), the gold standard in high-stakes educational testing. Unlike previous research relying on archival data, we prospectively gathered fresh responses from 80 diverse LLMs on a balanced, 1,100-question USMLE-aligned benchmark. Using one unidimensional two-parameter logistic IRT model per topic, we estimate LLM’s latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone. Notably, we identify distinctive ``spiky’’ ability profiles, where overall rankings can be misleading due to highly specialized model abilities. While \texttt{GPT-5} was the top performer in a majority of domains (8 of 11), it was outperformed in Social Science and Communication by \texttt{Claude-3-opus}, demonstrating that even an overall 23rd-ranked model can hold the top spot for specific competencies. Furthermore, we demonstrate IRT’s utility in auditing benchmarks by identifying flawed questions. We synthesize these findings into a practical decision-support framework that integrates our multi-factor competency profiles with operational metrics. This work establishes a robust, psychometrically grounded methodology essential for the safe, effective, and trustworthy deployment of LLMs in healthcare.

[337] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution cs.CLPDF

Luyang Zhang, Siyuan Peng, Jialu Wang, Shichao Zhu, Beibei Li

TL;DR: PET框架通过动态概率分布推断用户偏好演化，提升排名质量40%（NDCG），并在长尾内容排序中显著优于现有生产模型7倍。

Details

Motivation: 现代数字生态系统中，用户偏好演化理解是关键挑战。现有LLM直接生成偏好列表的方法缺乏透明性和可解释性，且容易导致流行度偏差。

Result: 在公开基准（Yelp、MovieLens）上NDCG提升40%，在短视频平台数据集上长尾内容排序表现优于生产模型7倍。

Insight: 通过分布化偏好映射取代直接列表生成，PET实现了更可解释、公平且多样化的个性化系统。

Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user’s next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user’s preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.

[338] AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play cs.CL | cs.AI | cs.IR | cs.LGPDF

Ran Xu, Yuchen Zhuang, Zihan Dong, Jonathan Wang, Yue Yu

TL;DR: AceSearcher是一个基于强化自博弈的框架，通过训练单个LLM在分解复杂查询和生成答案两个角色间切换，解决了搜索增强型LLM在多跳检索和推理能力上的不足，显著提升了复杂推理任务的性能。

Details

Motivation: 现有搜索增强型LLM在多跳检索和复杂推理任务中表现不佳，主要原因是无效的多跳检索和推理能力有限。AceSearcher旨在通过协同自博弈框架解决这些问题。

Result: 在三个推理密集型任务的10个数据集中，AceSearcher平均精确匹配率提升7.6%。较小的模型（1.5B和8B）也能超越参数多9倍的现有搜索增强型LLM。

Insight: AceSearcher展示了自博弈和角色交替在提升LLM复杂推理能力中的潜力，同时证明了小模型通过高效设计也能胜任大模型的任务。

Abstract: Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.

[339] Can Large Language Models Express Uncertainty Like Human? cs.CL | cs.AIPDF

Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang

TL;DR: 论文探讨大型语言模型(LLM)如何通过语言表达不确定性(如可能、或许)，提出了一种轻量级且符合人类习惯的方法，并发布首个大规模数据集和改进框架。

Details

Motivation: 在高风险场景中，LLM的过度自信可能误导用户。现有方法(如对数概率或多采样)存在计算成本高或不符合自然语言习惯的问题，需要更高效且贴近人类的解决方案。

Result: 研究表明，多数LLM在语言不确定性表达上表现不佳，但经过设计的提示和微调可显著提升校准性和区分能力。

Insight: 语言不确定性表达是一种轻量化、高效且符合人类习惯的LLM不确定性评估方法，值得进一步探索。

Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

[340] BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models cs.CL | cs.AI | cs.LGPDF

Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre

TL;DR: BeyondBench 是一个不使用静态基准测试的语言模型评估框架，通过动态生成基于算法的任务来解决训练数据污染问题，确保公平评估模型的推理能力。

Details

Motivation: 传统基准测试因训练数据污染问题，难以区分模型的推理能力与记忆能力。

Result: 评估 101 个语言模型，显示在复杂任务上推理能力显著下降，大型模型在 Hard Suite 中表现差异显著。

Insight: 模型在多项式到指数复杂度任务上的性能急剧下降，工具使用对性能有显著影响。

Abstract: Evaluating language models fairly is becoming harder as static benchmarks available on the internet risk contamination by training data. This makes it unclear whether models are truly reasoning or just recalling answers. In this paper, we introduce BeyondBench, an evaluation framework that avoids this problem by using algorithmic problem generation. Unlike traditional benchmarks that risk contamination from internet-scale training data, BeyondBench creates mathematically grounded problems on the fly, ensuring each test remains fresh and uncontaminated. Our framework covers 44 algorithmic tasks with a total of 117 variations, grouped into three difficulty levels: the Easy Suite (29 tasks) for basic arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) tackling NP-complete and constraint satisfaction problems. Each task generates problems from a combinatorial space larger than 10^15 unique instances, with solutions verified deterministically by mathematical proofs. We evaluated 101 language models, including 85 open-source and 16 closed-source models, spanning sizes from 0.5B to 141B parameters and multiple quantization schemes. Our results show consistent reasoning deficiencies across model families, with performance degrading sharply as problem complexity increases from polynomial to exponential. In our Hard Suite evaluations, models such as Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved average accuracies of 56.38%, 26.91%, and 33.60%, respectively. Moreover, we observe that performance drops drastically without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing a decline of 16.81%, 28.05%, and 47.59% accuracy on the hard suite. Our leaderboard is publicly available at https://ctrl-gaurav.github.io/BeyondBench/

[341] MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation cs.CLPDF

Yuelyu Ji

TL;DR: MRAG-Suite是一个诊断性评估平台，专注于解决视觉检索增强生成（Visual RAG）中查询难度和模糊性的系统评估问题。

Details

Motivation: 当前的多模态检索增强生成系统在评估中缺乏对查询难度和模糊性的系统性考量，亟需一个更全面的诊断工具。

Result: 结果显示，在困难和模糊查询下，准确率显著下降，且MM-RAGChecker能有效诊断出幻觉问题。

Insight: 系统揭示了Visual RAG在复杂场景中的局限性，并为未来改进提供了方向。

Abstract: Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.

[342] LOGOS: LLM-driven End-to-End Grounded Theory Development and Schema Induction for Qualitative Research cs.CL | cs.HCPDF

Xinyu Pi, Qisen Yang, Chuong Nguyen

TL;DR: LOGOS是一个端到端框架，通过LLM驱动的编码、语义聚类、图推理和迭代优化过程，实现了扎根理论的全面自动化，显著提升了定性研究的可扩展性和效率。

Details

Motivation: 扎根理论虽然能从定性数据中获得深刻见解，但其依赖专家密集型的手工编码导致难以规模化。现有工具无法实现完全自动化，LOGOS旨在解决这一问题。

Result: 在五个不同数据集上，LOGOS优于现有基线方法，并在复杂数据集上与专家开发的模式对齐率达到88.2%。

Insight: LOGOS展示了通过自动化技术在不牺牲理论深度的前提下，如何实现定性研究的规模化和民主化。

Abstract: Grounded theory offers deep insights from qualitative data, but its reliance on expert-intensive manual coding presents a major scalability bottleneck. Current computational tools stop short of true automation, keeping researchers firmly in the loop. We introduce LOGOS, a novel, end-to-end framework that fully automates the grounded theory workflow, transforming raw text into a structured, hierarchical theory. LOGOS integrates LLM-driven coding, semantic clustering, graph reasoning, and a novel iterative refinement process to build highly reusable codebooks. To ensure fair comparison, we also introduce a principled 5-dimensional metric and a train-test split protocol for standardized, unbiased evaluation. Across five diverse corpora, LOGOS consistently outperforms strong baselines and achieves a remarkable $88.2%$ alignment with an expert-developed schema on a complex dataset. LOGOS demonstrates a powerful new path to democratize and scale qualitative research without sacrificing theoretical nuance.

[343] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models cs.CL | cs.AIPDF

Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu

TL;DR: DiffuGuard提出了一种针对扩散大语言模型（dLLMs）的安全防御框架，通过动态随机重掩码和块级审计修复机制，有效降低攻击成功率，同时保持模型性能。

Details

Motivation: 扩散大语言模型因其迭代和平行生成机制，存在独特的漏洞，需要一种针对性的防御方法以提升其安全性。

Result: 在四种dLLMs上测试，DiffuGuard将六种攻击方法的成功率从47.9%降至14.7%，且不影响模型效率和实用性。

Insight: dLLMs具备潜在的安全能力，但需要通过解码策略的优化和动态修复机制来激活。

Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard’s exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang

TL;DR: Q-Mirror提出了一种将纯文本问答对(TQA)转化为高质量多模态问答对(MMQA)的框架，并通过闭环迭代优化提升生成质量，为解决科学领域多模态基准构建的成本和规模问题提供了可行方案。

Details

Motivation: 高质量多模态基准对科学推理模型的进步至关重要，但手动构建成本高且难以扩展。因此，探索如何将现有纯文本问答对转化为多模态问答对具有实际意义。

Result: 实验表明，Q-Mirror将平均分从78.90提升到85.22，通过率从72%提升到95%，且顶级理解模型在多模态评估中与人工判断高度一致。

Insight: 1) 多模态生成的质量仍需提升；2) 闭环迭代优化是提高生成质量的有效方法；3) 高质量评估标准对生成结果的可靠性至关重要。

Abstract: High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition & Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation & understanding models on the distinct tasks of MMQA generation & MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72% to 95%, offering a practical path to large-scale scientific benchmarks.

[345] Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey cs.CLPDF

Yuntao Shou, Tao Meng, Wei Ai, Keqin Li

TL;DR: 这篇综述文章总结了多模态大语言模型（MLLMs）在多模态情感识别与推理领域的最新进展，包括模型架构、数据集和性能基准，并指出了关键挑战和未来研究方向。

Details

Motivation: 随着对更高级语义和跨模态融合需求的增加，MLLMs的兴起为复杂场景中的情感识别与推理提供了新的可能性。然而，该领域缺乏系统性综述，本文旨在填补这一空白。

Result: 文章总结了MLLMs在情感识别与推理中的表现，指出了现有方法的优势和局限。

Insight: MLLMs在该领域展现出巨大潜力，但仍需解决数据稀缺性和多模态对齐等挑战。

Abstract: In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

[346] Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining cs.CL | cs.AIPDF

Matthew Theodore Roque, Dan John Velasco

TL;DR: 论文研究了在数据受限环境下，通过文本简化和课程学习优化语言模型预训练的效果，发现简化文本和复杂度排序能提升模型性能。

Details

Motivation: 探索在小规模数据集上如何通过数据增强（如文本简化）和课程学习（数据排序）提升语言模型的预训练效果。

Result: 简化数据能提升微调和零样本性能：小模型受益于低到高复杂度排序，大模型在交替排序下表现更佳。

Insight: 1) 数据增强对小数据集预训练有益；2) 模型规模影响最优课程学习策略的选择；3) 简化文本可作为一种高效的数据扩充方法。

Abstract: Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the same text remain underexplored. We address this by studying curriculum learning in pretraining, focusing on text-complexity ordering and data augmentation via simplification. We ask: (1) Does simplifying texts enhance representation quality more than reusing the original data? and (2) Does ordering data by text complexity yield better representations? To answer, we build on a pair of parallel corpora where human-written paragraphs are aligned with LLM-simplified variants, and test four data schedules: repeated exposure, low-to-high complexity, high-to-low, and interleaved. We analyze models’ representation quality from a sample efficiency perspective via fine-tuning, as well as its zero-shot performance on linguistic knowledge, entity tracking, world knowledge, and commonsense reasoning. Our findings show that adding simplified data improves fine-tuning and zero-shot performance over a repeated-exposure baseline: smaller models benefit from low-to-high complexity, while larger models perform better with interleaved ordering.

[347] Reinforcement Mid-Training cs.CLPDF

Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi

TL;DR: 论文提出了一种新的中间训练阶段——强化中间训练（RMT），解决了大型语言模型中推理步骤冗余、标记熵分布不平衡和标记信息利用不足的问题。

Details

Motivation: 现有的大型语言模型训练分为预训练和后训练两阶段，但忽略了中间阶段的潜力，导致推理效率低和性能瓶颈。

Result: RMT在语言建模中性能提升64.91%，推理长度减少79%，并在数学领域后训练中提升18.76%。

Insight: 强化中间训练是提升模型效率和性能的关键阶段，动态预算和适应性采样可有效优化训练过程。

Abstract: The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.

[348] LLaDA-MoE: A Sparse MoE Diffusion Language Model cs.CL | cs.AIPDF

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu

TL;DR: LLaDA-MoE是一种基于稀疏Mixture-of-Experts（MoE）架构的大型语言扩散模型，训练了约20T tokens。它在推理阶段仅激活1.4B参数，保持7B参数容量，取得了与更大参数模型竞争的性能。

Details

Motivation: 现有扩散语言模型通常需要大量计算资源，研究者希望通过稀疏MoE架构降低计算开销，同时保持高性能。

Result: LLaDA-MoE在多个基准测试中超越了之前的扩散语言模型（如LLaDA、Dream等），并展示了与Qwen2.5-3B-Instruct相当的能力。

Insight: 研究表明，稀疏MoE架构能够在减少计算开销的同时保持高性能，这为扩散语言模型的进一步探索提供了空间。

Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE’s strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

[349] Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling cs.CL | cs.DBPDF

Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan

TL;DR: Agentar-Scale-SQL通过协调测试时扩展（Orchestrated Test-Time Scaling），在多视角协同下提升Text-to-SQL性能，达到了BIRD基准的最优结果。

Details

Motivation: 现有Text-to-SQL方法在复杂基准（如BIRD）上远不及人类专家，且测试时扩展缺乏协调策略，忽略了模型的内部推理过程。

Result: 在BIRD测试集上达到81.67%的执行准确率，排名第一，接近人类专家水平。

Insight: 协调多视角扩展策略是提升Text-to-SQL性能的有效路径，且框架易于适配新数据库和更强语言模型。

Abstract: State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model’s internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.

[350] Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents cs.CL | cs.AI | cs.DB | cs.ET | cs.IRPDF

Khanh Trinh Pham, Thu Huong Nguyen, Jun Jo, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

TL;DR: 论文提出了多语言Text-to-SQL基准MultiSpider 2.0，扩展了Spider 2.0到8种语言，揭示了当前语言模型在多语言任务上的局限性，并提出协作式语言代理方法提升性能。

Details

Motivation: 现有Text-to-SQL基准多为英语，限制了多语言进展。扩展多语言任务有助于推动跨语言系统的开发。

Result: 主流LLMs在多语言任务上执行准确率仅4%（单语言60%），协作方法将其提升至15%。

Insight: 多语言Text-to-SQL任务仍需更多研究，当前模型对语言多样性适应性不足，协作方法展现了改进潜力。

Abstract: Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0’s structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4% execution accuracy when relying on intrinsic reasoning, versus 60% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL.

[351] CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task cs.CLPDF

Haosi Mo, Xinyu Ma, Xuebo Liu, Derek F. Wong, Yu Li

TL;DR: 论文提出了一个全面的评估框架CDT（Cognition-Domain-Task），用于衡量大语言模型在认知、领域和任务三个维度的能力，弥补了现有基准的不足。

Details

Motivation: 现有的大语言模型评测基准通常关注孤立的能力，缺乏一个全面的评估框架。因此，作者提出了CDT框架，以更全面地评估模型能力。

Result: 实验表明，CDT的能力指标与下游性能高度相关，且在数据选择任务中显著提升了基准分数（分别提高了1.6和2.2分）。

Insight: CDT框架不仅提供了更全面的模型能力评估方式，还能有效指导数据集的分析和构建。

Abstract: Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities, highlighting the need for comprehensive evaluation frameworks that extend beyond task-specific benchmarks. However, existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities. To address this gap, we propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model’s capabilities across three dimensions. We expand the scope of model capability definitions at the cognitive level by incorporating the Cattell-Horn-Carroll cognitive theory, refining the categorization of model capabilities. We apply CDT in two directions: dataset capability evaluation and data selection. Experiments show that our capability metrics correlate well with downstream performance and can support effective dataset analysis and construction. The experiments on data selection also show significant improvements in both general and specific benchmarks, achieving scores of 44.3 and 45.4, with an increase of 1.6 and 2.2 points over the baselines, respectively. These results validate the effectiveness and practicality of CDT. Source code and models are available at https://github.com/Alessa-mo/CDT.

[352] Alternatives To Next Token Prediction In Text Generation – A Survey cs.CL | cs.AIPDF

Charlie Wyatt, Aditya Joshi, Flora Salim

TL;DR: 这篇综述探讨了替代Next Token Prediction（NTP）的文本生成方法，将其分为五类：多标记预测、计划后生成、潜在推理、连续生成方法及非Transformer架构，旨在解决NTP的局限性。

Details

Motivation: NTP范式虽然在大型语言模型中取得成功，但也导致长期规划不足、错误累积和计算效率低下等问题，因此需要探索替代方法。

Result: 综述了现有研究并提出分类法，为开发更具变革性的自然语言处理模型提供方向。

Insight: 替代NTP的方法可能在长期规划和计算效率方面优于传统方法，但仍需进一步研究验证。

Abstract: The paradigm of Next Token Prediction (NTP) has driven the unprecedented success of Large Language Models (LLMs), but is also the source of their most persistent weaknesses such as poor long-term planning, error accumulation, and computational inefficiency. Acknowledging the growing interest in exploring alternatives to NTP, the survey describes the emerging ecosystem of alternatives to NTP. We categorise these approaches into five main families: (1) Multi-Token Prediction, which targets a block of future tokens instead of a single one; (2) Plan-then-Generate, where a global, high-level plan is created upfront to guide token-level decoding; (3) Latent Reasoning, which shifts the autoregressive process itself into a continuous latent space; (4) Continuous Generation Approaches, which replace sequential generation with iterative, parallel refinement through diffusion, flow matching, or energy-based methods; and (5) Non-Transformer Architectures, which sidestep NTP through their inherent model structure. By synthesizing insights across these methods, this survey offers a taxonomy to guide research into models that address the known limitations of token-level generation to develop new transformative models for natural language processing.

[353] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training cs.CLPDF

Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong

TL;DR: GRPO-MA改进了GRPO算法，通过多答案生成解决了梯度耦合、稀疏奖励信号和优势估计不稳定三大问题，提升了训练稳定性和效率。

Details

Motivation: GRPO在训练Chain-of-Thought推理时存在梯度耦合、稀疏奖励信号和不稳定优势估计的问题，限制了模型性能和训练效率。

Result: 在数学、代码和多模态任务上，GRPO-MA显著提升性能与训练效率，且增加答案数量能持续改进模型表现。

Insight: 多答案生成能有效缓解RL训练中的不稳定问题，且理论支持其在降低方差方面的优势。

Abstract: Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

[354] AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration cs.CL | cs.AIPDF

Shaohao Rui, Kaitao Chen, Weijie Ma, Xiaosong Wang

TL;DR: AdaThink-Med是一个端到端框架，旨在通过不确定性引导的长度校准，增强医疗推理模型的自适应思维能力，动态调整思考长度以平衡性能和计算成本。

Details

Motivation: 现有医疗LLMs在处理简单和复杂问题时均采用冗长的推理过程，增加了实际应用中的推理成本。需要一种自适应方法，动态调整思考长度以提高效率。

Result: 在六个医疗QA基准测试中，平均减少6.4倍推理长度，性能仅轻微下降，并自发形成“非思考”和“思考”两种推理模式。

Insight: 不确定性是动态调整推理长度的有效指标；自适应思维能显著提升效率，同时保持模型性能。

Abstract: Recent advances in inference time scaling with extended long chain-of thought have significantly improved the reasoning capabilities of both general and medical large language models (LLMs). However, these models tend to engage in lengthy reasoning processes regardless of the difficulty of the input question, leading to increased inference costs in real-world applications. Therefore, enabling adaptive thinking where models think less for simpler questions and think more for complex ones is critical for the effective use of medical LLMs in practice. Despite its importance, there is a lack of end-to-end approaches designed to enhance the adaptive thinking capabilities of medical LLMs while providing a comprehensive examination of the trade-off between performance and computational cost. To bridge this gap, we propose AdaThink-Med, the first end-to-end framework designed to enhance adaptive thinking ability in medical reasoning models with uncertainty-guided length calibration. AdaThink-Med first generates multiple candidate outputs for each question, evaluates the correctness and uncertainty of each candidate, and then estimates problem difficulty via an uncertainty-guided length calibration module. For outputs with low difficulty and correct answers, the framework penalizes longer reasoning paths; whereas for those with high difficulty and incorrect answers, it encourages extending the chain of thought to explore alternative solutions. On six public medical QA benchmarks, AdaThink-Med achieves up to 6.4x length reduction on average while retaining performance with only minimal degradation. Intriguingly, we observe that AdaThink-Med spontaneously develops two distinct reasoning modes, which we characterize as “non-thinking” and “thinking”, demonstrating the model’s ability to suppress redundant reasoning processes dynamically.

[355] Inducing Dyslexia in Vision Language Models cs.CL | cs.LGPDF

Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf

TL;DR: 该论文通过在视觉语言模型（VLM）中模拟阅读障碍（dyslexia），通过选择性扰动模型中的视觉单词处理单元，复制了人类阅读障碍的关键特征，为研究阅读障碍提供了一种新的计算框架。

Details

Motivation: 传统研究阅读障碍的方法（如行为学和神经影像学）无法验证因果假设，因此作者利用VLM模拟阅读障碍，以更灵活地探索其机制。

Result: 结果显示，靶向扰动导致了选择性阅读任务受损，但其他视觉和语言理解能力未受影响，这与人类阅读障碍的表现一致。

Insight: 这项工作表明，VLM可以作为研究阅读障碍的工具，为未来的理论研究提供了新的模拟和测试平台。

Abstract: Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans’ phonological deficits without a significant change in orthographic processing. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.

[356] InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation cs.CL | cs.AI | cs.LGPDF

Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li

TL;DR: InfLLM-V2提出了一种密集-稀疏可切换注意力框架，解决了传统Transformer在长序列处理中的计算和内存瓶颈问题，同时保持了短序列到长序列的无缝适应。

Details

Motivation: 标准Transformer的自注意力机制在处理长序列时面临严重的计算和内存瓶颈。现有的稀疏注意力方法虽然提供了解决方案，但带来了额外的参数和训练流程中断的问题。

Result: 实验表明，InfLLM-V2在处理长上下文理解和链式推理任务时，速度比密集注意力快4倍，同时保持了98.1%和99.7%的性能。

Insight: InfLLM-V2提供了一种高效且灵活的长序列处理方案，解决了稀疏注意力方法中参数冗余和训练流程不连贯的问题，适用于实际部署。

Abstract: Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.

[357] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents cs.CLPDF

Guibin Zhang, Muxin Fu, Shuicheng Yan

TL;DR: MemGen提出了一种动态生成记忆框架，通过内存触发器和内存编织器，实现了LLM-powered agents的记忆与推理的紧密结合，显著提升了性能，并展示了自发形成人类记忆能力的潜力。

Details

Motivation: 现有代理记忆范式（参数化记忆和检索记忆）无法捕捉人类认知中推理与记忆的动态结合，MemGen旨在填补这一空白。

Result: MemGen在8个基准上优于ExpeL、AWM和GRPO，性能提升最高达38.22%，并展示了跨域泛化能力和自发记忆能力的形成。

Insight: MemGen的自发记忆形成表明，机器学习可实现更自然的人类认知机制，具备潜在的自我演化能力。

Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent’s reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent’s current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22%$, exceeds GRPO by up to $13.44%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

[358] Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution cs.CLPDF

Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze

TL;DR: Socratic-Zero提出了一种完全自主的框架，通过教师、求解器和生成器的协同进化，从少量种子示例生成高质量训练数据，显著提升了模型在数学推理任务中的表现。

Details

Motivation: 现有大规模高质量数据集依赖人工标注，难以扩展；数据合成方法存在质量不一致和无法动态适应模型能力的问题。

Result: 仅用100种子问题，Socratic-Solver-8B在七项数学推理基准上平均提升20.2个百分点；合成数据还使其他LLM超越SOTA商业模型。

Insight: 闭环的协同进化机制能够有效解决数据合成的质量和适应性难题，为少样本或零样本学习提供了新思路。

Abstract: Recent breakthroughs in large language models (LLMs) on reasoning tasks rely heavily on massive, high-quality datasets-typically human-annotated and thus difficult to scale. While data synthesis or distillation offers a promising alternative, existing methods struggle with inconsistent data quality and an inability to dynamically adapt to the evolving capabilities of the model, leading to suboptimal training signals. To address these limitations, we introduce Socratic-Zero, a fully autonomous framework that generates high-quality training data from minimal seed examples through the co-evolution of three agents: the Teacher, the Solver, and the Generator. The Solver continuously refines its reasoning by learning from preference feedback on both successful and failed trajectories; the Teacher adaptively crafts increasingly challenging questions based on the Solver’s weaknesses; and the Generator distills the Teacher’s question-design strategy to enable scalable, high-fidelity curriculum generation. This closed-loop system produces a self-improving curriculum-requiring no pre-existing tasks or labels. Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B achieves an average gain of +20.2 percentage points over prior data synthesis methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25, Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3 and GLM4 series models. Even more surprisingly, synthetic data from Socratic-Generator-32B enables student LLMs to achieve superior performance compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks, including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.

[359] ProxyAttn: Guided Sparse Attention via Representative Heads cs.CL | cs.LGPDF

Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang

TL;DR: ProxyAttn是一种训练免费的稀疏注意力算法，利用注意力头之间的相似性，通过代表性头代理所有头的注意力分数，并结合动态预算估计，实现了更精细的块重要性评估，显著提升了长文本任务的效率和性能。

Details

Motivation: 传统注意力机制的二次复杂度限制了大型语言模型（LLM）在长文本任务中的效率。现有方法通过动态估计块重要性实现了块稀疏注意力，但其粗粒度估计在高稀疏率下会导致性能下降。

Result: 在多种主流模型和基准测试中，ProxyAttn实现了高达10.3倍的注意力加速和2.4倍的预填充加速，且无明显性能损失。

Insight: 注意力头之间存在高度相似性，可以通过少量代表性头代理其他头的行为，从而在高稀疏率下实现高效且精确的注意力计算。

Abstract: The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

[360] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space cs.CLPDF

Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang

TL;DR: LatentEvolve提出了一种自进化的测试时缩放框架，通过模拟人脑的双系统学习机制，提升了大型语言模型在推理阶段的性能表现，无需修改模型参数。

Details

Motivation: 现有测试时缩放方法独立且缺乏渐进学习能力，限制了大型语言模型在推理阶段的潜力。受人类大脑互补学习系统启发，作者希望通过模拟快慢双系统的学习机制，改进测试时缩放的效果。

Result: 在八个基准数据集和五种模型骨干上的实验显示，LatentEvolve超越了现有的测试时缩放方法（如LatentSeek和TTRL），最高提升13.33%，并展现出出色的跨领域和跨骨干泛化能力。

Insight: 通过模拟人类认知的动态学习机制（快慢结合），可以在完全无监督的情况下显著提升大型语言模型的推理能力，为测试时缩放提供了新的设计思路。

Abstract: Test-time Scaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters. However, existing TTS methods are largely independent, implying that LLMs have not yet evolved to progressively learn how to scale more effectively. With the objective of evolving LLMs to learn ``how to scale test-time computation,’’ we propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system (CLS) theory. Analogous to the human brain’s dual system of a fast-recall hippocampus and a slow-consolidating neocortex, LatentEvolve comprises two evolutionary components: \textit{daytime scaling}, which rapidly retrieves historical latent representations to better guide current LLM reasoning; and \textit{nighttime scaling}, which integrates past latent optimizations in a manner akin to the human brain’s consolidation of experiences during sleep. The alternation of daytime and nighttime processes facilitates a fast and slow evolution of LLM TTS, mirroring human cognitive dynamics in a fully unsupervised manner. Extensive experiments across eight benchmarks and five model backbones demonstrate that our LatentEvolve surpasses state-of-the-art TTS methods such as LatentSeek and TTRL by up to $13.33%$ and exhibits exceptional cross-domain and cross-backbone generalization.

[361] KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning cs.CLPDF

Xilin Dang, Kexin Chen, Xiaorui Su, Ayush Noori, Iñaki Arango

TL;DR: KnowGuard提出了一种基于知识驱动的多轮临床推理中的‘暂不决策’机制，通过知识图谱探索和证据评估，显著提升了诊断准确性并减少了不必要的交互。

Details

Motivation: 临床实践中，医生在面对信息不足时会选择暂不决策以避免误诊，但现有大型语言模型（LLMs）难以实现这一行为，常因过度自信导致错误。KnowGuard旨在解决这一问题。

Result: KnowGuard在诊断准确性上提升了3.93%，同时平均减少了7.27次不必要的交互，优于现有技术。

Insight: 结合外部知识图谱的方法能够更准确地识别知识边界，显著提升临床决策的可靠性。

Abstract: In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidences clearly demonstrate that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93% while reducing unnecessary interaction by 7.27 turns on average.

[362] DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework cs.CLPDF

Rui Jia, Yuang Wei, Ruijia Li, Yuang-Hao Jiang, Xinyu Xie

TL;DR: DiaCDM是一个基于对话的认知诊断模型，首次将认知诊断应用于师生对话场景，通过改进的IRE框架和图编码方法解决了传统CD模型在动态和非结构化对话中的问题，显著提升了诊断准确性和结果可解释性。

Details

Motivation: 传统认知诊断模型（CD）依赖结构化测试数据，无法直接应用于动态、非结构化的师生对话。此外，从长对话中准确提取诊断语义也存在挑战。因此，研究需要一种适合对话场景的诊断框架和方法。

Result: 在三个真实对话数据集上的实验表明，DiaCDM显著提升了诊断准确性，同时增强了结果的可解释性，为教师评估学生认知状态提供了新工具。

Insight: 1. 对话场景的认知诊断需要动态和非结构化数据处理能力；2. 结合教育理论（如IRE框架）能有效提升模型适应性；3. 图编码方法可以有效捕捉对话中的核心语义关系。

Abstract: While cognitive diagnosis (CD) effectively assesses students’ knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it’s difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We’ve adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results’ interpretability, providing teachers with a powerful tool for assessing students’ cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.

[363] Hierarchical Error Correction for Large Language Models: A Systematic Framework for Domain-Specific AI Quality Enhancement cs.CL | cs.AI | I.2.7; I.2.6PDF

Zhilong Zhao, Yindi Liu

TL;DR: 该论文提出了一种分层纠错（HEC）框架，用于解决大语言模型在专业领域中的性能问题，通过系统化的错误分析和有针对性的干预策略，显著提升了模型的准确性。

Details

Motivation: 大语言模型在专业领域中表现不佳，例如医疗编码任务中仅达到45.9%的准确率。为了解决这一问题，研究者分析了错误的层次结构，并提出了一个系统化的纠错框架。

Result: 实验结果表明，HEC框架在多个领域的任务中平均提升了11.2个百分点（p < 0.001），但在高基线任务（>75%准确率）中效果有限。

Insight: 系统化的错误分析可以指导有效的AI增强策略，尤其在中等基线任务中表现突出，但也揭示了在高基线和复杂推理任务中的局限性。

Abstract: Large Language Models face significant performance challenges in specialized domains, with state-of-the-art models achieving only 45.9% accuracy on medical coding tasks. This study proposes a Hierarchical Error Correction (HEC) framework that addresses domain-specific AI limitations through systematic error analysis and targeted intervention strategies. We analyze error patterns across four specialized domains and find that AI errors follow consistent hierarchical structures: Knowledge-layer errors (58.4%), Reasoning-layer errors (39.6%), and Complexity-layer errors (2.0%). Based on these patterns, we develop a three-stage correction framework that addresses errors according to their hierarchical importance and demonstrates that framework effectiveness correlates inversely with baseline task performance. Experimental validation across medical transcription (4,921 cases), legal document classification (1,000 cases), political bias detection (645 cases), and legal reasoning (1,000 cases) shows consistent improvements. Cross-model validation across five LLM architectures demonstrates average improvements of 11.2 percentage points (p < 0.001). However, analysis reveals framework limitations in high-baseline tasks (>75% accuracy), where hierarchical intervention may interfere with effective reasoning processes. The results suggest that systematic error analysis can guide effective AI enhancement strategies in specialized domains, particularly for moderate-baseline tasks, while highlighting the importance of understanding framework boundaries for optimal deployment.

[364] Expanding Computation Spaces of LLMs at Inference Time cs.CLPDF

Yoonna Jang, Kisu Yang, Isabelle Augenstein

TL;DR: 本文研究了在推理阶段为语言模型插入填充标记（filler tokens）以扩展其计算空间的有效性，揭示了标记类型、数量和插入位置的影响，并发现较小模型从中受益更大。

Details

Motivation: 通过扩展语言模型的计算空间，提升其在推理阶段的性能，尤其是在任务相关的详细推理步骤中。

Result: 适当插入填充标记能显著提升模型性能（SmolLM2-1.7B-Instruct最高提升12.372%），尤其对小模型效果显著。注意力图显示填充标记能延续原有注意力机制。

Insight: 填充标记为语言模型提供了额外的计算容量，而非冗余输入，小模型更能从中受益。

Abstract: Chain-of-thought (CoT) rationale enables language models to use additional task-related text for problem-solving, benefiting not only from detailed reasoning steps but also from the expanded computational space of longer inputs. Prior work has trained filler or special tokens to serve as additional computation spaces. In this study, we investigate whether language models can leverage artificially inserted sequences of filler tokens solely at inference. We first identify effective token types, numbers, and insertion locations, then examine at what stage of training models begin to exploit the expanded computation space, and finally analyze dynamics within these spaces via attention maps. Experiments on models ranging from 1.7B to 32B across open-domain QA and math tasks show that appropriate token types and counts vary, but placing filler tokens directly before the final ‘Answer:’ token is most effective. Smaller models benefit most, up to 12.372 percentage points in SmolLM2-1.7B-Instruct, indicating that these spaces act as additional computational capacity rather than redundant input. Attention maps reveal that expanded spaces often continue the original attention mechanism and sometimes focus on questions or answer options, suggesting meaningful computation for problem-solving.

[365] MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes cs.CL | cs.AIPDF

Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen

TL;DR: MobileLLM-R1挑战了大型语言模型需要海量数据和参数才能具备推理能力的假设，展示了通过高质量数据精选和重采样，小规模模型也能实现强大推理能力。

Details

Motivation: 传统观点认为语言模型的推理能力需要大规模模型和海量数据支持，但近期研究表明小规模模型也能具备推理能力。本文进一步探讨数据规模和质量的必要性。

Result: MobileLLM-R1-950M在AIME评分达15.5，远超同类开源模型，且在推理任务中媲美或超越更大规模的Qwen3-0.6B，仅需后者11.7%的训练数据。

Insight: 高质量数据和小规模模型的精心设计训练方法可以替代海量数据和参数规模，为资源有限的推理模型开发提供了新思路。

Abstract: The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

[366] The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents’ Inquiry Capability cs.CLPDF

Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu

TL;DR: 论文提出了MAQuE基准，用于全面评估医疗多轮问答能力，涵盖任务成功、询问熟练度等多项指标，并揭示了现有AI医生在询问能力上的不足以及对患者行为变化的敏感性。

Details

Motivation: 当前AI医生虽具备诊断能力，但忽略了其他关键品质（如沟通能力）。研究旨在填补这一空白，通过大规模基准和评估框架全面衡量AI医生的多轮询问能力。

Result: 实验显示，即使最先进的LLM在询问能力上仍有显著不足，且对患者行为变化高度敏感。精细指标还暴露了不同评估视角间的权衡问题。

Insight: 1) AI医生的询问能力需进一步提升；2) 现实患者行为的多样性对诊断准确性影响重大；3) 实际临床应用中需平衡性能与实用性。

Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

[367] SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems cs.CLPDF

Kaihong Li, Huichi Zhou, Bin Ma, Fangjun Huang

TL;DR: 提出了SemanticShield框架，通过LLM检测推荐系统中的虚假行为攻击，结合行为特征和语义分析，提高了识别恶意用户的能力。

Details

Motivation: 推荐系统易受虚假行为攻击（shilling attacks），现有防御主要关注用户行为，忽略了物品特征（如标题和描述）中的恶意语义。

Result: 在六种代表性攻击策略上表现优异，对未知攻击方法也展现出强泛化能力。

Insight: 物品特征中的语义信息能有效暴露恶意意图，轻量级LLM结合强化学习可显著提升检测性能。

Abstract: Recommender systems (RS) are widely used in e-commerce for personalized suggestions, yet their openness makes them susceptible to shilling attacks, where adversaries inject fake behaviors to manipulate recommendations. Most existing defenses emphasize user-side behaviors while overlooking item-side features such as titles and descriptions that can expose malicious intent. To address this gap, we propose a two-stage detection framework that integrates item-side semantics via large language models (LLMs). The first stage pre-screens suspicious users using low-cost behavioral criteria, and the second stage employs LLM-based auditing to evaluate semantic consistency. Furthermore, we enhance the auditing model through reinforcement fine-tuning on a lightweight LLM with carefully designed reward functions, yielding a specialized detector called SemanticShield. Experiments on six representative attack strategies demonstrate the effectiveness of SemanticShield against shilling attacks, and further evaluation on previously unseen attack methods shows its strong generalization capability. Code is available at https://github.com/FrankenstLee/SemanticShield.

[368] GateMABSA: Aspect-Image Gated Fusion for Multimodal Aspect-based Sentiment Analysis cs.CLPDF

Adamu Lawan, Haruna Yunusa

TL;DR: GateMABSA是一个新颖的多模态门控架构，用于解决多模态基于方面的情感分析中视觉噪声和跨模态对齐问题，显著优于基线模型。

Details

Motivation: 现有MABSA模型难以过滤视觉噪声并有效对齐文本和图像中的方面与情感内容，因此GateMABSA提出了一种改进方案。

Result: 在两个Twitter基准数据集上的实验表明，GateMABSA显著优于多个基线模型。

Insight: 通过分离语法、语义和融合的多模态处理，GateMABSA有效提升了跨模态对齐能力和噪声过滤效果。

Abstract: Aspect-based Sentiment Analysis (ABSA) has recently advanced into the multimodal domain, where user-generated content often combines text and images. However, existing multimodal ABSA (MABSA) models struggle to filter noisy visual signals, and effectively align aspects with opinion-bearing content across modalities. To address these challenges, we propose GateMABSA, a novel gated multimodal architecture that integrates syntactic, semantic, and fusion-aware mLSTM. Specifically, GateMABSA introduces three specialized mLSTMs: Syn-mLSTM to incorporate syntactic structure, Sem-mLSTM to emphasize aspect–semantic relevance, and Fuse-mLSTM to perform selective multimodal fusion. Extensive experiments on two benchmark Twitter datasets demonstrate that GateMABSA outperforms several baselines.

[369] An empirical study on the limitation of Transformers in program trace generation cs.CLPDF

Simeng Sun

TL;DR: 该论文实证研究了Transformer模型在程序追踪生成任务中的局限性，发现虽然模型在分布内数据上表现良好，但在泛化到不同因素时存在系统性失败。

Details

Motivation: 研究动机是探索Transformer在程序追踪生成（PTG）任务中的表现，尤其是在外部化推理的长追踪任务中，验证其泛化能力。

Result: 结果显示，尽管模型在分布内数据上表现良好，但在泛化性测试（如程序长度、追踪步骤）中表现不佳，部分设计显著提升了泛化能力。

Insight: 研究发现，Transformer在处理长追踪任务时的泛化能力有限，但通过模型设计的调整可以部分缓解这一问题。

Abstract: We study Transformers on the task \emph{program trace generation} (PTG), where models produce step-by-step execution traces for synthetic programs. Unlike existing algorithmic problems, PTG externalizes reasoning through long traces where each step is trivial. We train small Transformers with diverse modifications, including alternative position encodings, softmax replacements, hybrid model, and short convolutions. While these models achieve strong in-distribution accuracy, they exhibit systematic failures when generalizing to various factors (e.g., program length, trace steps), though some designs significantly improve generalization.

[370] Scaling Generalist Data-Analytic Agents cs.CL | cs.AI | cs.IR | cs.LGPDF

Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang

TL;DR: 该论文介绍了DataMind，一种可扩展的数据合成和智能体训练方法，用于构建通用的数据分析智能体。DataMind解决了开源数据分析智能体面临的三大挑战：数据资源不足、训练策略不当和不稳定的多轮代码执行。通过一系列创新技术，DataMind在多个数据分析基准测试中取得了最优性能。

Details

Motivation: 现有方法依赖专有模型和提示工程，而开源模型在面对多样化格式和大规模数据文件时表现不佳。DataMind旨在解决这些问题，推动开源数据分析智能体的发展。

Result: DataMind-14B在多个数据分析基准测试中平均得分71.16%，优于最强的专有基线DeepSeek-V3.1和GPT-5；DataMind-7B在开源模型中表现最佳，得分68.10%。

Insight: 论文展示了高质量数据合成和训练策略对提升智能体性能的重要性，为开源社区提供了可借鉴的方法和数据集。

Abstract: Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community’s future research.

[371] jina-reranker-v3: Last but Not Late Interaction for Document Reranking cs.CL | cs.AI | cs.IR | 68T50 | I.2.7PDF

Feng Wang, Yuqing Li, Han Xiao

TL;DR: jina-reranker-v3是一种0.6B参数的多语言文档重排序模型，通过引入‘最后但不延迟的交互’机制，显著提升了性能。

Details

Motivation: 现有方法（如ColBERT）采用延迟交互，导致效率较低。研究旨在通过更紧密的交互机制提升性能。

Result: 在BEIR基准测试中达到61.94 nDCG@10，性能优于当前最优模型，且体积小十倍。

Insight: 紧密的交互机制可以显著提升重排序任务的性能，同时保持模型的紧凑性。

Abstract: jina-reranker-v3 is a 0.6B parameter multilingual document reranker that introduces a novel last but not late interaction. Unlike late interaction models such as ColBERT that perform separate encoding followed by multi-vector matching, our approach conducts causal self-attention between query and documents within the same context window, enabling rich cross-document interactions before extracting contextual embeddings from the last token of each document. This compact architecture achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being ten times smaller than generative listwise rerankers.

[372] InfoAgent: Advancing Autonomous Information-Seeking Agents cs.CL | cs.AIPDF

Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang

TL;DR: 本文介绍了InfoAgent，一种基于创新数据合成流程和网络搜索工具的自主信息检索代理，通过实体树和子树采样提高问题难度，并通过自托管搜索基础设施增强透明度。

Details

Motivation: 构建能够通过外部工具扩展能力的LLM代理是AI研究的新前沿，但目前方法依赖商业工具且透明度不足。

Result: InfoAgent在BrowseComp、BrowseComp-ZH和Xbench-DS上分别达到15.3%、29.2%和40.4%的准确率，优于现有开源代理。

Insight: 自托管工具和系统性数据合成是提升代理能力的关键，RL阶段显著优化了工具使用能力。

Abstract: Building Large Language Model agents that expand their capabilities by interacting with external tools represents a new frontier in AI research and applications. In this paper, we introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools. To construct challenging, hard-to-find queries,we build entity trees and apply sub-tree sampling with entity fuzzification to systematically increase question difficulty. Unlike prior work that relies heavily on commercial search tools, we develop a dedicated self-hosted search infrastructure, enhancing transparency of agent environments and facilitating further advancement of agent capacity. We evaluate the effectiveness of our data pipeline by measuring the average number of tool calls required to correctly answer a question, and also show that our agent yields better performance when equipped with our tools. Our \mbox{InfoAgent} is post-trained from Qwen3-14B using a two-stage recipe: cold-start supervised finetuning to instill long-horizon search behaviors, followed by reinforcement learning which significantly improves reasoning-driven tool use. With our methods, InfoAgent achieves 15.3% accuracy on BrowseComp, 29.2% on BrowseComp-ZH, and 40.4% on Xbench-DS, outperforming prior open-source deep research agents such as WebSailor-72B and DeepDive-32B.

cs.MA [Back]

[373] MAS$^2$: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems cs.MA | cs.CLPDF

Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang

TL;DR: MAS^2是一个基于递归自生成原则的多智能体系统，能够自主构建适应动态环境的定制化多智能体系统，通过生成-实现-修正三元组实现实时任务需求下的动态组合与自适应调整。

Details

Motivation: 现有的多智能体系统多为一次性生成部署模式，无法应对动态和不确定性环境，限制了其在实际应用中的表现。

Result: 在7个基准测试中性能提升高达19.6%，且能有效利用新LLM实现15.1%的性能增益，同时保持高效的成本效益。

Insight: 递归自生成和动态调整是多智能体系统适应复杂环境的关键策略。

Abstract: The past two years have witnessed the meteoric rise of Large Language Model (LLM)-powered multi-agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self-evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi-agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid \textit{generate-once-and-deploy}'' paradigm, rendering the resulting systems brittle and ill-prepared for the dynamism and uncertainty of real-world environments. To transcend this limitation, we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems. Technically, we devise a \textit{generator-implementer-rectifier}’’ tri-agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real-time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta-agents. Extensive evaluation across seven benchmarks reveals that MAS$^2$ achieves performance gains of up to $19.6%$ over state-of-the-art MAS in complex scenarios such as deep research and code generation. Moreover, MAS$^2$ exhibits superior cross-backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to $15.1%$. Crucially, these gains are attained without incurring excessive token costs, as MAS$^2$ consistently resides on the Pareto frontier of cost-performance trade-offs. The source codes are available at https://github.com/yeyeyeah2/MAS2.

cs.AI [Back]

[374] Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective cs.AI | cs.CL | cs.LG | stat.MLPDF

Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng

TL;DR: 这篇论文从理论角度分析了强化学习（RL）在语言模型规划中的优势和局限，揭示了探索的重要性以及不同RL方法的差异。

Details

Motivation: 尽管强化学习显著提升了大型语言模型（LLMs）的规划能力，但其理论依据尚不明确。作者旨在通过理论分析揭示RL的优势和潜在问题。

Result: 在Blocksworld基准测试中，论文验证了理论分析的结果，展示了PG的多样性崩溃和Q学习的优势。

Insight: 探索是RL有效规划的关键，而Q学习在保持多样性和离策略学习方面优于PG；此外，奖励设计对RL方法的成功至关重要。

Abstract: Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL’s benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration’s role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.

[375] Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents cs.AI | cs.CL | cs.SEPDF

Zonghan Yang, Shengjie Wang, Kelin Fu, Wenyang He, Weimin Xiong

TL;DR: Kimi-Dev通过Agentless训练生成技能先验，结合SWE-Agent框架，实现了高效的软件工程任务解决，性能与Claude 3.5 Sonnet相当。

Details

Motivation: 当前LLM在软件工程中的解决方案分为多轮交互的SWE-Agent和无代理的单步验证方法，作者认为两者可结合，通过Agentless训练生成的技能先验提升SWE-Agent的效率和效果。

Result: Kimi-Dev在SWE-bench Verified上达到60.4%的成绩，适配SWE-Agent后实现48.6%的pass@1，与Claude 3.5 Sonnet相当。

Insight: 结构化技能先验可弥合工作流和代理框架的鸿沟，为可迁移的编码代理提供新思路。

Abstract: Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.

[376] Multiplayer Nash Preference Optimization cs.AI | cs.CLPDF

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao

TL;DR: 论文提出了多玩家纳什偏好优化（MNPO），将纳什学习从博弈扩展到多玩家场景，以更好地捕捉现实世界中非传递性和多样化的偏好结构。

Details

Motivation: 现有的基于Bradley-Terry假设的奖励方法无法捕捉复杂的偏好结构，而两玩家纳什学习（NLHF）方法虽然有效，但仍局限于单一对手的偏见。

Result: 实验显示MNPO在指令跟随基准测试中优于现有NLHF基线，尤其在异构标注条件和混合策略评估场景下表现更优。

Insight: MNPO不仅继承了双玩家方法的理论保证，还通过多玩家竞争动态丰富了偏好结构的覆盖能力。

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.

[377] $p$-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding cs.AI | cs.CLPDF

Runyan Tan, Shuang Wu, Phillip Howard

TL;DR: 该论文提出了一种新的无超参数采样方法$p$-less sampling，用于大语言模型（LLM）的解码过程。该方法基于信息论动态设置截断阈值，避免了现有方法对超参数的依赖，并在高温度下仍能保持高质量的生成结果。通过实验验证，$p$-less sampling在数学、逻辑推理和创意写作任务中表现优于现有方法。

Details

Motivation: 当前基于采样的解码方法对超参数的选择敏感，且性能会因任务和温度配置的不同而变化。因此，需要一种稳健的无超参数方法，能在不同条件下保持高质量的生成结果。

Result: 实验表明，$p$-less sampling在数学、逻辑推理和创意写作任务中均优于现有采样方法，且在高温度下生成质量下降较少。此外，该方法还提高了推理效率。

Insight: 1. 信息论方法可以有效替代超参数调优；2. 动态阈值设置可以提升生成质量和效率；3. 无超参数方法在实际应用中更具普适性。

Abstract: Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments.

[378] Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning cs.AI | cs.CLPDF

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta

TL;DR: 这篇论文提出了一个模式感知的工具集成推理框架，通过区分计算器模式和算法模式，并统一模式选择与教师偏好，显著提升了代码使用率和准确性。

Details

Motivation: 现有的工具集成推理方法主要关注何时调用工具，而忽略了如何正确应用工具。作者发现工具应用的两种常见模式（计算器模式和算法模式）选择不当可能导致推理失败，即使逻辑本身是正确的。

Result: 在多个数学数据集（如MATH500和AIME24）上，该方法显著提升了代码使用率和准确性（例如，MATH500上的Code@1从64.0%提升到70.5%，AIME24从26.7%提升到50.0%）。

Insight: 工具的正确应用模式选择对推理性能至关重要，模式感知的方法能够有效提升工具集成推理的效果。

Abstract: Tool-integrated reasoning (TIR) has become a key approach for improving large reasoning models (LRMs) on complex problems. Prior work has mainly studied when to invoke tools, while overlooking how tools are applied. We identify two common patterns: a calculator pattern that uses code for direct computation, and an algorithmic pattern that encodes problems as programs. Misaligned choices often cause failures even when reasoning is sound. We propose a two-stage framework that first builds code competence from both patterns and then aligns pattern selection with teacher preferences. Across challenging math datasets, our pattern-aware method substantially improves both code usage and accuracy, for instance raising Code@1 on MATH500 from 64.0% to 70.5% and on AIME24 from 26.7% to 50.0%. These gains highlight the effectiveness of a pattern-aware approach for tool-integrated reasoning.

[379] Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking cs.AI | cs.CLPDF

Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu

TL;DR: 论文提出了一种名为JET的方法，通过训练大规模推理模型（LRMs）主动终止不必要的推理步骤，从而在保持精度的同时显著提高推理效率。

Details

Motivation: 尽管大规模推理模型在复杂任务上表现出色，但其深层推理通常伴随高昂计算成本。现有强化学习方法在构建短推理路径时仍存在困难。

Result: 实验表明，JET显著提升效率且不降低精度，如在Olympiad基准上，DeepSeek-Distill-Qwen-1.5B模型输出长度减少46.3%的同时准确率提升4.6%。

Insight: LRM在推理早期已积累足够信息，后续步骤冗余；通过主动终止推理可兼顾效率与效果。

Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

[380] Mapping Overlaps in Benchmarks through Perplexity in the Wild cs.AI | cs.CLPDF

Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

TL;DR: 该论文通过困惑度分析大型语言模型（LLM）基准测试的共性与重叠，提出了一种基于自然语料的基准签名方法，揭示了不同任务间的能力关联与差异。

Details

Motivation: 现有LLM基准测试中，任务间的重叠性和有效性缺乏系统性分析。作者希望通过困惑度揭示LLM在不同任务中的能力分布及其潜在联系。

Result: 知识、推理任务重叠度高，多语言和文化任务差异显著；编程任务与其他任务重叠最少；基准签名对性能无关因素稳健。

Insight: 困惑度能有效揭示LLM能力的底层关联，帮助理解基准的有效性及LLM能力的真实分布。

Abstract: We develop signatures of capacity familiarity to characterize large language model (LLM) benchmarks and their meaningful overlaps. Benchmark signatures probe the capacity required for benchmark performance. We formally define them as a set of salient tokens drawn from in-the-wild, naturally authored corpora, where LLM token perplexity, reflecting more or less pre-training exposure, becomes highly predictive of LLM benchmark performance. Through a large-scale meta-evaluation, we extract benchmark signatures via stepwise forward selection with linear regressions across 32 LLMs and 88 benchmarks spanning diverse knowledge, coding, logic, instruction following, math, language, reasoning, and world modeling. Our analysis situates signatures in relation to both the semantic similarity of benchmark questions and the correlation of model performance. While performance overlaps are universally high and semantic overlaps remain confined to a narrow mid-range, benchmark signatures prove highly informative in capturing variation, overlap, and divergence. We observe overlap in knowledge and reasoning subtasks, whereas multilingual and cultural benchmarks exhibit less similarity, even compared to cross-task overlap. Notably, performance-level results are strongly influenced by benchmark-orthogonal factors such as question format, highlighting limitations in LLM generalization, the conflation of performance with ability, and issues inherent in current mainstream benchmark agreement studies. Benchmark signatures, however, remain robust to such effects. Ultimately, we identify cross-functional overlaps across logic, math, language, instruction following, and world modeling, with coding emerging as the least overlapping domain. Together, these findings provide mechanistic insights into benchmark validity and LLM sensitivities, and sketch the underlying landscape of interconnected LLM capabilities.

[381] From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models cs.AI | cs.CLPDF

Jue Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

TL;DR: 这篇论文通过实证评估、注意力分析和机制干预，研究了蒸馏DeepSeek R1模型中推理过程对答案生成的影响，揭示了推理标记对答案生成的定向功能流。

Details

Motivation: 大型推理模型（LRMs）在生成答案时会同时生成显式的推理轨迹，但这些推理轨迹对答案生成的具体影响尚不明确。论文旨在探究推理与答案生成之间的关系。

Result: 研究发现关键推理标记的扰动会可靠地改变最终答案，证实了从推理到答案的信息流是定向且功能性的。

Insight: 中间推理过程在模型输出中具有功能性作用，揭示了LRMs如何利用推理标记改善答案生成。

Abstract: Large Reasoning Models (LRMs) generate explicit reasoning traces alongside final answers, yet the extent to which these traces influence answer generation remains unclear. In this work, we conduct a three-stage investigation into the interplay between reasoning and answer generation in three distilled DeepSeek R1 models. First, through empirical evaluation, we demonstrate that including explicit reasoning consistently improves answer quality across diverse domains. Second, attention analysis reveals that answer tokens attend substantially to reasoning tokens, with certain mid-layer Reasoning-Focus Heads (RFHs) closely tracking the reasoning trajectory, including self-reflective cues. Third, we apply mechanistic interventions using activation patching to assess the dependence of answer tokens on reasoning activations. Our results show that perturbations to key reasoning tokens can reliably alter the final answers, confirming a directional and functional flow of information from reasoning to answer. These findings deepen our understanding of how LRMs leverage reasoning tokens for answer generation, highlighting the functional role of intermediate reasoning in shaping model outputs. Our data and code are publicly available at \href{https://aka.ms/R2A-code}{this URL}.

[382] SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents cs.AI | cs.CL | cs.CRPDF

Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang

TL;DR: 这篇论文提出了一个名为SafeSearch的自动红队框架，用于评估基于LLM的搜索代理的安全性，揭示了其在不可靠搜索结果下的脆弱性，并展示了常用防御措施的局限性。

Details

Motivation: LLM搜索代理接入互联网虽然扩展了信息来源，但也引入了安全隐患，如搜索结果不可靠可能导致误导行为。研究旨在系统性评估和提升搜索代理的安全性。

Result: 实验显示，LLM搜索代理在不可靠网站下的最高攻击成功率（ASR）达90.5%，且常见防御措施效果有限。

Insight: 突出了当前LLM搜索代理的安全挑战，强调了系统性评估的必要性，为未来安全开发提供了工具和方向。

Abstract: Search agents connect LLMs to the Internet, enabling access to broader and more up-to-date information. However, unreliable search results may also pose safety threats to end users, establishing a new threat surface. In this work, we conduct two in-the-wild experiments to demonstrate both the prevalence of low-quality search results and their potential to misguide agent behaviors. To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient, enabling lightweight and harmless safety assessments of search agents. Building on this framework, we construct the SafeSearch benchmark, which includes 300 test cases covering five categories of risks (e.g., misinformation and indirect prompt injection). Using this benchmark, we evaluate three representative search agent scaffolds, covering search workflow, tool-calling, and deep research, across 7 proprietary and 8 open-source backend LLMs. Our results reveal substantial vulnerabilities of LLM-based search agents: when exposed to unreliable websites, the highest ASR reached 90.5% for GPT-4.1-mini under a search workflow setting. Moreover, our analysis highlights the limited effectiveness of common defense practices, such as reminder prompting. This emphasizes the value of our framework in promoting transparency for safer agent development. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.

[383] From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning cs.AI | cs.CLPDF

Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin

TL;DR: ChemMAS是一个多智能体系统，将化学反应条件预测任务转化为基于证据的推理过程，提供了可解释的依据。

Details

Motivation: 现有的大型语言模型（LLMs）在化学反应条件推荐中表现良好，但缺乏解释推荐依据的能力，限制了其在科学工作中的应用。

Result: ChemMAS在Top-1准确率上优于领域基线20-35%，并领先通用LLMs 10-15%，同时提供可验证的解释。

Insight: 通过多智能体系统分解任务并提供证据支持，可以显著提升模型的解释性和性能，适用于高风险的科学研究场景。

Abstract: The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.

[384] Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models cs.AI | cs.CLPDF

Guanxu Chen, Yafu Li, Yuxian Jiang, Chen Qian, Qihan Ren

TL;DR: CANON是一种新的强化学习优势估计方法，通过无方向性先验来优化大型语言模型的推理能力，在数学推理和高复杂度逻辑任务中表现优于现有方法。

Details

Motivation: 现有基于RLVR的方法在优化LLM推理能力时依赖于方向性先验（如奖励或优势调整），但缺乏超参数调优容易导致偏差和失败。CANON旨在消除这种方向性假设，更灵活地利用目标指标。

Result: 在三种LLM上，基于熵的CANON在数学推理和高复杂度逻辑任务中表现最佳；应用于响应长度时，CANON进一步提高了token效率，优化了性能-成本的帕累托边界。

Insight: 移除方向性先验可以更灵活地利用目标指标，避免人为偏差；分组比较和动态优势估计是优化RLVR性能的有效策略。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs’ reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyperparameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce Conditional advANtage estimatiON (CANON), amplifying the impact of the target metric without presuming its direction. Specifically, CANON regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, CANON further improves token efficiency, yielding a more favorable Pareto frontier in the performance-cost trade-off.

[385] Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models cs.AI | cs.CLPDF

Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang

TL;DR: 该论文研究了大型推理模型（LRMs）在回答问题时，是通过真正的推理（Chain-of-Thought, CoT）还是记忆检索机制生成答案，并通过实验验证了两种机制的同时存在及其影响因素。研究发现现有模型的训练范式存在漏洞，并提出了一种新框架FARL，以抑制记忆检索的捷径，提升模型的真实推理能力。

Details

Motivation: 大型推理模型在解决问题时展现出强大的推理能力，但其答案与推理过程不一致的现象引发了研究者的关注。论文通过实验探究了这种不一致的根源，即模型可能同时依赖推理和记忆检索两种机制。

Result: 实验证实推理和检索机制在不同领域、模型规模和训练方法下表现不同。FARL框架显著提升了模型的推理主导行为，增强了泛化推理能力。

Insight: 1.当前模型的推理能力可能被高估，部分答案依赖记忆检索；2.训练方法的缺陷可能导致模型避开真实推理；3.FARL为改进模型的推理能力提供了新方向。

Abstract: Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, recent studies reveal that their final answers often contradict their own reasoning traces. We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval. To test this hypothesis, we conduct controlled experiments that challenge LRMs with misleading cues during reasoning and/or corrupted answers during retrieval. Our results across models and datasets confirm that both mechanisms operate simultaneously, with their relative dominance influenced by multiple factors: problem domains, model scales, and fine-tuning approaches (e.g., reinforcement learning vs. distillation). The findings reveal a critical limitation in current reasoning fine-tuning paradigms: models can exploit the retrieval mechanism as a shortcut, effectively “hacking” the reward signal and undermining genuine reasoning development. To address this challenge, we introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning. By carefully suppressing retrieval shortcuts during the fine-tuning process, FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities.

[386] Learning to Ponder: Adaptive Reasoning in Latent Space cs.AI | cs.CLPDF

Yixin He, Lumingyuan Tang

TL;DR: FR-Ponder是一个无需修改主干模型权重、通过潜在空间自适应分配计算资源的推理框架，利用小于1M参数的控制器动态调整推理深度，提高计算效率与任务准确率。

Details

Motivation: 现有方法（如Best-of-N）对复杂和简单查询使用统一计算深度，浪费资源或推理不足。FR-Ponder旨在实现自适应计算分配以提高效率与准确性。

Result: 在GSM8K和MATH500上优于基线，降低FLOPs的同时保持或提升准确率。

Insight: 1. 潜在导航向量与问题复杂性相关；2. 自适应计算分配可通过轻量控制器实现；3. GRPO有效平衡计算成本与性能。

Abstract: Test-time compute has emerged as a key paradigm for enhancing LLM reasoning, yet prevailing approaches like Best-of-N and majority voting apply uniform depth across inputs, wasting computation on simple queries while potentially under-thinking complex ones. We present FR-Ponder, a single-graph, backbone-training-free framework that allocates instance-adaptive reasoning compute via latent steering. A less than 1M-param controller observes hidden states and decides to halt or apply a small ponder step by adding a pre-computed steering vector to frozen representations. Our method extracts the latent steering vector associated with deeper reasoning outputs and direct IO from LLM and re-applies it through a tunable scaling factor, allowing the model to adapt its reasoning depth to the complexity of each input. To balance performance and computational cost, we employ Group Relative Policy Optimization (GRPO) as a reward signal to adaptively regulate reasoning depth, achieving task accuracy while mitigating overreasoning. Through curriculum learning and careful reward engineering, FR-Ponder learns calibrated compute allocation correlated with problem difficulty. On GSM8K and MATH500, FR-Ponder improves the compute-accuracy frontier, delivering lower FLOPs with better matched accuracy and comparing favorably to early-exit baselines, without modifying backbone weights. Analyses visualize interpretable steering directions and show learned compute allocation correlates with problem difficulty.

[387] SpecExit: Accelerating Large Reasoning Model via Speculative Exit cs.AI | cs.CL | cs.LGPDF

Rubing Yang, Huajun Bai, Song Liu, Guanghua Yu, Runzhi Fan

TL;DR: SpecExit是一种新颖的框架，通过直接从轻量级草稿模型中预测未来令牌和提前退出信号，解决了大型推理模型（LRMs）的过思考问题，显著减少了生成长度和端到端延迟。

Details

Motivation: 大型推理模型在完成任务时经常因为过思考而产生不必要的长输出，增加端到端延迟，影响了实际部署的效率。现有的提前退出机制虽然有效，但依赖探测机制，引入了额外的开销，限制了速度提升和通用性。

Result: 平均生成长度减少了66%，端到端延迟提升了2.5倍，同时保持了准确性。

Insight: 隐藏状态不仅可用于推测解码，还能高效地提供提前退出信号，展示了隐藏状态在高效推理中的广泛应用潜力。

Abstract: Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.

[388] AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models cs.AI | cs.CLPDF

Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu

TL;DR: AdvChain是一个针对大型推理模型的安全对齐范式，通过对抗性CoT调整，解决了CoT推理中安全挑战的滚雪球效应。

Details

Motivation: CoT推理的多步骤特性带来了新的安全挑战，传统对齐方法无法解决滚雪球效应，导致有害顺从或过度拒绝。

Result: AdvChain显著提升了对抗越狱攻击和CoT劫持的鲁棒性，同时减少对良性提示的过度拒绝。

Insight: 通过自校正能力实现安全性与实用性的平衡，为构建更可靠的推理模型提供了新方向。

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the \textit{snowball effect}, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.

[389] SCI-Verifier: Scientific Verifier with Thinking cs.AI | cs.CL | cs.LGPDF

Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye

TL;DR: 本文提出SCI-Verifier和SCI-VerifyBench，分别从数据和模型层面解决科学领域答案验证的挑战，强调推理能力的重要性。

Details

Motivation: 大语言模型在科学推理中的应用日益广泛，但答案格式复杂且等价表达多样，使得答案验证成为关键但困难的任务。现有验证方法存在系统性评估标准缺失和对繁琐规则的依赖等问题。

Result: SCI-Verifier展示了强大的逻辑推理和等价判断能力，输出简洁稳定；SCI-VerifyBench提供了系统性评估框架。

Insight: 推理能力对科学验证至关重要；跨学科基准和统一验证器的结合可显著提升LLM在科学领域的可靠性和适用性。

Abstract: As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.

[390] Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention cs.AI | cs.CLPDF

Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li

TL;DR: 论文提出了一种名为Intervened Preference Optimization (IPO)的方法，通过干预性监督确保大型推理模型（LRMs）的安全性，显著减少了有害内容和推理中的安全隐患。

Details

Motivation: 现有大型推理模型在复杂问题解决中表现优异，但其链式推理（CoT）中常包含有害内容，且现有方法忽略了推理本身的安全性，潜在风险高。

Result: 在对抗性安全基准测试中，IPO相对减少30%以上的有害性，同时保持多样推理任务的高性能。

Insight: 安全推理的关键在于少数关键步骤的安全性触发器，合规线索与不安全推理相关，干预性措施可有效引导推理安全化。

Abstract: Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.

[391] On the Self-awareness of Large Reasoning Models’ Capability Boundaries cs.AI | cs.CLPDF

Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei

TL;DR: 研究了大型推理模型(LRMs)是否具备对自身能力边界的自我认知能力，并提出两种优化策略以提高其推理效率和可靠性。

Details

Motivation: LRMs在处理复杂推理任务时表现出色，但在面对困难问题时往往陷入无效推理，浪费计算资源并给出错误答案。这凸显了现有推理范式忽视了问题与模型能力边界的关系。

Result: 实验表明，提出的策略在不牺牲准确性的情况下，避免了无效推理，显著减少token使用量（62.7%-93.6%），提高了模型的效率和可靠性。

Insight: 模型的推理表达和隐藏状态中已隐含了对问题可解性的判断信息，通过监控这些信号，可以提前终止无效推理，优化资源分配。

Abstract: Large Reasoning Models (LRMs) have shown impressive performance on complex reasoning tasks such as mathematics, yet they also display misbehaviors that expose their limitations. In particular, when faced with hard questions, LRMs often engage in unproductive reasoning until context limit, producing wrong answers while wasting substantial computation. This phenomenon reflects a fundamental issue: current answering paradigms overlook the relationship between questions and LRMs’ capability boundaries. In this paper, we investigate whether LRMs possess self-awareness of capability boundaries. We begin by an observation that LRMs may know what they cannot solve through expressed reasoning confidence. For black-box models, we find that reasoning expressions reveal boundary signals, with accelerated growing confidence trajectory for solvable problems but convergent uncertainty trajectory for unsolvable ones. For white-box models, we show that hidden states of the last input token encode boundary information, with solvable and unsolvable problems linearly separable even before reasoning begins. Building on these findings, we propose two simple yet effective optimization strategies: reasoning expression monitoring and hidden states monitoring. Experiments demonstrate that these boundary-aware strategies enable LRMs to avoid unproductive reasoning without sacrificing accuracy, significantly improving reliability and efficiency by cutting token usage up to 62.7 - 93.6%.

[392] Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity cs.AI | cs.CL | cs.LGPDF

Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng

TL;DR: 论文提出了数据推理强度（DRI）这一新指标，量化训练数据的逻辑推理复杂性，并通过数据重新优化策略提升LLM的逻辑推理能力，优于传统以数据量为中心的优化方法。

Details

Motivation: 现有研究多关注数据格式转换，忽视了训练样本内部的推理复杂性，限制了LLM在逻辑推理上的潜力发挥。论文旨在通过量化数据推理复杂性并优化对齐模型推理边界，释放LLM的认知潜力。

Result: 实验表明，该方法显著提升了LLM的逻辑推理性能和泛化能力，尤其在强化学习框架下验证了其有效性。

Insight: 优化数据的逻辑推理复杂性比单纯增加数据量或变换数据格式更能有效提升LLM的认知潜力，揭示了未来数据优化的新方向。

Abstract: Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM’s logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs’ full cognitive potential.

[393] MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning cs.AI | cs.CLPDF

Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan

TL;DR: MASLegalBench是针对多智能体系统（MAS）在法律推理任务中的评估基准，专注于演绎推理和智能体协作，填补了现有法律基准的空白。

Details

Motivation: 多智能体系统（MAS）结合大语言模型（LLM）在复杂任务中展现出潜力，但法律领域缺乏针对MAS特性的评估方法。

Result: 实验结果揭示了现有LLM模型和MAS架构的优势、局限性和改进空间。

Insight: MASLegalBench为法律领域的MAS研究提供了标准化的评估工具，促进了智能体协作在法律任务中的应用。

Abstract: Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

[394] From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones cs.AI | cs.CLPDF

Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang

TL;DR: 论文通过合成框架证明LLMs可以通过强化学习（RL）组合已有技能学习全新技能，类似人类认知学习方式，且具有迁移性和泛化能力，而仅靠下一个标记训练无法实现。

Details

Motivation: 探讨RL是否真的能让LLMs学习全新技能，而非仅重新加权已有能力，以澄清RL在LLM后训练中的作用。

Result: RL使LLMs学会未见函数组合，且能泛化到更复杂任务；组合能力可迁移到不相关任务；RL显著改变模型推理行为。

Insight: 基础模型先掌握基本技能，再用RL激励组合和高级技能，为解决复杂问题提供新思路；RL在LLM学习中具有独特价值。

Abstract: Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target’s atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

[395] The Era of Real-World Human Interaction: RL from User Conversations cs.AI | cs.CL | cs.LGPDF

Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva

TL;DR: 论文提出了一种直接从真实用户对话中学习的强化学习范式RLHI，通过用户引导的重写和基于用户的奖励，将用户长期交互历史与即时偏好结合，显著提升了模型的个性化和指令遵循能力。

Details

Motivation: 当前对话模型通常依赖专家标注的反馈进行对齐，缺乏直接从用户自然交互中学习的能力。真实用户对话提供了丰富的个性化信息，能够更有效地实现模型的持续改进和多维对齐。

Result: 在WildChat数据集上的实验表明，RLHI方法在个性化和指令遵循任务上优于基线模型，且在推理任务上也表现优异，验证了直接从用户交互中学习的有效性。

Insight: 真实用户对话是一种可扩展且高效的监督信号，能够支持模型的个性化对齐和持续改进，未来可以在更广泛的场景中应用。

Abstract: We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users’ natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user’s long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.

[396] ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory cs.AI | cs.CLPDF

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang

TL;DR: ReasoningBank提出了一种新型的记忆框架，用于从代理的成功和失败经验中提炼可泛化的推理策略，并通过MaTTS技术加速学习过程，实现代理的自我进化。

Details

Motivation: 当前大型语言模型代理在处理持续任务流时，未能有效利用历史交互经验，导致重复错误和效率低下。

Result: 在网页浏览和软件工程任务中，ReasoningBank超越了仅存储原始轨迹或成功任务的现有方法，显著提升了效率和效果。

Insight: 结合记忆和测试时扩展（MaTTS）能够实现代理的自我进化，展示了记忆驱动扩展的潜力。

Abstract: With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent’s self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent’s interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.

[397] Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning cs.AI | cs.CVPDF

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao

TL;DR: 论文提出了一种自适应视觉推理范式Mixture-of-Visual-Thoughts（MoVT），通过统一不同推理模式并结合上下文自适应选择，提升模型的通用推理能力。

Details

Motivation: 当前视觉推理方法主要针对特定推理模式设计，缺乏通用性。本文旨在通过统一多模式推理并结合上下文选择，提升模型的通用推理能力。

Result: 实验表明AdaVaR能有效学习多模式推理并实现自适应选择，在不同场景中表现一致提升。

Insight: MoVT为构建通用视觉推理模型提供了有效解决方案，上下文自适应是多模式推理的关键。

Abstract: Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

Junyang Zhang, Tianyi Zhu, Thierry Tambe

TL;DR: AttAnchor提出了一种参数无关的框架，通过注意力锚点改进视觉语言模型（VLM）中的跨模态令牌对齐，提升任务性能和减少幻觉现象。

Details

Motivation: 现有的视觉语言模型因依赖简单的令牌拼接和模态无关的位置编码，导致跨模态令牌对齐不佳，引发幻觉和性能下降问题。

Result: 在15个指标和基准测试中的13个上取得提升，推理任务最高提升32%，幻觉测试最高15%，轻量化模型性能超越大型模型。

Insight: 跨模态令牌的联合聚类而非单模态内分组，能更有效地实现跨模态对齐，提升视觉语言模型的性能。

Abstract: A fundamental reason for the dominance of attention over RNNs and LSTMs in LLMs is its ability to capture long-range dependencies by modeling direct interactions between all tokens, overcoming the sequential limitations of recurrent architectures. Similarly, a key reason why today’s vision language models (VLMs) hallucinate and underperform pure language models is that they rely on direct concatenation of image and text tokens with a modality-blinded positional encoding, which conveniently adopts the pretrained LLM backbone but forces unnecessary long-distance attention between semantically related tokens across modalities. This underscores the urgent need for mechanisms that efficiently enhance token locality and cross-modal alignment. In response, we propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities, improving cross-modal locality. By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores, guiding the model to focus on the correct image regions for tasks such as VQA, MMBench and POPE. This improves answer accuracy and reduces hallucinations without disrupting the prompt’s semantic flow. AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks. AttAnchor enables TinyLLaVA 1B to outperform much larger models like LLaVA 7B and QwenVL 3B on POPE with only 0.1% inference time overhead. To the best of our knowledge, this work is among the first to investigate mixed-modal token grouping, where text and image tokens are clustered jointly into shared groups rather than being grouped within a single modality or merely aligned post-hoc with additional alignment losses.

[399] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned cs.AI | cs.CVPDF

Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, Soujanya Poria

TL;DR: 该论文研究如何通过过程奖励模型（VL-PRMs）提升视觉语言模型的推理能力，提出了一种混合数据合成框架和感知级监督方法，并系统评估了多种测试时扩展策略。

Details

Motivation: 现有视觉语言过程奖励模型（VL-PRMs）依赖蒙特卡洛树搜索（MCTS）构建数据，但其生成的监督信号噪声大且泛化能力有限。该研究旨在探索VL-PRMs的设计空间，提升其在多模态推理中的可靠性和扩展性。

Result: 在五个多模态基准测试中，VL-PRMs表现优异：（1）测试时扩展中优于过程步骤选择；（2）小模型能媲美甚至超越大模型的错误检测能力；（3）揭示了强VLM的潜在推理能力；（4）感知级监督显著提升性能；（5）未训练的数学推理任务中表现良好。

Insight: 1. VL-PRMs作为结果奖励模型在测试时扩展中具有优势；2. 模型大小并非错误检测能力的关键；3. 感知级监督是多模态任务的关键；4. 测试时扩展策略需针对不同任务定制。

Abstract: Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

[400] BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving cs.AI | cs.CV | cs.LGPDF

Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang

TL;DR: BridgeDrive提出了一种基于扩散桥的闭环轨迹规划方法，通过锚点引导扩散模型，提升自动驾驶在复杂动态场景中的规划能力，实现了Bench2Drive基准上5%的性能提升。

Details

Motivation: 扩散模型在自动驾驶中表现出捕捉多模态行为的能力，但在闭环环境中如何有效引导仍是一个挑战。现有方法依赖截断的调度策略，可能影响性能和理论一致性。

Result: 在Bench2Drive基准上取得了最先进性能，成功率比现有方法提高了5%。

Insight: 锚点引导的扩散模型在复杂动态场景中表现优异，ODE求解器的兼容性为实时部署提供了重要支持。

Abstract: Diffusion-based planners have shown great promise for autonomous driving due to their ability to capture multi-modal driving behaviors. However, guiding these models effectively in reactive, closed-loop environments remains a significant challenge. Simple conditioning often fails to provide sufficient guidance in complex and dynamic driving scenarios. Recent work attempts to use typical expert driving behaviors (i.e., anchors) to guide diffusion models but relies on a truncated schedule, which introduces theoretical inconsistencies and can compromise performance. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach provides a principled diffusion framework that effectively translates anchors into fine-grained trajectory plans, appropriately responding to varying traffic conditions. Our planner is compatible with efficient ODE solvers, a critical factor for real-time autonomous driving deployment. We achieve state-of-the-art performance on the Bench2Drive benchmark, improving the success rate by 5% over prior arts.

[401] Transparent Visual Reasoning via Object-Centric Agent Collaboration cs.AI | cs.CVPDF

Benjamin Teoh, Ben Glocker, Francesca Toni, Avinash Kori

TL;DR: 论文提出了OCEAN框架，通过对象中心表征和多智能体协作实现透明视觉推理，展示了在性能和解释性上的优势。

Details

Motivation: 解决可解释AI中如何生成基于人类可理解概念的视觉解释的挑战。

Result: 在多个数据集上表现与黑盒模型相当，且用户研究表明其解释更直观可信。

Insight: 对象中心表征和多智能体协作是实现可解释AI的有效途径。

Abstract: A central challenge in explainable AI, particularly in the visual domain, is producing explanations grounded in human-understandable concepts. To tackle this, we introduce OCEAN (Object-Centric Explananda via Agent Negotiation), a novel, inherently interpretable framework built on object-centric representations and a transparent multi-agent reasoning process. The game-theoretic reasoning process drives agents to agree on coherent and discriminative evidence, resulting in a faithful and interpretable decision-making process. We train OCEAN end-to-end and benchmark it against standard visual classifiers and popular posthoc explanation tools like GradCAM and LIME across two diagnostic multi-object datasets. Our results demonstrate competitive performance with respect to state-of-the-art black-box models with a faithful reasoning process, which was reflected by our user study, where participants consistently rated OCEAN’s explanations as more intuitive and trustworthy.

Yue Zhang, Tianyi Ma, Zun Wang, Yanyuan Qiao, Parisa Kordjamshidi

TL;DR: 论文提出了一种通过引入多视角文本描述和改进类比推理的LLM导航方法，显著提高了Vision-and-Language Navigation（VLN）任务的性能。

Details

Motivation: 现有的零样本LLM导航方法要么过度简化视觉细节（通过文本描述编码图像），要么难以捕捉高层语义（直接处理原始图像）。

Result: 在R2R数据集上验证了方法的有效性，导航性能显著提升。

Insight: 多视角文本描述和类比推理的结合为VLN任务提供了更丰富的语义信息和高层推理能力。

Abstract: Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent’s contextual understanding by incorporating textual descriptions from multiple perspectives that facilitate analogical reasoning across images. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.

eess.AS [Back]

[403] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning eess.AS | cs.AI | cs.CL | cs.CV | cs.SDPDF

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan

TL;DR: VSSFlow提出了一种统一的流匹配框架，将视频到声音（V2S）和视觉文本到语音（VisualTTS）任务整合到一个模型中，通过条件聚合机制处理不同的输入信号，并利用交叉注意力和自注意力的归纳偏置优化生成效果。

Details

Motivation: 现有的视频到声音和视觉文本到语音任务通常分开处理，缺乏统一的框架，且尝试统一的任务面临输入条件差异和训练复杂性问题。

Result: 在V2S和VisualTTS基准测试中超过了领域专用模型，验证了统一生成模型的潜力。

Insight: 联合训练可以学习任务间共享的音频先验，加速收敛并提升生成质量，无需复杂的训练策略设计。

Abstract: Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

[404] AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines eess.AS | cs.CV | cs.MM | cs.SDPDF

Cancan Li, Fei Su, Juan Liu, Hui Bu, Yulong Wan

TL;DR: 该论文介绍了AISHELL6-Whisper，一个中文普通话视听耳语语音数据集，提供了30小时的耳语和平行正常语音，并提出了一种基于Whisper-Flamingo框架的视听语音识别基线方法。

Details

Motivation: 当前中文普通话的视听耳语语音识别缺乏大规模数据集，限制了相关研究的发展。该论文旨在填补这一空白。

Result: 测试集上耳语语音的字符错误率（CER）为4.13%，正常语音为1.11%，并在wTIMIT基准上实现了新的最先进结果。

Insight: 数据集和基线的开放共享将为耳语语音识别研究提供重要资源，促进多模态语音识别技术的发展。

Abstract: Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech’s spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

cs.IR [Back]

[405] Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval cs.IR | cs.AI | cs.CLPDF

Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao

TL;DR: Retro*提出了一种新的文档检索方法，专注于通过细粒度推理评估任务与文档的相关性，结合评分机制和多推理轨迹整合，优化大型语言模型在推理密集型任务中的表现，显著提升了BRIGHT基准上的性能。

Details

Motivation: 随着LLM代理和RAG的普及，需要检索与任务间接或隐式相关的文档，这对现有IR技术提出了挑战。Retro*旨在解决现有方法在适用性、可扩展性和效率上的不足。

Result: 在BRIGHT基准上，Retro*显著优于现有文档检索方法，达到了最新的性能水平。

Insight: Retro*的评分机制和多推理轨迹整合为推理密集型文档检索提供了新的思路，强化学习的优化进一步提升了模型的推理能力。

Abstract: With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*’s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.

cs.GR [Back]

[406] ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing cs.GR | cs.CVPDF

Xiang Tang, Ruotong Li, Xiaopeng Fan

TL;DR: ZeroScene is a zero-shot framework for generating 3D scenes from a single image and enabling controllable texture editing. It leverages large vision models to ensure scene coherence and maintains texture consistency through advanced techniques.

Details

Motivation: Existing methods struggle with simultaneous quality of individual assets and scene coherence in 3D reconstruction, as well as maintaining texture consistency during editing.

Result: Experiments show the framework achieves geometric and appearance accuracy, faithfully reconstructs scenes, and produces detailed textures aligned with text prompts.

Insight: Leveraging large vision models and combining optimization techniques with diffusion models can significantly enhance the quality and coherence of 3D scene generation and editing.

Abstract: In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts.

[407] StrucADT: Generating Structure-controlled 3D Point Clouds with Adjacency Diffusion Transformer cs.GR | cs.CV | cs.LGPDF

Zhenyu Shu, Jiajun Shen, Zhongui Chen, Xiaoguang Han, Shiqing Xin

TL;DR: 该论文提出了StrucADT，一种通过结构（部件存在性和相邻关系）控制3D点云生成的方法，解决了现有方法缺乏可控性的问题，实现了高质量、多样化的可控点云生成。

Details

Motivation: 现有的3D点云生成方法虽然能生成多样且真实的形状，但缺乏对生成形状的精确控制，限制了其大规模应用。论文提出通过结构（部件及其相邻关系）控制生成，以满足用户特定需求。

Result: 实验表明，StrucADT在ShapeNet数据集上实现了高质量的多样化点云生成，支持用户指定形状结构的可控生成，并在可控点云生成领域达到SOTA性能。

Insight: 1.结构控制（部件及其相邻关系）为3D点云生成提供了新的可控维度；2.StructureGraph表示法和StrucADT模型为未来可控生成任务提供了参考。

Abstract: In the field of 3D point cloud generation, numerous 3D generative models have demonstrated the ability to generate diverse and realistic 3D shapes. However, the majority of these approaches struggle to generate controllable 3D point cloud shapes that meet user-specific requirements, hindering the large-scale application of 3D point cloud generation. To address the challenge of lacking control in 3D point cloud generation, we are the first to propose controlling the generation of point clouds by shape structures that comprise part existences and part adjacency relationships. We manually annotate the adjacency relationships between the segmented parts of point cloud shapes, thereby constructing a StructureGraph representation. Based on this StructureGraph representation, we introduce StrucADT, a novel structure-controllable point cloud generation model, which consists of StructureGraphNet module to extract structure-aware latent features, cCNF Prior module to learn the distribution of the latent features controlled by the part adjacency, and Diffusion Transformer module conditioned on the latent features and part adjacency to generate structure-consistent point cloud shapes. Experimental results demonstrate that our structure-controllable 3D point cloud generation method produces high-quality and diverse point cloud shapes, enabling the generation of controllable point clouds based on user-specified shape structures and achieving state-of-the-art performance in controllable point cloud generation on the ShapeNet dataset.

[408] ReLumix: Extending Image Relighting to Video via Video Diffusion Models cs.GR | cs.CVPDF

Lezhong Wang, Shutong Jin, Ruiqi Cui, Anders Bjorholm Dahl, Jeppe Revall Frisvad

TL;DR: 本文提出了一种名为ReLumix的新方法，通过两阶段处理和视频扩散模型，将图像重光照技术无缝扩展到视频中。

Details

Motivation: 视频后期制作中动态光照控制的灵活性不足，现有方法局限于特定重光照模型。ReLumix旨在解决这一问题，提供一种通用的解决方案。

Result: ReLumix在视觉保真度上显著提升，能够灵活应用于真实世界视频，展现良好的泛化性能。

Insight: 解耦光照和时序合成任务，结合扩散模型的强大先验，为视频重光照提供了灵活且高质量的解决方案。

Abstract: Controlling illumination during video post-production is a crucial yet elusive goal in computational photography. Existing methods often lack flexibility, restricting users to certain relighting models. This paper introduces ReLumix, a novel framework that decouples the relighting algorithm from temporal synthesis, thereby enabling any image relighting technique to be seamlessly applied to video. Our approach reformulates video relighting into a simple yet effective two-stage process: (1) an artist relights a single reference frame using any preferred image-based technique (e.g., Diffusion Models, physics-based renderers); and (2) a fine-tuned stable video diffusion (SVD) model seamlessly propagates this target illumination throughout the sequence. To ensure temporal coherence and prevent artifacts, we introduce a gated cross-attention mechanism for smooth feature blending and a temporal bootstrapping strategy that harnesses SVD’s powerful motion priors. Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos. The method demonstrates significant improvements in visual fidelity, offering a scalable and versatile solution for dynamic lighting control.

[409] Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives cs.GR | cs.CVPDF

AmirHossein Zamani, Bruno Roy, Arianna Rampini

TL;DR: 论文提出了一种无监督学习方法，通过结合语义和可见性目标，自动化3D网格参数化（UV映射）过程，解决了现有方法忽略语义对齐和可见性优化的问题。

Details

Motivation: 现有的3D生成模型依赖手动UV映射，耗时且需要专业知识；同时，自动方法忽视了语义对齐和切割缝位置的可见性优化，导致纹理生成效果不佳。

Result: 实验表明，该方法生成的UV图在纹理生成和减少可见缝方面优于现有基准方法。

Insight: 结合语义和可见性优化显著提升了UV映射的质量，为自动化3D内容生成提供了新思路。

Abstract: Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.

cs.HC [Back]

[410] Bridging the behavior-neural gap: A multimodal AI reveals the brain’s geometry of emotion more accurately than human self-reports cs.HC | cs.AI | cs.CL | cs.CY | cs.MMPDF

Changde Du, Yizhuo Lu, Zhongyu Huang, Yi Sun, Zisen Zhou

TL;DR: 该论文提出了一种多模态AI方法，通过大规模相似性判断捕捉情绪的神经几何，优于人类自我报告的数据。

Details

Motivation: 情绪表征对人类认知和社会互动至关重要，但传统评分量表无法准确预测大脑活动，存在’行为-神经鸿沟’。

Result: MLLM的情感表征在预测人类情绪处理网络的神经活动时精度最高，且优于LLM和人类行为评分。

Insight: 研究表明，多模态学习（尤其是视觉数据）对构建神经对齐的情感框架至关重要，AI模型能自主发展出丰富且神经对齐的情感表征。

Abstract: The ability to represent emotion plays a significant role in human cognition and social interaction, yet the high-dimensional geometry of this affective space and its neural underpinnings remain debated. A key challenge, the behavior-neural gap,' is the limited ability of human self-reports to predict brain activity. Here we test the hypothesis that this gap arises from the constraints of traditional rating scales and that large-scale similarity judgments can more faithfully capture the brain's affective geometry. Using AI models as cognitive agents,’ we collected millions of triplet odd-one-out judgments from a multimodal large language model (MLLM) and a language-only model (LLM) in response to 2,180 emotionally evocative videos. We found that the emergent 30-dimensional embeddings from these models are highly interpretable and organize emotion primarily along categorical lines, yet in a blended fashion that incorporates dimensional properties. Most remarkably, the MLLM’s representation predicted neural activity in human emotion-processing networks with the highest accuracy, outperforming not only the LLM but also, counterintuitively, representations derived directly from human behavioral ratings. This result supports our primary hypothesis and suggests that sensory grounding–learning from rich visual data–is critical for developing a truly neurally-aligned conceptual framework for emotion. Our findings provide compelling evidence that MLLMs can autonomously develop rich, neurally-aligned affective representations, offering a powerful paradigm to bridge the gap between subjective experience and its neural substrates. Project page: https://reedonepeck.github.io/ai-emotion.github.io/.

eess.SP [Back]

[411] YOLO-based Bearing Fault Diagnosis With Continuous Wavelet Transform eess.SP | cs.AI | cs.CV | cs.LG | eess.IVPDF

Po-Heng Chou, Wei-Lung Mao, Ru-Ping Lin

TL;DR: 本文提出了一种基于YOLO的轴承故障诊断框架，通过连续小波变换（CWT）生成的时频谱图进行分类，显著提高了准确性和泛化能力。

Details

Motivation: 传统轴承故障诊断方法在处理瞬态故障特征时存在局限性，且泛化能力较差。本文旨在利用YOLO模型的空间检测能力，结合CWT提供的时频信息，实现更高效的故障诊断。

Result: YOLOv11在CWRU、PU和IMS数据集上的mAP分别为99.4%、97.8%和99.5%，显著优于MCNN-LSTM基线方法。

Insight: 1. YOLO的空间检测能力结合CWT时频信息，能有效捕捉轴承故障的瞬态特征；2. 通过可视化故障区域，为旋转机械状态监测提供了实用解决方案。

Abstract: This letter proposes a YOLO-based framework for spatial bearing fault diagnosis using time-frequency spectrograms derived from continuous wavelet transform (CWT). One-dimensional vibration signals are first transformed into time-frequency spectrograms using Morlet wavelets to capture transient fault signatures. These spectrograms are then processed by YOLOv9, v10, and v11 models to classify fault types. Evaluated on three benchmark datasets, including Case Western Reserve University (CWRU), Paderborn University (PU), and Intelligent Maintenance System (IMS), the proposed CWT-YOLO pipeline achieves significantly higher accuracy and generalizability than the baseline MCNN-LSTM model. Notably, YOLOv11 reaches mAP scores of 99.4% (CWRU), 97.8% (PU), and 99.5% (IMS). In addition, its region-aware detection mechanism enables direct visualization of fault locations in spectrograms, offering a practical solution for condition monitoring in rotating machinery.

[412] Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model eess.SP | cs.CVPDF

Jianheng Zhou, Chenyu Liu, Jinan Zhou, Yi Ding, Yang Liu

TL;DR: 提出一种新的多模态范式，利用通用多模态大模型从PSG信号生成的二维图像中学习睡眠分期，效果优于现有方法。

Details

Motivation: 现有睡眠分期的自动方法依赖复杂信号和领域专用模型，缺乏直观性且需要大量数据。本文旨在通过临床诊断实践启发的方法克服这些限制。

Result: 在ISRUC、MASS和SHHS数据集上，方法在准确性和鲁棒性上超越现有基准。解释性分析显示模型模拟了人类专家的视觉诊断逻辑。

Insight: 1. 通用模型可通过多模态学习适应新领域；2. 图像化信号能有效保留关键信息；3. 方法无需预训练即可快速适应新任务。

Abstract: Sleep staging is essential for diagnosing sleep disorders and assessing neurological health. Existing automatic methods typically extract features from complex polysomnography (PSG) signals and train domain-specific models, which often lack intuitiveness and require large, specialized datasets. To overcome these limitations, we introduce a new paradigm for sleep staging that leverages large multimodal general-purpose models to emulate clinical diagnostic practices. Specifically, we convert raw one-dimensional PSG time-series into intuitive two-dimensional waveform images and then fine-tune a multimodal large model to learn from these representations. Experiments on three public datasets (ISRUC, MASS, SHHS) demonstrate that our approach enables general-purpose models, without prior exposure to sleep data, to acquire robust staging capabilities. Moreover, explanation analysis reveals our model learned to mimic the visual diagnostic workflow of human experts for sleep staging by PSG images. The proposed method consistently outperforms state-of-the-art baselines in accuracy and robustness, highlighting its efficiency and practical value for medical applications. The code for the signal-to-image pipeline and the PSG image dataset will be released.

q-bio.QM [Back]

[413] Patient-specific Biomolecular Instruction Tuning q-bio.QM | cs.AI | cs.CL | cs.LG | 92C40, 68T07, 62P10 | I.2.7; I.5.1; J.3PDF

Irsyad Adam, Zekai Chen, David Laub, Shaun Porwal, Arda Pekis

TL;DR: 提出了一种名为KRONOS的图-LLM框架，结合分子交互拓扑和蛋白质组学数据，增强了LLM在癌症临床任务中的推理能力，同时发布了首个蛋白质组学指令微调数据集CPTAC-PROTSTRUCT。

Details

Motivation: 癌症的个性化治疗需要理解患者特异性蛋白质组学数据，但目前缺乏专门的数据集和模型架构来支持这一目标。

Result: KRONOS在分子分类、时间轨迹建模和肿瘤分期预测等任务中表现优异。

Insight: 通过结合分子交互图和LLM，可以显著提升对患者特异性蛋白质组学数据的理解，推动精准医疗的发展。

Abstract: Proteomics data is essential to pathogenic understanding of a disease phenotype. In cancer, analysis of molecular signatures enables precision medicine through the identification of biological processes that drive individualized tumor progression, therapeutic resistance, and clinical heterogeneity. Recent advances in multimodal large language models (LLMs) have shown remarkable capacity to integrate and reason across heterogeneous data modalities. However, performing multi-modal language modeling for molecular understanding of patient-specific proteomics remains a significant challenge due to two barriers: (1) the lack of instruction-tuning datasets that enable clinical interpretation from proteomics data, and (2) the absence of language modeling architectures designed to capture the rich heterogeneity of molecular data. In this work, we introduce CPTAC-PROTSTRUCT, the first instruction tuning dataset for molecular understanding of oncology, comprising over 400k open-ended examples derived from individualized proteomic profiles curated from the largest national proteomics cancer study (CPTAC). Additionally, we propose KRONOS (Knowledge Representation of patient Omics Networks in Oncology via Structured tuning), a novel graph-LLM framework that leverages molecular interaction topology with proteomics to learn patient-specific graph representations for enhanced clinical reasoning. We show that KRONOS achieves competitive performance across benchmark clinical tasks, including molecular classification, temporal trajectory modeling, and tumor stage prediction from proteomics data. Ultimately, this approach empowers LLMs to understand patient-level pathogenesis, advancing precision medicine through more accurate diagnosis, prognosis, and treatment stratification.

cs.DL [Back]

[414] Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement cs.DL | cs.CLPDF

An Dao, Vu Tran, Le-Minh Nguyen, Yuji Matsumoto

TL;DR: SCIDOCA 2025共享任务聚焦科学文献中的引用发现与预测，包含三个子任务：引用发现、掩码引用预测和引用句子预测。数据集基于S2ORC构建，包含60,000+标注段落，7个团队参与。

Details

Motivation: 为科学文献理解提供新的评估基准，推动引用建模的研究。

Result: 三个团队提交结果，报告了各子任务的性能指标，分析了系统效果。

Insight: 共享任务为科学文档理解提供了新的研究方向，尤其是引用建模。

Abstract: We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding. The dataset and task materials are publicly available at https://github.com/daotuanan/scidoca2025-shared-task.

cs.SD [Back]

[415] DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation cs.SD | cs.CL | eess.ASPDF

Ziqi Chen, Gongyu Chen, Yihua Wang, Chaofan Ding, Zihao chen

TL;DR: DiaMoE-TTS是一个基于国际音标（IPA）的统一方言TTS框架，通过混合专家（MoE）和参数高效零样本适应技术，解决方言数据稀缺和语音复杂性挑战。

Details

Motivation: 方言语音具有丰富文化和语言多样性，但方言TTS系统面临数据稀缺、拼写不一致和语音复杂性等问题，需要高效解决方案。

Result: 实验表明该系统能以少量数据生成自然语音，实现零样本适应未见方言和专业领域（如京剧）。

Insight: IPA标准化和参数高效适应技术可显著提升方言TTS的可扩展性和数据利用率。

Abstract: Dialect speech embodies rich cultural and linguistic diversity, yet building text-to-speech (TTS) systems for dialects remains challenging due to scarce data, inconsistent orthographies, and complex phonetic variation. To address these issues, we present DiaMoE-TTS, a unified IPA-based framework that standardizes phonetic representations and resolves grapheme-to-phoneme ambiguities. Built upon the F5-TTS architecture, the system introduces a dialect-aware Mixture-of-Experts (MoE) to model phonological differences and employs parameter-efficient adaptation with Low-Rank Adaptors (LoRA) and Conditioning Adapters for rapid transfer to new dialects. Unlike approaches dependent on large-scale or proprietary resources, DiaMoE-TTS enables scalable, open-data-driven synthesis. Experiments demonstrate natural and expressive speech generation, achieving zero-shot performance on unseen dialects and specialized domains such as Peking Opera with only a few hours of data.

[416] MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech cs.SD | cs.AI | cs.CL | cs.CV | cs.MMPDF

Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu

TL;DR: MGM-Omni是一种统一的Omni LLM，支持多模态理解和长时程语音生成。其双轨令牌架构分离了多模态推理和实时语音生成，提升了效率和跨模态交互能力。

Details

Motivation: 现有语音合成多采用级联管道，缺乏统一的端到端解决方案。MGM-Omni旨在解决这一问题，实现高效的多模态理解和可控的个性化语音生成。

Result: 实验表明，MGM-Omni在长时程语音生成、音色稳定性和多模态理解方面优于现有开源模型。

Insight: MGM-Omni为多模态理解和语音生成提供了一种高效统一的范式，有望推动个性化语音技术的发展。

Abstract: We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a “brain-mouth” design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

[417] Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention cs.SD | cs.CVPDF

Kai Li, Kejun Gao, Xiaolin Hu

TL;DR: 论文提出了一种高效的视听语音分离方法Dolphin，通过轻量化的视觉特征提取和音频分离模块，显著提升了分离质量和效率。

Details

Motivation: 现有的视听语音分离方法通常参数量大且计算成本高，难以作为预处理步骤部署在实际应用中。

Result: 在三个基准数据集上超越SOTA，参数减少50%以上，计算量降低2.4倍，GPU推理速度提升6倍以上。

Insight: 轻量化和高效的注意力机制设计是提升视听语音分离实际应用性的关键。

Abstract: Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

[418] Discovering “Words” in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music cs.SD | cs.CVPDF

Tianle Wang, Sirui Zhang, Xinyi Tong, Peiyang Yu, Jishang Chen

TL;DR: 这篇论文提出了一种无监督机器学习算法，从符号音乐数据中发现重复出现的模式（称为音乐”词”），并通过两阶段EM框架优化这些模式的提取。

Details

Motivation: 音乐中的重复模式反映了音乐的结构和认知过程，但提取这些模式因语义模糊性而具有挑战性。本文旨在解决这一问题，为AI音乐和音乐学研究提供工具。

Result: 算法在人类专家标注数据上的IoU得分为0.61，表明其有效性。方法支持AI音乐任务和音乐学分析。

Insight: 最小化编码长度是人类音乐创作的潜在优化原则，为跨风格和作曲家的音乐分析提供了新视角。

Abstract: This paper presents an unsupervised machine learning algorithm that identifies recurring patterns – referred to as music-words'' -- from symbolic music data. These patterns are fundamental to musical structure and reflect the cognitive processes involved in composition. However, extracting these patterns remains challenging because of the inherent semantic ambiguity in musical interpretation. We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework: 1. Developing a music-word dictionary; 2. Reconstructing the music data. When evaluated against human expert annotations, the algorithm achieved an Intersection over Union (IoU) score of 0.61. Our findings indicate that minimizing code length effectively addresses semantic ambiguity, suggesting that human optimization of encoding systems shapes musical semantics. This approach enables computers to extract basic building blocks’’ from music data, facilitating structural analysis and sparse encoding. The method has two primary applications. First, in AI music, it supports downstream tasks such as music generation, classification, style transfer, and improvisation. Second, in musicology, it provides a tool for analyzing compositional patterns and offers insights into the principle of minimal encoding across diverse musical styles and composers.

astro-ph.IM [Back]

[419] Interpreting deep learning-based stellar mass estimation via causal analysis and mutual information decomposition astro-ph.IM | astro-ph.GA | cs.AI | cs.CVPDF

Wei Zhang, Qiufan Lin, Yuan-Sen Ting, Shupei Chen, Hengxin Ruan

TL;DR: 论文通过因果分析和互信息分解技术，解释了基于深度学习的星系质量估计模型，揭示了输入数据的多成分贡献及其物理意义。

Details

Motivation: 由于端到端深度学习模型的不可解释性和关联性，难以理解除积分测光之外的信息（如形态学）如何影响星系质量的估计。为此，研究旨在通过因果分析和互信息分解技术提升模型的可解释性。

Result: 研究获得了有意义的结果，为基于图像的模型提供了物理解释，并展示了深度学习与可解释性技术结合的潜力。

Insight: 研究揭示了深度学习模型在星系质量估计中的信息利用机制，为天体物理学参数估计和复杂多变量物理过程的研究提供了新方法。

Abstract: End-to-end deep learning models fed with multi-band galaxy images are powerful data-driven tools used to estimate galaxy physical properties in the absence of spectroscopy. However, due to a lack of interpretability and the associational nature of such models, it is difficult to understand how the information additional to integrated photometry (e.g., morphology) contributes to the estimation task. Improving our understanding in this field would enable further advances into unraveling the physical connections among galaxy properties and optimizing data exploitation. Therefore, our work is aimed at interpreting the deep learning-based estimation of stellar mass via two interpretability techniques: causal analysis and mutual information decomposition. The former reveals the causal paths between multiple variables beyond nondirectional statistical associations, while the latter quantifies the multicomponent contributions (i.e., redundant, unique, and synergistic) of different input data to the stellar mass estimation. Using data from the Sloan Digital Sky Survey (SDSS) and the Wide-field Infrared Survey Explorer (WISE), we obtained meaningful results that provide physical interpretations for image-based models. Our work demonstrates the gains from combining deep learning with interpretability techniques, and holds promise in promoting more data-driven astrophysical research (e.g., astrophysical parameter estimations and investigations on complex multivariate physical processes).

cs.NE [Back]

[420] Accuracy-Robustness Trade Off via Spiking Neural Network Gradient Sparsity Trail cs.NE | cs.AI | cs.CVPDF

Nhan T. Luu

TL;DR: SNNs在特定架构下表现出自然的梯度稀疏性，无需显式正则化即可达到最佳对抗防御性能，但稀疏梯度在提升鲁棒性的同时会损害模型的泛化能力。

Details

Motivation: SNNs因其能效高和内存占用小受到关注，但其对抗鲁棒性仍未被充分探索。

Result: 展示了SNNs在对抗防御中的优异性能，但揭示了鲁棒性与泛化能力的权衡。

Insight: 梯度稀疏性是SNNs对抗防御的关键，但需要在鲁棒性和泛化能力之间找到平衡。

Abstract: Spiking Neural Networks (SNNs) have attracted growing interest in both computational neuroscience and artificial intelligence, primarily due to their inherent energy efficiency and compact memory footprint. However, achieving adversarial robustness in SNNs, particularly for vision-related tasks, remains a nascent and underexplored challenge. Recent studies have proposed leveraging sparse gradients as a form of regularization to enhance robustness against adversarial perturbations. In this work, we present a surprising finding: under specific architectural configurations, SNNs exhibit natural gradient sparsity and can achieve state-of-the-art adversarial defense performance without the need for any explicit regularization. Further analysis reveals a trade-off between robustness and generalization: while sparse gradients contribute to improved adversarial resilience, they can impair the model’s ability to generalize; conversely, denser gradients support better generalization but increase vulnerability to attacks.

cs.RO [Back]

[421] ReSeFlow: Rectifying SE(3)-Equivariant Policy Learning Flows cs.RO | cs.CVPDF

Zhitao Wang, Yanke Wang, Jiangtao Wen, Roberto Horowitz, Yuxing Han

TL;DR: 论文提出ReSeFlow，一种结合SE(3)等变扩散模型和修正流的方法，用于高效生成机器人操纵任务的轨迹级策略，显著降低推理时间并提升性能。

Details

Motivation: 在非结构化环境中，机器人操纵任务需要生成鲁棒且长时程的策略。SE(3)等变扩散模型虽然数据高效，但推理时间成本高。作者受修正流的高效推理启发，提出了ReSeFlow来解决这一问题。

Result: ReSeFlow仅需一步推理即可超越基线方法（100步推理），在绘画任务中误差降低48.5%，旋转三角形任务中降低21.9%。

Insight: SE(3)等变性与修正流的结合为生成策略学习模型的实际应用提供了高效的数据和推理方案。

Abstract: Robotic manipulation in unstructured environments requires the generation of robust and long-horizon trajectory-level policy with conditions of perceptual observations and benefits from the advantages of SE(3)-equivariant diffusion models that are data-efficient. However, these models suffer from the inference time costs. Inspired by the inference efficiency of rectified flows, we introduce the rectification to the SE(3)-diffusion models and propose the ReSeFlow, i.e., Rectifying SE(3)-Equivariant Policy Learning Flows, providing fast, geodesic-consistent, least-computational policy generation. Crucially, both components employ SE(3)-equivariant networks to preserve rotational and translational symmetry, enabling robust generalization under rigid-body motions. With the verification on the simulated benchmarks, we find that the proposed ReSeFlow with only one inference step can achieve better performance with lower geodesic distance than the baseline methods, achieving up to a 48.5% error reduction on the painting task and a 21.9% reduction on the rotating triangle task compared to the baseline’s 100-step inference. This method takes advantages of both SE(3) equivariance and rectified flow and puts it forward for the real-world application of generative policy learning models with the data and inference efficiency.

[422] Robot Learning from Any Images cs.RO | cs.CV | cs.LGPDF

Siheng Zhao, Jiageng Mao, Wei Chow, Zeyu Shangguan, Tianheng Shi

TL;DR: RoLA框架将任意真实图像转换为支持物理交互的机器人环境，无需额外硬件或数字资产，实现了高效、多样化的机器人数据生成与应用。

Details

Motivation: 现有机器人学习方法通常依赖专用硬件或预先构建的数字环境，限制了数据的多样性和可扩展性。RoLA的目标是通过单张图像直接生成交互式机器人环境，降低数据生成的门槛。

Result: 实验表明，RoLA能够快速生成大量多样化机器人数据，并在多个应用场景中表现出色，例如机械臂和人形机器人的任务学习。

Insight: RoLA突破了传统机器人数据生成的依赖限制，通过单张图像实现了高效、低成本的数据扩展，为机器人学习开辟了新途径。

Abstract: We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA’s versatility across applications like scalable robotic data generation and augmentation, robot learning from Internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at https://sihengz02.github.io/RoLA .

[423] Leave No Observation Behind: Real-time Correction for VLA Action Chunks cs.RO | cs.AI | cs.CV | cs.SY | eess.SYPDF

Kohei Sendai, Maxime Alvarez, Tatsuya Matsushima, Yutaka Matsuo, Yusuke Iwasawa

TL;DR: 本文提出了一种轻量级的实时动作块矫正模块（A2C2），用于改进VLA模型在高延迟和长时任务中的反应性和连贯性，无需重新训练基础策略。

Details

Motivation: VLA模型通过预测动作块提高效率和时序一致性，但动作块化在高延迟和长时任务中会损害反应性。

Result: 在动态Kinetix任务库和LIBERO Spatial上，A2C2显著提高了成功率（分别比RTC高23%和7%），并增强了长时任务的鲁棒性。

Insight: A2C2是一种即插即用的机制，适用于实时控制中的大容量动作块策略部署。

Abstract: To improve efficiency and temporal coherence, Vision-Language-Action (VLA) models often predict action chunks; however, this action chunking harms reactivity under inference delay and long horizons. We introduce Asynchronous Action Chunk Correction (A2C2), which is a lightweight real-time chunk correction head that runs every control step and adds a time-aware correction to any off-the-shelf VLA’s action chunk. The module combines the latest observation, the predicted action from VLA (base action), a positional feature that encodes the index of the base action within the chunk, and some features from the base policy, then outputs a per-step correction. This preserves the base model’s competence while restoring closed-loop responsiveness. The approach requires no retraining of the base policy and is orthogonal to asynchronous execution schemes such as Real Time Chunking (RTC). On the dynamic Kinetix task suite (12 tasks) and LIBERO Spatial, our method yields consistent success rate improvements across increasing delays and execution horizons (+23% point and +7% point respectively, compared to RTC), and also improves robustness for long horizons even with zero injected delay. Since the correction head is small and fast, there is minimal overhead compared to the inference of large VLA models. These results indicate that A2C2 is an effective, plug-in mechanism for deploying high-capacity chunking policies in real-time control.

Seungchan Kim, Omar Alama, Dmytro Kurdydyk, John Keller, Nikhil Keetha

TL;DR: RAVEN是一个基于3D记忆的行为树框架，用于非结构化户外环境的空中语义导航，通过持久记忆和自适应行为提升导航性能。

Details

Motivation: 户外语义导航面临长距离搜索和非结构化环境的挑战，现有方法要么是反应式的短视行为，要么依赖离线预计算的场景图，缺乏适应性。

Result: 在仿真环境中性能优于基线85.25%，并通过真实空中机器人验证。

Insight: 持久记忆和多模态线索的结合是提升户外语义导航适应性和效率的关键。

Abstract: Aerial outdoor semantic navigation requires robots to explore large, unstructured environments to locate target objects. Recent advances in semantic navigation have demonstrated open-set object-goal navigation in indoor settings, but these methods remain limited by constrained spatial ranges and structured layouts, making them unsuitable for long-range outdoor search. While outdoor semantic navigation approaches exist, they either rely on reactive policies based on current observations, which tend to produce short-sighted behaviors, or precompute scene graphs offline for navigation, limiting adaptability to online deployment. We present RAVEN, a 3D memory-based, behavior tree framework for aerial semantic navigation in unstructured outdoor environments. It (1) uses a spatially consistent semantic voxel-ray map as persistent memory, enabling long-horizon planning and avoiding purely reactive behaviors, (2) combines short-range voxel search and long-range ray search to scale to large environments, (3) leverages a large vision-language model to suggest auxiliary cues, mitigating sparsity of outdoor targets. These components are coordinated by a behavior tree, which adaptively switches behaviors for robust operation. We evaluate RAVEN in 10 photorealistic outdoor simulation environments over 100 semantic tasks, encompassing single-object search, multi-class, multi-instance navigation and sequential task changes. Results show RAVEN outperforms baselines by 85.25% in simulation and demonstrate its real-world applicability through deployment on an aerial robot in outdoor field tests.

[425] Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models cs.RO | cs.AI | cs.CV | cs.LGPDF

Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia

TL;DR: 论文提出了Oat-VLA方法，通过对象-代理中心的视觉输入分词化方案，显著降低了Vision-Language-Action（VLA）模型的训练计算成本，同时保持了性能。

Details

Motivation: 当前Vision-Language-Action（VLA）模型在适应机器人领域时，由于视觉输入的分词化方案问题，导致计算成本过高。作者希望通过改进分词化方法提高效率。

Result: 实验表明，Oat-VLA在LIBERO测试套件上收敛速度至少快两倍，并在真实世界的拾放任务中优于OpenVLA。

Insight: 对象和代理中心的视觉分词化能够显著提升机器人领域的模型效率，同时不牺牲性能。

Abstract: Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

[426] Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress cs.RO | cs.CVPDF

Priyanka Mandikal, Jiaheng Hu, Shivin Dass, Sagnik Majumder, Roberto Martín-Martín

TL;DR: SPARTA是一种统一的框架，用于处理物体状态变化的操作任务，如糊状化、铺展或切片。通过空间渐进的视觉分割和强化学习，它能够高效地感知和操控物体状态的变化。

Details

Motivation: 大多数机器人操作任务专注于物体的运动状态变化，如抓取、放置或旋转，而忽略了物理和视觉状态逐渐变化的任务（如糊状化、铺展）。SPARTA旨在填补这一空白。

Result: 在10种真实物体的3个任务中，SPARTA显著优于稀疏奖励和视觉目标基线，训练时间和精度均有提升。

Insight: 空间渐进的视觉表示是广泛物体状态操作任务的通用基础，能够高效地感知和控制物体状态的变化。

Abstract: Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state change–such as mashing, spreading, or slicing–where the object’s physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. Project website: https://vision.cs.utexas.edu/projects/sparta-robot

[427] PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization cs.RO | cs.CVPDF

Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang

TL;DR: 论文提出了一种结合学习与优化的方法，通过相机位姿回归网络提供初步位姿估计，再通过优化算法精细化深度图像对齐，从而在相机运动不稳定时实现高精度的稠密重建。

Details

Motivation: 现有的RGB-D SLAM系统在相机运动不稳定（如大视角变化或快速抖动）时表现不佳：传统优化方法需要良好的初始化但难以处理大运动，而学习型方法鲁棒性较强但精度不足。因此，作者提出了结合两者的方案。

Result: 实验表明，该方法在具有挑战性的基准测试中优于最佳竞争对手，同时在稳定运动序列中保持相当的精度，且能实时运行。

Insight: 论文表明，简单且原理明确的技术（学习初始化+优化细化）的结合，可以同时实现不稳定运动下的鲁棒性和稠密重建的高精度。

Abstract: Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Project page: https://github.com/siyandong/PROFusion.

[428] DRCP: Diffusion on Reinforced Cooperative Perception for Perceiving Beyond Limits cs.RO | cs.CV | eess.IVPDF

Lantao Li, Kang Yang, Rui Song, Chen Sun

TL;DR: DRCP提出了一种实时可部署的框架，通过跨模态合作感知和轻量级扩散优化模块，提升自动驾驶在极端条件下的感知能力。

Details

Motivation: 虽然在多智能体融合和感知主干网络方面已取得进展，但在实际部署中仍面临部分检测和噪声累积等问题，限制了检测精度。

Result: 在移动平台上实现了实时性能，并在极端条件下显著提升了鲁棒性。

Insight: 扩散模型不仅可以用于生成任务，还能有效提升感知任务的鲁棒性和精度。

Abstract: Cooperative perception enabled by Vehicle-to-Everything communication has shown great promise in enhancing situational awareness for autonomous vehicles and other mobile robotic platforms. Despite recent advances in perception backbones and multi-agent fusion, real-world deployments remain challenged by hard detection cases, exemplified by partial detections and noise accumulation which limit downstream detection accuracy. This work presents Diffusion on Reinforced Cooperative Perception (DRCP), a real-time deployable framework designed to address aforementioned issues in dynamic driving environments. DRCP integrates two key components: (1) Precise-Pyramid-Cross-Modality-Cross-Agent, a cross-modal cooperative perception module that leverages camera-intrinsic-aware angular partitioning for attention-based fusion and adaptive convolution to better exploit external features; and (2) Mask-Diffusion-Mask-Aggregation, a novel lightweight diffusion-based refinement module that encourages robustness against feature perturbations and aligns bird’s-eye-view features closer to the task-optimal manifold. The proposed system achieves real-time performance on mobile platforms while significantly improving robustness under challenging conditions. Code will be released in late 2025.

[429] AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation cs.RO | cs.AI | cs.CVPDF

Ryosuke Takanami, Petr Khrapchenkov, Shu Morikuni, Jumpei Arima, Yuta Takaba

TL;DR: AIRoA MoMa Dataset是一个大规模的多模态数据集，专注于移动操作任务，包含同步多模态数据和分层标注，旨在推动Vision-Language-Action模型的发展。

Details

Motivation: 机器人从受控环境转向非结构化人类环境时，需要通用的智能体能够可靠地执行自然语言指令。现有数据集缺乏同步力扭矩感知、分层标注和显式失败案例，限制了研究进展。

Result: 数据集首次公开，为Vision-Language-Action模型提供了重要基准。

Insight: 同步的多模态数据和分层标注设计可显著提升模型的鲁棒性和泛化能力。

Abstract: As robots transition from controlled settings to unstructured human environments, building generalist agents that can reliably follow natural language instructions remains a central challenge. Progress in robust mobile manipulation requires large-scale multimodal datasets that capture contact-rich and long-horizon tasks, yet existing resources lack synchronized force-torque sensing, hierarchical annotations, and explicit failure cases. We address this gap with the AIRoA MoMa Dataset, a large-scale real-world multimodal dataset for mobile manipulation. It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states, together with a novel two-layer annotation schema of sub-goals and primitive actions for hierarchical learning and error analysis. The initial dataset comprises 25,469 episodes (approx. 94 hours) collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format. By uniquely integrating mobile manipulation, contact-rich interaction, and long-horizon structure, AIRoA MoMa provides a critical benchmark for advancing the next generation of Vision-Language-Action models. The first version of our dataset is now available at https://huggingface.co/datasets/airoa-org/airoa-moma .

cs.SE [Back]

[430] Metamorphic Testing for Audio Content Moderation Software cs.SE | cs.AI | cs.CL | cs.MMPDF

Wenxuan Wang, Yongjiang Wu, Junyuan Zhang, Shuqing Li, Yun Peng

TL;DR: 该论文提出了一种名为MTAM的变形测试框架，用于评估音频内容审核软件对对抗性音频的检测能力，结果显示该方法在商业和学术模型中均能发现显著错误。

Details

Motivation: 随着音频平台的普及，有害音频内容的传播问题日益严重，而现有的审核工具容易被对抗性修改绕过，因此需要一种更有效的测试方法。

Result: MTAM在测试五种商业软件和一种学术模型时，错误发现率（EFR）最高达51.1%和45.7%，证明了其有效性。

Insight: 对抗性扰动（如修改音高或插入噪声）可绕过现有审核工具，表明当前音频审核技术仍存在明显漏洞。

Abstract: The rapid growth of audio-centric platforms and applications such as WhatsApp and Twitter has transformed the way people communicate and share audio content in modern society. However, these platforms are increasingly misused to disseminate harmful audio content, such as hate speech, deceptive advertisements, and explicit material, which can have significant negative consequences (e.g., detrimental effects on mental health). In response, researchers and practitioners have been actively developing and deploying audio content moderation tools to tackle this issue. Despite these efforts, malicious actors can bypass moderation systems by making subtle alterations to audio content, such as modifying pitch or inserting noise. Moreover, the effectiveness of modern audio moderation tools against such adversarial inputs remains insufficiently studied. To address these challenges, we propose MTAM, a Metamorphic Testing framework for Audio content Moderation software. Specifically, we conduct a pilot study on 2000 audio clips and define 14 metamorphic relations across two perturbation categories: Audio Features-Based and Heuristic perturbations. MTAM applies these metamorphic relations to toxic audio content to generate test cases that remain harmful while being more likely to evade detection. In our evaluation, we employ MTAM to test five commercial textual content moderation software and an academic model against three kinds of toxic content. The results show that MTAM achieves up to 38.6%, 18.3%, 35.1%, 16.7%, and 51.1% error finding rates (EFR) when testing commercial moderation software provided by Gladia, Assembly AI, Baidu, Nextdata, and Tencent, respectively, and it obtains up to 45.7% EFR when testing the state-of-the-art algorithms from the academy.

q-bio.NC [Back]

[431] Targeted perturbations reveal brain-like local coding axes in robustified, but not standard, ANN-based brain models q-bio.NC | cs.CV | cs.LGPDF

Nikolas McNeal, N. Apurva Ratan Murty

TL;DR: 研究表明，通过小规模对抗性扰动测试可评估ANN模型的局部表征几何特性，揭示标准模型的脆弱性，而鲁棒化模型的编码维度更接近人脑特性。

Details

Motivation: 当前ANN模型在预测神经反应方面表现相似，需要更强标准评估其对视觉系统的模拟能力。

Result: 标准模型对扰动极为敏感，编码维度不稳定；鲁棒化模型的编码维度更具泛化性和语义意义。

Insight: 对抗性扰动测试可作为评估神经编码模型的强有力工具，鲁棒化模型更接近人脑的实际编码机制。

Abstract: Artificial neural networks (ANNs) have become the de facto standard for modeling the human visual system, primarily due to their success in predicting neural responses. However, with many models now achieving similar predictive accuracy, we need a stronger criterion. Here, we use small-scale adversarial probes to characterize the local representational geometry of many highly predictive ANN-based brain models. We report four key findings. First, we show that most contemporary ANN-based brain models are unexpectedly fragile. Despite high prediction scores, their response predictions are highly sensitive to small, imperceptible perturbations, revealing unreliable local coding directions. Second, we demonstrate that a model’s sensitivity to adversarial probes can better discriminate between candidate neural encoding models than prediction accuracy alone. Third, we find that standard models rely on distinct local coding directions that do not transfer across model architectures. Finally, we show that adversarial probes from robustified models produce generalizable and semantically meaningful changes, suggesting that they capture the local coding dimensions of the visual system. Together, our work shows that local representational geometry provides a stronger criterion for brain model evaluation. We also provide empirical grounds for favoring robust models, whose more stable coding axes not only align better with neural selectivity but also generate concrete, testable predictions for future experiments.

cs.CR [Back]

[432] MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction cs.CR | cs.CLPDF

Sepideh Abedini, Shubhankar Mohapatra, D. B. Emerson, Masoumeh Shafieinejad, Jesse C. Cresswell

TL;DR: MaskSQL是一个通过抽象化保护隐私的文本到SQL框架，利用LLM的同时避免敏感数据共享，平衡隐私与性能。

Details

Motivation: LLM在文本到SQL等任务中表现优异，但在隐私敏感领域的使用受限，因为需要共享数据。尽管小型语言模型（SLM）可以本地部署保护隐私，但在复杂任务中表现不佳。

Result: 实验显示MaskSQL优于SLM模型，性能接近LLM基准，同时保护隐私。

Insight: 抽象化是一种有效的隐私保护方法，适用于需要平衡隐私与性能的任务。

Abstract: Large language models (LLMs) have shown promising performance on tasks that require reasoning, such as text-to-SQL, code generation, and debugging. However, regulatory frameworks with strict privacy requirements constrain their integration into sensitive systems. State-of-the-art LLMs are also proprietary, costly, and resource-intensive, making local deployment impractical. Consequently, utilizing such LLMs often requires sharing data with third-party providers, raising privacy concerns and risking noncompliance with regulations. Although fine-tuned small language models (SLMs) can outperform LLMs on certain tasks and be deployed locally to mitigate privacy concerns, they underperform on more complex tasks such as text-to-SQL translation. In this work, we introduce MaskSQL, a text-to-SQL framework that utilizes abstraction as a privacy protection mechanism to mask sensitive information in LLM prompts. Unlike redaction, which removes content entirely, or generalization, which broadens tokens, abstraction retains essential information while discarding unnecessary details, striking an effective privacy-utility balance for the text-to-SQL task. Moreover, by providing mechanisms to control the privacy-utility tradeoff, MaskSQL facilitates adoption across a broader range of use cases. Our experimental results show that MaskSQL outperforms leading SLM-based text-to-SQL models and achieves performance approaching state-of-the-art LLM-based models, while preserving privacy.

[433] Responsible Diffusion: A Comprehensive Survey on Safety, Ethics, and Trust in Diffusion Models cs.CR | cs.CVPDF

Kang Wei, Xin Yuan, Fushuo Huo, Chuan Ma, Long Yuan

TL;DR: 这篇论文全面调查了扩散模型在安全性、伦理性和可信度方面的潜在威胁及对策，旨在推动生成式人工智能的技术能力和应用成熟度。

Details

Motivation: 扩散模型在各领域展现出强大的数据生成能力，但其潜藏的安全、伦理和信任问题尚未得到充分探讨，亟需系统性的研究以促进其健康发展。

Result: 研究表明扩散模型的安全性和伦理性问题复杂多样，需要多学科协作解决。论文还总结了当前研究的局限性和未解决的问题。

Insight: 扩散模型的广泛应用需伴随严格的监管和技术保障，未来的研究应关注跨学科合作和新技术的开发，以确保其负责任的使用。

Abstract: Diffusion models (DMs) have been investigated in various domains due to their ability to generate high-quality data, thereby attracting significant attention. However, similar to traditional deep learning systems, there also exist potential threats to DMs. To provide advanced and comprehensive insights into safety, ethics, and trust in DMs, this survey comprehensively elucidates its framework, threats, and countermeasures. Each threat and its countermeasures are systematically examined and categorized to facilitate thorough analysis. Furthermore, we introduce specific examples of how DMs are used, what dangers they might bring, and ways to protect against these dangers. Finally, we discuss key lessons learned, highlight open challenges related to DM security, and outline prospective research directions in this critical field. This work aims to accelerate progress not only in the technical capabilities of generative artificial intelligence but also in the maturity and wisdom of its application.

[434] StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data cs.CR | cs.CVPDF

Yixu Wang, Yan Teng, Yingchun Wang, Xingjun Ma

TL;DR: 论文提出了LoRA提取攻击的新方法StolenLoRA，利用合成数据和半监督学习策略，实验显示攻击成功率达96.60%，揭示了LoRA调优模型的安全漏洞。

Details

Motivation: LoRA等高效微调方法在快速部署定制模型方面表现出色，但其紧凑性带来了新的安全隐患，尤其是模型提取攻击的风险。

Result: StolenLoRA在10k查询内达到96.60%的攻击成功率，甚至在不同预训练骨干的跨骨干场景中有效。

Insight: LoRA调优模型对提取攻击特别脆弱，亟需针对PEFT方法的防御机制；多样化LoRA部署可能是潜在防御方案。

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model using synthetic data. StolenLoRA leverages a Large Language Model to craft effective prompts for data generation, and it incorporates a Disagreement-based Semi-supervised Learning (DSL) strategy to maximize information gain from limited queries. Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries, even in cross-backbone scenarios where the attacker and victim models utilize different pre-trained backbones. These findings reveal the specific vulnerability of LoRA-adapted models to this type of extraction and underscore the urgent need for robust defense mechanisms tailored to PEFT methods. We also explore a preliminary defense strategy based on diversified LoRA deployments, highlighting its potential to mitigate such attacks.

eess.IV [Back]

Md. Saiful Bari Siddiqui, Mohammed Imamul Hassan Bhuiyan

TL;DR: S$^3$F-Net提出了一种多模态医学图像分类方法，通过融合空间和频域特征提升分类性能，显著优于仅使用空间特征的基线模型。

Details

Motivation: 传统CNN主要关注空间特征，忽略了频域信息的重要性，而这在医学图像分析中可能对捕捉全局模式至关重要。

Result: 在多个医学影像数据集中表现出色，最高提升5.13%的准确率，并在BRISC2025数据集上达到98.76%的SOTA水平。

Insight: 双域特征融合能够根据输入病理动态调整对空间和频域分支的依赖，证明了频域信息在医学图像分析中的重要性。

Abstract: Convolutional Neural Networks have become a cornerstone of medical image analysis due to their proficiency in learning hierarchical spatial features. However, this focus on a single domain is inefficient at capturing global, holistic patterns and fails to explicitly model an image’s frequency-domain characteristics. To address these challenges, we propose the Spatial-Spectral Summarizer Fusion Network (S$^3$F-Net), a dual-branch framework that learns from both spatial and spectral representations simultaneously. The S$^3$F-Net performs a fusion of a deep spatial CNN with our proposed shallow spectral encoder, SpectraNet. SpectraNet features the proposed SpectralFilter layer, which leverages the Convolution Theorem by applying a bank of learnable filters directly to an image’s full Fourier spectrum via a computation-efficient element-wise multiplication. This allows the SpectralFilter layer to attain a global receptive field instantaneously, with its output being distilled by a lightweight summarizer network. We evaluate S$^3$F-Net across four medical imaging datasets spanning different modalities to validate its efficacy and generalizability. Our framework consistently and significantly outperforms its strong spatial-only baseline in all cases, with accuracy improvements of up to 5.13%. With a powerful Bilinear Fusion, S$^3$F-Net achieves a SOTA competitive accuracy of 98.76% on the BRISC2025 dataset. Concatenation Fusion performs better on the texture-dominant Chest X-Ray Pneumonia dataset, achieving 93.11% accuracy, surpassing many top-performing, much deeper models. Our explainability analysis also reveals that the S$^3$F-Net learns to dynamically adjust its reliance on each branch based on the input pathology. These results verify that our dual-domain approach is a powerful and generalizable paradigm for medical image analysis.

[436] ReCon-GS: Continuum-Preserved Guassian Streaming for Fast and Compact Reconstruction of Dynamic Scenes eess.IV | cs.CV | cs.MMPDF

Jiaye Fu, Qiankun Gao, Chengxiang Wen, Yanmin Wu, Siwei Ma

TL;DR: ReCon-GS是一种新型的动态场景重建框架，通过动态分配多级锚点高斯模型和层次重配置策略，显著提高了重建效率和存储效率。

Details

Motivation: 在线自由视点视频（FVV）重建面临速度慢、运动估计不一致和存储需求高的问题，ReCon-GS旨在解决这些挑战。

Result: 实验表明，ReCon-GS在训练效率上提升约15%，内存需求减少50%以上，同时重建质量优于现有方法。

Insight: 层次化的变形表示和存储优化机制是实现高效动态场景重建的关键。

Abstract: Online free-viewpoint video (FVV) reconstruction is challenged by slow per-frame optimization, inconsistent motion estimation, and unsustainable storage demands. To address these challenges, we propose the Reconfigurable Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework that enables high fidelity online dynamic scene reconstruction and real-time rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians in a density-adaptive fashion to capture inter-frame geometric deformations, thereby decomposing scene motion into compact coarse-to-fine representations. Then, we design a dynamic hierarchy reconfiguration strategy that preserves localized motion expressiveness through on-demand anchor re-hierarchization, while ensuring temporal consistency through intra-hierarchical deformation inheritance that confines transformation priors to their respective hierarchy levels. Furthermore, we introduce a storage-aware optimization mechanism that flexibly adjusts the density of Anchor Gaussians at different hierarchy levels, enabling a controllable trade-off between reconstruction fidelity and memory usage. Extensive experiments on three widely used datasets demonstrate that, compared to state-of-the-art methods, ReCon-GS improves training efficiency by approximately 15% and achieves superior FVV synthesis quality with enhanced robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS slashes memory requirements by over 50% compared to leading state-of-the-art methods.

[437] Wavelet-Assisted Mamba for Satellite-Derived Sea Surface Temperature Super-Resolution eess.IV | cs.CVPDF

Wankun Chen, Feng Gao, Yanhai Gan, Jingchao Cao, Junyu Dong

TL;DR: 该论文提出了一种基于小波辅助Mamba的卫星海表温度超分辨率框架（WMSR），通过结合低频状态空间模块（LFSSM）和高频增强模块（HFEM），显著提升了海表温度数据的超分辨率性能。

Details

Motivation: 海表温度（SST）是气候变化的重要指标，但由于物理成像的限制，获取高分辨率SST数据具有挑战性。现有的深度学习方法在建模长距离依赖时存在不足，而Mamba模型基于状态空间模型（SSM），具有线性复杂度潜力，但其在SST超分辨率中的应用尚未充分探索。

Result: 在三个SST数据集上的实验表明，WMSR优于现有最先进方法，展现了更高的超分辨率性能。

Insight: 将小波分析与Mamba模型结合，能够有效处理SST数据中的全局低频信息和局部高频纹理，为遥感图像超分辨率提供了新思路。

Abstract: Sea surface temperature (SST) is an essential indicator of global climate change and one of the most intuitive factors reflecting ocean conditions. Obtaining high-resolution SST data remains challenging due to limitations in physical imaging, and super-resolution via deep neural networks is a promising solution. Recently, Mamba-based approaches leveraging State Space Models (SSM) have demonstrated significant potential for long-range dependency modeling with linear complexity. However, their application to SST data super-resolution remains largely unexplored. To this end, we propose the Wavelet-assisted Mamba Super-Resolution (WMSR) framework for satellite-derived SST data. The WMSR includes two key components: the Low-Frequency State Space Module (LFSSM) and High-Frequency Enhancement Module (HFEM). The LFSSM uses 2D-SSM to capture global information of the input data, and the robust global modeling capabilities of SSM are exploited to preserve the critical temperature information in the low-frequency component. The HFEM employs the pixel difference convolution to match and correct the high-frequency feature, achieving accurate and clear textures. Through comprehensive experiments on three SST datasets, our WMSR demonstrated superior performance over state-of-the-art methods. Our codes and datasets will be made publicly available at https://github.com/oucailab/WMSR.

[438] A Novel Preprocessing Unit for Effective Deep Learning based Classification and Grading of Diabetic Retinopathy eess.IV | cs.CVPDF

Pranoti Nage, Sanjay Shitole

TL;DR: 该论文提出了一种新的预处理单元，结合改进的Mask RCNN和SSA-VGG-16分类器，用于糖尿病视网膜病变（DR）和糖尿病性黄斑水肿（DME）的早期检测和分级。

Details

Motivation: 早期检测DR和DME对防止视力损失至关重要，但现有方法在噪声过滤、对比度增强和特征提取方面存在不足。

Result: 在两个数据集（IDRiD和MESSIDOR）上验证了方法的有效性，AVDS滤波器中汉明距离方法在对比度提升方面表现最佳，欧氏距离在PSNR上表现更好。

Insight: 自适应距离方法的选择对预处理效果至关重要，空间注意力机制的引入有助于捕捉图像中的关键区域，提升分类性能。

Abstract: Early detection of diabetic retinopathy (DR) is crucial as it allows for timely intervention, preventing vision loss and enabling effective management of diabetic complications. This research performs detection of DR and DME at an early stage through the proposed framework which includes three stages: preprocessing, segmentation, feature extraction, and classification. In the preprocessing stage, noise filtering is performed by fuzzy filtering, artefact removal is performed by non-linear diffusion filtering, and the contrast improvement is performed by a novel filter called Adaptive Variable Distance Speckle (AVDS) filter. The AVDS filter employs four distance calculation methods such as Euclidean, Bhattacharya, Manhattan, and Hamming. The filter adaptively chooses a distance method which produces the highest contrast value amongst all 3 methods. From the analysis, hamming distance method was found to achieve better results for contrast and Euclidean distance showing less error value with high PSNR. The segmentation stage is performed using Improved Mask-Regional Convolutional Neural Networks (Mask RCNN). In the final stage, feature extraction and classification using novel Self-Spatial Attention infused VGG-16 (SSA-VGG-16), which effectively captures both global contextual relationships and critical spatial regions within retinal images, thereby improving the accuracy and robustness of DR and DME detection and grading. The effectiveness of the proposed method is assessed using two distinct datasets: IDRiD and MESSIDOR.

cs.LG [Back]

[439] CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction cs.LG | cs.CLPDF

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse

TL;DR: 提出了一种基于注意力输出误差的KV缓存令牌逐出方法CAOTE，结合注意力分数和值向量优化逐出误差，显著提升了下游任务准确性。

Details

Motivation: 长上下文支持的LLM面临内存和计算瓶颈，现有基于注意力分数的令牌逐出方法忽略了令牌对注意力输出的贡献信息。

Result: 实验表明，CAOTE与最先进的基于注意力分数的方法结合时，能持续提升下游任务的准确性。

Insight: 令牌逐出过程中利用值向量信息至关重要，CAOTE为优化KV缓存提供了新的启发式方法。

Abstract: While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value tokens on top of attention-based eviction scores in closed-form. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

[440] Adaptive Margin RLHF via Preference over Preferences cs.LG | cs.AI | cs.CLPDF

Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum

TL;DR: 该论文提出了一种基于偏好强度的自适应边际优化方法DPO-PoP，通过偏好对偏好的标注来推断动态边际，从而提升RLHF中的泛化性和生成质量。

Details

Motivation: 现有RLHF方法在奖励模型学习中通常使用固定或简单的边际，无法反映偏好强度的差异性，且依赖噪声较大的评分信息。

Result: 在UltraFeedback数据集上，DPO-PoP优于原始DPO、固定边际DPO及真实边际DPO。

Insight: 偏好强度建模能提升泛化性；分类和生成性能存在权衡，需根据目标选择合适的采样策略。

Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

[441] Causally-Enhanced Reinforcement Policy Optimization cs.LG | cs.AI | cs.CLPDF

Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo

TL;DR: 论文提出了CE-PO（因果增强策略优化），一种通过估计因果一致性的可微代理来增强强化学习策略优化的方法，以减少LLM（大语言模型）中的奖励利用和不忠实推理。

Details

Motivation: 现有的LLM在强化目标训练中常通过捷径策略生成表面正确的答案，但其推理过程可能依赖虚假或不忠实的原因，且在因果扰动下表现不佳。

Result: 在4个数据集上，CE-PO平均提升准确性5.49%（最高9.58%），并显著提升了对因果相关性翻转和轻微反事实编辑的鲁棒性。

Insight: 通过显式建模因果一致性并融合到奖励设计中，可以有效减少LLM的不忠实推理行为，提升其在复杂任务中的鲁棒性。

Abstract: Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.

[442] RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility cs.LG | cs.AI | cs.CLPDF

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang

TL;DR: RHYTHM利用分层时间标记化和预训练LLM的冻结主干，显著提升了人类移动性预测的性能和效率。

Details

Motivation: 人类移动性预测中的长距离依赖和多尺度周期性行为增加了建模的复杂性。

Result: 在三个真实数据集上，RHYTHM提升了2.4%的整体精度，周末预测提升5.0%，训练时间减少24.6%。

Insight: 冻结LLM主干和分层注意力设计有效地平衡了计算成本和模型性能。

Abstract: Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby significantly reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM freezes the pretrained LLM’s backbone to reduce attention complexity and memory cost. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time. Code is publicly available at https://github.com/he-h/rhythm.

[443] C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning cs.LG | cs.AI | cs.CLPDF

Haotian Liu, Shuo Wang, Hongteng Xu

TL;DR: 论文提出了一种名为C²GSPG的置信度校准分组序列策略梯度方法，旨在解决强化学习中的过度自信问题，提升模型的推理能力和自我认知能力。

Details

Motivation: 现有的强化学习方法（如GRPO及其变体）在推理任务中存在过度自信的问题，限制了模型的自我认知能力。因此，提出一种既能提升推理性能又能抑制过度自信的方法是必要的。

Result: 在逻辑和数学推理任务中，C²GSPG在推理准确性和置信度校准方面均优于现有方法。

Insight: 置信校准正则化与GSPG框架协同工作，在二元奖励任务中完全一致，非二元任务中通过技术手段缓解冲突，证明了方法的有效性和普适性。

Abstract: Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C$^2$GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence’s reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying C$^2$GSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of C$^2$GSPG is available at https://github.com/HaotianLiu123/CCGSPG.

[444] SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts cs.LG | cs.AI | cs.CLPDF

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang

TL;DR: SPEC-RL提出了一种通过推测性解码优化强化学习训练中计算密集型rollout阶段的方法，显著减少计算时间，同时保持策略质量。

Details

Motivation: 现有的强化学习训练方法在rollout阶段存在计算冗余问题，连续训练周期的轨迹片段经常重叠，浪费计算资源。

Result: 实验表明，SPEC-RL在多个数学推理和泛化基准上将rollout时间减少2-3倍，且不影响策略质量。

Insight: 推测性解码不仅能用于语言模型生成，也可优化强化学习的rollout阶段，为RLVR的可扩展性提供新思路。

Abstract: Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including GSM8K, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL

[445] Temporal Generalization: A Reality Check cs.LG | cs.CL | cs.CVPDF

Divyam Madaan, Sumit Chopra, Kyunghyun Cho

TL;DR: 该论文探讨了机器学习模型在仅依赖过去数据时能否实现时间泛化性，并通过参数插值和外推等方法在多个任务上进行了实验，结果显示简单基线方法（使用最新模型参数）在所有场景中表现最优，凸显了时间泛化的困难性。

Details

Motivation: 机器学习模型在分布偏移时性能下降，论文旨在研究仅依赖过去数据能否实现时间泛化性，以验证模型对未来数据的预测能力。

Result: 实验结果表明，所有评估方法均未在所有任务中超越简单基线方法（使用最新模型参数），强调了时间泛化的困难性。

Insight: 研究揭示了在没有未来数据或强假设时，时间泛化具有固有挑战性，提醒研究人员对泛化性声明需持谨慎态度。

Abstract: Machine learning (ML) models often struggle to maintain performance under distribution shifts, leading to inaccurate predictions on unseen future data. In this work, we investigate whether and under what conditions models can achieve such a generalization when relying solely on past data. We explore two primary approaches: convex combinations of past model parameters (\emph{parameter interpolation}) and explicit extrapolation beyond the convex hull of past parameters (\emph{parameter extrapolation}). We benchmark several methods within these categories on a diverse set of temporal tasks, including language modeling, news summarization, news tag prediction, academic paper categorization, satellite image-based land use classification over time, and historical yearbook photo gender prediction. Our empirical findings show that none of the evaluated methods consistently outperforms the simple baseline of using the latest available model parameters in all scenarios. In the absence of access to future data or robust assumptions about the underlying data-generating process, these results underscore the inherent difficulties of generalizing and extrapolating to future data and warrant caution when evaluating claims of such generalization.

[446] Anchored Supervised Fine-Tuning cs.LG | cs.CLPDF

He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang

TL;DR: 论文提出了一种名为锚定监督微调（ASFT）的方法，通过在动态微调（DFT）中引入轻量级的KL正则化，解决了DFT缺乏分布锚定导致的训练不稳定问题，显著提升了在数学推理、医学知识基础和代码生成等任务上的表现。

Details

Motivation: 传统的监督微调（SFT）容易过拟合，而强化学习（RL）虽然泛化能力强但计算成本高。动态微调（DFT）作为中间方法虽然在某些领域表现优异，但存在不稳定问题。作者希望通过理论分析改进DFT，提出更稳定的方法。

Result: ASFT在数学推理、医学知识基础和代码生成任务中均优于SFT和DFT，计算成本低且稳定性高。

Insight: 理论分析可以帮助设计更高效且稳定的微调方法，KL正则化是一种简单有效的稳定化手段。

Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.

[447] Knowledge Homophily in Large Language Models cs.LG | cs.AI | cs.CL | cs.SIPDF

Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi

TL;DR: 该论文研究了大型语言模型（LLMs）中的知识同质性现象，通过图表示和GNN模型预测知识覆盖，优化知识注入和多跳推理的效率。

Details

Motivation: LLMs作为知识库的应用日益广泛，但其知识结构尚未充分研究。受认知神经科学中语义聚类和启动效应的启发，探索LLMs中类似的知识同质性模式。

Result: GNN模型有效预测知识性分数，优化了知识注入和多跳推理的效率，提高了标签预算下的知识覆盖率。

Insight: LLMs的知识呈现同质性分布，利用图结构可以高效识别和填补知识空白，为知识增强任务提供新方向。

Abstract: Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.

[448] Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR cs.LG | cs.CLPDF

Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang

TL;DR: 这篇论文提出了VERL方法，通过隐藏状态空间的ER及其衍生指标ERV/ERA，解耦探索与利用的动态，实现了两者的协同增强，显著提升了LLM推理能力。

Details

Motivation: 传统RLVR将探索与利用视为权衡关系，但作者认为这可能仅是测量层面的表象。论文通过隐藏状态空间的分析，揭示了两者可以解耦的潜在机会。

Result: 在多样化LLM和推理基准测试中，VERL带来了一致的性能提升，特别是在Gaokao 2024数据集上实现了21.4%的绝对准确率提升。

Insight: 探索与利用并非必须权衡的关系，通过隐藏状态空间的分析和方法设计，可以同时增强两者的能力。

Abstract: A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

[449] Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm cs.LG | cs.AI | cs.CL | stat.MLPDF

Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma

TL;DR: 《Explore-Execute Chain》提出了一种结构化的推理框架$E^2C$，通过将推理分解为探索阶段和执行阶段，解决了现有方法（如CoT）在高层次规划与低层次执行耦合时的问题，提高了计算效率、路径探索能力和可解释性。

Details

Motivation: 现有方法（如Chain-of-Thought）的推理过程在高层次规划和低层次执行上存在耦合，导致计算效率低、路径探索有限且可解释性差。

Result: 在AIME’2024上达到58.1%的准确率，仅使用Forest-of-Thought方法10%的解码token；在跨域适应中，EF-SFT仅用3.5%的token，实现了比标准SFT高14.5%的准确率。

Insight: 分离规划与执行不仅能提升效率和性能，还能增强模型的泛化能力和可解释性。

Abstract: Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution.This decomposition enables an efficient test-time scaling strategy: on AIME’2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git

[450] Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends cs.LG | cs.AI | cs.CLPDF

Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang

TL;DR: 这篇论文揭示了Group-Relative REINFORCE（GRPO）及其相关算法实际上是一种离策略算法，打破了其传统上被认为是严格的on-policy算法的误区。通过理论推导和实验验证，论文提出了将REINFORCE适应离策略环境的通用原则。

Details

Motivation: 长期以来，REINFORCE及其变体（如GRPO）被认为只能在严格on-policy环境下工作，但在实际应用中（如大型语言模型的强化学习），离策略数据的需求日益增长。论文旨在重新审视这些算法的本质，并解决其在离策略环境中的适用性问题。

Result: 理论和实验结果表明，Group-Relative REINFORCE可以有效应用于离策略环境。提出的方法不仅统一了OPMD和AsymRE等算法，还为这些算法的性能提供了理论解释。

Insight: 论文揭示了REINFORCE系列算法在离策略环境中的潜力，打破了传统观念。此外，论文提出的正则化和数据分布调整原则为未来算法设计提供了新的方向。

Abstract: Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms – Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) – as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.

[451] When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training cs.LG | cs.AI | cs.CLPDF

Sanxing Chen, Xiaoyin Chen, Yukun Huang, Roy Xie, Bhuwan Dhingra

TL;DR: 该论文研究了通过监督微调（SFT）和强化学习（RL）训练大型语言模型（LLM）在序列决策任务（如多臂老虎机问题）中的表现，发现这些方法虽然提高了性能，但也可能导致模型过早放弃探索，表现出更强的贪婪性。

Details

Motivation: LLM在序列决策中的探索能力不足，现有方法（如SFT和RL）的效果和泛化能力尚不明确。

Result: 训练后的LLM性能接近UCB和Thompson Sampling，但泛化时容易过早放弃探索。RL/SFT训练的模型比预训练模型更易出现早期灾难性失败。

Insight: 训练方法可能导致模型偏向贪婪策略，需要设计更全面的奖励和评估指标以提升探索鲁棒性。

Abstract: While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6x longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.

[452] ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation cs.LG | cs.AI | cs.CLPDF

Aasheesh Singh, Vishal Vaddina, Dagnachew Birru

TL;DR: ORPO-Distill提出了一个通用的跨架构LLM蒸馏方法，通过偏好优化任务实现知识转移，优于传统方法。

Details

Motivation: 传统CoT蒸馏方法在跨架构LLM蒸馏中存在局限性，需要更有效的方式转移多样化推理路径的知识。

Result: 在五个数据集和多个学生模型上实验，一致优于传统黑盒KD基线方法。

Insight: 偏好优化目标和混合策略的结合能够更有效地实现跨架构知识蒸馏。

Abstract: We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

[453] Rethinking Entropy Regularization in Large Reasoning Models cs.LG | cs.AI | cs.CLPDF

Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng

TL;DR: 论文分析了熵正则化在大型推理模型（LRMs）中的失效问题，并提出了一种名为SIREN的方法，通过选择性熵正则化来解决熵爆炸和过早收敛的问题。

Details

Motivation: 传统的熵正则化方法在大型推理模型的强化学习中表现不佳，导致熵爆炸和过早收敛。需要一种新方法来限制探索范围，提升训练稳定性。

Result: 在五个数学基准上，SIREN显著优于其他熵相关方法，如在AIME24/25上提升了6.6 maj@k，同时保持了适当的熵值和回答多样性。

Insight: 选择性熵正则化是关键，它能避免全局熵爆炸，同时保留验证性能，解决了大型推理模型中常见的过早收敛问题。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRMs, which easily trigger a global entropy explosion as the model indiscriminately explores all possible actions and states. To address this, we propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states. SIREN achieves this through a two-step entropy masking mechanism, consisting of a top-p mask and a peak-entropy mask. In addition, regularization is transformed into a self-anchored form to stabilize training. Across five mathematical benchmarks, SIREN attains superior average performance over previous entropy-related RLVR approaches, exemplified by a +6.6 maj@k improvement on AIME24/25 with Qwen2.5-Math-7B. Further analysis confirms that SIREN promotes greater response diversity and maintains entropy at an appropriate level, which helps to preserve the validation pass@k throughout training. This effectively mitigates the premature convergence problem common in RLVR for LRM.

[454] SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression cs.LG | cs.CLPDF

Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang

TL;DR: SIRI提出了一种交替压缩与扩展推理预算的训练方法，通过动态调整最大推理长度，在性能与效率间取得平衡。

Details

Motivation: 现有大型推理模型存在重复思维模式，缩减推理长度常以性能下降为代价，SIRI旨在解决这一权衡问题。

Result: SIRI-low在AIME24上性能提升43.2%，token使用减少46.9%；SIRI-high达到最高精度。

Insight: 动态调整推理长度能平衡探索与效率，逐步逼近性能-效率的Pareto前沿。

Abstract: We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model’s performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM’s output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal “sweet spot” between the two. Our models are publicly available.

[455] Localizing Adversarial Attacks To Produces More Imperceptible Noise cs.LG | cs.AI | cs.CV | I.2.6; I.2.10; I.5.1PDF

Pavan Reddy, Aditya Sanjay Gujral

TL;DR: 该论文通过引入二元掩码技术，系统地评估了局部对抗攻击的有效性、不易察觉性和计算效率。相比全局攻击，局部攻击在像素扰动、PSNR和SSIM指标上表现更优，但需付出更高的计算代价和轻微的攻击成功率下降。

Details

Motivation: 传统对抗攻击通常采用全局扰动，而局部对抗噪声的效果和潜力尚未被充分探索。论文旨在填补这一空白。

Result: 局部攻击在像素扰动、PSNR和SSIM上优于全局攻击，但攻击成功率和计算效率略有下降。PGD和C&W在局部约束下表现更佳。

Insight: 局部对抗攻击在提升噪声不易察觉性方面具有潜力，但需权衡计算成本和攻击效果，迭代方法更适合局部攻击场景。

Abstract: Adversarial attacks in machine learning traditionally focus on global perturbations to input data, yet the potential of localized adversarial noise remains underexplored. This study systematically evaluates localized adversarial attacks across widely-used methods, including FGSM, PGD, and C&W, to quantify their effectiveness, imperceptibility, and computational efficiency. By introducing a binary mask to constrain noise to specific regions, localized attacks achieve significantly lower mean pixel perturbations, higher Peak Signal-to-Noise Ratios (PSNR), and improved Structural Similarity Index (SSIM) compared to global attacks. However, these benefits come at the cost of increased computational effort and a modest reduction in Attack Success Rate (ASR). Our results highlight that iterative methods, such as PGD and C&W, are more robust to localization constraints than single-step methods like FGSM, maintaining higher ASR and imperceptibility metrics. This work provides a comprehensive analysis of localized adversarial attacks, offering practical insights for advancing attack strategies and designing robust defensive systems.

[456] Graph Your Own Prompt cs.LG | cs.AI | cs.CVPDF

Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao

TL;DR: 论文提出了一种名为图一致性正则化（GCR）的新框架，通过将模型预测生成的关系图结构引入学习过程，以增强类别感知和语义特征表示。

Details

Motivation: 深度网络学习到的特征表示通常包含噪声类间相似性，与模型预测语义不符，GCR旨在通过图结构对齐解决这一问题。

Result: 实验表明，GCR能提升特征结构清晰度、增强类内凝聚力并改善泛化能力。

Insight: GCR通过自提示机制，利用模型自身输出来优化内部结构，为学习预测结构提供了新视角。

Abstract: We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model’s predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. Project website Code

[457] GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data cs.LG | cs.CV | cs.IRPDF

Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao

TL;DR: GBSK是一种基于粒度球计算和多采样的新型骨架聚类算法，用于高效处理大规模数据集的聚类任务。其通过多粒度球构造逐步揭示数据的统计“骨架”，显著降低了计算开销并保持高聚类精度。

Details

Motivation: 大规模数据集的聚类任务通常面临计算复杂度高、效率低的问题。传统方法难以在保持精度的同时高效处理此类数据，因此需要一种新的方法来平衡计算效率和聚类质量。

Result: 在标准硬件上进行的实验表明，GBSK在包含1亿实例的高维数据集上表现高效且聚类性能强。

Insight: 利用数据的多粒度抽象可以显著提升大规模聚类任务的效率，同时自适应参数设计有助于算法的实际部署。

Abstract: To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical “skeleton” – a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.

[458] Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation cs.LG | cs.AI | cs.CVPDF

Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu

TL;DR: 论文提出了DART框架，通过解耦训练和自适应数据管理，解决了GUI代理在多轮RL训练中的效率问题，显著提升了任务成功率。

Details

Motivation: GUI代理在多轮RL训练中面临交互速度慢和高质量交互数据不足的问题，亟需高效的学习框架。

Result: 在OSWorld基准测试中，DART-GUI-7B任务成功率为42.13%，比基线模型高出14.61%。

Insight: 解耦设计和自适应数据管理能显著提升RL训练的效率和效果，尤其在GUI代理任务中。

Abstract: Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6GPU utilization for rollout, 1.9 training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.

[459] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention cs.LG | cs.AI | cs.CVPDF

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng

TL;DR: SLA是一种可微调的稀疏-线性注意力机制，通过分类注意力权重为关键、边缘和可忽略三部分，显著减少DiT模型的计算复杂度，实现20倍注意力计算减少和2.2倍端到端视频生成加速，不损失生成质量。

Details

Motivation: 在Diffusion Transformer（DiT）模型中，尤其是视频生成时，注意力机制的延迟是主要瓶颈，因其序列长且计算复杂度为二次方。

Result: 实验显示，SLA减少了95%的注意力计算，DiT模型的注意力计算速度提升13.7倍，视频生成端到端速度提升2.2倍，且生成质量不下降。

Insight: 注意力权重具有天然的稀疏性和低秩特性，利用这一特性可以显著优化计算效率。

Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

[460] GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning cs.LG | cs.AI | cs.CV | cs.MAPDF

Umang Garg, Bowen Zhang, Anantanjit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath

TL;DR: GPS-MTM是一种基于自监督学习的轨迹建模基础模型，通过分解移动数据的模态（状态和动作）并利用双向Transformer，在无需手动标注的情况下学习语义关联，显著提升了轨迹填充和下一站预测等下游任务的表现。

Details

Motivation: 现有轨迹建模方法多将轨迹扁平化为坐标流，难以捕捉人类移动的语义模式。GPS-MTM旨在通过分解移动数据的模态并利用自监督学习，构建更强大的轨迹基础模型。

Result: 在Numosim-LA、Urban Anomalies和Geolife等基准数据集上，GPS-MTM在轨迹填充和下一站预测等任务中表现优异，尤其在依赖上下文推理的动态任务中优势显著。

Insight: GPS-MTM表明，分解轨迹为语义模态并结合自监督学习，能够有效捕捉人类移动的正常模式，为大规模轨迹分析提供了新思路。

Abstract: Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

[461] AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring cs.LG | cs.AI | cs.CV | stat.AP | 62M10, 68T45, 62P35, 92C40, 65C20, 60G35, 92C42, 92C35, 93E10 | I.2.6; C.2.4; H.3.4; I.2.4; H.3.5; C.2.4; C.3; I.4.8; I.5.1; J.3;

K.6.1; H.2.8PDF
Youssef Sabiri, Walid Houmaidi, Ouail El Maadi, Yousra Chtouki

TL;DR: AQUAIR 是一个高分辨率的室内环境质量数据集，专注于智能水产养殖监测，填补了当前公共数据集的空白，为预测和异常检测工具的开发提供了支持。

Details

Motivation: 当前智能水产养殖系统缺乏描述养殖池周围空气环境的公共数据集，限制了预测和异常检测工具的研发，从而影响了鱼类福利和能源优化。

Result: 数据集包含超过23,000条时间戳观测数据，展示了稳定的环境条件和明显的喂食时间峰值，适用于短时预测、事件检测和传感器漂移研究。

Insight: AQUAIR 填补了智能水产养殖信息学中的数据空白，为机器学习和环境传感研究提供了可重复的基准，特别是在循环水产系统中头部空间动态的研究。

Abstract: Smart aquaculture systems depend on rich environmental data streams to protect fish welfare, optimize feeding, and reduce energy use. Yet public datasets that describe the air surrounding indoor tanks remain scarce, limiting the development of forecasting and anomaly-detection tools that couple head-space conditions with water-quality dynamics. We therefore introduce AQUAIR, an open-access public dataset that logs six Indoor Environmental Quality (IEQ) variables–air temperature, relative humidity, carbon dioxide, total volatile organic compounds, PM2.5 and PM10–inside a fish aquaculture facility in Amghass, Azrou, Morocco. A single Awair HOME monitor sampled every five minutes from 14 October 2024 to 9 January 2025, producing more than 23,000 time-stamped observations that are fully quality-controlled and publicly archived on Figshare. We describe the sensor placement, ISO-compliant mounting height, calibration checks against reference instruments, and an open-source processing pipeline that normalizes timestamps, interpolates short gaps, and exports analysis-ready tables. Exploratory statistics show stable conditions (median CO2 = 758 ppm; PM2.5 = 12 micrograms/m3) with pronounced feeding-time peaks, offering rich structure for short-horizon forecasting, event detection, and sensor drift studies. AQUAIR thus fills a critical gap in smart aquaculture informatics and provides a reproducible benchmark for data-centric machine learning curricula and environmental sensing research focused on head-space dynamics in recirculating aquaculture systems.

[462] Clebsch-Gordan Transformer: Fast and Global Equivariant Attention cs.LG | cs.CV | cs.ROPDF

Owen Lewis Howell, Linfeng Zhao, Xupeng Zhu, Yaoyao Qian, Haojie Huang

TL;DR: 论文提出Clebsch-Gordan Transformer，通过基于SO(3)不可约表示的Clebsch-Gordan卷积实现高效的全局注意力，支持所有阶数的等变性特征，计算复杂度为O(N log N)，并在多个任务中验证了其性能。

Details

Motivation: 现有等变性Transformer仅支持低阶等变特征和局部上下文窗口，限制了表达能力和性能，亟需一种高效支持全局注意力和高阶等变性的方法。

Result: 在n-body模拟、QM9、ModelNet点云分类和机器人抓取任务上，方法在显存占用、速度和精度上均优于现有等变性Transformer。

Insight: 通过几何结构的等变性建模可显著提升Transformer在物理和视觉任务中的性能，同时高效实现全局注意力是关键。

Abstract: The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $\SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N \log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.

[463] Semantic Editing with Coupled Stochastic Differential Equations cs.LG | cs.CV | stat.MLPDF

Jianxin Zhang, Clayton Scott

TL;DR: 论文提出了一种利用耦合随机微分方程（coupled SDEs）的方法，用于改进预训练文本到图像模型的内容编辑能力，实现了高语义保真度和图像细节的一致性。

Details

Motivation: 现有的预训练文本到图像模型在内容编辑时容易破坏细节或引入不希望的伪影，需要一种无需重新训练或辅助网络的解决方案。

Result: 该方法无需额外训练或辅助网络，实现了高语义保真度和像素级一致性的图像编辑。

Insight: 耦合SDEs为生成式AI提供了一种简单且强大的控制工具，能够有效平衡语义编辑和视觉一致性之间的需求。

Abstract: Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box-without retraining or auxiliary networks-and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.

[464] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers cs.LG | cs.CVPDF

Xianhang Li, Chen Huang, Chun-Liang Li, Eran Malach, Josh Susskind

TL;DR: 论文提出了SALT方法，通过冻结教师模型的权重，分两阶段训练学生模型，实现了计算高效的视频自监督学习。

Details

Motivation: 传统的V-JEPA方法通过EMA更新教师模型防止表征坍塌，但限制了模型选择和架构灵活性。重新思考该方法，发现冻结教师模型同样可行。

Result: 学生模型在冻结评估中优于V-JEPA 2，计算效率更高，且对不同质量的教师模型表现出鲁棒性。

Insight: 学生模型的性能对教师模型质量依赖较低，计算资源应主要分配给学生模型而非教师模型。

Abstract: Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a target encoder with a simple pixel-reconstruction objective under V-JEPA masking, then (ii) freeze it and train a student to predict the teacher’s latents on masked regions. This leads to a two-stage, unregularized scheme that we refer to as SALT (Static-teacher Asymmetric Latent Training). SALT decouples optimization into pixel reconstruction (teacher) and masked latent prediction (student), increasing transparency, efficiency, and scalability while preserving the ability of representation to generalize under frozen evaluation. Empirically, our student models outperform recently proposed V-JEPA 2 encoders under frozen backbone evaluation across diverse benchmarks. They are also more compute-optimal: at matched pretraining FLOPs, our method achieves higher probing accuracy, and its scaling curves dominate V-JEPA’s accuracy-FLOPs Pareto frontier. Finally, we find that student quality is remarkably robust to teacher quality: high-performing students emerge even with small, sub-optimal teachers. This points to a compute budget allocation that should overwhelmingly favor the student. These results position SALT as a simple, scalable, and compute-efficient alternative to EMA-based self-distillation for video representation learning.

[465] A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity cs.LG | cs.AI | cs.CVPDF

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

TL;DR: 论文提出TRIANGLE，一种新的多模态对齐度量方法，通过三角形面积相似性直接在高维空间计算，显著提升多模态模型的性能。

Details

Motivation: 当前多模态学习模型在模态对齐上存在局限性，部分模态未被有效对齐，导致模型在下游任务中表现不佳。

Result: 在多模态任务（如视频-文本检索、音频-文本检索等）中，TRIANGLE在Recall@1上比基于余弦的方法提升高达9个百分点。

Insight: TRIANGLE提供了可解释的对齐依据，同时显著提升了多模态模型的性能。

Abstract: Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.

[466] Score-based Membership Inference on Diffusion Models cs.LG | cs.CVPDF

Mingxing Rao, Bowen Qu, Daniel Moyer

TL;DR: 这篇论文研究了基于评分的成员推理攻击（MIA）在扩散模型中的应用，提出了SimA这一高效单查询攻击方法，并发现潜在扩散模型（LDM）比像素空间模型更抗攻击。

Details

Motivation: 扩散模型在生成任务中表现出色，但其隐私风险尚未充分研究。成员推理攻击可判断样本是否属于训练集，可能导致隐私泄露。

Result: SimA在多种扩散模型上表现优异，LDM因潜在空间的信息瓶颈更抗攻击。

Insight: 潜在扩散模型的隐私保护优势源于VAE的强信息瓶颈，未来需要更深入理解VAE的反演问题。

Abstract: Membership inference attacks (MIAs) against diffusion models have emerged as a pressing privacy concern, as these models may inadvertently reveal whether a given sample was part of their training set. We present a theoretical and empirical study of score-based MIAs, focusing on the predicted noise vectors that diffusion models learn to approximate. We show that the expected denoiser output points toward a kernel-weighted local mean of nearby training samples, such that its norm encodes proximity to the training set and thereby reveals membership. Building on this observation, we propose SimA, a single-query attack that provides a principled, efficient alternative to existing multi-query methods. SimA achieves consistently strong performance across variants of DDPM, Latent Diffusion Model (LDM). Notably, we find that Latent Diffusion Models are surprisingly less vulnerable than pixel-space models, due to the strong information bottleneck imposed by their latent auto-encoder. We further investigate this by differing the regularization hyperparameters ($\beta$ in $\beta$-VAE) in latent channel and suggest a strategy to make LDM training more robust to MIA. Our results solidify the theory of score-based MIAs, while highlighting that Latent Diffusion class of methods requires better understanding of inversion for VAE, and not simply inversion of the Diffusion process

Table of Contents

cs.CV [Back]

[1] Pathological Truth Bias in Vision-Language Models cs.CVPDF

[2] Scale and Rotation Estimation of Similarity-Transformed Images via Cross-Correlation Maximization Based on Auxiliary Function Method cs.CVPDF

[3] Robust Object Detection for Autonomous Driving via Curriculum-Guided Group Relative Policy Optimization cs.CVPDF

[4] Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation cs.CVPDF

[5] Sequential Token Merging: Revisiting Hidden States cs.CVPDF

[6] Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects cs.CVPDF

[7] Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment cs.CV | cs.AI | cs.LGPDF

[8] IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism cs.CV | cs.AI | cs.LGPDF

[9] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning cs.CV | cs.AI | cs.LGPDF

[10] CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models cs.CV | cs.AIPDF

[11] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning cs.CV | cs.AIPDF

[12] VideoScore2: Think before You Score in Generative Video Evaluation cs.CV | cs.AI | cs.CLPDF

[13] TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses cs.CVPDF

[14] MMPB: It’s Time for Multi-Modal Personalization cs.CV | cs.AIPDF

[15] Seeing Isn’t Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN cs.CV | cs.AIPDF

[16] Learning Temporal Saliency for Time Series Forecasting with Cross-Scale Attention cs.CV | cs.LGPDF

[17] Multimodal Slice Interaction Network Enhanced by Transfer Learning for Precise Segmentation of Internal Gross Tumor Volume in Lung Cancer PET/CT Imaging cs.CV | cs.AIPDF

[18] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models cs.CVPDF

[19] Learning KAN-based Implicit Neural Representations for Deformable Image Registration cs.CVPDF

[20] Convolutional Set Transformer cs.CV | cs.AI | cs.LGPDF

[21] TY-RIST: Tactical YOLO Tricks for Real-time Infrared Small Target Detection cs.CV | cs.AI | cs.LGPDF

[22] FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning cs.CVPDF

[23] Brain Tumor Classification from MRI Scans via Transfer Learning and Enhanced Feature Representation cs.CVPDF

[24] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View cs.CVPDF

[25] Disentangling Static and Dynamic Information for Reducing Static Bias in Action Recognition cs.CVPDF

[26] Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training cs.CVPDF

[27] Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation cs.CV | cs.CLPDF

[28] Planning with Unified Multimodal Models cs.CVPDF

[29] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy cs.CVPDF

[30] Sensor-Adaptive Flood Mapping with Pre-trained Multi-Modal Transformers across SAR and Multispectral Modalities cs.CV | cs.AIPDF

[31] GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization cs.CV | cs.AIPDF

[32] MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition cs.CV | cs.AIPDF

[33] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis cs.CVPDF

[34] FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection cs.CV | cs.LGPDF

[35] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting cs.CVPDF

[36] CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP cs.CV | cs.AIPDF

[37] Deep Learning for Oral Health: Benchmarking ViT, DeiT, BEiT, ConvNeXt, and Swin Transformer cs.CV | 68U10: Image processingPDF

[38] HTMA-Net: Towards Multiplication-Avoiding Neural Networks via Hadamard Transform and In-Memory Computing cs.CV | cs.AIPDF

[39] Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM cs.CVPDF

[40] Stochastic Interpolants via Conditional Dependent Coupling cs.CVPDF

[41] Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT cs.CVPDF

[42] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents cs.CVPDF

[43] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction cs.CVPDF

[44] TRAX: TRacking Axles for Accurate Axle Count Estimation cs.CV | cs.AIPDF

[45] Patch Rebirth: Toward Fast and Transferable Model Inversion of Vision Transformers cs.CV | cs.AIPDF

[46] Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection cs.CV | cs.AIPDF

[47] TATTOO: Training-free AesTheTic-aware Outfit recOmmendation cs.CVPDF

[48] Increasing the Diversity in RGB-to-Thermal Image Translation for Automotive Applications cs.CVPDF

[49] LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis cs.CV | cs.HCPDF

[50] Learning Regional Monsoon Patterns with a Multimodal Attention U-Net cs.CV | cs.AI | cs.LGPDF

[51] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction cs.CVPDF

[52] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing cs.CV | cs.AIPDF

[53] Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification cs.CVPDF

[54] Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning cs.CV | cs.AI | cs.CLPDF

[55] C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection cs.CVPDF

[56] Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning cs.CVPDF

[57] LRPO: Enhancing Blind Face Restoration through Online Reinforcement Learning cs.CVPDF

[58] DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice cs.CV | cs.AIPDF

[59] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling cs.CV | cs.AIPDF

[60] Test-time Uncertainty Estimation for Medical Image Registration via Transformation Equivariance cs.CVPDF

[61] GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval cs.CVPDF

[62] UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation cs.CVPDF

[63] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving cs.CVPDF

[64] SPIKE-RL: Video-LLMs meet Bayesian Surprise cs.CV | cs.CLPDF

[65] FM-SIREN & FM-FINER: Nyquist-Informed Frequency Multiplier for Implicit Neural Representation with Periodic Activation cs.CVPDF

[66] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing cs.CV | cs.CLPDF

[67] 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras cs.CV | cs.LGPDF

[68] No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation cs.CVPDF

[69] Robust Multi-Modal Face Anti-Spoofing with Domain Adaptation: Tackling Missing Modalities, Noisy Pseudo-Labels, and Model Degradation cs.CVPDF

[70] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation cs.CVPDF

[71] Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos cs.CVPDF

[72] Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional cs.CV | cs.CL | cs.LGPDF

[73] Enhancing Polyp Segmentation via Encoder Attention and Dynamic Kernel Update cs.CV | cs.AIPDF

[74] Evaluating point-light biological motion in multimodal large language models cs.CV | cs.AIPDF

[75] Imaging-Based Mortality Prediction in Patients with Systemic Sclerosis cs.CV | cs.AIPDF

[76] Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis cs.CVPDF

[77] OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction cs.CVPDF

[78] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement cs.CVPDF