Table of Contents

cs.CL [Back]

[1] From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness cs.CL | cs.AIPDF

Linbo Cao, Lihao Sun, Yang Yue

TL;DR: 本文首次系统性地研究了基于人口统计学的角色分配对LLM智能体行为的影响,发现无关任务的角色提示会导致智能体性能显著下降(最高达26.2%),揭示了当前LLM智能体系统中一个被忽视的脆弱性。

Details

Motivation: 尽管文本生成中角色诱导的偏见已被广泛记录,但其对智能体任务性能的影响尚未被充分探索,而后者可能带来更直接的操作风险。

Result: 在涵盖战略推理、规划和技术操作的智能体基准测试中,评估了广泛部署的模型,发现性能变化显著,最高性能下降达26.2%,且这种变化在不同任务类型和模型架构中均存在。

Insight: 角色分配和简单的提示注入可能扭曲智能体的决策可靠性,引入隐性偏见并增加行为波动性,这对LLM智能体的安全稳健部署提出了重要关切。

Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents’ behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent’s decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.


[2] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction cs.CL | cs.AI | eess.ASPDF

Junjie An, Jingguang Tian, Tianyi Wang, Yu Gao, Xiaofeng Mou

TL;DR: 本文提出了一种检索增强的自学习推理模型(A-STAR),用于纠正端到端自动语音识别(ASR)系统中的命名实体错误。该方法包含两个关键组件:一个用于命名实体识别的重述语言模型(RLM)和基于音素编辑距离的候选检索,以及一个能够根据任务难度动态调整推理深度的自适应思维链(CoT)推理模型。

Details

Motivation: 端到端ASR系统经常错误识别领域特定短语(如命名实体),可能导致下游任务灾难性失败。现有基于大语言模型(LLM)的命名实体纠错方法尚未充分利用LLM的复杂推理能力。

Result: 在AISHELL-1和Homophone数据集上的实验表明,该方法相比强基线,分别实现了命名实体字符错误率相对降低17.96%和34.42%。

Insight: 创新点在于结合检索增强生成(RAG)与自适应思维链推理,通过动态调整推理深度来处理不同难度的纠错任务,并引入音素级编辑距离进行候选检索以提高准确性。

Abstract: End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96% and 34.42%, respectively, compared to a strong baseline.


[3] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática cs.CL | cs.CVPDF

Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila

TL;DR: 本章节系统性地介绍了多模态大语言模型(MLLMs),涵盖了其理论基础、代表性模型、以及使用LangChain和LangGraph构建多模态管道的实用技术,并探讨了当前挑战与未来趋势。

Details

Motivation: 旨在弥合MLLMs理论与实践的鸿沟,为研究者和开发者提供从核心原理到实际部署的全面指南,以推动这一关键AI技术的发展与应用。

Result: 作为一篇综述性章节,未报告具体定量实验结果,但系统梳理了领域现状,并提供了可公开获取的实践代码库作为补充材料。

Insight: 创新点在于将MLLMs的理论基础、典型模型与以LangChain/LangGraph为代表的现代工程化实践(如提示工程、管道构建)紧密结合,提供了从理论到落地的完整视角,对实际应用具有指导意义。

Abstract: Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.


[4] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale cs.CLPDF

Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries

TL;DR: 本文介绍了propella-1,一个用于大规模LLM数据整理的多属性文档标注模型家族。该模型能够为文本文档生成涵盖18个属性(分为六类)的结构化JSON标注,支持57种语言。作者还发布了包含超过30亿文档标注的数据集propella-annotations,并利用这些标注对主流预训练数据集进行了多维度的组合分析,揭示了单一质量分数方法无法捕捉的显著差异。

Details

Motivation: 当前LLM预训练的数据整理主要依赖小型分类器产生的单一标量质量分数,这种方法混淆了多个质量维度,缺乏灵活性且不可解释。本文旨在解决这一问题,通过多属性标注提供更细粒度、可解释的数据质量评估。

Result: 以前沿商业LLM作为参考标注器进行评估,4B参数的propella-1模型比大得多的通用模型取得了更高的一致性。

Insight: 创新点在于用小型、多语言、多属性的专用LLM替代单一分数分类器进行文档标注,并生成结构化输出。这为数据整理提供了更灵活、可解释的多维度视角,其模型和标注数据集的发布也有助于社区进行更精细的数据分析和筛选。

Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.


[5] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR cs.CL | cs.AIPDF

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee

TL;DR: 本文提出了一种名为PACED-RL的后训练框架,通过将GFlowNets中的配分函数重新解释为每个提示的预期奖励(即在线准确率)信号,并利用该信号优先处理信息量大的训练提示和进行误差优先回放,从而显著提高了大型语言模型在分布匹配训练中的样本效率。

Details

Motivation: 现有基于奖励最大化的强化学习方法在提升LLMs推理性能时,往往会降低输出多样性;而采用GFlowNets的方法虽能匹配目标分布并联合学习其配分函数,但仅将配分函数视为归一化因子,未能充分利用其蕴含的每个提示的预期准确率信息,导致样本效率不高。

Result: 在多个基准测试上的广泛实验表明,PACED-RL在性能上显著优于GRPO和先前的GFlowNet方法,实现了更强的性能改进。

Insight: 核心创新在于理论揭示了配分函数与每个提示的准确率估计之间的关系,并据此将配分函数重新定位为难度调度器,以指导训练样本的优先级排序和回放策略,且这些组件复用GFlowNet训练中已产生的信息,有效分摊了计算开销。

Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.


[6] Learning Ordinal Probabilistic Reward from Preferences cs.CLPDF

Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo

TL;DR: 本文提出了一种新的奖励建模范式——概率奖励模型(PRM),将奖励视为随机变量而非确定性标量,从而学习每个响应的完整质量概率分布。为实现该范式,作者提出了其离散化实现形式——序数概率奖励模型(OPRM),将质量分数离散化为有限的序数评级。基于OPRM,进一步提出了一种数据高效的训练策略——区域淹没调优(RgFT),通过结合质量级别标注来引导模型将概率质量集中在相应的评级子区域内,使奖励更好地反映绝对文本质量。

Details

Motivation: 现有奖励模型主要遵循生成式(GRM)或判别式(DRM)范式,但两者均存在局限:GRM通常需要昂贵的逐点监督,而DRM产生未校准的相对分数,缺乏概率解释。本文旨在解决这些挑战,提出一种能够学习奖励概率分布的新范式。

Result: 在多个奖励模型基准测试上的实验表明,该方法相比现有奖励模型将准确率提高了2.9%至7.4%,表现出强大的性能和数据效率。对分数分布的分析证明,该方法不仅能捕捉相对排名,还能捕获绝对质量。

Insight: 创新点在于将奖励建模从确定性标量转向概率分布,并提出了实用的离散化实现OPRM及高效的RgFT训练策略。这为奖励模型提供了概率解释和更好的绝对质量校准能力,同时减少了数据需求。

Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.


[7] $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models cs.CLPDF

Yuang Cai, Yuyu Yuan

TL;DR: 本文提出了一种名为Experiential Knowledge Distillation(X-KD)的新颖通用知识蒸馏框架,用于大型语言模型。该框架受体验式学习理论和逆强化学习启发,通过近似变分奖励模仿学习(AVRIL)联合建模教师模型的原始奖励函数并进行策略蒸馏,使学生模型能够在教师模型的原始学习环境中学习,而不仅仅是模仿其行为。

Details

Motivation: 现有的大型语言模型知识蒸馏方法主要关注模仿教师模型的行为,但忽略了塑造教师知识的原始学习环境。为了解决这个问题,本文旨在让学生模型能够从教师的原始学习经验中学习,以获得更本质的知识。

Result: 实验结果表明,X-KD在抽象摘要、机器翻译和算术推理任务上,其性能优于广义知识蒸馏(Generalized KD)和MiniLLM基线方法。此外,X-KD在性能-多样性权衡和数据效率方面也优于基线知识蒸馏方法。

Insight: 论文的核心创新点是将知识蒸馏问题重新定义为让学生在教师的原始学习环境中学习,这通过联合建模奖励函数和策略来实现。这提供了一种更通用、更灵活的知识蒸馏框架,可以适用于序列级和基于散度的蒸馏方法,并可能带来更好的泛化能力和数据效率。

Abstract: Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher’s knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher’s original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher’s original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.


[8] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs cs.CL | cs.AI | cs.CV | eess.IVPDF

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian

TL;DR: 本文介绍了MedXIAOHE,一个旨在提升真实世界临床应用中的通用医学理解和推理能力的医学视觉语言基础模型。该模型通过实体感知的持续预训练框架整合异构医学语料库,并利用强化学习和工具增强的智能体训练融入多样医学推理模式,实现了多步诊断推理和低幻觉的长篇报告生成。

Details

Motivation: 动机是构建一个能够在真实临床应用中推进通用医学理解和推理的医学多模态大语言模型,以解决医学知识覆盖不全、罕见病等长尾问题,并提升模型的可靠性和指令遵循能力。

Result: MedXIAOHE在多种医学基准测试中取得了最先进的性能,并在多项能力上超越了领先的闭源多模态系统。

Insight: 创新点包括:提出实体感知的持续预训练框架以组织异构医学语料;通过强化学习和工具增强的智能体训练整合多样医学推理模式,实现可验证的多步诊断;以及集成用户偏好准则、证据基础推理和低幻觉报告生成以提高真实世界应用的可靠性。从客观角度看,其将结构化知识整合与智能体训练相结合以提升医学推理的系统性,是一个值得借鉴的方向。

Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.


[9] Left-right asymmetry in predicting brain activity from LLMs’ representations emerges with their formal linguistic competence cs.CL | cs.AI | q-bio.NCPDF

Laurent Bonnasse-Gahot, Christophe Pallier

TL;DR: 本研究探讨了大型语言模型(LLMs)在训练过程中,其内部表征预测大脑活动时出现的左右半球不对称性(左半球预测能力更强)与模型形式语言能力发展的关联。通过分析OLMo-2 7B和Pythia模型在不同训练检查点的表现,并结合英语和法语fMRI数据,发现这种不对称性与模型的形式语言能力(如区分语法可接受性、生成规范文本)同步出现,而与算术、Dyck语言任务或涉及世界知识和推理的文本任务表现无关。

Details

Motivation: 旨在理解LLMs在训练中预测大脑活动的左右半球不对称性(左半球改善更快)是由模型获得的何种能力所驱动的。

Result: 在OLMo-2 7B和Pythia模型上,使用英语和法语fMRI数据验证,发现大脑预测分数的不对称性与模型在形式语言任务(如语法可接受性判断)上的表现同步演进,而与算术、Dyck语言或知识推理任务表现不相关。

Insight: 创新点在于将LLMs预测大脑活动的神经表征不对称性具体关联到其形式语言能力的获得,而非一般性认知能力,这为理解语言处理的大脑偏侧化提供了计算模型证据,并强调了形式语法知识在神经表征中的关键作用。

Abstract: When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model’s capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).


[10] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models cs.CLPDF

Jiangxi Chen, Qian Liu

TL;DR: 本文提出了BaziQA-Benchmark,一个用于评估大语言模型符号和时序组合推理能力的标准化基准。该基准基于2021年至2025年全球算命师竞赛中的200道专业多选题构建,每个问题都需要在固定符号图表和交互时序条件下进行结构化推理。研究在多轮设置下评估了当代语言模型,分析了其在时间难度、推理领域和推理协议上的性能变化,并引入了一个轻量级结构化推理协议来约束推理顺序。结果显示,模型表现虽优于随机猜测但远未饱和,对时序组合和推理顺序敏感,且在精确时间定位和多条件符号判断上存在系统性失败。

Details

Motivation: 为了解决现有评估方法(如轶事或提示驱动评估)在客观性和可控性上的不足,本文旨在创建一个标准化基准,以系统评估大语言模型在符号和时序组合推理方面的能力,从而更准确地衡量模型在复杂结构化推理任务上的表现。

Result: 在BaziQA-Benchmark上的实验表明,当代语言模型(如GPT系列等)的表现持续优于随机猜测(chance level),但远未达到饱和性能(即人类或理论上限)。模型在时间难度、推理领域(如符号解释)和不同推理协议下表现出显著性能变化,特别是在精确时间定位和多条件符号判断任务上存在系统性失败。该基准支持跨年份、领域和模型系列的受控比较,但未明确提及是否达到SOTA或与特定模型相当,而是强调了模型普遍存在的局限性。

Insight: 论文的创新点包括:1) 引入了一个基于专业领域(算命竞赛)的标准化基准,专注于符号和时序组合推理,弥补了现有评估在结构化推理上的空白;2) 提出了轻量级结构化推理协议,通过约束推理顺序来探究模型行为,无需额外领域知识,为分析模型推理过程提供了新工具。从客观角度看,该研究将传统符号推理与时间条件相结合,突出了大语言模型在复杂、多步骤推理中的脆弱性,为未来改进模型推理能力提供了具体方向。

Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021–2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.


[11] When Words Don’t Mean What They Say: Figurative Understanding in Bengali Idioms cs.CLPDF

Adib Sakhawat, Shamim Ara Parveen, Md Ruhul Amin, Shamim Al Mahmud, Md Saiful Islam

TL;DR: 本文针对低资源语言孟加拉语,构建了一个大规模、文化背景丰富的成语数据集,包含10,361个成语,并采用包含19个字段的详细注释模式。通过评估30个先进的多语言和指令调优大语言模型在成语理解任务上的表现,发现所有模型准确率均低于50%,远低于人类83.4%的水平,揭示了现有模型在跨语言和文化推理方面的局限。

Details

Motivation: 解决大语言模型在低资源语言(特别是孟加拉语)中理解比喻性语言(如成语)的挑战,因为现有模型缺乏针对此类语言和文化背景的充分训练数据与评估基准。

Result: 在构建的孟加拉语成语理解基准测试中,30个先进大语言模型的最高准确率未超过50%,而人类表现达到83.4%,凸显了模型性能与人类水平的巨大差距。

Insight: 论文的创新点在于创建了一个大规模、文化背景丰富的孟加拉语成语数据集,并设计了全面的注释模式,为低资源语言的比喻性语言理解提供了重要的基础设施;客观来看,这项工作强调了在模型评估中纳入文化维度的重要性,并为跨语言NLP研究提供了可扩展的数据集构建方法。

Abstract: Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.


[12] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents cs.CLPDF

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha

TL;DR: 本文介绍了SciAgentGym,一个用于评估LLM智能体多步科学工具使用能力的基准测试环境,包含跨四个自然科学学科的1780个领域特定工具。研究揭示了当前先进模型在复杂科学工具使用上的瓶颈,并提出了SciForge数据合成方法以生成逻辑感知的训练轨迹,通过微调得到的SciAgent-8B模型在性能上超越了更大的Qwen3-VL-235B-Instruct模型。

Details

Motivation: 当前基准测试大多忽视了智能体在编排工具以执行严格科学工作流方面的能力,因此需要建立一个专门的基准来评估和提升LLM智能体在科学推理中的多步工具使用能力。

Result: 评估发现,即使是GPT-5这样的领先模型,随着交互步骤的增加,成功率也从60.6%急剧下降到30.9%。通过SciForge方法微调得到的SciAgent-8B模型在SciAgentBench基准上超越了Qwen3-VL-235B-Instruct,并展现出跨领域的科学工具使用能力正向迁移。

Insight: 创新点在于将工具动作空间建模为依赖图以生成逻辑感知的训练轨迹(SciForge方法),这有效提升了模型在长视野、多步科学工作流中的执行能力;同时,构建了一个分层的、可扩展的交互式基准测试环境(SciAgentGym与SciAgentBench),专门用于压力测试智能体从基础动作到长视野工作流的科学工具使用能力。

Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents’ ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.


[13] TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution cs.CLPDF

Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan

TL;DR: TraceBack是一个用于表格问答(QA)的多智能体分解框架,旨在实现细粒度的单元格级归因,通过剪枝表格、分解问题和对齐答案与支撑单元格来提高答案的可验证性。

Details

Motivation: 现有表格QA系统缺乏细粒度归因,即使答案正确也缺乏可验证的支撑,限制了在高风险场景中的可信度,因此需要一种可扩展的单元格级归因方法。

Result: TraceBack在从ToTTo、FetaQA和AITQA构建的CITEBench基准测试中显著优于强基线模型,同时提出的无参考指标FairScore能紧密跟踪人类判断并保持方法排名。

Insight: 创新点包括模块化多智能体框架实现可扩展的单元格级归因,以及无参考指标FairScore用于系统评估,支持对表格QA的可解释和可扩展评估。

Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.


cs.CV [Back]

[14] Thermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring cs.CVPDF

Constantino Álvarez Casado, Mohammad Rahman, Sasan Sharifipour, Nhi Nguyen, Manuel Lage Cañellas

TL;DR: 本文提出了一种基于面部热成像视频的非接触式生物信号提取方法,用于估计皮肤电活动(EDA)、心率(HR)和呼吸率(BR)。通过信号处理流程,包括解剖区域跟踪、空间聚合以及慢速出汗趋势与快速心肺成分的分离,实现了从热视频中同时提取这三种信号。

Details

Motivation: 可见光方法无法获取皮肤电活动(EDA),而EDA是交感神经激活的标准标志。热红外成像能够捕捉由自主神经调节驱动的皮肤温度变化,因此有潜力提供非接触式的EDA、HR和BR估计,以弥补可见光方法的不足。

Result: 在公开的SIMULATOR STUDY 1 (SIM1)驾驶员监测数据集(31个会话)上评估。最佳固定EDA配置(鼻部区域,指数移动平均)与手掌EDA的平均绝对相关性为0.40 ± 0.23,个别会话可达0.89。BR估计的平均绝对误差为3.1 ± 1.1 bpm,HR估计为13.8 ± 7.5 bpm MAE(受限于7.5 Hz的低相机帧率)。

Insight: 创新点在于利用热成像同时提取EDA、HR和BR,并提出了一个包含OMIT分解(用于HR)和鼻/颊信号平均(用于BR)的信号处理流程。客观来看,该方法为非接触式生物信号估计提供了性能基准和设计指导,特别是揭示了信号极性变化、热力学延迟短以及条件和人口统计学因素对提取质量的影响。

Abstract: Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of $0.40 \pm 0.23$ against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of $3.1 \pm 1.1$ bpm, while HR estimation yields $13.8 \pm 7.5$ bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.


[15] LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens cs.CVPDF

Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov

TL;DR: 本文提出了LLaMo,一个通过模态特定的混合Transformer架构扩展预训练大语言模型的统一框架,旨在解决运动-语言统一生成与理解任务中存在的灾难性遗忘和离散化伪影问题。该方法将人体运动编码为因果连续潜在空间,并通过轻量级流匹配头在仅解码器骨干中保持下一令牌预测范式,实现了实时流式运动生成。

Details

Motivation: 现有方法通常在成对的运动-文本数据上微调大语言模型,但由于可用数据规模有限,会导致语言能力的灾难性遗忘,且常将运动量化为离散表示以与语言模型集成,这会引入显著的离散化抖动伪影。

Result: 实验表明,LLaMo在通用设置下实现了高保真的文本到运动生成和运动到文本描述,特别是在零样本运动生成方面表现突出,标志着向通用统一运动-语言大模型迈出了重要一步。

Insight: 创新点在于通过模态特定的混合Transformer架构扩展预训练LLM,在保留基础模型语言理解能力的同时实现可扩展的多模态适应;将运动编码为因果连续潜在空间,并通过流匹配头实现实时生成,避免了离散化带来的伪影问题。

Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.


[16] Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues cs.CVPDF

Marco Willi, Melanie Mathys, Michael Graber

TL;DR: 本文研究了基于CLIP的合成图像检测方法,通过引入SynthCLIC数据集和可解释性分析,揭示了CLIP检测器主要依赖高级摄影属性而非生成器特定伪影,并评估了其在不同生成模型上的泛化性能。

Details

Motivation: 解决合成图像检测方法泛化性差、难以应对高质量生成模型的问题,并探究CLIP检测器所依赖的预测线索。

Result: 在GAN基准上达到0.96 mAP,在高质量扩散数据集SynthCLIC上为0.92 mAP,跨生成器家族泛化性能可低至0.37 mAP。

Insight: 提出减少语义偏差的配对数据集SynthCLIC,并利用可解释线性头和文本概念模型分析CLIP特征,发现检测器依赖高级摄影属性,强调了模型持续更新和广泛训练的必要性。

Abstract: Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs–unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.


[17] Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models cs.CV | cs.AI | cs.LGPDF

Ali Subhan, Ashir Raza

TL;DR: 本文是对DragDiffusion方法的可复现性研究。DragDiffusion是一种基于扩散模型的交互式点控图像编辑方法,允许用户通过拖动图像中的点来精确操控图像内容。本研究使用作者发布的代码和DragBench基准,复现了原论文的主要消融实验,包括扩散时间步选择、LoRA微调、掩码正则化强度和UNet特征监督等,结果与原工作的定性和定量趋势基本一致。研究发现,方法性能对少数超参数(如优化的时间步和用于运动监督的特征层级)较为敏感,而其他组件则具有较宽的操作范围。此外,评估的多时间步潜在优化变体并未提升空间精度,反而显著增加了计算成本。总体而言,研究结果支持了DragDiffusion的核心主张,并明确了其可稳定复现的条件。

Details

Motivation: 动机是验证DragDiffusion方法的可复现性,并探究其性能对关键超参数的敏感度以及不同组件的有效范围,以明确该方法在何种条件下能够可靠地复现。

Result: 在DragBench基准上复现了DragDiffusion,其定性和定量结果与原工作报告的趋势高度一致。研究发现,性能对优化的时间步和运动监督所用的特征层级等超参数敏感;多时间步潜在优化变体未改善空间精度,且计算成本大幅增加。

Insight: 创新点在于对DragDiffusion进行了系统的可复现性分析,明确了其核心主张成立的条件,并指出方法对特定超参数(时间步、特征层级)的敏感性,而其他组件(如LoRA微调、正则化)则更具鲁棒性;同时验证了单时间步潜在优化的有效性,否定了多时间步优化的必要性,为实际应用提供了重要指导。

Abstract: DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors’ released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.


[18] What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis cs.CV | cs.AIPDF

Xirui Li, Ming Li, Tianyi Zhou

TL;DR: 本文通过Frankenstein式的分析框架,探究强化学习(RL)在视觉推理任务中相比监督微调(IN)具体提升了哪些能力。研究发现,RL并非均匀增强视觉感知,而是系统性地优化了Transformer中后期层的计算,从而改善视觉到推理的对齐和推理性能。

Details

Motivation: 尽管带可验证奖励的强化学习已成为提升视觉语言模型视觉推理能力的标准后训练阶段,但RL相比监督微调作为冷启动初始化(IN)具体改进了哪些能力仍不明确,基准测试的提升混杂了多种因素,难以归因于特定技能。

Result: 通过因果探测、参数比较和模型合并等方法分析发现,RL主要在中后期层引起一致的推理时偏移,且这些中后期优化既可通过合并转移,又对RL增益是必要的(通过冻结实验验证)。

Insight: 论文的创新点在于提出了一个系统性的分析框架(功能定位、更新表征和可转移性测试),揭示了RL在视觉推理中的可靠贡献是优化中后期Transformer计算以改善对齐,而非均匀增强感知,这凸显了仅依赖基准评估理解多模态推理改进的局限性。

Abstract: Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL’s reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.


[19] Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction cs.CVPDF

Lihe Liu, Xiaoxi Pan, Yinyin Yuan, Lulu Shang

TL;DR: 本文提出了PathoSpatial,一个可解释的端到端框架,用于整合配准的全切片图像(WSI)和空间转录组学(ST)数据,以学习具有空间信息的预后表征。该框架采用任务引导的原型学习和多级专家架构,自适应地协调无监督的模态内发现与有监督的跨模态聚合,旨在增强可解释性的同时保持判别能力。

Details

Motivation: 随着配对的WSI-ST队列扩展到群体水平,利用其互补的空间信号进行预后预测变得至关重要,但目前缺乏针对此范式的原则性跨模态融合策略。

Result: 在一个具有配对ST和WSI的三阴性乳腺癌队列上,PathoSpatial在五个生存终点上均表现出强大且一致的性能,其性能优于或与领先的单模态和多模态方法相当。

Insight: 创新点在于通过任务引导的原型学习和多级专家架构实现可解释的跨模态融合,能够进行事后原型解释和分子风险分解,提供基于生物学的定量解释,突出了候选预后因素,为空间组学-病理学融合的可扩展和可解释多模态学习提供了概念验证。

Abstract: Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.


[20] Human-Like Coarse Object Representations in Vision Models cs.CV | cs.AIPDF

Andrey Gizdov, Andrea Procopio, Yichen Li, Daniel Harari, Tomer Ullman

TL;DR: 该论文研究了视觉模型是否以及何时会获得类似人类的粗粒度物体表示。通过时间碰撞(TTC)行为范式,作者引入了一个比较流程和对齐度量,并改变模型的训练时间、规模和通过剪枝调整的有效容量。研究发现,与人类行为的对齐度呈现倒U型曲线:模型在中等理想粒度时与人类匹配最佳,而小型/训练不足/过度剪枝的模型会欠分割成团块,大型/充分训练的模型则会因边界抖动而过分割。这表明人类般的粗粒度表示源于资源约束,而非特定偏差。

Details

Motivation: 人类在直觉物理任务中使用粗糙、体积化的‘身体’来表示物体,平滑凹面以牺牲视觉细节换取高效的物理预测,但其内部结构尚不明确。相比之下,分割模型优化的是像素级精确的掩码,可能与这种‘身体’表示不一致。本文旨在探究视觉模型是否以及何时会习得这种类似人类的粗粒度物体表示。

Result: 研究结果表明,模型与人类行为的对齐度遵循倒U型曲线。在中等‘理想身体粒度’时,模型与人类行为匹配最佳。在多个基准(如模型训练时间、规模和剪枝程度)上的实验均验证了这一规律,揭示了资源约束是产生人类般粗粒度表示的关键因素。

Insight: 论文的创新点在于提出了一个评估视觉模型物体表示与人类直觉物理表示对齐度的框架,并发现‘理想身体粒度’的存在。从客观角度看,该研究揭示了模型容量与表示粒度之间的权衡关系,为在视觉模型中诱导出物理高效的粗粒度表示(例如通过使用早期检查点、适度架构或轻度剪枝)提供了简单有效的调节方法,这支持了资源理性理论在平衡识别细节与物理可用性方面的观点。

Abstract: Humans appear to represent objects for intuitive physics with coarse, volumetric bodies’’ that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity’’ best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.


[21] Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models cs.CVPDF

Ali Abbasi, Mehdi Taghipour, Rahmatollah Beheshti

TL;DR: 本文针对医学视觉语言模型在否定句处理上的不足,提出了一个放射学诊断基准和一个上下文临床否定数据集,并设计了一种基于因果追踪效应的层特异性微调方法NAST,以提升模型对否定临床陈述的区分能力,同时保持一般的视觉语言对齐性能。

Details

Motivation: 解决医学视觉语言模型在处理临床报告中的否定语句时,经常混淆肯定和否定陈述的问题,这在安全关键的医疗环境中可能导致严重错误。

Result: 实验表明,NAST方法在放射学特定基准上提高了对肯定和否定临床陈述的区分能力,且未损害一般的视觉语言对齐性能,代码和资源已开源。

Insight: 创新点在于将因果可解释性信号转化为优化规则,通过因果追踪效应指导层特异性梯度更新,实现针对否定处理的定向模型适应,为安全关键领域的模型微调提供了新思路。

Abstract: Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer’s update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.


[22] LiDAR-Anchored Collaborative Distillation for Robust 2D Representations cs.CVPDF

Wonjun Jo, Hyunwoo Ha, Kim Ji-Yeon, Hawook Jeong, Tae-Hyun Oh

TL;DR: 本文提出了一种名为协作蒸馏(Collaborative Distillation)的新型自监督学习方法,该方法利用3D LiDAR作为监督信号,旨在提升2D图像编码器在噪声和恶劣天气条件下的鲁棒性,同时保持其在清晰场景下的原有能力。

Details

Motivation: 预训练的2D图像编码器在清晰日间场景之外(如噪声和恶劣天气条件)的鲁棒视觉感知任务中存在不足,需要一种方法来增强其在这些挑战性条件下的性能。

Result: 该方法在各种条件下的多个下游任务中超越了竞争方法,并展现出强大的泛化能力。

Insight: 创新点在于利用3D LiDAR数据作为自监督信号来增强2D表示的鲁棒性,这不仅提升了模型对恶劣条件的适应性,还因LiDAR的特性而增强了模型的3D感知能力,提高了在实际场景中的实用性和适应性。

Abstract: As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR’s characteristics. This advancement highlights our method’s practicality and adaptability in real-world scenarios.


[23] Unbiased Gradient Estimation for Event Binning via Functional Backpropagation cs.CVPDF

Jinze Chen, Wei Zhai, Han Han, Tiankai Ma, Yang Cao

TL;DR: 本文提出了一种名为功能反向传播的新框架,用于解决事件视觉中事件分箱操作导致的梯度截断和估计偏差问题。该方法通过在反向传播中合成弱导数,保持前向输出不变,从而实现对任意分箱函数的无偏梯度估计。实验表明,该方法在优化任务中降低了误差并加速了收敛,并在光流和SLAM等下游任务中取得了性能提升。

Details

Motivation: 事件视觉将动态场景编码为异步时空脉冲,通常需将事件分箱成帧以利用传统图像处理流程,但分箱函数的不连续性导致梯度截断,迫使算法依赖帧级特征;直接学习原始事件则因分箱操作的不连续性而产生有偏梯度估计,限制了学习效率。

Result: 在基于优化的自运动估计中,RMS误差降低3.2%,收敛速度加快1.57倍;在自监督光流任务中,EPE降低9.4%;在SLAM任务中,RMS误差降低5.1%,展示了在事件视觉感知中的广泛效益。

Insight: 创新点在于通过分部积分将目标函数提升为泛函,在反向传播中利用余切函数重构弱导数,从而实现对平滑和非平滑目标的长程有限差分的无偏匹配;客观分析认为,该方法为事件处理提供了通用的梯度估计框架,可提升下游任务的性能。

Abstract: Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4% lower EPE in self-supervised optical flow, and 5.1% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at https://github.com/chjz1024/EventFBP.


[24] Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

TL;DR: 本文提出了一种名为注意力驱动自压缩(ADSC)的新方法,用于高效多模态大语言模型(MLLMs)中的视觉令牌压缩。该方法利用LLM自身的注意力机制,在选定的层中通过均匀下采样逐步减少视觉令牌,形成信息瓶颈,促使模型将信息重组并压缩到剩余令牌中。

Details

Motivation: 多模态大语言模型在处理大量视觉令牌时计算成本高昂,现有剪枝方法要么在LLM之前操作(受限于编码器-投影器设计的多样性),要么在LLM内部使用与FlashAttention不兼容的启发式方法。本文旨在利用LLM自身作为压缩的最优指导,解决视觉令牌压缩的效率和通用性问题。

Result: 在LLaVA-1.5模型上应用ADSC,FLOPs减少了53.7%,峰值KV缓存内存减少了56.7%,同时保持了原始模型98.2%的性能。在多个基准测试中,该方法在效率和准确性上均优于先前的剪枝方法,尤其是在高压缩比下,ADSC保持稳健,而基于启发式的方法性能急剧下降。

Insight: 创新点在于将LLM本身视为压缩的指导者,利用其深层注意力机制自然传递视觉到文本信息的特性,通过简单的均匀下采样实现渐进式压缩,无需额外计算分数、辅助模块或修改注意力机制,且完全兼容FlashAttention。这提供了一种通用、高效且易于实现的视觉令牌压缩新思路。

Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM’s attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.


[25] CBEN – A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding cs.CVPDF

Marco Stricker, Masakazu Iwamura, Koichi Kise

TL;DR: 本文提出了一个名为CloudyBigEarthNet(CBEN)的多模态遥感数据集,该数据集包含成对的光学和雷达图像,并专门引入了云遮挡场景。研究指出,现有方法在晴朗图像上训练后,在云遮挡图像上性能显著下降(23-33个百分点),而通过在训练中引入云遮挡数据,性能可相对提升17.2-28.7个百分点。

Details

Motivation: 光学遥感图像常受云层干扰,而现有机器学习方法通常排除云遮挡图像,限制了在时间敏感应用(如自然灾害监测)中的实用性。云去除方法存在视觉伪影等问题,因此需要开发对云更鲁棒的方法,结合不受云影响的雷达数据。

Result: 在CBEN数据集上评估,现有SOTA方法在云遮挡图像上的平均精度(AP)下降23-33个百分点;通过训练时适应云遮挡数据,性能相对提升17.2-28.7个百分点。

Insight: 创新点在于构建了首个专门针对云遮挡的多模态遥感数据集CBEN,并实证了训练数据包含云遮挡对提升模型鲁棒性的重要性。客观来看,该研究强调了在遥感机器学习中考虑真实天气条件的必要性,为开发云鲁棒方法提供了基准。

Abstract: Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN


[26] IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models cs.CV | cs.AIPDF

Aarish Shah Mohsin, Mohammed Tayyab Ilyas Khan, Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria

TL;DR: 本文提出了IndicFairFace数据集,这是一个平衡的印度人脸数据集,包含14,400张图像,旨在解决视觉语言模型(VLMs)中存在的印度内部地理偏见问题。该数据集按印度各邦和性别均匀平衡,用于量化和减轻基于CLIP的VLMs中的地理偏见,并通过后处理的迭代零空间投影方法进行去偏,同时保持基准数据集上的检索准确率下降小于1.5%。

Details

Motivation: 现有公平感知数据集将印度视为单一类别,忽略了印度28个邦和8个中央直辖区的内部多样性,导致视觉语言模型在表示和地理上存在偏见,本文旨在通过创建平衡数据集来审计和减轻这种偏见。

Result: 使用IndicFairFace量化了基于CLIP的VLMs中的地理偏见,并通过迭代零空间投影方法有效减少了偏见,同时基准数据集上的检索准确率平均下降小于1.5%,表明去偏方法对现有嵌入空间影响较小。

Insight: 创新点在于首次创建了针对印度地理多样性的平衡人脸数据集,并应用后处理去偏方法减轻VLMs中的地理偏见,为研究印度背景下的地理偏见提供了首个基准,强调了考虑国家内部多样性的重要性。

Abstract: Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.


[27] Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening cs.CVPDF

Wooseok Jeon, Seunghyun Shin, Dongmin Shin, Hae-Gon Jeon

TL;DR: 本文提出了一种名为运动先验蒸馏(MPD)的推理时间蒸馏技术,用于解决生成中间帧任务中双向路径不匹配导致的时序不连续和视觉伪影问题。该方法通过将前向路径的运动残差蒸馏到后向路径中,抑制双向不匹配,从而生成更时序一致的中间帧结果。

Details

Motivation: 现有推理时间采样策略(如并行或顺序融合前向和后向路径)由于两条生成路径之间的运动先验不匹配,常导致时序不连续和不良视觉伪影。本文旨在通过蒸馏前向路径的运动先验来对齐双向路径,提升生成中间帧的质量。

Result: 在标准基准测试上进行了定量评估,并进行了广泛的用户研究,证明了该方法在实际场景中的有效性,能够生成更时序一致的中间帧结果。

Insight: 创新点在于提出了一种无需额外训练的推理时间蒸馏技术,通过蒸馏前向路径的运动残差来对齐双向生成路径,避免了因路径模糊性导致的去噪歧义,从而提升生成视频的时序连贯性。

Abstract: Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.


[28] Channel-Aware Probing for Multi-Channel Imaging cs.CV | cs.LGPDF

Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito

TL;DR: 本文提出了一种名为通道感知探测(CAP)的方法,用于解决多通道成像(MCI)数据中因通道配置变化而导致的预训练编码器重用难题。CAP通过独立特征编码(IFE)和分离池化(DCP)技术,利用通道间的内在多样性,在冻结预训练编码器的情况下,显著提升了在下游任务上的探测性能。

Details

Motivation: 多通道成像数据的通道配置在不同数据集中存在差异,导致固定通道训练困难,且预训练编码器难以直接迁移到新的通道设置上。现有研究多关注于通过全微调来评估编码器,而对冻结编码器的探测方法探索不足,且现有探测策略直接迁移到MCI领域效果不佳。

Result: 在三个MCI基准测试上,CAP方法相比默认探测协议持续提升了探测性能,达到了与从头开始训练相当的水平,并大幅缩小了与基于相同MCI预训练检查点进行全微调之间的性能差距。

Insight: 创新点在于提出了通道感知的探测框架,通过独立编码每个通道并在通道内池化后再跨通道聚合,有效利用了MCI数据中的通道间多样性。这为在冻结预训练编码器的设定下,高效利用固定表示进行下游任务提供了新思路。

Abstract: Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.


[29] VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph cs.CV | cs.CLPDF

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang

TL;DR: VimRAG是一个针对多模态检索增强生成(RAG)的框架,旨在解决传统RAG方法在处理长上下文、特别是信息稀疏但token密集的视觉数据时遇到的困难。它通过将推理过程建模为动态有向无环图来结构化智能体状态和检索到的多模态证据,并引入图调制视觉记忆编码机制来动态分配计算资源。

Details

Motivation: 传统RAG方法依赖线性交互历史,难以处理涉及大量视觉数据的迭代推理长上下文任务,因此需要一种能有效检索、推理和理解多模态信息的新方法。

Result: 在多个多模态RAG基准测试上,VimRAG均取得了最先进的(SOTA)性能。

Insight: 创新点在于将推理过程建模为动态图以结构化记忆,并基于图的拓扑位置评估节点重要性,从而动态分配高分辨率token给关键证据;同时,通过图引导的策略优化策略,解耦步骤有效性与轨迹级奖励,实现细粒度的信用分配。

Abstract: Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.


[30] SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences cs.CV | cs.GRPDF

Ruipeng Wang, Langkun Zhong, Miaowei Wang

TL;DR: SPRig是一个自监督的、姿态不变的绑定框架,用于从缺乏标准T-pose的网格序列(如动物运动捕捉或AIGC/视频衍生的网格序列)中学习一致的骨骼绑定。它通过在现有模型上施加跨帧一致性损失,解决了逐帧绑定方法导致的拓扑不一致问题。

Details

Motivation: 现有最先进的绑定方法假设存在一个标准的静止姿态(如T-pose),但这对于缺乏这种姿态的序列数据(如动物运动捕捉或视频生成的网格序列)不成立。逐帧应用这些方法会导致姿态依赖性和跨帧的拓扑不一致。

Result: 论文在一种新的置换不变稳定性协议上验证了方法。实验表明,该方法在时间稳定性上达到了最先进水平,能从具有挑战性的序列中生成一致的绑定,并显著减少了基线方法中常见的伪影。

Insight: 主要创新点在于提出了一个通用的微调框架,通过自监督的跨帧一致性损失来强制学习姿态不变的绑定,从而在现有模型基础上提升对序列数据的处理能力。客观来看,其引入的置换不变稳定性评估协议也是一个有价值的贡献。

Abstract: State-of-the-art rigging methods assume a canonical rest pose–an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.


[31] Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting cs.CVPDF

Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Qijun Chen

TL;DR: 本文提出了WS-COC,一个基于多模态大语言模型(MLLM)的弱监督类别无关物体计数框架。该框架通过三种策略引导MLLM进行计数:分而辨之的对话调优、比较排序的计数优化以及全局-局部计数增强,在仅使用图像级计数标签的情况下,在多个数据集上达到了与全监督方法相当甚至更优的性能。

Details

Motivation: 解决全监督物体计数方法需要昂贵点级标注的问题,并克服现有弱监督方法通常仅限于单类别(如行人)计数的局限性,旨在开发一个类别无关的弱监督计数框架。

Result: 在FSC-147、CARPK、PUCPR+和ShanghaiTech数据集上的大量实验表明,WS-COC匹配甚至超越了许多最先进的全监督方法,同时显著降低了标注成本。

Insight: 创新点在于将MLLM引导至计数任务时,采用了三种策略来弥合模态差距并提升性能:1)通过多轮对话逐步缩小计数范围的训练策略;2)基于图像间计数相对排序的优化策略;3)融合局部与全局预测以处理密集场景的增强策略。这些策略避免了直接微调MLLM预测具体计数的困难,为弱监督计数提供了新范式。

Abstract: Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.


[32] Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation cs.CVPDF

Yichen Zhao, Zelin Peng, Piao Yang, Xiaokang Yang, Wei Shen

TL;DR: 该论文提出了首个用于胸部X光片解读的大规模原生交错视觉语言推理数据集MMRad-IVL-22K,旨在模拟放射科医生交替进行视觉检查和语言推理的诊断过程。实验表明,基于该数据集的多模态思维链引导的报告生成,在临床准确性和报告质量上显著优于纯文本思维链方法,并提升了开源大视觉语言模型的推理一致性和报告质量。

Details

Motivation: 现有医疗大视觉语言模型通常只进行一次视觉检查,然后依赖纯文本思维链推理,这容易产生幻觉。论文旨在通过模拟放射科医生交错进行的视觉检查和语言推理过程,解决这一问题。

Result: 在先进的闭源LVLMs上,多模态CoT引导的报告生成在临床准确性和报告质量上显著优于纯文本CoT(例如,RadGraph指标提升6%)。在七个SOTA开源LVLMs上的基准测试表明,在MMRad-IVL-22K上微调的模型,其推理一致性和报告质量优于通用和医疗专用LVLMs。

Insight: 核心创新是构建了首个大规模、反映放射科医生真实交错推理工作流程的数据集,强调高保真的交错视觉语言证据是可靠医疗AI不可替代的组成部分,而非使用伪视觉坐标。这为医疗多模态推理提供了更真实的基准和训练数据。

Abstract: Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.


[33] RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads cs.CVPDF

Vijayasri Iyer, Maahin Rathinagiriswaran, Jyothikamalesh S

TL;DR: 本文介绍了RoadscapesQA数据集,这是一个专为印度道路场景设计的多任务、多模态视觉问答数据集,包含约9000张图像及手动验证的边界框,通过基于规则的启发式方法生成问答对,用于目标定位、推理和场景理解等任务,旨在推动非结构化环境下的视觉场景理解研究。

Details

Motivation: 自动驾驶系统需要理解道路场景以进行有效决策,但现有数据集在印度多样化驾驶环境中的覆盖不足,因此作者创建了RoadscapesQA数据集来填补这一空白。

Result: 论文提供了数据集的关键统计信息,并利用视觉语言模型为图像问答任务建立了初始基线,但未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于构建了专注于印度道路场景的多任务多模态数据集,通过规则启发式方法自动生成问答对,支持多样化的视觉理解任务,有助于研究非结构化环境下的自动驾驶视觉问题。

Abstract: Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.


[34] RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training cs.CV | cs.AI | cs.CLPDF

Yunshuang Nie, Bingqian Lin, Minzhe Niu, Kun Xiang, Jianhua Han

TL;DR: 本文提出了RADAR,一个用于评估多模态大语言模型(MLLM)预训练阶段能力发展的高效框架。该框架包含两个核心组件:无需微调的‘软判别分数’新指标,以及一个包含超过1.5万个样本的‘多模态混合基准’,用于零样本评估模型的感知和推理能力。利用RADAR,论文揭示了MLLM预训练中感知与推理能力发展的不对称性。

Details

Motivation: 当前缺乏高效的评估框架来诊断MLLM预训练的性能瓶颈。现有评估要么依赖监督微调后的测试(成本高),要么无法解耦地量化模型的感知和推理能力,且现有基准在规模或目标上存在局限。

Result: 论文提出了RADAR框架及其包含的新指标和基准。虽然没有给出具体的SOTA对比分数,但通过该框架,作者在数据量、模型规模和预训练策略等多个维度上,全面揭示了预训练MLLM中感知与推理能力发展的不对称现象。

Insight: 创新点在于提出了一个无需微调、解耦评估MLLM预训练阶段感知与推理能力的框架(RADAR)。其核心是‘软判别分数’这一鲁棒性新指标,以及一个大规模、精心构建的零样本评估基准。这为诊断预训练瓶颈和进行针对性优化提供了新视角和工具。

Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model’s perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs’ perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.


[35] Reliable Thinking with Images cs.CV | cs.LGPDF

Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang

TL;DR: 本文研究了多模态思维链(TWI)中的噪声思维(NT)问题,即视觉线索挖掘和答案推理过程中的不完美性,并提出了一种名为可靠图像思维(RTWI)的新方法,通过统一文本中心的方式评估视觉线索和文本思维链的可靠性,并采用鲁棒过滤和投票模块来防止NT污染最终答案。

Details

Motivation: 现有TWI方法假设交错的图像-文本思维链是无误的,但在实际多模态理解中容易因复杂性而违反,导致错误积累和性能下降,因此需要解决NT问题。

Result: 在七个基准测试上的广泛实验验证了RTWI对抗NT的有效性,表明该方法能提升多模态大语言模型的推理性能。

Insight: 创新点在于将NT问题形式化,并提出基于文本中心的可靠性估计与过滤投票机制,以增强TWI的鲁棒性,避免错误传播,这在多模态推理中具有实际应用价值。

Abstract: As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another’’, erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.


[36] EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition cs.CV | cs.AI | cs.NEPDF

Xiao Wang, Xingxing Xiong, Jinfeng Gao, Xufeng Lou, Bo Jiang

TL;DR: 本文提出了一个名为EPRBench的高质量基准数据集,专门用于基于事件流的视觉地点识别(VPR)研究。该数据集包含10K个事件序列和65K个事件帧,覆盖了手持和车载采集的多种视角、天气和光照场景。作者还提供了LLM生成并经人工精炼的场景描述,以支持语义感知和语言集成的VPR研究。此外,论文提出了一种新颖的多模态融合范式,利用LLM从原始事件流生成文本描述,进而指导空间注意力令牌选择、跨模态特征融合和多尺度表示学习,从而在实现高精度地点识别的同时,提供可解释的推理过程。

Details

Motivation: 当前基于事件流的VPR领域缺乏专用的高质量数据集,且传统可见光相机在低光照、过曝和高速运动等挑战性条件下表现不稳定。因此,作者旨在构建一个全面的基准数据集,并探索如何将LLM集成到基于事件的感知流程中,以提升VPR的性能和可解释性。

Result: 作者在EPRBench上实现并评估了15种最先进的VPR算法,为未来的算法比较提供了强基线。提出的多模态融合框架不仅实现了高精度的地点识别,还产生了可解释的推理过程,显著增强了模型的透明度和可解释性。

Insight: 论文的创新点在于:1) 构建了首个专门针对事件流VPR的高质量、多场景基准数据集EPRBench,并提供了LLM生成的人工精炼场景描述;2) 提出了一种新颖的多模态融合范式,将LLM生成的文本描述与事件流特征相结合,通过空间注意力、跨模态融合和多尺度学习提升性能与可解释性,为基于事件的感知系统引入了语义和语言理解能力。

Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID


[37] Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos cs.CVPDF

Jieyun Bai, Zihao Zhou, Yitong Tang, Jie Gan, Zhuonan Liang

TL;DR: 本文介绍了MICCAI 2024的产时超声大挑战(IUGC),该挑战旨在通过深度学习自动测量产时超声生物指标,以解决资源有限地区缺乏专业超声医师的问题。论文概述了挑战设计,回顾了八个参赛团队的方法,并分析了基准结果、关键瓶颈和未来研究方向。

Details

Motivation: 解决中低收入国家因缺乏训练有素的超声医师而难以在产时常规使用超声监测分娩进程的问题,从而降低孕产妇和新生儿死亡率。

Result: 挑战赛取得了鼓舞人心的性能,但分析表明该领域仍处于早期阶段,大规模临床部署前仍需深入研究。

Insight: 提出了一个临床导向的多任务自动测量框架(标准平面分类、胎头-耻骨联合分割和生物测量),并发布了迄今为止最大的多中心产时超声视频数据集,为模型训练和评估提供了坚实基础。从预处理、数据增强、学习策略、模型架构和后处理五个角度系统分析了现有方法,有助于识别瓶颈和指导未来研究。

Abstract: A substantial proportion (45%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.


[38] Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation cs.CVPDF

Hongbo Jiang, Jie Li, Xinqi Cai, Tianyu Xie, Yunhang Shen

TL;DR: 本文提出MLLMEmbed-ReID,一个用于跨模态行人重识别的统一云边部署框架。该框架首先通过指令提示和分层低秩适应微调,将多模态大语言模型适配为强大的云端模型,生成跨RGB、红外、素描和文本模态的统一嵌入空间。然后,基于教师模型特征空间的低秩特性,设计了一种新颖的知识蒸馏策略,将知识部署到轻量级边缘模型上。

Details

Motivation: 解决跨模态行人重识别在实际云边部署中面临的挑战:现有方法依赖于针对不同模态的专门化云端模型,导致生态系统碎片化;同时,现有方法未能将多模态大语言模型有效适配为单一的端到端骨干网络,也缺乏针对边缘部署的有效知识蒸馏策略。

Result: 轻量级边缘模型在多个视觉跨模态行人重识别基准测试上取得了最先进的性能;其云端模型在所有跨模态行人重识别基准测试上均表现出色。

Insight: 创新点包括:1) 利用指令提示和分层LoRA微调策略,将基础MLLM适配为强大的统一云端模型;2) 受教师特征空间低秩特性启发,提出了一种新颖的知识蒸馏策略,结合主成分映射损失和特征关系损失,以优先保留关键信息并维持关系结构,实现高效边缘部署。

Abstract: Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher’s feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.


[39] Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding cs.CVPDF

Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen

TL;DR: 本文提出了一种无需训练、用于加速文档解析视觉语言模型推理的方法,称为分层推测解码。该方法利用轻量级文档解析流水线作为草稿模型批量预测未来token,并由更准确的VLM并行验证这些预测。此外,通过将文档页面划分为独立区域,实现各区域的并行解码,最后按自然阅读顺序组装预测结果。

Details

Motivation: 基于VLM的端到端文档解析方法因需自回归生成长token序列而导致推理延迟高,本文旨在解决长输出和复杂布局结构带来的效率问题。

Result: 在通用基准OmniDocBench上,该方法为dots.ocr模型提供了2.42倍的无损加速,在长文档解析任务中最高达到4.89倍加速。

Insight: 创新点在于将推测解码思想与文档布局结构特性结合,通过区域划分实现并行化,无需额外训练即可显著提升推理速度,为长文档多模态理解的高效处理提供了新思路。

Abstract: Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.


[40] Detecting Object Tracking Failure via Sequential Hypothesis Testing cs.CV | cs.AIPDF

Alejandro Monroy Muñoz, Rajeev Verma, Alexander Timans

TL;DR: 该论文提出了一种基于序列假设检验的物体跟踪失败检测方法,将跟踪过程视为逐步积累证据的统计测试,能够快速识别跟踪失败并控制误报率,且无需额外训练或调优。

Details

Motivation: 解决实时视频物体跟踪系统缺乏形式化安全保障的问题,传统方法依赖启发式置信度,无法可靠判断跟踪何时失效。

Result: 在四个视频基准测试上验证了两种主流跟踪模型的有效性,方法能够以可控的误报率快速检测跟踪失败。

Insight: 创新地将序列假设检验(e-过程)引入跟踪失败检测,提供统计理论保障;方法轻量、模型无关,支持有监督和无监督变体,增强了跟踪系统的安全可靠性。

Abstract: Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.


[41] Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis cs.CV | cs.CLPDF

Runzhou Liu, Hailey Weingord, Sejal Mittal, Prakhar Dungarwal, Anusha Nandula

TL;DR: 本文提出了一种细粒度的多模态大语言模型(MLLM)作为评判者的框架,用于评估图像编辑模型。该框架将常见的评估概念分解为十二个细粒度、可解释的因素,涵盖图像保真度、编辑质量和指令忠实度。作者构建了一个经过人工验证的新基准,整合了人工判断、基于MLLM的评估、模型输出和传统指标。研究表明,所提出的MLLM评判者在细粒度上与人类评估高度一致,可作为可靠且可扩展的评估工具,而传统指标则常与人类感知脱节。

Details

Motivation: 传统图像编辑评估指标粒度粗、可解释性差,经常奖励视觉上合理但忽略了可控性、编辑定位和用户指令忠实度的输出,无法捕捉对人类感知和意图重要的方面。

Result: 通过广泛的人工研究,论文表明所提出的MLLM评判者在细粒度上与人类评估高度对齐。在构建的新基准上,传统指标被证明是这些评估因素的较差代理,而MLLM评判者则在离线和在线设置中提供了更直观和信息丰富的评估。

Insight: 主要创新点在于提出了一个将图像编辑评估系统分解为多个细粒度、可解释因素的框架,并构建了一个整合多维度数据的人类对齐基准。这为使用MLLM作为可靠、可扩展的评估者提供了实证基础,为解决传统评估指标与人类感知脱节的问题提供了新思路。

Abstract: Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.


[42] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions cs.CVPDF

Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia

TL;DR: 本文提出了一种构建通用视频多模态大语言模型的方法,通过引入结构化、细粒度的指令数据来解决现有模型在视频理解上的局限性。具体贡献包括:一个包含一百万条细粒度音视频指令标注的开源数据集ASID-1M,一个可扩展的、带有自动验证和精炼功能的数据管理流程ASID-Verify,以及一个基于该数据集进行监督微调的视频理解模型ASID-Captioner。

Details

Motivation: 现有视频理解模型的性能受限于指令数据,这些数据通常将复杂的音视频内容表示为单一、不完整的描述,缺乏细粒度组织和可靠标注,无法满足通用视频理解对细粒度、多样化场景建模的需求。

Result: 在涵盖音视频描述、属性描述、基于描述的问答和基于描述的时间定位等七个基准测试中,ASID-Captioner在提升细粒度描述质量、减少幻觉和改善指令遵循方面表现出色,在开源模型中达到SOTA水平,并与Gemini-3-Pro性能相当。

Insight: 核心创新在于提出了一个结构化、质量可验证的指令数据构建范式(ASID-1M数据集与ASID-Verify流程),通过单属性和多属性监督来组织细粒度标注,并利用自动验证确保描述与音视频内容在语义和时间上的一致性,从而为训练更可靠的视频MLLM提供了高质量数据基础。

Abstract: Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.


[43] Multimodal Classification via Total Correlation Maximization cs.CVPDF

Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng Lu

TL;DR: 本文提出了一种基于总相关最大化的多模态分类方法TCMax,通过最大化多模态特征与标签之间的总相关性来缓解模态竞争并捕捉模态间交互,在多个基准测试中超越了现有的联合学习和单模态学习方法。

Details

Motivation: 针对多模态学习中联合学习常导致过拟合某些模态而忽视其他模态、性能甚至不如单模态学习的问题,从信息论角度分析模态竞争,并提出最大化总相关性以平衡模态贡献。

Result: 在多个基准测试上的广泛实验表明,TCMax在性能上超越了当前最先进的联合学习和单模态学习方法,达到了SOTA水平。

Insight: 从信息论视角理论分析模态竞争,提出总相关神经估计(TCNE)作为互信息神经估计的扩展,并设计了无需超参数的总相关最大化损失函数TCMax,通过特征对齐捕获模态间交互,有效缓解了模态不平衡问题。

Abstract: Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.


[44] DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation cs.CVPDF

Boujemaa Guermazi, Riadh Ksantini, Naimul Khan

TL;DR: 本文提出了DynaGuide,一种用于无监督语义分割的通用动态引导框架。该框架通过结合来自零样本模型的全局伪标签和轻量级CNN的局部边界细化,采用动态优化的多组件损失函数,在无需目标域真实标签的情况下实现了高精度分割。

Details

Motivation: 解决现有无监督图像分割方法难以兼顾全局语义结构和细粒度边界准确性的问题,旨在提供一种无需人工标注、可泛化且实用的分割解决方案。

Result: 在BSD500、PASCAL VOC2012和COCO数据集上进行了广泛实验,实现了最先进的性能,mIoU分别提升了17.5%、3.1%和11.66%。

Insight: 创新点在于提出了一种双引导策略和动态损失优化机制,将全局语义引导与局部边界细化相结合,并支持即插即用的多样化引导源集成,具有模块化设计、强泛化能力和低计算开销的优点。

Abstract: Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide


[45] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models cs.CV | cs.AI | cs.CLPDF

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni

TL;DR: 本文提出CoPE-VideoLM,一种利用视频编解码器原语(运动矢量和残差)来高效构建视频语言模型的方法。该方法通过轻量级Transformer编码器聚合编解码器原语,并与图像编码器表示对齐,以减少计算开销并改善时间覆盖。

Details

Motivation: 现有视频语言模型(VideoLMs)受限于最大上下文窗口,通常采用关键帧采样,导致宏观事件和微观细节的遗漏,且处理全图像令牌计算成本高。

Result: 与标准VideoLMs相比,该方法将首令牌生成时间减少高达86%,令牌使用量减少高达93%,并在14个多样化视频理解基准(包括通用问答、时序推理、长视频理解和空间场景理解)上保持或超越性能。

Insight: 创新点在于利用视频编解码器原语(运动矢量和残差)来编码视频冗余和稀疏性,避免对大多数帧进行昂贵的全图像编码,并通过预训练策略对齐表示以加速收敛,实现了效率与性能的平衡。

Abstract: Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86%$ and token usage by up to $93%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.


[46] Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images cs.CVPDF

Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn

TL;DR: 该论文提出了一个名为’Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images’的基准数据集,旨在通过单目图像进行隐式尺度三维重建来解决真实用餐场景下的多食物体积估计问题。该基准将食物分量估计重新定义为单目观察下的隐式尺度三维重建任务,移除了显式物理参考和度量标注,要求算法从餐盘、餐具等上下文对象中推断尺度。

Details

Motivation: 现有饮食评估方法主要依赖单图像分析或基于外观的推断(包括最近的视觉语言模型),缺乏显式几何推理且对尺度模糊性敏感。该研究旨在通过引入几何推理来提升在复杂、真实场景中食物体积估计的准确性和鲁棒性。

Result: 在MetaFood 2025 Workshop的挑战赛中,基于几何重建的方法取得了最佳性能:在体积估计上达到了0.21的平均绝对百分比误差(MAPE),在几何精度上达到了5.7的L1 Chamfer距离。实验表明,尽管强大的视觉语言基线模型表现有竞争力,但基于几何重建的方法在准确性和鲁棒性上均更优。

Insight: 论文的核心创新点在于将食物体积估计问题重新定义为’隐式尺度三维重建’任务,并为此创建了一个强调多食物场景、遮挡和复杂空间布局的基准数据集。其可借鉴之处在于利用场景上下文(如餐具)作为隐式尺度线索,而非依赖显式度量参考,这更贴近真实世界的感知与推理过程,推动了从纯外观分析向结合几何推理的范式转变。

Abstract: We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision–language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.


[47] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation cs.CV | cs.AI | cs.LGPDF

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah

TL;DR: 本文提出了Curriculum-DPO++方法,用于文本到图像生成的直接偏好优化。该方法在原有数据级课程(按难度组织图像对)的基础上,引入了新颖的模型级课程,包括动态解冻网络层和渐进增加LoRA矩阵秩,以逐步提升模型学习能力。

Details

Motivation: 现有的RLHF和DPO方法未考虑学习不同偏好的难度差异,导致优化过程非最优。本文旨在通过结合数据和模型课程来解决这一问题,以提升文本到图像生成中的偏好学习效果。

Result: 在九个基准测试中,Curriculum-DPO++在文本对齐、美学质量和人类偏好方面均优于Curriculum-DPO及其他最先进的偏好优化方法,达到了SOTA水平。

Insight: 创新点在于将课程学习思想从数据层面扩展到模型层面,通过动态调整模型结构(层解冻和LoRA秩增长)来匹配学习进度,这为高效微调大型生成模型提供了新思路。

Abstract: Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.


[48] SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery cs.CVPDF

Chunming Li, Shidong Wang, Tong Xin, Haofeng Zhang

TL;DR: 本文提出了一种名为SIEFormer的新方法,该方法通过谱分析重新解释Vision Transformer(ViT)中的注意力机制,并增强特征适应性,特别针对具有挑战性的广义类别发现(GCD)任务。SIEFormer由两个主要分支组成,分别对应ViT的隐式和显式谱视角,实现联合优化。隐式分支利用不同类型的图拉普拉斯矩阵建模token的局部结构相关性,并引入新颖的带自适应滤波(BaF)层,可灵活执行带通和带阻滤波。显式分支则引入可操纵滤波层(MFL),通过对输入“值”特征应用傅里叶变换,在频域中用一组可学习参数调制变换后的信号,然后进行逆傅里叶变换以获得增强特征。大量实验表明,该方法在多个图像识别数据集上实现了最先进的性能,并通过消融研究和可视化验证了其优越性。

Details

Motivation: 解决广义类别发现(GCD)任务中,现有Vision Transformer(ViT)注意力机制在特征适应性和可解释性方面的不足,特别是如何利用谱分析来增强模型对未知类别的发现能力。

Result: 在多个图像识别数据集上实现了最先进的(SOTA)性能,并通过消融研究和可视化验证了方法的有效性。

Insight: 创新点在于将谱分析引入Transformer架构,通过隐式和显式两个分支分别建模局部和全局依赖,其中带自适应滤波(BaF)层和可操纵滤波层(MFL)提供了灵活的频率域特征调制机制,增强了模型的可解释性和适应性,为GCD任务提供了新的解决方案。

Abstract: This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value” features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.


[49] Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection cs.CVPDF

Declan McIntosh, Alexandra Branzan Albu

TL;DR: 本文提出了一种数据集折叠方法,可将任意基于单类分类器的异常检测器转换为完全无监督的方法,无需修改底层检测器,仅通过算法选择训练数据子集。该方法基于异常在训练数据中不常见且异质的弱假设,利用多个独立训练的单类分类器实例过滤训练集中的异常。

Details

Motivation: 解决单类分类器在异常检测中因训练数据仅包含正常样本而对训练标签噪声敏感的问题,旨在将单类分类器转化为无监督方法,以应对实际场景中训练数据可能包含未知异常的情况。

Result: 在MVTec AD、ViSA和MVTec Loco AD数据集上实现了最先进的无监督异常检测性能,并首次通过转换现有方法创建了无监督逻辑异常检测器。

Insight: 创新点在于通过弱假设和数据集折叠策略,将单类分类器的改进直接迁移到无监督领域,无需修改模型架构,仅依赖数据子集选择,增强了方法的通用性和可扩展性。

Abstract: Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.


[50] Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace cs.CVPDF

Seth Donahue, J. D. Peiffer, R. Tyler Richardson, Yishan Zhong, Shaun Q. Y. Tan

TL;DR: 该论文提出了一种使用单目相机和AI驱动的无标记运动捕捉技术来量化上肢可达工作空间的方法,旨在验证其在临床运动分析中的可行性。通过对比基于标记的运动捕捉系统,研究发现正面单目相机配置与参考系统具有良好的一致性,尤其在评估前部工作空间时表现最佳,显示出临床应用的潜力。

Details

Motivation: 动机是验证一种临床可访问的方法,使用单目相机和AI驱动的无标记运动捕捉来量化上肢可达工作空间,以降低临床运动分析的技术门槛和开销,促进定量评估的广泛应用。

Result: 在九名无损伤成年参与者执行标准化UERW任务中,正面相机配置与基于标记的参考系统表现出强一致性,平均偏差为0.61±0.12%每八分区;而偏移相机配置低估了可达工作空间百分比(-5.66±0.45%)。结果支持正面单目相机配置在UERW评估中的可行性,尤其在评估前部工作空间时一致性最高。

Insight: 创新点在于首次验证了单目无标记运动捕捉系统用于UERW任务评估,通过减少技术复杂性,实现了更广泛的定量上肢活动性评估实施;从客观角度看,该方法通过AI驱动简化了临床设置,为低成本、易部署的临床运动分析提供了新途径。

Abstract: To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of $0.61 \pm 0.12$ % reachspace reached per octanct (mean $\pm$ standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ($-5.66 \pm 0.45$ % reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.


[51] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control cs.CV | cs.GRPDF

Mingzhi Sheng, Zekai Gu, Peng Li, Cheng Lin, Hao-Xiang Guo

TL;DR: 本文提出FlexAM,一个基于新颖3D控制信号的统一视频生成框架。该框架通过将视频动态表示为点云,并引入多频率位置编码、深度感知位置编码和灵活控制信号,实现了外观与运动的有效解耦,从而支持广泛的视频生成控制任务,如I2V/V2V编辑、相机控制和空间对象编辑。

Details

Motivation: 解决视频生成中有效且泛化的控制挑战,现有方法依赖模糊或任务特定信号,而作者认为外观与运动的基本解耦能提供更鲁棒和可扩展的路径。

Result: 大量实验表明,FlexAM在所有评估任务上实现了卓越性能,达到SOTA水平。

Insight: 创新点包括使用点云表示视频动态、多频率位置编码以区分细粒度运动、深度感知位置编码,以及灵活控制信号以平衡精度和生成质量,这为视频生成控制提供了统一且可扩展的解决方案。

Abstract: Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of “appearance” and “motion” provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.


[52] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision cs.CVPDF

Aadarsh Sahoo, Georgia Gkioxari

TL;DR: 本文提出了对话式图像分割(CIS)任务和ConverSeg基准,旨在将抽象的、基于意图的概念(如功能、安全、物理推理)转化为像素级精确的分割掩码。作者开发了ConverSeg-Net模型,融合了强大的分割先验与语言理解,并设计了一个无需人工监督的AI驱动数据引擎来生成提示-掩码对。实验表明,现有语言引导分割模型在CIS任务上表现不足,而基于数据引擎训练的ConverSeg-Net在ConverSeg基准上取得了显著提升,同时在现有语言引导分割基准上保持了强劲性能。

Details

Motivation: 现有指代图像定位工作主要关注类别和空间查询(如“最左边的苹果”),而忽略了功能和物理推理等抽象意图驱动的概念(如“我可以安全地存放刀的地方是哪里?”)。本文旨在填补这一空白,推动图像分割向更自然、更复杂的对话式交互发展。

Result: 在提出的ConverSeg基准(涵盖实体、空间关系、意图、可供性、功能、安全和物理推理)上,ConverSeg-Net取得了显著增益。同时,该模型在现有的语言引导分割基准上也保持了强大的性能。

Insight: 论文的创新点在于:1) 定义了对话式图像分割(CIS)这一新任务,强调对抽象、意图驱动概念的理解与分割;2) 构建了大规模、多维度(实体、空间、意图、功能、安全、物理)的ConverSeg基准;3) 提出了融合分割先验与语言理解的ConverSeg-Net模型架构;4) 设计了一个无需人工标注的AI驱动数据引擎,可扩展地生成训练数据,为解决数据稀缺问题提供了新思路。

Abstract: Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., “left-most apple”) and overlooks functional and physical reasoning (e.g., “where can I safely store the knife?”). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/


cs.SI [Back]

[53] Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised Analysis cs.SI | cs.CLPDF

Oktay Karakuş

TL;DR: 本文提出了一种基于图的无监督框架,用于在K-pop歌词中发现和评估语义社区。该方法利用行级语义表示构建相似性图,并通过社区检测揭示稳定的微观主题社区,无需流派、艺术家或语言监督。进一步通过图论桥接指标识别跨界歌曲,并分析其结构特性。

Details

Motivation: 大规模歌词语料库存在缺乏可靠标注、多语言内容和高风格重复性等挑战,现有方法多依赖监督分类或粗粒度文档表示,难以揭示潜在语义结构。

Result: 在多个鲁棒性设置下,跨界歌词相比核心社区成员展现出更高的词汇多样性和更低的重复性,挑战了“钩子强度或重复驱动跨主题连接”的假设。

Insight: 创新点在于提出语言无关的图框架,适用于无标注文化文本语料库,通过行级语义表示和社区检测实现细粒度主题发现,并利用图论指标量化跨界内容的结构特性。

Abstract: Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.


cs.AI [Back]

[54] GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory cs.AI | cs.CL | cs.CY | cs.GT | cs.MAPDF

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang

TL;DR: 本文提出了GT-HarmBench基准测试,用于评估前沿AI系统在多智能体高风险环境中的安全性风险。该基准包含2009个基于博弈论结构(如囚徒困境、猎鹿博弈和懦夫博弈)的现实场景,测试了15个前沿模型,发现它们仅在62%的情况下选择对社会有益的行动。研究还表明,博弈论干预可将有益结果提升高达18%。

Details

Motivation: 现有AI安全基准主要评估单智能体,而忽略了多智能体环境中的风险(如协调失败和冲突),因此需要一个新的基准来理解和评估这些风险。

Result: 在GT-HarmBench上测试的15个前沿模型中,智能体仅平均在62%的情况下选择对社会有益的行动;通过博弈论干预,这一比例最高可提升18%。该基准为多智能体对齐研究提供了一个标准化的测试平台。

Insight: 创新点在于首次从博弈论视角系统构建了多智能体AI安全风险基准,并量化了模型在复杂社会困境中的可靠性差距;其方法(如提示框架和排序敏感性分析)为理解和改进多智能体对齐提供了新工具。

Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner’s Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.


[55] Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents cs.AI | cs.CLPDF

Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo

TL;DR: 本文提出了CogRouter框架,旨在解决LLM智能体在多轮决策任务中认知模式僵化的问题。该框架基于ACT-R理论,设计了从本能反应到战略规划的四个层次化认知级别,并通过两阶段训练(CoSFT和CoPO)使智能体能够根据每一步的具体需求动态调整认知深度,核心思想是合适的认知深度应最大化最终行动的信心。

Details

Motivation: 当前LLM智能体在长视野任务中通常采用固定的认知模式(要么不思考直接响应,要么统一进行深度推理),无法适应任务不同步骤间认知需求的显著变化,导致效率低下。

Result: 在ALFWorld和ScienceWorld基准测试中,CogRouter实现了SOTA性能。使用Qwen2.5-7B模型,其成功率达到了82.3%,显著优于GPT-4o、OpenAI-o3和GRPO等基线模型,同时使用的token数量减少了62%。

Insight: 论文的主要创新点在于将认知深度建模为可动态调整的层次化变量,并提出了一个基于信心度(行动置信度)进行步骤级信用分配的强化学习优化方法(CoPO),实现了认知资源的按需分配,从而在保证性能的同时大幅提升效率。

Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.


[56] Consistency of Large Reasoning Models Under Multi-Turn Attacks cs.AI | cs.CLPDF

Yubo Li, Ramayya Krishnan, Rema Padman

TL;DR: 本文评估了九种前沿推理模型在多轮对抗攻击下的鲁棒性,发现推理能力虽能提供一定但非完全的鲁棒性:推理模型普遍优于指令调优基线,但均存在特定脆弱性,其中误导性建议普遍有效,而社会压力则因模型而异。通过轨迹分析,识别出五种失败模式(自我怀疑、社会从众、建议劫持、情感易感性和推理疲劳),前两者占失败的50%。研究还表明,适用于标准大语言模型的置信度感知响应生成(CARG)方法对推理模型失效,因其扩展推理轨迹导致过度自信;反直觉的是,随机置信度嵌入优于针对性提取。

Details

Motivation: 大型推理模型在复杂任务上达到SOTA性能,但其在多轮对抗压力下的鲁棒性尚未充分探索,本文旨在评估其对抗攻击下的表现。

Result: 在对抗攻击评估中,大多数推理模型显著优于指令调优基线,但所有模型均表现出不同的脆弱性模式;置信度感知防御方法对推理模型无效,随机置信度嵌入反而表现更好。

Insight: 论文的创新点在于系统揭示了推理模型在多轮对抗攻击下的脆弱性模式及失败机制,并指出基于置信度的防御方法需针对推理模型进行根本性重新设计,而非直接沿用标准LLM的防御策略。

Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.


q-bio.GN [Back]

[57] Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models q-bio.GN | cs.CLPDF

Yanan Li, Christina Yi Jin, Yuan Jin, Manli Luo, Tie Xu

TL;DR: 本文探讨了DNA基础模型与大型语言模型(LLM)融合进行DNA-语言推理时的核心问题:基因组序列与自然语言应在哪个层级进行交互。现有方法多采用嵌入层级的后期对齐融合,这会将丰富的基因组序列压缩为固定表示,限制了对细粒度、令牌级基因组结构的推理能力。为此,论文提出了两种新的融合方法:通过序列级对比预训练增强嵌入对齐的SeqCLIP,以及直接将基因组k-mer整合到语言模型现有词汇表中的词汇层级集成方法OneVocab。

Details

Motivation: 解决现有DNA-语言模型多采用后期嵌入对齐融合,导致基因组序列信息被过度压缩,无法支持细粒度、令牌级推理的问题,探索更有效的多模态融合层级。

Result: 在分类和推理任务上的综合实验表明,尽管各种对齐策略能改进嵌入层级的融合效果,但早期的词汇层级集成(OneVocab)能为DNA-语言建模产生更具表达力和更有效的表示。

Insight: 创新点在于系统性地比较了“对齐”与“集成”两种融合范式,并提出了具体的实现方法(SeqCLIP和OneVocab)。客观来看,其核心洞察是:对于DNA-语言这种需要细粒度结构推理的任务,早期、底层的词汇集成可能比后期、高层的表示对齐更为有效,这为多模态基础模型的设计提供了新的思路。

Abstract: Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model’s ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic $k$-mers into the language model’s existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.


eess.IV [Back]

[58] VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction eess.IV | cs.AI | cs.CV | cs.MMPDF

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das

TL;DR: 本文提出VineetVC系统,一种在严重带宽限制下自适应的视频会议方案,通过将WebRTC媒体传输与音频驱动的说话人头像重建路径及遥测驱动的模式调节相结合,以极低带宽(中位数32.80 kbps)合成视频流替代摄像头流,缓解网络拥塞导致的画质下降和延迟问题。

Details

Motivation: 解决消费者和受限网络中带宽耗尽导致实时视频会议稳定性下降的问题,如编码器速率饱和、丢包增加、帧率降低和端到端延迟显著上升。

Result: 系统在浏览器客户端中实现,合成视频流的中位带宽仅为32.80 kbps,通过带宽模式切换策略有效维持会议流畅性,但未提及具体基准测试或与现有方法的定量比较。

Insight: 创新点在于将传统WebRTC传输与AI驱动的音频到头像合成路径动态集成,并引入遥测驱动的自适应切换机制,为低带宽场景提供了轻量级、可替代的视觉通信解决方案。

Abstract: Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.


[59] 3DLAND: 3D Lesion Abdominal Anomaly Localization Dataset eess.IV | cs.CVPDF

Mehran Advand, Zahra Dehghanian, Navid Faraji, Reza Barati, Seyed Amir Ahmad Safavi-Naini

TL;DR: 本文介绍了3DLAND,一个大规模的三维腹部病变定位数据集,包含超过6000个增强CT扫描和20000多个高保真三维病变标注,覆盖七个腹部器官。该数据集旨在解决现有医学影像数据集中三维标注、多器官覆盖和病变-器官关联缺失的问题,为医学AI提供可扩展的评估基准。

Details

Motivation: 现有腹部CT数据集通常缺乏三维标注、多器官覆盖或精确的病变-器官关联,这阻碍了鲁棒表示学习和临床应用。

Result: 通过集成自动化空间推理、提示优化的二维分割和记忆引导的三维传播的三阶段流程,数据集标注经过放射科专家验证,表面Dice分数超过0.75,建立了器官感知三维分割模型的新基准。

Insight: 创新点包括提供大规模、高质量的三维病变标注数据集,以及一种高效的标注流程,支持异常检测、定位和跨器官迁移学习的评估,推动了面向医疗的AI发展。

Abstract: Existing medical imaging datasets for abdominal CT often lack three-dimensional annotations, multi-organ coverage, or precise lesion-to-organ associations, hindering robust representation learning and clinical applications. To address this gap, we introduce 3DLAND, a large-scale benchmark dataset comprising over 6,000 contrast-enhanced CT volumes with over 20,000 high-fidelity 3D lesion annotations linked to seven abdominal organs: liver, kidneys, pancreas, spleen, stomach, and gallbladder. Our streamlined three-phase pipeline integrates automated spatial reasoning, prompt-optimized 2D segmentation, and memory-guided 3D propagation, validated by expert radiologists with surface dice scores exceeding 0.75. By providing diverse lesion types and patient demographics, 3DLAND enables scalable evaluation of anomaly detection, localization, and cross-organ transfer learning for medical AI. Our dataset establishes a new benchmark for evaluating organ-aware 3D segmentation models, paving the way for advancements in healthcare-oriented AI. To facilitate reproducibility and further research, the 3DLAND dataset and implementation code are publicly available at https://mehrn79.github.io/3DLAND.


cs.IR [Back]

[60] DiffuRank: Effective Document Reranking with Diffusion Language Models cs.IR | cs.CLPDF

Qi Liu, Kun Ai, Jiaxin Mao, Yanzhao Zhang, Mingxin Li

TL;DR: 本文提出DiffuRank,一个基于扩散语言模型(dLLMs)的文档重排序框架,旨在解决现有基于自回归大语言模型(LLMs)的重排序方法在效率和灵活性上的限制。DiffuRank探索了三种基于dLLMs的重排序策略:逐点评估、基于logit的列表评估和基于排列的列表评估,并在多个基准测试中验证了其性能。

Details

Motivation: 现有基于LLMs的文档重排序器大多依赖自回归生成,存在逐token解码延迟高、固定从左到右生成顺序导致错误传播且难以修正等问题,限制了效率和灵活性。本文旨在利用扩散语言模型(dLLMs)更灵活的生成和解码特性来解决这些限制。

Result: 在多个基准测试上评估了零样本和微调后的重排序性能。实验结果表明,在相似模型规模下,dLLMs达到了与自回归LLMs相当、在某些情况下甚至超越的性能水平。

Insight: 创新点在于首次将扩散语言模型应用于文档重排序任务,并设计了三种适配dLLMs特性的重排序策略及相应训练方法。从客观角度看,这为文档重排序提供了一种非自回归的、支持并行解码和更灵活生成顺序的替代架构,可能提升效率和可控性。

Abstract: Recent advances in large language models (LLMs) have inspired new paradigms for document reranking. While this paradigm better exploits the reasoning and contextual understanding capabilities of LLMs, most existing LLM-based rerankers rely on autoregressive generation, which limits their efficiency and flexibility. In particular, token-by-token decoding incurs high latency, while the fixed left-to-right generation order causes early prediction errors to propagate and is difficult to revise. To address these limitations, we explore the use of diffusion language models (dLLMs) for document reranking and propose DiffuRank, a reranking framework built upon dLLMs. Unlike autoregressive models, dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order, and enable parallel decoding, which may lead to improved efficiency and controllability. Specifically, we investigate three reranking strategies based on dLLMs: (1) a pointwise approach that uses dLLMs to estimate the relevance of each query-document pair; (2) a logit-based listwise approach that prompts dLLMs to jointly assess the relevance of multiple documents and derives ranking lists directly from model logits; and (3) a permutation-based listwise approach that adapts the canonical decoding process of dLLMs to the reranking tasks. For each approach, we design corresponding training methods to fully exploit the advantages of dLLMs. We evaluate both zero-shot and fine-tuned reranking performance on multiple benchmarks. Experimental results show that dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes. These findings demonstrate the promise of diffusion-based language models as a compelling alternative to autoregressive architectures for document reranking.


[61] RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems cs.IR | cs.AI | cs.CLPDF

Junhua Liu, Yang Jihao, Cheng Chang, Kunrong LI, Bin Fu

TL;DR: RGAlign-Rec是一个用于推荐系统中主动意图预测的闭环对齐框架,它结合了基于LLM的语义推理器和查询增强排序模型,通过排序引导对齐训练范式,利用下游排序信号作为反馈来优化LLM的潜在推理,以弥合用户特征与知识库语义意图之间的差距,并解决通用LLM输出与任务特定排序目标之间的错位问题。

Details

Motivation: 解决现代电商聊天机器人中主动意图预测的两个核心挑战:一是离散用户特征与聊天机器人知识库中语义意图之间的语义鸿沟;二是通用大语言模型输出与特定任务排序效用之间的目标不一致问题。

Result: 在Shopee的大规模工业数据集上,RGAlign-Rec实现了GAUC提升0.12%,错误率相对降低3.52%,Recall@3提升0.56%。在线A/B测试显示,查询增强模型(QE-Rec)初始带来CTR提升0.98%,随后的排序引导对齐阶段进一步贡献0.13%的增益,表明框架有效提升了预测准确性和服务质量。

Insight: 创新点在于提出了排序引导对齐的多阶段训练范式,利用下游排序信号作为反馈来优化LLM的潜在推理,实现了语义推理与排序目标的同步对齐,为工业推荐系统提供了一种将LLM语义能力与任务特定排序目标有效结合的闭环框架。

Abstract: Proactive intent prediction is a critical capability in modern e-commerce chatbots, enabling “zero-query” recommendations by anticipating user needs from behavioral and contextual signals. However, existing industrial systems face two fundamental challenges: (1) the semantic gap between discrete user features and the semantic intents within the chatbot’s Knowledge Base, and (2) the objective misalignment between general-purpose LLM outputs and task-specific ranking utilities. To address these issues, we propose RGAlign-Rec, a closed-loop alignment framework that integrates an LLM-based semantic reasoner with a Query-Enhanced (QE) ranking model. We also introduce Ranking-Guided Alignment (RGA), a multi-stage training paradigm that utilizes downstream ranking signals as feedback to refine the LLM’s latent reasoning. Extensive experiments on a large-scale industrial dataset from Shopee demonstrate that RGAlign-Rec achieves a 0.12% gain in GAUC, leading to a significant 3.52% relative reduction in error rate, and a 0.56% improvement in Recall@3. Online A/B testing further validates the cumulative effectiveness of our framework: the Query-Enhanced model (QE-Rec) initially yields a 0.98% improvement in CTR, while the subsequent Ranking-Guided Alignment stage contributes an additional 0.13% gain. These results indicate that ranking-aware alignment effectively synchronizes semantic reasoning with ranking objectives, significantly enhancing both prediction accuracy and service quality in real-world proactive recommendation systems.


[62] WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata cs.IR | cs.CVPDF

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

TL;DR: WISE是一个开源的多模态搜索引擎,支持对图像、视频、音频、人脸、语音转录文本和用户元数据进行统一检索,用户可通过自然语言、反向图像、音频文件等多种方式进行查询,并利用向量搜索技术实现大规模高效检索。

Details

Motivation: 旨在为没有机器学习专业知识的用户提供一个集成多种模态检索能力的实用工具,解决跨模态信息(如视觉场景、音频事件、人脸、语音等)的统一、高效检索问题。

Result: 论文声称WISE能够扩展到支持数百万张图像或数千小时视频的高效检索,并已应用于多个真实场景;但摘要未提及具体的定量基准测试结果或与现有方法的比较。

Insight: 主要创新点在于将多种模态(视觉、音频、人脸、语音、元数据)的检索能力集成到一个统一的、可扩展的搜索引擎中,并支持跨模态组合查询;其模块化架构便于集成新模型,且支持本地部署以处理敏感数据。

Abstract: In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities – for example, retrieving German trains from a historical archive by applying the object query “train” and the metadata query “Germany”, or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.


cs.LG [Back]

[63] Constraint-Rectified Training for Efficient Chain-of-Thought cs.LG | cs.CLPDF

Qinhang Wu, Sen Lin, Ming Zhang, Yingbin Liang, Ness B. Shroff

TL;DR: 本文提出了一种名为约束修正训练(CRT)的后训练框架,旨在解决思维链(CoT)推理中推理长度过长和冗余步骤(即过度思考)的问题,通过基于参考保护的约束优化,在保持答案质量的同时稳定地减少推理长度。

Details

Motivation: 动机是解决现有基于启发式方法(如长度感知奖励设计或基于提示的校准)在平衡推理长度与准确性时存在的严重准确性下降和对超参数敏感的问题,寻求一种更稳定、可解释的高效推理训练方法。

Result: 综合评估表明,该框架在保持答案质量处于稳健可靠水平的同时,持续减少了令牌使用量。进一步分析揭示,CRT不仅通过缩短响应,还通过减少内部语言冗余来提高推理效率。

Insight: 创新点在于提出了一个基于参考保护约束优化的原则性后训练框架(CRT),采用两阶段训练方案(先发现最短可靠推理模式,再在习得的长度预算下精炼准确性),并自然地产生一系列保持正确性的中间检查点,实现了对推理详细程度的细粒度控制。

Abstract: Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.


[64] Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL cs.LG | cs.AI | cs.CLPDF

Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng

TL;DR: 本文提出了一种名为Introspective LLM的分层强化学习框架,旨在让大型语言模型在生成过程中学习并动态控制采样温度。该方法将温度选择视为一个基于模型内部隐藏状态的策略,并与下一个token的生成策略联合优化,以最大化下游任务奖励。实验表明,在数学推理基准测试中,学习到的温度策略优于固定或启发式基线,并能展现出与推理不确定性相关的可解释探索行为。

Details

Motivation: 现有方法通常使用静态值或与任务奖励脱节的启发式策略来控制采样温度,这限制了模型在强化学习训练中探索与利用的平衡能力。本文旨在解决如何让LLM根据其内部状态动态、自适应地调整温度,从而更好地从可验证奖励中进行学习。

Result: 在数学推理基准测试(如GSM8K)上的实验结果表明,学习到的温度策略在任务性能上超越了固定温度(如T=1.0)和启发式自适应基线,实现了更好的效果。

Insight: 核心创新点在于将温度控制形式化为一个基于模型内部状态(隐藏状态)的分层强化学习问题,并与token生成策略进行联合优化。这为解码策略的端到端学习提供了新思路,使得温度调整能与任务奖励直接挂钩,并可能产生更符合任务需求的探索行为。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration–exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.


[65] Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models cs.LG | cs.CVPDF

Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, Ivor Tsang

TL;DR: Flow-Factory是一个用于流匹配模型中强化学习的统一框架,旨在解决现有代码库碎片化、模型特定实现复杂和工程负担重的问题。它通过模块化、基于注册表的架构,解耦算法、模型和奖励,支持多种算法和模型的无缝集成,并提供生产级的内存优化、灵活的多奖励训练和分布式训练支持。

Details

Motivation: 当前使用强化学习对齐扩散和流匹配模型与人类偏好时,存在代码库分散、实现依赖特定模型以及工程复杂度高的痛点,阻碍了研究效率和创新。

Result: 框架已成功支持GRPO、DiffusionNFT和AWM等算法在Flux、Qwen-Image和WAN视频模型上的应用,展示了其无缝集成和扩展能力。

Insight: 核心创新在于提出了一种模块化、解耦的架构设计,通过注册表机制实现了算法、模型和奖励的灵活组合,这显著降低了实现开销,使研究人员能快速原型化和规模化新方法,其生产级优化和分布式支持也提升了实用价值。

Abstract: Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at https://github.com/X-GenGroup/Flow-Factory.


[66] SLA2: Sparse-Linear Attention with Learnable Routing and QAT cs.LG | cs.AI | cs.CVPDF

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang

TL;DR: 本文提出了SLA2,一种改进的稀疏-线性注意力机制,用于加速扩散模型(特别是视频生成)。它通过可学习的路由器动态分配稀疏或线性注意力计算,采用更直接的稀疏-线性注意力公式,并引入量化感知微调的低比特注意力设计。实验表明,SLA2在视频扩散模型中实现了97%的注意力稀疏度和18.6倍的注意力加速,同时保持生成质量。

Details

Motivation: 现有稀疏-线性注意力(SLA)基于启发式分割(根据注意力权重大小分配计算到稀疏或线性分支)可能不是最优的,且存在注意力误差不匹配问题,需要更有效的稀疏-线性注意力分解方法。

Result: 在视频扩散模型上,SLA2实现了97%的注意力稀疏度和18.6倍的注意力加速,同时保持了生成质量,达到了高效与性能的平衡。

Insight: 创新点包括可学习的动态路由器、更直接的稀疏-线性注意力公式以及结合稀疏与低比特注意力的量化感知设计,为注意力机制优化提供了可学习的路由和量化集成思路。

Abstract: Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.


[67] Transporting Task Vectors across Different Architectures without Training cs.LG | cs.AI | cs.CVPDF

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

TL;DR: 本文提出了一种名为Theseus的无训练方法,用于在不同架构的模型之间传输任务特定的参数更新。该方法通过正交Procrustes分析对齐表示空间,并基于中间表示的功能效应来匹配任务更新,从而实现在不同宽度的视觉和语言模型之间有效传输任务向量。

Details

Motivation: 解决大型预训练模型在下游任务适配时,任务特定参数更新难以在不同架构模型间传输的问题,特别是跨不同宽度模型的传输尚未充分探索。

Result: 在视觉和语言模型的不同宽度变体上评估Theseus,结果显示其在不进行额外训练或反向传播的情况下,相比强基线方法取得了持续改进。

Insight: 创新点在于将任务向量传输形式化为基于观察激活的函数匹配问题,并通过功能而非参数方式定义任务身份,从而实现了跨架构的稳定传输;这为模型适配和知识迁移提供了新的视角。

Abstract: Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.


cs.RO [Back]

[68] LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning cs.RO | cs.CVPDF

Junwoon Lee, Yulun Tian

TL;DR: LatentAM是一个在线3D高斯泼溅(3DGS)建图框架,它从流式RGB-D观测中构建可扩展的潜在特征地图,用于开放词汇的机器人感知。该方法通过在线字典学习,将每个高斯图元与一个紧凑的查询向量关联,该向量可通过注意力机制转换为近似的视觉语言模型(VLM)嵌入,无需特定模型解码器或预训练,实现了与不同VLM的即插即用集成。

Details

Motivation: 解决现有方法依赖特定模型解码器或预训练来提取高维VLM嵌入的问题,旨在实现一种模型无关、无需预训练、可实时扩展的在线建图方案,以支持开放词汇的机器人感知。

Result: 在公开基准和大规模自定义数据集上的实验表明,LatentAM在特征重建保真度上显著优于现有最先进方法,同时在评估数据集上达到接近实时的速度(12-35 FPS)。

Insight: 创新点在于提出了一种基于在线字典学习的模型无关方法,将紧凑查询向量通过注意力机制近似VLM嵌入,无需模型特定解码器;同时,结合基于体素哈希的高效地图管理策略,在GPU上优化局部活跃地图,在CPU上存储全局地图,实现了大规模环境下的实时建图与有限GPU内存使用。

Abstract: We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM


[69] LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation cs.RO | cs.CVPDF

Yue Hu, Avery Xi, Qixin Xiao, Seth Isaacson, Henry X. Liu

TL;DR: 本文提出了LongNav-R1,一个端到端的多轮强化学习框架,用于优化视觉-语言-动作模型在长视野导航任务中的性能。该框架将导航决策过程重新定义为VLA策略与具身环境之间的连续多轮对话,并引入了视野自适应策略优化机制,以处理不同长度的视野并实现精确的时序信用分配。

Details

Motivation: 解决现有单轮范式在长视野导航任务中的局限性,使智能体能够推理历史交互的因果效应和序列化未来结果,并直接从在线交互中学习,避免人类演示带来的行为僵化。

Result: 在物体导航基准测试中,仅使用4000条轨迹,LongNav-R1将Qwen3-VL-2B模型的成功率从64.3%提升至73.0%,表现出卓越的样本效率并显著优于最先进方法。其泛化性和鲁棒性在长视野真实世界导航的零样本性能中得到进一步验证。

Insight: 核心创新在于将导航决策建模为多轮对话,并引入视野自适应策略优化机制。这为长视野序列决策问题提供了新的强化学习范式,通过多轮交互促进多样化轨迹生成和精确的长期信用分配,可借鉴于其他需要长期规划和在线学习的具身智能任务。

Abstract: This paper develops LongNav-R1, an end-to-end multi-turn reinforcement learning (RL) framework designed to optimize Visual-Language-Action (VLA) models for long-horizon navigation. Unlike existing single-turn paradigm, LongNav-R1 reformulates the navigation decision process as a continuous multi-turn conversation between the VLA policy and the embodied environment. This multi-turn RL framework offers two distinct advantages: i) it enables the agent to reason about the causal effects of historical interactions and sequential future outcomes; and ii) it allows the model to learn directly from online interactions, fostering diverse trajectory generation and avoiding the behavioral rigidity often imposed by human demonstrations. Furthermore, we introduce Horizon-Adaptive Policy Optimization. This mechanism explicitly accounts for varying horizon lengths during advantage estimation, facilitating accurate temporal credit assignment over extended sequences. Consequently, the agent develops diverse navigation behaviors and resists collapse during long-horizon tasks. Experiments on object navigation benchmarks validate the framework’s efficacy: With 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0%. These results demonstrate superior sample efficiency and significantly outperform state-of-the-art methods. The model’s generalizability and robustness are further validated by its zero-shot performance in long-horizon real-world navigation settings. All source code will be open-sourced upon publication.


[70] MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery cs.RO | cs.CV | cs.LGPDF

Keshara Weerasinghe, Seyed Hamid Reza Roodabeh, Andrew Hawkins, Zhaomeng Zhang, Zachary Schrader

TL;DR: 本文介绍了MiDAS,一个开源的、平台无关的多模态数据采集系统,用于机器人辅助微创手术研究。该系统通过外部传感器(如电磁和RGB-D手部追踪、脚踏板感应和手术视频)实现时间同步的多模态数据采集,无需依赖专有机器人遥测接口,并在Raven-II和da Vinci Xi平台上进行了验证。

Details

Motivation: 机器人辅助微创手术研究依赖多模态数据,但专有机器人遥测数据的获取存在障碍,因此需要一种非侵入式、平台无关的数据采集系统来促进可重复研究。

Result: 在Raven-II和da Vinci Xi平台上,外部手部和脚部传感数据与内部机器人运动学高度相关,且基于非侵入式运动信号的手势识别性能与专有遥测数据相当。

Insight: 创新点在于开发了一个开源、非侵入式的多模态数据采集系统,避免了专有接口限制,并首次提供了包含疝气修复缝合任务的高保真模拟模型多模态数据集,有助于推动手术机器人研究的可访问性和可重复性。

Abstract: Background: Robot-assisted minimally invasive surgery (RMIS) research increasingly relies on multimodal data, yet access to proprietary robot telemetry remains a major barrier. We introduce MiDAS, an open-source, platform-agnostic system enabling time-synchronized, non-invasive multimodal data acquisition across surgical robotic platforms. Methods: MiDAS integrates electromagnetic and RGB-D hand tracking, foot pedal sensing, and surgical video capturing without requiring proprietary robot interfaces. We validated MiDAS on the open-source Raven-II and the clinical da Vinci Xi by collecting multimodal datasets of peg transfer and hernia repair suturing tasks performed by surgical residents. Correlation analysis and downstream gesture recognition experiments were conducted. Results: External hand and foot sensing closely approximated internal robot kinematics and non-invasive motion signals achieved gesture recognition performance comparable to proprietary telemetry. Conclusion: MiDAS enables reproducible multimodal RMIS data collection and is released with annotated datasets, including the first multimodal dataset capturing hernia repair suturing on high-fidelity simulation models.


[71] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos cs.RO | cs.CV | cs.LGPDF

Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang

TL;DR: 本文提出Perceive-Simulate-Imitate (PSI)框架,通过模拟筛选人类视频中的抓取-轨迹数据,训练模块化机器人操作策略,以无机器人数据的方式高效学习精确的抓取后操作技能。

Details

Motivation: 解决从人类视频学习机器人操作技能时,抓取行为(尤其是非类人手机器人)难以直接迁移,且任意稳定抓取往往与后续任务动作不兼容的问题。

Result: 真实世界实验表明,该框架无需机器人数据即可高效学习精确操作技能,性能比直接使用抓取生成器更鲁棒。

Insight: 创新点在于通过模拟中的成对抓取-轨迹筛选,为轨迹数据扩展抓取适宜性标签,从而以监督学习方式实现面向任务的抓取能力学习;这是一种将人类视频数据与仿真验证相结合,以解决抓取与后续动作协同问题的模块化策略学习方法。

Abstract: The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot’s ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.