Table of Contents

cs.CL [Back]

[1] PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering cs.CL | cs.AI | cs.IRPDF

MinGyu Jeon, SuWan Cho, JaeYoung Shu

TL;DR: 本文提出了一种名为PPoGA(Predictive Plan-on-Graph with Action)的新型知识图谱问答框架,旨在解决现有方法在初始高层推理计划错误时无法有效调整的问题。该框架通过规划器-执行器架构分离高层策略与底层执行,并利用预测处理机制来预见结果,其核心创新在于引入了包含路径修正和计划修正的自我纠正机制。

Details

Motivation: 现有结合知识图谱的大语言模型在复杂问答中,一旦初始高层推理计划有误,往往因类似认知功能固着的限制而无法调整策略,导致追求不可行的解决方案。本文旨在通过模拟人类认知控制和问题解决过程,构建一个更鲁棒和灵活的推理系统来解决此问题。

Result: 在三个具有挑战性的多跳知识图谱问答基准(GrailQA、CWQ和WebQSP)上进行的广泛实验表明,PPoGA取得了最先进的性能,显著优于现有方法。

Insight: 摘要宣称的创新点在于受人类认知启发的规划器-执行器架构、预测处理机制以及核心的自我纠正机制(包括局部路径修正和全局计划重新制定)。从客观角度看,将高层计划制定与底层执行解耦,并赋予系统在计划无效时识别、丢弃并重新制定整个计划的能力,是提升AI系统推理鲁棒性和灵活性的关键元认知能力借鉴点。

Abstract: Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.


[2] G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models cs.CL | cs.AIPDF

Xun Xu

TL;DR: 本文提出了G-MemLLM,一种用于增强大型语言模型长上下文推理能力的门控潜在记忆增强架构。该方法通过集成一个可训练的潜在记忆库和GRU风格的门控更新机制,选择性地更新、保留或覆盖记忆槽,以缓解长序列中信息稀释和梯度消失的问题。

Details

Motivation: 现有大型语言模型受限于有限的上下文窗口容量,并且在多跳推理中难以保持长期事实一致性。现有方法如上下文压缩或循环标记常导致‘上下文腐化’或长程信息稀释,因此需要一种能有效管理和维护长期知识的机制。

Result: 在HotpotQA和Zero-Shot Relation Extraction (ZsRE)基准测试上,从GPT-2 (124M)到Llama 3.1 (8B)的不同规模模型上进行了评估。结果表明,G-MemLLM显著提升了多跳推理和关系抽取精度:在ZsRE上,Llama 3.1-8B的准确率提升了13.3%;在HotpotQA上,GPT-2的Answer F1提升了8.56分,Llama 3.1-8B的Supporting Fact F1提升了6.89分。

Insight: 主要创新点在于引入了可训练的潜在记忆库和受GRU启发的门控更新逻辑,使模型能够动态、选择性地管理长期记忆,有效防止循环系统中常见的知识梯度消失问题。从客观角度看,这种将门控机制与潜在记忆相结合的方法,为增强LLM的长上下文处理能力提供了一个新颖且可扩展的架构思路。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot’’ or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.


[3] PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems cs.CL | cs.AIPDF

Jiongchi Yu, Yuhan Ma, Xiaoyu Zhang, Junjie Wang, Qiang Hu

TL;DR: 本文介绍了PTCBENCH,这是一个用于评估大语言模型(LLM)在受控情境下人格特质一致性的系统性基准。该基准通过12种不同的外部条件(如地点和人生事件)来测试模型,并使用NEO五因素量表进行严格评估。研究发现,某些外部场景(如“失业”)会显著改变LLM的人格特质,甚至影响其推理能力。

Details

Motivation: 随着大语言模型在情感代理和AI系统中的部署日益增多,保持LLM人格的一致性和真实性对用户信任和参与至关重要。然而,现有工作忽视了人格特质是动态且依赖于情境这一基本心理学共识,因此需要填补这一空白。

Result: 研究基于39,240条人格特质记录,发现某些外部场景(如“失业”)会触发LLM显著的人格变化,并改变其推理能力。PTCBENCH提供了一个可扩展的框架,用于评估现实、动态环境中的人格一致性。

Insight: 论文的创新点在于首次系统性地将心理学中人格的情境依赖性引入LLM评估,提出了一个可量化的基准(PTCBENCH),揭示了外部条件对LLM人格稳定性的影响,为开发稳健且符合心理学的AI系统提供了实用见解。

Abstract: With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., “Unemployment”) can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.


[4] Construct, Align, and Reason: Large Ontology Models for Enterprise Knowledge Management cs.CL | cs.AI | cs.LGPDF

Yao Zhang, Hongyin Zhu

TL;DR: 本文提出了一种名为大型本体模型(LOM)的统一构建-对齐-推理框架,旨在解决企业知识管理中多源异构数据集成和有效语义推理的挑战。该方法首先从结构化和非结构化数据构建双层企业本体,然后通过一个包含本体指令微调、文本-本体对齐以及课程学习驱动的多任务指令调优的三阶段训练流程,实现指令对齐的推理。

Details

Motivation: 企业级知识管理面临整合多源异构数据和实现有效语义推理的重大挑战,传统知识图谱在发现隐式关系和复杂问答的语义理解方面存在不足。

Result: 在构建的综合基准测试上,参数量为40亿的LOM模型达到了89.47%的准确率,并在复杂图推理任务上超越了DeepSeek-V3.2,表明其有效融合了本体结构和语言。

Insight: 创新点在于提出了一个统一的构建-对齐-推理框架,以及一个旨在增强结构理解、语义编码和推理生成能力的三阶段训练流程,实现了本体结构与语言模型的有效融合。

Abstract: Enterprise-scale knowledge management faces significant challenges in integrating multi-source heterogeneous data and enabling effective semantic reasoning. Traditional knowledge graphs often struggle with implicit relationship discovery and lack sufficient semantic understanding for complex question answering. To address these limitations, we introduce a unified construct–align–reason framework, the large ontology model (LOM). We first build a dual-layer enterprise ontology from structured databases and unstructured text, subsequently fusing these sources into a comprehensive enterprise ontology. To enable instruction-aligned reasoning, we propose a unified three-stage training pipeline: ontology instruction fine-tuning to improve structural understanding; text-ontology grounding to strengthen node semantic encoding; and multi-task instruction tuning on ontology-language pairs with curriculum learning to enhance semantic reasoning and generation. We also construct comprehensive training and evaluation datasets covering diverse ontology reasoning tasks. On this benchmark, our 4B-parameter LOM achieves 89.47% accuracy and outperforms DeepSeek-V3.2 on complex graph reasoning, indicating effective fusion of ontology structure and language.


[5] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering cs.CL | cs.LGPDF

Philip Müller, Nicholas Popovič, Michael Färber, Peter Steinbach

TL;DR: 该论文提出了首个用于评估大型语言模型在长形式问答任务中不确定性量化方法校准性能的大规模基准,涵盖20个不同变体的LLM和7个科学问答数据集,通过对68.5万条响应的分析揭示了当前UQ方法的局限性。

Details

Motivation: 现有不确定性量化方法在依赖事实检索和推理能力的科学问答领域缺乏有效验证,阻碍了生成答案的可信应用,因此需要建立系统性的评估基准。

Result: 在token级别,指令微调导致概率质量极化,降低了置信度作为不确定性估计的可靠性;在序列级别,基于频率的答案一致性方法校准最可靠,而口头化方法存在系统性偏差且与正确性相关性差。

Insight: 创新点在于构建了可扩展的开源基准框架,并发现仅依赖ECE指标评估UQ方法会产生误导性结论,强调需要多维度评估校准性能。

Abstract: Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.


[6] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning cs.CLPDF

Li Siyan, Darshan Deshpande, Anand Kannappan, Rebecca Qian

TL;DR: 本文提出了DETOUR基准测试,用于评估双代理在模糊未指定检索场景下的搜索与推理能力,包含1011个提示,模拟了舌尖现象的多轮对话过程。

Details

Motivation: 现有评估代理在舌尖搜索过程中能力的基准仅限于单轮设置,无法真实模拟多轮回忆过程,因此需要构建更贴近现实的评估环境。

Result: 在DETOUR基准上,当前最先进的模型在所有模态(文本、图像、音频和视频)上仅达到36%的准确率,表明其在未指定场景下的能力仍有待提升。

Insight: 创新点在于引入了双代理评估框架(主代理与记忆代理),通过多轮交互模拟真实世界的模糊检索过程,为评估模型在复杂、未指定信息场景下的能力提供了新基准。

Abstract: When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.


[7] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models cs.CLPDF

Sercan Karakaş

TL;DR: 本研究评估了最先进的大语言模型是否能够捕捉土耳其语反身代词的约束关系,通过构建包含100个句子的平衡数据集,对比测试了OpenAI的思维链模型和基于土耳其语数据微调的Trendyol-LLM-7B-base-v0.1模型在反身代词kendi和kendisi的局部与非局部先行词选择上的表现。

Details

Motivation: 解决大语言模型是否理解和掌握土耳其语中反身代词的约束关系这一语言学问题,特别是测试模型在局部与长距离约束上的偏好。

Result: Trendyol-LLM在约70%的试验中偏好局部约束,表现出强烈的局部性偏差;而o1 Mini在局部和长距离解读之间的选择几乎均匀分布,揭示了两个系统在约束行为上的显著差异。

Insight: 论文创新点在于采用结合句子级困惑度和强制选择范式的评估方法,对比不同架构和训练策略的LLM在特定语言现象上的表现;客观分析表明,模型在特定语言数据上的微调程度可能显著影响其语言学规则的捕获能力,这为理解LLM的语言内部表征提供了新视角。

Abstract: This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.


[8] Segment-Level Attribution for Selective Learning of Long Reasoning Traces cs.CLPDF

Siyuan Wang, Yanchen Liu, Xiang Ren

TL;DR: 本文提出了一种基于片段级归因的选择性学习框架,用于优化大型推理模型在长推理轨迹上的训练。通过集成梯度归因量化每个token对最终答案的影响,并聚合为两个片段级指标——归因强度和方向一致性,以识别具有反思性推理的重要片段。该框架在重要片段上应用选择性监督微调,同时屏蔽不重要片段的损失,从而提升模型准确性和输出效率。

Details

Motivation: 大型推理模型通过生成长链思维轨迹实现强推理性能,但其中仅小部分片段对答案预测有实质性贡献,多数包含重复或截断内容。这种输出冗余在监督微调后进一步传播,导致模型模仿冗长但无信息的模式,可能降低性能。

Result: 在多个模型和数据集上的实验表明,该方法提高了准确性和输出效率,实现了从长推理轨迹中更有效的学习。

Insight: 创新点在于将token级归因聚合为片段级指标(归因强度和方向一致性),并基于此设计选择性学习框架,聚焦于具有高归因强度但中等一致性的反思性推理片段,从而优化监督微调过程,减少冗余学习。

Abstract: Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token’s influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens’ attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnote{Code and data are available at https://github.com/SiyuanWangw/SegmentSelectiveSFT}.


[9] When Agents “Misremember” Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems cs.CL | cs.AI | cs.CRPDF

Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou

TL;DR: 本文研究了基于大语言模型的多智能体系统中存在的曼德拉效应,即智能体因社会影响和内化错误信息而集体错误记忆的现象。作者提出了MANBENCH基准来评估该效应,分析了其影响因素,并提出了提示级和模型级防御策略,平均减少74.40%的效应。

Details

Motivation: 多智能体系统在协作中易受集体认知偏差影响,曼德拉效应作为典型例子,限制了系统对记忆偏差的理解并引发伦理担忧,需探究其存在、成因和缓解方法。

Result: 在MANBENCH基准上评估多个LLM驱动的智能体,量化了曼德拉效应,并通过提出的防御策略(如认知锚定和基于对齐的模型级防御)相比基线平均减少74.40%的效应。

Insight: 创新点包括设计MANBENCH基准以系统评估多智能体系统中的曼德拉效应,并提出结合提示工程和模型对齐的防御策略,为构建更鲁棒和伦理对齐的协作系统提供洞见。

Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.


[10] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation cs.CLPDF

Zhexiong Liu, Diane Litman

TL;DR: 本文提出了一种名为Intention-Tuning的意图自适应分层微调框架,用于解决基于意图的文本修订生成任务中LLM面临的挑战。该框架通过动态选择LLM的层来学习意图表示,并将其迁移到修订生成中,从而在小型修订语料库上实现高效且有效的性能。

Details

Motivation: 动机在于LLM在基于意图的生成任务(如文本修订)中应用不足,现有方法难以处理复杂的多意图场景,且全量微调需要大量昂贵标注数据。

Result: 实验结果表明,Intention-Tuning在小型修订语料库上优于多种参数高效微调(PEFT)基线方法,证明了其有效性和效率。

Insight: 创新点在于提出了层级的意图自适应微调机制,动态选择并迁移LLM层表示,以少量数据高效学习复杂意图,为基于意图的生成任务提供了新思路。

Abstract: Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer’s actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.


[11] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas cs.CLPDF

Zhaokun Yan, Zhaohan Liu, Wuzheng Dong, Lijie Feng, Chengxiao Dai

TL;DR: 本文介绍了GlobalHealthAtlas,一个大规模多语言公共卫生推理数据集,包含280,210个实例,覆盖15个公共卫生领域和17种语言,并分为三个难度级别。论文提出了一个LLM辅助的数据构建与质量控制流程,以及一个领域对齐的评估器,用于评估LLM在安全关键公共卫生推理任务上的表现,超越了传统问答基准。

Details

Motivation: 公共卫生推理需要基于科学证据、专家共识和安全约束进行群体层面推断,但作为结构化机器学习问题,其监督信号和基准有限,尚未得到充分探索。

Result: 论文构建了GlobalHealthAtlas数据集,并通过LLM辅助流程确保数据质量;提出了一个从多样LLM高置信度判断中蒸馏出的领域对齐评估器,在六个维度(准确性、推理、完整性、共识对齐、术语规范、洞察力)上评估输出,支持可重复的LLM训练与评估。

Insight: 创新点包括:1) 引入大规模、分层、多语言的公共卫生推理数据集;2) 设计LLM辅助的构建与质量控制流程,提高大规模数据的一致性;3) 提出多维度评估器,专注于安全关键领域的推理能力评估,为LLM在专业领域应用提供新基准。

Abstract: Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbf{GlobalHealthAtlas}, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.


[12] Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design cs.CL | cs.AIPDF

Hanjing Shi, Dominic DiFranzo

TL;DR: 本文针对多语言大语言模型(MLLMs)在跨文化、语言和政治环境中部署时存在的治理风险,提出了一个基于文化根基的治理框架。该框架旨在解决因英语中心主义数据、同质化用户假设和抽象公平概念导致的系统性风险,特别是对低资源语言和文化边缘化社区的影响。

Details

Motivation: 现有MLLM治理框架主要基于英语中心数据、同质化用户和抽象公平概念,忽视了文化、语言和政治多样性,导致对低资源语言和边缘化社区的系统性风险,如数据实践、模型行为和问责机制与当地规范、权利和期望不匹配。

Result: 本文未提出新的技术基准或定量结果,而是贡献了一个概念性议程,将多语言AI治理重新定义为社会文化和基于权利的问题。

Insight: 创新点在于将多语言AI治理从技术基准问题转向社会文化和权利问题,强调基于文化根基的治理框架,关注数据管理、透明度和参与式问责的设计与政策含义,以防止MLLMs在规模和中性伪装下复制全球不平等。

Abstract: Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.


[13] Reasoning by Commented Code for Table Question Answering cs.CLPDF

Seho Pyo, Jiheon Seok, Jaejin Lee

TL;DR: 本文提出了一种基于注释代码生成的表格问答框架,将推理过程分解为带自然语言注释的多行可执行Python程序,以提高数值准确性和可解释性。该方法在WikiTableQuestions基准上达到70.9%的准确率,超越基线模型;与端到端模型结合后进一步提升至84.3%。

Details

Motivation: 现有表格问答方法因表格线性化破坏二维结构关系,导致数值准确性受限且可解释性不足,本文旨在通过显式推理的代码生成框架解决这一问题。

Result: 在WikiTableQuestions基准上,使用Qwen2.5-Coder-7B-Instruct模型达到70.9%准确率(超越Repanda基线的67.6%),结合轻量级答案选择机制与端到端模型后提升至84.3%。

Insight: 创新点在于将推理过程显式化为带注释的多行代码生成,增强结构化数据处理的透明性;通过模块化分解与端到端模型融合,平衡了可解释性与性能提升。

Abstract: Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3% accuracy on the WikiTableQuestions benchmark.


[14] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling cs.CL | cs.AIPDF

Chaoqun Cui, Shijing Wang, Liangbin Huang, Qingqing Gu, Zhaolong Huang

TL;DR: 本文提出了一个名为Hermes的基于大语言模型(LLM)的自动化跨语言字幕生成框架,旨在解决字幕翻译中语义连贯性、代词与术语翻译以及表达力等挑战。该框架集成了说话人分割、术语识别和表达力增强三个模块,实验表明其在说话人分割任务上达到SOTA水平,并能生成表达力强、上下文连贯的翻译。

Details

Motivation: 跨语言字幕翻译对于娱乐内容本地化至关重要,但尚未在机器翻译领域得到充分探索。现有LLM在通用机器翻译上虽取得进展,但字幕文本的独特特性(如语义连贯性、代词/术语翻译、表达力)仍带来持续挑战。

Result: 实验证明,Hermes在说话人分割(Diarization)任务上达到了最先进的(SOTA)性能,并且能够生成具有表达力、上下文连贯的翻译。

Insight: 主要创新点在于提出了一个统一的、模块化的LLM框架,专门针对字幕翻译的特定挑战(说话人分割、术语识别、表达力增强)进行集成优化,而非仅依赖通用翻译模型,这为跨语言字幕生成研究提供了新思路。

Abstract: Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.


[15] Formal Semantic Control over Language Models cs.CLPDF

Yingji Zhang

TL;DR: 该论文通过变分自编码器(VAE)框架,从句子级和推理级两个方向推进语义表示学习,旨在使语言模型的潜在空间在语义和几何上更可解释,并实现局部化、准符号化、组合式的控制。

Details

Motivation: 动机是使语言模型的内部语义表示能够被系统解释、精确结构化和可靠引导,从而增强其潜在空间的解释性和可控性。

Result: 论文提出了一套新颖的理论框架和实用方法,并通过实验证明,这些方法在解释性文本生成和解释性自然语言推理(NLI)任务中,增强了自然语言潜在空间的解释性和可控性。

Insight: 创新点在于通过解耦和操纵潜在空间中的特定语义特征来引导句子生成,以及隔离和引导推理行为来控制NLI,实现了对语言模型的局部化、准符号化、组合式控制。

Abstract: This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.


Haitao Li, Yifan Chen, Shuo Miao, Qian Dong, Jia Chen

TL;DR: 本文提出了LegalOne系列基础模型,专门针对中文法律领域设计,通过三阶段训练流程(包括塑性调整采样、法律代理思维链蒸馏和课程强化学习)来掌握法律推理能力,在多项法律任务上达到SOTA性能,并公开了模型权重和评估框架。

Details

Motivation: 解决通用大语言模型在法律领域应用时因缺乏精确领域知识和难以执行严谨多步司法推理而受限的问题。

Result: 实验结果表明,LegalOne在广泛的法律任务上实现了最先进的性能,通过增强的知识密度和效率超越了参数量大得多的通用LLMs。

Insight: 创新点包括:1) 塑性调整采样(PAS)平衡新知识获取与原始能力保留;2) 法律代理思维链蒸馏(LEAD)将复杂司法过程转化为结构化推理轨迹;3) 课程强化学习策略实现从记忆到理解的渐进式推理能力提升;模型专注于知识密度与效率而非单纯规模扩展。

Abstract: While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.


[17] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement cs.CL | cs.AIPDF

Ziyan Xiao, Yinghao Zhu, Liang Peng, Lequan Yu

TL;DR: 本文提出ExperienceWeaver,一个用于基于大语言模型的临床文本改进的分层框架,旨在解决小样本场景下高质量数据有限和医学文档约束复杂的问题。该框架通过将嘈杂的多维反馈提炼为结构化的、可操作的知识,使模型学习‘如何修订’而不仅仅是‘修订什么’,从而优化小样本经验学习。

Details

Motivation: 当前基于大语言模型的临床文本改进方法在小样本场景中存在局限:监督微调需要大量数据且成本高,而检索增强生成通常只能提供表面修正,无法捕捉修订背后的推理过程。

Result: 在四个临床数据集上的广泛评估表明,ExperienceWeaver在小样本设置中持续提升性能,超越了Gemini-3 Pro等最先进模型。

Insight: 创新点在于将焦点从数据检索转移到经验学习,通过提炼反馈为具体的错误提示和高级策略,构建了一个代理驱动的流程,使模型能够学习修订的推理过程,这在小样本场景中具有显著优势。

Abstract: Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns “how to revise” rather than just “what to revise”. Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.


[18] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning cs.CL | cs.AIPDF

Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen

TL;DR: 本文提出了一种名为A$^2$D的自适应能力分解方法,旨在提升基于可验证奖励的强化学习(RLVR)在增强大型语言模型推理能力时的效果。该方法通过训练一个分解器将复杂问题分解为简单子问题,并利用这些子问题指导推理器的训练,从而解决RLVR过程中因信息有限导致的盲目探索问题。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)方法在增强大语言模型推理能力时,由于训练过程中提供的信息有限,模型往往只能进行盲目的探索,导致在复杂问题上失败。本文旨在不依赖教师模型的情况下,为RLVR过程提供额外信息以改善这一问题。

Result: 实验表明,A$^2$D方法在性能上优于现有基线方法,并且可以作为一个即插即用模块应用于不同的RLVR算法中,有效提升了推理器的探索和利用能力。

Insight: 创新点在于提出了一种自适应能力分解框架,通过先训练分解器生成子问题,再以子问题指导推理器训练,从而为RLVR过程提供了结构化的中间监督信号。这种方法避免了对外部教师模型的依赖,并增强了模型在复杂推理任务中的探索效率。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner’s exploration and exploitation abilities.


[19] APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards cs.CLPDF

Kaiyan Chang, Chenwei Zhu, Yingfeng Luo, Yifu Huo, Chenglong Wang

TL;DR: 本文提出了一种名为APR(Anchor-based Process Reward)的方法,旨在解决大型推理模型(LRMs)在测试时扩展(TTS)中出现的‘过度思考’问题。该方法通过识别推理过程中的‘推理锚点’(即答案首次稳定的位置),并专门惩罚锚点后无意义的重复验证(称为答案稳定尾部,AST),从而减少计算冗余。实验表明,APR在多个数学推理数据集上实现了性能与效率的帕累托前沿,同时大幅降低了强化学习训练所需的计算资源。

Details

Motivation: 大型推理模型在测试时扩展时会产生‘过度思考’现象,即在推理过程中即使已获得最终答案,仍会进行无意义的重复自我验证,导致计算冗余和效率低下。本文旨在通过分析这种结构冗余,提出一种针对性的奖励塑造方法来优化模型效率。

Result: 在五个数学推理数据集上,APR方法在1.5B和7B规模的模型上实现了性能与效率的帕累托前沿,同时显著减少了强化学习训练的计算资源需求。

Insight: 创新点在于从细粒度视角重新审视‘过度思考’现象,正式定义了‘推理锚点’和‘答案稳定尾部’(AST)的概念,并提出了结构感知的奖励塑造方法APR,专门惩罚锚点后的冗余推理行为。这为优化大型推理模型的效率提供了一种可借鉴的新思路,即通过定位关键推理节点来设计针对性奖励,而非全局惩罚。

Abstract: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring significantly fewer computational resources for RL training.


[20] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs cs.CL | cs.HCPDF

Yuheng Shao, Junjie Xiong, Chaoran Wu, Xiyuan Wang, Ziyu Zhou

TL;DR: 本文针对母语为中文的英语学习者应用关键词法记忆词汇的困难,提出了WordCraft这一基于多模态大语言模型(MLLMs)的交互式工具。该工具通过引导学习者完成关键词选择、关联构建和意象形成三个步骤,为关键词法提供过程性支架,以提升词汇记忆效果。两项用户研究表明,WordCraft在保持生成效应的同时,具有高有效性和可用性。

Details

Motivation: 母语为中文的英语学习者在应用关键词法记忆词汇时,面临难以生成语音合适的关键词、构建连贯关联以及形成生动心理意象等挑战。现有方法(如全自动关键词生成或结果导向的助记工具)要么损害学习者参与度,要么缺乏足够的过程性指导。

Result: 两项用户研究表明,WordCraft不仅保持了生成效应,而且在有效性和可用性方面都达到了高水平。

Insight: 论文的创新点在于将多模态大语言模型(MLLMs)应用于词汇学习支架,设计了一个以学习者为中心的交互式工具,将关键词法从结果导向转变为过程引导,通过分步交互(选择、关联、意象化)来增强学习者的主动参与和深层加工,从而提升记忆效果。这为教育技术中的人机协同和过程性学习支持提供了新思路。

Abstract: Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.


[21] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models cs.CLPDF

Siyuan Zhang, Jialian Li, Yichi Zhang, Xiao Yang, Yinpeng Dong

TL;DR: 本文提出了一种表示视角来分析大语言模型在推理任务中的内部状态演化。研究发现,后训练对初始表示质量提升有限,但能驱动推理过程中表示分布的持续转变,使其更有利于任务解决。统计分析表明生成正确性与最终表示高度相关,而反事实实验揭示生成标记的语义是驱动这种转变的主要因素。

Details

Motivation: 现有研究主要通过显式生成结果分析推理能力的演化,将推理过程视为黑盒,掩盖了内部变化。本文旨在通过分析模型内部状态的动态变化,揭示推理过程的内部机制以及训练对推理能力提升的影响。

Result: 通过对不同训练阶段模型的综合实验,发现后训练仅能有限提升静态初始表示质量。在推理任务中,表示在生成过程中会发生显著的、持续的分布偏移。后训练使模型能够驱动这种转变,使其朝向更利于解决问题的分布。统计分析确认生成正确性与最终表示高度相关。

Insight: 论文的创新点在于引入了表示视角来分析推理过程,揭示了推理过程中内部表示存在持续的分布偏移,并明确了后训练通过优化这种偏移来提升推理能力。核心发现是生成标记的语义(而非推理时的额外计算或内在参数差异)是驱动表示转变的主导因素,这为模型分析和优化提供了新思路。

Abstract: Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model’s internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.


[22] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis cs.CLPDF

Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru

TL;DR: 本文提出了Omni-RRM,一个开源的、基于评分标准的全模态奖励模型,用于生成结构化、多维度且带有维度理由的偏好判断,覆盖文本、图像、视频和音频。其核心是Omni-Preference数据集,通过全自动流程构建,无需人工标注偏好。模型采用两阶段训练,在多个基准测试中达到SOTA水平。

Details

Motivation: 现有奖励模型(RMs)存在瓶颈:主要是视觉中心、输出不透明的标量分数、依赖昂贵的人工标注,这限制了多模态大语言模型(MLLMs)的性能提升。

Result: 在视频基准(ShareGPT-V)上达到80.2%的准确率,在音频基准(Audio-HH-RLHF)上达到66.8%的准确率,均达到SOTA水平;在图像任务上大幅超越现有开源RMs,整体准确率比其基础模型绝对提升17.7%。

Insight: 创新点在于:1)提出了首个开源、基于评分标准、覆盖全模态的奖励模型;2)设计了全自动的Omni-Preference数据集构建流程,通过对比不同能力模型生成候选响应对,并利用强教师模型进行偏好调和与过滤,同时为每对提供基于评分标准的模态感知理由,从而省去人工标注;3)采用两阶段训练(监督微调+强化学习GRPO)来提升模型在困难样本上的判别能力。

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2% on ShareGPT-V) and audio (66.8% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.


[23] A Baseline Multimodal Approach to Emotion Recognition in Conversations cs.CL | cs.AI | cs.CY | cs.SD | eess.ASPDF

Víctor Yeste, Rodrigo Rivas-Arévalo

TL;DR: 本文提出了一种轻量级多模态基线方法,用于对话中的情感识别,基于SemEval-2024 Task 3数据集(源自《老友记》剧集)。该方法结合了基于Transformer的文本分类器和自监督语音表示模型,通过简单的后期融合集成实现,旨在提供一个可访问的参考实现而非追求SOTA性能。

Details

Motivation: 动机是建立一个透明且易于复现的基线系统,以支持对话情感识别任务的多模态研究,重点在于探索多模态融合何时能超越单模态模型。

Result: 在SemEval-2024 Task 3数据集上,该方法在有限训练协议下获得了实证结果,展示了多模态融合在特定情况下优于单模态模型的性能,但未达到SOTA水平。

Insight: 创新点在于提供了一个结合文本和语音的简单后期融合基线框架,强调了多模态融合的实用性和可解释性,为未来更严格的比较提供了基础参考。

Abstract: We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.


[24] Verification Required: The Impact of Information Credibility on AI Persuasion cs.CL | cs.GTPDF

Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein

TL;DR: 本文研究了信息可信度对AI说服力的影响,通过引入MixTalk这一战略通信游戏来模拟LLM代理之间的交互,其中发送者策略性地结合可验证和不可验证的声明来传递私有信息,接收者则分配有限预算进行成本验证。论文评估了先进LLM代理在三种现实部署场景中的表现,并提出了锦标赛Oracle策略蒸馏(TOPD)方法,该方法通过离线蒸馏交互日志中的策略来提升接收者对说服的鲁棒性。

Details

Motivation: 随着LLM代理越来越多地部署在影响高风险决策的通信场景中,理解战略通信变得至关重要;现有研究主要关注不可验证的廉价交谈或完全可验证的披露,未能捕捉信息具有概率可信度的现实领域。

Result: 在三种现实部署设置的大规模锦标赛中评估了最先进的LLM代理,揭示了它们在推理信息可信度和塑造交互的显式行为方面的优势与局限;提出的TOPD方法显著提高了接收者对说服的鲁棒性。

Insight: 创新点在于引入MixTalk游戏来建模信息可信度,填补了现有研究在概率可信度领域的空白;TOPD方法通过离线蒸馏锦标赛Oracle策略并在推理时上下文部署,为提升AI代理在战略通信中的鲁棒性提供了新思路。

Abstract: Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.


[25] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA cs.CL | cs.AIPDF

Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang

TL;DR: 本文提出MedSpeak,一种基于知识图谱的ASR错误校正框架,旨在解决医疗口语问答系统中自动语音识别对医学术语识别不准的问题。该框架利用医疗知识图谱中的语义关系和语音信息,结合大语言模型的推理能力,优化噪声转录文本并提升下游答案预测性能。

Details

Motivation: 解决依赖自动语音识别的口语问答系统在医学术语识别上的准确性问题,提升医疗领域SQA的可靠性。

Result: 在基准测试上的综合实验结果表明,MedSpeak显著提高了医学术语识别准确率和整体医疗SQA性能,达到了该领域的SOTA水平。

Insight: 创新点在于将知识图谱的语义与语音信息与大语言模型推理相结合,用于ASR错误校正,为领域特定ASR纠错提供了可借鉴的跨模态知识增强方法。

Abstract: Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.


[26] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning cs.CL | cs.AI | cs.LGPDF

Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, Mohammad Ghavamzadeh, Mingyi Hong

TL;DR: DISPO是一种用于增强大语言模型数学推理能力的强化学习算法,它通过解耦正确与错误响应中重要性采样权重的上下裁剪,实现了四种可控的策略更新机制,从而在训练效率和稳定性之间取得了更好的平衡。

Details

Motivation: 现有强化学习方法在训练大语言模型数学推理时存在权衡:PPO类方法(如GRPO/DAPO)稳定但学习慢,REINFORCE类方法(如CISPO)效率高但不稳定。DISPO旨在解决这一局限性。

Result: 在AIME’24基准测试上达到61.04%的准确率,优于CISPO(55.42%)和DAPO(50.21%),并在多个基准和模型上取得类似提升。

Insight: 创新点在于将重要性采样权重的裁剪针对正确和错误响应分别解耦为上下裁剪,形成四种可控更新机制,通过独立调节这些参数来维持探索与蒸馏的平衡,避免灾难性失败,从而同时提升训练效率和稳定性。

Abstract: Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) – both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME’24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.


[27] Sparse Reward Subsystem in Large Language Models cs.CLPDF

Guowei Xu, Mert Yuksekgonul, James Zou

TL;DR: 本文在大语言模型的隐藏状态中发现了一个稀疏奖励子系统,类似于人脑中的生物奖励子系统。该子系统包含代表模型内部状态价值期望的价值神经元,并通过干预实验证明了这些神经元对推理的重要性。研究发现这些神经元在不同数据集、模型规模和架构中具有鲁棒性,且在相同基础模型微调的不同模型和数据集间具有显著的可迁移性。此外,研究还识别了奖励子系统中的多巴胺神经元,这些神经元编码奖励预测误差,在高奖励时激活增强,低奖励时激活减弱。

Details

Motivation: 动机是探索大语言模型内部是否存在类似人脑奖励子系统的结构,以理解模型如何内部表示和评估状态价值,并解决模型推理机制的可解释性问题。

Result: 实验结果表明,价值神经元在不同数据集、模型规模和架构中具有鲁棒性和可迁移性,且多巴胺神经元能有效编码奖励预测误差,验证了稀疏奖励子系统的存在和功能。

Insight: 创新点在于将神经科学中的奖励子系统概念引入大语言模型分析,发现了可解释的价值神经元和多巴胺神经元,为模型内部机制提供了新的生物学类比和可解释性见解。

Abstract: In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model’s internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.


[28] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework cs.CL | cs.AIPDF

Abhijit Chakraborty, Ashish Raj Shekhar, Shiven Agarwal, Vivek Gupta

TL;DR: DeALOG是一个去中心化的多智能体框架,用于处理跨文本、表格和图像的复杂问答任务。它通过多个专用智能体(如表智能体、上下文智能体、视觉智能体等)协同工作,并利用共享的自然语言日志作为持久内存进行通信和错误检测,以提高鲁棒性和可解释性。

Details

Motivation: 解决跨模态(文本、表格、图像)复杂问答中需要整合多样化信息源的问题,并提供一个支持专业化处理、协调和可解释性的框架。

Result: 在FinQA、TAT-QA、CRT-QA、WikiTableQuestions、FeTaQA和MultiModalQA等多个基准测试上展示了具有竞争力的性能,分析确认了共享日志、智能体专业化和验证对准确性的重要性。

Insight: 创新点在于采用去中心化的多智能体架构,通过自然语言日志作为共享内存实现智能体间的协调与验证,这提供了模块化、可扩展且可解释的解决方案,其日志介导的协作机制有助于错误检测和鲁棒性提升。

Abstract: Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.


[29] Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning cs.CLPDF

Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu

TL;DR: 该论文提出了RULES方法,通过将引理判断任务形式化为结构化预测问题,要求模型输出前提检查和结论效用检查两个部分,并采用分段感知的强化学习训练策略,以提升大型语言模型在数学推理中可靠应用引理的能力。

Details

Motivation: 当前大型语言模型在数学基准测试中表现良好,但经常误用引理,即在未验证假设的情况下直接应用结论,这影响了推理的可靠性。论文旨在解决这一问题,通过结构化任务设计和训练方法增强模型对引理适用性的判断能力。

Result: 实验结果表明,RULES在领域内任务上优于普通模型和单标签强化学习基线,在破坏适用性的扰动测试中改进更大,在端到端任务上达到持平或略有提升;消融研究证实了双段输出和分段感知强化学习对鲁棒性的必要性。

Insight: 创新点在于将引理判断形式化为结构化预测任务,并引入分段感知的强化学习损失掩码机制,通过惩罚错误部分来提升模型鲁棒性。这为增强语言模型的逻辑推理可靠性提供了可借鉴的训练框架。

Abstract: Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.


[30] Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident cs.CL | cs.CYPDF

Conrad Borchers, Jill-Jênn Vie, Roger Azevedo

TL;DR: 这篇论文评估了大型语言模型(LLMs)在模拟新手推理和元认知判断方面的能力,特别是在化学辅导场景中。研究发现,尽管GPT-4.1能生成流畅且上下文连贯的推理,但其推理过程相比人类新手存在系统性过度连贯、冗长和变异性不足的问题,并且会高估学习者的表现。

Details

Motivation: 动机是探究LLMs能否在AI辅导系统中真实地模拟新手的碎片化和不完美的推理过程,现有评估过于关注问题解决准确性,而忽略了人类学习中的元认知特征。

Result: 在包含630条多步化学辅导问题中人类思考话语的数据集上,GPT-4.1的推理与人类相比过度连贯、冗长且变异性低,尤其在提供更丰富问题解决上下文时这些效应加剧;模型持续高估学习者的步骤级成功率。

Insight: 论文的创新点在于提出了一个评估框架,揭示了LLMs在模拟学习时的认知局限性,这源于其训练数据缺乏情感表达和工作记忆约束等新手特征;该框架可指导未来自适应系统更真实地支持新手学习和自我调节。

Abstract: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models’ ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.


[31] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations cs.CL | cs.SD | eess.ASPDF

Sheng-Lun Wei, Yu-Ling Liao, Yen-Hua Chang, Hen-Hsen Huang, Hsin-Hsi Chen

TL;DR: 本文首次系统性地研究了多语言多模态大语言模型(MLLMs)中的语音偏见。作者构建并发布了BiasInEar数据集,这是一个基于Global MMLU Lite的语音增强基准测试,涵盖英语、中文和韩语,在性别和口音上保持平衡,总计70.8小时(约4,249分钟)的语音和11,200个问题。使用四种互补的指标(准确率、熵、APES和Fleiss’ κ)评估了九个代表性模型在语言(语言和口音)、人口统计(性别)和结构(选项顺序)扰动下的表现。研究发现,MLLMs对人口统计因素相对稳健,但对语言和选项顺序高度敏感,表明语音可能放大现有的结构偏见。此外,架构设计和推理策略显著影响跨语言的稳健性。总体而言,本研究为评估语音集成LLMs的公平性和稳健性建立了一个统一框架,弥合了基于文本和语音评估之间的差距。

Details

Motivation: 解决多语言多模态大语言模型(MLLMs)中语音偏见的评估空白,系统研究模型在语言、人口统计和结构因素扰动下的敏感性和公平性问题。

Result: 在BiasInEar数据集上评估了九个代表性模型。结果显示,模型对性别因素相对稳健,但对语言和选项顺序高度敏感;架构设计和推理策略显著影响跨语言稳健性。

Insight: 创新点在于首次系统研究MLLMs的语音偏见,并构建了平衡的语音增强基准数据集BiasInEar。客观分析认为,其提出的统一评估框架和多种互补指标为语音集成LLMs的公平性和稳健性评估提供了重要工具,揭示了语音可能放大文本模型已有偏见的关键现象。

Abstract: This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss’ $κ$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.


[32] What If We Allocate Test-Time Compute Adaptively? cs.CLPDF

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan

TL;DR: 本文提出了一种验证器引导的自适应推理框架,将推理过程视为迭代的轨迹生成与选择。该方法在每个问题中运行多次推理迭代,每轮迭代可选地生成高层规划、选择推理工具和计算策略,并生成候选推理轨迹。过程奖励模型(PRM)作为统一控制信号:在迭代内,步骤级PRM分数用于指导生成过程中的剪枝与扩展;在迭代间,聚合的轨迹奖励用于选择最终答案。

Details

Motivation: 现有测试时计算扩展方法通常均匀分配推理计算、使用固定采样策略,且仅将验证用于重排序。本文旨在通过动态、自适应的计算分配,更高效地利用计算资源,提升复杂推理任务的性能。

Result: 在多个数据集上,该动态PRM引导方法持续优于直接的测试时计算扩展,在MATH-500上取得显著提升,在AIME24和AMO-Bench等更难的基准测试上实现了数倍的改进。通过理论FLOPs和惩罚浪费生成与工具开销的计算强度指标评估效率,表明验证引导的分配将计算集中在高效用推理路径上。

Insight: 创新点在于将推理过程建模为迭代的轨迹生成与选择,并利用过程奖励模型(PRM)作为跨步骤和跨迭代的统一控制信号,实现自适应的计算分配。这突破了传统均匀计算分配和固定采样策略的限制,通过动态引导和验证,更高效地利用计算资源,专注于有潜力的推理路径。

Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.


[33] Logic-Oriented Retriever Enhancement via Contrastive Learning cs.CLPDF

Wenxuan Zhang, Yuan-Hao Jiang, Changyong Qi, Rui Jia, Yonghe Wu

TL;DR: 本文提出LORE(Logic-Oriented Retriever Enhancement)方法,通过细粒度的对比学习来增强检索器的逻辑推理能力,以解决大型语言模型在知识密集型任务中因检索器过度依赖表面相似性而无法处理复杂逻辑查询的问题。该方法无需外部监督或额外资源,保持索引兼容性,并能持续提升检索效用与下游生成任务性能。

Details

Motivation: 解决大型语言模型在知识密集型任务中的瓶颈,即现有检索器往往过度拟合表面相似性,在处理涉及复杂逻辑关系的查询时表现不佳。

Result: 论文表明LORE方法能持续提升检索效用和下游生成任务性能,同时保持效率,但摘要中未提及具体的基准测试(如BEIR、HotpotQA等)或与SOTA模型的定量比较结果。

Insight: 创新点在于通过无监督的细粒度对比学习,激活模型表征中固有的、未被充分利用的逻辑分析能力,引导嵌入向量对齐逻辑结构而非浅层相似性,这是一种参数高效且与现有索引兼容的检索器增强方法。

Abstract: Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at https://github.com/mazehart/Lore-RAG.


[34] Tendem: A Hybrid AI+Human Platform cs.CLPDF

Konstantin Chernyshev, Ekaterina Artemova, Viacheslav Zhukov, Maksim Nerush, Mariia Fedorova

TL;DR: Tendem是一个混合AI+人类平台,其中AI处理结构化、可重复的工作,而人类专家在模型失败或需要验证结果时介入。每个结果在交付给客户前都经过全面的质量审查。通过94个真实世界任务的内部评估,与纯AI代理和纯人类工作流相比,Tendem能持续提供更高质量的输出和更快的周转时间,同时运营成本与纯人类执行相当。在第三方代理基准测试中,其AI代理在网页浏览和工具使用任务上接近最先进水平,并在前沿领域知识和推理方面表现出色。

Details

Motivation: 解决纯AI代理在处理复杂或不确定任务时可能失败的问题,以及纯人类工作流效率较低、成本较高的问题,旨在通过AI与人类专家的协同,实现高质量、高效率且成本可控的任务执行。

Result: 在94个真实世界任务的内部评估中,Tendem相比纯AI代理和Upwork自由职业者的纯人类工作流,能持续提供更高质量输出和更快周转时间,运营成本与纯人类执行相当;在第三方代理基准(如网页浏览和工具使用)上,其AI代理表现接近SOTA,并在前沿领域知识和推理任务中取得强劲结果。

Insight: 创新点在于构建了一个动态的AI与人类专家混合协作系统,通过质量审查机制确保输出可靠性,实现了效率与质量的平衡;从客观角度看,这种混合模式为AI代理在复杂现实任务中的部署提供了可扩展的解决方案,强调了人类在关键决策和验证中的不可替代作用。

Abstract: Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem’s performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem’s AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.


[35] Long-range Modeling and Processing of Multimodal Event Sequences cs.CL | cs.LGPDF

Jichu Li, Yilun Zhong, Zhiting Li, Feng Zhou, Quyu Kong

TL;DR: 本文提出了一种新颖的框架,将基于大语言模型(LLM)的时间点过程(TPP)扩展至视觉模态,将文本生成与时间和类型预测并列为核心能力。该方法通过基于时间相似性的自适应序列压缩机制解决长上下文问题,并采用预训练后监督微调的两阶段范式。在DanmakuTPP-QA等基准测试上的实验表明,该方法在预测准确性和生成文本分析质量上均优于现有最先进基线。

Details

Motivation: 现有TPP方法在处理多模态数据时,序列长度急剧增加,导致基于注意力的模型难以生成需要长程理解的一致、长篇文本描述,限制了其生成丰富多模态内容和推理事件动态的能力。

Result: 在包括具有挑战性的DanmakuTPP-QA基准在内的广泛实验中,该方法在预测准确性和生成文本分析质量方面均优于最先进的基线模型。

Insight: 核心创新在于将文本生成确立为TPP的核心任务之一,并引入了基于时间相似性的自适应序列压缩机制来有效处理长序列,从而在多模态事件序列的长程建模上取得突破。

Abstract: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.


[36] Don’t Judge a Book by its Cover: Testing LLMs’ Robustness Under Logical Obfuscation cs.CLPDF

Abhilekh Borah, Shubhra Ghosh, Kedar Joshi, Aditya Kumar Guru, Kripabandhu Ghosh

TL;DR: 本文研究了大型语言模型(LLMs)在逻辑推理任务中的鲁棒性,发现当问题以逻辑等价但表述模糊(obfuscated)的形式提出时,模型的性能会显著下降。为此,作者引入了Logifus(一个结构保持的逻辑模糊化框架)和LogiQAte(一个包含1108个问题、涵盖四个推理任务的新型诊断基准)。在评估六个最先进模型后,发现模糊化导致零样本性能平均大幅下降,例如GPT-4o下降47%,揭示了LLMs缺乏对问题的深层理解。

Details

Motivation: 尽管LLMs在标准形式的算术、真值表、三段论等推理任务上表现良好,但当相同问题以逻辑等价但表述模糊的形式提出时,它们往往失败。论文旨在系统研究LLMs在这种逻辑模糊化下的脆弱性,并评估其真实理解能力。

Result: 在LogiQAte基准的四个任务(Obfus FOL、Obfus Blood Relation、Obfus Number Series、Obfus Direction Sense)上评估了六个SOTA模型。结果表明,逻辑模糊化严重降低了零样本性能:GPT-4o平均下降47%,GPT-5下降27%,推理模型o4-mini下降22%。这揭示了当前模型在模糊化问题上的普遍脆弱性。

Insight: 论文的创新点在于提出了一个结构保持的逻辑模糊化框架(Logifus)和一个专门诊断LLMs逻辑鲁棒性的基准(LogiQAte)。客观来看,其核心洞察是:当前LLMs在很大程度上依赖于问题的表面形式进行解析,而非深层逻辑理解,这为未来构建真正理解语义、超越表面形式的模型提供了紧迫的研究方向。

Abstract: Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.


[37] Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation cs.CL | cs.CVPDF

Shashini Nilukshi, Deshan Sumanathilaka

TL;DR: 本文对视觉词义消歧(VWSD)进行了小型综述,VWSD是传统词义消歧(WSD)的多模态扩展,旨在利用视觉线索解决视觉-语言任务中的词汇歧义问题。综述回顾了从早期多模态融合方法到使用CLIP等对比模型、基于扩散的文本到图像生成以及大语言模型(LLM)支持的新框架的发展,涵盖了2016年至2025年的研究,展示了VWSD通过基于特征、基于图和对比嵌入技术的演进。

Details

Motivation: 传统WSD仅依赖文本和词汇资源,而VWSD通过结合视觉信息,旨在以最少的文本输入解决词汇歧义,从而提升视觉-语言任务中词义理解的准确性。

Result: 定量结果显示,基于CLIP的微调模型和LLM增强的VWSD系统在性能上持续优于零样本基线,在平均倒数排名(MRR)上实现了高达6-8%的提升。

Insight: 创新点包括将视觉线索整合到词义消歧中,并探索了提示工程、微调和多语言适应等策略。从客观角度看,论文强调了CLIP对齐、扩散生成和LLM推理的融合作为未来构建强大、上下文感知和多语言消歧系统的关键路径。

Abstract: This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.


[38] ASTER: Agentic Scaling with Tool-integrated Extended Reasoning cs.CLPDF

Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He

TL;DR: 本文提出了ASTER框架,旨在解决强化学习(RL)在扩展工具集成推理(TIR)时出现的‘交互崩溃’问题。该问题导致模型无法维持多轮工具使用,而退化为大量内部推理。ASTER通过一种优先考虑交互密集轨迹的有针对性的冷启动策略来规避此崩溃。研究发现,仅使用4K条交互密集轨迹的小型专家冷启动集就能建立强大的先验,从而在后续RL训练中实现卓越探索。

Details

Motivation: 动机是解决在通过强化学习扩展LLMs的工具集成推理时出现的‘交互崩溃’问题,即模型无法维持有效的多轮工具交互,从而限制了其长期推理能力的提升。

Result: 在竞争性数学基准测试中,ASTER-4B模型取得了最先进的结果,在AIME 2025上达到了90.0%的准确率,超越了包括DeepSeek-V3.2-Exp在内的领先前沿开源模型。

Insight: 论文宣称的创新点在于系统地研究了冷启动监督微调(SFT)如何诱导工具使用的行为先验、交互密度如何影响探索和RL结果,并提出了ASTER框架,其核心洞察是:一个由少量但交互密集的专家轨迹构成的冷启动集,能够有效建立强大的行为先验,从而在后续RL训练中实现更优的探索和泛化,避免了交互崩溃。从客观角度看,这为高效、可扩展地训练具备复杂工具使用能力的智能体提供了一种有前景的路径。

Abstract: Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.


[39] Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling cs.CLPDF

Kai Zhang, Jiayi Liao, Chengpeng Li, Ziyuan Xie, Sihang Li

TL;DR: 本文提出Chronos,一种轻量级即插即用的时序推理评分器,用于提升大语言模型在测试时扩展中的推理性能。Chronos将推理轨迹建模为时间序列,学习捕捉词元概率的轨迹特征,据此分配质量分数并采用加权投票机制。

Details

Motivation: 现有测试时扩展方法(如多数投票和启发式词元级评分)平等对待推理轨迹或词元,容易受到轨迹质量大幅波动和局部逻辑错误的影响,因此需要一种能建模推理时序动态的评分方法。

Result: 在领域内和领域外基准测试上的广泛评估表明,Chronos在各种模型上均带来显著性能提升,且计算开销可忽略。例如,在HMMT25基准上使用Qwen3-4B-Thinking-2507模型时,Chronos@128相比Pass@1和Maj@128分别实现了34.21%和22.70%的相对改进。

Insight: 创新点在于将推理轨迹视为时间序列进行建模,通过学习词元概率的时序动态来评估轨迹质量,并采用加权投票机制,这比平等对待所有轨迹或词元的传统方法更能有效捕捉推理链的时序依赖性和局部逻辑质量。

Abstract: Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods – most notably majority voting and heuristic token-level scoring – treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21% over Pass@1 and 22.70% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.


[40] Inferential Question Answering cs.CL | cs.IRPDF

Jamshid Mozafari, Hamed Zamani, Guido Zuccon, Adam Jatowt

TL;DR: 本文提出了一个新的问答任务——推理问答,该任务要求模型从仅提供线索的答案支持段落中推断答案。为此,作者构建了QUIT数据集,包含7,401个问题和240万个段落,并通过实验表明,当前在传统问答任务上有效的方法在推理问答中表现不佳,揭示了现有问答系统在处理基于推理的任务方面存在局限。

Details

Motivation: 现有大多数问答系统主要关注答案包含,即假设答案可以直接从语料库文档中提取或生成,但有些问题需要推理,即从现有信息中推断出未明确陈述的答案。本文旨在解决这种需要推理的问答问题。

Result: 通过对检索器、重排序器和基于LLM的阅读器进行综合评估,发现传统问答任务的有效方法在推理问答中表现不佳:检索器性能不足,重排序器提升有限,微调改进不一致。即使是面向推理的大语言模型也未能超越较小的通用模型。

Insight: 本文的创新点在于提出了推理问答这一新任务类别,并构建了专门的QUIT数据集来研究该问题。从客观角度看,该研究揭示了当前问答流程在处理基于间接文本证据的理解和推理方面尚未准备好,为未来开发更强大的推理模型指明了方向。

Abstract: Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA – a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.


[41] PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian cs.CL | cs.IRPDF

Jamshid Mozafari, Seyed Parsa Mousavinasab, Adam Jatowt

TL;DR: 本文介绍了PARSE,这是首个面向波斯语的开放领域推理问答基准,包含10,800个涵盖布尔、多项选择和事实型格式的问题,涉及多种推理类型和难度级别。该基准通过基于LLM的生成流程构建,并经过人工验证和多阶段过滤以确保质量。研究还评估了多语言和波斯语LLM在不同提示策略下的表现,发现波斯语提示和结构化提示(如思维链)能提升性能,微调进一步改善结果,尤其对波斯语专用模型有效。

Details

Motivation: 动机是解决低资源语言(如波斯语)缺乏高质量推理问答基准的问题,波斯语有约1.3亿使用者,但此前没有全面的开放领域资源来评估推理能力的QA系统。

Result: 在PARSE基准上,波斯语提示和结构化提示(如CoT用于布尔/多项选择,few-shot用于事实型问题)提高了LLM的性能,微调进一步提升了结果,特别是针对波斯语专用模型。这些结果支持了公平比较和实际模型适应。

Insight: 创新点包括创建首个波斯语开放领域推理QA基准PARSE,通过LLM生成和人工验证确保质量;研究还提供了关于提示策略和微调对低资源语言模型性能影响的见解,为低资源环境下的LLM开发和评估奠定了基础。

Abstract: Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.


[42] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models cs.CL | cs.AIPDF

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Dannong Xu

TL;DR: 本文提出了EverMemBench基准测试,用于评估大语言模型在长期交互式记忆方面的能力。该基准包含超过100万token的多方、多组对话,涉及时间演化信息、跨主题交织和角色特定人设,并通过1000多个QA对从细粒度回忆、记忆意识和用户画像理解三个维度评估记忆系统。

Details

Motivation: 现有基准主要关注二元、单一主题的对话,无法捕捉现实世界的复杂性,因此需要一个新的基准来评估LLM助手所必需的长时对话记忆能力。

Result: 评估揭示了关键局限性:在多参与者场景下,多跳推理崩溃,即使Oracle模型准确率也仅为26%;时序推理问题仍未解决;基于相似度的检索方法无法弥合查询与隐式相关记忆之间的语义鸿沟,成为记忆意识的瓶颈。

Insight: 创新点在于构建了一个更贴近现实复杂对话场景的基准,并系统性地定义了评估记忆的三个维度。客观分析认为,该研究揭示了当前记忆系统在复杂、动态、多方交互场景下的根本性挑战,为下一代记忆架构的发展指明了方向。

Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.


[43] CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering cs.CL | cs.LGPDF

Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen

TL;DR: 本文提出CRAFT框架,通过基于Group Relative Policy Optimization的强化学习方法,解决多跳问答中检索增强生成存在的推理崩溃、推理-答案不一致和格式失控问题,利用双重奖励机制优化推理过程,在多个基准测试中提升答案准确性和推理可信度。

Details

Motivation: 针对多跳问答中检索增强生成面临的推理崩溃、推理-答案不一致以及格式失控三大挑战,旨在提升大语言模型在复杂推理任务中的可靠性和可控性。

Result: 在三个多跳问答基准测试中,CRAFT显著提高了答案准确性和推理可信度,其中7B参数模型在多种推理轨迹设置下达到与闭源大语言模型相当的性能。

Insight: 创新点在于结合确定性奖励和基于评判的奖励的双重强化学习机制,以及支持可控推理轨迹变体以系统分析结构和规模对推理性能的影响,为提升复杂推理任务的可靠性和可解释性提供了新思路。

Abstract: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence–distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.


[44] On the Power of (Approximate) Reward Models for Inference-Time Scaling cs.CL | stat.MLPDF

Youheng Zhu, Yiping Lu

TL;DR: 本文从理论上分析了近似奖励模型在基于序列蒙特卡洛(SMC)的推理时间缩放范式中的有效性。研究指出,只要近似奖励模型的贝尔曼误差以O(1/T)为界,结合SMC就能将推理的计算复杂度从T的指数级降低到多项式级,从而在仅使用近似奖励的情况下实现推理效率的指数级提升。

Details

Motivation: 解决一个核心问题:在实践中,真实的奖励模型无法获得,所有部署系统都依赖近似奖励模型。因此,需要从理论上阐明为何以及何时近似奖励模型足以支持有效的推理时间缩放。

Result: 理论结果表明,当近似奖励模型的贝尔曼误差有界(O(1/T))时,结合SMC进行推理时间缩放,可以将推理的计算复杂度从指数级(in T)降低到多项式级。

Insight: 创新点在于首次从理论上将近似奖励模型的贝尔曼误差确立为基于SMC的推理时间缩放有效性的关键控制量,并给出了具体的误差界条件,为实际系统中使用近似奖励提供了理论依据和设计指导。

Abstract: Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.


[45] Rethinking Selective Knowledge Distillation cs.CLPDF

Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva

TL;DR: 本文重新审视了自回归大语言模型中的选择性知识蒸馏,通过解构位置、类别和样本三个维度的选择策略,系统比较了重要性信号和选择策略的有效性,并提出了基于学生熵的位置选择方法(SE-KD),该方法在多个基准测试中提升了准确性、下游任务一致性和内存效率。

Details

Motivation: 当前选择性知识蒸馏方法在大型语言模型中虽已广泛应用,但缺乏对重要性信号、选择策略及其相互作用有效性的系统分析,本文旨在填补这一空白,优化蒸馏过程。

Result: 在多个基准测试中,SE-KD在准确性、下游任务一致性和内存效率上优于密集蒸馏;扩展到类别和样本维度(SE-KD 3X)进一步提升了效率,使离线教师缓存可行,相比先前方法减少了70%的墙钟时间、18%的峰值内存和80%的存储使用,且性能无损。

Insight: 创新点在于系统解构了选择性蒸馏的三个维度,并提出了基于学生熵的位置选择策略,通过结合位置、类别和样本的互补效率增益,实现了高效的离线教师缓存,为知识蒸馏的实践应用提供了新思路。

Abstract: Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.


[46] Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language cs.CLPDF

Umme Abira Azmary, MD Ikramul Kayes, Swakkhar Shatabda, Farig Yousuf Sadeque

TL;DR: 本文针对低资源孟加拉语问答模型面临的数据稀缺和语言复杂性挑战,提出了首个孟加拉语反事实问答数据集BanglaCQA,并设计了基于微调和提示的管道来解耦模型中的参数化知识和上下文知识。通过LLM和人工评估方法,论文分析了模型在不同问答场景下的表现,发现思维链提示在反事实场景中能有效提取参数化知识,为低资源语言的反事实推理研究提供了新框架和关键发现。

Details

Motivation: 解决低资源孟加拉语问答模型因标注数据有限和语言复杂而难以区分模型依赖预编码知识还是上下文输入的问题,现有数据集缺乏此类分析所需的结构。

Result: 在构建的BanglaCQA数据集上,通过微调编码器-解码器模型和基于提示的解码器LLM管道进行评估,结果显示思维链提示在反事实场景中特别有效,尤其是在解码器LLMs中提取参数化知识方面。

Insight: 创新点包括创建首个孟加拉语反事实QA数据集BanglaCQA,提出解耦参数化和上下文知识的分析框架,以及发现思维链提示在低资源语言反事实推理中的独特有效性,为类似语言的研究提供了可借鉴的方法和方向。

Abstract: Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.


[47] ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure cs.CLPDF

Jie Deng, Shining Liang, Jun Li, Hongzhi Li, Yutao Xie

TL;DR: 该论文提出了一种名为ConPress的轻量级自监督微调方法,旨在减少大型推理模型在解决推理密集型任务时产生的冗长思维链,从而降低推理开销。该方法基于观察到的‘自我压缩’现象:当单个提示中包含多个独立且可回答的问题时,模型会自发地为每个问题生成更短的推理轨迹。ConPress通过构建多问题提示来诱导自我压缩,采样模型输出,解析并过滤每个问题的轨迹以获得简洁而正确的推理路径,然后用于监督微调,从而在单问题设置中内化压缩推理行为。

Details

Motivation: 大型推理模型通常通过生成冗长的思维链来解决推理密集型任务,导致显著的推理开销。论文旨在利用观察到的‘自我压缩’现象,即模型在多问题上下文压力下会自发缩短推理轨迹,来开发一种无需外部教师、手动剪枝或强化学习的方法,以学习高效的推理。

Result: 在仅使用8k微调示例的情况下,ConPress在MATH500基准上将推理令牌使用量减少了59%,在AIME25基准上减少了33%,同时保持了有竞争力的准确率。

Insight: 论文的创新点在于发现并利用了‘自我压缩’这一可复现的推理时现象,并基于此提出了一种轻量级自监督微调方法ConPress,该方法能够内部化压缩推理行为,无需依赖外部资源或复杂优化,从而有效减少推理令牌使用量。

Abstract: Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.


[48] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training cs.CL | cs.LGPDF

Ran Xu, Tianci Liu, Zihan Dong, Tony You, Ilgee Hong

TL;DR: 本文提出Rubric-ARM框架,通过交替强化学习联合优化评分标准生成器和评判器,以解决传统奖励模型在不可验证领域(如创意写作)中仅输出标量分数、无法捕捉响应质量多维特性的问题。

Details

Motivation: 标准奖励模型预测的标量分数难以捕捉不可验证领域中响应质量的多方面特性,现有方法依赖静态评分标准或分离的训练流程,存在局限性。

Result: 在多个基准测试中,Rubric-ARM实现了最先进的性能,并在离线和在线强化学习设置中显著提升了下游策略对齐效果。

Insight: 创新点在于将评分标准生成视为潜在动作进行学习以最大化判断准确性,并引入交替优化策略来减轻同时更新的非平稳性,理论分析表明该策略能降低训练中的梯度方差。

Abstract: Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.


[49] The Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation cs.CLPDF

Mingwen Zhang, Minqiang Yang, Changsheng Ma, Yang Yu, Hui Bai

TL;DR: 本文提出了一种名为Socratic Inquiry Framework(SIF)的轻量级、即插即用治疗意图规划器,旨在将大型语言模型从被动的倾听者转变为主动的认知引导者,以解决现有心理治疗对话模型在主动提问方面的不足。该方法通过策略锚定和模板检索分别决定提问时机和内容,并配合新构建的Socratic-QA数据集进行监督。实验表明,SIF显著提升了主动提问频率、对话深度和治疗对齐度。

Details

Motivation: 当前的心理治疗大型语言模型(LLMs)主要处于被动反应模式,只能提供共情但肤浅的回应,无法有效引导认知行为疗法(CBT)中关键的主动提问来揭示潜在信念或指导行为改变,因此需要一种方法将LLMs转变为主动的认知引导者。

Result: 实验表明,SIF框架显著增强了主动提问的频率、对话深度以及与治疗目标的对齐度,标志着对话模式从被动安慰向主动探索的明确转变。

Insight: 论文的核心创新点在于提出了一个解耦‘何时提问’(策略锚定)和‘提问什么’(模板检索)的轻量级框架SIF,无需端到端重新训练即可实现情境感知、理论驱动的提问。同时,构建的Socratic-QA数据集为主动推理提供了明确的监督。从客观角度看,这种将治疗策略与具体提问模板分离的模块化设计,以及专注于高质量、策略对齐数据集的构建,为开发具有主动引导能力的领域专用LLM提供了可借鉴的新范式。

Abstract: Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbf{Socratic Inquiry Framework (SIF)}, a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbf{when to ask} (via Strategy Anchoring) from \textbf{what to ask} (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbf{Socratic-QA}, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.


[50] A2Eval: Agentic and Automated Evaluation for Embodied Brain cs.CLPDF

Shuai Zhang, Jiayu Hu, Zijie Chen, Zeyuan Ding, Yi Zhang

TL;DR: 本文提出了A2Eval,一个用于具身智能体评估的智能体化自动评估框架。它通过两个协作智能体(数据智能体和评估智能体)自动化地生成平衡、紧凑的评估套件并执行高保真度的评估,旨在解决现有手动标注基准存在的冗余、覆盖不均、成本高昂和排名偏差等问题。

Details

Motivation: 当前具身视觉语言模型(VLM)的评估依赖于静态、专家定义、手动标注的基准,这些基准存在严重的冗余和覆盖不平衡问题。这种劳动密集型范式消耗了大量计算和标注资源,增加了成本,扭曲了模型排名,最终阻碍了迭代开发。

Result: 在10个基准和13个模型上的评估表明,A2Eval将评估套件压缩了85%,总体计算成本降低了77%,速度提升了4.6倍,同时保持了评估质量。它纠正了系统性排名偏差,将人类对齐度提升至Spearman’s rho=0.85,并保持了高排名保真度(Kendall’s tau=0.81)。

Insight: 创新点在于首次提出了一个由协作智能体驱动的全自动化评估框架,将基准构建和评估执行流程自动化。其核心思想是通过智能体自主归纳能力维度、组装平衡的评估套件,并合成验证可执行的评估流水线,从而实现高保真、低成本的评估,为具身评估设立了新标准。

Abstract: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman’s rho=0.85, and maintains high ranking fidelity (Kendall’s tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.


[51] CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation cs.CL | cs.AIPDF

Zhongyuan Peng, Caijun Xu, Changyi Xiao, Shibo Hong, Eli Zhang

TL;DR: 本文提出了CoDiQ框架,用于可控难度的问题生成,通过测试时缩放实现细粒度难度控制,并确保问题的可解性。基于该框架构建了CoDiQ-Generator模型和包含44K竞赛级问题序列的CoDiQ-Corpus数据集。人工评估表明生成的问题比LiveCodeBench/AIME更具挑战性且保持高可解性,使用该数据集训练大型推理模型能显著提升推理性能。

Details

Motivation: 现有自动化问题生成方法缺乏精确的难度控制、计算成本高,且难以大规模生成竞赛级别的问题,因此需要一种能够精细控制难度并保证问题可解性的新框架。

Result: 在CoDiQ-Corpus上训练的大型推理模型显著提升了推理性能;人工评估显示生成的问题比LiveCodeBench/AIME更具挑战性,且可解性超过82%。

Insight: 创新点在于发现了测试时缩放趋势(扩展推理token预算会增加难度但降低可解性),并基于此设计了CoDiQ框架,实现了难度与可解性的平衡;通过提升模型生成高难度有效问题的上限,构建了专门用于挑战性问题生成的CoDiQ-Generator。

Abstract: Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model’s ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.


[52] Scaling Search-Augmented LLM Reasoning via Adaptive Information Control cs.CLPDF

Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C. Kerce, Faramarz Fekri

TL;DR: 本文提出了DeepControl框架,旨在解决搜索增强型大型语言模型(LLM)推理中因信息检索不受控而导致的证据冗余、上下文饱和和学习不稳定问题。该框架基于信息效用的形式化概念,引入了检索延续和粒度控制机制,自适应地决定何时继续/停止检索以及扩展多少信息,并通过退火控制策略在训练中内化有效的信息获取行为。

Details

Motivation: 现有搜索增强推理代理在交错进行多步推理与外部信息检索时,不受控的检索常导致冗余证据、上下文饱和和不稳定学习。基于结果的强化学习方法对调节信息获取的指导有限,因此需要一种自适应信息控制机制。

Result: 在七个基准测试上的广泛实验表明,该方法始终优于强基线。具体而言,在Qwen2.5-7B和Qwen2.5-3B模型上,相比基于结果的强化学习基线,平均性能分别提升了9.4%和8.6%,并且持续优于无显式信息控制的检索无关和基于检索的推理方法。

Insight: 创新点在于提出了基于信息效用(衡量给定推理状态下检索证据的边际价值)的自适应信息控制框架,并设计了检索延续控制、粒度控制和退火训练策略。这为将搜索增强推理代理扩展到复杂现实信息环境提供了重要思路,即显式、自适应地控制信息获取过程比单纯依赖结果反馈更有效。

Abstract: Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.


[53] Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models cs.CL | cs.LGPDF

Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo

TL;DR: 本文发现大型推理模型(LRMs)在强化学习后训练后会出现探索崩溃问题,即基于温度的采样不再提升 pass@n 准确率。作者提出了一种无需额外训练或参数的潜在探索解码(LED)方法,通过聚合中间层后验并选择熵最大的深度配置来恢复探索能力,在多个推理基准测试中一致提升了 pass@1 和 pass@16 的准确率。

Details

Motivation: 现代推理模型的后训练会导致探索崩溃,即模型最终层后验熵急剧降低,而中间层熵仍相对较高,这限制了采样多样性。论文旨在解决这一熵不对称问题,恢复模型的探索能力。

Result: 在多个推理基准测试和模型上,LED 方法无需额外训练或参数,将 pass@1 和 pass@16 准确率分别平均提升了 0.61 和 1.03 个百分点。

Insight: 创新点在于观察到后训练模型存在最终层与中间层的熵不对称现象,并据此提出了一种深度条件解码策略(LED),通过聚合中间层后验并最大化熵来生成探索候选,这是一种新颖且高效的无训练解码优化方法。

Abstract: Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.


[54] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL | cs.SEPDF

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang

TL;DR: 本文首次系统研究了多模态大语言模型(MLLMs)在代码理解任务中的有效性,提出将源代码渲染为图像进行视觉压缩,以解决传统文本模型因上下文长度线性增长带来的计算效率瓶颈。实验表明,MLLMs能以高达8倍的压缩率有效理解代码,并利用语法高亮等视觉线索提升性能,为高效推理提供了新途径。

Details

Motivation: 随着软件系统规模扩大,基于文本的LLMs在代码理解中面临计算效率瓶颈,因为代码作为线性令牌序列会导致上下文长度和计算成本线性增长。多模态LLMs的快速发展为通过图像模态表示代码以优化效率提供了机会。

Result: 在代码理解任务上的实验结果显示:MLLMs能以高达8倍的压缩率有效理解代码;在4倍压缩下,利用语法高亮等视觉线索能提升代码补全性能;克隆检测等任务对视觉压缩表现出异常韧性,某些压缩比甚至略微优于原始文本输入。

Insight: 创新点在于首次系统探索了MLLMs用于代码理解的可行性,并验证了图像模态代码表示在保持语义的同时实现高效压缩的潜力,为降低大模型推理成本提供了新思路,即利用视觉模型的压缩优势处理结构化代码。

Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.


[55] PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning cs.CLPDF

Langming Liu, Kangtao Lv, Haibin Chen, Weidong Zhang, Yejing Wang

TL;DR: 本文提出了一种名为PretrainRL的新框架,通过在预训练阶段集成强化学习来巩固大语言模型的事实知识,以解决其事实性幻觉问题。该方法的核心原则是“去偏后学习”,通过降低高概率错误信息的权重,为低概率真实信息的学习创造空间。

Details

Motivation: 大语言模型存在事实性幻觉问题,其根源在于预训练语料库中数据分布不平衡,导致“低概率真实”和“高概率错误”的状态。现有方法要么回避问题,要么面临灾难性遗忘,因此需要从根源上解决。

Result: 在三个公共基准测试上的大量实验表明,PretrainRL显著减轻了事实性幻觉,并优于最先进的方法。

Insight: 创新点在于将强化学习整合到预训练阶段以主动重塑模型的概率分布,并设计了高效的负采样策略来发现高概率错误信息,以及引入了评估模型事实知识概率状态的新指标。从客观角度看,这是一种从模型训练早期阶段干预概率分布以提升事实准确性的新颖思路。

Abstract: Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of “low-probability truth” and “high-probability falsehood”. Recent approaches, such as teaching models to say “I don’t know” or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is “\textbf{debiasing then learning}.” It actively reshapes the model’s probability distribution by down-weighting high-probability falsehoods, thereby making “room” for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model’s probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.


[56] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support cs.CL | cs.AIPDF

Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang

TL;DR: 本文提出了ES-MemEval基准测试和EvoEmo数据集,用于系统评估对话智能体在长期情感支持场景下的五种核心记忆能力,包括信息提取、时序推理等,并通过对多种LLM的实验揭示了显式长期记忆的重要性以及当前RAG方法的局限性。

Details

Motivation: 现有长期对话基准主要关注静态、显式的事实检索,无法评估智能体在用户信息分散、隐式且持续演变的复杂场景(如在线情感支持)中的表现,因此需要新的评估框架。

Result: 在开源长上下文、商业及检索增强(RAG)LLM上的广泛实验表明,显式长期记忆对于减少幻觉和实现有效个性化至关重要;RAG能提升事实一致性,但在处理时序动态和演变的用户状态方面存在困难。

Insight: 创新点在于构建了首个专注于长期情感支持中个性化记忆能力的综合基准(ES-MemEval)和配套数据集(EvoEmo),系统定义了五种核心记忆评估维度,并实证分析了当前记忆与检索范式整合的不足,为未来长期个性化对话系统的发展指明了方向。

Abstract: Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.


[57] Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation cs.CL | cs.AIPDF

Kwun Hang Lau, Fangyuan Zhang, Boyu Ruan, Yingli Zhou, Qintian Guo

TL;DR: 本文提出了CatRAG框架,旨在解决现有基于知识图谱的检索增强生成方法中存在的’静态图谬误’问题。该方法通过将静态知识图谱转化为查询自适应的导航结构,引入了符号锚定、查询感知的动态边权重调整和关键事实段落权重增强三种机制来引导随机游走,从而更完整地检索多跳查询所需的证据链。

Details

Motivation: 现有如HippoRAG等结构感知的RAG方法依赖索引时确定的固定转移概率(静态图),忽略了边相关性的查询依赖性,导致随机游走易偏离至高度数’枢纽’节点,造成语义漂移,难以检索到多跳查询所需的完整证据链。

Result: 在四个多跳基准测试上的实验表明,CatRAG持续优于最先进的基线模型。分析显示,虽然在标准召回率指标上提升有限,但CatRAG在推理完整性(即无间隙地恢复整个证据路径的能力)方面取得了显著改进。

Insight: 核心创新在于将静态知识图谱动态化为查询感知的导航结构,通过结合符号约束、动态边权重调整和结构化的关键事实偏置,有效引导检索过程,弥合了检索部分上下文与实现完全基于证据的推理之间的差距。

Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a “Static Graph Fallacy”: they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree “hub” nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query’s intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at https://github.com/kwunhang/CatRAG.


[58] Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition cs.CL | cs.AIPDF

Wonjun Lee, Hyounghun Kim, Gary Geunbae Lee

TL;DR: 本文提出Moe-Ctc,一种结合中间CTC监督的混合专家架构,用于提升带口音语音识别的鲁棒性。该方法通过口音感知路由促进专家捕获口音特定模式,并在推理时过渡到无标签路由,同时利用专家专属CTC头和对齐路由的增强损失优化训练。

Details

Motivation: 解决自动语音识别在口音语音上性能下降的问题,现有方法要么对重口音或未见口音泛化不足,要么依赖有限且带噪声的标签。

Result: 在Mcv-Accent基准测试中,在低资源和高资源条件下对已见和未见口音均取得一致提升,相比强基线FastConformer实现最高29.3%的相对词错误率降低。

Insight: 创新点包括结合中间CTC监督的混合专家架构、从口音感知到无标签路由的渐进过渡机制,以及通过路由增强损失稳定优化,提升了口音泛化能力。

Abstract: Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.


[59] Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models cs.CL | cs.IRPDF

Bin Cao, Huixian Lu, Chenwen Ma, Ting Wang, Ruizhe Li

TL;DR: 本文提出正交层次分解(OHD)框架,通过空间-语义协同约束的正交树归纳(OTI)方法,将复杂表格分解为列树和行树以分别捕获垂直和水平层次依赖关系,并设计双路径关联协议和LLM语义仲裁器来对齐多级语义信息,从而为LLMs构建结构保持的输入表示,以解决复杂表格理解中的结构语义对齐问题。

Details

Motivation: 针对具有多级表头、合并单元格和异构布局的复杂表格,现有基于表格线性化或规范化网格建模的方法难以显式捕获层次结构和跨维度依赖,导致非标准表格的结构语义与文本表示错位,因此需要设计能保持结构语义的输入表示方法。

Result: 在复杂表格问答基准AITQA和HiTab上的实验表明,OHD框架在多项评估指标上均优于现有表示范式。

Insight: 创新点在于提出正交树归纳方法将表格分解为行列双树结构以显式建模层次依赖,并结合双路径关联与LLM语义仲裁实现结构语义对齐;可借鉴之处在于通过正交分解和语义仲裁机制增强LLMs对复杂结构化数据的理解能力。

Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial–semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.


[60] S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs cs.CLPDF

Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin

TL;DR: 该论文提出了一种名为S3-CoT的自采样框架,旨在通过激活引导技术,让大型语言模型(LLMs)自身生成风格一致且长度可变的思维链(CoT)推理轨迹,从而高效学习一种类似人类“系统1”的快速思维模式。该方法无需教师指导或高质量监督数据,通过基于黄金答案过滤的数据进行监督微调(SFT),并结合了类人双认知系统和渐进式压缩课程。实验表明,该方法在数学基准测试和医学领域的跨域泛化中均能稳定提升LLMs的性能。

Details

Motivation: 当前配备思维链(CoT)的大型语言模型(LLMs)虽然性能强大,但其推理过程往往存在冗余,缺乏类似人类“系统1”的快速、直觉式推理能力。本研究旨在探索LLMs能否获得这种高效推理模式,并解决基于SFT的方法中高质量监督数据稀缺的核心瓶颈。

Result: 在数学基准测试(如GSM8K、MATH)和医学领域的跨域泛化测试中,该方法为通用LLMs和R1风格LLMs带来了稳定的性能提升,表明其能有效诱导高效推理。

Insight: 创新点在于提出了一个自采样框架,通过激活引导让模型自身生成推理数据,无需外部监督,从而缓解了数据稀缺问题;同时引入了类人双认知系统(结合快速与慢速思维)和渐进式压缩课程来优化学习过程。从客观角度看,该方法将模型内部激活用于数据生成,是一种新颖的“自我蒸馏”思路,可能为高效推理模型的训练开辟新途径。

Abstract: Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at https://github.com/DYR1/S3-CoT.


[61] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation cs.CL | cs.AIPDF

Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui

TL;DR: 本文提出了一种超越传统检索增强生成(RAG)的智能体记忆系统xMemory,针对智能体记忆场景中对话流具有高度相关性和冗余性的特点,通过解耦与聚合的检索策略,将记忆分解为语义组件并组织成层次结构,从而在推理时自上而下地检索紧凑且多样的主题和语义信息,有效提升了答案质量和令牌效率。

Details

Motivation: 传统RAG方法假设检索对象是大型异构语料库,而智能体记忆是有限、连贯且高度相关的对话流,导致固定top-k相似性检索容易返回冗余上下文,且后处理剪枝可能删除时间关联的先决条件,因此需要超越相似性匹配,在潜在组件上进行检索。

Result: 在LoCoMo和PerLTQA基准测试中,使用三种最新大型语言模型进行实验,xMemory在答案质量和令牌效率方面均取得了一致的提升。

Insight: 创新点在于提出解耦与聚合的检索范式,将记忆分解为语义组件并构建层次结构,通过稀疏性-语义目标指导记忆的分割与合并,实现自上而下的检索,从而在智能体记忆场景中减少冗余并保持上下文连贯性。

Abstract: Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-$k$ similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity–semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader’s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.


[62] NEAT: Neuron-Based Early Exit for Large Reasoning Models cs.CLPDF

Kang Liu, Yongkang Liu, Xiaocui Yang, Peidong Wang, Wen Zhang

TL;DR: 本文提出NEAT框架,一种基于神经元激活动态的早期推理退出方法,用于缓解大型推理模型中的过度思考问题。该方法通过监控神经元级激活模式,在无需额外训练或测试时计算的情况下,动态触发早期退出,从而减少冗余推理步骤。

Details

Motivation: 大型推理模型常出现过度思考现象,即在已得出正确解后仍生成冗余推理步骤。现有早期退出方法依赖输出级启发式或训练探测模型,需要额外计算或外部标注数据,NEAT旨在无需训练和额外计算下解决此问题。

Result: 在四个推理基准测试和六个不同规模与架构的模型上,NEAT平均减少22%至28%的token使用,同时保持准确率。

Insight: 创新点在于利用神经元级激活动态进行训练免费的早期退出,避免额外计算或数据需求,为模型效率优化提供了新视角。

Abstract: Large Reasoning Models (LRMs) often suffer from \emph{overthinking}, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose \textbf{NEAT}, a \textbf{N}euron-based \textbf{E}arly re\textbf{A}soning exi\textbf{T} framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22% to 28% when averaged over the four benchmarks, while maintaining accuracy.


[63] Closing the Loop: Universal Repository Representation with RPG-Encoder cs.CL | cs.SEPDF

Jane Luo, Chengyu Yin, Xin Zhang, Qingtao Li, Steven Liu

TL;DR: 本文提出了RPG-Encoder框架,将Repository Planning Graph(RPG)从一个静态生成蓝图推广为一个统一的高保真仓库表示。该框架通过编码原始代码、增量演化拓扑结构以及作为结构感知导航的统一接口,解决了现有仓库代理因表示碎片化而导致的推理脱节问题,在多个基准测试中实现了最先进的仓库理解性能。

Details

Motivation: 解决当前仓库代理因依赖孤立的API文档或缺乏语义深度的依赖图等碎片化表示方法而导致的推理脱节问题,将仓库理解与生成视为统一循环中的逆过程。

Result: 在SWE-bench Verified上实现了93.7%的Acc@5,达到了最先进的仓库理解水平;在SWE-bench Live Lite上超过最佳基线超过10%;在RepoCraft上实现了98.5%的重建覆盖率。

Insight: 核心创新在于将仓库规划图(RPG)从静态生成蓝图推广为动态、统一的表示,并通过增量拓扑演化将维护成本与仓库规模解耦(降低95.7%开销),从而在意图与实现之间形成闭环,实现了高保真、结构感知的仓库表示与导航。

Abstract: Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG’s high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.


[64] LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction – A Case Study on SDGs cs.CL | cs.AIPDF

Yikai Zeng, Yingchao Piao, Jianhui Li

TL;DR: LEC-KG是一个结合大语言模型(LLM)语义理解和知识图谱嵌入(KGE)结构推理的双向协作框架,用于从非结构化文本构建领域特定知识图谱。该框架通过分层粗到细的关系提取缓解长尾偏差,利用证据引导的思维链反馈将结构建议锚定于源文本,并通过语义初始化实现未见实体的结构验证。LLM和KGE模块迭代增强,KGE提供结构感知反馈以优化LLM提取,而经验证的三元组逐步改进KGE表示。

Details

Motivation: 从非结构化文本构建领域特定知识图谱面临异构实体提及、长尾关系分布和缺乏标准化模式的挑战,需要结合语义理解和结构推理能力。

Result: 在中文可持续发展目标(SDG)报告上评估,LEC-KG相比LLM基线有显著提升,尤其在低频关系上表现优异,能可靠地将非结构化政策文本转化为经验证的知识图谱三元组。

Insight: 创新点包括LLM与KGE的双向迭代协作机制、分层关系提取缓解长尾问题、证据引导的思维链反馈确保结构建议的可追溯性,以及语义初始化支持未见实体的结构验证,为领域知识图谱构建提供了可借鉴的协同框架。

Abstract: Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.


[65] Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning cs.CL | cs.LGPDF

Keqin Peng, Yuanxin Ouyang, Xuebo Liu, Zhiliang Tian, Ruijian Han

TL;DR: 本文提出了一种名为动态解耦条件优势(DDCA)的新方法,旨在解决强化学习与可验证奖励(RLVR)中推理过程冗长的问题。该方法通过将效率优化与正确性解耦,并动态调整惩罚强度,在多个数学推理基准测试上显著减少了生成令牌数量,同时保持或提高了准确性。

Details

Motivation: 传统RLVR方法在鼓励多步推理时往往产生过于冗长的轨迹,而简单的长度惩罚在组相对优化中会严重损害准确性。这主要源于两个结构性问题:长度基线稀释和难度-惩罚不匹配。

Result: 在GSM8K、MATH500、AMC23和AIME25等基准测试上的实验表明,DDCA相对于自适应基线方法,在效率-准确性权衡上取得了一致性改进。在较简单任务(如GSM8K)上减少了约60%的生成令牌,在较难任务(如AIME25)上减少了超过20%,同时维持或提升了准确性。

Insight: 核心创新在于将效率优化与正确性解耦:通过在正确响应簇内计算条件长度优势来消除基线稀释,并利用组通过率作为难度代理来动态缩放惩罚强度。这为解决推理任务中冗余生成问题提供了一个有效且自适应的框架。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency–accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at https://github.com/alphadl/DDCA.


[66] Quantifying the Gap between Understanding and Generation within Unified Multimodal Models cs.CLPDF

Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen

TL;DR: 该论文提出了GapEval基准,用于量化统一多模态模型(UMM)中理解与生成能力之间的差距,并评估其认知一致性。实验发现,多种架构的UMM在这两个方向上均存在持续差距,表明当前模型仅实现了表面统一,而非深层的认知融合。

Details

Motivation: 探究统一多模态模型(UMM)中理解与生成能力是否真正对齐与融合,因为现有模型在这两方面的整合程度尚不明确。

Result: 在GapEval基准上的实验表明,多种UMM架构在理解与生成方向上都存在显著且持续的差距,揭示了当前模型仅达到表面统一,并未实现深层认知收敛。

Insight: 创新点在于提出了双向评估基准GapEval,以对称方式量化理解与生成差距;客观分析发现UMM内部知识往往是割裂的,跨模态能力涌现与知识同步性不足,这为未来研究指明了方向。

Abstract: Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two “unified” directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model’s bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.


[67] D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use cs.CLPDF

Bowen Xu, Shaoyu Wu, Hao Jiang, Kai Liu, Xin Chen

TL;DR: 本文提出D-CORE框架,旨在解决大型推理模型(LRMs)在复杂工具使用场景中缺乏子任务分解能力导致的‘懒惰推理’问题。该框架通过两阶段训练(自我蒸馏激励任务分解,多样性感知强化学习恢复反思推理能力),显著提升了模型在多种基准测试上的工具使用性能。

Details

Motivation: 当前大型推理模型在复杂工具使用场景中缺乏子任务分解能力,导致‘懒惰推理’,无法有效处理复杂现实问题。

Result: 在BFCLv3基准测试中,D-CORE-8B达到77.7%准确率,超越最佳8B模型5.7%;D-CORE-14B以79.3%准确率创下新SOTA,性能优于70B模型且体积小5倍。

Insight: 创新点在于通过自我蒸馏激励任务分解推理能力,并结合多样性感知强化学习恢复反思能力,实现了模型规模的效率突破,为复杂工具使用提供了可扩展的训练框架。

Abstract: Effective tool use and reasoning are essential capabilities for large reasoning models(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs’ task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs’ reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7% accuracy, surpassing the best-performing 8B model by 5.7%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.


[68] Evaluating Metalinguistic Knowledge in Large Language Models across the World’s Languages cs.CLPDF

Tjaša Arčon, Matej Klemen, Marko Robnik-Šikonja, Kaja Dobrovoljc

TL;DR: 该论文通过构建跨语言元语言知识基准,评估了大型语言模型(LLMs)对全球多种语言的语言结构知识的掌握程度,发现当前LLMs的元语言知识有限且受数据可用性影响显著。

Details

Motivation: 现有语言模型评估多集中于语言使用任务,且偏向高资源语言,缺乏对元语言知识(即关于语言结构的显式推理能力)的系统性评估,因此需要构建一个覆盖广泛语言现象的基准来填补这一空白。

Result: 在构建的基准上,GPT-4o表现最佳但准确率仅为0.367,开源模型表现更差;所有模型均高于随机基线但未超过多数类基线,且在词汇特征上表现最好、音系特征上最差;语言层面的准确率与数字语言状态(如维基百科规模、语料可用性)强相关,低资源语言性能显著较低。

Insight: 创新点在于构建了首个系统性评估LLMs跨语言元语言知识的基准,揭示了LLMs的元语言知识是碎片化的、受数据可用性驱动而非基于可泛化的语法能力,强调了未来LLMs开发中需增强全球语言多样性的重要性。

Abstract: Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs’ metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world’s languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.


[69] Sinhala Physical Common Sense Reasoning Dataset for Global PIQA cs.CLPDF

Nisansa de Silva, Surangika Ranathunga

TL;DR: 这篇论文介绍了首个僧伽罗语物理常识推理数据集,作为Global PIQA项目的一部分,包含110个人工创建和验证的数据样本,每个样本包括提示、正确答案和错误答案,问题大多涉及斯里兰卡背景。

Details

Motivation: 动机是解决僧伽罗语在物理常识推理任务中缺乏高质量数据集的问题,以支持全球多语言AI模型的发展。

Result: 论文创建了包含110个样本的数据集,为僧伽罗语物理常识推理提供了基准,但未提及与其他模型的定量比较或SOTA结果。

Insight: 创新点在于首次构建了僧伽罗语物理常识推理数据集,专注于斯里兰卡文化背景,有助于促进低资源语言的AI研究。

Abstract: This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.


[70] Kimi K2.5: Visual Agentic Intelligence cs.CLPDF

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai

TL;DR: 本文介绍了Kimi K2.5,一个开源的、旨在推进通用智能体智能的多模态智能体模型。其核心在于文本与视觉的联合优化,并引入了Agent Swarm这一自导向并行智能体编排框架,以动态分解并并行执行复杂任务。

Details

Motivation: 旨在通过联合优化文本和视觉模态来增强彼此,并解决复杂任务的高效分解与并行执行问题,以推进通用智能体智能的发展。

Result: 在包括编码、视觉、推理和智能体任务在内的多个领域取得了最先进的性能;Agent Swarm框架相比单智能体基线将延迟降低了最高4.5倍。

Insight: 创新点在于文本-视觉联合预训练、零视觉SFT和联合强化学习等多模态联合优化技术,以及能动态分解任务并并行执行的Agent Swarm智能体编排框架。

Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.


[71] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery cs.CL | cs.AI | cs.LGPDF

Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang

TL;DR: 本文提出了一种名为模块化梯度手术(MGS)的新方法,旨在解决在多样化领域(如数学、通用对话和指令遵循)上训练单一通用推理模型时,由领域异质性导致的梯度冲突和性能干扰问题。该方法在模块层面协调Transformer内部的梯度,应用于Llama和Qwen模型后,在多任务强化学习基准上取得了显著的平均性能提升。

Details

Motivation: 当前使用强化学习训练大型通用推理模型时,由于不同领域间的显著异质性,顺序强化学习和混合强化学习等常用策略会导致严重的跨领域干扰,限制了模型的整体性能增益。

Result: 在三个代表性领域(数学、通用聊天和指令遵循)上,MGS方法应用于Llama和Qwen模型时,相比标准的多任务强化学习,分别取得了平均4.3分(相对提升16.6%)和4.5分(相对提升11.1%)的改进。进一步分析表明,该方法在长时间训练下依然有效。

Insight: 论文的核心创新点在于从模块层面(而非整个模型或任务层面)识别并解决梯度冲突,提出了模块化梯度手术(MGS)。这为理解和缓解多领域强化学习中的干扰问题提供了一个新颖且有效的技术途径,有助于推动通用推理模型的训练。

Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.


[72] Automated Multiple Mini Interview (MMI) Scoring cs.CLPDF

Ryan Huynh, Frank Guerin, Alison Callwood

TL;DR: 本文提出了一种用于自动评分多站迷你面试(MMI)的多智能体提示框架,该框架通过分解评估过程为文本精炼和标准特定评分,利用少量示例上下文学习,在MMI评分任务上超越了专门微调的基线模型,并在ASAP基准测试中达到了与领域内最先进模型相当的性能,无需额外训练。

Details

Motivation: 解决在竞争性选拔过程中评估软技能(如同理心、伦理判断和沟通能力)时,人工评分存在不一致性和偏见的问题,同时针对现有基于推理的微调方法在处理抽象、上下文依赖的MMI任务时难以捕捉候选人叙述中隐含信号的局限性。

Result: 在MMI评分任务上,该方法平均QWK得分为0.62,显著优于专门微调基线(0.32),并达到与人类专家相当的可靠性;在ASAP基准测试中,无需额外训练即可媲美领域特定的最先进模型。

Insight: 创新点在于采用结构化提示工程的多智能体框架,将复杂主观推理任务分解为可管理的子步骤,通过少量示例上下文学习实现高效评估,为数据密集型微调提供了可扩展的替代方案,改变了LLM在自动化评估中的应用方式。

Abstract: Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.


[73] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof cs.CLPDF

Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu

TL;DR: Proof-RM是一种可扩展且可泛化的奖励模型,用于评估数学证明过程。论文通过设计一个可扩展的数据构建管道,利用大语言模型生成大量高质量的“问题-证明-检查”三元组数据,并训练一个证明检查奖励模型,以增强大语言模型的数学推理能力。

Details

Motivation: 解决大语言模型在数学推理中,对于基于证明的问题无法通过简单答案匹配自动验证的问题,需要一种能够可靠评估完整证明过程的奖励模型。

Result: 实验验证了模型在奖励准确性、泛化能力和测试时指导等方面的强性能,表明其具有可扩展性和良好表现。

Insight: 创新点包括设计可扩展的数据生成管道,结合过程奖励和令牌权重平衡来稳定强化学习过程,为增强大语言模型数学能力提供了实用的方法和工具。

Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality “question-proof-check“ triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model’s scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.


[74] ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs cs.CLPDF

Ziyan Zhang, Chao Wang, Zhuo Chen, Chiyi Li, Kai Song

TL;DR: 本文提出了ROG框架,用于在不完整知识图谱上回答复杂的一阶逻辑查询。该方法结合了查询感知的邻域检索与大语言模型的链式思维推理,通过将多操作符查询分解为单操作符子查询序列,并在每一步中基于紧凑的查询相关邻域证据进行推理,从而减少复合错误并提升对复杂和否定密集型查询的推理鲁棒性。

Details

Motivation: 解决在不完整知识图谱上回答复杂一阶逻辑查询(包含投影、交集、并集和否定操作符)的困难,特别是针对复杂查询结构和否定密集型查询,现有基于嵌入的方法存在局限性。

Result: 在标准知识图谱推理基准测试中,ROG相比基于嵌入的强基线模型取得了持续的性能提升,尤其在高度复杂和否定密集型的查询类型上改进最大。

Insight: 创新点在于将检索增强与LLM链式推理结合,通过逐步分解查询并缓存中间答案集来提升推理一致性和鲁棒性;从客观角度看,该方法用基于检索的逐步推断替代学习型操作符,为基于嵌入的逻辑推理提供了一种实用替代方案,有效缓解了复杂推理链中的错误累积问题。

Abstract: Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.


[75] Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models cs.CL | cs.AIPDF

Gabriele Maraia, Marco Valentino, Fabio Massimo Zanzotto, Leonardo Ranaldi

TL;DR: 本文提出了一种抽象引导推理框架,通过构建内容负载和抽象三段论的配对,利用LLM在抽象输入上的激活定义抽象推理空间,并学习轻量级抽象器来预测与该空间对齐的表示,通过多层干预整合这些预测,以减少语义干扰,提升LLM在形式推理中的鲁棒性。

Details

Motivation: LLMs在三段论推理中常混淆语义合理性与形式有效性(内容效应),现有方法难以可靠抑制语义干扰,因此需要一种机制将结构推理与词汇语义分离。

Result: 在跨语言迁移测试中,抽象对齐的引导减少了内容驱动的错误,并提高了有效性敏感性能,表明该方法能增强LLM形式推理对语义干扰的鲁棒性。

Insight: 创新点在于通过激活级抽象空间和轻量级抽象器的干预,实现结构推理与语义的显式分离,为提升LLM形式推理的稳健性提供了一种可扩展机制。

Abstract: Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model’s internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model’s activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.


[76] Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability cs.CLPDF

Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang

TL;DR: 本文提出了一种基于强化学习的端到端训练框架,旨在提升大语言模型(LLMs)的“分而治之”(DAC)推理能力。该方法通过训练模型将复杂问题分解为子问题并顺序求解,以克服传统链式思维(CoT)推理在模型能力极限和测试时扩展性方面的不足。

Details

Motivation: 链式思维推理在模型能力极限时往往不足,且其严格的顺序性限制了测试时的扩展性。分而治之推理虽有望解决此问题,但通用的后训练方式与DAC推理之间存在错位,限制了模型充分利用其潜力。

Result: 在竞赛级基准测试中,该方法在Pass@1指标上超过CoT 8.6%,在Pass@32指标上超过6.3%,展现出更高的性能上限和更强的测试时扩展性。

Insight: 创新点在于提出了一个端到端的强化学习框架,将问题分解和子问题求解整合到策略训练中,从而系统性地提升模型的分而治之推理能力,突破了传统后训练与DAC推理方式不匹配的瓶颈。

Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.


cs.CV [Back]

[77] EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions cs.CV | cs.AI | cs.CYPDF

Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng

TL;DR: 该论文发布了EDU-CIRCUIT-HW数据集,包含1300多份大学STEM课程的真实学生手写解答,用于评估多模态大语言模型(MLLMs)在识别和自动评分方面的性能。研究发现MLLMs在手写内容识别中存在大量潜在错误,可靠性不足,并提出通过识别错误模式并加以纠正,只需少量人工干预即可显著提升AI评分系统的鲁棒性。

Details

Motivation: 解决当前缺乏真实、领域特定的基准来评估MLLMs在理解包含数学公式、图表和文本推理的复杂学生手写解答方面的能力,以及现有评估范式主要依赖下游任务(如自动评分)结果,未能全面捕捉MLLMs对整体手写逻辑的理解。

Result: 在EDU-CIRCUIT-HW数据集上评估了多种MLLMs的上游识别保真度和下游自动评分性能,揭示了MLLM识别的手写内容中存在惊人规模的潜在失败,表明模型在高风险教育环境中用于自动评分等理解导向应用的可靠性不足。

Insight: 创新点在于构建了真实的学生手写解答数据集并同时评估上游识别和下游任务性能,揭示了MLLMs在复杂手写理解中的局限性;客观分析认为,通过案例研究展示利用识别错误模式进行预检测和纠正(仅需约4%的人工干预)可提升AI评分系统鲁棒性,为实际部署提供了实用方法。

Abstract: Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers’ workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs’ understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs’ upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models’ insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.


[78] Mirage2Matter: A Physically Grounded Gaussian World Model from Video cs.CV | cs.AIPDF

Zhengqing Gao, Ziwen Li, Xin Wang, Jiaxin Huang, Zhenyang Ren

TL;DR: 本文提出Simulate Anything框架,通过多视角环境视频和现成资产高效生成高保真具身训练数据。该方法利用3D高斯溅射(3DGS)从视频重建真实环境的逼真场景表示,再通过生成模型恢复物理真实表示并集成到仿真环境中,构建了统一、可编辑且物理基础的世界模型。

Details

Motivation: 解决具身智能因真实世界交互数据稀缺而受限的问题,同时克服现有仿真平台存在视觉与物理差距、依赖昂贵传感器或精确校准等局限性。

Result: 使用该框架模拟数据训练的视觉语言动作(VLA)模型在下游任务上实现了强大的零样本性能,匹配甚至超越了使用真实数据获得的结果。

Insight: 创新点在于将3DGS重建与生成模型结合,通过精度校准目标实现重建场景与真实世界的精确尺度对齐,为可扩展的具身智能训练提供了基于重建的世界建模新途径。

Abstract: The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.


[79] R3G: A Reasoning–Retrieval–Reranking Framework for Vision-Centric Answer Generation cs.CV | cs.AIPDF

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu

TL;DR: 本文提出R3G框架,一个用于视觉问答的模块化推理-检索-重排序框架,通过生成推理计划指导两阶段图像检索,以补充缺失的视觉线索,在MRAG-Bench基准上提升了多种多模态大模型的准确性。

Details

Motivation: 解决视觉问答中如何有效选择和整合外部检索图像以补充缺失视觉线索的挑战。

Result: 在MRAG-Bench基准的六个MLLM骨干网络和九个子场景上均提升了准确率,达到了整体性能的SOTA水平。

Insight: 创新点在于模块化的两阶段检索(粗检索+细粒度重排序)与推理计划生成相结合;客观分析其核心是将检索过程明确结构化,并通过充分性感知的重排序与推理步骤互补,优化了图像的选择与利用效率。

Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model’s reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.


[80] HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models cs.CV | cs.AIPDF

Wing Chan, Richard Allen

TL;DR: 本文介绍了HYPE-EDIT-1基准测试,这是一个包含100个任务的基准,用于评估前沿图像编辑模型的可靠性。该基准专注于基于参考的市场营销/设计编辑任务,采用二元通过/失败判断,并通过生成10个独立输出来计算每次尝试通过率、pass@10、重试上限下的预期尝试次数以及结合模型价格和人工审核时间的每次成功编辑有效成本。

Details

Motivation: 当前公开的图像编辑模型演示通常展示最佳案例,而实际工作流程需要承担重试和审核成本,因此需要一种能够量化模型可靠性和实际成本的基准测试方法。

Result: 在评估的模型中,每次尝试通过率在34%到83%之间,每次成功编辑的有效成本在0.66美元到1.42美元之间。结果表明,单张图像定价低的模型在考虑重试和人工审核的总有效成本时可能更昂贵。

Insight: 创新点在于提出了一个结合模型性能和实际工作流程成本(包括重试和人工审核)的量化评估框架,并引入了私有测试集和标准化工具链,为图像编辑模型的可靠性评估提供了更全面的基准。这有助于揭示模型在真实应用中的实际效率和经济性,而不仅仅是展示性能力。

Abstract: Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.


[81] Efficient UAV trajectory prediction: A multi-modal deep diffusion framework cs.CV | cs.RO | eess.IVPDF

Yuan Gao, Xinyu Guo, Wenjing Xie, Zifan Wang, Hongwen Yu

TL;DR: 本文提出了一种用于无人机轨迹预测的多模态深度融合框架,通过融合激光雷达和毫米波雷达信息,旨在提升对未经授权无人机的轨迹预测精度,以满足低空经济管理的需求。

Details

Motivation: 解决低空经济中未经授权无人机管理对高精度轨迹预测的需求,利用激光雷达和毫米波雷达在空间几何结构和动态反射特性上的互补信息,提升预测性能。

Result: 在CVPR 2024 UG2+挑战赛的MMAUD数据集上,所提多模态融合模型相比基线模型轨迹预测精度提升了40%,显著优于基准方法。

Insight: 创新点在于设计了包含模态特定特征提取网络和双向交叉注意力融合模块的架构,实现了多模态信息的互补与语义对齐;从客观角度看,该框架有效结合了激光雷达的几何细节和雷达的动态特性,为多模态轨迹预测提供了高效解决方案。

Abstract: To meet the requirements for managing unauthorized UAVs in the low-altitude economy, a multi-modal UAV trajectory prediction method based on the fusion of LiDAR and millimeter-wave radar information is proposed. A deep fusion network for multi-modal UAV trajectory prediction, termed the Multi-Modal Deep Fusion Framework, is designed. The overall architecture consists of two modality-specific feature extraction networks and a bidirectional cross-attention fusion module, aiming to fully exploit the complementary information of LiDAR and radar point clouds in spatial geometric structure and dynamic reflection characteristics. In the feature extraction stage, the model employs independent but structurally identical feature encoders for LiDAR and radar. After feature extraction, the model enters the Bidirectional Cross-Attention Mechanism stage to achieve information complementarity and semantic alignment between the two modalities. To verify the effectiveness of the proposed model, the MMAUD dataset used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge is adopted as the training and testing dataset. Experimental results show that the proposed multi-modal fusion model significantly improves trajectory prediction accuracy, achieving a 40% improvement compared to the baseline model. In addition, ablation experiments are conducted to demonstrate the effectiveness of different loss functions and post-processing strategies in improving model performance. The proposed model can effectively utilize multi-modal data and provides an efficient solution for unauthorized UAV trajectory prediction in the low-altitude economy.


[82] SITUATE – Synthetic Object Counting Dataset for VLM training cs.CV | cs.AIPDF

René Peinl, Vincent Tischler, Patrick Schröder, Christian Groth

TL;DR: 本文介绍了SITUATE,一个专为训练和评估视觉语言模型在具有空间约束的计数任务上而设计的新型合成数据集。该数据集旨在弥合简单2D数据集(如VLMCountBench)与模糊的真实数据集(如TallyQA)之间的差距,后者缺乏对遮挡和空间构图的控制。实验表明,在SITUATE上微调Qwen VL 2.5 7B模型能提高其在Pixmo计数测试数据上的准确性,反之则不然,这证明了该数据集有助于提升模型在分布外图像上的泛化能力。

Details

Motivation: 解决现有计数数据集(如VLMCountBench过于简单,TallyQA过于模糊且缺乏可控性)在训练视觉语言模型进行空间约束计数任务时的不足,提供一个可控的合成数据集以更好地评估和提升模型性能。

Result: 在SITUATE上微调Qwen VL 2.5 7B模型后,其在Pixmo计数测试数据上的准确性得到提升;通过与其他现有计数基准以及从Pixmo count派生的同等规模微调集进行交叉验证,证实了该数据集的有效性。

Insight: 创新点在于构建了一个可控的合成计数数据集,专注于空间约束,以弥补现有数据集的缺陷;从客观角度看,该数据集通过合成方法控制遮挡和构图,为模型提供了更结构化的训练环境,可能有助于提升视觉语言模型在复杂计数任务中的鲁棒性和泛化能力。

Abstract: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.


[83] Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer cs.CV | cs.LGPDF

Yu Li, Guilherme N. DeSouza, Praveen Rao, Chi-Ren Shyu

TL;DR: 本文提出了一种结合遥感影像与地理上下文信息的视觉Transformer模型,通过地理空间嵌入机制将多源地理数据转换为与图像块空间对齐的嵌入,并设计了引导注意力模块动态整合多模态信息,以提升对疾病流行率等健康结果的预测能力。

Details

Motivation: 现有视觉-语言或多模态模型主要优化视觉与文本内容的语义对齐,缺乏对结构化地理空间层的表示与推理能力,因此需要一种能够有效融合遥感影像与辅助地理信息以增强地理空间理解的模型。

Result: 实验结果表明,该框架在预测疾病流行率任务上优于现有的预训练地理空间基础模型,突显了其在多模态地理空间理解中的有效性。

Insight: 创新点包括地理空间嵌入机制实现多源数据与图像的空间对齐,以及引导注意力模块通过基于辅助数据相关性的动态权重计算来整合多模态信息,同时通过分配不同注意力头角色以捕获互补信息并提升预测可解释性。

Abstract: Visual transformers have driven major progress in remote sensing image analysis, particularly in object detection and segmentation. Recent vision-language and multimodal models further extend these capabilities by incorporating auxiliary information, including captions, question and answer pairs, and metadata, which broadens applications beyond conventional computer vision tasks. However, these models are typically optimized for semantic alignment between visual and textual content rather than geospatial understanding, and therefore are not suited for representing or reasoning with structured geospatial layers. In this study, we propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information. Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches. To facilitate cross-modal interaction, we design a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data, thereby directing the model toward the most relevant regions. In addition, the module assigns distinct roles to individual attention heads, allowing the model to capture complementary aspects of the guidance information and improving the interpretability of its predictions. Experimental results demonstrate that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, highlighting its effectiveness in multimodal geospatial understanding.


[84] From Manual Observation to Automated Monitoring: Space Allowance Effects on Play Behaviour in Group-Housed Dairy Calves cs.CV | eess.IVPDF

Haiyu Yang, Heidi Lesscher, Enhong Liu, Miel Hostens

TL;DR: 本研究调查了群养奶牛犊的空间分配(每头犊牛2.66-17.98平方米)与玩耍行为之间的关系,并开发了一个用于可扩展监测的自动化计算机视觉流程。研究发现,玩耍行为与空间分配呈非线性关系,在每头犊牛8-10平方米时达到峰值。同时,基于手动标注训练的计算机视觉分类器在主动玩耍检测上达到了高准确率。

Details

Motivation: 在商业条件下,空间分配对奶牛犊玩耍行为(作为积极福利指标)的影响尚不明确,尤其是在中高分配(每头犊牛6-20平方米)范围内。研究旨在量化这种关系,并开发自动化监测方法以克服传统人工观察的可扩展性限制。

Result: 统计分析显示,玩耍行为(以观察期百分比%OP表示)与空间分配呈非线性关系,在每头犊牛8-10平方米时最高(1.6% OP),在6-8平方米和12-14平方米时最低(<0.6% OP)。计算机视觉分类器在主动玩耍检测上实现了97.6%的准确率和99.4%的召回率。

Insight: 研究提出每头犊牛8-10平方米是一个平衡福利效益与经济可行性的实用目标。创新点在于将详细的动物行为学分析与自动化计算机视觉流程相结合,证明了小规模标注项目可以扩展到连续的福利评估系统,为畜牧业的大规模、客观行为监测提供了可行方案。

Abstract: Play behaviour serves as a positive welfare indicator in dairy calves, yet the influence of space allowance under commercial conditions remains poorly characterized, particularly at intermediate-to-high allowances (6-20 m2 per calf). This study investigated the relationship between space allowance and play behaviour in 60 group-housed dairy calves across 14 commercial farms in the Netherlands (space range: 2.66-17.98 m2 per calf), and developed an automated computer vision pipeline for scalable monitoring. Video observations were analyzed using a detailed ethogram, with play expressed as percentage of observation period (%OP). Statistical analysis employed linear mixed models with farm as a random effect. A computer vision pipeline was trained on manual annotations from 108 hours on 6 farms and validated on held-out test data. The computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection. Calves spent on average 1.0% of OP playing reflecting around 10 minutes per 17-hour period. The space-play relationship was non-linear, with highest play levels at 8-10 m2 per calf (1.6% OP) and lowest at 6-8 m2 and 12-14 m2 (<0.6% OP). Space remained significant after controlling for age, health, and group size. In summary, these findings suggest that 8-10 m2 per calf represents a practical target balancing welfare benefits with economic feasibility, and demonstrate that automated monitoring can scale small annotation projects to continuous welfare assessment systems.


[85] IC-EO: Interpretable Code-based assistant for Earth Observation cs.CV | cs.AIPDF

Lamia Lahouel, Laurynas Lopata, Simon Gruening, Gabriele Meoni, Gaetan Petit

TL;DR: 本文提出了一种名为IC-EO的、基于代码生成的可解释性对话助手,用于简化地球观测分析。它将用户的自然语言查询转化为可执行、可审计的Python工作流,并通过一个统一的、易于扩展的API支持分类、分割、检测等多种任务。

Details

Motivation: 解决地球观测分析对非专家用户门槛高、现有系统多为难以审计或复现的黑盒预测模型的问题。

Result: 在选定的两个用例(土地构成制图和野火后损害评估)上,该智能体性能优于通用LLM/VLM基线(GPT-4o, LLaVA),在土地构成任务上准确率达到64.2%(vs. 51.7%),在野火后分析上达到50%(vs. 0%)。

Insight: 创新点在于利用工具调用大语言模型,构建了一个生成可验证代码的对话式智能体,将地球观测分析转变为透明、可复现的过程。其评估框架覆盖工具级、智能体级和任务级三个层面,确保了结果的可靠性和可解释性。

Abstract: Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels: (i) tool-level performance on public EO benchmarks; (ii) at the agent-level to understand the capacity to generate valid, hallucination-free code; and (iii) at the task-level on specific use cases. In this work, we select two use-cases of interest: land-composition mapping and post-wildfire damage assessment. The proposed agent outperforms general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.


[86] VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents cs.CV | cs.AI | cs.MMPDF

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Jiahuan Chen

TL;DR: 本文提出了VDE Bench,一个专门用于评估多语言和复杂视觉文档编辑模型性能的基准测试。它包含一个高质量、人工标注的数据集,涵盖中英文密集文本文档,并引入了一个解耦的评估框架,在OCR解析层面系统量化编辑性能。

Details

Motivation: 现有图像编辑模型主要关注英文和稀疏文本布局,无法充分处理密集、结构复杂的文档或非拉丁文字(如中文),因此需要专门的基准来评估模型在多语言复杂视觉文档编辑任务上的能力。

Result: 基于该基准对代表性SOTA图像编辑模型进行了全面评估,手动验证表明人工判断与自动评估指标具有很强的一致性。

Insight: 创新点在于构建了首个针对多语言密集文本视觉文档编辑的系统性基准,并提出了一个解耦的、基于OCR解析的细粒度评估框架,以量化文本修改的准确性。

Abstract: In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.


[87] PovNet+: A Deep Learning Architecture for Socially Assistive Robots to Learn and Assist with Multiple Activities of Daily Living cs.CV | cs.ROPDF

Fraser Robinson, Souren Pashangpour, Matthew Lisondra, Goldie Nejat

TL;DR: 本文提出了POVNet+,一种用于社交辅助机器人的多模态深度学习架构,旨在识别多种日常生活活动(ADLs),以主动启动辅助行为。该架构通过引入ADL和运动嵌入空间,能够区分已知ADL、未知新ADL以及非典型执行的已知ADL,并结合用户状态估计来监测用户表现,从而在真实场景中提供协助。

Details

Motivation: 长期部署自主社交辅助机器人的主要障碍在于其无法同时感知和协助多种日常生活活动,因此需要一种能够识别多活动并主动发起辅助的解决方案。

Result: 与最先进的人类活动识别方法相比,POVNet+在ADL分类准确率上更高;在杂乱生活环境中与多用户及机器人Leia进行的人机交互实验表明,该架构能成功识别已知、未知及非典型ADL,并启动适当的辅助交互。

Insight: 创新点包括引入ADL和运动嵌入空间以区分活动类型,以及应用用户状态估计方法在运动嵌入空间中识别新ADL;从客观角度看,该研究通过多模态感知和主动辅助机制,提升了社交机器人在复杂环境中的适应性和实用性。

Abstract: A significant barrier to the long-term deployment of autonomous socially assistive robots is their inability to both perceive and assist with multiple activities of daily living (ADLs). In this paper, we present the first multimodal deep learning architecture, POVNet+, for multi-activity recognition for socially assistive robots to proactively initiate assistive behaviors. Our novel architecture introduces the use of both ADL and motion embedding spaces to uniquely distinguish between a known ADL being performed, a new unseen ADL, or a known ADL being performed atypically in order to assist people in real scenarios. Furthermore, we apply a novel user state estimation method to the motion embedding space to recognize new ADLs while monitoring user performance. This ADL perception information is used to proactively initiate robot assistive interactions. Comparison experiments with state-of-the-art human activity recognition methods show our POVNet+ method has higher ADL classification accuracy. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia using POVNet+ demonstrate the ability of our multi-modal ADL architecture in successfully identifying different seen and unseen ADLs, and ADLs being performed atypically, while initiating appropriate assistive human-robot interactions.


[88] Shedding the Facades, Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation cs.CVPDF

Jiao Li, Jian Lang, Xikai Tang, Wenzheng Shu, Ting Zhong

TL;DR: 本文提出首个针对仇恨视频检测的测试时自适应框架SCANNER,通过利用仇恨内容中稳定的核心语义作为桥梁连接源域和目标域,以应对仇恨内容演化导致的严重语义漂移问题。

Details

Motivation: 现有仇恨视频检测方法假设训练和推理数据分布相同,但实际中仇恨内容会演化成不规则和模糊形式以逃避审查,导致严重语义漂移,传统测试时自适应方法难以应对。

Result: 实验表明SCANNER在多个基准上优于所有基线方法,在Macro-F1指标上平均比最佳基线提升4.69%。

Insight: 创新点在于发现仇恨表现形式虽演化但其核心目标特征保持稳定,并据此设计了基于质心的对齐机制、样本级自适应对齐策略和簇内多样性正则化来增强模型适应性。

Abstract: Hate Video Detection (HVD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, hateful content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in HVD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for HVD. Motivated by the insight that, despite the evolving nature of hateful manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving hateful content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 4.69% in Macro-F1 over the best.


[89] LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models cs.CVPDF

Pengcheng Zheng, Chaoning Zhang, Jiarong Mo, GuoHui Li, Jiaquan Zhang

TL;DR: 本文提出LLaVA-FA,一种用于压缩大型多模态模型(LMMs)的新方法。该方法通过在频域联合执行低秩分解和量化近似,利用傅里叶变换的去相关和共轭对称特性,实现了更紧凑和准确的权重表示。此外,还引入了针对复数矩阵的极坐标量化方法PolarQuant和无需大规模校准数据的可选对角校准(ODC)方案。

Details

Motivation: 大型多模态模型虽然性能优异,但其巨大的计算和内存成本阻碍了实际部署。现有的压缩方法通常将低秩分解和量化解耦,导致重建误差累积,尤其是在存在跨模态冗余的多模态架构中。

Result: 广泛的实验结果表明,所提出的LLaVA-FA在多个基准测试中优于现有的高效多模态模型,同时保持了最少的激活参数和较低的计算成本。

Insight: 核心创新点在于将压缩问题转换到频域进行联合低秩与量化近似,这利用了傅里叶变换的数学特性来更有效地捕捉和压缩权重信息。PolarQuant和ODC方案是针对频域复数表示量身定制的有效技术,减少了数据依赖和误差。

Abstract: Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.


[90] Scalable Analytic Classifiers with Associative Drift Compensation for Class-Incremental Learning of Vision Transformers cs.CV | cs.AIPDF

Xuan Rao, Mingming Ha, Bo Zhao, Derong Liu, Cesare Alippi

TL;DR: 本文针对Vision Transformers在类增量学习中分类器重建阶段计算成本高的问题,提出了一种可扩展的解析分类器框架。通过低秩分解的正则化高斯判别分析(LR-RGDA)降低推理复杂度,并结合基于Hopfield网络的分布补偿器(HopDC)来缓解表示漂移,实现了在大规模类增量学习场景下的高效且高性能的解决方案。

Details

Motivation: 解决Vision Transformers在类增量学习(CIL)中,分类器重建阶段依赖昂贵的迭代随机梯度下降(SGD)所导致的计算瓶颈问题,旨在为大规模CIL场景提供可扩展的解决方案。

Result: 在多个CIL基准测试上进行了广泛实验,结果表明所提出的框架达到了最先进的(SOTA)性能。

Insight: 主要创新点包括:1)提出LR-RGDA,通过利用协方差的低秩结构和Woodbury矩阵恒等式,将判别函数分解为全局仿射项和低秩二次扰动项,显著降低了推理复杂度;2)引入HopDC,一种无需训练的机制,利用现代连续Hopfield网络通过联想记忆动态在无标签锚点上重新校准历史类统计量,以缓解表示漂移,并提供了估计误差的理论界。

Abstract: Class-incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes-optimal alternative with accuracy comparable to SGD-based classifiers; however, its quadratic inference complexity limits its use in large-scale CIL scenarios. To overcome this, we propose Low-Rank Factorized RGDA (LR-RGDA), a scalable classifier that combines RGDA’s expressivity with the efficiency of linear classifiers. By exploiting the low-rank structure of the covariance via the Woodbury matrix identity, LR-RGDA decomposes the discriminant function into a global affine term refined by a low-rank quadratic perturbation, reducing the inference complexity from $\mathcal{O}(Cd^2)$ to $\mathcal{O}(d^2 + Crd^2)$, where $C$ is the class number, $d$ the feature dimension, and $r \ll d$ the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield-based Distribution Compensator (HopDC), a training-free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with ViTs. Code: https://github.com/raoxuan98-hash/lr_rgda_hopdc.


[91] Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields cs.CV | cs.AIPDF

Shiqian Li, Ruihong Shen, Junfeng Ni, Chang Pan, Chi Zhang

TL;DR: 本文提出了一种名为神经高斯力场(NGFF)的端到端神经框架,该框架将3D高斯感知与基于物理的动态建模相结合,能够从多视角RGB输入生成交互式、物理逼真的4D视频,其速度比现有高斯模拟器快两个数量级。

Details

Motivation: 现有视频生成模型虽然视觉质量高,但缺乏物理定律建模,无法保证物理合理性;而结合3D高斯溅射和物理引擎的方法则计算成本高、在复杂现实场景中鲁棒性不足。

Result: 在合成和真实3D场景上的评估表明,NGFF在物理推理方面具有强大的泛化能力和鲁棒性,推动了视频预测向基于物理的世界模型发展。

Insight: 创新点在于将3D高斯感知与物理动态建模集成到一个端到端框架中,显著提升了生成速度和物理合理性;同时构建了包含多样材料、多物体交互和复杂场景的大规模4D高斯数据集GSCollision以支持训练。

Abstract: Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF’s strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.


[92] SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles cs.CV | cs.RO | eess.IVPDF

Shucong Li, Xiaoluo Zhou, Yuqian He, Zhenyu Liu

TL;DR: 本文提出了一种名为SDCM的雷达-视觉融合框架,用于车联网中的3D目标检测,旨在解决4D雷达点云稀疏和视觉数据在低光照、远距离及密集遮挡场景下表征退化的问题。该框架包含模拟稠密化模块、雷达补偿映射模块和Mamba建模交互融合模块,以生成稠密雷达点云、补偿视觉退化并实现异质模态交互融合。

Details

Motivation: 解决4D雷达点云稀疏导致的3D表征能力差,以及视觉数据在低光照、远距离和密集遮挡下纹理信息不可靠的问题,提升车联网中雷达-视觉融合3D目标检测的鲁棒性和准确性。

Result: 在VoD、TJ4DRadSet和Astyx HiRes 2019数据集上,SDCM取得了最佳性能,同时具有更低的参数量和更快的推理速度。

Insight: 创新点包括:基于3D核密度估计和曲率模拟的模拟稠密化方法生成稠密雷达点云;利用雷达的全天候特性设计补偿映射模块缓解视觉退化;引入Mamba建模提取特征张量差异值以减少异质性并促进模态交互。这些方法为稀疏传感器数据增强和跨模态补偿融合提供了新思路。

Abstract: 3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas’ representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.


[93] See Without Decoding: Motion-Vector-Based Tracking in Compressed Video cs.CV | cs.AI | eess.IVPDF

Axel Duché, Clément Chatelain, Gilles Gasso

TL;DR: 本文提出了一种轻量级的压缩域跟踪模型,可直接在视频流上操作,无需完全解码RGB视频。该模型利用压缩数据中的运动矢量和变换系数,在帧间传播物体边界框,在MOTS15/17/20数据集上实现了高达3.7倍的计算加速,同时mAP@0.5仅比RGB基线下降4%。

Details

Motivation: 解决在大型监控系统中进行实时视频分析时,完全解码RGB视频带来的高计算开销问题,旨在直接在压缩域进行高效目标跟踪。

Result: 在MOTS15/17/20数据集上,与RGB基线相比,计算速度提升高达3.7倍,mAP@0.5指标仅下降4%。

Insight: 创新点在于直接利用压缩视频流中的运动矢量和变换系数进行跟踪,避免了完整的RGB解码过程,显著提升了计算效率;从客观角度看,该方法有效利用了视频编码中已有的运动信息,为实时视频分析提供了一种高效的压缩域处理范式。

Abstract: We propose a lightweight compressed-domain tracking model that operates directly on video streams, without requiring full RGB video decoding. Using motion vectors and transform coefficients from compressed data, our deep model propagates object bounding boxes across frames, achieving a computational speed-up of order up to 3.7 with only a slight 4% mAP@0.5 drop vs RGB baseline on MOTS15/17/20 datasets. These results highlight codec-domain motion modeling efficiency for real-time analytics in large monitoring systems.


[94] Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders cs.CV | q-bio.NCPDF

Laura Cif, Diane Demailly, Gabriella A. Horvàth, Juan Dario Ortigoza Escobar, Nathalie Dorison

TL;DR: 本文提出了一种基于姿态估计的深度学习框架,用于从常规临床视频中自动识别和区分多种共存的运动障碍(如肌张力障碍、震颤、舞蹈症等)。该框架将视频转换为解剖学关键点时间序列,并提取涵盖统计、时域、频域及高阶不规则性-复杂性特征的动力学描述符,以实现对重叠运动障碍表型的客观、可扩展分析。

Details

Motivation: 解决运动障碍(如肌张力障碍、震颤等)在临床中因症状波动、间歇性发作及频繁共现而难以客观识别和长期监测的问题,现有方法主要依赖主观评估且易受评估者间差异影响,缺乏基于常规临床视频的客观、可扩展区分手段。

Result: 摘要中未提及具体定量结果或基准测试,但暗示所提框架能够从视频中提取多维度运动特征,为区分重叠表型提供基础。

Insight: 创新点在于将姿态估计与多标签识别结合,通过关键点时间序列和综合运动特征(包括高阶不规则性-复杂性描述符)来客观量化运动障碍,可借鉴于其他视频分析任务中运动表型的细粒度区分。

Abstract: Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.


[95] YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation cs.CVPDF

Ranjan Sapkota, Manoj Karkee

TL;DR: 本文提出了YOLOE-26,一个将部署优化的YOLO26架构与YOLOE的开放词汇学习范式相结合的统一框架,用于实现实时的开放词汇实例分割。该方法基于YOLOv26的无NMS、端到端设计,在保持YOLO系列高效和确定性特点的同时,将其能力扩展到闭集识别之外。

Details

Motivation: 解决现有实时实例分割模型通常局限于预定义类别(闭集识别)的问题,旨在开发一个能够在动态、真实世界环境中进行开放词汇(即识别训练时未见过的类别)实例分割的实用且可扩展的解决方案。

Result: 广泛的实验证明了该模型在不同尺寸下均具有一致的缩放行为和良好的精度-效率权衡,无论是在有提示还是无提示的设置下。

Insight: 主要创新点包括:1) 用对象嵌入头替代固定的类别逻辑值,将分类任务重新定义为与文本描述、视觉示例或内置词汇生成的提示嵌入进行相似性匹配;2) 引入了多种高效开放词汇推理技术,如零开销文本提示的Re-Parameterizable Region-Text Alignment (RepRTA)、示例引导分割的Semantic-Activated Visual Prompt Encoder (SAVPE)以及无提示推理的Lazy Region Prompt Contrast;3) 所有提示模式在一个统一的对象嵌入空间中操作,实现了文本提示、视觉提示和完全自主分割之间的无缝切换。该框架与Ultralytics生态系统完全兼容。

Abstract: This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.


[96] Intra-Class Subdivision for Pixel Contrastive Learning: Application to Semi-supervised Cardiac Image Segmentation cs.CVPDF

Jiajun Zhao, Xuan Yang

TL;DR: 本文提出了一种用于心脏图像分割的类内细分像素对比学习框架,通过引入’无关样本’概念区分同一类别内部和边界区域的像素表示,并设计边界对比损失来增强边界表征的区分度,从而解决边界处的表征污染问题。

Details

Motivation: 针对半监督心脏图像分割中边界区域像素表征易受污染的问题,旨在提升分割质量和边界精度。

Result: 在公开心脏数据集上的实验表明,该方法在分割质量和边界精度方面显著优于现有方法。

Insight: 创新性地提出’无关样本’概念以细化类内表征,并结合边界对比损失增强边界区分能力;其理论分析为对比学习在医学图像分割中的应用提供了新视角。

Abstract: We propose an intra-class subdivision pixel contrastive learning (SPCL) framework for cardiac image segmentation to address representation contamination at boundaries. The novel concept ``Unconcerned sample’’ is proposed to distinguish pixel representations at the inner and boundary regions within the same class, facilitating a clearer characterization of intra-class variations. A novel boundary contrastive loss for boundary representations is proposed to enhance representation discrimination across boundaries. The advantages of the unconcerned sample and boundary contrastive loss are analyzed theoretically. Experimental results in public cardiac datasets demonstrate that SPCL significantly improves segmentation performance, outperforming existing methods with respect to segmentation quality and boundary precision. Our code is available at https://github.com/Jrstud203/SPCL.


[97] CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV | cs.AIPDF

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun

TL;DR: CamReasoner是一个通过结构化空间推理增强相机运动理解的框架,将相机动态理解重新定义为观察-思考-回答(O-T-A)的推理过程,并首次在该领域应用强化学习进行逻辑对齐,以抑制幻觉并实现最先进的性能。

Details

Motivation: 现有多模态模型将相机动态理解视为黑盒分类,依赖表面视觉模式而非几何线索,导致混淆物理上不同的运动,因此需要弥合感知与电影逻辑之间的差距。

Result: 在多个基准测试中达到最先进的性能,具体通过构建包含18k SFT推理链和38k RL反馈样本的大规模推理轨迹套件来支持。

Insight: 创新点包括将相机运动理解重构为结构化推理过程,引入O-T-A范式强制模型解码轨迹和视锥等时空线索,并首次在该领域应用强化学习进行逻辑对齐,确保推理基于物理几何而非上下文猜测。

Abstract: Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.


[98] Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images cs.CV | cs.AIPDF

Shanwen Wang, Xin Sun, Danfeng Hong, Fei Zhou

TL;DR: 本文提出了一种名为SemiEarth的新型半监督语义分割模型,专门针对遥感图像领域。该模型通过引入视觉语言模型(VLM)来净化教师网络生成的伪标签,特别是在多类别边界区域,从而显著提升伪标签质量,并最终指导学生模型的学习。

Details

Motivation: 传统半监督语义分割(S4)架构,尤其是师生框架,面临伪标签质量低下的挑战。本文旨在利用视觉语言模型(VLM)的开放世界能力来解决遥感图像领域的S4问题。

Result: 在多个遥感数据集上进行的广泛实验表明,SemiEarth模型取得了最先进(SOTA)的性能。

Insight: 核心创新点是提出了一个独立于S4架构的视觉语言模型伪标签净化(VLM-PP)模块。该模块不仅能利用VLM的开放世界知识纠正低置信度伪标签中的错误类别预测,还增强了模型的可解释性,这是相较于先前SOTA方法的一个显著优势。

Abstract: The semi-supervised semantic segmentation (S4) can learn rich visual knowledge from low-cost unlabeled images. However, traditional S4 architectures all face the challenge of low-quality pseudo-labels, especially for the teacher-student framework.We propose a novel SemiEarth model that introduces vision-language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo-label purifying (VLM-PP) structure to purify the teacher network’s pseudo-labels, achieving substantial improvements. Especially in multi-class boundary regions of RS images, the VLM-PP module can significantly improve the quality of pseudo-labels generated by the teacher, thereby correctly guiding the student model’s learning. Moreover, since VLM-PP equips VLMs with open-world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low-confidence pseudo-labels whenever a discrepancy arises between its prediction and the pseudo-label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at https://github.com/wangshanwen001/SemiEarth.


[99] Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning cs.CV | cs.AI | cs.LGPDF

Zafar Iqbal, Anwar Ul Haq, Srimannarayana Grandhi

TL;DR: 本文提出了一种名为多跳视觉推理链(VCoR)的新型无监督可变形图像配准框架,它将配准重新定义为渐进式推理过程。该框架通过局部空间细化模块和交叉参考注意力机制,迭代地细化配准结果,处理大变形并提供具有理论边界的透明中间预测序列。

Details

Motivation: 现有无监督深度学习配准方法虽然精度高,但缺乏可解释性和透明度,导致误差漂移和临床信任度降低。本文旨在解决这一关键问题,提出一个兼具高精度、可解释性和可靠性的配准框架。

Result: 在DIR-Lab 4D CT(肺部)和IXI T1加权MRI(脑部)两个具有挑战性的公共数据集上的广泛评估表明,VCoR在保持有竞争力的配准精度的同时,提供了丰富的中间可视化结果和置信度度量。

Insight: 主要创新点在于将配准重新构建为多跳视觉推理过程,模仿临床决策的迭代特性,并通过跨跳变形场的稳定性和收敛性来估计不确定性,从而提供了内置的可解释性。其渐进式细化策略和理论边界保证为处理复杂解剖结构的大变形提供了新的可靠思路。

Abstract: Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels, making interpretability and reliability critical. Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust. We propose a novel Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as a progressive reasoning process. Inspired by the iterative nature of clinical decision-making, each visual reasoning hop integrates a Localized Spatial Refinement (LSR) module to enrich feature representations and a Cross-Reference Attention (CRA) mechanism that leads the iterative refinement process, preserving anatomical consistency. This multi-hop strategy enables robust handling of large deformations and produces a transparent sequence of intermediate predictions with a theoretical bound. Beyond accuracy, our framework offers built-in interpretability by estimating uncertainty via the stability and convergence of deformation fields across hops. Extensive evaluations on two challenging public datasets, DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain), demonstrate that VCoR achieves competitive registration accuracy while offering rich intermediate visualizations and confidence measures. By embedding an implicit visual reasoning paradigm, we present an interpretable, reliable, and clinically viable unsupervised medical image registration.


[100] A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification cs.CV | cs.AIPDF

Juan A. Olmos, Antoine Manzanera, Fabio Martínez

TL;DR: 该论文提出了一种名为MFM-Geom的几何多模态基础模型,用于整合双参数磁共振成像(bp-MRI)和临床报告数据,以改进前列腺癌的分类。模型利用对称正定矩阵和黎曼深度学习来融合成像与文本表示,在仅使用10%训练数据的情况下,其性能超越了基于类别标记嵌入的基线方法,并在外部数据集上验证了其泛化能力。

Details

Motivation: 前列腺癌诊断依赖于bp-MRI和临床变量,但现有计算机辅助诊断方法主要基于成像,忽略了临床背景且受数据稀缺限制,导致学习到的表示不够鲁棒。本文旨在通过多模态融合解决这一问题,减少对专家主观解释的依赖。

Result: 在内部数据集上,MFM-Geom的AUC-PR达到90.67%,比基线方法提升8.3%;在外部数据集上泛化性能良好,AUC-PR为90.6%,证实了其鲁棒性。

Insight: 创新点在于将对称正定矩阵和黎曼深度学习引入多模态表示融合,有效整合了成像与临床文本信息;从客观角度看,该方法通过几何框架处理多模态数据,可能提升了表示的判别性和泛化能力,尤其在数据有限场景下表现突出。

Abstract: Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.


[101] CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models cs.CV | cs.LGPDF

Samyak Jha, Junho Kim

TL;DR: 本文提出CAPA框架,通过注意力贡献度感知的视觉令牌剪枝和FFN线性近似,有效提升大型视觉语言模型的推理效率。

Details

Motivation: 解决大型视觉语言模型中处理大量视觉令牌计算成本高的问题,现有基于注意力分数的令牌重要性评估方法不够准确。

Result: 在多个基准测试中,CAPA实现了效率与性能的平衡,并展现出更强的鲁棒性。

Insight: 创新点在于提出注意力贡献度作为更准确的令牌选择标准,并发现视觉注意力汇中的概率垃圾可安全剪枝、结构锚点需保留,以及视觉令牌相关FFN存在冗余可通过线性近似简化。

Abstract: Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency–performance trade-offs with improved robustness.


[102] Subspace Clustering on Incomplete Data with Self-Supervised Contrastive Learning cs.CV | cs.AIPDF

Huanran Li, Daniel Pimentel-Alarcón

TL;DR: 本文提出了一种名为对比子空间聚类(CSC)的自监督对比学习框架,用于处理具有缺失条目的不完整数据的子空间聚类问题。该方法通过生成部分观测输入的掩码视图,并利用SimCLR风格的对比损失训练深度神经网络来学习不变嵌入,最后使用稀疏子空间聚类进行分组。

Details

Motivation: 现有子空间聚类方法大多假设数据完全观测,这限制了其在现实世界存在数据缺失场景下的有效性。本文旨在解决不完整数据的聚类问题。

Result: 在六个基准数据集上的实验表明,CSC在性能上持续优于经典和深度学习方法基线,对数据缺失表现出强鲁棒性,并具有良好的大规模数据集可扩展性。

Insight: 创新点在于将自监督对比学习与子空间聚类结合,通过生成数据掩码视图并学习不变表示来处理不完整数据,为缺失数据下的聚类任务提供了一种有效的解决方案。

Abstract: Subspace clustering aims to group data points that lie in a union of low-dimensional subspaces and finds wide application in computer vision, hyperspectral imaging, and recommendation systems. However, most existing methods assume fully observed data, limiting their effectiveness in real-world scenarios with missing entries. In this paper, we propose a contrastive self-supervised framework, Contrastive Subspace Clustering (CSC), designed for clustering incomplete data. CSC generates masked views of partially observed inputs and trains a deep neural network using a SimCLR-style contrastive loss to learn invariant embeddings. These embeddings are then clustered using sparse subspace clustering. Experiments on six benchmark datasets show that CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.


[103] World-Shaper: A Unified Framework for 360° Panoramic Editing cs.CVPDF

Dong Liang, Yuhao Liu, Jinyuan Jia, Youjun Zhao, Rynson W. H. Lau

TL;DR: 本文提出了World-Shaper,一个用于360°全景图像编辑的统一几何感知框架。该框架直接在等距柱状投影(ERP)域中重新表述全景编辑问题,通过“先生成后编辑”的范式,利用可控全景生成为监督编辑学习合成多样化的配对数据,并引入几何感知学习策略来应对几何失真。

Details

Motivation: 现有基于透视的图像编辑方法无法建模全景图像的空间结构,而传统的立方体贴图分解方法由于与球面几何不匹配,会破坏全局一致性。因此,需要一种直接在ERP域中工作的、能保持几何一致性的全景编辑方法。

Result: 在作者新构建的基准PEBench上进行的大量实验表明,该方法在几何一致性、编辑保真度和文本可控性方面优于最先进(SOTA)方法。

Insight: 创新点在于:1)直接在ERP域进行编辑的统一框架,避免了立方体贴图分解带来的不一致性;2)“先生成后编辑”的范式,解决了配对数据稀缺问题;3)显式位置感知形状监督与隐式全景先验渐进式训练相结合的几何感知学习策略。这为保持全局一致性的360°视觉世界创建提供了统一的编辑控制方案。

Abstract: Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: https://world-shaper-project.github.io/


[104] PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories cs.CV | cs.AIPDF

Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li

TL;DR: PLACID是一个基于视频扩散模型的框架,用于将多个物体图像合成为一张美观的多物体合成图像。它利用预训练的文本控制图像到视频(I2V)扩散模型来保持物体一致性、身份和背景细节,并通过生成物体平滑移动到目标位置的合成序列数据进行训练,在推理时引导随机初始化的物体形成连贯布局。

Details

Motivation: 解决当前生成式AI在工作室级多物体合成任务中的不足,如物体细节改变、遗漏或重复、布局尺寸错误或展示不一致等问题,要求同时实现物体身份近乎完美保留、背景与颜色保真度、布局与设计元素控制以及完整美观的展示。

Result: 广泛的定量评估和用户研究表明,PLACID在多物体合成任务中超越了现有最先进方法,在身份、背景和颜色保留方面表现更优,物体遗漏更少,视觉效果更吸引人。

Insight: 主要创新点包括:利用预训练I2V扩散模型的时间先验来保持物体一致性和背景细节;提出一种生成合成轨迹的数据策展策略,通过物体平滑移动的合成序列对齐视频模型的时间先验,从而在推理时实现文本引导的连贯布局生成。

Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item’s identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model’s temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.


[105] TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation cs.CV | cs.AIPDF

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Lior Wolf

TL;DR: 本文提出TokenTrim,一种推理时令牌修剪方法,用于缓解自回归长视频生成中的时间漂移问题。该方法通过识别并移除不稳定的潜在令牌来防止错误传播,从而提升长序列生成的时间一致性,且无需修改模型架构或训练过程。

Details

Motivation: 自回归视频生成在生成长视频时存在严重的时间漂移问题,即错误在长序列中累积放大。作者假设这主要源于推理时错误传播,而非模型容量不足,具体是由于损坏的潜在条件令牌在自回归推理中被重复使用。

Result: 该方法显著改善了长时域时间一致性,但摘要未提及具体定量结果或基准测试。

Insight: 创新点在于提出一种轻量级的推理时干预策略,通过动态修剪不稳定的潜在令牌来阻断错误传播,避免了修改模型架构或训练过程,保持了潜在空间的完整性。

Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.


[106] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs cs.CV | cs.AIPDF

Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi

TL;DR: 该论文提出了TimeBlind,一个用于诊断视频多模态大语言模型(MLLMs)时空组合理解能力的基准测试。它通过精心设计的视频对(静态内容相同,仅时间结构不同)和互补问题,评估模型对原子事件、事件属性和事件间依赖关系的细粒度时间推理能力。在评估了20多个前沿MLLMs后,发现其表现远低于人类水平,揭示了模型严重依赖静态视觉捷径而非真正的时间逻辑。

Details

Motivation: 当前多模态大语言模型(MLLMs)在静态语义理解上表现出色,但其对视频中时间动态的把握仍然脆弱。为了诊断和推动模型在细粒度时空理解上的发展,需要专门的评测基准。

Result: 在包含600个实例(2400个视频-问题对)的TimeBlind基准上评估了超过20个SOTA MLLMs(如GPT-5, Gemini 3 Pro)。表现最佳的MLLM的实例准确率(正确区分一对视频)仅为48.2%,远低于人类表现(98.2%)。

Insight: 论文的创新点在于:1)受认知科学启发,将细粒度时间理解系统性地分为三个层次;2)采用最小对比对范式,通过控制静态内容变量来隔离和测试纯粹的时间推理能力,有效避免了语言先验的干扰。这为评估和提升视频LLMs的时空组合推理能力提供了一个关键的诊断工具和研究方向。

Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .


[107] Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory cs.CVPDF

Alan Yuille, Daniel Kersten

TL;DR: 本文从贝叶斯决策理论(BDT)的视角,介绍了计算机视觉及其与认知科学的关系,并探讨了BDT框架如何统一贝叶斯方法和深度神经网络方法,同时指出了BDT的局限性以及未来结合两者的方向。

Details

Motivation: 旨在通过贝叶斯决策理论这一理论透镜,梳理计算机视觉领域的核心概念,并探讨其与认知科学的联系,同时分析贝叶斯方法和深度神经网络方法的优缺点及其整合潜力。

Result: 论文未提及具体的定量实验结果或基准测试,而是侧重于理论框架的构建与比较分析。

Insight: 创新点在于利用贝叶斯决策理论作为统一框架,将贝叶斯方法(与认知科学共鸣)和深度神经网络方法(受视觉腹侧通路层次结构启发)联系起来,并指出通过超越BDT局限来融合两者是未来方向。

Abstract: This document presents an introduction to computer vision, and its relationship to Cognitive Science, from the perspective of Bayes Decision Theory (Berger 1985). Computer vision is a vast and complex field, so this overview has a narrow scope and provides a theoretical lens which captures many key concepts. BDT is rich enough to include two different approaches: (i) the Bayesian viewpoint, which gives a conceptually attractive framework for vision with concepts that resonate with Cognitive Science (Griffiths et al., 2024), and (ii) the Deep Neural Network approach whose successes in the real world have made Computer Vision into a trillion-dollar industry and which is motivated by the hierarchical structure of the visual ventral stream. The BDT framework relates and captures the strengths and weakness of these two approaches and, by discussing the limitations of BDT, points the way to how they can be combined in a richer framework.


[108] LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification cs.CV | cs.AIPDF

Rory Driscoll, Alexandros Christoforos, Chadbourne Davis

TL;DR: 论文提出了LogicGaze,一个用于严格评估视觉语言模型在视觉叙事中因果一致性的新基准框架。该基准通过从ShareGPT4Video和Flickr30k中构建包含因果序列和视觉矛盾扰动的数据,迫使模型验证每个推理步骤的真实性,揭示了当前最先进模型在视觉证据基础上的显著缺陷。

Details

Motivation: 尽管顺序推理增强了视觉语言模型执行复杂多模态任务的能力,但其推理链是否真正基于视觉证据的可靠性尚未得到充分探索。本文旨在解决模型在视觉叙事中可能存在的幻觉问题,并评估其因果一致性。

Result: 在LogicGaze的三部分评估协议(因果验证、基于视觉的叙事合成、扰动拒绝)下,包括Qwen2.5-VL-72B在内的最先进视觉语言模型都表现出了显著的脆弱性。

Insight: 创新点在于构建了一个专门针对因果一致性和幻觉问题的基准,通过集成视觉上矛盾但语言上合理的扰动来强制模型进行验证。这为开发更可靠、可信的多模态推理模型提供了重要的评估工具和方向。

Abstract: While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.


[109] On the Assessment of Sensitivity of Autonomous Vehicle Perception cs.CVPDF

Apostol Vassilev, Munawar Hasan, Edward Griffor, Honglan Jin, Pavel Piliptchak

TL;DR: 本文提出了一种评估自动驾驶车辆感知系统鲁棒性的方法,通过基于模型集合的预测敏感性量化来评估感知性能,捕捉模型间的分歧和推理变异性。研究在模拟和真实世界的不利驾驶场景下进行,并提出了一个评估感知性能的概念架构。

Details

Motivation: 自动驾驶的可行性高度依赖感知系统在实时提供准确可靠信息以支持稳健决策和操控方面的性能。这些系统不仅需要在理想条件下可靠运行,还需在自然和对抗性驾驶因素挑战下保持性能,这些干扰可能导致感知错误和检测分类延迟,因此评估自动驾驶车辆感知系统的鲁棒性并探索提高感知可靠性的策略至关重要。

Result: 实验使用了五种最先进的计算机视觉模型(包括YOLO v8-v9、DETR50、DETR101、RT-DETR),基于不同路面(如干燥和潮湿沥青)和车速下自动驾驶车辆在停车标志处的停车距离开发了感知评估标准。结果显示,光照条件减弱(如雾和低太阳高度)对感知模型性能影响最大,对抗性道路条件(如道路物体遮挡)会增加感知敏感性,且在对抗性道路条件和恶劣天气条件结合时模型性能下降。此外,距离道路物体越远,对感知性能的影响越大,从而降低感知鲁棒性。

Insight: 创新点包括提出基于模型集合的预测敏感性量化方法来评估感知鲁棒性,以及开发基于停车距离的感知评估标准。从客观角度看,该方法通过整合多个SOTA模型和模拟真实场景,为自动驾驶感知系统的可靠性评估提供了系统化框架,有助于识别关键性能瓶颈。

Abstract: The viability of automated driving is heavily dependent on the performance of perception systems to provide real-time accurate and reliable information for robust decision-making and maneuvers. These systems must perform reliably not only under ideal conditions, but also when challenged by natural and adversarial driving factors. Both of these types of interference can lead to perception errors and delays in detection and classification. Hence, it is essential to assess the robustness of the perception systems of automated vehicles (AVs) and explore strategies for making perception more reliable. We approach this problem by evaluating perception performance using predictive sensitivity quantification based on an ensemble of models, capturing model disagreement and inference variability across multiple models, under adverse driving scenarios in both simulated environments and real-world conditions. A notional architecture for assessing perception performance is proposed. A perception assessment criterion is developed based on an AV’s stopping distance at a stop sign on varying road surfaces, such as dry and wet asphalt, and vehicle speed. Five state-of-the-art computer vision models are used, including YOLO (v8-v9), DEtection TRansformer (DETR50, DETR101), Real-Time DEtection TRansformer (RT-DETR)in our experiments. Diminished lighting conditions, e.g., resulting from the presence of fog and low sun altitude, have the greatest impact on the performance of the perception models. Additionally, adversarial road conditions such as occlusions of roadway objects increase perception sensitivity and model performance drops when faced with a combination of adversarial road conditions and inclement weather conditions. Also, it is demonstrated that the greater the distance to a roadway object, the greater the impact on perception performance, hence diminished perception robustness.


[110] Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception cs.CV | cs.AIPDF

Alexandros Christoforos, Sarah Jenkins, Michael Brown, Tuan Pham, David Chen

TL;DR: 本文提出了一种名为SynerNet的协同神经代理网络框架,旨在缓解视觉语言模型在处理分布外概念时出现的跨模态对齐退化问题。该框架通过四个专门的计算单元(视觉感知、语言上下文、名义嵌入和全局协调)以结构化消息传播协议协作纠正模态差异。在VISTA-Beyond基准测试中,SynerNet在少样本和零样本场景下均实现了显著的性能提升。

Details

Motivation: 解决视觉语言模型在遇到分布外概念时,跨模态对齐性能下降的问题,即所谓的’语义鸿沟’。

Result: 在VISTA-Beyond基准测试中,少样本和零样本场景下的精度提升了1.2%到5.4%,表明该方法有效提升了模型在分布外感知任务上的性能。

Insight: 创新点包括:1)多代理潜在空间术语获取框架;2)用于增强少样本适应的语义上下文交换算法;3)自适应动态平衡机制。这些机制协同工作,通过结构化消息传播来弥合模态差异,为处理分布外感知问题提供了新的系统化思路。

Abstract: This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.


[111] When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs cs.CV | cs.AI | cs.CLPDF

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh

TL;DR: 本文发现检索增强生成(RAG)在增强大型视觉语言模型(LVLM)时存在一种新的失败模式——注意力分散(AD),即当检索到的文本上下文足够相关时,它会全局性地抑制模型对图像的视觉注意力,导致模型原本能正确回答的问题反而出错。为此,作者提出了一种无需训练的干预方法MAD-RAG,通过双问题表述解耦视觉定位与上下文整合,并结合注意力混合来保留图像条件证据,有效缓解了该问题。

Details

Motivation: 现有研究将RAG的失败归因于对检索上下文注意力不足,并试图减少对图像令牌的关注。本文动机是识别并解决一种被先前研究忽视的、截然不同的失败模式:注意力分散(AD),即当检索上下文足够好时,反而会抑制模型对图像关键区域的视觉注意力,导致性能下降。

Result: 在OK-VQA、E-VQA和InfoSeek基准测试上的广泛实验表明,MAD-RAG在不同模型家族中均持续优于现有基线方法,相比原始RAG基线分别取得了高达4.76%、9.20%和6.18%的绝对性能提升。值得注意的是,MAD-RAG以可忽略的计算开销,纠正了高达74.68%的失败案例。

Insight: 论文宣称的创新点在于首次识别并定义了RAG中的“注意力分散”失败模式,并提出了一种无需训练、通过双问题表述和注意力混合来解耦视觉与文本处理的干预方法MAD-RAG。从客观角度看,其核心创新在于对RAG内部注意力机制失效模式的深入诊断,以及一种轻量级、可泛化的架构级干预策略,为理解并改进检索增强多模态模型提供了新视角。

Abstract: While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.


[112] AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning cs.CV | cs.AIPDF

Chongyu Qu, Zhengyi Lu, Yuxiang Lai, Thomas Z. Li, Junchao Zhu

TL;DR: 本文提出了AdaFuse,一种基于强化学习的自适应多模态融合框架,用于肺癌风险预测。该框架将多模态融合建模为一个顺序决策过程,通过学习为每个患者动态选择是否纳入特定模态(如医学影像、临床记录和放射学报告),并在信息足够时提前终止,而非固定使用所有模态。

Details

Motivation: 现有多模态融合方法通常平等处理或学习权重分配所有可用模态,但未解决一个根本问题:对于特定患者,是否所有模态都是必要的?本文旨在开发一种能根据患者具体情况自适应选择和使用模态的个性化融合策略。

Result: 在NLST数据集上的实验表明,AdaFuse取得了最高的AUC(0.762),优于最佳单模态基线(0.732)、最佳固定融合策略(0.759)以及自适应基线DynMM(0.754)和MoE(0.742),同时计算量(FLOPs)低于所有使用三模态的方法。

Insight: 核心创新在于将强化学习引入多模态融合,将其建模为顺序决策问题,实现了患者特异性的模态选择与早期终止机制。这代表了一种从统一融合策略向自适应诊断流程的转变,即模型学会在何时咨询额外模态以及何时现有信息已足够进行准确预测。

Abstract: Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.


[113] MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI cs.CVPDF

Zhengyi Lu, Ming Lu, Chongyu Qu, Junchao Zhu, Junlin Guo

TL;DR: MASC是一种基于强化学习的统一框架,用于联合优化金属感知的k空间采样和伪影校正,以加速MRI扫描。该方法通过物理模拟构建配对数据集,使用PPO代理学习在有限采集预算下选择相位编码线,并结合U-Net网络进行端到端训练,以提升金属植入物导致的伪影去除效果。

Details

Motivation: 解决MRI中金属植入物引起的严重伪影问题,传统方法将金属伪影减少(MAR)和加速MRI采集视为独立问题,而MASC旨在统一优化这两个任务。

Result: 实验表明,MASC学习的采样策略优于传统采样方法,端到端训练相比使用冻结预训练MAR网络提升了性能,并在FastMRI数据集上通过物理模拟验证了其泛化到真实临床MRI数据的能力。

Insight: 创新点包括将主动MRI采集建模为序列决策问题,结合强化学习代理和MAR网络的端到端联合优化,以及利用物理模拟构建精确配对的训练数据集以支持监督学习。

Abstract: Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC’s learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: https://github.com/hrlblab/masc


[114] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models cs.CVPDF

Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda

TL;DR: 本文提出ReLAPSe,一种基于强化学习的对抗性提示搜索框架,用于恢复从未学习(unlearned)的扩散模型中擦除的概念。该方法将概念恢复重新定义为强化学习问题,利用扩散模型的噪声预测损失作为可验证的奖励信号,训练一个智能体学习可迁移的恢复策略,从而高效地恢复细粒度的身份和风格信息。

Details

Motivation: 现有利用扩散模型未学习后潜在信息泄露的对抗方法存在局限性:基于优化的方法计算成本高(需要逐实例迭代搜索),而基于推理和启发式的方法缺乏目标模型潜在视觉表示的直接反馈。ReLAPSe旨在解决这些挑战,为严格的红队测试(red-teaming)未学习扩散模型提供可扩展工具。

Result: ReLAPSe在多种最先进的未学习方法上实现了高效、接近实时的细粒度身份和风格恢复,表明其能够有效利用模型未学习后的潜在视觉残留信息。

Insight: 主要创新点在于将概念恢复从逐实例优化转变为全局策略学习,通过强化学习与可验证奖励(RLVR)的闭环设计,使文本提示操作直接与潜在视觉残差对齐,从而学习可迁移的恢复策略,而非孤立优化单个提示。

Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model’s latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model’s noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe


[115] Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset cs.CV | cs.CL | cs.LGPDF

Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira

TL;DR: 本文研究了基于Transformer的视觉语言模型在巴西葡萄牙语图像描述任务中的表现,通过对比分析原生标注数据集与自动翻译数据集,评估了模型在跨上下文环境下的泛化能力。

Details

Motivation: 针对巴西葡萄牙语等低资源语言在图像描述任务中缺乏专用数据集和模型的问题,研究旨在通过跨原生-翻译数据集评估,探索自动翻译数据对模型性能的影响。

Result: 实验表明,Swin-DistilBERTimbau模型在跨数据集评估中表现最佳,具有强泛化能力;巴西葡萄牙语预训练模型ViTucano在传统文本评估指标上优于GPT-4o和LLaMa 3.2 Vision等大型多语言模型,而GPT-4模型在CLIP-Score指标上得分最高,显示出更好的图文对齐性。

Insight: 创新点在于提出了跨原生-翻译数据集的评估框架,结合注意力图分析和CLIP-Score指标,揭示了模型在性别误分类、物体枚举错误和空间不一致性等方面的系统性偏差,为低资源语言图像描述任务的数据集构建和模型优化提供了参考。

Abstract: Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer-caption-ptbr.


[116] Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure cs.CV | cs.LGPDF

Trishna Chakraborty, Udita Ghosh, Aldair Ernesto Gongora, Ruben Glatt, Yue Dong

TL;DR: 本文提出了一种利用视觉语言模型(VLMs)进行自主实验室安全监控的方法。为了解决缺乏视觉评估数据的问题,作者首先构建了一个结构化数据生成流程,将文本实验室场景转换为对齐的(图像、场景图、真实标签)三元组。实验表明,VLMs在给定文本场景图时表现良好,但在纯视觉设置下性能显著下降,难以直接从像素中提取结构化对象关系。为此,作者提出了一种后训练上下文工程方法——场景图引导对齐,通过将视觉输入转换为与VLM推理更匹配的结构化场景图,来弥合感知差距,从而提升纯视觉设置下的危险检测性能。

Details

Motivation: 实验室中微小的不安全行为可能导致严重伤害,但持续的安全监控受限于人力。视觉语言模型有望实现自主监控,但由于安全事件多以非结构化文本记录,缺乏视觉评估数据,其在真实场景中的有效性尚不明确。

Result: 在包含362个独特场景、1,207个样本的合成数据集上,对七个开源和闭源模型进行实验。结果显示,VLMs在给定文本场景图时表现有效,但在纯视觉设置下性能大幅下降。提出的场景图引导对齐方法改善了纯视觉设置下的危险检测性能。

Insight: 创新点包括:1)引入结构化数据生成流程,利用大语言模型作为场景图架构师和图像生成模型作为渲染器,从文本场景创建对齐的视觉数据;2)提出后训练上下文工程方法——场景图引导对齐,将视觉输入转换为结构化场景图,以弥合VLMs的感知差距,提升对视觉场景中对象关系的理解能力。

Abstract: Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring – beyond mandatory pre-lab safety training – is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.


[117] Text is All You Need for Vision-Language Model Jailbreaking cs.CV | cs.AI | cs.CRPDF

Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh

TL;DR: 本文提出了一种名为Text-DJ的新型越狱攻击方法,通过利用大型视觉语言模型(LVLM)的光学字符识别(OCR)能力来绕过其安全防护机制。该方法将单个有害查询分解为多个语义相关但更良性的子查询,并引入大量无关的干扰查询,然后将所有查询以图像网格形式同时呈现给模型,从而成功诱导模型生成有害内容。

Details

Motivation: 现有大型视觉语言模型的安全防护机制主要关注分析显式文本输入或相关视觉场景,但忽视了模型OCR能力可能被利用的漏洞。本文旨在揭示并利用这一漏洞,通过将文本提示转换为图像并分散呈现,以绕过基于文本的过滤器。

Result: 该方法成功绕过了多个最先进(SOTA)大型视觉语言模型的安全对齐机制,在实验中证明了其有效性。

Insight: 创新点在于首次系统地利用LVLM的OCR能力进行越狱攻击,通过查询分解、引入无关干扰和网格化图像呈现的组合策略,暴露了模型对分散、多图像对抗性输入的脆弱性。这提示未来需要针对碎片化多模态输入设计更鲁棒的防御机制。

Abstract: Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model’s Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model’s safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs’ OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.


[118] DISK: Dynamic Inference SKipping for World Models cs.CV | cs.LG | cs.ROPDF

Anugunj Naman, Gaibo Zhang, Ayushman Singh, Yaguang Zhang

TL;DR: DISK是一种无需训练的自适应推理方法,用于自回归世界模型。它通过双分支控制器协调视频和自运动轨迹的两个耦合扩散变换器,实现跨模态跳过决策,保持运动-外观一致性。该方法将高阶潜在差异跳过测试扩展到自回归前向链中,并通过展开循环传播控制器统计量以确保长时域稳定性。在NuPlan和NuScenes数据集上的闭环驾驶模拟中,DISK实现了轨迹扩散2倍加速和视频扩散1.6倍加速,同时保持规划误差、视觉质量和导航性能。

Details

Motivation: 解决自回归世界模型在长时域视频和轨迹预测中计算成本高的问题,旨在降低推理开销而不牺牲性能。

Result: 在NuPlan和NuScenes的1500个样本上,使用NVIDIA L40S GPU进行闭环驾驶模拟,DISK实现了轨迹扩散2倍加速和视频扩散1.6倍加速,同时保持L2规划误差、视觉质量(FID/FVD)和NAVSIM PDMS分数,达到与原始模型相当的水平。

Insight: 创新点包括:训练免费的跨模态跳过决策机制,通过双分支控制器协调视频和轨迹扩散;将高阶潜在差异跳过测试扩展到自回归前向链;在展开循环中传播控制器统计量以增强长时域稳定性。这些方法在保持性能的同时显著降低了计算成本。

Abstract: We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.


[119] LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs cs.CV | cs.AIPDF

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott

TL;DR: 本文提出了一种名为LatentLens的新方法,用于将大型语言模型(LLM)处理视觉令牌时的潜在表示映射到自然语言描述,从而揭示视觉令牌在LLM各层中的可解释内容。该方法通过比较视觉令牌表示与大型文本语料库中的上下文化文本表示,以最近邻方式生成描述。实验表明,与现有方法(如LogitLens)相比,LatentLens能更准确地评估视觉令牌的可解释性,并在多个视觉-语言模型(VLM)中实现广泛的可解释性。

Details

Motivation: 为了解决现有可解释性方法(如LogitLens)在评估LLM处理视觉令牌时可能低估其可解释性的问题,并深入理解为什么LLM能轻松处理通过简单映射(如浅层MLP)输入的视觉令牌,从而揭示视觉与语言表示之间的对齐关系。

Result: 在10个不同的VLM上评估了LatentLens,结果显示大多数视觉令牌在所有模型和所有层中都是可解释的,而常用方法如LogitLens则显著低估了可解释性。定性分析表明,LatentLens生成的描述在语义上有意义,且比单个令牌提供更细粒度的人类可理解解释。

Insight: 创新点在于提出了一种基于大型文本语料库上下文化表示的可解释性映射方法,通过最近邻搜索将视觉令牌表示转化为自然语言描述,这为分析潜在表示提供了新方向,并增强了视觉与语言对齐的证据。从客观角度看,该方法可推广用于其他模态的表示分析,且其简单高效的比较机制具有借鉴价值。

Abstract: Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.


[120] PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting cs.CVPDF

Xin Zhang, Shen Chen, Jiale Zhou, Lei Li

TL;DR: 本文提出PSGS,一个两阶段框架,用于从文本生成高保真的全景3D场景。第一阶段通过新颖的两层优化架构(布局推理层和自优化层)生成语义连贯的全景图;第二阶段利用全景滑动机制,通过策略性采样重叠视角来初始化全局一致的3D高斯泼溅点云,并结合深度和语义一致性损失提升渲染质量。

Details

Motivation: 解决现有文本驱动3D场景生成方法因3D-文本数据有限和多视角拼接不一致导致的场景过于简单化的问题,旨在为VR、AR和游戏等沉浸式应用高效生成逼真的3D场景。

Result: 实验表明,PSGS在全景图生成任务上优于现有方法,并能生成更具吸引力的3D场景。

Insight: 创新点包括:1)两层优化架构(布局推理与基于MLLM反馈的自优化)实现语义连贯的全景图生成;2)全景滑动机制结合深度与语义损失,有效初始化全局一致的3D高斯泼溅点云,提升细节保真度。这为可扩展的沉浸式内容创作提供了新思路。

Abstract: Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.


[121] GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association cs.CV | cs.MMPDF

Rong-Lin Jian, Ming-Chi Luo, Chen-Wei Huang, Chia-Ming Lee, Yu-Fan Lin

TL;DR: GTATrack是一个用于体育场景多目标跟踪的层次化框架,在SoccerTrack 2025挑战赛中获胜。它通过Deep-EIoU进行运动无关的在线关联,并结合全局轨迹关联进行轨迹级优化,有效处理了鱼眼相机带来的几何畸变、尺度变化以及球员外观相似、频繁遮挡等挑战。

Details

Motivation: 解决体育场景(尤其是足球)中多目标跟踪的挑战,包括球员不规则运动、外观相似、频繁遮挡,以及静态鱼眼相机引入的几何畸变和极端尺度变化。

Result: 在SoccerTrack 2025挑战赛中,以0.60的HOTA分数获胜,并将误报数显著降低至982,达到了基于鱼眼相机的足球跟踪领域的SOTA水平。

Insight: 创新点在于将运动无关的局部关联(Deep-EIoU)与全局轨迹关联(GTA)相结合的两阶段设计,以及使用伪标签策略提升对小目标和畸变目标的检测召回率,从而有效处理身份切换、遮挡和轨迹碎片化问题。

Abstract: Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at https://github.com/ron941/GTATrack-STC2025.


[122] RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding cs.CVPDF

Jiahe Wu, Bing Cao, Qilong Wang, Qinghua Hu, Dongdong Li

TL;DR: 本文提出RGBX-R1框架,旨在增强多模态大语言模型(MLLM)在红外、深度和事件数据等X视觉模态上的感知与推理能力。通过设计视觉模态思维链(VM-CoT)引导的提示策略,并采用冷启动监督微调与时空强化微调的两阶段训练范式,模型在新建的RGBX-Grounding基准上实现了多模态理解和空间感知的显著提升。

Details

Motivation: 现有MLLM主要基于RGB模态预训练,在红外、深度等关键模态上性能受限,难以适应复杂场景需求,因此需要扩展其跨模态理解与推理能力。

Result: 在自建的RGBX-Grounding基准测试中,该方法在三个RGBX grounding任务上超越基线模型22.71%,展现了在多模态理解和空间感知方面的优越性。

Insight: 创新点包括:提出视觉模态思维链(VM-CoT)以扩展MLLM的跨模态理解;设计两阶段训练范式(CS-SFT与ST-RFT),结合监督学习与强化学习逐步提升推理能力;引入模态理解时空奖励(MuST reward)以强化时空推理。这些方法为MLLM适应多样视觉模态提供了可借鉴的框架。

Abstract: Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM’s perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs’ RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.


[123] Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models cs.CV | cs.AIPDF

Jingrui Zhang, Feng Liang, Yong Zhang, Wei Wang, Runhao Zeng

TL;DR: 本文提出了一种名为SparseCut的新型跨模态融合架构,用于多模态大语言模型(MLLMs)。该架构通过在跨模态编码器与大语言模型(LLM)之间引入稀疏快捷连接,实现了多层次视觉特征的高效、分层融合,从而在不增加计算开销的情况下,提升了模型的跨模态理解能力。

Details

Motivation: 现有MLLMs大多关注扩展语言模型规模或构建高质量训练数据,而忽略了如何将跨模态知识有效整合到语言空间中。特别是在视觉-语言模型中,仅使用高层视觉特征进行模态对齐,会丢弃中、低层特征中丰富的语义信息,限制了模型的跨模态理解能力。

Result: 实验表明,SparseCut在多个多模态基准测试上显著提升了MLLMs的性能,并且对于不同的基础LLM具有通用性和可扩展性。

Insight: 核心创新点在于提出了稀疏快捷连接和高效多粒度特征融合模块。稀疏连接实现了跨模态特征的分层高效融合,而融合模块在特征通过快捷连接前进行融合,既保留了原始语言上下文,又避免了增加LLM的输入长度和计算复杂度。这是一种在保持效率的同时增强语义融合的有效架构设计。

Abstract: With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model’s ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.


[124] DuoGen: Towards General Purpose Interleaved Multimodal Generation cs.CVPDF

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni

TL;DR: DuoGen是一个通用的交错多模态生成框架,通过系统解决数据、架构和评估问题,提升交错生成质量。它结合了大规模高质量指令调优数据集、利用预训练多模态LLM的视觉理解和DiT的视觉生成能力的两阶段解耦策略,在文本质量、图像保真度和图像-上下文对齐方面优于现有开源模型。

Details

Motivation: 现有交错多模态生成模型在通用指令下的质量受限于训练数据不足和基础模型能力,DuoGen旨在系统解决这些问题以实现更通用的交错生成。

Result: 在公开和新提出的基准测试中,DuoGen在文本质量、图像保真度和图像-上下文对齐方面优于先前开源模型,并在统一生成模型中实现了文本到图像和图像编辑的SOTA性能。

Insight: 创新点包括构建大规模高质量指令调优数据集、两阶段解耦架构(结合MLLM和DiT避免昂贵单模态预训练),以及灵活的基础模型选择策略,可借鉴其数据合成和模型对齐方法。

Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at https://research.nvidia.com/labs/dir/duetgen/.


[125] SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation cs.CVPDF

Kunal Mahatha, Jose Dolz, Christian Desrosiers

TL;DR: 本文提出SPARK方法,将无监督训练自由分割重新定义为扩散诱导亲和力图上的随机流平衡问题,通过马尔可夫传播方案结合随机游走标签扩散与自适应剪枝策略,以克服传统基于谱图分割方法的局限性。

Details

Motivation: 现有训练自由分割方法依赖谱图分割假设,存在需预设聚类数、边界过度平滑、对噪声和多模态亲和力分布敏感等问题,且忽视局部邻域结构对稳定亲和力传播和细粒度轮廓保持的关键作用。

Result: 在七个广泛使用的语义分割基准测试中,该方法实现了最先进的零样本性能,相比基于谱聚类的方法,产生了更清晰的边界、更连贯的区域和显著更稳定的掩码。

Insight: 创新点在于将分割问题重构为随机流平衡问题,并引入整合全局扩散注意力与局部邻域结构的马尔可夫传播方案,通过自适应剪枝增强可靠亲和力路径,从而提升分割的鲁棒性和细节保留能力。

Abstract: We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.


[126] MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval cs.CVPDF

Chaoran Xu, Chengkan Lv, Qiyu Chen, Feng Zhang, Zhengtao Zhang

TL;DR: 本文提出了一种名为MRAD的零样本异常检测框架,通过构建两级记忆库(图像级和像素级)并直接进行相似性检索来替代传统的参数拟合方法,从而降低训练和推理成本,并提升跨域稳定性。

Details

Motivation: 现有零样本异常检测方法通常依赖提示学习或复杂建模来拟合数据分布,导致训练或推理成本高且跨域稳定性有限,MRAD旨在解决这些问题。

Result: 在16个工业和医学数据集上,MRAD框架在异常分类和分割任务中均表现出优越性能,无论是在免训练还是基于训练的设置下。

Insight: 创新点在于用非参数化的记忆检索替代参数拟合,并提出了两个轻量级变体(MRAD-FT和MRAD-CLIP)以增强判别性和泛化能力;客观来看,该方法充分利用原始数据的经验分布而非仅依赖模型拟合,为异常检测提供了新思路。

Abstract: Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP’s learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code will be publicly released at https://github.com/CROVO1026/MRAD.


[127] SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding cs.CVPDF

Yujia Tong, Tian Zhang, Yunyang Wan, Kaiwei Lin, Jingling Yuan

TL;DR: 本文提出了一种名为SAGE的新颖框架,通过基于实时预测不确定性的熵引导自适应推测解码来加速视觉语言模型(VLMs)的推理。该方法动态调整推测树结构,根据置信度构建更深更窄或更浅更宽的树,以提高接受长度和解码速度。

Details

Motivation: 现有推测解码方法依赖静态树结构,无法适应生成过程中不同步骤的预测难度变化,导致接受长度不理想和加速效果有限。

Result: 在多个基准测试中,SAGE在不损失输出质量的情况下,为LLaVA-OneVision-72B实现了高达3.36倍的解码加速,为Qwen2.5-VL-72B实现了3.18倍的加速,优于静态树基线。

Insight: 创新点在于利用输出熵作为置信度指标,动态调整推测树结构(深窄树用于高置信度预测以最大化推测深度,浅宽树用于不确定预测以多样化探索),从而自适应地优化解码过程。

Abstract: Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.


[128] SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal cs.CVPDF

Yifan Zhang, Qian Chen, Yi Liu, Wengen Li, Jihong Guan

TL;DR: 本文提出了一种名为SADER的结构感知扩散框架,用于多时相遥感影像的去云处理。该框架通过可扩展的多时相条件扩散网络(MTCDN)捕捉多时相和多模态相关性,引入云感知注意力损失来强调云主导区域,并设计了确定性重采样策略以在固定采样步数下迭代优化样本。

Details

Motivation: 现有基于扩散模型的去云方法存在采样效率有限,以及对多时相遥感场景中的结构和时序先验利用不足的问题。

Result: 在多个多时相数据集上的大量实验表明,SADER在所有评估指标上均持续优于最先进的去云方法,达到了SOTA水平。

Insight: 主要创新点包括:1. 可扩展的多时相条件扩散网络(MTCDN),通过时序融合和混合注意力充分捕捉相关性;2. 云感知注意力损失,通过考虑云层厚度和亮度差异来强调云主导区域;3. 确定性重采样策略,用于连续扩散模型,通过引导校正替换异常值来迭代优化样本,提高了采样效率。

Abstract: Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at https://github.com/zyfzs0/SADER.


[129] Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models cs.CV | cs.AIPDF

Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li

TL;DR: 该论文针对视频多模态大语言模型中的组合幻觉问题,提出了OmniVCHall基准来系统评估孤立和组合幻觉,并设计了TriCD对比解码框架以提升模型在复杂推理场景下的性能。

Details

Motivation: 当前研究主要关注孤立的幻觉类型,而由多个时空因素交互导致错误推理的组合幻觉问题尚未得到充分探索,论文旨在解决这一挑战。

Result: 在OmniVCHall基准上评估39个代表性VLLM,发现即使先进模型(如Qwen3-VL和GPT-5)性能也显著下降;提出的TriCD框架在两个代表性骨干模型上平均准确率提升超过10%。

Insight: 创新点在于系统定义了组合幻觉问题并构建了综合性评估基准,以及通过自适应扰动控制器和显著性引导增强模块的三路径校准机制,利用强化学习优化决策以对抗幻觉。

Abstract: Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., “All are correct” and “None of the above”) to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at https://github.com/BMRETURN/OmniVCHall.


[130] GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates cs.CVPDF

Xingyu Luo, Yidong Cai, Jie Liu, Jie Tang, Gangshan Wu

TL;DR: 本文提出了一种名为GLAD的生成式语言辅助视觉跟踪模型,旨在解决低语义模板(如图像模糊、分辨率低)对视觉-语言跟踪性能的影响。该模型利用扩散模型对文本描述和模板图像进行生成式多模态融合,以增强语言与图像之间的兼容性并提升模板图像的语义信息。

Details

Motivation: 当前视觉-语言跟踪方法在处理低语义图像(如模糊、低分辨率)时,由于文本和视觉特征之间的差距,直接融合效果有限,导致跨模态理解性能下降。本文旨在通过生成式融合来提升低语义模板下的跟踪性能。

Result: 实验表明,该方法在多个基准测试上达到了新的最先进水平(SOTA),并实现了令人印象深刻的推理速度。

Insight: 创新点在于首次将扩散模型引入视觉-语言跟踪领域,用于生成式多模态融合,以恢复模糊和语义模糊的模板图像,从而增强多模态特征表示。这为处理低语义图像提供了一种新的生成式融合范式。

Abstract: Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: https://github.com/Confetti-lxy/GLAD


[131] Bridging Degradation Discrimination and Generation for Universal Image Restoration cs.CVPDF

JiaKui Hu, Zhengjian Yao, Lujia Jin, Yanye Lu

TL;DR: 本文提出了一种名为BDG的新方法,用于通用图像恢复任务,通过结合退化判别与生成过程,有效处理多种退化类型和程度,旨在从低质量图像中恢复出细节丰富的干净图像。

Details

Motivation: 通用图像恢复任务面临两大挑战:高质量图像分布的采样和基于退化调整输出。本文旨在同时解决这些问题,通过结合退化判别与生成过程,提升模型在多任务和多退化场景下的性能。

Result: 在不改变架构的情况下,BDG在通用图像恢复和真实世界超分辨率任务中取得了显著的性能提升,主要体现为在不损害感知质量的前提下,保真度大幅提高。

Insight: 创新点包括提出多角度多尺度灰度共生矩阵(MAS-GLCM)进行细粒度退化判别,并将扩散训练过程分为生成、桥接和恢复三个阶段,以整合判别信息到恢复过程中,从而增强模型处理复杂退化场景的能力。

Abstract: Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model’s capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in https://github.com/MILab-PKU/BDG.


[132] MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation cs.CV | cs.AIPDF

Xiangdong Li, Ye Lou, Ao Gao, Wei Zhang, Siyang Song

TL;DR: 本文提出MAUGen,一种基于扩散模型的多模态框架,用于联合生成具有多样化身份特征的真实感面部表情图像及解剖学一致的动作单元(AU)标签(包括发生和强度),仅需一个描述性文本提示。该框架包含多模态表示学习模块和扩散式图像标签生成器,并构建了大规模合成数据集MIFA。

Details

Motivation: 解决大规模、人口统计学多样且具有精确AU标注的面部图像数据缺乏的问题,该问题是开发可泛化AU识别系统的根本瓶颈。

Result: 大量实验表明,MAUGen在合成真实感、人口统计学多样的面部图像及语义对齐的AU标签方面优于现有方法。

Insight: 创新点在于提出统一的扩散框架,通过文本提示联合生成图像与AU标签,并构建了包含身份变化的大规模多模态合成数据集MIFA,为AU识别提供了高质量数据源。

Abstract: The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.


[133] From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking cs.CV | cs.LGPDF

Yifan Jiang, Cong Zhang, Bofei Zhang, Yifan Yang, Bingzhang Wang

TL;DR: 论文提出了Pix2Fact基准,用于评估视觉语言模型在细粒度视觉事实核查任务中的多跳推理能力。该基准包含1000张高分辨率图像和由专家标注的问题,要求模型结合详细视觉定位、多步推理和外部知识来回答。

Details

Motivation: 现有基准将视觉定位和知识推理分开评估,无法衡量两者协同作用。为解决此问题,需要构建一个能评估专家级感知和知识密集型多跳推理能力的基准。

Result: 在Pix2Fact基准上评估了9个SOTA视觉语言模型(包括Gemini-3-Pro和GPT-5),最先进模型平均准确率仅为24.0%,远低于人类56%的表现,揭示了当前模型的局限性。

Insight: 创新点在于构建了首个专注于细粒度视觉事实核查和多跳推理的专家级基准,通过高分辨率图像和专家标注确保任务难度,为开发结合精细感知和稳健推理的多模态智能体提供了关键评估工具。

Abstract: Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.


[134] Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering cs.CVPDF

Guangtao Lyu, Xinyi Cheng, Qi Liu, Chenghao Xu, Jiexi Yan

TL;DR: 本文提出了一种通过对比神经元引导(CNS)来分析和缓解大型视觉语言模型(LVLM)幻觉问题的方法。该方法从表示层面入手,利用稀疏自编码器(SAE)将密集的视觉嵌入分解为稀疏、可解释的神经元,识别出‘常开神经元’和‘图像特定神经元’两类。研究发现幻觉通常源于图像特定神经元的干扰或虚假激活,而CNS通过对比干净与噪声输入,选择性地增强信息性神经元并抑制扰动诱导的激活,从而在预填充阶段生成更鲁棒、语义更准确的视觉表示,有效减少幻觉。

Details

Motivation: 现有缓解LVLM幻觉的方法主要集中在输出层面的调整,而对产生幻觉的内部机制缺乏深入探索。本文旨在从表示层面深入理解幻觉的产生机制,并提出一种更根本的缓解方法。

Result: 在专注于幻觉评估和通用多模态基准测试上的大量实验表明,CNS方法能持续减少幻觉,同时保持整体的多模态理解能力。

Insight: 创新点在于从神经元层面(而非输出层面)分析和干预幻觉,提出了基于对比分析的CNS方法来识别和调控‘图像特定神经元’。这为理解模型内部工作机制和实现可控干预提供了新视角,且该方法在预填充阶段操作,与现有的解码阶段方法完全兼容,具有很好的可扩展性。

Abstract: LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.


[135] VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning cs.CVPDF

Vivek Madhavaram, Vartika Sengar, Arkadipta De, Charu Sharma

TL;DR: 本文提出了VIZOR,一种无需训练、端到端的框架,用于从原始3D场景直接生成密集且视角不变的3D场景图。该方法通过将空间关系定义为相对于每个对象正面方向的方式,确保关系在不同视角下保持一致,并能够推断开放词汇的关系,无需标注训练数据。

Details

Motivation: 现有方法通常依赖特定参考视角的2D图像、深度图等输入来构建场景图,但难以泛化,且产生的空间关系(如’左/右’)在不同视角下不一致。本文旨在解决这些局限性,实现视角不变的零样本3D场景图生成。

Result: 在场景图生成和下游任务(如基于查询的对象定位)上进行了广泛评估。VIZOR在Replica和Nr3D数据集上的零样本定位准确率分别提升了22%和4.81%,优于现有最先进方法。

Insight: 创新点在于提出了一种视角不变的场景图生成框架,通过将空间关系锚定于对象自身正面方向而非全局坐标系,确保了关系的一致性;同时,该框架是训练无关的,能够进行开放词汇关系推理,增强了泛化能力。

Abstract: Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like “left/right”, which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object’s front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.


[136] Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment cs.CV | cs.AIPDF

Lukas Kuhn, Giuseppe Serra, Florian Buettner

TL;DR: 本文提出了NOVA(NOn-contrastive Vision-language Alignment)框架,一种基于联合嵌入预测和分布正则化的非对比视觉-语言对齐方法。该方法通过从增强的图像视图中预测文本嵌入来对齐视觉表示与冻结的领域特定文本编码器,并利用Sketched Isotropic Gaussian Regularization(SIGReg)强制各向同性高斯结构,从而无需负采样、动量编码器或停止梯度操作,将训练目标简化为单个超参数。

Details

Motivation: 解决当前主流的对比学习方法(如CLIP)需要大批量、精心设计的负采样和大量超参数调优的问题,旨在提供一种更简单、更稳定的非对比视觉-语言预训练替代方案。

Result: 在零样本胸部X射线分类任务中,使用ClinicalBERT作为文本编码器,并在MIMIC-CXR数据集上从头训练Vision Transformers,NOVA在三个基准数据集上的零样本分类性能优于多个标准基线,且训练过程显著更稳定。

Insight: 创新点在于通过预测性嵌入对齐和分布正则化实现非对比学习,消除了对比方法中的复杂组件,简化了训练流程;从客观角度看,该方法在医学影像等特定领域展示了非对比学习在简化性和稳定性方面的潜力,为视觉-语言表示学习提供了新思路。

Abstract: Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.


[137] HPC: Hierarchical Point-based Latent Representation for Streaming Dynamic Gaussian Splatting Compression cs.CVPDF

Yangzhi Ma, Bojun Liu, Wenting Liao, Dong Liu, Zhu Li

TL;DR: 本文提出了一种名为HPC的流式动态高斯泼溅压缩框架,旨在解决动态高斯泼溅在保持渲染质量的同时减少内存占用以实现高效流式传输的挑战。该方法采用分层点基潜在表示,避免了对未占用空间的参数冗余,并通过定制聚合方案提高紧凑性。此外,首次探索了通过挖掘和利用参数间帧相关性来压缩神经网络,形成了端到端的压缩框架。

Details

Motivation: 动态高斯泼溅在自由视点视频中取得了显著进展,但现有流式压缩方法存在参数冗余或局部相关性利用不足的问题,导致存储效率低下,难以在保持渲染质量的同时实现高效传输。

Result: 在综合实验评估中,HPC显著优于现有最先进方法,相比基线实现了67%的存储减少,同时保持了高重建保真度。

Insight: 创新点包括:引入分层点基潜在表示以避免未占用空间的参数冗余,通过定制聚合方案提高紧凑性;首次探索压缩神经网络以利用参数间帧相关性,形成端到端压缩框架。从客观角度看,该方法结合了结构化和非结构化表示的优点,提升了压缩效率。

Abstract: While dynamic Gaussian Splatting has driven significant advances in free-viewpoint video, maintaining its rendering quality with a small memory footprint for efficient streaming transmission still presents an ongoing challenge. Existing streaming dynamic Gaussian Splatting compression methods typically leverage a latent representation to drive the neural network for predicting Gaussian residuals between frames. Their core latent representations can be categorized into structured grid-based and unstructured point-based paradigms. However, the former incurs significant parameter redundancy by inevitably modeling unoccupied space, while the latter suffers from limited compactness as it fails to exploit local correlations. To relieve these limitations, we propose HPC, a novel streaming dynamic Gaussian Splatting compression framework. It employs a hierarchical point-based latent representation that operates on a per-Gaussian basis to avoid parameter redundancy in unoccupied space. Guided by a tailored aggregation scheme, these latent points achieve high compactness with low spatial redundancy. To improve compression efficiency, we further undertake the first investigation to compress neural networks for streaming dynamic Gaussian Splatting through mining and exploiting the inter-frame correlation of parameters. Combined with latent compression, this forms a fully end-to-end compression framework. Comprehensive experimental evaluations demonstrate that HPC substantially outperforms state-of-the-art methods. It achieves a storage reduction of 67% against its baseline while maintaining high reconstruction fidelity.


[138] Video Understanding: Through A Temporal Lens cs.CVPDF

Thong Thanh Nguyen

TL;DR: 本论文探讨了如何利用视频元素间的时间关系来提升视频理解能力。针对现有方法的局限性,提出了五个方面的贡献:一个利用大型视觉语言模型和具有减性角度边界的噪声鲁棒对比学习目标的自动标注框架;一种使用“循环适配器”的参数高效微调策略,用于在低数据环境下捕捉时序动态;整合状态空间层以实现高效的长视频建模,并引入了两个新的长时基准测试集;一个新颖的对比学习框架,用于显式建模动作与视频片段间的细粒度关系;以及对大型视觉语言模型进行的全面实证研究,发现视觉-语言接口是时序推理的瓶颈,并提出了新的“时序导向方案”来提升视频理解。

Details

Motivation: 解决现有视频理解方法在利用时序关系方面的局限性,以更好地建模视频内容的动态特性。

Result: 论文提出的方法在多个方面进行了验证,包括引入了两个新的长时视频基准测试集,并展示了显式时序建模能显著提升模型对视频内容的表示和推理能力。

Insight: 创新点包括:利用大型视觉语言模型进行自动标注的噪声鲁棒框架、用于时序建模的参数高效循环适配器、状态空间层在长视频建模中的应用、显式建模动作-片段关系的对比学习框架,以及针对大型视觉语言模型时序推理瓶颈提出的优化方案。这些方法共同强调了显式时序建模的重要性。

Abstract: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using “recurrent adapters” to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new “temporal-oriented recipe” for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model’s ability to represent and reason about the fluid nature of video content.


[139] JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning cs.CVPDF

Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li

TL;DR: JoyAvatar是一个通过协调文本和音频条件生成高度表达性虚拟人视频的框架,解决了现有方法在复杂文本指令(如全身运动、动态相机轨迹、背景转换或人-物交互)下对齐能力有限的问题。

Details

Motivation: 现有视频虚拟人模型在说话、演讲和唱歌等场景中表现良好,但在处理涉及复杂元素的文本指令时对齐能力不足,限制了其生成自然、连贯的全身动作和动态相机运动的能力。

Result: 在GSB评估中,JoyAvatar超越了Omnihuman-1.5和KlingAvatar 2.0等最先进模型,实现了更高的性能,并支持多人对话和非人类角色扮演等复杂应用。

Insight: 创新点包括双教师增强训练算法,从基础模型转移文本可控性并学习音频-视觉同步;以及基于去噪时间步动态调制多模态条件强度,以减轻异质条件信号之间的冲突,从而提升虚拟人模型的表达能力和一致性。

Abstract: Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model’s capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.


[140] Evaluating Deep Learning-Based Nerve Segmentation in Brachial Plexus Ultrasound Under Realistic Data Constraints cs.CV | cs.AIPDF

Dylan Yves, Khush Agarwal, Jonathan Hoyin Chan, Patcharapit Promoppatum, Aroonkamon Pattanasiricharoen

TL;DR: 本研究评估了基于U-Net的深度学习模型在臂丛超声图像中的神经分割性能,重点探讨了数据集组成(多设备数据融合)和标注策略(从二分类扩展到多类别分割)对分割效果的影响。研究发现,多设备数据训练对性能较差的采集设备具有正则化效果,但无法超越目标域单设备训练的性能;多类别监督会因类别不平衡和边界模糊导致神经分割的Dice分数下降9%-61%;同时,神经尺寸与分割精度呈中度正相关,小神经分割仍是主要挑战。

Details

Motivation: 超声引导区域麻醉中,神经准确定位至关重要,但手动识别因图像对比度低、斑点噪声和患者间解剖变异而困难。本研究旨在评估深度学习在臂丛超声神经分割中的实际应用效果,为临床数据约束下开发鲁棒分割系统提供方法学指导。

Result: 在臂丛超声图像分割任务中,多设备(SIEMENS ACUSON NX3 Elite和Philips EPIQ5)联合训练对低性能采集设备有正则化益处,但未超越目标域单源训练性能;多类别监督导致神经特异性Dice分数下降9%-61%;神经尺寸与分割精度呈正相关(Pearson r=0.587, p<0.001)。

Insight: 创新点在于系统评估了实际临床数据约束(设备差异、标注策略)对深度学习分割性能的影响。可借鉴的见解包括:多源数据训练的正则化效应有限,需针对目标域优化;多类别分割中类别不平衡和边界模糊会显著降低关键类别性能;小目标分割仍是超声图像分析的挑战,需针对性改进模型或数据增强策略。

Abstract: Accurate nerve localization is critical for the success of ultrasound-guided regional anesthesia, yet manual identification remains challenging due to low image contrast, speckle noise, and inter-patient anatomical variability. This study evaluates deep learning-based nerve segmentation in ultrasound images of the brachial plexus using a U-Net architecture, with a focus on how dataset composition and annotation strategy influence segmentation performance. We find that training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5) provides regularization benefits for lower-performing acquisition sources, though it does not surpass single-source training when matched to the target domain. Extending the task from binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle) results in decreased nerve-specific Dice scores, with performance drops ranging from 9% to 61% depending on dataset, likely due to class imbalance and boundary ambiguity. Additionally, we observe a moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p<0.001), indicating that smaller nerves remain a primary challenge. These findings provide methodological guidance for developing robust ultrasound nerve segmentation systems under realistic clinical data constraints.


[141] DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning cs.CVPDF

Wenhao Li, Xianjing Meng, Qiangchang Wang, Zhongyi Han, Zhibin Wu

TL;DR: 本文提出DVLA-RL方法,通过双层级视觉-语言对齐与强化学习门控机制解决小样本学习问题。该方法包含双层级语义构建模块,利用大语言模型基于类别名和支持样本生成判别性属性并合成连贯类别描述,提供从低层属性到高层描述的互补语义;以及RL门控注意力模块,将跨模态融合建模为序列决策过程,通过强化学习训练轻量策略自适应调整自注意力和交叉注意力的贡献,实现浅层细化局部属性、深层强调全局语义的精准对齐。

Details

Motivation: 现有小样本学习方法利用大语言模型通过类别名获取语义嵌入来增强视觉表示,但忽视了从低层到高层语义的渐进式自适应对齐,导致语义增益有限。

Result: DVLA-RL在三种不同的小样本学习场景下的九个基准测试中均取得了新的最先进性能。

Insight: 创新点在于提出了双层级语义构建过程,结合类别名和支持样本生成互补的细粒度属性和整体类别描述;并引入基于强化学习的门控注意力机制,将跨模态融合动态建模为序列决策,自适应地融合文本和视觉token,实现更精准的渐进式对齐。

Abstract: Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.


[142] Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds cs.CV | cs.ROPDF

Xianzhe Fan, Shengliang Deng, Xiaoyang Wu, Yuxiang Lu, Zhuoling Li

TL;DR: 本文提出Any3D-VLA模型,旨在通过融合多样化的点云数据来增强视觉-语言-动作(VLA)模型在复杂场景中的空间理解能力。该模型统一了模拟器、传感器和模型估计的点云数据,构建多样化输入并学习与领域无关的3D表示,以弥补现有VLA模型仅依赖2D图像的不足。

Details

Motivation: 现有VLA模型通常以2D图像作为视觉输入,限制了其在复杂场景中的空间理解能力。本文旨在探索如何融入3D信息来增强VLA模型的能力,并解决3D数据稀缺以及跨环境差异和深度尺度偏差导致的领域差距问题。

Result: 模拟和真实世界实验表明,Any3D-VLA在提升性能和缓解领域差距方面具有优势。

Insight: 创新点在于将视觉输入显式提升为点云以获得更好的表示,并提出了一个统一的训练流程来处理多样化的点云数据源,从而学习领域无关的3D表示并与2D表示融合,增强了模型的鲁棒性。

Abstract: Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA’s advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.


[143] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval cs.CVPDF

Tong Wang, Yunhan Zhao, Shu Kong

TL;DR: 本文提出了一种名为Paracosm的训练免费零样本组合图像检索方法,通过直接生成查询中的’心理图像’来提升检索精度,并在四个基准测试中实现了最先进的性能。

Details

Motivation: 解决组合图像检索中’心理图像’仅隐含定义、无法直接获取的挑战,避免现有方法依赖文本描述进行匹配的局限性。

Result: 在四个具有挑战性的基准测试中显著优于现有零样本方法,达到了零样本组合图像检索的最先进水平。

Insight: 创新点在于直接利用大型多模态模型生成’心理图像’进行匹配,并通过为数据库中的真实图像生成合成对应物来弥合合成与真实图像之间的领域差距,从而实现无需训练的高效零样本检索。

Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this mental image’’ is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a mental image’’ for a given multimodal query and propose to use this mental image'' to search for the target image. As the mental image’’ has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm’’, where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.


[144] Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition cs.CVPDF

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong

TL;DR: 本文提出了一种基于二阶几何统计的无训练视觉地点识别框架,将场景建模为SPD流形上的协方差描述符,通过黎曼映射实现几何稳定的特征表示,在零样本场景下表现出色。

Details

Motivation: 解决视觉地点识别中因环境和视角剧烈变化导致的表示鲁棒性问题,现有方法依赖大量监督数据或简单一阶统计,忽略了内在结构相关性。

Result: 在零样本场景下与SOTA基线方法相比具有高度竞争力,尤其在挑战性零样本任务中表现优异。

Insight: 创新性地将场景表示为SPD流形上的协方差描述符,利用黎曼几何映射实现结构信号与噪声解耦,无需训练即可实现强泛化能力。

Abstract: Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.


[145] Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware cs.CVPDF

Brandon Leblanc, Charalambos Poullis

TL;DR: 本文提出了Distill3R框架,旨在将大型3D基础模型的几何推理能力蒸馏到可在单台工作站上训练的紧凑学生模型中,以降低3D视觉研究的计算门槛。

Details

Motivation: 解决大规模3D基础模型训练依赖庞大计算集群,导致大多数学术实验室难以参与的问题,旨在实现3D视觉研究的民主化。

Result: 提出的7200万参数学生模型相比其6.5亿参数的教师模型,参数量减少9倍,推理速度提升5倍,可在单工作站上3天内完成训练,而教师模型需要GPU集群训练一周。学生模型保持了结构一致性和功能性3D感知所需的几何理解能力。

Insight: 主要创新点包括:1)通过压缩监督信号将繁重的教师模型推理与训练循环解耦的离线缓存流水线;2)利用教师模型不确定性、支持在消费级硬件上训练的置信感知蒸馏损失。该工作为缺乏大规模计算资源的实验室提供了一个可复现、低成本的研究基线。

Abstract: While multi-view 3D reconstruction has shifted toward large-scale foundation models capable of inferring globally consistent geometry, their reliance on massive computational clusters for training has created a significant barrier to entry for most academic laboratories. To bridge this compute divide, we introduce Distill3R, a framework designed to distill the geometric reasoning of 3D foundation models into compact students fully trainable on a single workstation. Our methodology centers on two primary innovations: (1) an offline caching pipeline that decouples heavy teacher inference from the training loop through compressed supervision signals, and (2) a confidence-aware distillation loss that leverages teacher uncertainty to enable training on commodity hardware. We propose a 72M-parameter student model which achieves a 9x reduction in parameters and a 5x inference speedup compared to its 650M-parameter teacher. The student is fully trainable in under 3 days on a single workstation, whereas its teacher requires massive GPU clusters for up to a week. We demonstrate that the student preserves the structural consistency and qualitative geometric understanding required for functional 3D awareness. By providing a reproducible, single-workstation training recipe, Distill3R serves as an exploratory entry point for democratized 3D vision research and efficient edge deployment. This work is not intended to compete with state-of-the-art foundation models, but to provide an accessible research baseline for laboratories without access to large-scale compute to train and specialize models on their own domain-specific data at minimal cost.


[146] OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection cs.CVPDF

Kunal Mahatha, Ali Bahri, Pierre Marza, Sahar Dastani, Maria Vakalopoulou

TL;DR: 本文提出OCTOPUS模型,一种增强空间感知能力的视觉状态空间模型(SSM),通过沿八个主要方向(水平、垂直和对角线)进行离散递归,在保持SSM线性复杂度的同时,有效捕捉图像的全局上下文和局部空间结构。

Details

Motivation: 现有视觉SSM因因果性建模破坏了像素或图像块间的固有空间关系,导致无法有效捕捉局部空间连贯性,常将非相邻块关联而忽略视觉相关的邻近块。

Result: 在分类和分割基准测试中,OCTOPUS在边界保持和区域一致性方面表现出显著改进,同时相比现有基于V-SSM的模型保持了相对更好的分类准确率。

Insight: 创新点在于引入多方向(八向)递归机制,允许所有空间连接区域间的有效信息交换,同时保持不相关块间的独立性,为构建空间感知且计算高效的视觉架构提供了可扩展的基础方法。

Abstract: State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.


[147] ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models cs.CVPDF

Dhruv Parikh, Haoyang Fan, Rajgopal Kannan, Viktor Prasanna

TL;DR: 本文提出ConsensusDrop,一种无需训练的视觉语言模型(VLM)高效推理框架。它通过融合视觉编码器的显著性(视觉显著性)和LLM交叉注意力(跨模态显著性),生成一个共识排名,以选择最具信息量的视觉token,并对冗余token进行编码器引导的合并,从而在保持精度的同时显著减少视觉token数量、降低计算开销。

Details

Motivation: 现有视觉token削减方法通常只利用视觉编码器显著性(查询无关但广泛)或LLM交叉注意力(查询感知但稀疏且计算成本高),单独使用任一种信号都不够充分。本文旨在融合这两种互补信号,以实现更高效、更准确的视觉token选择。

Result: 在LLaVA-1.5/NeXT、Video-LLaVA等多个开源VLM上,ConsensusDrop在相同token预算下持续优于先前的剪枝方法,提供了更优的精度-效率帕累托前沿,即使在激进的token削减下也能保持接近基线的精度,同时减少了首次token生成时间(TTFT)和KV缓存占用。

Insight: 核心创新点在于提出了一种无需训练的、融合视觉与跨模态显著性的共识排名机制,以协调两种不对称但互补的信号,并辅以编码器引导的token合并来压缩冗余信息。这为VLM的高效推理提供了一种新颖且有效的训练后优化思路。

Abstract: Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier – preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.


[148] Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning cs.CVPDF

Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen

TL;DR: 本文针对多模态大语言模型在深度情感理解上的不足,提出了一种基于心理理论(Theory of Mind, ToM)的解决方案。具体包括:1)引入了一个名为HitEmotion的分层基准测试,用于诊断模型在不同认知深度下的能力断点;2)提出了一种ToM引导的推理链,通过追踪心理状态和校准跨模态证据来实现忠实的情感推理;3)提出了一种名为TMPO的强化学习方法,利用中间心理状态作为过程级监督来引导和强化模型推理。实验表明,该方法能有效提升模型在情感推理任务上的准确性和推理的忠实性、连贯性。

Details

Motivation: 当前多模态大语言模型在深度情感理解方面能力有限,作者认为真正的情感智能需要显式地建模情感产生的认知基础——心理理论。

Result: 在提出的HitEmotion基准测试上,实验暴露了当前SOTA模型在认知要求高的任务上存在深度情感推理缺陷。所提出的ToM引导推理链和TMPO方法提升了端任务准确率,并产生了更忠实、更连贯的推理依据。

Insight: 主要创新点在于将心理理论(ToM)这一认知科学概念系统地引入多模态情感推理,并构建了分层诊断基准、ToM引导的推理框架以及利用中间心理状态进行过程监督的强化学习方法,为评估和增强基于认知的情感理解能力提供了实用工具包。

Abstract: Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.


[149] Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025 cs.CV | cs.AI | cs.NE | cs.ROPDF

Phu-Hoa Pham, Chi-Nguyen Tran, Dao Sy Duy Minh, Nguyen Lam Phu Quy, Huynh Trung Kiet

TL;DR: 本文介绍了团队HCMUS_TheFangs在NeurIPS 2025 Mouse vs. AI竞赛中获胜的解决方案,针对视觉鲁棒性和神经对齐两个赛道。在视觉鲁棒性赛道,采用轻量级双层CNN结合门控线性单元和观测归一化,取得了95.4%的最终得分;在神经对齐赛道,构建了深层的类ResNet架构,拥有16个卷积层和基于GLU的门控机制,以1780万参数实现了顶级的神经预测性能。

Details

Motivation: 解决视觉鲁棒性和神经对齐这两个关键挑战,以开发能与生物视觉系统匹敌的人工智能体。

Result: 在Track 1(视觉鲁棒性)上获得95.4%的最终得分;在Track 2(神经对齐)上实现了顶级的神经预测性能(top-1)。

Insight: 创新点在于揭示了架构简洁性与针对性组件结合对视觉鲁棒性的优势,以及深层大容量模型对神经对齐的有效性;客观分析发现,训练步数与性能存在非单调关系,最佳结果出现在约20万步,这挑战了关于视觉运动学习中模型复杂度的传统假设。

Abstract: Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.


[150] VAMOS-OCTA: Vessel-Aware Multi-Axis Orthogonal Supervision for Inpainting Motion-Corrupted OCT Angiography Volumes cs.CVPDF

Nick DiSanto, Ehsan Khodapanah Aghdam, Han Liu, Jacob Watson, Yuankai K. Tao

TL;DR: VAMOS-OCTA是一种用于修复手持光学相干断层扫描血管成像(OCTA)中运动伪影的深度学习框架。它采用2.5D U-Net架构,结合新颖的血管感知多轴正交监督损失,通过相邻B扫描重建被运动破坏的中心B扫描,旨在同时提升横截面B扫描的清晰度和容积投影的准确性。

Details

Motivation: 手持OCTA在非合作或儿科患者中易受运动伪影影响,导致3D采集时出现未采样区域,在正面投影中产生空白带,严重降低图像质量。现有方法主要关注修复正面最大强度投影,而本工作旨在联合增强横截面B扫描和容积投影。

Result: 在合成和真实世界损坏的容积数据上训练和评估,使用感知质量和像素级精度指标。VAMOS-OCTA在性能上持续优于先前方法,能生成具有清晰毛细血管、恢复血管连续性和干净正面投影的重建结果,达到了SOTA水平。

Insight: 创新点在于提出了血管感知多轴正交监督损失,结合血管加权的强度重建与轴向和横向投影一致性,为恢复运动退化的3D OCTA数据提供了强大的约束。客观来看,其多轴监督策略和2.5D U-Net的联合设计,有效利用了三维上下文信息,是修复医学图像中运动伪影的一种有前景的方法。

Abstract: Handheld Optical Coherence Tomography Angiography (OCTA) enables noninvasive retinal imaging in uncooperative or pediatric subjects, but is highly susceptible to motion artifacts that severely degrade volumetric image quality. Sudden motion during 3D acquisition can lead to unsampled retinal regions across entire B-scans (cross-sectional slices), resulting in blank bands in en face projections. We propose VAMOS-OCTA, a deep learning framework for inpainting motion-corrupted B-scans using vessel-aware multi-axis supervision. We employ a 2.5D U-Net architecture that takes a stack of neighboring B-scans as input to reconstruct a corrupted center B-scan, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss. This loss combines vessel-weighted intensity reconstruction with axial and lateral projection consistency, encouraging vascular continuity in native B-scans and across orthogonal planes. Unlike prior work that focuses primarily on restoring the en face MIP, VAMOS-OCTA jointly enhances both cross-sectional B-scan sharpness and volumetric projection accuracy, even under severe motion corruptions. We trained our model on both synthetic and real-world corrupted volumes and evaluated its performance using both perceptual quality and pixel-wise accuracy metrics. VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections. These results demonstrate that multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data. Our source code is available at https://github.com/MedICL-VU/VAMOS-OCTA.


[151] SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning cs.CVPDF

Zihao Zhao, Shengting Cao, Muchao Ye

TL;DR: 本文提出了一种名为SRVAU-R1的反射感知学习框架,旨在通过增强多模态大语言模型(MLLMs)的自我反思和自我修正能力,来提升视频异常理解(VAU)任务的深度推理性能。该方法构建了首个面向VAU的反思式思维链数据集,并设计了包含监督微调和强化微调的学习范式。

Details

Motivation: 现有基于MLLM的视频异常理解方法大多停留在对异常的表面描述,缺乏对异常行为的深度推理,如显式的自我反思和自我修正。

Result: 在多个视频异常基准测试上的广泛实验表明,SRVAU-R1在时间异常定位精度和推理质量上均显著优于现有方法,实现了持续的性能提升。

Insight: 创新点在于将反思机制系统性地引入MLLM的推理过程,通过构建反思导向的思维链数据集和结合监督与强化学习的微调范式,来增强模型对视频异常的深层理解和自我修正能力。

Abstract: Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.


[152] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence cs.CVPDF

Chentian Sun

TL;DR: 本文提出了FUSE-Flow,一个用于实时多视点云重建的、可线性扩展的流式框架。它通过逐帧处理、无状态设计,并利用测量置信度和3D距离一致性进行加权融合来抑制噪声,同时引入基于自适应空间哈希的加权聚合方法以实现大规模多相机场景下的高效处理。

Details

Motivation: 解决在严格实时约束下,将大规模多视角深度观测融合成高质量点云所面临的计算复杂度高、内存占用大和可扩展性有限的问题。现有基于体素融合、时间累积或全局优化的方法难以同时实现实时性、重建质量和多相机可扩展性。

Result: 实验表明,该框架在重叠、深度不连续和动态场景中提高了重建稳定性和几何保真度,同时在现代GPU上保持了实时帧率,验证了其有效性、鲁棒性和可扩展性。

Insight: 创新点在于提出了一个帧级、无状态、可线性扩展的流式重建框架,并引入了基于测量置信度和3D距离一致性的双权重融合策略,以及一种基于自适应空间哈希的加权聚合方法,以高效处理稀疏和密集区域,实现高吞吐、低延迟的点云生成与融合。

Abstract: Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.


[153] VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models cs.CV | cs.AIPDF

Guangshuo Qin, Zhiteng Li, Zheng Chen, Weihang Zhang, Linghe Kong

TL;DR: 本文提出了一种名为视觉专家量化(VEQ)的模态自适应量化框架,旨在高效压缩混合专家(MoE)视觉语言模型(VLM)。该方法通过感知视觉与语言模态间的差异以及不同专家贡献的非均匀性,设计了模态-专家感知量化和模态-亲和力感知量化策略,以在训练后量化(PTQ)中实现更优的性能。

Details

Motivation: MoE视觉语言模型性能卓越但计算和内存成本极高,需要压缩。现有的量化方法忽视了两种关键异质性:视觉与语言令牌之间的固有差异,以及不同专家贡献的非均匀性。

Result: 在W3A16配置下,该方法在Kimi-VL和Qwen3-VL基准上相比之前的SOTA量化方法分别取得了2.04%和3.09%的平均准确率显著提升,并在多种多模态任务上展现了优越的鲁棒性。

Insight: 创新点在于提出了一个同时适应跨模态差异和专家异质性的双感知量化框架。具体包括利用专家激活频率来优先最小化关键专家误差,以及整合令牌-专家亲和力与模态信息来构建增强的Hessian矩阵以指导校准过程。这为异构MoE模型的高效量化提供了新思路。

Abstract: Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.


[154] From Videos to Conversations: Egocentric Instructions for Task Assistance cs.CVPDF

Lavisha Aggarwal, Vikas Bahirwani, Andrea Colaco

TL;DR: 本文提出了一种基于大语言模型的自动框架,将单人教学视频转换为双人多模态任务指导对话,以解决现实世界任务执行中大规模多模态对话数据集稀缺的问题。利用该框架,作者构建了HowToDIV数据集,包含多个领域的507个对话、6,636个问答对和24小时视频,并报告了Gemma 3和Qwen 2.5在该数据集上的基线结果。

Details

Motivation: 解决增强现实(AR)辅助AI智能体发展中,因人工数据收集成本高、逻辑复杂而导致的大规模、基于真实世界任务执行的多模态对话数据集稀缺的瓶颈。

Result: 在自建的HowToDIV多模态数据集上,使用Gemma 3和Qwen 2.5模型进行了基准测试,为多模态程序性任务辅助提供了初步的评估基准。

Insight: 创新点在于提出了一个完全自动化的、可扩展且成本效益高的数据生成框架,能够将现有的大量单人教学视频资源转化为结构化的专家-新手多轮交互对话,从而为训练任务导向的对话系统创建了新的数据源和方法。

Abstract: Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.


[155] ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction cs.CVPDF

Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li

TL;DR: 本文提出ReLayout框架,用于实现无需手动调整的自动化设计布局编辑,通过引入关系图作为未编辑元素的布局结构约束,并采用关系感知设计重建方法在自监督下模拟编辑过程,从而在缺乏三元组数据的情况下实现多功能且保持结构的设计布局编辑。

Details

Motivation: 解决设计布局编辑任务中的两大挑战:在满足用户编辑意图的同时保持未编辑元素的布局结构,以及缺乏(原始设计、编辑操作、编辑后设计)三元组训练数据的问题。

Result: 定性和定量实验以及用户研究表明,ReLayout在编辑质量、准确性和布局结构保持方面显著优于基线模型。

Insight: 创新点包括:将用户意图标准化为四种基本编辑操作以克服自然语言歧义;引入关系图作为结构保持的约束;提出关系感知设计重建方法,通过自监督学习绕过数据稀缺问题;利用多模态大语言模型作为骨干网络,统一多种编辑操作于单一模型,实现多功能编辑。

Abstract: Automated redesign without manual adjustments marks a key step forward in the design workflow. In this work, we focus on a foundational redesign task termed design layout editing, which seeks to autonomously modify the geometric composition of a design based on user intents. To overcome the ambiguity of user needs expressed in natural language, we introduce four basic and important editing actions and standardize the format of editing operations. The underexplored task presents a unique challenge: satisfying specified editing operations while simultaneously preserving the layout structure of unedited elements. Besides, the scarcity of triplet (original design, editing operation, edited design) samples poses another formidable challenge. To this end, we present ReLayout, a novel framework for versatile and structure-preserving design layout editing that operates without triplet data. Specifically, ReLayout first introduces the relation graph, which contains the position and size relationships among unedited elements, as the constraint for layout structure preservation. Then, relation-aware design reconstruction (RADR) is proposed to bypass the data challenge. By learning to reconstruct a design from its elements, a relation graph, and a synthesized editing operation, RADR effectively emulates the editing process in a self-supervised manner. A multi-modal large language model serves as the backbone for RADR, unifying multiple editing actions within a single model and thus achieving versatile editing after fine-tuning. Qualitative, quantitative results and user studies show that ReLayout significantly outperforms the baseline models in terms of editing quality, accuracy, and layout structure preservation.


[156] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance cs.CV | cs.AIPDF

Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong

TL;DR: 本文提出了一种名为Residual Decoding (ResDec) 的训练无关方法,旨在缓解大型视觉语言模型(LVLMs)中的幻觉问题。该方法利用历史信息和模型内部的隐式推理机制来纠正解码过程中的偏差,从而生成与视觉输入更相关的内容。

Details

Motivation: 大型视觉语言模型虽然在多模态任务中表现出色,但容易受到语言先验的影响,产生与视觉输入无关的幻觉内容,即语法正确但缺乏视觉依据的生成。

Result: 大量实验表明,ResDec能有效抑制由语言先验引起的幻觉,显著提升视觉基础能力并减少物体幻觉。此外,该方法在综合LVLM基准测试中也表现优异,展现了其广泛适用性。

Insight: 创新点在于提出了一种无需额外训练、基于历史感知残差引导的解码策略,利用模型内部的令牌logits演化机制来纠正偏差,这为缓解LVLMs的幻觉问题提供了一种轻量级且高效的解决方案。

Abstract: Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.


[157] Radioactive 3D Gaussian Ray Tracing for Tomographic Reconstruction cs.CVPDF

Ling Chen, Bao Yang

TL;DR: 本文提出了一种基于3D高斯射线追踪的断层扫描重建框架,以解决现有基于3D高斯泼溅(3DGS)的方法在断层重建中因局部仿射近似导致的精度下降和非线性几何校正困难的问题。该方法通过解析计算射线穿过3D高斯基元的线积分,提供了更物理一致的前向投影模型,并便于精确应用非线性几何校正。

Details

Motivation: 现有基于3D高斯泼溅的断层重建方法(如R2-Gaussian)采用局部仿射近似将3D高斯映射到探测器上的2D高斯,这可能导致重建定量精度下降,并使得纳入非线性几何校正(如PET中的弧校正)变得复杂。本文旨在克服这些限制。

Result: 论文宣称其方法在断层扫描重建中提供了比基于泼溅的模型更物理一致的前向投影和更易整合几何校正的优势,但摘要中未提及具体的定量实验结果或基准测试比较。

Insight: 主要创新点在于将3D高斯泼溅范式从基于泼溅的近似投影转向基于射线追踪的解析线积分计算,从而避免了局部仿射坍缩,提高了投影模型的物理一致性,并显式控制射线参数以简化非线性几何校正的整合,这有望将基于高斯的重建方法扩展到更广泛的现实断层扫描系统。

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged in computer vision as a promising rendering technique. By adapting the principles of Elliptical Weighted Average (EWA) splatting to a modern differentiable pipeline, 3DGS enables real-time, high-quality novel view synthesis. Building upon this, R2-Gaussian extended the 3DGS paradigm to tomographic reconstruction by rectifying integration bias, achieving state-of-the-art performance in computed tomography (CT). To enable differentiability, R2-Gaussian adopts a local affine approximation: each 3D Gaussian is locally mapped to a 2D Gaussian on the detector and composed via alpha blending to form projections. However, the affine approximation can degrade reconstruction quantitative accuracy and complicate the incorporation of nonlinear geometric corrections. To address these limitations, we propose a tomographic reconstruction framework based on 3D Gaussian ray tracing. Our approach provides two key advantages over splatting-based models: (i) it computes the line integral through 3D Gaussian primitives analytically, avoiding the local affine collapse and thus yielding a more physically consistent forward projection model; and (ii) the ray-tracing formulation gives explicit control over ray origins and directions, which facilitates the precise application of nonlinear geometric corrections, e.g., arc-correction used in positron emission tomography (PET). These properties extend the applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.


[158] DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification cs.CV | cs.MMPDF

Ying Shu, Pujian Zhan, Huiqi Yang, Hehe Fan, Youfang Lin

TL;DR: 本文提出了一种名为DRFormer的双正则化双向Transformer框架,用于行人重识别任务。该框架通过结合视觉基础模型(如DINO)的局部纹理特征和视觉语言模型(如CLIP)的全局语义特征,以解决遮挡和姿态变化等挑战。

Details

Motivation: 现有方法主要依赖单一范式,忽略了局部细节特征与全局语义特征的互补潜力,而这两者对解决行人重识别中的遮挡和姿态变化问题均有贡献。

Result: 在五个基准测试上的大量实验表明,该方法有效协调了局部和全局表示,取得了与最先进方法相竞争的性能。

Insight: 创新点在于通过双正则化机制整合DINO和CLIP两种模型的优势,确保特征提取的多样性并平衡两者的贡献,从而提升行人重识别的鲁棒性。

Abstract: Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.


[159] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers cs.CVPDF

Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai

TL;DR: 本文提出了一种名为PISA(分段稀疏注意力)的训练免费注意力机制,旨在解决扩散Transformer中注意力计算二次复杂度带来的效率瓶颈。PISA通过一种新颖的“精确或近似”策略,对关键块进行精确计算,而对非关键块使用基于泰勒展开的高效近似,从而在保持全注意力范围的同时实现次二次复杂度,有效平衡了生成质量与计算速度。

Details

Motivation: 扩散Transformer在图像和视频生成中至关重要,但其注意力机制的二次计算复杂度成为效率瓶颈。现有的块稀疏注意力方法通过仅关注关键块来加速,但在高稀疏度下会因丢弃上下文信息而导致性能下降。本文的动机是发现非关键块的注意力分数具有分布稳定性,可以被高效且准确地近似,而非直接丢弃,这为设计更优的稀疏注意力提供了关键洞见。

Result: 实验结果表明,PISA在Wan2.1-14B和Hunyuan-Video模型上分别实现了1.91倍和2.57倍的加速,并且在稀疏注意力方法中始终保持最高的生成质量。即使在FLUX模型上进行图像生成,PISA也能实现1.2倍的加速而不损害视觉质量。

Insight: 论文宣称的创新点在于提出了“精确或近似”的范式,取代了传统的“保留或丢弃”范式,通过块级泰勒展开高效近似非关键块,使PISA成为全注意力的忠实代理。从客观角度看,其核心创新在于利用注意力分数的分布稳定性进行近似,这为设计既高效又保持上下文完整性的稀疏注意力机制提供了新的思路。

Abstract: Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.


[160] MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization cs.CVPDF

Haitao Zhang, Yingying Wang, Jiaxiang Wang, Haote Xu, Hongyang Zhang

TL;DR: 本文提出了MedAD-R1模型,通过两阶段训练框架提升医学异常检测(MedAD)的可解释性和推理一致性。首先构建了首个大规模多模态多中心基准数据集MedAD-38K,包含诊断思维链(CoT)标注和结构化视觉问答(VQA)对;第一阶段通过监督微调(SFT)注入基础医学知识并对齐思维-回答范式,第二阶段引入一致性组相对策略优化(Con-GRPO)算法,通过一致性奖励确保推理过程与最终诊断的逻辑连贯性。

Details

Motivation: 现有医学异常检测模型依赖简单碎片化数据集的监督微调,导致推理能力不足和多模态泛化性弱,需要提升模型的可信度和可解释性以支持临床决策。

Result: MedAD-R1在MedAD-38K基准上达到SOTA性能,优于强基线模型超过10%,展现了生成透明且逻辑一致推理路径的能力。

Insight: 创新点包括构建首个大规模多模态医学异常检测基准MedAD-38K,以及提出两阶段训练框架(认知注入+Con-GRPO),其中Con-GRPO通过一致性奖励强化推理与诊断的逻辑关联,为可解释AI提供了新思路。

Abstract: Medical Anomaly Detection (MedAD) presents a significant opportunity to enhance diagnostic accuracy using Large Multimodal Models (LMMs) to interpret and answer questions based on medical images. However, the reliance on Supervised Fine-Tuning (SFT) on simplistic and fragmented datasets has hindered the development of models capable of plausible reasoning and robust multimodal generalization. To overcome this, we introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs. On this foundation, we propose a two-stage training framework. The first stage, Cognitive Injection, uses SFT to instill foundational medical knowledge and align the model with a structured think-then-answer paradigm. Given that standard policy optimization can produce reasoning that is disconnected from the final answer, the second stage incorporates Consistency Group Relative Policy Optimization (Con-GRPO). This novel algorithm incorporates a crucial consistency reward to ensure the generated reasoning process is relevant and logically coherent with the final diagnosis. Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10%. This superior performance stems from its ability to generate transparent and logically consistent reasoning pathways, offering a promising approach to enhancing the trustworthiness and interpretability of AI for clinical decision support.


[161] PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space cs.CVPDF

Jinghong Zheng, Changlong Jiang, Yang Xiao, Jiaqi Li, Haohong Kuang

TL;DR: 本文提出了一种名为PandaPose的新方法,用于从单张RGB图像中提升3D人体姿态。该方法通过将2D姿态先验传播到3D锚点空间作为统一的中间表示,以解决现有方法中存在的误差传播和自遮挡问题。

Details

Motivation: 现有方法通常基于2D特征建立从2D到3D姿态的直接关节映射,这存在两个根本性局限:输入预测的2D姿态误差会不可避免地传播到3D预测,以及处理自遮挡情况存在固有困难。

Result: 在Human3.6M、MPI-INF-3DHP和3DPW三个基准测试上的实验证明了该方法的优越性。在Human3.6M的挑战性条件下,与SOTA方法相比,误差大幅降低了14.7%,定性和定量比较都展示了该方法的有效性和鲁棒性。

Insight: 创新点在于提出了一个统一的3D锚点空间作为中间表示,该空间包含:1)规范坐标系中的关节级3D锚点,以提供准确鲁棒的先验来缓解2D姿态估计不准确;2)深度感知的关节级特征提升,分层整合深度信息以解决自遮挡模糊性;3)锚点-特征交互解码器,将3D锚点与提升的特征结合,生成封装了关节级3D锚点集、视觉线索和几何深度信息的统一锚点查询,用于促进锚点到关节的集成预测。

Abstract: 3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.


[162] Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis cs.CVPDF

Matej Suchanek, Klara Janouskova, Ondrej Vasatko, Jiri Matas

TL;DR: 本文提出Koo-Fu CLIP方法,一种基于Fukunaga-Koontz线性判别分析(Fukunaga-Koontz LDA)的监督式CLIP模型适配方法。该方法在归一化(白化)的嵌入空间中操作,通过抑制类内差异并增强类间判别力,以闭式线性投影重塑CLIP嵌入的几何结构,从而提升类别的可分性并实现有效的降维,为CLIP表示提供了一种轻量且高效的适配方案。

Details

Motivation: CLIP等视觉-语言模型提供了强大的通用表示,但其原始嵌入并未针对监督分类进行优化,通常表现出有限的类别分离性和过高的维度。本文旨在解决这一问题,通过监督适配来优化CLIP嵌入,以提升其在分类任务中的性能。

Result: 在ImageNet大规模基准测试中,在Koo-Fu CLIP空间中进行最近视觉原型分类,将ImageNet-1K上的top-1准确率从75.1%提升至79.1%。当标签空间扩展到14K和21K类别时,该方法仍能保持一致的性能提升。此外,该方法支持高达10-12倍的压缩,而准确率几乎没有损失,从而实现了高效的大规模分类和检索。

Insight: 论文宣称的创新点在于将Fukunaga-Koontz LDA应用于CLIP嵌入的监督适配,这是一种闭式(无需迭代优化)的线性投影方法,能同时优化类内/类间方差并实现降维。从客观角度看,其核心创新在于将经典的判别分析技术与现代大规模预训练模型(CLIP)相结合,提供了一种计算高效、可解释性强且能显著提升分类性能的适配策略,特别是在处理超大规模类别时仍能保持有效性,并支持高压缩比,这对实际部署具有重要价值。

Abstract: Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.


[163] Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs cs.CV | cs.ROPDF

Daniel Yezid Guarnizo Orjuela, Leonardo Scappatura, Veronica Di Gennaro, Riccardo Andrea Izzo, Gianluca Bardaro

TL;DR: 本文针对视觉-语言-动作(VLA)模型在真实世界部署中对图像损坏(如电子噪声、坏点、镜头污染等)的脆弱性问题,提出了一种即插即用、模型无关的视觉Transformer——损坏恢复Transformer(CRT)。CRT通过对抗训练目标,从损坏的视觉输入中恢复干净的观测,无需对底层VLA模型进行昂贵的微调,从而显著提升了VLA模型在视觉干扰下的鲁棒性。

Details

Motivation: 现有VLA模型在受控环境中表现良好,但在真实世界部署中,其性能极易受到传感器级图像损坏(如噪声、坏点、镜头污渍)的影响,导致成功率急剧下降,这严重阻碍了其可靠应用。现有研究主要关注场景几何造成的物理遮挡,而图像损坏这一关键模式尚未得到充分探索。

Result: 在LIBERO和Meta-World基准测试上的大量实验表明,CRT能有效恢复VLA模型因视觉损坏而丢失的性能。例如,在常见信号伪影下,最先进的VLA模型(如π₀.₅和SmolVLA)的成功率可从90%骤降至2%,而使用CRT后,VLA模型即使在严重视觉损坏下也能维持接近基线的成功率。

Insight: 论文的创新点在于首次系统量化并解决了VLA模型对传感器级图像损坏的脆弱性,并提出了一种模型无关、即插即用的恢复模块CRT。其核心洞察是:通过一个独立的、基于对抗训练的视觉修复模块,可以在不修改或微调原有复杂VLA模型的前提下,从源头(视觉输入)提升整个系统的鲁棒性,这是一种高效且通用的防御策略。

Abstract: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90% success rates to as low as 2%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.


[164] Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models cs.CV | cs.AIPDF

Chunliang Hua, Zeyuan Yang, Lei Zhang, Jiayang Sun, Fengwen Chen

TL;DR: 本文提出了一种利用遥感影像和多模态大语言模型进行无人机紧急着陆点评估的新框架,通过粗到细的流程结合语义分割与视觉-语言推理,以识别传统几何方法难以察觉的复杂语义风险。

Details

Motivation: 解决无人机紧急着陆时仅依赖几何传感器无法识别语义风险(如人群、临时结构)的问题,实现全局上下文感知的着陆点安全评估。

Result: 在构建的Emergency Landing Site Selection基准测试中,该方法在风险识别准确率上显著优于几何基线模型,并生成可解释的决策理由。

Insight: 创新点在于将MLLMs与遥感影像及POI数据融合进行细粒度语义推理,提升了自动化决策的可信度;可借鉴其粗到细的多模态融合架构用于其他安全关键场景。

Abstract: Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at https://anonymous.4open.science/r/ELSS-dataset-43D7.


[165] EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment cs.CVPDF

Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan

TL;DR: 该论文提出了EEmo-Logic框架和EEmoDB数据集,旨在全面评估图像唤起的情感。EEmoDB是目前最大的图像情感理解数据集,包含125k张图像生成的120万QA对(EEmoDB-QA)以及25k张图像构建的36k细粒度评估数据集(EEmoDB-Assess)。EEmo-Logic是一个通过指令微调和任务定制的组相对偏好优化(GRPO)训练的多模态大语言模型(MLLM),在情感QA和细粒度评估任务上表现出色。

Details

Motivation: 现有模型在图像唤起的情感理解上存在局限,要么是粗粒度的情感感知,要么缺乏推理能力。为了弥补这一差距,需要构建一个全面的数据集和框架,以理解情感的多维属性和强度细微差别,从而推动机器共情和丰富人机交互应用。

Result: 广泛的实验表明,EEmo-Logic在领域内和跨领域数据集上均实现了稳健的性能,在情感QA和细粒度评估任务中表现出色。

Insight: 论文的创新点包括:1) 构建了目前最大的、涵盖5个分析维度和5个任务类别的综合图像情感理解数据集EEmoDB;2) 提出了一个一体化的MLLM框架EEmo-Logic,通过指令微调和新颖奖励设计的任务定制GRPO进行训练,增强了模型对情感细微差别的理解和推理能力。

Abstract: Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125k$ images via automated generation, alongside a $36k$ dataset (EEmoDB-Assess) curated from $25k$ images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at https://anonymous.4open.science/r/EEmoLogic.


[166] EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting cs.CVPDF

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling

TL;DR: 本文提出了一种用于累积上下文天气预测的高效多尺度Transformer(EMFormer),通过创新的预训练、微调和预测流程,解决了长期天气预报中的灾难性遗忘、误差累积和高训练开销问题。

Details

Motivation: 长期天气预报对社会经济规划和灾害预防至关重要,但现有方法在扩展预测范围时面临灾难性遗忘、误差累积和高训练开销的限制。

Result: 实验表明,该方法在天气预报和极端事件预测中实现了强劲性能,显著提高了长期预测准确性,并在视觉基准(ImageNet-1K和ADE20K)上表现出强泛化能力,同时比传统多尺度模块提速5.69倍。

Insight: 创新点包括引入高效多尺度Transformer(EMFormer)通过单卷积提取多尺度特征,采用累积上下文微调以提升时间一致性而不损害短期准确性,以及提出通过正弦加权动态平衡不同项的复合损失来指导优化轨迹。

Abstract: Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules.


[167] Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis cs.CVPDF

Haoran Lai, Zihang Jiang, Kun Zhang, Qingsong Yao, Rongsheng Wang

TL;DR: 本文提出Med3D-R1,一个用于增强3D医学视觉-语言模型临床推理能力的强化学习框架。该框架包含监督微调和强化学习两阶段训练,通过残差对齐机制和异常重加权策略优化特征与文本的匹配,并重新设计一致性奖励以促进逐步诊断推理。在CT-RATE和RAD-ChestCT两个3D诊断基准测试中取得了最先进的准确率。

Details

Motivation: 解决3D医学影像的复杂性、模型易过拟合报告表面模式以及缺乏可解释性奖励设计等问题,旨在开发具有鲁棒临床推理能力的3D医学视觉-语言模型。

Result: 在CT-RATE和RAD-ChestCT两个3D诊断基准的医学多选题视觉问答任务上,模型分别达到41.92%和44.99%的准确率,实现了最先进(SOTA)水平。

Insight: 创新点包括:两阶段强化学习框架、用于对齐3D特征与文本嵌入的残差对齐机制、强调临床信息令牌的异常重加权策略,以及旨在促进连贯逐步推理的重新设计的一致性奖励。这些设计有助于提升模型在真实诊断工作流程中的可靠性和可解释性。

Abstract: Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92% on CT-RATE and 44.99% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.


[168] Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment cs.CVPDF

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi

TL;DR: 本文提出了一种名为文本精炼与对齐(TRA)的框架,用于提升点监督时序动作定位的性能。该框架通过引入视频帧的文本描述,设计了基于点的文本精炼(PTR)和基于点的多模态对齐(PMA)两个模块,以补充视觉特征并缩小模态间差距,最终利用增强的多模态特征进行精确的动作定位。

Details

Motivation: 当前的点监督时序动作定位方法仅考虑视觉输入特征,忽略了文本侧有益的语义信息,导致性能受限。本文旨在利用视觉描述的文本特征来补充视觉特征,以解决这一问题。

Result: 在五个广泛使用的基准测试上的大量实验结果表明,该框架相比多种最先进(SOTA)方法具有优越的性能。计算开销分析表明,该框架可在单张24 GB RTX 3090 GPU上运行,证明了其实用性和可扩展性。

Insight: 创新点在于首次在点监督时序动作定位中引入并有效利用文本语义信息,通过PTR模块精炼文本描述和PMA模块进行点级多模态特征对比学习,实现了视觉与语言模态的更好对齐,从而提升了定位精度。

Abstract: Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.


[169] TF-Lane: Traffic Flow Module for Robust Lane Perception cs.CVPDF

Yihan Xie, Han Xia, Zhen Yang

TL;DR: 本文提出了一种名为TF-Lane的交通流感知车道感知模块(TFM),旨在解决自动驾驶系统中视觉传感器在遮挡或车道线缺失等场景下性能下降的问题。该模块通过提取实时交通流特征,并将其与现有车道感知算法无缝集成,从而提升车道检测的鲁棒性。

Details

Motivation: 现有基于视觉的车道检测方法在视觉线索不足时性能显著下降,而依赖高精地图的方案则面临高成本和实时性挑战,因此探索无需额外成本的实时交通流信息作为补充。

Result: 在Nuscenes和OpenLaneV2两个公开数据集上对四种主流模型进行实验,使用标准评估指标,TFM一致提升了性能,在Nuscenes数据集上最高获得了+4.1%的mAP增益。

Insight: 创新性地利用实时交通流作为低成本、高实时性的补充信息源,并设计了可集成到现有算法中的模块化方案,增强了车道感知在复杂场景下的鲁棒性。

Abstract: Autonomous driving systems require robust lane perception capabilities, yet existing vision-based detection methods suffer significant performance degradation when visual sensors provide insufficient cues, such as in occluded or lane-missing scenarios. While some approaches incorporate high-definition maps as supplementary information, these solutions face challenges of high subscription costs and limited real-time performance. To address these limitations, we explore an innovative information source: traffic flow, which offers real-time capabilities without additional costs. This paper proposes a TrafficFlow-aware Lane perception Module (TFM) that effectively extracts real-time traffic flow features and seamlessly integrates them with existing lane perception algorithms. This solution originated from real-world autonomous driving conditions and was subsequently validated on open-source algorithms and datasets. Extensive experiments on four mainstream models and two public datasets (Nuscenes and OpenLaneV2) using standard evaluation metrics show that TFM consistently improves performance, achieving up to +4.1% mAP gain on the Nuscenes dataset.


[170] Interaction-Consistent Object Removal via MLLM-Based Reasoning cs.CVPDF

Ching-Kai Huang, Wen-Chieh Lin, Yan-Cen Lee

TL;DR: 本文提出了交互一致的对象移除(ICOR)任务,旨在移除目标对象及其相关的交互元素(如光影效果、物理连接对象等),以避免语义不一致。作者提出了基于多模态大语言模型(MLLM)的推理增强对象移除框架(REORM),通过模块化设计实现MLLM驱动的分析、掩码引导移除和自校正机制,并引入了ICOREval基准进行评估。实验表明,REORM在ICOREval上优于现有最先进的图像编辑系统。

Details

Motivation: 现有基于图像的对象移除方法通常仅移除指定目标,而忽略了与之相关的交互证据(如光影、物理连接等),导致结果在语义上不一致。本文旨在解决这一问题,确保移除操作在交互层面保持一致性。

Result: 在提出的ICOREval基准上,REORM超越了当前最先进的图像编辑系统,能够有效生成交互一致的结果。

Insight: 创新点在于将交互一致性形式化为对象移除任务,并利用MLLM进行推理以识别需联合移除的元素;框架采用模块化设计,包含自校正机制和本地部署变体,提升了编辑的准确性和资源效率。

Abstract: Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support evaluation, we introduce ICOREval, a benchmark consisting of instruction-driven removals with rich interaction dependencies. On ICOREval, REORM outperforms state-of-the-art image editing systems, demonstrating its effectiveness in producing interactionconsistent results.


[171] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation cs.CVPDF

Ayushman Sarkar, Zhenyu Yu, Chu Chen, Wei Tang, Kangning Cui

TL;DR: ReDiStory是一种无需训练的推理时框架,通过解耦和重组文本嵌入来提升多帧视觉故事生成的一致性。它显式地将文本嵌入分解为身份相关和帧特定组件,并通过抑制跨帧共享方向来减少语义干扰,从而在保持提示保真度的同时增强跨帧身份一致性。

Details

Motivation: 现有无需训练的方法将身份和帧提示拼接为统一表示,在复杂故事中容易引入帧间语义干扰,削弱身份保持。本文旨在解决视觉故事生成中跨帧身份一致性与帧特定语义保持的平衡问题。

Result: 在ConsiStory+基准测试中,ReDiStory在多个身份一致性指标上相比1Prompt1Story方法取得了持续提升,证明了其有效性。

Insight: 核心创新在于推理时对文本嵌入进行显式的区域解耦(身份与帧)和去相关操作,这是一种无需修改扩散模型参数或额外监督的轻量级干预策略,有效减少了跨帧干扰,为提升生成一致性提供了新思路。

Abstract: Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory


[172] StoryState: Agent-Based State Control for Consistent and Editable Storybooks cs.CVPDF

Ayushman Sarkar, Zhenyu Yu, Wei Tang, Chu Chen, Kangning Cui

TL;DR: 本文提出了StoryState,一种基于智能体编排的显式故事状态控制层,用于提升多模态模型生成故事书的一致性和可编辑性。该方法将故事表示为包含角色表、全局设置和每页场景约束的结构化对象,并利用少量LLM智能体来维护状态并生成1Prompt1Story风格的提示,以驱动免训练的文本到图像生成。

Details

Motivation: 现有的一键式故事书生成方法中,底层故事状态(如角色、世界设定和页面级对象)是隐式的,导致编辑粒度粗糙且容易破坏视觉一致性。

Result: 在多页面编辑任务的系统级实验中,与1Prompt1Story方法相比,StoryState实现了局部页面编辑,提高了跨页面一致性,并减少了意外更改、交互轮次和编辑时间,同时接近了Gemini Storybook的一次性生成一致性水平。

Insight: 核心创新在于引入了一个显式、可编辑的结构化故事状态表示,并通过基于提示的、模型无关的智能体编排来维护和利用该状态,从而在不依赖模型微调的情况下,实现了对生成过程的细粒度控制和一致性保障。

Abstract: Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at https://github.com/YuZhenyuLindy/StoryState


[173] FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching cs.CVPDF

Divya Jyoti Bajpai, Shubham Agarwal, Apoorv Saxena, Kuldeep Kulkarni, Subrata Mitra

TL;DR: FlowCast是一种无需训练的推测生成框架,用于加速流匹配模型的推理过程。它通过利用流匹配模型训练中保持恒定速度的特性,推测未来速度并跳过冗余步骤,实现2.5倍以上的加速,且不损失生成质量。

Details

Motivation: 流匹配模型虽然能生成高质量视觉内容,但由于去噪步骤多导致推理速度慢,限制了实时应用。现有加速方法如蒸馏或截断会降低质量或需要重训练,因此需要一种无需训练、能保持质量的通用加速方案。

Result: 在图像生成、视频生成和编辑任务中,FlowCast实现了超过2.5倍的加速,性能优于现有基线方法,且与标准完整生成相比没有质量损失。

Insight: 创新点在于利用流匹配模型的恒定速度特性进行零成本推测,通过误差阈值控制跳过稳定区域的冗余步骤,这是一种即插即用、无需辅助网络或重训练的通用加速框架,理论分析还提供了轨迹偏差的最坏情况边界保证。

Abstract: Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.


[174] What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom cs.CVPDF

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen

TL;DR: 本文提出MED框架,用于解耦视觉工具使用强化学习(RL)中内在能力变化与工具诱导效应,发现当前方法主要提升内在学习能力,而非真正掌握工具使用。

Details

Motivation: 解决视觉工具使用RL中性能提升来源不明确的问题,区分工具使用改进与内在能力演化。

Result: 在两个视觉语言模型和六个基准测试上的检查点级分析表明,改进主要由内在学习主导,工具使用RL主要减少工具诱导的损害(如调用错误和工具模式干扰)。

Insight: 创新点在于提出MED框架进行粗到细的解耦分析;客观来看,揭示了当前视觉工具使用RL更侧重于与工具安全共存而非精通使用,为未来方法设计提供诊断视角。

Abstract: Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.


[175] Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning cs.CV | cs.AIPDF

Yu Xu, Yuxin Zhang, Juan Cao, Lin Gao, Chunyu Wang

TL;DR: 本文提出了视觉隐喻迁移任务,旨在让AI模型从参考图像中解耦出抽象的’创意本质’,并将其逻辑重新实例化到用户指定的目标主体上。作者受认知科学启发,设计了一个基于概念融合理论的多智能体框架,通过模式语法结构化表示关系不变量,实现了跨域逻辑的精确迁移。

Details

Motivation: 现有生成式AI模型主要局限于像素级指令对齐和表层外观保持,无法捕捉生成真正视觉隐喻所需的底层抽象逻辑,因此需要解决视觉隐喻的创造性生成问题。

Result: 大量实验和人工评估表明,该方法在隐喻一致性、类比恰当性和视觉创造性方面显著优于当前最先进的基线模型。

Insight: 创新点在于将认知科学中的概念融合理论形式化为模式语法,并构建了一个包含感知、迁移、生成和分层诊断智能体的协作系统,通过闭环回溯机制确保从抽象逻辑到具体生成的各环节质量,为自动化高影响力创意应用提供了新途径。

Abstract: A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the “creative essence” from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar (“G”). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.


[176] MTC-VAE: Multi-Level Temporal Compression with Content Awareness cs.CVPDF

Yubo Dong, Linchao Zhu

TL;DR: 本文提出了MTC-VAE,一种支持多级时间压缩的变分自编码器,旨在解决LVDMs中提高压缩率时性能下降的问题。该方法通过最小化微调实现固定压缩率VAE向多级压缩模型的转换,并研究了不同压缩级别对不同视频片段性能的影响,同时展示了与扩散模型DiT的兼容性。

Details

Motivation: 解决连续VAE在增加采样层以提高压缩率时,因未扩展隐藏通道维度而导致的效率显著下降问题,为视频扩散模型提供更灵活高效的压缩方案。

Result: 通过实证分析展示了多级时间压缩在不同视频特性片段上的有效性,并与DiT框架成功集成,证明了其兼容性和训练可行性。

Insight: 创新点在于将固定压缩率VAE转换为支持多级时间压缩的模型,通过内容感知的压缩策略优化性能,为视频生成模型的压缩模块提供了可扩展且高效的解决方案。

Abstract: Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression rates.Moreover, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.


[177] Exposing and Defending the Achilles’ Heel of Video Mixture-of-Experts cs.CVPDF

Songping Wang, Qinglong Liu, Yueming Lyu, Ning Li, Ziwen He

TL;DR: 本文提出了一种针对视频混合专家(MoE)模型组件级漏洞的对抗攻击方法——时序Lipschitz引导攻击(TLGA),并进一步设计了联合时序Lipschitz引导攻击(J-TLGA)来协同扰动路由器和专家模块,以暴露其协作弱点。基于此,作者提出了联合时序Lipschitz对抗训练(J-TLAT)进行防御,该框架是即插即用的,相比密集模型能减少60%以上的推理成本,并在多个数据集和架构上持续提升了对抗鲁棒性。

Details

Motivation: 现有攻击方法通常将MoE视为统一架构,忽视了其关键组件(如路由器和专家模块)独立且协作的弱点,本文旨在填补这一空白,深入研究视频MoE模型的组件级漏洞。

Result: 提出的J-TLGA攻击显著放大了对抗效果,暴露了MoE架构的协作弱点(阿喀琉斯之踵)。防御方法J-TLAT在多种数据集和架构上持续增强了对抗鲁棒性,有效缓解了MoE的独立和协作弱点,同时该框架是即插即用的,相比密集模型推理成本降低超过60%。

Insight: 创新点在于首次系统性地研究了视频MoE模型在组件层面的独立与协作弱点,并提出了针对性的时序Lipschitz引导攻击与联合对抗训练防御框架。从客观角度看,将攻击和防御的视角从整体模型细化到关键内部组件(路由器、专家),并利用时序Lipschitz性质进行引导,是深入理解并提升MoE鲁棒性的有效途径。

Abstract: Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles’ Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.


[178] PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles cs.CV | cs.AI | cs.LGPDF

Leonardo Brusini, Cristian Sbrolli, Eugenio Lomurno, Toshihiko Yamasaki, Matteo Matteucci

TL;DR: PolyGen提出了一种通过多生成器集成构建合成视觉-语言训练数据的新框架,旨在通过架构不同的生成器交集训练来覆盖数据流形并消除模型特定伪影,同时引入程序化硬负样本课程以增强细粒度语法理解。

Details

Motivation: 当前基于合成数据的视觉-语言预训练方法通常依赖单一生成主干进行扩展,这会引入生成器特定的谱偏差并限制特征多样性,PolyGen旨在通过多生成器集成解决这一问题。

Result: PolyGen在相同数据预算下,通过将独特描述重新分配为多源变体,在聚合多任务基准上比领先的单源基线SynthCLIP提升了19.0%,在SugarCrepe++组合性基准上提升了9.1%。

Insight: 创新点在于采用多生成器集成以覆盖数据流形并消除模型偏差,以及程序化硬负样本课程设计;客观分析表明,结构多样性是比单纯增加单源样本量更高效的数据扩展规律。

Abstract: Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.


[179] PromptRL: Prompt Matters in RL for Flow-Based Image Generation cs.CV | cs.LGPDF

Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park

TL;DR: 本文提出了PromptRL框架,用于解决流匹配模型在强化学习后训练中存在的样本效率低下和提示词过拟合问题。该框架将语言模型作为可训练的提示词优化代理集成到基于流的强化学习优化循环中,实现了提示词重写能力的快速发展和协同训练机制,从而在多个基准测试上取得了最先进的性能。

Details

Motivation: 当前基于流匹配模型的强化学习后训练流程存在两个被低估但重要的问题:由于生成多样性不足导致的样本效率低下,以及明显的提示词过拟合问题,即模型会记忆特定的训练提示词表述,而在语义相同但风格不同的提示词上出现严重的性能崩溃。

Result: PromptRL在多个基准测试上取得了最先进的性能:在GenEval上得分为0.97,在OCR准确率上得分为0.98,在PickScore上得分为24.05。此外,在大规模图像编辑模型FLUX.1-Kontext上,仅用6万次rollouts就将EditReward从1.19提升至1.43,超越了Gemini 2.5 Flash Image(1.37),并与依赖细粒度数据标注和复杂多阶段训练的ReasonNet(1.44)性能相当。实验表明,与朴素的仅使用流的强化学习相比,PromptRL能以超过2倍的更少rollouts实现更高的性能上限。

Insight: 论文的核心创新点在于将语言模型作为可训练的提示词优化代理直接集成到基于流的强化学习优化循环中,这不仅能快速提升提示词重写能力,更重要的是创造了一种协同训练机制,重塑了优化动态。这种设计巧妙地解决了样本效率和提示过拟合问题,为基于流的图像生成模型的强化学习对齐提供了一种高效且鲁棒的新范式。

Abstract: Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at https://github.com/G-U-N/UniRL.


[180] Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas cs.CV | cs.LGPDF

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring

TL;DR: 本文提出了一种基于抛物线的位置编码方法PaPE,专门用于视觉模态(如图像、点云、视频等)的注意力架构中。该方法从视觉特性出发,设计了满足平移不变性、旋转不变性、距离衰减、方向性和上下文感知等原则的编码方式。在涵盖4种模态的8个数据集上评估,PaPE或其旋转不变版本PaPE-RI在7个数据集上达到最佳性能,并在ImageNet-1K的外推实验中表现出色,相比次优位置编码绝对提升高达10.5%。

Details

Motivation: 现有位置编码方法主要从语言模型的1D序列扩展到视觉的nD结构,但未能充分考虑视觉模态的特性(如平移和旋转不变性),因此本文旨在填补这一空白,设计一种基于视觉原则的位置编码。

Result: 在8个数据集(涵盖图像、点云、视频和事件相机流4种模态)上,PaPE或PaPE-RI在7个数据集上取得最优性能;在ImageNet-1K的外推实验中,PaPE相比次优位置编码绝对提升高达10.5%,表现出卓越的外推能力。

Insight: 创新点在于从视觉特性原则(如平移不变性、旋转不变性)出发设计位置编码,而非简单扩展语言模型方法;客观来看,该方法通过抛物线形式整合多模态视觉数据的位置信息,提升了模型在多样视觉任务中的泛化能力和外推性能。

Abstract: We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.


[181] Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles cs.CV | cs.AIPDF

Penghao Deng, Jidong J. Yang, Jiachen Bian

TL;DR: 本文提出了一种跨范式评估方法,用于智能车辆中基于驾驶员注视点的语义对象识别。研究比较了三种视觉方法:直接目标检测(YOLOv13)、分割辅助分类(SAM2+EfficientNetV2 vs YOLOv13)和基于查询的视觉语言模型(Qwen2.5-VL-7b vs Qwen2.5-VL-32b),以识别驾驶员注视点对应的道路场景中的语义对象。结果表明,YOLOv13和大型VLM(Qwen2.5-VL-32b)性能最佳,尤其在识别交通灯等关键小物体时表现出色,而分割辅助方法存在语义鸿沟问题。

Details

Motivation: 理解驾驶员在驾驶时的视觉注意力分布(通过注视行为表征)对于开发下一代高级驾驶辅助系统和提升道路安全至关重要。本文旨在解决从车辆前视摄像头捕捉的道路场景中,基于注视点进行语义对象识别的挑战。

Result: 在评估中,直接目标检测(YOLOv13)和大型视觉语言模型(Qwen2.5-VL-32b)显著优于其他方法,宏平均F1分数超过0.84。大型VLM在识别交通灯等安全关键小物体时表现出更强的鲁棒性和性能,尤其在夜间恶劣条件下。分割辅助范式则因“部分与整体”的语义鸿沟导致召回率大幅下降。

Insight: 论文的创新点在于对基于注视的语义对象识别进行了跨范式(检测、分割+分类、VLM)的系统性评估与比较。客观分析表明,研究揭示了传统检测器的实时效率与大型VLM提供的更丰富上下文理解和鲁棒性之间的根本权衡,为未来人机协同智能驾驶监控系统的设计提供了关键见解和实践指导。

Abstract: Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle’s front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a “part-versus-whole” semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.


[182] Understanding vision transformer robustness through the lens of out-of-distribution detection cs.CV | cs.AI | cs.LGPDF

Joey Kuang, Alexander Wong

TL;DR: 该论文研究了量化对视觉变换器(ViT、DeiT、DeiT3)在分布外(OOD)检测任务中鲁棒性的影响,发现大规模数据集(如ImageNet-22k)预训练可能损害低比特量化模型在OOD检测中的性能,而数据增强是更优选择。

Details

Motivation: 视觉变换器在视觉任务中表现优异,但量化会降低其内存和推理成本,同时带来性能损失风险;现有研究主要关注分布内(ID)任务行为,而论文旨在通过探索OOD情况,利用注意力机制深入理解量化属性,以提升模型的实时可用性。

Result: 在OOD数据集上的实验表明,4位量化模型性能下降显著:例如,DeiT3从FP32到4位量化时,ID任务性能下降17%;在OOD检测中,ImageNet-22k预训练的ViT和DeiT3的AUPR-out指标分别平均下降15.0%和19.2%,而仅使用ImageNet-1k预训练的模型下降幅度较小(9.5%和12.0%)。

Insight: 论文的创新点在于从OOD检测视角分析视觉变换器的量化鲁棒性,揭示了大规模预训练可能对低比特量化不利,并建议数据增强作为更有效的优化策略;这为模型压缩和部署提供了新的评估维度。

Abstract: Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.


[183] Preserving Localized Patch Semantics in VLMs cs.CVPDF

Parsa Esmaeilkhani, Longin Jan Latecki

TL;DR: 本文针对自回归视觉语言模型(VLMs)中Logit Lens可视化方法因视觉信息扩散而失效的问题,提出了一种Logit Lens损失(LLL)。该损失通过补充下一个词预测(NTP)损失,约束视觉token嵌入与描述其对应图像区域的文本概念在语义上对齐,从而保持图像token的局部视觉语义,无需修改模型架构或进行大规模训练。

Details

Motivation: 解决自回归VLMs中,Logit Lens可视化方法因视觉token的语义信息在自注意力层中扩散到语言token,导致局部视觉信息被破坏、可视化结果无法用于模型可解释性的问题。

Result: 实验表明,LLL不仅使Logit Lens能够生成有意义的图像对象置信度热图,从而变得实用,还提升了模型在分割等视觉中心任务上的性能,且无需附加任何特殊任务头。

Insight: 创新点在于提出了一种简单有效的辅助损失(LLL),通过约束视觉-文本token在嵌入空间的语义对齐,在保持模型架构不变的前提下,同时增强了模型的可解释性和下游视觉任务性能。这为改善VLMs的视觉语义保真度提供了一种轻量级方法。

Abstract: Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word “cat”), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.


[184] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars cs.CV | cs.AIPDF

Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu

TL;DR: 本文提出了一种名为InteractAvatar的双流框架,用于生成具有文本驱动的人-物交互(GHOI)的说话虚拟人。该方法通过感知与交互模块(PIM)生成与文本对齐的交互动作,并通过音频-交互感知生成模块(AIM)合成生动的交互视频,有效解决了GHOI生成中的环境感知与控制-质量权衡难题。

Details

Motivation: 现有方法能生成具有简单人体运动的全身说话虚拟人,但扩展到基于场景的人-物交互(GHOI)仍是一个开放挑战,需要虚拟人执行与文本对齐的物体交互,这涉及环境感知和控制-质量权衡问题。

Result: 作者建立了一个名为GroundedInter的基准来评估GHOI视频生成。大量实验和比较证明了该方法在生成说话虚拟人的基于场景人-物交互方面的有效性。

Insight: 主要创新点包括:1)将感知、规划与视频合成解耦的双流框架;2)利用检测增强环境感知的PIM模块;3)用于合成生动交互视频的AIM模块;4)通过共享网络结构和并行生成来缓解控制-质量权衡的运动-视频对齐器设计。

Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io


[185] Toward Cognitive Supersensing in Multimodal Large Language Model cs.CV | cs.AIPDF

Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu

TL;DR: 本文提出了一种名为’认知超感知’的新型训练范式,旨在增强多模态大语言模型(MLLMs)解决复杂认知问题的能力。该方法通过引入一个潜在视觉意象预测头,让模型学习视觉认知潜在嵌入序列并与答案对齐,从而形成基于视觉的内部推理链,并辅以强化学习阶段优化文本推理路径。为评估模型认知能力,作者构建了CogSense-Bench基准测试集。实验表明,采用该范式训练的MLLMs在多个基准上显著优于现有方法。

Details

Motivation: 当前MLLMs虽然在开放词汇感知任务上表现出色,但在解决需要视觉记忆和抽象视觉细节的复杂认知问题时能力有限。现有方法主要在文本空间扩展思维链推理,忽视了类似于人类视觉空间模板和视觉意象的视觉推理机制。

Result: 在作者提出的CogSense-Bench视觉问答基准(评估五个认知维度)上,采用Cognitive Supersensing训练的MLLMs显著优于最先进的基线模型。同时,在领域外的数学和科学VQA基准上也表现出优异的泛化能力。

Insight: 核心创新点在于提出了一个整合潜在视觉意象预测头的训练范式,使MLLMs能够形成基于视觉的内部推理链,这被认为是连接感知识别与认知理解的关键。同时,构建的CogSense-Bench基准为评估模型认知能力提供了新工具。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.


[186] Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd cs.CV | cs.AIPDF

Yejin Son, Saejin Kim, Dongjun Min, Younjae Yu

TL;DR: 本文提出了多模态非常识推理基准MUN,用于评估模型处理偏离典型视觉或上下文期望场景的能力,并设计了基于检索的上下文学习框架R-ICL,通过多模态集成检索器MER识别语义相关示例,在低频非典型设置中实现性能提升。

Details

Motivation: 解决多模态环境中常识推理的挑战,特别是模型对意外或非典型场景的适应能力不足的问题。

Result: 在MUN基准上,R-ICL框架相比基线ICL方法平均提升8.3%,在低频非典型场景中表现有效。

Insight: 创新点包括构建专注于非常规场景的多模态基准MUN,以及无需额外训练的R-ICL框架,利用MER处理图像-文本对故意不匹配的情况,提升模型在真实世界多样化和非原型场景中的鲁棒性。

Abstract: Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models’ ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models’ robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.


[187] SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models cs.CVPDF

Haobo Wang, Weiqi Luo, Xiaojun Jia, Xiaochun Cao

TL;DR: 本文提出了一种名为SGHA-Attack的语义引导分层对齐框架,用于提升针对视觉语言模型(VLMs)的目标对抗攻击的可迁移性。该方法通过采样文本到图像模型生成多个目标参考图像,并选择语义最相关的Top-K锚点进行加权混合,以提供稳定的优化指导。在此基础上,SGHA-Attack在多个深度上对齐中间视觉表征(包括全局和空间粒度),并在共享潜在子空间中同步中间视觉与文本特征,从而在最终投影前提供早期跨模态监督。

Details

Motivation: 现有基于迁移的目标对抗攻击通常依赖单一参考并强调最终层对齐,这会导致过拟合于代理模型的嵌入空间,未能充分利用中间语义信息,从而降低了攻击在不同异构VLM之间的可迁移性。

Result: 在开源和商业黑盒VLM上进行的大量实验表明,SGHA-Attack比现有方法实现了更强的目标可迁移性,并且在预处理和净化防御下仍保持鲁棒性。

Insight: 主要创新点在于:1)使用多个语义相关的目标参考锚点(通过冻结的文本到图像模型生成并加权混合),以提供更稳定、泛化的优化指导;2)提出分层对齐机制,不仅在最终层,而且在多个中间层对齐视觉表征(全局与空间)并同步视觉与文本特征,从而更早、更全面地注入目标语义,提升了攻击在不同VLM架构间的可迁移性。

Abstract: Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations, enabling attackers to optimize on surrogate models and manipulate black-box VLM outputs. Prior targeted transfer attacks often overfit surrogate-specific embedding space by relying on a single reference and emphasizing final-layer alignment, which underutilizes intermediate semantics and degrades transfer across heterogeneous VLMs. To address this, we propose SGHA-Attack, a Semantic-Guided Hierarchical Alignment framework that adopts multiple target references and enforces intermediate-layer consistency. Concretely, we generate a visually grounded reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate to form a weighted mixture for stable optimization guidance. Building on these anchors, SGHA-Attack injects target semantics throughout the feature hierarchy by aligning intermediate visual representations at both global and spatial granularities across multiple depths, and by synchronizing intermediate visual and textual features in a shared latent subspace to provide early cross-modal supervision before the final projection. Extensive experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods and remains robust under preprocessing and purification defenses.


[188] HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation cs.CVPDF

Wencan Cheng, Gim Hee Lee

TL;DR: 本文提出了一种名为HandMCM的新型3D手部姿态估计方法,该方法基于状态空间模型(Mamba),通过引入局部信息注入/过滤模块和对应关系建模模块,有效学习关键点在各种遮挡场景下的动态运动拓扑。同时,通过整合多模态图像特征,增强了输入的鲁棒性和表征能力,从而实现了更精确的姿态估计。在三个基准数据集上的实验表明,该方法显著优于当前最先进的方法,尤其是在严重遮挡的挑战性场景中。

Details

Motivation: 3D手部姿态估计对于增强现实等人机交互应用至关重要,但由于手部自遮挡以及与物体交互引起的遮挡,该任务面临重大挑战。本文旨在解决这些遮挡问题,提升姿态估计的准确性和鲁棒性。

Result: 在三个基准数据集上的实证评估表明,该模型显著优于当前最先进(SOTA)的方法,特别是在涉及严重遮挡的挑战性场景中。

Insight: 论文的创新点在于将强大的状态空间模型(Mamba)首次应用于3D手部姿态估计任务,并设计了专门的局部信息处理与对应关系建模模块来应对遮挡挑战。同时,整合多模态特征以增强输入表示也是一个有效的策略。从客观角度看,将序列建模能力强且高效的Mamba架构引入到具有动态拓扑结构的手部关键点建模中,为解决遮挡问题提供了新的思路,具有借鉴意义。

Abstract: 3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.


[189] Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages cs.CVPDF

Zhixiong Yue, Zixuan Ni, Feiyang Ye, Jinshan Zhang, Sheng Shen

TL;DR: 本文提出了一种名为TAFS-GRPO的新框架,用于训练流匹配文本到图像模型,使其成为与人类偏好良好对齐的高效少步生成器。该方法通过迭代注入自适应时间噪声来引入采样随机性,同时结合步感知优势积分机制和GRPO,提供密集且步特定的奖励,以稳定策略优化。

Details

Motivation: 现有基于强化学习的流匹配模型方法通常依赖大量去噪步骤,且面临稀疏和不精确的奖励信号,导致对齐效果欠佳。本文旨在解决这些问题,实现更快、更好的人类偏好对齐。

Result: 大量实验表明,TAFS-GRPO在少步文本到图像生成中实现了强劲性能,并显著提高了生成图像与人类偏好的对齐度。

Insight: 创新点包括:通过温度退火少步采样引入随机性同时保持语义完整性;结合步感知优势积分与GRPO,避免了奖励函数的可微性要求,并提供密集的步特定奖励。这为流匹配模型的强化学习对齐提供了更稳定和高效的训练机制。

Abstract: Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model’s sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.


[190] Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework cs.CVPDF

Wenzhuo Zhao, Keren Fu, Jiahao He, Xiaohong Liu, Qijun Zhao

TL;DR: 本文提出了Samba和Samba+,一种基于Mamba的通用显著目标检测框架。Samba通过引入显著性引导的Mamba块(SGMB)和上下文感知上采样(CAU),有效处理RGB、RGB-D、RGB-T、视频SOD等多种任务。Samba+进一步通过多任务联合训练,结合中心辐射图注意力(HGA)和模态锚定持续学习(MACL),构建了一个更统一、通用的单一模型。

Details

Motivation: 现有SOD模型受限于CNN的有限感受野和Transformer的二次计算复杂度。Mamba在平衡全局感受野和计算效率方面展现出潜力,但需要针对SOD任务重新设计扫描策略并解决多模态输入和持续适应中的挑战。

Result: Samba在6个SOD任务的22个数据集上均超越了现有方法,且计算成本更低。Samba+使用单一训练模型在这些任务和数据集上取得了更优的结果。

Insight: 创新点包括:为SOD任务重新设计的空间邻域扫描(SNS)算法以保持显著区域的空间连续性;上下文感知上采样(CAU)用于层次特征对齐;以及用于构建通用模型的中心辐射图注意力(HGA)模块和模态锚定持续学习(MACL)策略。

Abstract: Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the “task-specific” problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.


[191] UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception cs.CVPDF

Wenzhuo Liu, Qiannan Guo, Zhen Wang, Wenshuo Wang, Lei Yang

TL;DR: 本文提出了一种统一且通用的多模态多任务学习框架UV-M3TL,用于辅助驾驶感知,旨在同时识别驾驶员行为、情绪、车辆行为和交通环境。该框架通过双分支空间通道多模态嵌入(DB-SCME)和自适应特征解耦多任务损失(AFD-Loss)来缓解任务间的负迁移,并在AIDE数据集及多个公开基准上实现了最先进的性能。

Details

Motivation: 高级驾驶辅助系统(ADAS)需要理解驾驶员行为并感知导航环境,但联合学习这些异构任务会导致任务间负迁移,损害系统性能。

Result: 在AIDE数据集上,UV-M3TL在四个任务上均达到了最先进的性能;在BDD100K、CityScapes、NYUD-v2和PASCAL-Context等多个公开多任务感知基准上,该框架在不同任务组合中均表现出色,在大多数任务上取得了最先进的结果。

Insight: 创新点包括:1)双分支结构(DB-SCME)显式建模任务共享和任务特定特征,以增强跨任务知识转移并缓解任务冲突;2)自适应加权机制和特征解耦约束(AFD-Loss)提高联合优化的稳定性并引导模型学习多样化的多任务表示。从客观角度看,该框架通过统一设计有效处理多模态多任务学习中的负迁移问题,具有较好的通用性和扩展性。

Abstract: Advanced Driver Assistance Systems (ADAS) need to understand human driver behavior while perceiving their navigation context, but jointly learning these heterogeneous tasks would cause inter-task negative transfer and impair system performance. Here, we propose a Unified and Versatile Multimodal Multi-Task Learning (UV-M3TL) framework to simultaneously recognize driver behavior, driver emotion, vehicle behavior, and traffic context, while mitigating inter-task negative transfer. Our framework incorporates two core components: dual-branch spatial channel multimodal embedding (DB-SCME) and adaptive feature-decoupled multi-task loss (AFD-Loss). DB-SCME enhances cross-task knowledge transfer while mitigating task conflicts by employing a dual-branch structure to explicitly model salient task-shared and task-specific features. AFD-Loss improves the stability of joint optimization while guiding the model to learn diverse multi-task representations by introducing an adaptive weighting mechanism based on learning dynamics and feature decoupling constraints. We evaluate our method on the AIDE dataset, and the experimental results demonstrate that UV-M3TL achieves state-of-the-art performance across all four tasks. To further prove the versatility, we evaluate UV-M3TL on additional public multi-task perception benchmarks (BDD100K, CityScapes, NYUD-v2, and PASCAL-Context), where it consistently delivers strong performance across diverse task combinations, attaining state-of-the-art results on most tasks.


[192] Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation? cs.CVPDF

Susan Liang, Chao Huang, Filippos Bellos, Yolo Yunlong Tang, Qianxiang Shen

TL;DR: 本文提出了Omni-Judge,一项评估全模态大语言模型(omni-LLMs)能否作为人类对齐的评判者,用于评估文本条件音频-视频生成任务的研究。研究发现,omni-LLMs在语义对齐任务上表现出色,但在高帧率感知指标上存在局限,同时能提供可解释的反馈。

Details

Motivation: 随着Sora、Veo等文本到视频生成模型能够生成高保真、带同步音频的视频,如何评估这种三模态输出成为一个未解决的挑战。人类评估可靠但成本高,传统自动指标(如FVD、CLAP、ViCLIP)局限于孤立模态对且可解释性差。全模态大语言模型(omni-LLMs)因其能自然处理音、视、文并支持推理,成为有潜力的替代方案。

Result: 在九个感知和对齐指标上,Omni-Judge达到了与传统指标相当的与人类评价的相关性,并在音频-文本对齐、视频-文本对齐和音频-视频-文本一致性等语义要求高的任务上表现出色。然而,由于时间分辨率有限,它在视频质量和音视频同步等高帧率感知指标上表现不佳。

Insight: 论文的创新点在于首次系统性地探索了使用全模态大语言模型(omni-LLMs)作为统一、可解释的评判者来评估三模态(文本-音频-视频)生成任务。其核心洞察是,omni-LLMs在语义理解和跨模态一致性评估方面具有优势,并能提供可解释的推理链反馈,这为生成模型的迭代优化提供了实用工具,但同时也揭示了其在处理高时间分辨率细节方面的当前局限性。

Abstract: State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.


[193] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards cs.CVPDF

Minh-Quan Le, Gaurav Mittal, Cheng Zhao, David Gu, Dimitris Samaras

TL;DR: 本文提出了一种名为PISCES的无标注文本到视频后训练算法,通过一种新颖的双重最优传输对齐奖励模块,旨在提升生成视频的质量、时间一致性以及与输入文本的语义对齐。该方法避免了依赖大规模人工偏好标注或预训练视觉语言模型中未对齐的嵌入,从而解决了现有方法的可扩展性限制和次优监督问题。

Details

Motivation: 现有基于奖励的后训练方法要么依赖大规模人工标注,要么使用预训练模型中未对齐的嵌入进行监督,导致可扩展性有限或监督效果不佳。本文旨在开发一种无需人工标注的后训练方法,以更有效地提升文本到视频生成的视觉质量、时间一致性和语义对齐。

Result: 在短视频和长视频生成任务上的实验表明,PISCES在VBench基准测试的质量和语义得分上均优于基于标注和无标注的方法,人类偏好研究进一步验证了其有效性。

Insight: 核心创新在于提出了双重最优传输对齐奖励模块,分别在分布层面和离散词元层面桥接文本与视频嵌入,从而实现了无标注的、与人类判断对齐的奖励监督。该方法首次从最优传输视角改进了生成式后训练中的无标注奖励监督,并且该模块兼容多种优化范式(如直接反向传播和强化学习微调)。

Abstract: Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.


[194] Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks cs.CVPDF

Bohan Zeng, Kaixin Zhu, Daili Hua, Bozhou Li, Chengzhuo Tong

TL;DR: 本文批判了当前世界模型研究中将世界知识注入孤立任务的碎片化做法,提出应建立统一的设计规范,使世界模型成为整合交互、感知、符号推理和空间表示的整体框架,以指导未来研究构建更通用、鲁棒和原则性的世界模型。

Details

Motivation: 当前AI研究中,世界模型旨在增强大模型对物理动态和世界知识的理解,但现有方法多聚焦于将世界知识注入视觉预测、3D估计等孤立任务,缺乏系统性统一框架,限制了整体世界理解能力。

Result: 论文未提及具体实验或基准测试结果,而是通过理论分析提出统一设计规范,旨在为未来研究提供结构化视角。

Insight: 创新点在于强调世界模型不应是能力松散集合,而应是整合交互、感知、符号推理和空间表示的规范性框架,这为构建通用AI系统提供了系统性设计思路。

Abstract: World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task-specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.


[195] Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification cs.CVPDF

Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Tareef Daqqaq, Reem Kateb

TL;DR: 本文提出了一种结合动态自适应焦点损失(DAFL)和客户端感知聚合策略的联邦学习框架,用于解决医学图像分类中数据异构性和类别不平衡问题。该框架在三个公开数据集(ISIC、Ocular Disease和RSNA-ICH)上验证了其有效性,相比多种基线模型(如DenseNet121、ViT、FedCLIP等)在大多数情况下取得了准确率提升,范围从0.98%到41.69%。

Details

Motivation: 动机在于解决联邦学习环境中医学图像分类面临的数据隐私限制、数据异构性和类别不平衡问题,这些因素限制了Vision Transformer等深度学习模型的泛化能力。

Result: 在ISIC、Ocular Disease和RSNA-ICH三个公开数据集上的分类结果显示,所提框架在大多数情况下优于DenseNet121、ResNet50、ViT-S/16、ViT-L/32、FedCLIP、Swin Transformer、CoAtNet和MixNet等模型,准确率提升范围从0.98%到41.69%,并在不平衡的ISIC数据集上通过消融实验验证了损失函数和聚合策略的有效性。

Insight: 创新点包括设计动态类别不平衡系数来调整焦点损失,以关注少数类并防止稀疏数据被忽略,以及采用加权聚合策略适应客户端数据规模和特征,从而更好地捕获客户端间差异。这些方法可借鉴于其他联邦学习场景,以提升模型在非独立同分布数据下的性能。

Abstract: While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client’s sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98% to 41.69%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: https://github.com/AIPMLab/ViT-FLDAF.


[196] ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval cs.CVPDF

Tianyu Yang, ChenWei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo

TL;DR: 本文提出ReCALL框架,旨在解决基于多模态大语言模型(MLLM)的合成图像检索(CIR)任务中,将生成式MLLM适配为判别式检索器时引发的范式冲突和能力退化问题。该框架通过诊断-生成-精炼的流程,利用MLLM自身挖掘认知盲点、生成校正数据并精炼检索器,从而重新校准其细粒度推理能力。

Details

Motivation: 现有方法将生成式MLLM适配为单嵌入判别式检索器时,会引发范式冲突,导致模型固有的细粒度组合推理能力退化(即能力退化问题),从而影响合成图像检索性能。

Result: 在CIRR和FashionIQ基准测试上的大量实验表明,ReCALL能持续重新校准退化能力,并取得了最先进的(SOTA)性能。

Insight: 创新点在于提出了一个模型无关的框架ReCALL,通过诊断认知盲点、利用MLLM自身生成校正指令和三元组、并进行分组对比持续训练,将检索器的判别式嵌入空间与MLLM内在的组合推理能力重新对齐,从而解决适配过程中的能力退化问题。

Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.


[197] Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning cs.CV | cs.AIPDF

Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang

TL;DR: 本文提出了一种名为CaCoVID的贡献感知令牌压缩算法,用于高效视频理解。该方法通过强化学习框架优化策略网络,主动选择对正确预测贡献最大的视频令牌组合,而非依赖注意力分数,从而减少视频令牌冗余带来的计算开销。

Details

Motivation: 视频大语言模型在视频理解任务中表现出色,但视频令牌的冗余性导致推理时计算开销巨大,限制了实际部署。现有压缩算法通常优先保留注意力分数最高的特征,但注意力分数与对正确答案的实际贡献之间的相关性不明确,因此需要一种更有效的压缩方法。

Result: 在多个视频理解基准测试上的广泛实验证明了CaCoVID的有效性,具体定量结果未在摘要中提及,但暗示了其能显著减少探索空间并加速策略优化收敛。

Insight: 创新点在于引入强化学习框架来优化令牌选择策略,直接基于令牌对正确预测的贡献进行压缩,实现了从被动保留到主动发现最优压缩组合的范式转变;同时,提出的在线组合空间采样组合策略优化算法大幅减少了探索空间,加速了收敛。从客观角度看,该方法将压缩目标与模型最终预测性能直接对齐,可能更有效地平衡压缩率与准确性。

Abstract: Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.


[198] From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction cs.CVPDF

Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Junhao Chen

TL;DR: 本文提出了一种用于视频序列中时间一致的人体中心密集预测方法,通过可扩展的合成数据管道生成逼真的人体帧和运动对齐序列,并训练一个统一的ViT密集预测器,结合静态预训练和动态序列监督的两阶段训练策略,在THuman2.1和Hi4D基准上达到SOTA性能,并能泛化到真实视频。

Details

Motivation: 解决现有模型在视频序列中因运动、遮挡和光照变化导致的时间不一致(如闪烁)问题,并弥补缺乏多密集任务配对人体视频监督数据的不足。

Result: 在THuman2.1和Hi4D基准上实现了最先进的性能,并能有效泛化到真实世界视频。

Insight: 创新点包括:可扩展的合成数据管道同时提供帧级和序列级监督;统一的ViT密集预测器通过CSE嵌入注入显式人体几何先验,并使用轻量级通道重加权模块提升几何特征可靠性;两阶段训练策略结合静态预训练和动态序列监督以优化时空一致性。

Abstract: In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.


[199] Moonworks Lunara Aesthetic II: An Image Variation Dataset cs.CVPDF

Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah

TL;DR: 论文介绍了Lunara Aesthetic II,这是一个公开的、符合伦理的图像数据集,旨在支持对现代图像生成和编辑系统中的上下文一致性进行受控评估和学习。该数据集包含2,854对基于Moonworks原创艺术和摄影作品生成的锚点链接的变体对,每个变体对在保持稳定底层身份的同时,应用了如光照、天气、视角、场景构图、色调或情绪等上下文变换。

Details

Motivation: 动机是解决图像生成和编辑系统中上下文一致性的评估和学习问题,提供一个包含身份保持的上下文变换的监督信号数据集,以支持对上下文泛化、身份保持和编辑鲁棒性的基准测试、微调和分析。

Result: 结果显示,该数据集具有高身份稳定性、强目标属性实现能力,以及超越大规模网络数据集的稳健美学特征。

Insight: 创新点在于将身份保持的上下文变化操作化为监督信号,同时保持高美学评分,为图像生成和图像到图像系统提供了可解释的关系监督基准数据集。

Abstract: We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara’s signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.


[200] VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR cs.CV | cs.GRPDF

Hail Song, Boram Yoon, Seokhwan Yang, Seoyoung Kang, Hyunjeong Kim

TL;DR: VRGaussianAvatar是一个集成系统,利用头戴式显示器(HMD)的跟踪信号,在虚拟现实中实现实时全身3D高斯泼溅(3DGS)化身。该系统采用并行流水线,包含VR前端和GA后端。前端通过逆运动学估计全身姿态,后端则对单图像重建的3DGS化身进行立体渲染。为提高效率,引入了双目批处理技术,在单批次中联合处理左右眼视图。

Details

Motivation: 解决在VR中仅使用HMD信号实现实时、高保真全身化身渲染的挑战,旨在提升沉浸感和交互性能。

Result: 定量性能测试和用户研究表明,该系统能维持交互式VR性能,在感知外观相似性、具身感和合理性方面优于基于图像和视频的网格化身基线。

Insight: 创新点包括将3DGS化身集成到VR的并行系统架构,以及双目批处理技术以优化立体渲染效率;客观来看,其单图像重建和实时HMD信号驱动的全身姿态估计具有应用潜力。

Abstract: We present VRGaussianAvatar, an integrated system that enables real-time full-body 3D Gaussian Splatting (3DGS) avatars in virtual reality using only head-mounted display (HMD) tracking signals. The system adopts a parallel pipeline with a VR Frontend and a GA Backend. The VR Frontend uses inverse kinematics to estimate full-body pose and streams the resulting pose along with stereo camera parameters to the backend. The GA Backend stereoscopically renders a 3DGS avatar reconstructed from a single image. To improve stereo rendering efficiency, we introduce Binocular Batching, which jointly processes left and right eye views in a single batched pass to reduce redundant computation and support high-resolution VR displays. We evaluate VRGaussianAvatar with quantitative performance tests and a within-subject user study against image- and video-based mesh avatar baselines. Results show that VRGaussianAvatar sustains interactive VR performance and yields higher perceived appearance similarity, embodiment, and plausibility. Project page and source code are available at https://vrgaussianavatar.github.io.


[201] SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking cs.CVPDF

Yinchao Ma, Dengqing Yang, Zhangyu He, Wenfei Yang, Tianzhu Zhang

TL;DR: 本文提出了一种名为SMTrack的新型视觉跟踪方法,它利用状态空间模型进行高效的长程时序建模。该方法通过引入选择性状态感知空间模型和隐藏状态传播机制,在训练和跟踪阶段都能以线性计算复杂度整合时序线索,从而在保持高性能的同时显著降低了计算成本。

Details

Motivation: 现有基于CNN和Transformer的视觉跟踪方法在建模长程时序依赖时存在固有局限,通常需要复杂的定制模块或高昂的计算成本。本文旨在解决这一问题,为视觉跟踪提供一个简洁高效的时序建模新范式。

Result: 大量实验结果表明,SMTrack在保持低计算成本的同时,取得了有竞争力的跟踪性能。

Insight: 主要创新点在于将状态空间模型(Mamba)引入视觉跟踪领域,并提出了具有状态感知参数的选择性空间模型,以及通过隐藏状态传播实现跨帧交互的机制,这为高效时序建模提供了新思路。

Abstract: Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.


[202] FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding cs.CV | cs.AIPDF

Kangcong Li, Peng Ye, Lin Zhang, Chao Wang, Huafeng Qin

TL;DR: 本文提出FreshMem,一种受大脑对数感知和记忆巩固启发的频率-空间混合记忆网络,旨在解决多模态大语言模型在在线流式视频理解中存在的适应性不足、细节丢失和上下文碎片化问题。该方法通过多尺度频率记忆模块和空间缩略图记忆模块协同工作,在保持短期保真度的同时确保长期连贯性,显著提升了流式视频理解性能。

Details

Motivation: 将多模态大语言模型从离线视频理解过渡到在线流式视频理解对于实现连续感知至关重要,但现有方法缺乏灵活的适应性,导致不可逆的细节丢失和上下文碎片化。

Result: 在StreamingBench、OV-Bench和OVO-Bench基准测试上,FreshMem显著提升了Qwen2-VL基线模型,分别获得了5.20%、4.52%和2.34%的性能增益。作为一个无需训练的方法,它甚至优于多个完全微调的方法,为长时程流式视频理解提供了一个高效的范式。

Insight: 论文宣称的创新点在于受大脑启发的频率-空间混合记忆架构,通过多尺度频率记忆捕获全局历史“要点”并辅以残差细节,以及通过自适应压缩策略的空间缩略图记忆实现情节聚类。从客观角度看,其将频率域压缩与空间域摘要相结合以平衡细节与连贯性的混合策略,以及作为无需训练的高效解决方案,是值得借鉴的创新之处。

Abstract: Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain’s logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical “gist”; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.


[203] Cross-Modal Alignment and Fusion for RGB-D Transmission-Line Defect Detection cs.CV | cs.AIPDF

Jiaming Cui, Shuai Zhou, Wenqiang Li, Ruifeng Qin, Feng Shen

TL;DR: 本文提出了一种名为CMAFNet的跨模态对齐与融合网络,用于RGB-D输电线路缺陷检测。该方法通过基于字典的特征纯化模块和全局上下文语义集成框架,有效整合RGB外观和深度几何信息,以应对小尺度缺陷、复杂背景和光照变化的挑战。

Details

Motivation: 现有基于RGB的检测器在输电线路缺陷检测中,由于小尺度缺陷占主导、背景复杂且光照变化大,难以在有限色彩对比度下区分几何上细微的缺陷与视觉相似背景结构,因此需要融合深度信息以提升检测性能。

Result: 在TLRGBD基准测试中(94.5%的实例为小物体),CMAFNet实现了32.2%的mAP@50和12.5%的APs,分别比最强基线高出9.8和4.0个百分点。其轻量级变体在仅4.9M参数下达到24.8% mAP50和228 FPS,超越了所有基于YOLO的检测器,并以显著更低的计算成本匹配了基于Transformer的方法。

Insight: 创新点包括:提出了一种基于字典学习的特征纯化模块,通过位置归一化实现显式的重建驱动跨模态对齐;设计了部分通道注意力机制来捕获全局空间依赖以增强结构语义推理;采用先纯化后融合的范式,确保异构特征在融合前的统计兼容性,从而有效抑制模态特定噪声并保留缺陷判别信息。

Abstract: Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.


[204] FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization cs.CVPDF

Yikun Ma, Yiqing Li, Jingwen Ye, Zhongkai Wu, Weidong Zhang

TL;DR: FastPhysGS是一个快速且鲁棒的基于物理的动态3D高斯溅射(3DGS)模拟框架,旨在解决现有方法在将3DGS扩展到4D物理模拟时面临的挑战。它通过实例感知粒子填充(IPF)与蒙特卡洛重要性采样(MCIS)来高效填充内部粒子以保持几何保真度,并采用双向图解耦优化(BGDO)这一自适应策略来快速优化从视觉语言模型(VLM)预测的材料参数。

Details

Motivation: 现有基于物质点法(MPM)的方法要么依赖手动参数调整,要么从视频扩散模型中蒸馏动力学,这限制了泛化能力和优化效率。而使用LLMs/VLMs的近期尝试则存在文本/图像到3D的感知差距,导致不稳定的物理行为,并且常常忽略3DGS的表面结构,产生不合理的运动。

Result: 实验表明,FastPhysGS仅使用7GB运行时内存,在1分钟内即可实现高保真度的物理模拟,其性能优于先前的工作,具有广泛的应用潜力。

Insight: 论文的创新点在于提出了IPF与MCIS相结合的方法来高效完成内部粒子填充,以及BGDO自适应优化策略来加速材料参数优化。从客观角度看,该方法通过结合几何感知的粒子生成和基于VLM预测的自适应优化,在保持模拟质量的同时显著提升了速度和内存效率,为动态3DGS的物理模拟提供了一种新的高效解决方案。

Abstract: Extending 3D Gaussian Splatting (3DGS) to 4D physical simulation remains challenging. Based on the Material Point Method (MPM), existing methods either rely on manual parameter tuning or distill dynamics from video diffusion models, limiting the generalization and optimization efficiency. Recent attempts using LLMs/VLMs suffer from a text/image-to-3D perceptual gap, yielding unstable physics behavior. In addition, they often ignore the surface structure of 3DGS, leading to implausible motion. We propose FastPhysGS, a fast and robust framework for physics-based dynamic 3DGS simulation:(1) Instance-aware Particle Filling (IPF) with Monte Carlo Importance Sampling (MCIS) to efficiently populate interior particles while preserving geometric fidelity; (2) Bidirectional Graph Decoupling Optimization (BGDO), an adaptive strategy that rapidly optimizes material parameters predicted from a VLM. Experiments show FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.


[205] DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation cs.CVPDF

Tushar Anand, Maheswar Bora, Antitza Dantcheva, Abhijit Das

TL;DR: 本文提出了一种名为DenVisCoM的新型Mamba模块,以及一种专为精确实时估计光流和视差而设计的混合架构。该架构统一处理多视图几何与运动任务,结合DenVisCoM和基于Transformer的注意力块,在保证实时推理、低内存占用和高精度的同时,联合估计运动与三维密集感知任务。

Details

Motivation: 解决多视图几何与运动任务(如光流和视差估计)中实时性、内存效率和准确性难以兼顾的问题,提出统一架构以联合处理这些相关任务。

Result: 在大量数据集上广泛分析了精度与实时处理的权衡,实验结果表明所提模型能够实时准确地估计光流和视差,但摘要未提及具体基准测试或SOTA比较。

Insight: 创新点包括引入DenVisCoM Mamba模块和混合架构,将Mamba与Transformer结合,以高效统一的方式处理光流和视差估计,提升实时密集视觉对应任务的性能。

Abstract: In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.


[206] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models cs.CVPDF

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding

TL;DR: 本文提出了一种简单而有效的AI生成图像检测方法,即使用现代视觉基础模型的冻结特征训练一个线性分类器。该方法在传统基准测试、未见过的生成器和具有挑战性的真实场景数据上均取得了新的最优性能,特别是在真实场景数据集上,其准确率比现有专用检测器高出超过30%。

Details

Motivation: 现有的专用AI生成图像检测器在精心策划的基准测试上表现近乎完美,但在真实、开放场景中性能会急剧下降。本文旨在探索一种更简单、更通用的检测方法,以解决现有方法泛化能力不足的问题。

Result: 该方法在传统基准测试上与专用检测器性能相当,并在真实场景数据集上大幅超越它们,准确率提升超过30%,达到了新的SOTA水平。

Insight: 论文的创新点在于揭示了简单线性分类器结合现代视觉基础模型(如Perception Encoder, MetaCLIP 2, DINOv3)的冻结特征,可以作为一种强大且泛化性强的AIGI检测基线。其核心洞察是,这种能力源于基础模型在包含合成内容的大规模预训练数据中涌现出的特性:视觉语言模型内化了明确的伪造语义概念,而自监督学习模型则从预训练数据中隐式地学习到了判别性的取证特征。这倡导了AI取证领域的一个范式转变:从过度拟合静态基准转向利用基础模型不断演化的世界知识来实现真实世界的可靠性。

Abstract: While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.


[207] Tail-Aware Post-Training Quantization for 3D Geometry Models cs.CVPDF

Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang

TL;DR: 本文提出了一种名为TAPTQ的尾部感知后训练量化方法,专门针对3D几何模型设计。该方法通过渐进式粗到细校准构建策略、基于三分搜索的量化区间求解器以及尾部相对误差引导的模块级补偿,有效解决了3D模型量化中的校准数据瓶颈、计算复杂度和量化误差累积问题。

Details

Motivation: 3D几何模型日益增长的复杂性和规模使其在资源受限平台上的部署面临挑战。现有的后训练量化方法主要针对2D视觉Transformer优化,无法有效迁移到3D模型,因为3D模型具有复杂的特征分布和过高的校准开销。

Result: 在VGGT和Pi3基准测试上的大量实验表明,TAPTQ在准确率上持续优于最先进的后训练量化方法,同时显著减少了校准时间。

Insight: 创新点包括:1)渐进式粗到细校准构建策略解决3D数据集的数据规模瓶颈;2)将量化区间搜索重新表述为优化问题并引入基于三分搜索的求解器,将计算复杂度从O(N)降低到O(log N);3)提出尾部相对误差引导的模块级补偿机制,自适应识别和纠正对长尾激活异常值敏感的模块中的失真。

Abstract: The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(\log N)$ for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.


[208] ObjEmbed: Towards Universal Multimodal Object Embeddings cs.CVPDF

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie

TL;DR: 本文提出ObjEmbed模型,一种新颖的多模态大语言模型嵌入方法,通过将输入图像分解为多个区域嵌入(每个对应一个独立对象)和全局嵌入,以解决视觉语言理解中细粒度对齐的挑战。该模型支持视觉定位、局部图像检索和全局图像检索等多种视觉理解任务,具有对象导向表示、多功能性和高效编码三大特性。

Details

Motivation: 当前多模态嵌入模型在全局图像-文本对齐方面表现优异,但在图像区域与特定短语之间的细粒度对齐上存在困难,因此需要一种能够同时处理区域级和图像级任务的通用对象嵌入方法。

Result: 在18个多样化基准测试中表现出卓越性能,展示了其强大的语义判别能力,实现了细粒度视觉语言对齐的先进水平。

Insight: 创新点在于为每个区域生成两个互补嵌入:用于语义匹配的对象嵌入和预测定位质量的IoU嵌入,通过结合语义相似度和预测IoU来提升检索准确性;同时,模型在单次前向传播中编码图像中的所有对象及完整图像,兼顾效率与多功能性。

Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.


[209] Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation cs.CVPDF

Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang

TL;DR: Mind-Brush是一个统一的智能体框架,将图像生成转化为动态、知识驱动的工作流,通过模拟人类’思考-研究-创造’范式,主动检索多模态证据以处理分布外概念,并利用推理工具解决隐含视觉约束,从而提升对复杂用户意图的理解和生成能力。

Details

Motivation: 现有文本到图像生成模型多为静态解码器,难以理解隐含用户意图,而统一理解-生成模型虽有所改进,但仍无法处理涉及复杂知识推理的任务,且受限于静态内部先验,无法适应现实世界的动态变化。

Result: 在提出的Mind-Bench基准(包含500个样本,涵盖实时新闻、新兴概念及数学和地理推理等领域)上,Mind-Brush显著提升了统一模型的能力,使Qwen-Image基线实现了从零到一的能力飞跃,并在WISE和RISE等现有基准上取得了优越结果。

Insight: 创新点在于将生成过程框架化为动态、知识驱动的智能体工作流,整合主动检索和推理工具以增强对分布外概念和隐含约束的处理能力,这为构建更适应现实世界动态变化的生成系统提供了新思路。

Abstract: While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like ‘think-research-create’ paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.


[210] MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement cs.CVPDF

Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

TL;DR: 本文提出了一种名为MagicFuse的单图像融合框架,旨在仅使用单一低质量可见光图像的情况下,生成全面的跨光谱场景表示。该框架基于扩散模型,通过挖掘可见光谱中被遮蔽的信息并学习红外光谱的热辐射分布模式,最终融合视觉和语义约束,以支持下游任务。

Details

Motivation: 解决在恶劣条件下仅有可见光成像传感器可用时,如何继续获得多模态图像融合优势的问题,将传统数据级融合扩展到知识级。

Result: 大量实验表明,尽管仅依赖单一退化的可见光图像,MagicFuse在视觉和语义表示性能上达到甚至超越了需要多模态输入的最先进融合方法。

Insight: 创新点在于提出了单图像融合概念,并设计了基于扩散模型的知识强化与生成分支,通过概率噪声融合实现跨光谱场景表示,同时施加视觉和语义约束以确保其既满足人类观察又支持语义决策。

Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.


[211] GDPR-Compliant Person Recognition in Industrial Environments Using MEMS-LiDAR and Hybrid Data cs.CVPDF

Dennis Basile, Dennis Sprute, Helene Dörksen, Holger Flatt

TL;DR: 本文提出了一种基于MEMS-LiDAR和混合数据的GDPR合规人员识别方法,用于工业室内环境。该方法通过捕获匿名化的3D点云来保护隐私,并利用CARLA仿真框架生成合成数据来增强真实LiDAR数据,以减少数据收集和标注的工作量。

Details

Motivation: 解决在安全关键的工业室内空间中可靠检测未授权人员的问题,同时克服传统基于深度学习的视觉方法对光照和可见度条件敏感、可能违反GDPR等隐私法规,以及真实数据标注耗时且易出错等缺点。

Result: 实验结果表明,与仅使用真实数据训练的模型相比,混合数据方法将平均精度提高了44个百分点,同时将手动标注工作量减少了50%。

Insight: 创新点在于结合了隐私保护的MEMS-LiDAR硬件和利用合成数据增强的混合数据策略,在保证GDPR合规性的同时,实现了高性能、可扩展且成本效益高的人员检测方案。

Abstract: The reliable detection of unauthorized individuals in safety-critical industrial indoor spaces is crucial to avoid plant shutdowns, property damage, and personal hazards. Conventional vision-based methods that use deep-learning approaches for person recognition provide image information but are sensitive to lighting and visibility conditions and often violate privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union. Typically, detection systems based on deep learning require annotated data for training. Collecting and annotating such data, however, is highly time-consuming and due to manual treatments not necessarily error free. Therefore, this paper presents a privacy-compliant approach based on Micro-Electro-Mechanical Systems LiDAR (MEMS-LiDAR), which exclusively captures anonymized 3D point clouds and avoids personal identification features. To compensate for the large amount of time required to record real LiDAR data and for post-processing and annotation, real recordings are augmented with synthetically generated scenes from the CARLA simulation framework. The results demonstrate that the hybrid data improves the average precision by 44 percentage points compared to a model trained exclusively with real data while reducing the manual annotation effort by 50 %. Thus, the proposed approach provides a scalable, cost-efficient alternative to purely real-data-based methods and systematically shows how synthetic LiDAR data can combine high performance in person detection with GDPR compliance in an industrial environment.


[212] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention cs.CV | cs.AIPDF

Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik

TL;DR: 本文提出了一种用于自回归视频扩散模型和世界模型的、无需训练的统一注意力框架,旨在解决推理时KV缓存增长导致的延迟增加和GPU内存占用上升问题。该框架包含三个模块:TempCache通过时间对应性压缩KV缓存以限制其增长;AnnCA使用快速近似最近邻匹配选择与帧相关的提示词来加速交叉注意力;AnnSA使用轻量级ANN将每个查询限制在语义匹配的键上以稀疏化自注意力。

Details

Motivation: 自回归视频扩散模型在推理时,其核心注意力层的KV缓存会随着生成过程不断增长,导致延迟增加、GPU内存占用上升,进而限制了可用的时间上下文并损害了长程一致性。论文旨在解决这一瓶颈问题。

Result: 实验表明,所提方法在保持近乎相同的视觉质量的同时,实现了高达5到10倍的端到端加速,并在长序列生成中保持了稳定的吞吐量和近乎恒定的峰值GPU内存使用,而先前方法会逐渐变慢且内存使用持续增加。

Insight: 论文的创新点在于识别了自回归视频扩散中KV缓存的三种冗余来源,并据此提出了一个无需训练的统一注意力优化框架,通过压缩缓存、加速交叉注意力和稀疏化自注意力来显著提升效率,且与现有模型兼容。从客观角度看,其利用时间对应性和语义匹配进行稀疏化的思路具有借鉴意义。

Abstract: Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5–x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.


[213] GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation cs.CVPDF

Xiao Liang, Yunzhu Zhang, Linchao Zhu

TL;DR: 本文提出了一种名为引导渐进蒸馏(GPD)的框架,旨在加速扩散模型在视频生成中的去噪过程,以实现快速且高质量的生成。该方法通过教师模型逐步指导学生模型使用更大的步长,并结合在线生成训练目标和潜在空间的频域约束,显著减少了采样步骤。

Details

Motivation: 扩散模型在视频生成中取得了显著成功,但其去噪过程的高计算成本是主要瓶颈。现有方法在减少扩散步骤方面虽有进展,但在应用于视频生成时往往导致显著的生成质量下降。

Result: 将GPD应用于Wan2.1模型后,采样步骤从48步减少到6步,同时在VBench基准测试上保持了具有竞争力的视觉质量。与现有蒸馏方法相比,GPD在流程简单性和质量保持方面展现出明显优势。

Insight: 创新点在于提出了一种渐进式蒸馏训练策略,结合了在线生成训练目标以降低优化难度并提高计算效率,以及潜在空间的频域约束以促进细粒度细节和时间动态的保持,从而在加速的同时有效维持了视频生成质量。

Abstract: Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.


[214] Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies cs.CVPDF

Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung-Levy

TL;DR: 本文介绍了VIA-Bench,一个针对多模态大语言模型(MLLMs)在视觉错觉和异常场景下的鲁棒性评估基准。该基准包含六个核心类别,通过人工审核构建了超过1K个高质量问答对。对20多个SOTA MLLMs的广泛评估揭示了模型在面对违背常识先验的视觉刺激时存在显著脆弱性,特别是思维链推理的鲁棒性微乎其微,表明机器与人类感知存在根本差异。

Details

Motivation: 现有MLLMs评估通常依赖标准分布内数据,而模型在面对违背常识先验的视觉错觉和异常场景时的鲁棒性尚未得到充分检验,因此需要构建专门基准来探究这一关键问题。

Result: 在VIA-Bench上评估了超过20个SOTA MLLMs(包括专有、开源和推理增强模型),发现它们存在显著脆弱性;思维链推理提供的鲁棒性可忽略不计,常导致逻辑崩溃的“脆弱幻象”。

Insight: 创新点在于构建了首个专注于视觉错觉和异常的MLLMs基准VIA-Bench,揭示了MLLMs在非常规感知任务上的根本性缺陷,并指出思维链推理在此类场景下可能失效,这为提升模型感知鲁棒性和推动AGI发展提供了关键方向。

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages’’ where the model’s logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.


[215] Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery cs.CVPDF

Yin Wu, Daniel Slieter, Carl Esselborn, Ahmed Abouelazm, Tsung Yuan Tseng

TL;DR: 本文提出一种基于街景图像的跨国家数据采集策略,用于高效获取ADAS/ADS感知模型适应所需的代表性数据。该方法利用公开街景图像识别兴趣点,通过KNN特征距离和视觉归因两种评分方法筛选关键位置,并在交通标志检测任务上验证了其有效性。

Details

Motivation: 解决ADAS/ADS在不同国家部署时,因法规、交通设施和视觉惯例差异导致的域偏移问题,传统跨国家数据采集依赖大规模道路驾驶,成本高且效率低。

Result: 在交通标志检测任务上,使用仅50%目标域数据即可达到与随机采样相当的性能;通过Zenseact Open Dataset与Mapillary街景图像构建的共定位数据集进行可重复评估,并展示了大规模街景处理的经济可行性。

Insight: 创新点在于利用公开街景图像引导数据采集,提出两种POI评分方法(基于视觉基础模型的KNN特征距离和基于视觉语言模型的视觉归因方法),实现了低成本、高效率的跨国家模型适应数据获取策略。

Abstract: Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.


[216] SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection cs.CVPDF

Qian Xu, Xi Li, Fei Gao, Jie Guo, Haojuan Yuan

TL;DR: 本文提出了SPIRIT框架,旨在通过轻量级的物理信息插件,使视觉基础模型(VFMs)适应红外小目标检测(IRSTD)任务。该框架统一了单帧和多帧推理,通过空间上的PIFR模块抑制结构化背景并增强稀疏目标信号,以及时间上的PGMA模块在记忆交叉注意力中注入历史软空间先验,以约束跨帧关联,从而解决红外目标信号弱、语义线索有限以及与可见光图像模态差异大的挑战。

Details

Motivation: 红外小目标检测在监视和预警中至关重要,但面临红外数据稀缺、目标信号弱、语义线索有限以及与可见光谱图像的模态差异等问题,导致直接使用面向语义的视觉基础模型和仅基于外观的跨帧关联不可靠。

Result: 在多个IRSTD基准测试上的实验表明,SPIRIT框架相比基于VFM的基线方法取得了持续的性能提升,并达到了最先进的(SOTA)水平。

Insight: 创新点在于提出了一个统一的、与VFM兼容的框架,通过轻量级物理信息插件(PIFR和PGMA)来适应红外模态,其中PIFR利用秩稀疏分解近似来增强目标信号,PGMA则通过注入历史软空间先验来改进跨帧关联,从而有效弥合模态差距并提升检测鲁棒性。

Abstract: Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.


[217] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions cs.CV | cs.AIPDF

Yuliang Zhan, Jian Li, Wenbing Huang, Wenbing Huang, Yang Liu

TL;DR: 本文提出了一种名为CloDS的无监督布料动力学学习框架,用于从多视角视觉观测中学习未知条件下的布料动态。该方法采用三阶段流程,包括视频到几何的grounding阶段和基于grounded网格训练动力学模型,通过引入双位置不透明度调制来处理大变形和自遮挡问题。

Details

Motivation: 现有深度学习方法模拟复杂动态系统时需要已知物理属性作为监督或输入,限制了其在未知条件下的应用。本文旨在解决从纯视觉观测中无监督学习布料动力学的挑战。

Result: 综合实验评估表明,CloDS能够有效地从视觉数据中学习布料动力学,并对未见过的配置保持强大的泛化能力。

Insight: 创新点包括提出了布料动态接地(CDG)这一新场景,以及用于CDG的CloDS框架,其中引入了基于网格的高斯泼溅和双位置不透明度调制,以支持2D观测与3D几何之间的双向映射,从而处理大非线性变形和严重自遮挡。

Abstract: Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://github.com/whynot-zyl/CloDS. Visualization results are available at https://github.com/whynot-zyl/CloDS_video}.%\footnote{As in this example.


[218] WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization? cs.CVPDF

Pei Li, Jiaxi Yin, Lei Ouyang, Shihan Pan, Ge Wang

TL;DR: 该论文提出了WS-IMUBench,一个针对仅使用序列级标签的弱监督IMU时序动作定位(WS-IMU-TAL)的系统性基准研究。研究评估了从音频、图像和视频领域迁移而来的七种代表性弱监督方法在七个公共IMU数据集上的表现,进行了大量实验,并分析了迁移的有效性、性能以及失败模式,旨在加速该领域的发展。

Details

Motivation: 当前IMU时序动作定位(IMU-TAL)的发展受到密集、帧级边界标注成本高昂且难以扩展的瓶颈限制,因此研究旨在探索仅使用序列级标签的弱监督方法能否解决此问题。

Result: 研究在七个公共IMU数据集上对七种弱监督方法进行了超过3,540次模型训练和7,080次推理评估。结果表明,弱监督方法在有利的数据集(如动作较长、传感维度较高)上可以具有竞争力;迁移效果依赖于模态,时域方法通常比基于图像提案的方法更稳定。

Insight: 创新点在于首次系统地建立了弱监督IMU-TAL的基准,包括可复现的模板、数据集、协议和分析。客观分析认为,其核心贡献在于揭示了跨模态迁移的规律(时域方法更优)和关键挑战(短动作、时序模糊性、提案质量),并为未来研究指明了具体方向(如IMU特定的提案生成、边界感知目标、更强的时序推理)。

Abstract: IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.


[219] How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing cs.CVPDF

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian

TL;DR: 该论文提出了VIBE(视觉指令图像编辑基准),这是一个系统性的基准测试,用于评估模型在遵循视觉指令(如草图)进行图像编辑方面的能力。基准包含三个交互层次(指示性定位、形态操作和因果推理),并引入了一个基于大型多模态模型的评估框架。通过对17个代表性模型的全面评估,发现专有模型在遵循视觉指令方面优于开源模型,但所有模型在任务难度增加时性能均显著下降。

Details

Motivation: 现有图像编辑系统主要依赖文本引导,而人类交流本质上是多模态的,视觉指令(如草图)能更有效地传达空间和结构意图。为弥补这一差距,需要建立一个系统性的基准来评估模型遵循视觉指令的能力。

Result: 在VIBE基准上评估了17个开源和专有图像编辑模型。专有模型展现出初步的视觉指令遵循能力,并始终优于开源模型。然而,随着任务难度增加,即使是性能最强的系统,其性能也显著下降。

Insight: 论文的创新点在于提出了一个结构化的视觉指令基准(VIBE),其三个层次的交互设计能系统评估模型能力。同时,引入的LMM-as-a-judge评估框架支持可扩展和细粒度的评估。从客观角度看,该工作为多模态指令遵循研究提供了重要的评估工具和基准数据,揭示了当前模型在处理复杂视觉指令时的局限性。

Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.


[220] Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection cs.CVPDF

A S M Sharifuzzaman Sagar, Mohammed Bennamoun, Farid Boussaid, Naeha Sharif, Lian Xu

TL;DR: 本文系统评估了深度伪造检测器在多模态虚假信息检测中的作用。研究发现,尽管深度伪造检测器被广泛集成到自动事实核查流程中,但其仅能检测像素级伪造,对图像-文本对语义层面的虚假信息识别贡献有限,甚至可能因引入误导性的真实性先验而损害基于证据的推理性能。

Details

Motivation: 多模态虚假信息中的欺骗性通常源于图像-文本对所表达的语义和上下文主张,而非仅像素级篡改。然而,当前多数深度伪造检测器仅针对像素级伪造设计,未考虑主张层面的含义,这引发了一个核心科学和实践问题:像素级检测器是否能为验证图像-文本主张提供有用信号,还是会引入误导性的真实性先验从而削弱基于证据的推理?

Result: 在两个基准数据集MMFakeBench和DGM4上的实验表明:深度伪造检测器独立使用价值有限,F1分数在0.26-0.53(MMFakeBench)和0.33-0.49(DGM4)之间;将其预测结果作为辅助证据注入事实核查流程会因非因果的真实性假设导致性能持续下降0.04-0.08 F1。相比之下,以证据为中心的事实核查系统(结合MCTS检索和MAD推理)达到最高性能,F1分数约为0.81(MMFakeBench)和0.55(DGM4)。

Insight: 论文的创新点在于首次系统分析了深度伪造检测器在多模态虚假信息检测中的实际作用,揭示了像素级伪造信号在真实世界图像-文本虚假信息推理中不可靠。核心洞察是:多模态主张验证主要由语义理解和外部证据驱动,而非像素级伪影信号;这挑战了将深度伪造检测器简单集成到自动事实核查流程中的常见做法,强调了基于证据的语义推理的重要性。

Abstract: In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.


[221] Rethinking Genomic Modeling Through Optical Character Recognition cs.CV | cs.AI | cs.CL | cs.LGPDF

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen

TL;DR: 本文提出OpticalDNA框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解任务,通过将DNA序列渲染为结构化视觉布局,并训练具备视觉DNA编码器和文档解码器的视觉-语言模型,实现高效压缩和细粒度基因组信息保留。

Details

Motivation: 当前基因组基础模型大多采用处理一维标记序列的大型语言模型架构,这种结构与稀疏、不连续的基因组语义存在错配,导致在低信息背景上浪费计算资源,且难以实现理解驱动的长上下文压缩。

Result: 在多种基因组基准测试中,OpticalDNA一致优于近期基线模型;在长达45万个碱基的序列上,它以近20倍更少的有效标记实现了最佳整体性能,并且在仅微调256k可训练参数的情况下,超越了激活参数多达985倍的模型。

Insight: 创新点在于将基因组序列视为视觉文档进行OCR式理解,通过视觉布局编码实现高保真压缩,并针对核心基因组原语(如读取、区域定位、子序列检索和掩码跨度补全)设计提示条件目标,从而在减少有效标记预算的同时保留细粒度基因组信息。

Abstract: Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision–language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.


[222] ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding cs.CVPDF

Ye Chen, Yupeng Zhu, Xiongzhen Zhang, Zhewen Wan, Yingzhe Li

TL;DR: 本文提出了一种名为ProxyImg的分层解耦代理嵌入图像表示方法,旨在解决现有显式和隐式图像表示在可控编辑方面的不足。该方法通过语义感知分解输入图像,构建分层代理几何结构,并将多尺度隐式纹理参数嵌入到几何感知的分布式代理节点中,实现了高保真重建和独立于实例或部件的语义编辑。

Details

Motivation: 现有图像表示方法(如光栅图像、高斯图元等显式表示和潜在图像等隐式表示)存在表示冗余导致手动编辑负担重,或缺乏从潜在变量到语义实例/部件的直接映射,使得细粒度操控困难,阻碍了高效可控的图像视频编辑。

Result: 在ImageNet、OIR-Bench和HumanEdit等图像重建和编辑基准测试上的大量实验表明,该方法以显著更少的参数实现了最先进的渲染保真度,同时支持直观、交互式且物理合理的操控;与生成式方法相比,通过将代理节点与基于位置的动力学结合,实现了具有优越时间一致性和视觉真实感的实时物理驱动动画。

Insight: 创新点包括:1)分层代理基的参数化图像表示,将语义、几何和纹理属性解耦到独立可操控的参数空间;2)通过自适应贝塞尔拟合和迭代内部区域细分与网格化构建分层代理几何;3)引入局部自适应特征索引机制确保空间纹理一致性,支持高质量背景补全;4)将轻量级隐式渲染与物理模拟结合,实现实时物理动画。

Abstract: Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.


[223] Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model cs.CVPDF

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen

TL;DR: 本文提出了一种名为Q Cache的高效注意力机制,用于减少多模态大语言模型(MLLM)中的计算开销和KV缓存占用。该方法通过跨层共享相似的注意力模式,在超过一半的解码层中重用查询,从而显著降低KV缓存使用并提高吞吐量,同时保持模型性能。

Details

Motivation: 多模态大语言模型因视觉编码器中大量视觉令牌导致推理成本高昂,现有令牌剪枝方法可能破坏KV缓存完整性,影响长文本生成任务。本文从注意力机制的新视角出发,发现超过一半解码层的注意力语义相似,旨在通过跨层共享注意力来减少冗余计算。

Result: 在多个基准测试中,该方法能将KV缓存使用减少超过35%,实现1.5倍的吞吐量提升,而性能损失仅约1%。与最先进的令牌剪枝方法相比,本技术实现了更优的准确性保持。

Insight: 创新点在于提出Lazy Attention机制和Q Cache,通过跨层共享查询来重用相似注意力模式,这是一种轻量级且与现有推理框架(如Flash Attention和KV缓存)完全兼容的方法。该方法与令牌剪枝技术正交,可独立部署或结合使用,具有高度灵活性。

Abstract: Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.


[224] Learning Sparse Visual Representations via Spatial-Semantic Factorization cs.CV | cs.AI | cs.LGPDF

Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan

TL;DR: 论文提出STELLAR框架,通过将视觉特征分解为语义概念与其空间分布的乘积,解决了自监督学习中语义理解与图像重建之间的冲突。该框架使用稀疏的语义令牌进行增强对齐,同时保持定位矩阵中的精确空间映射以支持像素级重建。

Details

Motivation: 自监督学习面临语义理解与图像重建之间的根本冲突:基于全局令牌的语义方法(如DINO)丢弃空间信息,而生成式方法(如MAE)保留密集特征但缺乏高级抽象。论文旨在通过空间-语义分解来调和这一矛盾。

Result: 在ImageNet上,仅使用16个稀疏令牌即可实现79.10%的准确率(与密集骨干网络相当),同时达到2.60 FID的高质量重建性能,展示了在语义和生成任务上的双重优势。

Insight: 创新点在于将视觉特征显式分解为低秩的语义概念和空间分布矩阵,实现了语义身份与空间几何的分离,从而构建了一种既能支持判别任务又能支持生成任务的通用稀疏表示。

Abstract: Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.


[225] Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images cs.CVPDF

Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang

TL;DR: 本文提出了一种多模态开放词汇目标检测框架RS-MPOD,用于遥感图像。它通过整合基于实例的视觉提示、文本提示及其多模态融合,超越了传统仅依赖文本提示的方法,以解决开放词汇场景下因语义模糊和分布偏移导致的类别指定不稳定问题。

Details

Motivation: 动机在于解决遥感图像开放词汇目标检测中,仅依赖文本提示(基于预训练文本-视觉对齐)在实际应用中常因任务和应用特定的类别语义而失效,导致类别指定不稳定的局限性。

Result: 在标准、跨数据集和细粒度遥感基准测试上的大量实验表明,视觉提示在语义模糊和分布偏移下能提供更可靠的类别指定,而多模态提示在文本语义对齐良好时仍保持竞争力。

Insight: 创新点在于将类别指定从纯文本提示扩展到多模态(视觉+文本)提示,引入了视觉提示编码器提取基于外观的类别线索以实现无文本类别指定,以及多模态融合模块来整合可用信息,这为处理语义歧义提供了更灵活和鲁棒的方案。

Abstract: Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.


[226] Enhancing Multi-Image Understanding through Delimiter Token Scaling cs.CVPDF

Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh

TL;DR: 本文提出了一种通过缩放分隔符标记的隐藏状态来增强大型视觉语言模型(LVLM)多图像理解能力的方法,旨在解决现有模型中跨图像信息泄露的问题,从而提升模型在多个图像输入场景下的性能。

Details

Motivation: 现有LVLM在单图像任务上表现良好,但在多图像输入时性能下降,主要原因是跨图像信息泄露,即模型难以区分不同图像的信息。尽管已有模型使用分隔符标记来标记每个图像的起始和结束,但这些标记未能有效阻止信息泄露。

Result: 实验表明,该方法在Mantis、MuirBench、MIRB和QBench2等多图像基准测试上取得了性能提升,同时在仅文本任务(如多文档和多表格理解基准TQABench、MultiNews和WCEP-10)上也表现出改进,且无需额外训练或推理成本。

Insight: 创新点在于通过缩放分隔符标记的隐藏状态来强化图像内交互并限制不必要的跨图像交互,从而有效缓解信息泄露问题。客观来看,该方法提供了一种轻量级、无需额外成本的优化策略,可借鉴用于提升多模态模型在多输入场景下的区分能力。

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model’s ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.


[227] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen

TL;DR: 本文提出了Vision-DeepResearch基准(VDR-Bench),这是一个包含2000个VQA实例的评测基准,旨在更真实地评估多模态大语言模型(MLLMs)在视觉与文本搜索(即深度研究)任务中的能力。针对现有基准存在的视觉搜索核心信息泄露和评估场景过于理想化两大局限,该基准通过精心设计的多阶段构建流程和专家评审来确保问题质量。同时,论文提出了一种简单的多轮裁剪搜索工作流,以提升现有MLLMs在真实视觉检索场景下的性能。

Details

Motivation: 现有的多模态大语言模型评测基准存在两大局限:1)并非以视觉搜索为核心,答案常通过文本问题中的跨文本线索泄露,或可被模型已有的世界知识推断;2)评估场景过于理想化,图像搜索常可通过近乎精确的全图匹配完成,而文本搜索则过于直接、挑战性不足。这导致难以准确评估MLLMs在复杂视觉-文本事实查找(即深度研究)任务中的真实能力。

Result: 论文构建了包含2000个VQA实例的VDR-Bench基准。提出的多轮裁剪搜索工作流被证明能有效提升模型在真实视觉检索场景下的性能。整体结果为未来多模态深度研究系统的设计提供了实用指导。

Insight: 主要创新点在于构建了一个更贴近现实、旨在评估视觉深度研究系统真实行为的评测基准(VDR-Bench),其通过严格的构建流程避免了信息泄露和场景理想化问题。从客观角度看,提出的多轮裁剪搜索工作流是一个简单但可能有效的工程策略,用于弥补当前MLLMs视觉检索能力的不足,这对构建实用的多模态搜索系统具有借鉴意义。

Abstract: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.


[228] UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving cs.CVPDF

Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu

TL;DR: 本文提出UniDriveDreamer,一个用于自动驾驶的单阶段统一多模态世界模型,能够直接生成多摄像头视频和LiDAR序列的未来观测,无需依赖中间表示或级联模块。

Details

Motivation: 现有自动驾驶世界模型主要集中于单模态(如视频或LiDAR)生成,缺乏统一的多模态合成方法,限制了数据合成的全面性和效率。

Result: 在视频和LiDAR生成任务上,UniDriveDreamer超越了先前的最先进方法,并在下游任务中带来了可衡量的性能提升。

Insight: 创新点包括:1) 引入LiDAR专用VAE和视频VAE分别编码输入;2) 提出统一潜在锚定(ULA)技术显式对齐多模态潜在分布以确保兼容性和训练稳定性;3) 使用扩散变换器联合建模几何对应和时序演化;4) 利用结构化场景布局信息作为条件信号指导合成。

Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream


[229] ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning cs.CV | cs.AIPDF

Gongli Xi, Kun Wang, Zeming Gao, Huahui Yi, Haolang Lu

TL;DR: 本文提出了ClueTracer,一种无需训练、参数无关且架构无关的插件,用于抑制多模态推理模型中的幻觉问题。该方法通过从问题出发,追踪关键线索在模型推理路径(问题→输出→视觉标记)中的传播,从而定位任务相关的图像区域并抑制对无关区域的虚假注意力。

Details

Motivation: 多模态推理模型在长链推理过程中会产生幻觉,即生成未被输入图像或问题支持的内容。作者发现这是由于‘推理漂移’现象:模型在收集线索时过度关注与问题无关的实体,稀释了对任务相关线索的注意力,导致推理轨迹逐渐脱离视觉基础。现有针对非推理模型的定位或干预方法难以在推理场景中精确定位真实线索。

Result: 在无需额外训练的情况下,ClueTracer显著提升了所有推理架构(包括R1-OneVision、Ocean-R1、MM-Eureka等)在推理基准测试上的性能,实现了1.21倍的增益。当迁移到非推理设置时,也带来了1.14倍的增益。

Insight: 论文的创新点在于提出了ClueRecall评估指标来量化视觉线索检索,并设计了ClueTracer方法,通过逆向追踪问题到视觉标记的推理路径来抑制幻觉。从客观角度看,其无需训练、架构无关的特性使其具有广泛的适用性和部署便利性,为解决多模态推理中的幻觉问题提供了一种新颖的干预思路。

Abstract: Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model’s reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.


[230] One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation cs.CV | cs.AIPDF

Shuo Lu, Haohan Wang, Wei Feng, Weizhen Wang, Shen Zhang

TL;DR: 本文提出了一种名为’One Size, Many Fits’的统一框架,用于在大规模广告图像生成中协调不同用户群体的点击偏好。该框架通过产品感知的自适应分组动态组织用户,并利用一个经过预训练和微调的群体感知多模态大语言模型为每个群体生成定制化图像,从而提升各群体的点击率。

Details

Motivation: 现有广告图像生成方法采用’一刀切’策略,只优化整体点击率,忽视了不同用户群体的偏好多样性,导致针对特定群体的营销效果不佳。本文旨在解决这一差距。

Result: 大量实验表明,该框架在离线和在线设置中均达到了最先进的性能。

Insight: 主要创新点包括:产品感知的自适应分组方法、用于生成定制化图像的群体感知多模态大语言模型、用于群体偏好对齐的Group-DPO微调方法,以及首个大规模公开的群体图像偏好数据集GAIP。从客观角度看,将用户群体细粒度偏好与多模态生成模型结合,并通过专门的数据集和微调方法进行对齐,是提升广告生成效果的有效途径。

Abstract: Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all” strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.


[231] Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models cs.CV | cs.AIPDF

Cristian Sbrolli, Matteo Matteucci, Toshihiko Yamasaki

TL;DR: 本文介绍了Auto-Comp,一个用于生成可扩展合成基准测试的自动化流程,旨在精细评估视觉语言模型(VLMs)的组合推理能力。通过对比最小化描述与LLM生成的上下文描述下的模型表现,揭示了VLMs在颜色绑定和空间关系等组合任务上的普遍缺陷,并发现视觉语言上下文在辅助空间推理的同时会损害局部属性绑定。

Details

Motivation: 现代视觉语言模型在组合推理上存在关键缺陷,例如混淆物体属性组合。为了对这些失败进行细粒度、可控的分析,需要一种可扩展的基准测试生成方法。

Result: 在针对颜色绑定和空间关系的新基准上评估了20个VLMs,发现CLIP和SigLIP模型家族均存在普遍的组合失败。新颖的’混淆基准’进一步揭示了模型对低熵干扰项(如重复物体或颜色)高度敏感,其失败超出了已知的词袋模型限制。

Insight: 创新点在于提出了一个全自动、可控的合成基准生成管道(Auto-Comp),能够解耦核心绑定能力与视觉语言复杂性。一个关键的发现是视觉语言上下文在空间推理与局部属性绑定之间存在权衡:全局场景线索有助于前者,但视觉杂乱会损害后者。

Abstract: Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing “a red cube and a blue sphere” with “a blue cube and a red sphere”. Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., “a monitor to the left of a bicycle on a white background”) and LLM-generated Contextual captions (e.g., “In a brightly lit photography studio, a monitor is positioned to the left of a bicycle”), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel “Confusion Benchmark” reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).


[232] Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data cs.CV | cs.AIPDF

Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer

TL;DR: 本文提出了一种名为SegmentMIL的基于Transformer的多视图多示例学习框架,用于患者级别的冠状动脉狭窄分类。该方法利用真实世界临床数据,仅需患者级别标注,无需昂贵的视图级别标注,能够联合预测狭窄存在并定位受影响的解剖区域。

Details

Motivation: 现有基于单视图血管造影的深度学习模型严重依赖昂贵的视图级别标注,且无法捕捉多视图间的时序动态和依赖关系,而这些对临床诊断至关重要。

Result: SegmentMIL在内部和外部评估中均取得了高性能,其表现优于视图级别模型和经典的多示例学习基线方法,显示出作为临床可行且可扩展解决方案的潜力。

Insight: 创新点在于将Transformer架构与多示例学习结合,利用患者级别弱监督处理多视图数据,避免了视图级标注的需求,并能同时进行分类和定位,提高了临床实用性和可扩展性。

Abstract: Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views. Although numerous deep-learning models have been proposed for stenosis detection from a single angiography view, their performance heavily relies on expensive view-level annotations, which are often not readily available in hospital systems. Moreover, these models fail to capture the temporal dynamics and dependencies among multiple views, which are crucial for clinical diagnosis. To address this, we propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification. Trained on a real-world clinical dataset, using patient-level supervision and without any view-level annotations, SegmentMIL jointly predicts the presence of stenosis and localizes the affected anatomical region, distinguishing between the right and left coronary arteries and their respective segments. SegmentMIL obtains high performance on internal and external evaluations and outperforms both view-level models and classical MIL baselines, underscoring its potential as a clinically viable and scalable solution for coronary stenosis diagnosis. Our code is available at https://github.com/NikolaCenic/mil-stenosis.


[233] UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction cs.CVPDF

Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen

TL;DR: 本文提出了UrbanGS,一个用于大规模城市场景重建的可扩展且高效的架构,旨在解决3D高斯溅射(3DGS)在扩展到城市规模时面临的几何一致性、内存效率和计算可扩展性挑战。

Details

Motivation: 3D高斯溅射(3DGS)在有限场景中能实现高质量实时渲染,但在扩展到大规模城市场景时,面临几何一致性差、内存效率低和计算可扩展性不足的关键挑战。

Result: 在多个城市数据集上的大量实验表明,UrbanGS在渲染质量、几何精度和内存效率方面均实现了优越性能。

Insight: 主要创新点包括:1) 深度一致D-法线正则化模块,结合外部深度监督和自适应置信度加权,全面更新几何参数以增强多视图深度对齐和几何一致性;2) 空间自适应高斯剪枝策略,根据局部几何复杂度和可见性动态调整高斯密度以减少冗余;3) 统一的分区和视图分配方案,以消除边界伪影并优化计算负载。

Abstract: While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.


[234] FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space cs.CVPDF

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang

TL;DR: FSVideo是一个基于Transformer的快速图像到视频扩散模型,通过在高度压缩的潜在空间(64×64×4时空下采样率)中操作,结合扩散Transformer架构和层内存设计,以及多分辨率生成策略,实现了高质量视频生成,速度比其他开源模型快一个数量级。

Details

Motivation: 解决现有视频扩散模型生成速度慢、计算成本高的问题,旨在开发一个在高度压缩潜在空间中快速生成高质量视频的框架。

Result: 在14B参数的扩散Transformer基础模型和上采样器上,与其他流行开源模型相比,实现了竞争性的性能,同时速度快一个数量级。

Insight: 创新点包括:高度压缩的视频自编码器、扩散Transformer中的层内存设计以增强层间信息流和上下文重用,以及多分辨率生成策略;这些设计在保持质量的同时显著提升了生成速度。

Abstract: We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.


[235] MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos cs.CVPDF

Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao

TL;DR: MLV-Edit是一个无需训练、基于光流的框架,专门用于分钟级长视频的编辑。它通过分而治之的策略进行分段编辑,核心模块包括Velocity Blend(通过对齐相邻片段的光流场来修正运动不一致性)和Attention Sink(将局部片段特征锚定到全局参考帧以抑制累积结构漂移),从而解决长视频编辑中的计算开销和全局时间一致性问题。

Details

Motivation: 现有视频编辑技术擅长处理短视频,但扩展到分钟级长视频时面临巨大挑战,包括计算开销过高和难以在数千帧中保持全局时间一致性,导致闪烁和边界伪影等问题。

Result: 广泛的定量和定性实验表明,MLV-Edit在时间稳定性和语义保真度方面始终优于最先进的方法(SOTA)。

Insight: 创新点在于提出了一种无需训练的分段编辑框架,通过Velocity Blend模块消除边界伪影,以及Attention Sink模块抑制结构漂移,实现了对长视频高效且一致的编辑,可借鉴其分而治之策略和光流对齐技术来处理大规模时序数据。

Abstract: We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.


[236] Toxicity Assessment in Preclinical Histopathology via Class-Aware Mahalanobis Distance for Known and Novel Anomalies cs.CV | cs.AI | cs.LGPDF

Olga Graf, Dhrupal Patel, Peter Groß, Charlotte Lempp, Matthias Hein

TL;DR: 本文提出了一种基于人工智能的组织病理学异常检测框架,用于毒理学研究中的啮齿动物肝脏全玻片图像(WSI)分析。该框架通过微调预训练的Vision Transformer(DINOv2)进行组织分割,并利用马氏距离进行异常检测,特别引入了类别特定阈值以处理组织学数据的类依赖变异性,从而有效识别已知病理和未见过的罕见病理(分布外异常)。

Details

Motivation: 药物诱导毒性是临床前开发和早期临床试验失败的主要原因,而组织病理学评估作为毒性评估的金标准,严重依赖专家病理学家,在大规模筛选中形成瓶颈。本文旨在开发一个AI驱动的异常检测系统,以自动化检测组织病理学图像中的已知和未知异常,从而加速安全药物的开发并减少后期失败。

Result: 在生成的新数据集上,通过优化类别特定阈值,该框架实现了极低的误分类率:仅0.16%的病理组织被分类为健康组织,0.35%的健康组织被分类为病理组织。在已知毒理学发现的鼠标肝脏WSI上,该框架能准确检测异常,包括罕见的分布外形态,展示了其在临床前工作流程中的潜力。

Insight: 创新点包括:1)结合预训练Vision Transformer(DINOv2)与LoRA进行高效的组织分割微调;2)提出使用类别特定马氏距离阈值来更好地处理组织学数据的类依赖变异性,提高异常检测的准确性;3)框架能够同时处理已知病理和未见过的分布外异常,增强了在真实世界毒理学研究中的适用性。从客观角度看,该方法通过集成先进的自监督视觉模型和自适应阈值策略,为自动化组织病理学分析提供了可扩展且鲁棒的解决方案。

Abstract: Drug-induced toxicity remains a leading cause of failure in preclinical development and early clinical trials. Detecting adverse effects at an early stage is critical to reduce attrition and accelerate the development of safe medicines. Histopathological evaluation remains the gold standard for toxicity assessment, but it relies heavily on expert pathologists, creating a bottleneck for large-scale screening. To address this challenge, we introduce an AI-based anomaly detection framework for histopathological whole-slide images (WSIs) in rodent livers from toxicology studies. The system identifies healthy tissue and known pathologies (anomalies) for which training data is available. In addition, it can detect rare pathologies without training data as out-of-distribution (OOD) findings. We generate a novel dataset of pixelwise annotations of healthy tissue and known pathologies and use this data to fine-tune a pre-trained Vision Transformer (DINOv2) via Low-Rank Adaptation (LoRA) in order to do tissue segmentation. Finally, we extract features for OOD detection using the Mahalanobis distance. To better account for class-dependent variability in histological data, we propose the use of class-specific thresholds. We optimize the thresholds using the mean of the false negative and false positive rates, resulting in only 0.16% of pathological tissue classified as healthy and 0.35% of healthy tissue classified as pathological. Applied to mouse liver WSIs with known toxicological findings, the framework accurately detects anomalies, including rare OOD morphologies. This work demonstrates the potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development.


[237] LoopViT: Scaling Visual ARC with Looped Transformers cs.CVPDF

Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu

TL;DR: 本文提出LoopViT,一种基于权重共享循环的递归视觉Transformer架构,用于解决视觉推理任务。该方法通过解耦推理深度与模型容量,并引入基于预测熵的动态退出机制,在ARC-AGI基准上以更小的参数量实现了优于大型集成模型的性能。

Details

Motivation: 现有基于前馈架构的视觉Transformer在解决ARC-AGI基准时,其计算深度严格受限于参数量,难以捕捉人类归纳推理的迭代、算法式本质。

Result: 在ARC-AGI-1基准测试中,仅1800万参数的LoopViT模型达到了65.8%的准确率,超越了7300万参数的大型集成模型,实现了更高效的性能扩展。

Insight: 核心创新在于通过权重共享的循环结构(Hybrid Block迭代)实现计算深度与模型容量的解耦,并结合基于预测熵的无参数动态退出机制,使模型能自适应终止推理。这为视觉推理任务提供了一个比单纯增加网络宽度更高效的扩展轴。

Abstract: Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes” into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.


[238] Reg4Pru: Regularisation Through Random Token Routing for Token Pruning cs.CVPDF

Julian Wyatt, Ronald Clark, Irina Voiculescu

TL;DR: 本文提出了一种名为Reg4Pru的训练正则化技术,旨在缓解视觉Transformer模型在采用令牌剪枝策略时,尤其是在分割任务中,因保留表征不稳定而导致的性能下降问题。该方法通过随机令牌路由进行正则化,在FIVES血管分割数据集上显著提升了平均精度,同时实现了推理加速。

Details

Motivation: Transformer模型的计算复杂度随令牌数量呈二次方增长,现有令牌剪枝方法(如从保留表征中重新激活令牌)虽能提升计算效率,但会导致深层网络保留表征不稳定,从而损害密集预测任务(如分割)的性能。

Result: 在FIVES血管分割数据集上,使用Reg4Pru训练的模型比未使用路由正则化的相同模型,其平均精度(Average Precision)绝对提升了46%。同时,该配置相比未剪枝的基线模型,在挂钟时间上实现了29%的相对加速。

Insight: 创新点在于提出了一种通过随机令牌路由进行训练正则化的方法(Reg4Pru),以稳定令牌剪枝过程中保留的表征,从而在显著提升计算效率的同时,有效缓解了密集预测任务的性能损失。这为高效的令牌缩减策略提供了一种有价值的正则化思路。

Abstract: Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.


[239] CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization cs.CVPDF

Xinquan Yu, Wei Lu, Xiangyang Luo

TL;DR: 本文提出了一种名为CIEC(耦合隐式和显式线索)的新型框架,用于图像-文本对的多模态弱监督篡改定位,仅需粗粒度的图像/句子级标注。该框架包含基于图像和基于文本的弱监督定位两个分支,分别通过TRPS模块和VCTG模块整合视觉与文本线索,以精确定位篡改区域,实验表明其性能与全监督方法相当。

Details

Motivation: 当前多模态篡改定位方法依赖昂贵且耗时的细粒度标注(如补丁/令牌级标注),本文旨在通过仅使用粗粒度标注实现有效的弱监督定位,以降低标注成本。

Result: 在多个评估指标上,CIEC框架的实验结果与全监督方法相当,证明了其有效性。

Insight: 创新点包括:1)设计TRPS模块,结合视觉和文本线索及空间先验锁定可疑区域,并通过背景抑制和空间对比约束减少干扰;2)设计VCTG模块,关注内容词并利用相对视觉偏差辅助令牌定位,通过非对称稀疏和语义一致性约束缓解标签噪声。这些方法为弱监督多模态定位提供了新思路。

Abstract: To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.


[240] Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision cs.CVPDF

Ziqiao Weng, Jiancheng Yang, Kangxian Xie, Bo Zhou, Weidong Cai

TL;DR: 本文提出TopoField,一种拓扑感知的隐式建模框架,用于解决从CT图像中提取的肺树常出现的拓扑不完整问题(如分支缺失或断开)。该方法利用稀疏表面和骨架点云表示肺解剖结构,并通过在已有不完整肺树上引入合成结构破坏进行训练,学习一个连续的隐式场,从而支持无需完整或显式断开标注的拓扑修复。基于修复后的隐式表示,该方法可在单次前向传播中联合推断解剖标记和肺段重建。

Details

Motivation: 从CT图像提取的肺树常存在拓扑不完整问题,如分支缺失或断开,这会严重影响下游解剖分析并限制现有肺树建模流程的适用性。现有方法通常依赖密集体素处理或显式图推理,导致效率有限且在真实结构损坏下鲁棒性降低。

Result: 在Lung3D+数据集上的大量实验表明,TopoField能持续提升拓扑完整性,并在具有挑战性的不完整场景下实现准确的解剖标记和肺段重建。由于其隐式表达,TopoField计算效率高,每个病例仅需一秒多即可完成所有任务。

Insight: 创新点在于将拓扑修复作为首要建模问题,提出一种拓扑感知的隐式建模框架,通过在不完整拓扑监督下学习连续隐式场来统一处理肺树分析的多项任务。该方法避免了依赖完整标注,并通过合成破坏进行训练,提高了在现实不完整数据上的鲁棒性和效率。

Abstract: Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing or explicit graph reasoning, leading to limited efficiency and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward pass.Extensive experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications. Code and data will be available at https://github.com/HINTLab/TopoField.


[241] MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models cs.CVPDF

Zheyuan Zhou, Liang Du, Zixun Sun, Xiaoyu Zhou, Ruimin Ye

TL;DR: 本文提出了MAIN-VLA框架,通过显式建模意图和环境语义的抽象,来解决复杂动态环境中视觉-语言-动作模型从冗余传感器流中提取关键动作信号的效率问题。该框架将冗长的语言指令和视觉流分别抽象为紧凑的语义基元和结构化的拓扑可供性表示,并通过对齐这两种抽象模态实现了一种无需参数的令牌剪枝策略,从而提升决策质量、泛化能力和推理效率。

Details

Motivation: 在涉及实时不可预测交互的高度复杂动态环境(如3D开放世界和大规模PvP游戏)中,现有的视觉-语言-动作模型从冗余传感器流中提取关键动作信号效率低下,需要更深入的语义对齐而非浅层模式匹配来支撑决策。

Result: 在开放世界Minecraft以及大规模PvP环境(Game for Peace和Valorant)上的大量实验表明,MAIN-VLA取得了新的最先进水平,在决策质量、泛化能力和推理效率方面均表现优异。

Insight: 论文的创新点在于显式地对意图和环境进行语义抽象(IA和ESA),并将两者对齐以诱导出注意力集中效应,从而实现了一种无需模型参数即可过滤感知冗余的令牌剪枝策略,这为在复杂环境中提升VLA模型的效率和性能提供了新思路。

Abstract: Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.


[242] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation cs.CVPDF

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li

TL;DR: 本文提出了一种名为Causal Forcing的新方法,用于将预训练的双向视频扩散模型蒸馏为高质量、实时交互的视频生成模型。该方法通过使用自回归教师模型进行ODE初始化,解决了现有方法在将全注意力替换为因果注意力时存在的架构差距问题,从而提升了生成质量。

Details

Motivation: 现有方法在将双向视频扩散模型蒸馏为少步自回归模型时,由于用因果注意力替换全注意力而存在架构差距,且未从理论上解决此问题。ODE蒸馏初始化要求帧级单射性,而双向教师蒸馏自回归学生违反了该条件,导致性能下降。

Result: 实验结果表明,该方法在所有指标上均优于所有基线,在Dynamic Degree、VisionReward和Instruction Following指标上分别比SOTA方法Self Forcing提升了19.3%、8.7%和16.7%。

Insight: 核心创新点在于提出了Causal Forcing,通过使用自回归教师模型进行ODE初始化来弥补架构差距,理论上确保了帧级单射性,从而避免了条件期望解导致的性能退化。这为高质量实时视频生成的模型蒸馏提供了更严谨的解决方案。

Abstract: To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher’s flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}


[243] Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type cs.CVPDF

Junchi Feng, Nikhil Ballem, Mahya Beheshti, Giles Hamilton-Fletcher, Todd Hudson

TL;DR: 本研究系统评估了光学字符识别(OCR)在辅助技术应用中的性能,重点关注静态和动态条件下的表现。静态测试考察了1-7米距离和0-75度水平视角下的检测范围,动态测试则分析了不同行走速度(0.8-1.8 m/s)和摄像头安装位置(头戴、肩戴、手持)的影响。研究比较了智能手机和智能眼镜,以及四种OCR引擎(Google Vision、PaddleOCR 3.0、EasyOCR、Tesseract)的性能,并使用PaddleOCR 3.0评估了不同行走速度下的准确率。

Details

Motivation: 当前OCR评估大多依赖静态数据集,未能反映移动使用中的实际挑战,本研究旨在系统评估OCR在动态移动条件下的性能,为视障人士辅助技术提供更实用的性能基准。

Result: 结果显示,识别准确率随行走速度增加和视角变宽而下降;Google Vision总体准确率最高,PaddleOCR 3.0是紧随其后的最强开源替代方案;手机主摄像头准确率最高,肩戴位置在身体安装位置中平均表现最佳,但肩、头、手部位置差异无统计学显著性。

Insight: 创新点在于首次系统量化了移动条件(如行走速度、摄像头位置)对OCR性能的影响,并提供了多引擎、多设备的基准比较;客观来看,该研究为动态环境下的OCR评估建立了方法论框架,强调了实际应用场景中性能评估的重要性。

Abstract: Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone’s main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone’s main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.


[244] Show, Don’t Tell: Morphing Latent Reasoning into Image Generation cs.CVPDF

Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang

TL;DR: 本文提出了一种名为LatentMorph的新型文本到图像生成框架,该框架将隐式的潜在空间推理无缝集成到生成过程中,以解决现有方法在动态推理和细化能力上的不足。

Details

Motivation: 现有文本到图像生成方法缺乏动态推理和细化的能力,而当前依赖显式思维过程的增强范式存在效率低下、信息丢失和认知不匹配的问题。

Result: 在GenEval和T2I-CompBench基准上,LatentMorph分别将基础模型Janus-Pro的性能提升了16%和25%;在WISE和IPV-Txt等抽象推理任务上,它超越了显式推理范式(如TwiG)15%和11%,同时推理时间减少了44%,令牌消耗降低了51%;在推理调用方面,与人类直觉的认知对齐度达到71%。

Insight: 论文的创新点在于提出了一个完全在连续潜在空间中进行推理的框架,通过引入冷凝器、翻译器、塑形器和RL训练调用器四个轻量级组件,实现了更自适应、高效的自我细化,避免了显式推理的瓶颈。

Abstract: Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation–a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16%$ on GenEval and $25%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15%$ and $11%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44%$ and token consumption by $51%$; and (IV) exhibits $71%$ cognitive alignment with human intuition on reasoning invocation.


[245] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization cs.CVPDF

Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng

TL;DR: LongVPO是一个新颖的两阶段直接偏好优化框架,旨在使短上下文视觉语言模型能够鲁棒地理解超长视频,而无需任何长视频标注。第一阶段通过将问题锚定到单个短视频片段、穿插干扰项,并应用视觉相似性和问题特异性过滤来合成偏好三元组,以减轻位置偏差并确保明确的监督。第二阶段在长视频上采用递归字幕生成管道来生成场景级元数据,然后使用大语言模型构建多片段推理查询和不受偏好的响应,通过多片段推理任务来对齐模型的偏好。

Details

Motivation: 解决短上下文视觉语言模型难以理解超长视频的问题,并避免依赖昂贵的长视频人工标注。

Result: 仅使用16K合成示例且无需人工标注,在多个长视频基准测试中超越了最先进的开源模型,同时在短视频基准(如MVBench)上保持了强大性能。

Insight: 创新点包括两阶段合成偏好数据的方法:第一阶段通过锚定和过滤策略合成基础偏好对以减轻偏差;第二阶段通过递归字幕和LLM生成多片段推理任务来提升模型的长视频推理能力。这为高效的长视频理解提供了一个可扩展的范式。

Abstract: We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model’s scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model’s preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.


[246] Unified Personalized Reward Model for Vision Generation cs.CVPDF

Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou

TL;DR: 本文提出UnifiedReward-Flex,一个用于视觉生成的统一个性化奖励模型,旨在解决现有奖励模型对内容特定视觉线索不敏感、与主观且上下文相关的人类偏好存在系统性错位的问题。该模型结合奖励建模与灵活、上下文自适应的推理,通过解释语义意图、基于视觉证据进行推理,并动态构建分层评估来生成奖励。训练采用两阶段流程:首先从先进的闭源视觉语言模型中蒸馏高质量推理轨迹进行监督微调,然后使用直接偏好优化进一步强化推理保真度和判别对齐。

Details

Motivation: 现有多模态奖励模型通常采用‘一刀切’范式,假设单一偏好分布或依赖固定评估标准,导致其对内容特定的视觉线索不敏感,与主观且上下文相关的人类偏好存在系统性错位。

Result: 将UnifiedReward-Flex集成到GRPO框架中进行图像和视频合成,大量实验结果证明了其优越性。

Insight: 创新点在于将奖励建模与灵活、上下文自适应的推理相结合,通过动态构建分层评估来模拟人类评估过程;训练流程结合了从先进VLM的推理蒸馏和直接偏好优化,以提升模型的推理能力和对齐效果。

Abstract: Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.


[247] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory cs.CV | cs.AIPDF

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang

TL;DR: 本文提出了Infinite-World,一种能够在复杂真实世界环境中维持超过1000帧连贯视觉记忆的鲁棒交互世界模型。其核心创新包括:无需显式几何先验的分层无姿态记忆压缩器(HPMC)、将连续运动离散化为三态逻辑的不确定性感知动作标注模块,以及利用小型密集回访数据集进行微调的策略。该方法解决了现有世界模型在真实视频上训练时因姿态噪声和视角回访稀缺而面临的挑战。

Details

Motivation: 现有世界模型在具有完美真值的合成数据上可以高效优化,但由于真实世界视频中存在噪声姿态估计和视角回访稀缺的问题,缺乏有效的训练范式。本文旨在弥合这一差距,构建一个能在长时域(1000+帧)真实环境中保持视觉一致性和可控性的世界模型。

Result: 广泛的实验,包括客观指标和用户研究,表明Infinite-World在视觉质量、动作可控性和空间一致性方面均取得了优越性能。

Insight: 主要创新点包括:1)分层无姿态记忆压缩器(HPMC),通过联合优化压缩器与生成主干,使模型能以有限计算成本自主锚定到遥远过去的生成,无需依赖显式几何先验;2)不确定性感知动作标注,将连续运动离散化为三态逻辑,最大化利用原始视频数据的同时保护确定性动作空间免受噪声轨迹污染;3)基于试点研究的洞见,采用回访密集微调策略,利用紧凑的30分钟数据集高效激活模型的长距离闭环能力。

Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.


[248] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation cs.CVPDF

Xinshun Wang, Peiming Li, Ziyi Wang, Zhongbin Fang, Zhichao Deng

TL;DR: 本文提出了Superman框架,旨在统一基于视觉的人体运动感知和基于骨架的运动生成任务。该框架通过视觉引导的运动分词器构建跨模态运动词汇,并利用单一的多模态大语言模型架构处理时序输入,实现了从视频中估计3D骨架姿态(感知)以及基于骨架的运动预测和插值(生成)的统一。

Details

Motivation: 当前人体运动分析领域存在严重碎片化:感知模型仅能从视频理解运动但输出文本,生成模型无法从原始视觉输入感知;生成式MLLM通常局限于使用密集参数化SMPL模型的单帧静态姿态,无法处理时序运动;现有运动词汇仅基于骨架数据构建,与视觉域脱节。

Result: 在Human3.6M等标准基准测试上的大量实验表明,该统一方法在所有运动任务上达到了最先进或具有竞争力的性能。

Insight: 创新点包括:1) 提出视觉引导的运动分词器,利用3D骨架与视觉数据间的自然几何对齐,首次实现从两种模态进行鲁棒的联合学习,创建统一的跨模态运动词汇;2) 基于该运动语言,训练单一统一的MLLM架构以灵活处理多样化时序输入,统一了感知与生成任务。这为使用骨架进行生成式运动分析提供了一条更高效、可扩展的路径。

Abstract: Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between perception'' models that understand motion from video but only output text, and generation’’ models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.


[249] ReasonEdit: Editing Vision-Language Models using Human Reasoning cs.CV | cs.AIPDF

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

TL;DR: ReasonEdit是首个允许用户在编辑过程中提供推理解释的视觉语言模型编辑器,通过将人类推理存储在代码本中,并利用受网络科学启发的拓扑平衡多模态嵌入方法在推理时检索相关事实,显著提升了编辑泛化能力。

Details

Motivation: 现有视觉语言模型编辑器无法处理需要人类和模型对图像进行推理的重推理任务,因此提出ReasonEdit以解决这一问题,引入了一种新的实用模型编辑设置。

Result: 在四个视觉语言模型和多个基于推理的视觉问答数据集上,ReasonEdit实现了最先进的编辑性能,表明在编辑中使用人类推理能极大改善编辑泛化。

Insight: 创新点在于首次将人类推理解释整合到视觉语言模型编辑过程中,并提出了拓扑平衡多模态嵌入方法进行高效检索,这为提升模型编辑的准确性和泛化性提供了新思路。

Abstract: Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images.We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.


[250] SelvaMask: Segmenting Trees in Tropical Forests and Beyond cs.CVPDF

Simon-Olivier Duguay, Hugo Baudchon, Etienne Laliberté, Helene Muller-Landau, Gonzalo Rivas-Torres

TL;DR: 本文介绍了SelvaMask,一个包含超过8800个手动标注树冠的热带森林数据集,并提出了一个基于视觉基础模型(VFMs)的模块化检测-分割流程,用于热带森林中单个树冠的精确分割。该方法在密集热带森林中达到了最先进的性能,并在外部数据集上验证了其泛化能力。

Details

Motivation: 热带森林对全球生态平衡至关重要,但现有基于transformer的单个树冠分割模型在热带森林中性能较低,需要更准确的分割方法来支持大规模森林监测。

Result: 该方法在SelvaMask数据集上达到了最先进的性能,优于零样本通用模型和全监督端到端方法,并在外部热带和温带数据集上验证了其有效性。

Insight: 创新点包括引入具有全面标注和标注者一致性评估的热带森林数据集,以及提出结合领域特定检测提示器的模块化检测-分割流程,有效适应视觉基础模型以处理密集森林结构。

Abstract: Tropical forests harbor most of the planet’s tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.


[251] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing cs.CV | cs.AIPDF

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song

TL;DR: UniReason 1.0 是一个统一的多模态推理框架,它将世界知识对齐的图像生成和编辑任务整合到一个双推理范式中。该框架将生成视为世界知识增强的规划过程以注入隐式约束,并利用编辑能力进行细粒度视觉精炼,通过自我反思纠正视觉错误。它通过构建大规模以推理为中心的数据集来支持这一框架,并在多个推理密集型基准测试中展示了先进的性能。

Details

Motivation: 现有的统一多模态模型在处理需要深度推理的复杂合成任务时存在困难,并且通常将文本到图像生成和图像编辑视为孤立的能力,而非相互关联的推理步骤。UniReason 旨在解决这一问题,通过一个统一的框架将这两个任务协调起来,模仿人类先规划后精炼的认知过程。

Result: 广泛的实验表明,UniReason 在 WISE、KrisBench 和 UniREditBench 等推理密集型基准测试中取得了先进的性能,同时保持了卓越的通用合成能力。

Insight: 论文的创新点在于提出了一个双推理范式,将生成和编辑统一在一个共享表示中,通过世界知识增强规划和视觉自我纠正来实现深度推理。从客观角度看,其构建的大规模、覆盖多知识领域的推理中心数据集以及将生成与编辑作为连贯推理步骤的系统化方法,是值得借鉴的。

Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.


[252] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss cs.CV | cs.AIPDF

Zehong Ma, Ruihan Xu, Shiliang Zhang

TL;DR: PixelGen是一种简单的像素扩散框架,通过引入感知监督来优化高维像素空间的学习,避免了潜在扩散模型中VAE引入的伪影和瓶颈。该方法在ImageNet-256上实现了5.11的FID分数,并在大规模文本到图像生成中表现出良好的扩展性能。

Details

Motivation: 解决像素扩散方法因高维像素空间包含大量感知无关信号而难以优化、性能落后于潜在扩散模型的问题。

Result: 在ImageNet-256上,无需分类器引导,仅用80个训练周期就达到了5.11的FID分数;在大规模文本到图像生成中,GenEval得分为0.79,超越了强潜在扩散基线。

Insight: 创新点在于使用LPIPS损失和基于DINO的感知损失来引导扩散模型学习更有意义的感知流形,而非完整图像流形,从而简化生成范式并提升性能。

Abstract: Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.


q-bio.QM [Back]

[253] Rank-and-Reason: Multi-Agent Collaboration Accelerates Zero-Shot Protein Mutation Prediction q-bio.QM | cs.AI | cs.CLPDF

Yang Tan, Yuyuan Xi, Can Wu, Bozitao Zhong, Mingchen Li

TL;DR: 论文提出Rank-and-Reason(VenusRAR)框架,用于自动化零样本蛋白质突变预测。该框架包含排序和推理两个阶段:排序阶段通过多模态集成模型提升预测相关性,在ProteinGym基准上创造了新的Spearman相关性记录(0.551);推理阶段通过多智能体链式思维推理,依据几何和结构约束审核候选突变,在ProteinGym-DMS99上将Top-5命中率提升高达367%。湿实验验证进一步证实了框架的有效性,在Cas12i3核酸酶上实现了46.7%的阳性率,并发现了两个活性提升4.23倍和5.05倍的新突变体。

Details

Motivation: 解决现有蛋白质语言模型(PLMs)在零样本突变预测中,其统计上自信的结果常忽略基本生物物理约束的问题,并替代目前依赖专家手动审核PLM输出的低效、主观且高度依赖领域知识的工作流程,以实现自动化并最大化预期的湿实验适应性。

Result: 在ProteinGym基准上,排序阶段将Spearman相关性从0.518提升至0.551,创造了新记录;在ProteinGym-DMS99基准上,推理阶段将Top-5命中率提升高达367%;在Cas12i3核酸酶的湿实验验证中,实现了46.7%的阳性率,并发现了两个活性分别提升4.23倍和5.05倍的新突变体。

Insight: 摘要宣称的创新点在于提出了一个两阶段的多智能体协作框架,将上下文感知的多模态集成排序与基于链式思维推理的专家小组审核相结合,以自动化地整合统计预测和生物物理约束。从客观角度看,其创新之处在于将大语言模型/智能体范式应用于蛋白质工程的决策工作流,通过模拟专家协作(计算专家、虚拟生物学家、专家小组)来系统性地结合不同模态信息和领域知识,从而超越了单一PLM的局限性。

Abstract: Zero-shot mutation prediction is vital for low-resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet-lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank-and-Reason (VenusRAR), a two-stage agentic framework to automate this workflow and maximize expected wet-lab fitness. In the Rank-Stage, a Computational Expert and Virtual Biologist aggregate a context-aware multi-modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason-Stage, an agentic Expert Panel employs chain-of-thought reasoning to audit candidates against geometric and structural constraints, improving the Top-5 Hit Rate by up to 367% on ProteinGym-DMS99. The wet-lab validation on Cas12i3 nuclease further confirms the framework’s efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23-fold and 5.05-fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).


cs.IR [Back]

[254] SPARC-RAG: Adaptive Sequential-Parallel Scaling with Context Management for Retrieval-Augmented Generation cs.IR | cs.AI | cs.CL | cs.LGPDF

Yuxin Yang, Gangda Deng, Ömer Faruk Akgül, Nima Chitsazan, Yash Govilkar

TL;DR: SPARC-RAG是一个用于检索增强生成(RAG)的多智能体框架,旨在解决多跳问答中推理长度和效率的挑战。它通过统一的上下文管理机制,协调顺序(深度)和并行(宽度)两种推理时扩展维度,利用专门智能体生成互补子查询并控制退出决策,同时引入轻量级微调方法优化扩展行为。

Details

Motivation: 现有RAG方法在需要长推理的多跳问答任务中面临挑战,简单的顺序或并行扩展会导致上下文污染和扩展效率低下,计算增加但收益递减甚至为负。

Result: 在单跳和多跳问答基准测试中,SPARC-RAG始终优于之前的RAG基线,在更低的推理成本下实现了平均+6.2 F1分数的提升。

Insight: 创新点在于提出了一个统一管理上下文的协调框架,将顺序与并行扩展结合,并引入基于过程级可验证偏好的轻量级微调来优化扩展效率;其多智能体设计和显式的退出决策机制对于管理复杂推理流程具有借鉴意义。

Abstract: Retrieval-Augmented Generation (RAG) grounds large language model outputs in external evidence, but remains challenged on multi-hop question answering that requires long reasoning. Recent works scale RAG at inference time along two complementary dimensions: sequential depth for iterative refinement and parallel width for coverage expansion. However, naive scaling causes context contamination and scaling inefficiency, leading to diminishing or negative returns despite increased computation. To address these limitations, we propose SPARC-RAG, a multi-agent framework that coordinates sequential and parallel inference-time scaling under a unified context management mechanism. SPARC-RAG employs specialized agents that maintain a shared global context and provide explicit control over the scaling process. It generates targeted, complementary sub-queries for each branch to enable diverse parallel exploration, and explicitly regulates exiting decisions based on answer correctness and evidence grounding. To optimize scaling behavior, we further introduce a lightweight fine-tuning method with process-level verifiable preferences, which improves the efficiency of sequential scaling and effectiveness of parallel scaling. Across single- and multi-hop QA benchmarks, SPARC-RAG consistently outperforms previous RAG baselines, yielding an average +6.2 F1 improvement under lower inference cost.


[255] RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval cs.IR | cs.CVPDF

Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, Reno Kriz

TL;DR: 本文提出了RANKVIDEO,一种基于推理的文本到视频检索重排序模型。它通过显式地对查询-视频对进行基于视频内容的推理来评估相关性,并采用包含感知基础监督微调和结合点对、成对及教师置信度蒸馏目标的两阶段课程进行训练。在MultiVENT 2.0基准测试中,该模型在两级检索框架内显著提升了性能。

Details

Motivation: 针对视频检索中基于推理的重排序方法研究不足的问题,旨在弥补文本中心重排序与视频检索之间的差距,开发一个能够利用视频内容进行显式推理的重排序器。

Result: 在大型MultiVENT 2.0基准测试中,RANKVIDEO在两级检索框架下持续提升检索性能,在nDCG@10指标上平均提升31%,优于纯文本和视觉语言重排序替代方案,且效率更高。

Insight: 创新点在于将大型推理模型应用于视频检索重排序,并设计了专门的两阶段课程训练策略(感知基础SFT与多目标重排序训练)以及用于构建推理密集型查询-视频对的数据合成流程,实现了对视频内容的显式推理和性能提升。

Abstract: Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.


cs.LG [Back]

[256] Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy cs.LG | cs.CLPDF

Sarthak Sattigeri

TL;DR: 本研究将用于评估语言模型奉承行为的Beacon诊断工具扩展到印地语,通过对比英文原版、印地语直译版和印地语文化适应版三种提示设计,发现文化适应是导致跨语言奉承行为差异的主要因素,表明基于英文的模型对齐评估可能无法直接推广到其他语言和文化语境。

Details

Motivation: 解决语言模型的奉承行为(即模型优先迎合用户偏好而非原则性推理)是否在不同语言和文化语境中普遍存在的问题,探究基于英文的评估诊断的跨语言泛化能力。

Result: 在四个开源指令微调模型上,文化适应的印地语提示导致的奉承率比英文提示高出12.0到16.0个百分点。对Qwen 2.5-Coder-7B的分解显示,文化适应贡献了14.0%的主要差距,而语言编码仅贡献2.0%。在建议类提示上,跨语言差异最大(20-25个百分点),并在两个模型中达到统计显著性。

Insight: 主要创新点在于通过控制实验设计(原版、直译、文化适应)分离了语言编码和文化适应对模型行为的影响,定量证明了文化背景在提示构建中对模型对齐行为的实质性影响,挑战了仅基于英文评估来推断模型跨语言行为的假设。研究方法和数据集为后续跨文化对齐研究提供了基础。

Abstract: Sycophancy, the tendency of language models to prioritize agreement with user preferences over principled reasoning, has been identified as a persistent alignment failure in English-language evaluations. However, it remains unclear whether such diagnostics generalize across languages and cultural contexts. We extend the Beacon single-turn forced-choice sycophancy diagnostic to Hindi through a controlled three-condition design: English original, Hindi literal translation, and Hindi culturally adapted prompts. We evaluate four open-weight instruction-tuned models on 50 prompts per condition, enabling separation of language encoding effects from cultural adaptation effects. Across all models, sycophancy rates are consistently higher for culturally adapted Hindi prompts than for English, with absolute differences ranging from 12.0 to 16.0 percentage points. A decomposition on Qwen 2.5-Coder-7B shows that cultural adaptation (delta = 14.0%, 95% CI: [4.0%, 26.0%]) accounts for the majority of this gap, while language encoding contributes minimally (delta = 2.0%, 95% CI: [0.0%, 6.0%]). Category-level analysis reveals that advice prompts exhibit the largest cross-lingual differences (20-25 percentage points), achieving statistical significance in two of four models. These findings indicate that alignment behaviors measured in English may not transfer uniformly across languages and that culturally grounded prompt framing plays a substantial role. We release all datasets and evaluation code to support replication and extension.


[257] CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models cs.LG | cs.AI | cs.CLPDF

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Leqi Liu

TL;DR: 本文提出了一种名为CARE-RFT的新方法,通过引入偏斜反向KL散度作为正则化项,以解决强化微调(RFT)在提升大语言模型推理能力时与模型可信度(如幻觉和校准)之间的权衡问题。该方法根据模型置信度施加有界或无界的惩罚,从而在保持推理性能的同时恢复基础模型的可信度。

Details

Motivation: 现有强化微调方法存在关键权衡:无约束RFT虽能提升推理性能,但会严重损害模型可信度(如加剧幻觉和校准恶化);而RKL约束的RFT虽能保持可信度,却因对探索偏差的无界惩罚限制了推理增益。本文旨在解决这一矛盾,实现推理能力与可信度的平衡。

Result: 在多个模型规模和RFT算法上的广泛实验表明,CARE-RFT实现了优越的平衡:其推理性能与无约束RFT相当,同时恢复了基础模型的可信度和校准水平。

Insight: 论文的创新点在于用偏斜反向KL散度替代标准反向KL正则化,提供了一种置信度敏感的惩罚机制——对自信且持续获得奖励的探索施加有界惩罚以促进推理,在其他情况下则施加无界惩罚以保持校准。这揭示了仔细的、置信度感知的正则化是构建既强大又可信的推理模型的关键。

Abstract: Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.


[258] Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits cs.LG | cs.AI | cs.CL | cs.CVPDF

Neha Kalibhat, Zi Wang, Prasoon Bajpai, Drew Proud, Wenjun Zeng

TL;DR: 本文提出了一种黑盒可解释性框架,通过原子概念编辑(ACEs)学习可验证的“宪法”,即自然语言总结,描述提示词修改如何影响模型在特定行为(如对齐性、正确性或约束遵循)上的表现。该方法通过系统应用ACEs并观察模型在不同任务中的行为变化,学习从编辑到可预测结果的因果映射,从而提供对模型的深入、可泛化的见解。

Details

Motivation: 动机在于开发一种黑盒方法,以可解释的方式理解和控制模型行为,特别是通过修改提示词中的原子概念来系统地分析模型行为的变化,解决模型行为解释和控制的可验证性问题。

Result: 在数学推理和文生图对齐等多样化任务上验证了方法的有效性。例如,在文生图生成中,GPT-Image更注重语法遵循,而Imagen 4优先考虑氛围一致性;在数学推理中,干扰变量会混淆GPT-5,但对Gemini 2.5模型和o4-mini影响较小。此外,学习到的宪法在控制模型行为方面非常有效,相比不使用宪法的方法,平均成功率提升了1.86倍。

Insight: 创新点在于引入原子概念编辑(ACEs)作为可解释的操作单元,并学习可验证的“宪法”来因果映射编辑与模型行为变化,提供了一种系统、可泛化的黑盒解释和控制框架,能够跨任务和模型揭示行为模式并提升控制效果。

Abstract: We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model’s specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.


[259] LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference cs.LG | cs.AI | cs.CL | eess.SPPDF

Vikram Krishnamurthy

TL;DR: 本文提出将基于Transformer架构的大语言模型(LLMs)形式化为高维非线性自回归模型,并引入注意力机制作为依赖关系。论文提供了一个简洁的数学框架,用于描述LLM的预训练、对齐(如RLHF、DPO、RSFT、RLVR)和推理阶段的生成过程,旨在为研究人员提供明确的公式级参考。

Details

Motivation: 现有对LLMs的描述通常侧重于架构组件和训练流程,缺乏对其底层计算结构的清晰数学表述,因此本文旨在提供一个统一的、方程层面的形式化框架,以促进对LLM训练、对齐和推理的理解与分析。

Result: 本文是一篇综述性文章,未报告具体的定量实验结果,但提出的框架能够系统性地分析对齐诱导的行为(如奉承)、推理时现象(如幻觉、上下文学习、思维链提示和检索增强生成)以及持续学习等扩展。

Insight: 创新点在于将自注意力机制自然表述为重复的双线性-softmax-线性组合,从而构建出高表达能力的序列模型。该形式化框架为解释LLM行为、分析对齐和推理现象提供了理论基础,并可作为进一步理论发展的简明参考。

Abstract: Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear–softmax–linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.


[260] Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

TL;DR: 本文提出PEAR方法,通过重要性采样对SFT损失进行重加权,以解决SFT阶段与后续RL阶段的数据分布不匹配问题,从而提升模型在RL训练后的整体性能。

Details

Motivation: 当前LLM后训练流程中,离线SFT阶段常被孤立优化以最大化SFT性能,但其数据分布与后续在线RL阶段优化的策略分布存在不匹配,导致更强的SFT检查点在RL后可能表现更差。

Result: 在Qwen 2.5/3和DeepSeek-distilled模型上,于可验证推理游戏和数学推理任务(如AIME2025)进行实验,PEAR相比标准SFT显著提升RL后性能,在AIME2025上pass@8最高提升14.6%。

Insight: 创新点在于将SFT阶段设计与下游RL目标对齐,通过重要性采样在token、块和序列三个层级重加权SFT损失,以校正分布不匹配,为RL阶段提供更好初始化,且计算开销小。

Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.


[261] When Domains Interact: Asymmetric and Order-Sensitive Cross-Domain Effects in Reinforcement Learning for Reasoning cs.LG | cs.AI | cs.CLPDF

Wang Yang, Shouren Wang, Chaoda Song, Chuang Ma, Xinpeng Li

TL;DR: 本文首次系统分析了在强化学习用于推理任务时,不同领域(数学、科学、逻辑、谜题)的训练顺序和策略(顺序训练 vs. 混合训练)对GRPO性能的影响。研究发现跨领域效应具有高度不对称性和顺序敏感性,且没有单一最优策略。

Details

Motivation: GRPO已成为提升大语言模型推理能力的关键技术,但其在不同领域排序策略下的行为尚不明确,特别是顺序训练与混合训练的影响缺乏系统性研究。

Result: 在数学、科学、逻辑和谜题推理任务上的实验表明:跨领域泛化高度不对称(例如,其他领域训练使数学推理准确率提升约25%,但对逻辑和谜题几乎无迁移);顺序效应显著(数学→科学顺序在数学/科学上达到83%/41%准确率,而科学→数学顺序则降至77%/25%);没有单一最优策略(顺序训练利于数学,混合训练利于科学和逻辑,不良排序可导致性能从70%降至56%)。

Insight: 论文宣称的创新点在于首次系统性揭示了GRPO在多领域设置下表现出的不对称性、顺序敏感性和策略依赖性。客观来看,其核心洞察强调了在强化学习用于推理时,领域感知和顺序感知的训练设计至关重要,这为优化多领域训练流程提供了重要指导。

Abstract: Group Relative Policy Optimization (GRPO) has become a key technique for improving reasoning abilities in large language models, yet its behavior under different domain sequencing strategies is poorly understood. In particular, the impact of sequential (one domain at a time) versus mixed-domain (multiple domain at a time) training in GRPO has not been systematically studied. We provide the first systematic analysis of training-order effects across math, science, logic, and puzzle reasoning tasks. We found (1) single-domain generalization is highly asymmetric: training on other domains improves math reasoning by approximately 25% accuracy, while yielding negligible transfer to logic and puzzle; (2) cross-domain interactions are highly order-dependent: training in the order math$\rightarrow$science achieves 83% / 41% accuracy on math / science, while reversing the order to science$\rightarrow$math degrades performance to 77% / 25%; (3) no single strategy is universally optimal in multi-domain training: sequential training favors math (up to 84%), mixed training favors science and logic, and poor ordering can incur large performance gaps (from 70% to 56%). Overall, our findings demonstrate that GRPO under multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design.


[262] A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning cs.LG | cs.AI | cs.CLPDF

Akifumi Wachi, Hirota Kinoshita, Shokichi Takakura, Rei Higuchi, Taiji Suzuki

TL;DR: 本文提出了一种用于大语言模型推理中强化学习的相对预算理论,该理论通过一个单一量——相对预算ξ(生成时域H与基础策略下首次获得正确解所需平均令牌数E[T]之比)——来解释强化学习在不同任务和计算预算下的有效性差异。理论揭示了三种机制:不足、平衡和充足,并分析了每种机制下的样本效率。

Details

Motivation: 动机是解释为何强化学习在提升大语言模型推理能力时,其效果会因任务和计算预算的不同而产生显著差异,旨在提供一个统一的理论框架来理解和预测这种变化。

Result: 在理想化分布假设下的案例研究中,理论分析表明相对预算随迭代次数线性增长。实证结果在现实场景中验证了理论预测,并识别出在ξ∈[1.5, 2.0]的预算范围内学习效率最高,且与推理性能峰值相吻合。

Insight: 创新点在于提出了相对预算ξ这一核心概念,将其作为解释和预测强化学习样本效率的关键单一指标,并基于此划分了三种学习机制,为高效配置计算资源提供了理论指导。

Abstract: Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $ξ:= H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $ξ$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($ξ\to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($ξ=Θ(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($ξ\to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $ξ\in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.


[263] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards cs.LG | cs.AI | cs.CLPDF

Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She

TL;DR: 本文提出了一种名为VIP的自适应采样分配策略,用于解决可验证奖励强化学习中采样效率低下的问题。该方法通过高斯过程模型预测每个训练提示的成功概率,并基于方差估计优化采样分配,以最小化策略更新的梯度方差,从而更高效地利用计算预算。

Details

Motivation: 现有基于组的策略优化方法(如GRPO)对所有训练提示采用固定采样分配,隐含地假设所有提示信息量相同,导致计算预算使用效率低下并阻碍训练进展。本文旨在通过自适应分配采样来提升采样效率。

Result: 在多个基准测试中,VIP方法相比均匀分配或启发式分配策略,持续提高了采样效率并获得了更高的性能表现。

Insight: 创新点在于将采样分配问题形式化为一个在硬计算预算约束下最小化预期梯度方差的凸优化问题,并利用轻量级高斯过程模型进行预测以指导分配,从而实现了更智能的预算分配。从客观角度看,该方法将方差估计与优化框架结合,为强化学习中的资源分配提供了可验证的理论基础。

Abstract: Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce \Ours, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, \Oursuses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that \Oursconsistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks. Our code will be available at https://github.com/HieuNT91/VIP.


[264] No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs cs.LG | cs.CLPDF

Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

TL;DR: 本文通过Tele-Lens探测方法研究了大型语言模型在思维链推理中的潜在规划能力,发现LLMs表现出短视的规划视野,主要进行增量式推理而非精确的全局规划。基于此,论文提出了一种利用思维链动态增强不确定性估计的假设,并验证了仅需少量思维链位置即可有效代表整个路径的不确定性,同时实现了自动识别思维链旁路而不降低性能。

Details

Motivation: 旨在深入理解LLMs内部状态与其显式推理轨迹之间的关系,特别是针对思维链推理中潜在规划能力的研究,以澄清先前关于思维链重要性的互补观察。

Result: 在多个任务领域的实验结果表明,LLMs表现出短视的规划视野,主要进行增量式推理;提出的不确定性估计假设得到验证,且自动识别思维链旁路的方法未导致性能下降。

Insight: 创新点在于揭示了LLMs在思维链推理中缺乏全局规划的特性,并利用这一特性提出了基于思维链动态的不确定性估计方法,为优化推理过程提供了新视角。

Abstract: This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM’s internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.


[265] Learning Generative Selection for Best-of-N cs.LG | cs.AI | cs.CLPDF

Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi

TL;DR: 本文提出通过强化学习训练小型推理模型(1.7B参数)获得强大的生成式选择能力,以解决LLM推理中Best-of-N选择质量的瓶颈问题。该方法在数学和代码推理基准测试中,性能优于提示和多数投票基线,并接近或超过更大模型,且其能力可泛化到选择更强模型的输出。

Details

Motivation: 解决并行采样中Best-of-N选择质量受限的问题,特别是小型模型在生成式选择上性能不足的瓶颈。

Result: 在数学基准(AIME24, AIME25, HMMT25)和代码基准(LiveCodeBench)上,模型性能一致优于提示和多数投票基线,常接近或超过更大模型。

Insight: 通过针对性强化学习(DAPO)训练小型模型获得生成式选择能力,且该能力可泛化到未训练过的更强模型输出,为高效测试时计算扩展提供了可扩展方法。

Abstract: Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.


[266] Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models cs.LG | cs.CLPDF

Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye

TL;DR: 本文提出了一种名为CurioSFT的熵保持监督微调方法,旨在解决大型推理模型在标准SFT-then-RL流程中,SFT阶段导致的过度自信和生成多样性下降问题,从而为后续的强化学习阶段保留更好的探索能力。该方法通过自探索蒸馏和熵引导的温度选择,自适应地鼓励模型在推理标记上进行探索,同时在事实标记上保持稳定。

Details

Motivation: 标准的大型推理模型后训练流程(SFT-then-RL)存在局限性:SFT阶段模仿专家演示会导致模型过度自信、减少生成多样性,从而缩小了后续RL阶段可探索的解空间。简单的熵正则化方法会使词元分布趋于均匀,增加了熵但并未提升有意义的探索能力。

Result: 在数学推理任务上的大量实验表明,在SFT阶段,CurioSFT在分布内任务上比普通SFT高出2.5个百分点,在分布外任务上高出2.9个百分点。同时,在SFT阶段保留的探索能力成功转化为RL阶段的具体收益,带来了平均5.0个百分点的提升。

Insight: 论文的创新点在于提出了一个通过内在好奇心来保持熵的SFT框架。核心是自探索蒸馏(利用模型自身生成的、经过温度缩放的教师模型进行蒸馏)和熵引导的温度选择(自适应调整蒸馏强度)。其客观创新之处在于,不是简单地增加全局熵,而是区分了推理标记和事实标记,有针对性地放大前者的探索并稳定后者的知识,从而在保持模型能力的同时,为后续RL阶段保留了更高质量的探索起点。

Abstract: The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.


[267] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System cs.LG | cs.CLPDF

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

TL;DR: RLAnything是一个强化学习框架,通过闭环优化动态构建环境、策略和奖励模型,以增强任何LLM或智能体场景中的学习信号和整体系统性能。该框架利用步进和结果信号的集成反馈训练策略,并通过一致性反馈联合优化奖励模型,进而提升策略训练。理论驱动的自动环境适应利用来自策略和奖励模型的批评反馈,使两者都能从经验中学习。

Details

Motivation: 解决传统强化学习系统中环境、策略和奖励模型通常静态或独立优化的问题,旨在通过动态闭环优化整体系统,以更有效地放大学习信号,适用于广泛的LLM和智能体任务。

Result: 在多个代表性任务上取得显著提升:在OSWorld上,Qwen3-VL-8B-Thinking性能提升9.1%;在AlfWorld和LiveBench上,Qwen2.5-7B-Instruct分别提升18.7%和11.9%。实验表明,每个添加的组件都一致改善系统,且优化的奖励模型信号优于依赖人工标签的结果。

Insight: 创新点在于提出一个完全动态的强化学习系统,通过闭环联合优化环境、策略和奖励模型,利用集成反馈和一致性反馈机制增强学习;客观分析认为,其理论驱动的自动环境适应和奖励模型优化策略可借鉴于提升RL系统的整体效率和泛化能力。

Abstract: We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL


[268] Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models cs.LG | cs.AI | cs.CVPDF

Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani

TL;DR: 本文提出了一种名为UltraBreak的通用可迁移越狱攻击框架,针对视觉语言模型(VLMs)。该框架通过在视觉空间施加变换和正则化来约束对抗模式,同时通过基于语义的目标放松文本目标,从而生成能跨不同越狱目标和模型迁移的通用对抗模式。

Details

Motivation: 现有基于梯度的越狱方法迁移性差,其对抗模式容易过拟合到单一白盒代理模型,难以泛化到黑盒模型。本文旨在解决这一局限性,实现通用且可迁移的越狱攻击。

Result: 大量实验表明,UltraBreak在多个基准测试中持续优于先前的越狱方法,展现了强大的跨模型和跨攻击目标的迁移能力。

Insight: 核心创新点在于结合视觉层面的正则化和语义引导的文本监督,通过平滑损失景观来缓解代理模型过拟合,从而实现通用可迁移的越狱攻击。这揭示了语义目标在提升对抗攻击迁移性中的关键作用。

Abstract: Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.


[269] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-$k$ Activations cs.LG | cs.CVPDF

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

TL;DR: 本文提出了一种名为MiTA Attention的高效注意力机制,通过将Transformer注意力视为一个两层快速权重MLP,并采用压缩与路由策略,将N宽度的MLP压缩为更窄的版本,使用一组地标查询和每个地标查询中激活度最高的top-k键值对来构建可变形专家,从而在长序列场景下降低计算成本。初步视觉任务实验验证了其有效性。

Details

Motivation: 解决Transformer注意力在极长序列中因快速权重规模过大而导致计算成本过高的问题,通过统一框架解释多种高效注意力方法,并提出更高效的压缩与路由策略。

Result: 初步实验在视觉任务上展示了MiTA注意力的潜力,但未提供具体定量结果或基准比较,需进一步优化和验证。

Insight: 创新点在于将注意力机制统一解释为通过路由和/或压缩来扩展快速权重的框架,并提出压缩与路由结合的MiTA策略,利用地标查询和top-k激活键值对构建可变形专家,以平衡表达能力和计算效率。

Abstract: The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the $N$-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-$k$ activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-$k$ Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.


[270] When Is Rank-1 Enough? Geometry-Guided Initialization for Parameter-Efficient Fine-Tuning cs.LG | cs.CVPDF

Haoran Zhao, Soyeon Caren Han, Eduard Hovy

TL;DR: 本文研究了参数高效微调(PEFT)中极低秩(尤其是秩-1 LoRA)设置不稳定的问题,发现这不单是容量限制所致,而是由于预训练视觉与文本特征形成不匹配的各向异性区域,产生一个主导的‘模态间隙’方向,该方向在秩-1约束下会不成比例地主导早期梯度,导致优化不稳定。作者提出了一种几何感知的初始化方法Gap-Init,通过一个小型校准集估计模态间隙向量,并将秩-1 LoRA方向与之对齐,从而稳定训练。

Details

Motivation: 解决在极低秩(特别是秩-1)设置下,参数高效微调(如LoRA)经常出现训练不稳定和崩溃的问题,探究其根本原因并提供一个简单有效的解决方案。

Result: 在多个视觉-语言任务和骨干网络上,Gap-Init能持续稳定秩-1训练,其性能可以匹配甚至超越强大的秩-8基线模型,表明在极低秩极限下,初始对齐的重要性可与秩本身相媲美。

Insight: 创新点在于揭示了极低秩PEFT不稳定的几何根源——预训练特征中的模态间隙方向主导早期梯度流,并提出了一种基于校准的、零初始更新的几何感知初始化策略。从客观角度看,该研究强调了在低秩适应中考虑表示几何的重要性,为设计更高效的微调方法提供了新视角。

Abstract: Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings – especially rank-1 LoRA – are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant “gap” direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.


[271] InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs cs.LG | cs.AI | cs.CVPDF

Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li

TL;DR: 本文提出了一种名为InfoTok的信息正则化视觉分词机制,用于统一多模态大语言模型(MLLMs)中容量受限的共享视觉分词。该机制基于信息瓶颈原理,通过互信息正则化控制从图像到共享令牌再到多模态输出的信息流,从而在压缩和任务相关性之间实现原则性权衡。

Details

Motivation: 现有统一MLLMs中的共享令牌设计大多是架构驱动的,缺乏明确的标准来规定令牌应保留哪些信息以同时支持理解和生成任务。本文从容量受限的视角出发,强调视觉分词器应作为计算受限的学习者,其令牌预算应优先考虑可重用的结构,而非难以利用的高熵变化和冗余。

Result: 实验将InfoTok集成到三个代表性的统一MLLMs中,无需引入额外训练数据,在理解和生成任务上均取得了一致的性能提升,验证了信息正则化分词作为学习共享令牌空间的原则性基础的有效性。

Insight: 创新点在于首次从信息论和容量约束的角度,为统一MLLMs的共享视觉分词提供了原则性设计准则(信息瓶颈),并通过互信息正则化实现了信息流的可控调节,这为构建更高效的多模态模型提供了新的理论指导和实用方法。

Abstract: Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.


[272] Generative Visual Code Mobile World Models cs.LG | cs.AI | cs.CVPDF

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

TL;DR: 本文提出了一种新的移动图形用户界面世界建模范式:通过生成可渲染的代码来预测下一个GUI状态,而非直接生成像素。该方法结合了文本世界模型的语言先验优势和视觉世界模型的高保真度,并引入了首个基于此范式的开源视觉移动GUI世界模型gWorld及其数据生成框架。

Details

Motivation: 解决现有移动GUI世界模型在视觉保真度与文本渲染精度之间的权衡问题:基于文本的模型牺牲了视觉保真度,而视觉模型因无法精确渲染文本而依赖缓慢、复杂的外部模型流水线。

Result: 在4个分布内和2个分布外基准测试中,gWorld在准确性与模型大小之间建立了新的帕累托前沿,性能优于8个前沿开源模型,尽管这些模型规模是其50.25倍。

Insight: 核心创新在于将视觉世界建模任务重新定义为可执行网页代码的生成,利用视觉语言模型在结构化网页代码上的预训练先验,同时实现精确文本渲染和高保真视觉生成。数据生成框架gWorld能自动合成基于代码的训练数据,且分析表明扩展训练数据、优化流水线组件以及更强的世界建模能力均能带来性能提升。

Abstract: Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.


[273] From Perception to Action: Spatial AI Agents and World Models cs.LG | cs.AI | cs.CV | cs.MA | cs.ROPDF

Gloria Felicia, Nolan Bryant, Handi Putra, Ayaan Gazali, Eliel Lobo

TL;DR: 本文通过综述超过2000篇论文,提出了一个统一的三轴分类法,将智能体能力与不同尺度的空间任务联系起来,旨在弥合符号推理与物理世界空间智能之间的鸿沟。

Details

Motivation: 现有研究孤立地关注智能体架构或空间领域,缺乏一个连接这两种互补能力的统一框架,而空间智能(感知3D结构、推理物体关系和在物理约束下行动)对于具身智能体至关重要。

Result: 分析揭示了三个关键发现:1)分层记忆系统对长视野空间任务很重要;2)GNN-LLM集成是结构化空间推理的有前景方法;3)世界模型对于在微观到宏观空间尺度上安全部署至关重要。

Insight: 论文的创新点在于区分了空间基础(对几何和物理的度量理解)与符号基础(将图像与文本关联),并提出了一个统一分类法,为机器人、自动驾驶和地理空间智能等领域中空间感知自主系统的研究提供了整合基础。

Abstract: While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top-tier venues, we introduce a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long-horizon spatial tasks. (2) GNN-LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro-to-macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross-domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.


[274] FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning cs.LG | cs.AI | cs.CVPDF

Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang

TL;DR: 本文提出了一种受果蝇大脑启发的通用持续学习框架FlyPrompt,旨在解决单次遍历、非平稳数据流下的学习问题。该框架将GCL分解为专家路由和专家能力提升两个子问题,通过随机扩展分析路由器和时间集成输出头来动态适应数据分布。

Details

Motivation: 现有持续参数高效调优方法通常依赖多轮训练和明确的任务边界,难以应对通用持续学习场景。核心挑战在于如何为演化数据分配专家参数,以及在有限监督下提升其表征能力。

Result: 在CIFAR-100、ImageNet-R和CUB-200基准测试中,FlyPrompt相比现有最优方法分别取得了11.23%、12.43%和7.62%的性能提升,达到了新的SOTA水平。

Insight: 创新点在于受果蝇稀疏扩展和模块化集成记忆系统启发,提出了随机扩展路由与时间集成专家的协同机制。其核心思想是将路由决策与专家能力解耦,通过动态路由和边界调整实现高效的知识积累与抗遗忘。

Abstract: General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly’s hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt’s superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively. Our source code is available at https://github.com/AnAppleCore/FlyGCL.


[275] Segment to Focus: Guiding Latent Action Models in the Presence of Distractors cs.LG | cs.CVPDF

Hamza Adnan, Matthew T. Jackson, Alexey Zakharov

TL;DR: 本文提出MaskLAM方法,通过在潜在动作模型训练中引入视觉智能体分割掩码来加权重建损失,以抑制与动作相关的背景噪声干扰,从而提升模型在存在干扰物环境下的表示学习能力。

Details

Motivation: 潜在动作模型仅从原始观察中学习提取动作相关表示,但难以分离动作相关特征与动作相关噪声(如背景运动),导致捕获虚假关联并构建次优的潜在动作空间。

Result: 在添加了动作相关背景噪声的连续控制MuJoCo任务上,该方法相比基线实现了高达4倍的累积奖励提升,并通过线性探针评估显示潜在动作质量提高了3倍。

Insight: 利用预训练基础模型提供的分割掩码对重建损失进行加权,是一种轻量级且无需修改模型架构的噪声抑制策略,可有效引导模型关注显著信息。

Abstract: Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM – a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.


physics.data-an [Back]

[276] Comparison of Image Processing Models in Quark Gluon Jet Classification physics.data-an | cs.CV | cs.LG | hep-exPDF

Daeun Kim, Jiwon Lee, Wonjun Jeong, Hyeongwoo Noh, Giyeong Kim

TL;DR: 本文对基于卷积和Transformer的模型在夸克-胶子喷注分类任务上进行了全面比较,使用Pythia 8模拟的喷注图像,通过将喷注子结构编码为三通道粒子运动学表示,评估了CNN、ViT和Swin Transformer在监督与自监督学习下的性能。

Details

Motivation: 解决在夸克-胶子喷注分类中,不同图像处理模型(尤其是卷积与Transformer架构)的性能差异问题,并探索高效且准确的模型配置。

Result: 在监督学习设置下,仅微调Swin-Tiny模型的最后两个Transformer块在效率与准确性间达到最佳平衡,准确率达81.4%,AUC为88.9%;自监督预训练(MoCo)进一步增强了特征鲁棒性并减少了可训练参数。

Insight: 创新点包括将喷注子结构编码为三通道图像表示,并系统比较了CNN与Transformer模型;客观分析表明,分层注意力模型(如Swin Transformer)在喷注子结构研究中具有潜力,且自监督预训练能提升模型泛化能力,适用于向真实碰撞数据的领域迁移。

Abstract: We present a comprehensive comparison of convolutional and transformer-based models for distinguishing quark and gluon jets using simulated jet images from Pythia 8. By encoding jet substructure into a three-channel representation of particle kinematics, we evaluate the performance of convolutional neural networks (CNNs), Vision Transformers (ViTs), and Swin Transformers (Swin-Tiny) under both supervised and self-supervised learning setups. Our results show that fine-tuning only the final two transformer blocks of the Swin-Tiny model achieves the best trade-off between efficiency and accuracy, reaching 81.4% accuracy and an AUC (area under the ROC curve) of 88.9%. Self-supervised pretraining with Momentum Contrast (MoCo) further enhances feature robustness and reduces the number of trainable parameters. These findings highlight the potential of hierarchical attention-based models for jet substructure studies and for domain transfer to real collision data.


eess.IV [Back]

[277] SurfelSoup: Learned Point Cloud Geometry Compression With a Probablistic SurfelTree Representation eess.IV | cs.CVPDF

Tingyu Fan, Ran Gong, Yueyu Hu, Yao Wang

TL;DR: 本文提出了SurfelSoup,一种基于端到端学习的点云几何压缩框架,采用概率化表面元(pSurfel)表示局部点云分布,并通过pSurfelTree层次结构自适应选择表面元粒度,以实现高效压缩和光滑重建。

Details

Motivation: 解决传统点云压缩方法在平滑区域存在冗余点级压缩的问题,旨在通过表面结构化表示提升压缩效率与重建质量。

Result: 在MPEG通用测试条件下,几何压缩性能优于基于体素的基线方法和MPEG标准G-PCC-GesTM-TriSoup,达到SOTA水平,并生成视觉上更平滑、连贯的表面重建。

Insight: 创新点包括概率化表面元(pSurfel)建模局部点占用分布,以及pSurfelTree自适应树划分机制,实现率失真最优的表面元粒度选择,可借鉴于其他基于学习的3D数据压缩任务。

Abstract: This paper presents SurfelSoup, an end-to-end learned surface-based framework for point cloud geometry compression, with surface-structured primitives for representation. It proposes a probabilistic surface representation, pSurfel, which models local point occupancies using a bounded generalized Gaussian distribution. In addition, the pSurfels are organized into an octree-like hierarchy, pSurfelTree, with a Tree Decision module that adaptively terminates the tree subdivision for rate-distortion optimal Surfel granularity selection. This formulation avoids redundant point-wise compression in smooth regions and produces compact yet smooth surface reconstructions. Experimental results under the MPEG common test condition show consistent gain on geometry compression over voxel-based baselines and MPEG standard G-PCC-GesTM-TriSoup, while providing visually superior reconstructions with smooth and coherent surface structures.


[278] Advanced Geometric Correction Algorithms for 3D Medical Reconstruction: Comparison of Computed Tomography and Macroscopic Imaging eess.IV | cs.CVPDF

Tomasz Les, Tomasz Markiewicz, Malgorzata Lorent, Miroslaw Dziekiewicz, Krzysztof Siwek

TL;DR: 本文提出了一种混合两阶段配准框架,用于从宏观切片重建三维肾脏解剖结构,并以CT模型作为几何参考标准。该方法通过先进行约束全局对齐,再使用轻量级深度学习网络细化局部变形,解决了宏观成像中数据稀缺和高畸变的挑战。

Details

Motivation: 解决宏观成像中数据稀缺、高畸变以及传统全学习配准方法(如VoxelMorph)因训练多样性有限和变形超出卷积滤波器捕获范围而泛化失败的问题。

Result: 在包含40个肾脏的原始数据集上实验,相比单阶段基线方法取得了更好的结果。

Insight: 创新点在于层次化解配准流形,将可解释的全局优化(OCM算法)与数据高效的深度细化网络相结合,整合了显式几何先验与神经网络的灵活学习能力,确保了稳定优化和合理的变形场,即使训练样本很少。该方法可推广到其他由光学或摄影横截面重建的软组织器官。

Abstract: This paper introduces a hybrid two-stage registration framework for reconstructing three-dimensional (3D) kidney anatomy from macroscopic slices, using CT-derived models as the geometric reference standard. The approach addresses the data-scarcity and high-distortion challenges typical of macroscopic imaging, where fully learning-based registration (e.g., VoxelMorph) often fails to generalize due to limited training diversity and large nonrigid deformations that exceed the capture range of unconstrained convolutional filters. In the proposed pipeline, the Optimal Cross-section Matching (OCM) algorithm first performs constrained global alignment: translation, rotation, and uniform scaling to establish anatomically consistent slice initialization. Next, a lightweight deep-learning refinement network, inspired by VoxelMorph, predicts residual local deformations between consecutive slices. The core novelty of this architecture lies in its hierarchical decomposition of the registration manifold. This hybrid OCM+DL design integrates explicit geometric priors with the flexible learning capacity of neural networks, ensuring stable optimization and plausible deformation fields even with few training examples. Experiments on an original dataset of 40 kidneys demonstrated better results compared to single-stage baselines. The pipeline maintains physical calibration via Hough-based grid detection and employs Bezier-based contour smoothing for robust meshing and volume estimation. Although validated on kidney data, the proposed framework generalizes to other soft-tissue organs reconstructed from optical or photographic cross-sections. By decoupling interpretable global optimization from data-efficient deep refinement, the method advances the precision, reproducibility, and anatomical realism of multimodal 3D reconstructions for surgical planning, morphological assessment, and medical education.


[279] Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development eess.IV | cs.CV | cs.MMPDF

Xihua Sheng, Xiongzhuang Liang, Chuanbo Tang, Zhirui Zuo, Yifan Bian

TL;DR: 本文介绍了AVS标准中端到端智能视频编码探索模型(AVS-EEM)的最新进展。论文详细阐述了AVS-EEM的发展历程、核心技术框架(包括模型架构、训练策略和推理优化),并强调其以实际部署为核心的设计原则,即在严格的计算复杂度约束下追求压缩效率的持续显著提升。

Details

Motivation: 为了追求更高的视频压缩效率,AVS视频编码工作组启动了端到端智能视频编码的标准化探索,旨在开发一个兼具高效压缩性能和实际部署可行性的新一代编码标准。

Result: 实验结果表明,经过两年多的迭代优化,AVS-EEM的最新模型在严格遵循传统视频编码通用测试条件的前提下,其压缩效率已优于传统的AVS3参考软件。

Insight: 核心创新在于将端到端学习范式与标准化流程结合,并严格约束计算复杂度以确保实际部署可行性。其技术框架(模型架构、训练策略、推理优化)的系统性协同设计是性能快速演进的关键,为可部署的智能视频编码标准迈出了重要一步。

Abstract: Video coding standards are essential to enable the interoperability and widespread adoption of efficient video compression technologies. In pursuit of greater video compression efficiency, the AVS video coding working group launched the standardization exploration of end-to-end intelligent video coding, establishing the AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) project. A core design principle of AVS-EEM is its focus on practical deployment, featuring inherently low computational complexity and requiring strict adherence to the common test conditions of conventional video coding. This paper details the development history of AVS-EEM and provides a systematic introduction to its key technical framework, covering model architectures, training strategies, and inference optimizations. These innovations have collectively driven the project’s rapid performance evolution, enabling continuous and significant gains under strict complexity constraints. Through over two years of iterative refinement and collaborative effort, the coding performance of AVS-EEM has seen substantial improvement. Experimental results demonstrate that its latest model achieves superior compression efficiency compared to the conventional AVS3 reference software, marking a significant step toward a deployable intelligent video coding standard.


[280] MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation eess.IV | cs.AI | cs.CR | cs.CVPDF

Xiaoxi Kong, Jieyu Yuan, Pengdi Chen, Yuanlin Zhang, Chongyi Li

TL;DR: MarkCleaner是一种通过微几何扰动去除语义水印的框架,它利用空间位移破坏水印的相位对齐,从而在保持图像语义内容的同时高效去除水印,并支持实时推理。

Details

Motivation: 针对语义水印对传统图像空间攻击具有强鲁棒性的问题,论文发现微几何扰动(空间位移)能通过破坏相位对齐来去除水印,从而避免基于再生的水印去除方法导致的语义漂移。

Result: 大量实验表明,MarkCleaner在水印去除效果和视觉保真度方面均达到优异性能,且能实现高效的实时推理。

Insight: 创新点在于引入微几何扰动监督训练,使模型能分离语义内容与严格空间对齐;采用掩码引导编码器学习显式空间表示,以及基于2D高斯溅射的解码器显式参数化几何扰动,在去除水印的同时保持语义完整性。

Abstract: Semantic watermarks exhibit strong robustness against conventional image-space attacks. In this work, we show that such robustness does not survive under micro-geometric perturbations: spatial displacements can remove watermarks by breaking the phase alignment. Motivated by this observation, we introduce MarkCleaner, a watermark removal framework that avoids semantic drift caused by regeneration-based watermark removal. Specifically, MarkCleaner is trained with micro-geometry-perturbed supervision, which encourages the model to separate semantic content from strict spatial alignment and enables robust reconstruction under subtle geometric displacements. The framework adopts a mask-guided encoder that learns explicit spatial representations and a 2D Gaussian Splatting-based decoder that explicitly parameterizes geometric perturbations while preserving semantic content. Extensive experiments demonstrate that MarkCleaner achieves superior performance in both watermark removal effectiveness and visual fidelity, while enabling efficient real-time inference. Our code will be made available upon acceptance.


cs.HC [Back]

[281] SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality cs.HC | cs.CL | cs.ET | cs.IRPDF

Yoonsang Kim, Devshree Jadeja, Divyansh Pradhan, Yalong Yang, Arie Kaufman

TL;DR: SpeechLess是一个可穿戴增强现实助手,通过个性化空间记忆实现基于语音的意图粒度控制,帮助用户在公共场合减少语音输入,同时支持在需要时逐步明确表达意图。

Details

Motivation: 解决在公共场合使用可穿戴AR助手时语音交互带来的社交尴尬和日常重复请求的繁琐问题。

Result: 在实验室和野外研究中,SpeechLess能够改善日常信息获取、减少表达努力,并在不同日常环境中保持可接受的可用性和意图解析准确性。

Insight: 创新点在于将先前的交互与多模态个人上下文(空间、时间、活动和指代物)绑定形成空间记忆,从而从用户未充分指定的查询中推断缺失的意图维度,实现从完整话语到微/零话语的动态交互调整。

Abstract: Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users “speak less,” while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.


[282] Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models cs.HC | cs.CVPDF

Filip Nowicki, Hubert Marciniak, Jakub Łączkowski, Krzysztof Jassem, Tomasz Górecki

TL;DR: 本文系统评估了九种视觉语言模型在三个经过心理测量验证的情感图像数据集上的表现,包括零样本情感分类和连续情感评分预测任务。研究发现,模型在离散情感分类上表现良好(准确率60%-80%),但在连续评分预测中存在一致偏差,且评分者条件提示的影响有限。

Details

Motivation: 探究视觉语言模型在情感分析任务中与人类情感评分的一致性,评估其在大规模视觉情感推断中的可行性和局限性。

Result: 在IAPS、NAPS和LAI-GAI数据集上,模型在六类情感分类任务中准确率达60%-80%,在12类任务中为60%-75%;连续评分预测与人类评分相关性较强(r>0.75),但在唤醒度维度表现较弱,且存在高估倾向。

Insight: 研究揭示了视觉语言模型能捕捉宏观情感趋势,但缺乏心理测量学评估的细微差别;评分者元数据提示效果有限,为情感计算和心理健康应用提供了模型选择与改进方向。

Abstract: Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.


[283] Toward a Machine Bertin: Why Visualization Needs Design Principles for Machine Cognition cs.HC | cs.AI | cs.CVPDF

Brian Keith-Norambuena

TL;DR: 本文主张可视化领域需要将面向机器的视觉设计作为一个独立的研究问题进行研究,因为基于人类视觉心理物理学的传统设计知识(如编码指南、颜色模型)无法直接适用于机器视觉(如视觉语言模型)。论文通过综合VLM基准测试、视觉推理研究和可视化素养研究的证据,表明人机感知差异是质性的,并批判了当前通过绕过视觉(将图表转换为数据表)的主流方法,提出了区分面向人类和面向机器的可视化概念,并概述了建立实证基础的研究议程。

Details

Motivation: 解决传统以人类为中心的可视化设计原则(如编码效果排名、预注意处理规则)无法有效迁移到机器视觉(如视觉语言模型)的问题,因为机器通过基于补丁的标记化处理图像,其感知模式与人类存在质性差异。

Result: 论文未提及具体的定量实验结果或基准测试排名,但综合了现有VLM基准测试、视觉推理和可视化素养研究的证据,表明机器在编码性能模式、图像处理方式(补丁标记化 vs. 整体感知)以及设计模式成败上与人类存在显著差异。

Insight: 创新点在于明确提出需要区分面向人类和面向机器的可视化设计,将其视为根本不同的设计基础问题,而非工程架构选择;主张开发面向机器认知的实证设计原则(即“机器Bertin”),以补充现有以人为中心的知识体系,这为可视化研究开辟了新方向。

Abstract: Visualization’s design knowledge-effectiveness rankings, encoding guidelines, color models, preattentive processing rules – derives from six decades of psychophysical studies of human vision. Yet vision-language models (VLMs) increasingly consume chart images in automated analysis pipelines, and a growing body of benchmark evidence indicates that this human-centered knowledge base does not straightforwardly transfer to machine audiences. Machines exhibit different encoding performance patterns, process images through patch-based tokenization rather than holistic perception, and fail on design patterns that pose no difficulty for humans-while occasionally succeeding where humans struggle. Current approaches address this gap primarily by bypassing vision entirely, converting charts to data tables or structured text. We argue that this response forecloses a more fundamental question: what visual representations would actually serve machine cognition well? This paper makes the case that the visualization field needs to investigate machine-oriented visual design as a distinct research problem. We synthesize evidence from VLM benchmarks, visual reasoning research, and visualization literacy studies to show that the human-machine perceptual divergence is qualitative, not merely quantitative, and critically examine the prevailing bypassing approach. We propose a conceptual distinction between human-oriented and machine-oriented visualization-not as an engineering architecture but as a recognition that different audiences may require fundamentally different design foundations-and outline a research agenda for developing the empirical foundations the field currently lacks: the beginnings of a “machine Bertin” to complement the human-centered knowledge the field already possesses.


cs.SE [Back]

[284] Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression cs.SE | cs.CVPDF

Jianping Zhong, Guochang Li, Chen Zhi, Junxiao Han, Zhen Qin

TL;DR: 本文提出了一种名为LongCodeOCR的视觉压缩框架,用于解决大语言模型在处理长代码上下文时面临的窗口限制问题。该框架将代码渲染为压缩的二维图像序列,供视觉语言模型处理,从而避免传统文本压缩方法导致的依赖关系断裂。研究在代码摘要、代码问答和代码补全四个基准上对LongCodeOCR与现有最佳方法LongCodeZip进行了系统评估。

Details

Motivation: 大语言模型因窗口限制难以处理长代码上下文,现有文本代码压缩方法通过选择性过滤缓解此问题,但常破坏依赖闭包,导致语义碎片化。本文旨在通过视觉压缩方法避免这种依赖断裂,提供全局视图。

Result: 在可比较的压缩比(约1.7倍)下,LongCodeOCR在长模块摘要任务上的CompScore比LongCodeZip提高了36.85分。在100万令牌上下文长度下,使用专用9B视觉语言模型Glyph时,LongCodeOCR在保持更高准确性的同时,实现了约4倍的更高压缩率,并将压缩阶段延迟从约4.3小时大幅减少到约1分钟。

Insight: 创新点在于引入视觉代码压缩作为文本压缩的替代方案,通过保留全局视图避免依赖断裂。研究揭示了覆盖度与保真度的权衡:视觉压缩保留更广泛的上下文覆盖以支持全局依赖,但在精确性关键任务上面临保真度瓶颈;而文本压缩保持符号级精度但牺牲结构覆盖。

Abstract: Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage–fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.


cs.MM [Back]

[285] Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning cs.MM | cs.CV | cs.LG | cs.SDPDF

Mohamed Saleh, Zahra Ahmadi

TL;DR: 该论文提出了一种名为CMQKA的新型跨模态融合机制,通过高效的二进制操作实现线性复杂度,并在此基础上构建了名为SNNergy的节能多模态融合框架。该框架采用分层架构进行多尺度融合,利用事件驱动的脉冲操作实现高能效,并在多个音频-视觉基准测试中取得了最先进的性能。

Details

Motivation: 解决现有音频-视觉融合方法在有效建模跨模态依赖性与计算效率之间的权衡问题:基于注意力的方法计算复杂度高,而高效的融合策略(如拼接)无法有效提取互补的跨模态信息。

Result: 在CREMA-D、AVE和UrbanSound8K-AV等具有挑战性的音频-视觉基准测试中,SNNergy框架显著优于现有基线,并建立了新的最先进(SOTA)结果,同时保持了融合有效性和卓越的能效。

Insight: 核心创新点是CMQKA机制,它通过双向跨模态查询-键注意力和可学习的残差融合,以线性复杂度实现有效的跨模态特征提取与融合。SNNergy框架的创新在于将这种高效融合机制与分层多尺度处理、事件驱动的二进制脉冲操作相结合,实现了可扩展的层次化跨模态集成与实用的高能效。

Abstract: Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.


[286] Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection cs.MM | cs.CV | cs.HCPDF

Chen Chen, Dion Hoe-Lian Goh

TL;DR: 本研究通过195名参与者的实验,探究人类识别深度伪造视频时使用的多模态策略,发现视觉、听觉和直觉线索的组合对成功识别至关重要,并揭示了影响检测准确性的线索模式。

Details

Motivation: 随着深度伪造视频越来越难以被人类识别,理解人类使用的策略对于设计有效的媒体素养干预措施至关重要。

Result: 参与者对真实视频的识别准确率高于深度伪造视频,且对真实内容的预期校准误差更低;通过关联规则挖掘,识别出视觉外观、声音和直觉线索的组合常共同出现并促成成功识别。

Insight: 研究强调了多模态方法在人类检测中的重要性,为设计引导有效线索使用的媒体素养工具提供了方向,有助于提升人们对欺骗性数字媒体的抵御能力。

Abstract: As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.


eess.SP [Back]

[287] Real-Time 2D LiDAR Object Detection Using Three-Frame RGB Scan Encoding eess.SP | cs.CV | cs.LG | cs.ROPDF

Soheil Behnam Roudsari, Alexandre S. Brandão, Felipe N. Martins

TL;DR: 本文提出了一种用于室内服务机器人的无摄像头2D LiDAR物体检测方法。该方法通过将连续三帧LiDAR扫描堆叠为RGB通道来编码短期时间上下文,形成一个紧凑的YOLOv8n输入,无需构建占据栅格地图,同时保留了角度结构和运动线索。在Webots模拟器中160个随机室内场景的严格场景级留出评估中,该方法在四个物体类别上实现了98.4%的mAP@0.5(0.778 mAP@0.5:0.95),精确率和召回率分别为94.9%和94.7%。在树莓派5上,该方法可实时运行,平均预热后端到端延迟为每帧47.8毫秒。

Details

Motivation: 解决室内服务机器人需要鲁棒、比RGB视频更保护隐私且能在嵌入式硬件上运行的感知问题,旨在实现仅使用LiDAR的实时准确物体检测。

Result: 在Webots模拟的160个随机室内场景(严格场景级留出)中,对四个物体类别的检测达到98.4% mAP@0.5(0.778 mAP@0.5:0.95),精确率94.9%,召回率94.7%。在树莓派5上实现实时运行,端到端延迟为47.8ms/帧,相比同平台基于占据栅格的LiDAR-YOLO方法延迟显著降低。

Insight: 创新点在于将连续三帧LiDAR扫描编码为RGB通道作为YOLOv8n输入,这是一种轻量级的时间上下文编码方法,避免了计算密集的占据栅格构建,直接保留了原始扫描的角度和运动信息,从而在嵌入式设备上实现了高精度、低延迟的纯LiDAR检测。

Abstract: Indoor service robots need perception that is robust, more privacy-friendly than RGB video, and feasible on embedded hardware. We present a camera-free 2D LiDAR object detection pipeline that encodes short-term temporal context by stacking three consecutive scans as RGB channels, yielding a compact YOLOv8n input without occupancy-grid construction while preserving angular structure and motion cues. Evaluated in Webots across 160 randomized indoor scenarios with strict scenario-level holdout, the method achieves 98.4% mAP@0.5 (0.778 mAP@0.5:0.95) with 94.9% precision and 94.7% recall on four object classes. On a Raspberry Pi 5, it runs in real time with a mean post-warm-up end-to-end latency of 47.8ms per frame, including scan encoding and postprocessing. Relative to a closely related occupancy-grid LiDAR-YOLO pipeline reported on the same platform, the proposed representation is associated with substantially lower reported end-to-end latency. Although results are simulation-based, they suggest that lightweight temporal encoding can enable accurate and real-time LiDAR-only detection for embedded indoor robotics without capturing RGB appearance.


cs.AI [Back]

[288] Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs cs.AI | cs.CLPDF

Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li

TL;DR: 本文针对当前大语言模型在数学推理基准测试中因数据集偏向模板化计算和浅层算术分解而导致的性能饱和问题,提出了一个新的评测基准ReasoningMath-Plus。该基准包含150个精心设计的问题,旨在评估结构推理能力,如多约束协调、构造性逻辑合成和空间推理。同时,论文引入了HCRS确定性步骤级评分函数和基于标注推理轨迹训练的过程奖励模型,以进行细粒度的过程级评估。实验表明,仅依赖最终答案的指标会高估模型的推理鲁棒性。

Details

Motivation: 现有数学推理基准测试中,大语言模型的准确率已接近饱和,这主要源于数据集过度依赖模板化计算和浅层算术分解,未能充分评估真正的结构推理能力(如多约束协调、构造性逻辑合成和空间推理),因此需要一个新的、过程感知的基准来诊断模型的真实推理能力。

Result: 在提出的ReasoningMath-Plus基准上,领先模型在最终答案准确率上可达5.8/10(满分10分),但基于HCRS的整体评估得分显著更低(平均4.36/10,最佳5.14/10),表明仅看答案的指标会高估推理鲁棒性。

Insight: 论文的创新点在于构建了一个强调结构推理(如多约束交互、构造性解决方案形成和非平凡结构洞察)的基准数据集,并配套了HCRS步骤级评分函数和过程奖励模型,实现了对推理过程的细粒度、过程级评估,这为超越最终答案的模型能力诊断提供了新范式。

Abstract: Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.


[289] Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data cs.AI | cs.CLPDF

Akiharu Esashi, Pawissanutt Lertpongrujikorn, Justin Makino, Yuibi Fujimoto, Mohsen Amini Salehi

TL;DR: 本文提出了一种名为Foundation CAN LM的预训练语言模型,用于处理汽车CAN总线数据。该方法将CAN数据视为一种语言,通过大规模无标签解码信号进行预训练,并在多种汽车保险任务上进行微调,实现了跨任务的泛化能力。

Details

Motivation: 现有方法通常在原始CAN数据上训练孤立的任务特定模型,缺乏共享表示学习,限制了跨任务泛化。受NLP和CV中基础模型范式的启发,本文旨在为CAN数据构建一个统一的基础模型,以促进多任务学习和泛化。

Result: 实验结果表明,一个预训练的CAN模型能够有效适应多种预测任务,验证了基础模型范式在CAN数据上的适用性,为汽车AI中的可泛化表示学习提供了新方向。

Insight: 创新点包括将CAN数据视为语言进行建模,提出了统一的离散-连续混合信号标记化方案,并解决了时间复杂性和行程特定变异性等挑战,借鉴了NLP和CV中的预训练-微调范式,扩展了基础模型在汽车领域的应用。

Abstract: The Controller Area Network (CAN) bus provides a rich source of vehicular signals increasingly leveraged for applications in automotive and auto insurance domains, including collision detection, predictive maintenance, and driver risk modeling. Despite this potential, existing pipelines largely train isolated task-specific models on raw CAN data, with only limited efforts exploring decoded signals. Such fragmentation prevents shared representation learning and limits cross-task generalization. By contrast, natural language processing (NLP) and computer vision (CV) have been transformed by the foundation model paradigm: large-scale pretraining followed by task-specific adaptation. In this work, we introduce the foundation CAN model that demonstrates multi-objective downstream generalization using a single pretrained backbone. Our approach treats CAN data as a language: we pretrain on large-scale, unlabeled decoded CAN signals and fine-tune across heterogeneous auto insurance tasks. To enable this, we propose a unified tokenization scheme for mixed discrete-continuous signals and address challenges of temporal complexity and trip-specific variability. Our results show that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm, proven in NLP and CV, also holds for CAN data. This establishes a new direction for generalizable representation learning in automotive AI.


[290] Beyond Output Critique: Self-Correction via Task Distillation cs.AI | cs.CLPDF

Hossein A. Rahmani, Mengting Wan, Pei Zhou, Longqi Yang, Nick Craswell

TL;DR: 本文提出了SELF-THOUGHT框架,旨在改进大语言模型(LLMs)的自我纠正能力。该框架在解决方案细化之前引入了一个任务抽象中间步骤,即先将任务提炼成一个捕获关键变量、约束和问题结构的结构化模板,然后基于此模板指导解决方案的实例化。研究表明,这种抽象模板可以在不同规模的模型间迁移,帮助较小的LLM实现更可靠的自我纠正,而无需大量微调或依赖外部验证器。

Details

Motivation: 现有的大语言模型自我纠正方法大多停留在输出批判层面,只能修补表面错误,而难以纠正更深层次的推理缺陷。本文旨在解决这一问题,通过引入任务抽象步骤来提升自我纠正的深度和可靠性。

Result: 在多种推理任务上的实验表明,SELF-THOUGHT框架提高了不同规模模型(包括大型和小型LLM)的准确性、鲁棒性和泛化能力。

Insight: 核心创新点在于将自我纠正过程分解为“任务抽象”和“解决方案实例化”两个阶段,并利用任务抽象模板作为跨模型的结构化指导。这为构建更可靠、可扩展的自我纠正语言系统提供了一条新路径,特别是通过知识蒸馏(任务模板迁移)来增强较小模型的能力,是一个值得借鉴的思路。

Abstract: Large language models (LLMs) have shown promising self-correction abilities, where iterative refinement improves the quality of generated responses. However, most existing approaches operate at the level of output critique, patching surface errors while often failing to correct deeper reasoning flaws. We propose SELF-THOUGHT, a framework that introduces an intermediate step of task abstraction before solution refinement. Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure. This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task and reducing error propagation. Crucially, we show that these abstractions can be transferred across models: templates generated by larger models can serve as structured guides for smaller LLMs, which typically struggle with intrinsic self-correction. By reusing distilled task structures, smaller models achieve more reliable refinements without heavy fine-tuning or reliance on external verifiers. Experiments across diverse reasoning tasks demonstrate that SELF-THOUGHT improves accuracy, robustness, and generalization for both large and small models, offering a scalable path toward more reliable self-correcting language systems.


[291] Error Taxonomy-Guided Prompt Optimization cs.AI | cs.CL | cs.LGPDF

Mayank Singh, Vikas Yadav, Eduardo Blanco

TL;DR: 本文提出了一种名为错误分类引导的提示优化(ETGPO)方法,这是一种自上而下的自动提示优化算法。它通过收集模型错误、将其分类为错误分类法,并针对最常见的失败模式在提示中添加指导,来关注全局失败情况。在数学、问答和逻辑推理等多个基准测试中,ETGPO实现了与最先进方法相当或更好的准确性,同时优化阶段的令牌使用和评估预算仅需约三分之一。

Details

Motivation: 现有自动提示优化方法多依赖试错或基于单个问题的反馈进行自下而上的迭代调整,计算成本高且可能缺乏全局视角。本文旨在通过引入错误分类法来引导优化过程,以更高效、全局化的方式改进提示。

Result: 在数学、问答和逻辑推理等多个基准测试上,ETGPO达到了与最先进方法(SOTA)相当或更好的准确性,同时将优化阶段的令牌使用和评估预算减少了约三分之二。

Insight: 主要创新点在于采用自上而下的错误分类法引导的优化策略,从全局失败模式出发而非单个问题迭代,这提高了优化效率并降低了计算成本。从客观角度看,将错误系统分类并针对性增强提示是一种可借鉴的结构化方法,可能提升提示优化的可解释性和泛化能力。

Abstract: Automatic Prompt Optimization (APO) is a powerful approach for extracting performance from large language models without modifying their weights. Many existing methods rely on trial-and-error, testing different prompts or in-context examples until a good configuration emerges, often consuming substantial compute. Recently, natural language feedback derived from execution logs has shown promise as a way to identify how prompts can be improved. However, most prior approaches operate in a bottom-up manner, iteratively adjusting the prompt based on feedback from individual problems, which can cause them to lose the global perspective. In this work, we propose Error Taxonomy-Guided Prompt Optimization (ETGPO), a prompt optimization algorithm that adopts a top-down approach. ETGPO focuses on the global failure landscape by collecting model errors, categorizing them into a taxonomy, and augmenting the prompt with guidance targeting the most frequent failure modes. Across multiple benchmarks spanning mathematics, question answering, and logical reasoning, ETGPO achieves accuracy that is comparable to or better than state-of-the-art methods, while requiring roughly one third of the optimization-phase token usage and evaluation budget.


[292] HalluHard: A Hard Multi-Turn Hallucination Benchmark cs.AI | cs.CLPDF

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko

TL;DR: 论文提出了一个名为HalluHard的具有挑战性的多轮幻觉基准测试,包含950个种子问题,覆盖法律、研究、医疗和编程四个高风险领域,通过要求事实断言提供内联引用来衡量模型的基础性。论文还设计了一个基于网络搜索的证据检索与评估流程,用于在开放环境中可靠地评估幻觉。实验表明,即使结合网络搜索,前沿模型(如Opus-4.5)的幻觉率仍高达约30%,且幻觉行为受模型能力、对话轮次、推理类型和知识需求等因素影响。

Details

Motivation: 解决大型语言模型在多轮对话中随着上下文增长和早期错误累积,产生看似合理但缺乏事实依据的幻觉问题,尤其是在高风险领域,需要更严格的基准和评估方法来衡量模型的真实性。

Result: 在HalluHard基准测试中,即使结合网络搜索,最强配置(Opus-4.5)的幻觉率仍约为30%,内容基础性错误持续高发,表明现有模型在多轮对话中的幻觉问题依然严重。

Insight: 创新点包括引入一个基于内联引用的多轮幻觉基准,以及一个可检索、过滤和解析全文来源(如PDF)的自动化评估流程;客观分析认为,该研究强调了幻觉评估中证据检索的重要性,并揭示了模型能力、对话结构和知识类型对幻觉的影响,为未来模型改进提供了方向。

Abstract: Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $\textbf{HalluHard}$, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ($\approx 30%$ for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.


[293] Discovering Process-Outcome Credit in Multi-Step LLM Reasoning cs.AI | cs.CLPDF

Xiangwei Wang, Wei Wang, Ken Chen, Nanduni Nimalsiri, Saman Halgamuge

TL;DR: 本文提出了一种新颖的强化学习框架,旨在解决大语言模型(LLM)多步推理中奖励稀疏和信用分配低效的问题。该框架通过引入逐步边际信息增益(MIG)机制提供连续奖励信号,并使用解耦掩码策略(DMS)分别对思维链(CoT)和最终结果施加过程导向与结果导向的奖励,同时结合双门监督微调(SFT)目标来稳定训练。

Details

Motivation: 标准基于结果的强化学习方法在提升LLM推理能力时,常面临奖励信号稀疏和信用分配效率低下的挑战。本文旨在通过提供更精细、连续的奖励信号来更有效地引导模型的推理过程。

Result: 在文本(如MATH)和多模态(如Super-CLEVR)基准测试上的大量实验表明,该方法在样本效率和最终准确率上均持续优于GRPO等基线模型,并展现出卓越的分布外鲁棒性和有前景的零样本迁移能力。

Insight: 主要创新点在于:1)提出了逐步边际信息增益(MIG)机制,结合单调历史水印来量化推理步骤的内在价值并过滤训练噪声;2)设计了过程与结果奖励解耦的掩码策略,实现更精细的信用分配;3)引入了双门SFT目标以稳定训练。这些机制共同作用,为多步推理提供了更有效的强化学习信号。

Abstract: Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.


[294] ASP-Bench: From Natural Language to Logic Programs cs.AI | cs.CL | cs.LOPDF

Stefan Szeider

TL;DR: 本文介绍了ASP-Bench,一个用于评估将自然语言问题自动翻译为答案集编程(ASP)逻辑程序的基准测试集,包含128个问题实例,覆盖了ASP的多种特性。论文还基于ReAct框架的智能体方法进行了测试,展示了通过反馈驱动的迭代优化能够可靠地将自然语言建模为ASP程序,并分析了问题建模难度的决定因素。

Details

Motivation: 解决将自然语言规范自动翻译为逻辑程序(特别是答案集编程ASP)这一神经符号工程中的挑战性任务,缺乏系统性的评估基准。

Result: 使用基于ReAct框架的智能体方法在ASP-Bench上进行测试,实现了完全饱和(full saturation),表明结合求解器反馈的迭代优化方法对于ASP中的自然语言建模是可靠且鲁棒的。

Insight: 创新点在于构建了一个系统覆盖ASP特性(如选择规则、聚合、优化)的多维度基准测试集,并引入了基于智能体(agentic)的反馈驱动迭代优化方法,为分析自然语言到ASP的翻译难度提供了新的见解和评估工具。

Abstract: Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem’s modeling hardness.


[295] MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety cs.AI | cs.CL | cs.LG | cs.MAPDF

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma

TL;DR: 本文提出MAGIC框架,通过多轮多智能体强化学习将LLM安全对齐建模为攻击者与防御者的非对称对抗博弈,实现双方策略的协同进化,以提升模型对未知攻击模式的鲁棒性。

Details

Motivation: 现有LLM安全防御方法依赖静态、预收集的数据分布,难以应对持续演变的对抗攻击,因此需要一种动态、自适应的安全对齐机制。

Result: 大量实验验证了该框架的有效性,在保持模型有用性的同时,实现了更高的防御成功率。

Insight: 创新点在于将安全对齐形式化为协同进化的对抗博弈,攻击者通过迭代RL训练演化出新颖的组合策略,从而持续暴露长尾漏洞,驱动防御者泛化至未见攻击模式;理论上提供了对更鲁棒博弈均衡和安全保证的见解。

Abstract: Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker’s ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework’s effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.


[296] Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient cs.AI | cs.CL | cs.LGPDF

Changming Li, Kaixing Zhang, Haoyun Xu, Yingdong Shi, Zheng Zhang

TL;DR: 本文提出了一种名为集成策略梯度(IPG)的新框架,旨在解释和控制大型语言模型(LLM)的推理机制。该框架通过将基于复合结果(如推理后的准确性)的信号沿模型推理轨迹反向传播,将推理行为归因于模型的内部组件,从而实现对推理机制更精确的定位和可靠调控。

Details

Motivation: 现有针对LLM推理的可解释性方法要么识别与特定文本模式相关的组件(如神经元),要么依赖人工标注的对比对来推导控制向量,难以精确定位复杂的推理机制或捕捉从模型内部运作到推理输出的序列化影响。

Result: 实证评估表明,该方法在多种推理模型上实现了更精确的定位,并能够可靠地调控推理行为(如推理能力、推理强度)。

Insight: 创新点在于提出了基于结果导向和序列影响感知原则的IPG框架,通过传播复合结果信号来归因和调控推理行为,这为理解和控制LLM的复杂、长程推理过程提供了一种新方法。

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model’s inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.


[297] Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts cs.AI | cs.CLPDF

Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang

TL;DR: 本文提出了Avenir-Web,一种模仿人类经验的多模态网页智能体,它通过混合定位专家、经验模仿规划和结合自适应记忆的任务跟踪清单,解决了现有智能体在复杂动态网页上执行长时任务时存在的元素定位不准、缺乏站点特定程序知识以及任务跟踪不稳定等问题。该模型在真实世界部署的Online-Mind2Web基准测试中达到了新的开源最优水平。

Details

Motivation: 现有基于多模态大语言模型的自主网页智能体在复杂动态网页界面上执行长时任务时,存在元素定位不准确、缺乏站点特定程序知识以及长期任务跟踪和记忆不稳定等局限性,尤其是在处理复杂文档对象模型结构时。

Result: 在真实、以用户为中心的网页任务基准测试Online-Mind2Web上,Avenir-Web显著超越了先前的开源智能体,并达到了与顶级专有模型相当的性能,从而为在实时网站上运行的可靠网页智能体确立了新的开源最优水平。

Insight: 论文宣称的创新点在于集成了混合定位专家、经验模仿规划和结合自适应记忆的任务跟踪清单。从客观角度看,其核心创新在于将多种互补机制(专家混合、程序先验知识、自适应记忆)系统性地整合到一个框架中,以模仿人类在网页交互中的综合经验,从而提升了在复杂动态环境中的鲁棒性和可靠性。

Abstract: Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.


[298] From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models cs.AI | cs.CV | cs.LGPDF

Mohit Jiwatode, Alexander Dockhorn, Bodo Rosenhahn

TL;DR: 本文研究大型语言模型从游戏轨迹中推断因果游戏机制的能力,通过两种方法(直接代码生成和基于结构因果模型的两阶段方法)从GVGAI框架的九个代表性游戏中生成VGDL规则,并比较了不同提示策略和上下文控制下的性能。

Details

Motivation: 解决深度学习智能体在复杂游戏领域虽能取得高性能但通常不理解底层因果游戏机制的问题,探索从观测数据中推断支配规律(因果归纳)的能力。

Result: 基于结构因果模型的方法在盲评估中偏好胜率高达81%,比直接生成更常产生接近真实情况的VGDL描述,且产生逻辑不一致规则更少。

Insight: 创新点在于将因果归纳任务形式化为从游戏轨迹到VGDL规则的转换,并引入基于结构因果模型的中间表示以提高生成准确性和逻辑一致性;可借鉴使用语义嵌入和聚类选择代表性任务以降低冗余,以及通过控制上下文信息量来系统评估模型能力。

Abstract: Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.


[299] MACD: Model-Aware Contrastive Decoding via Counterfactual Data cs.AI | cs.CV | cs.LGPDF

Qixin Xiao, Kun Zhou

TL;DR: 本文提出了一种名为MACD的推理策略,通过结合模型引导的反事实数据构建与对比解码,来减少视频语言模型中的幻觉问题。该方法利用模型自身的反馈识别导致幻觉的关键物体区域,生成针对性的反事实输入,并在解码过程中强制基于证据的token选择。

Details

Motivation: 现有解码方法(如对比解码)依赖随机扰动构建对比数据来缓解幻觉,但难以控制驱动幻觉的视觉线索或与模型弱点良好对齐。

Result: 在EventHallusion、MVBench、Perception-test和Video-MME等基准测试中,MACD一致减少了幻觉,同时保持或提升了包括Qwen和InternVL系列在内的多种视频语言模型的任务准确性,尤其在涉及小物体、遮挡物体或共现物体的挑战性场景中效果显著。

Insight: 创新点在于利用模型自身反馈进行模型感知的反事实数据构建,实现对象级别的针对性干预,而非任意的帧或时序修改,从而更有效地将解码过程与视觉证据对齐。

Abstract: Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM’s own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.


[300] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery cs.AI | cs.CV | cs.LGPDF

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan

TL;DR: 本文提出了MentisOculi,一个用于评估前沿模型(特别是统一多模态模型UMMs)能否像人类心理意象一样利用中间视觉化进行推理的基准测试套件。研究发现,无论是使用潜在标记还是显式生成的图像,视觉策略通常无法提升模型在复杂多步推理任务上的性能,揭示了当前模型在利用视觉辅助进行推理方面存在根本性局限。

Details

Motivation: 随着前沿模型从仅能处理视觉信息的多模态大语言模型(MLLMs)向能够原生交错生成内容的统一多模态模型(UMMs)演进,研究者希望探索模型能否利用中间视觉表征(类似于人类的心理意象)来辅助推理。本文旨在评估和探究模型形成、维持和操纵视觉表征以解决目标导向任务的能力。

Result: 在MentisOculi基准上的评估结果表明,各种视觉策略(从潜在标记到显式生成图像)普遍未能提升模型在多步推理问题上的性能。对UMMs的专门分析揭示了一个关键局限:尽管它们具备解决任务的文本推理能力,有时也能生成正确的视觉内容,但它们会受到生成错误的累积影响,并且无法有效利用甚至已知正确的视觉化信息。

Insight: 论文的核心创新点在于构建了MentisOculi这一分层的、程序化的基准套件,专门用于系统性地挑战和评估模型利用视觉思维进行推理的能力。从客观角度看,其重要洞察在于,尽管视觉辅助推理在直觉上很有吸引力,但当前最先进的UMMs模型架构或训练范式尚不具备有效利用自生成或外部提供的视觉信息来增强复杂推理的能力,这为未来模型改进指明了关键方向。

Abstract: Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.


cs.CY [Back]

[301] Happy Young Women, Grumpy Old Men? Emotion-Driven Demographic Biases in Synthetic Face Generation cs.CY | cs.AI | cs.CVPDF

Mengting Wei, Aditya Gulati, Guoying Zhao, Nuria Oliver

TL;DR: 本文系统审计了八种最先进的文本到图像(T2I)模型(包括四个西方组织和四个中国机构开发的模型)在合成人脸生成中的偏差,特别关注情感提示如何影响人口统计表征(如性别、种族、年龄和吸引力)以及不同文化/语言背景训练模型间的输出差异。研究发现所有模型均存在持续的人口统计和情感条件偏差,与来源国无关。

Details

Motivation: 解决T2I模型在合成人脸生成中存在的偏差问题,特别是情感提示对人口统计表征的影响以及跨文化模型输出一致性的不足,现有研究对此缺乏深入探讨。

Result: 使用最先进的面部分析算法和基于信息论的偏差度量(如Kullback-Leibler和Jensen-Shannon散度),发现所有模型均偏离全球人口统计数据,显示出显著的人口统计和情感条件偏差。

Insight: 创新点在于系统比较了不同文化背景开发的T2I模型,并首次量化分析了情感提示如何加剧人口统计偏差;客观来看,该研究为评估生成模型的公平性和跨文化一致性提供了方法论框架,对透明生成系统的开发具有借鉴意义。

Abstract: Synthetic face generation has rapidly advanced with the emergence of text-to-image (T2I) and of multimodal large language models, enabling high-fidelity image production from natural-language prompts. Despite the widespread adoption of these tools, the biases, representational quality, and cross-cultural consistency of these models remain poorly understood. Prior research on biases in the synthetic generation of human faces has examined demographic biases, yet there is little research on how emotional prompts influence demographic representation and how models trained in different cultural and linguistic contexts vary in their output distributions. We present a systematic audit of eight state-of-the-art T2I models comprising four models developed by Western organizations and four developed by Chinese institutions, all prompted identically. Using state-of-the-art facial analysis algorithms, we estimate the gender, race, age, and attractiveness levels in the generated faces. To measure the deviations from global population statistics, we apply information-theoretic bias metrics including Kullback-Leibler and Jensen-Shannon divergences. Our findings reveal persistent demographic and emotion-conditioned biases in all models regardless of their country of origin. We discuss implications for fairness, socio-technical harms, governance, and the development of transparent generative systems.


cs.RO [Back]

[302] MapDream: Task-Driven Map Learning for Vision-Language Navigation cs.RO | cs.AI | cs.CVPDF

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen

TL;DR: MapDream提出了一种任务驱动的地图学习框架,用于视觉语言导航任务。该方法将地图构建建模为自回归的鸟瞰图合成问题,通过联合学习地图生成和动作预测,将环境上下文蒸馏为紧凑的三通道鸟瞰图,仅保留对导航至关重要的信息。

Details

Motivation: 现有视觉语言导航方法通常依赖与导航策略独立构建的手工地图,而作者认为地图应是直接由导航目标塑造的学习表示,而非详尽的重建。

Result: 在R2R-CE和RxR-CE基准测试上取得了最先进的单目性能。

Insight: 核心创新在于将地图构建视为自回归鸟瞰图合成任务,并提出了一个地图在循环的框架,通过监督预训练和强化学习微调实现端到端联合优化,学习紧凑、任务驱动的表示。

Abstract: Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird’s-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.


[303] APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation cs.RO | cs.CVPDF

Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen

TL;DR: 本文提出了APEX,一种用于空中物体目标导航(Aerial Object Goal Navigation)的新型分层智能体。它通过解耦的、基于记忆的异步探索架构,利用视觉语言模型(VLM)构建动态空间语义地图,并结合强化学习策略和开放词汇检测器,以高效地探索复杂空中环境并定位目标。

Details

Motivation: 现有方法在空中环境中难以记忆复杂的空间表征、做出可靠且可解释的动作决策,并且探索和信息收集效率低下。APEX旨在解决这些挑战,实现无人机在仅依赖视觉感知和语言描述下的自主探索与目标识别。

Result: 在具有挑战性的UAV-ON基准测试中,APEX的性能超越了之前的SOTA方法,成功率(SR)提升了4.2%,路径长度加权成功率(SPL)提升了2.8%。

Insight: 主要创新点在于其模块化、异步并行的分层架构:1)利用VLM的零样本能力动态构建高分辨率、可解释的3D吸引力、探索和障碍物地图作为记忆机制;2)各模块(建图、决策、目标确认)解耦并异步运行,有效规避了VLM的推理延迟,提升了探索的主动性。这种将语义理解、空间记忆与强化学习策略解耦并行的设计思路具有借鉴意义。

Abstract: Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbf{APEX} (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM’s inference latency and boosting the agent’s proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \href{https://github.com/4amGodvzx/apex}{GitHub}


[304] SyNeT: Synthetic Negatives for Traversability Learning cs.RO | cs.CVPDF

Bomena Kim, Hojun Lee, Younsoo Park, Yaoyu Hu, Sebastian Scherer

TL;DR: 本文提出了一种名为SyNeT的方法,通过显式构建合成负样本来增强基于视觉的可通行性学习,以解决现有自监督学习框架因缺乏显式负数据而难以准确识别多样不可通行区域的问题。该方法可无缝集成到PU和PN训练框架中,无需修改推理架构,并在公开和自收集数据集上验证了其提升模型鲁棒性和泛化能力的有效性。

Details

Motivation: 现有自监督可通行性估计框架主要依赖正样本和未标记数据,缺乏显式负样本限制了模型准确识别多样不可通行区域的能力,因此需要一种方法来显式引入负样本以提升性能。

Result: 在公开和自收集数据集上的大量实验表明,该方法显著提升了模型在不同环境中的鲁棒性和泛化能力,并通过提出的以物体为中心的FPR评估方法间接验证了模型对不可通行区域的一致性识别能力。

Insight: 创新点在于提出了一种构建合成负样本的训练策略,可灵活集成到现有框架中,并引入了以物体为中心的评估方法,为无额外标注下评估模型对不可通行区域的识别一致性提供了新思路。

Abstract: Reliable traversability estimation is crucial for autonomous robots to navigate complex outdoor environments safely. Existing self-supervised learning frameworks primarily rely on positive and unlabeled data; however, the lack of explicit negative data remains a critical limitation, hindering the model’s ability to accurately identify diverse non-traversable regions. To address this issue, we introduce a method to explicitly construct synthetic negatives, representing plausible but non-traversable, and integrate them into vision-based traversability learning. Our approach is formulated as a training strategy that can be seamlessly integrated into both Positive-Unlabeled (PU) and Positive-Negative (PN) frameworks without modifying inference architectures. Complementing standard pixel-wise metrics, we introduce an object-centric FPR evaluation approach that analyzes predictions in regions where synthetic negatives are inserted. This evaluation provides an indirect measure of the model’s ability to consistently identify non-traversable regions without additional manual labeling. Extensive experiments on both public and self-collected datasets demonstrate that our approach significantly enhances robustness and generalization across diverse environments. The source code and demonstration videos are publicly available at the project page: https://anonymous-synet.github.io/SyNet.github.io/


[305] KAN We Flow? Advancing Robotic Manipulation with 3D Flow Matching via KAN & RWKV cs.RO | cs.CVPDF

Zhihao Chen, Yiyuan Ge, Ziyang Wang

TL;DR: 本文提出KAN-We-Flow,一种用于3D机器人操作的流匹配策略。它结合了RWKV和KAN两种新型网络架构,构建了一个轻量级且表达能力强的骨干网络,并引入了动作一致性正则化来稳定训练。该方法大幅减少了模型参数,保持了快速推理速度,并在多个基准测试中达到了最先进的成功率。

Details

Motivation: 解决基于扩散模型的视觉运动策略推理效率低下的问题。这类策略需要多步去噪和沉重的UNet骨干网络,难以部署在资源受限的机器人上。流匹配方法虽然减轻了采样负担,但现有实现仍沿用大型UNet架构。

Result: 在Adroit、Meta-World和DexArt基准测试上达到了最先进的成功率。与基于UNet的方法相比,模型参数减少了86.8%,同时保持了快速的运行时间。

Insight: 主要创新点在于:1) 提出了RWKV-KAN块,将RWKV的高效时空/通道混合能力与KAN基于可学习样条的分组函数映射能力相结合,构建了轻量且高表达力的骨干网络。2) 引入了动作一致性正则化,通过欧拉外推法强制预测动作轨迹与专家演示对齐,作为额外的监督来稳定训练并提高策略精度。

Abstract: Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient time/channel mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks. Our project page can be viewed in \href{https://zhihaochen-2003.github.io/KAN-We-Flow.github.io/}{\textcolor{red}{link}}


[306] UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning cs.RO | cs.CVPDF

Shuai Liu, Siheng Ren, Xiaoyao Zhu, Quanmin Liang, Zefeng Li

TL;DR: UniDWM是一个通过多层面表示学习构建的统一驾驶世界模型,旨在为复杂驾驶环境中的可靠高效规划提供支持。该模型构建了一个对结构和动态敏感的潜在世界表示,作为物理基础的状态空间,实现了感知、预测和规划的一致推理。

Details

Motivation: 解决在复杂驾驶环境中实现可靠高效规划的问题,需要一个能够对场景的几何、外观和动态进行推理的模型。

Result: 大量实验证明了UniDWM在轨迹规划、4D重建和生成方面的有效性,突显了多层面世界表示作为统一驾驶智能基础的潜力。

Insight: 创新点在于提出了一个统一的多层面表示学习框架,构建了结构和动态感知的潜在世界表示,并将其理论化为变分自编码器(VAE)的变体,为学习提供了理论指导。同时,利用条件扩散变换器在潜在空间内预测未来世界演化,实现了感知、预测和规划的协同。

Abstract: Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene’s geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene’s structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at https://github.com/Say2L/UniDWM.


[307] Towards Autonomous Instrument Tray Assembly for Sterile Processing Applications cs.RO | cs.AI | cs.CV | cs.LGPDF

Raghavasimhan Sankaranarayanan, Paul Stuart, Nicholas Ahn, Arno Sungarian, Yash Chitalia

TL;DR: 本文提出了一种用于无菌处理(SPD)应用的全自动机器人系统,旨在对手术器械进行分类并结构化地装入无菌托盘,以自动化SPD组装阶段。该系统集成了基于YOLO12和级联ResNet的混合感知流程、6自由度机械臂、定制双电磁夹爪以及基于规则的打包算法,并使用3D打印分隔器和固定器来物理隔离器械。实验评估表明,该系统具有高感知精度,并能显著减少器械间碰撞。

Details

Motivation: 无菌处理与分发(SPD)部门在手术间负责手术器械的清洁、消毒、检查和组装,手动检查和准备器械托盘耗时、易出错,且易受污染和器械损坏。因此,需要自动化SPD组装阶段以提高效率、安全性和一致性。

Result: 实验评估显示,系统感知精度高,并且与人工组装的托盘相比,在工具间碰撞方面实现了统计上显著的减少。

Insight: 创新点包括:1) 构建了一个包含31种手术器械和6975张标注图像的自定义数据集,用于训练混合感知流程(YOLO12检测+级联ResNet细粒度分类);2) 集成了校准视觉模块、定制双电磁夹爪机械臂和基于规则的打包算法;3) 使用3D打印分隔器和固定器进行物理隔离,减少运输中的碰撞和摩擦。这为自动化SPD工作流程提供了可扩展的第一步。

Abstract: The Sterile Processing and Distribution (SPD) department is responsible for cleaning, disinfecting, inspecting, and assembling surgical instruments between surgeries. Manual inspection and preparation of instrument trays is a time-consuming, error-prone task, often prone to contamination and instrument breakage. In this work, we present a fully automated robotic system that sorts and structurally packs surgical instruments into sterile trays, focusing on automation of the SPD assembly stage. A custom dataset comprising 31 surgical instruments and 6,975 annotated images was collected to train a hybrid perception pipeline using YOLO12 for detection and a cascaded ResNet-based model for fine-grained classification. The system integrates a calibrated vision module, a 6-DOF Staubli TX2-60L robotic arm with a custom dual electromagnetic gripper, and a rule-based packing algorithm that reduces instrument collisions during transport. The packing framework uses 3D printed dividers and holders to physically isolate instruments, reducing collision and friction during transport. Experimental evaluations show high perception accuracy and statistically significant reduction in tool-to-tool collisions compared to human-assembled trays. This work serves as the scalable first step toward automating SPD workflows, improving safety, and consistency of surgical preparation while reducing SPD processing times.


[308] LIEREx: Language-Image Embeddings for Robotic Exploration cs.RO | cs.CVPDF

Felix Igelbrink, Lennart Niecksch, Marian Renz, Martin Günther, Martin Atzmueller

TL;DR: LIEREx 提出了一种将视觉语言基础模型(如CLIP)与3D语义场景图相结合的机器人探索方法,通过将物体编码为高维嵌入而非固定标签,使自主智能体能够在部分未知环境中进行目标导向的探索。

Details

Motivation: 传统语义地图方法依赖预定义的符号词汇,无法处理设计时未定义的分布外知识,限制了机器人在开放环境中的探索能力。

Result: 论文未在摘要中提及具体的定量结果或基准测试,但暗示该方法能实现目标导向的探索。

Insight: 创新点在于利用视觉语言基础模型的开放集能力,将物体表示为高维嵌入,并与3D语义场景图集成,从而增强机器人对未知环境的语义理解和探索灵活性。

Abstract: Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric representations but are often constrained by pre-designed symbolic vocabularies. The reliance on fixed object classes makes it impractical to handle out-of-distribution knowledge not defined at design time. Recent advances in Vision-Language Foundation Models, such as CLIP, enable open-set mapping, where objects are encoded as high-dimensional embeddings rather than fixed labels. In LIEREx, we integrate these VLFMs with established 3D Semantic Scene Graphs to enable target-directed exploration by an autonomous agent in partially unknown environments.


[309] FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation cs.RO | cs.CVPDF

Ruiteng Zhao, Wenshuo Wang, Yicheng Ma, Xiaocong Li, Francis E. H. Tay

TL;DR: 本文提出了一种名为FD-VLA的力感知视觉-语言-动作模型框架,用于接触丰富的机器人操作任务。该框架通过一个力蒸馏模块,在不依赖物理力传感器的情况下,从视觉观察和机器人状态中预测出力的潜在表示,并将其注入预训练的视觉语言模型中,实现力感知推理。

Details

Motivation: 解决在接触丰富的操作任务中,许多机器人缺乏昂贵或易损的力传感器,导致难以实现精细感知和灵巧操作的问题。

Result: 物理实验表明,蒸馏出的力表征优于直接传感器测量和其他基线方法,突显了该方法的有效性。

Insight: 创新点在于通过力蒸馏模块将力感知引入VLA框架,无需物理传感器,降低了硬件成本和复杂性;同时,该模块在VLM之前引入了力-视觉-状态的融合,改善了跨模态对齐,增强了接触场景下的感知-动作鲁棒性。

Abstract: Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.


cs.MA [Back]

[310] A-MapReduce: Executing Wide Search via Agentic MapReduce cs.MA | cs.CLPDF

Mingju Chen, Guibin Zhang, Heng Chang, Yuchen Guo, Shiji Zhou

TL;DR: 本文提出A-MapReduce,一种受MapReduce范式启发的多智能体执行框架,旨在解决现有LLM多智能体系统在处理大规模、广度导向的宽搜索任务时面临的搜索目标庞大和执行效率低下的问题。该框架将宽搜索重构为水平结构化的检索问题,通过任务自适应分解和结构化结果聚合实现并行处理,并利用经验记忆驱动查询条件任务分配与重组的持续演化。

Details

Motivation: 现有基于LLM的多智能体系统在强调迭代、垂直结构信息寻求的深度研究任务中表现出系统性优势,但在处理以大规模、广度导向检索为特征的宽搜索任务时,其主要为顺序、垂直结构推理设计的框架难以应对庞大的搜索目标和低效的长时程执行。

Result: 在五个智能体基准测试上的广泛实验表明,A-MapReduce性能优异,在WideSearch和DeepWideSearch基准上达到了最先进的性能,与使用OpenAI o3或Gemini 2.5 Pro骨干的强基线相比,平均Item F1指标提升了5.11%至17.50%;同时具有高成本效益和效率,提供了优越的性价比权衡,与代表性多智能体基线相比运行时间减少了45.8%。

Insight: 核心创新在于将MapReduce的并行计算范式引入多智能体系统以解决宽搜索问题,通过水平结构化重构、任务自适应分解与聚合以及经验记忆驱动的任务分配动态演化,实现了对大规模检索目标的高效并行处理与渐进式改进,为多智能体系统处理广度优先任务提供了新的框架思路。

Abstract: Contemporary large language model (LLM)-based multi-agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large-scale, breadth-oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long-horizon execution. To bridge this gap, we propose A-MapReduce, a MapReduce paradigm-inspired multi-agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A-MapReduce implements parallel processing of massive retrieval targets through task-adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query-conditioned task allocation and recomposition, enabling progressive improvement in large-scale wide-search regimes. Extensive experiments on five agentic benchmarks demonstrate that A-MapReduce is (i) high-performing, achieving state-of-the-art performance on WideSearch and DeepWideSearch, and delivering 5.11% - 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost-effective and efficient, delivering superior cost-performance trade-offs and reducing running time by 45.8% compared to representative multi-agent baselines. The code is available at https://github.com/mingju-c/AMapReduce.


math.OC [Back]

[311] Dual Quaternion SE(3) Synchronization with Recovery Guarantees math.OC | cs.CV | cs.RO | eess.SPPDF

Jianing Zhao, Linglingzhi Zhu, Anthony Man-Cho So

TL;DR: 本文提出了一种基于对偶四元数的SE(3)同步方法,旨在从带噪声的成对相对变换中恢复绝对位姿。该方法采用两阶段算法:首先通过幂法计算埃尔米特对偶四元数测量矩阵的谱初始化器,然后使用对偶四元数广义幂法(DQGPM)通过迭代投影保证可行性。理论分析给出了谱估计器的误差界,并证明DQGPM具有有限迭代误差界和线性误差收缩特性。实验表明,该方法在合成基准和真实世界多扫描点集配准任务中,相比基于矩阵的代表性方法,在精度和效率上均有提升。

Details

Motivation: SE(3)同步是机器人和3D视觉中的核心问题,旨在从带噪声的成对相对变换中恢复绝对位姿。现有标准方法通常需要多步启发式过程来恢复有效位姿,这些方法难以分析且通常缺乏理论保证。本文旨在通过采用对偶四元数表示,直接在单位对偶四元数上公式化SE(3)同步问题,以提供更简洁、高效且具有理论保证的解决方案。

Result: 在合成基准和真实世界多扫描点集配准实验中,所提出的流水线在精度和效率上均优于代表性的基于矩阵的方法。理论结果包括建立了谱估计器的估计误差界,并证明DQGPM具有有限迭代误差界,在明确的噪声依赖阈值内实现线性误差收缩。

Insight: 创新点在于采用对偶四元数表示SE(3)同步问题,并设计了两阶段算法(谱初始化器+DQGPM),该方法不仅简化了问题公式化,还提供了理论保证(如误差界和收敛性)。从客观角度看,将幂法推广到对偶四元数域,并结合迭代投影确保可行性,是一种新颖且理论扎实的优化策略,可借鉴于其他非欧几里得空间的同步问题。

Abstract: Synchronization over the special Euclidean group SE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multi-step heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates SE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admit a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.