Table of Contents

cs.CL [Back]

[1] LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning cs.CL | cs.GT | cs.LGPDF

Tommaso Felice Banfi, Sashenka Gamage

TL;DR: 本文提出了一种基于大语言模型(LLM)的框架,用于解决离散博弈论任务(以井字棋为例)。该方法将上下文学习与熵引导的思维链推理和自适应上下文检索相结合,根据令牌级不确定性动态调整检索示例数量和推理路径,从而提升决策质量。

Details

Motivation: 动机是解决LLM在顺序决策环境(如博弈论任务)中推理能力不足的问题,通过引入不确定性度量来动态调整推理策略,以提高决策的准确性和效率。

Result: 在对抗次优算法对手的实验中,熵引导的自适应推理将平均游戏结果从基线LLM的-11.6%提升至+9.5%(胜/平/负分别计为+1/0/-1),同时保持了相对较低的每局游戏LLM查询次数。统计验证表明改进显著,且令牌级熵与移动最优性呈负相关。

Insight: 创新点在于将熵(不确定性)作为引导信号,实现自适应的上下文检索和思维链推理路径扩展,这为LLM在复杂决策任务中的动态推理机制提供了新思路,可借鉴用于其他需要序列化推理的领域。

Abstract: We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emph{Tic-Tac-Toe}. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from (-11.6%) with the baseline LLM to (+9.5%) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.


[2] Reasoning Models Generate Societies of Thought cs.CL | cs.CY | cs.LGPDF

Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans

TL;DR: 该论文通过分析推理模型(如DeepSeek-R1和QwQ-32B)的内部机制,发现其卓越的推理能力并非仅源于更长的思维链计算,而是源于模拟多智能体交互的“思想社会”。这种社会结构通过激活具有不同个性特征和领域专长的内部认知视角之间的多样化与辩论,实现了对解空间的有效探索,从而提升了复杂认知任务的准确性。

Details

Motivation: 尽管大语言模型在多个领域展现出卓越能力,但其复杂推理的底层机制尚不明确。论文旨在探究推理模型(相比指令微调模型)在复杂任务上表现更优的根本原因,挑战了“仅通过延长计算链即可增强推理”的普遍观点。

Result: 通过定量分析和机制可解释性方法对推理轨迹的研究表明,推理模型比指令微调模型展现出更高的视角多样性,并在推理过程中激活了更广泛的、与异质个性及专业知识相关的特征冲突。受控强化学习实验证实,当仅因推理准确性给予奖励时,基础模型会增加对话行为;而使用对话脚手架对模型进行微调,能比基础模型更快地提升推理能力。

Insight: 论文的核心创新点在于提出了“思想社会”这一机制性解释,将模型内部的推理过程类比为人类群体的集体智慧,其中系统化组织的多样性是提升问题解决能力的关键。这为通过设计智能体组织架构来利用“群体智慧”开辟了新的研究方向。

Abstract: Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions – a society of thought – which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.


[3] When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs cs.CL | cs.AIPDF

Zhongxiang Sun, Yi Zhan, Chenglei Shen, Weijie Yu, Xiao Zhang

TL;DR: 本文研究了个性化大语言模型在事实查询中因用户历史偏好而产生幻觉的问题,提出了一种名为FPPS的轻量级推理时方法来缓解这种失真,并引入了首个评估个性化与事实问答的基准PFQABench。实验表明FPPS能显著提升事实准确性同时保持个性化性能。

Details

Motivation: 解决个性化大语言模型在适应个体用户时,因个性化与事实表征纠缠而导致的幻觉问题,即模型可能根据用户历史而非客观事实生成答案,从而损害事实可靠性并传播错误信念。

Result: 在多个LLM骨干模型和个性化方法上的实验表明,FPPS方法显著提高了事实准确性,同时保持了个性化性能。

Insight: 创新点在于揭示了个性化诱导的幻觉现象及其表征纠缠的根源,提出了轻量级推理时干预方法FPPS来解耦个性化与事实推理,并创建了首个联合评估基准PFQABench,为平衡个性化与事实性提供了新思路。

Abstract: Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user’s prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.


[4] NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems cs.CLPDF

Jiayu Liu, Rui Wang, Qing Zong, Qingcheng Zeng, Tianshi Zheng

TL;DR: 本文研究了在检索增强生成(RAG)系统中,大语言模型(LLM)由于检索到的上下文存在噪声(如矛盾或无关证据)而导致置信度校准不佳的问题。作者提出了NAACL规则作为解决噪声下过度自信的理论基础,并设计了NAACL框架,利用约2K个HotpotQA示例进行监督微调,使模型具备内在的噪声感知能力。实验表明,该方法显著提升了校准性能。

Details

Motivation: 在关键事实领域部署LLM需要准确评估模型置信度,但RAG设置下的置信度校准研究不足,且检索噪声会导致模型产生错误的确定性(过度自信)。

Result: 在四个基准测试上的实证结果显示,NAACL框架带来了显著提升,在域内和域外分别将ECE分数改善了10.9%和8.0%。

Insight: 创新点在于首次系统揭示了RAG中检索噪声对LLM置信度校准的负面影响,并提出了一个原则性的噪声感知校准规则(NAACL Rules)及相应的监督微调框架,使模型无需依赖更强的教师模型即可获得噪声感知能力,为构建既准确又认知可靠的LLM提供了新途径。

Abstract: Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model’s false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.


[5] Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data cs.CLPDF

Xuanming Zhang, Shwan Ashrafi, Aziza Mirsaidova, Amir Rezaeian, Miguel Ballesteros

TL;DR: 本文研究了大型语言模型在有限计算预算下的推理行为,提出了一种基于LLM合成偏好数据的推理时自改进方法,以在固定推理预算内快速生成高质量的部分解决方案。

Details

Motivation: 解决在计算资源受限的实际任务(如行程规划)中,模型需要快速提供有用部分解而非进行高成本详尽推理的问题。

Result: 在NaturalPlan (Trip)、AIME和GPQA数据集上的实验表明,该方法在Grok-3、GPT-oss、GPT-4.1/4o和LLaMA等多个模型上均能持续提升推理质量和效率。

Insight: 创新点包括引入了随时推理框架和量化推理质量随计算量提升效率的Anytime Index指标,以及利用模型自身生成的偏好数据进行推理时自学习以优化中间解。

Abstract: We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.


[6] CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs cs.CL | cs.LGPDF

Yuanxiang Liu, Songze Li, Xiaoke Guo, Zhaoyan Gong, Qifei Zhang

TL;DR: 本文提出CoG框架,一种无需训练的、受双过程理论启发的知识图谱增强推理方法,通过关系蓝图引导和失败感知精炼模块,有效应对知识图谱中的噪声和结构错位问题,提升大型语言模型在知识图谱上的可靠推理能力。

Details

Motivation: 解决大型语言模型在知识图谱增强推理中存在的认知僵化问题,如因同质化搜索策略导致的噪声敏感和推理停滞,以提高推理的可靠性和稳定性。

Result: 在三个基准测试上的实验结果表明,CoG在准确性和效率上均显著优于现有最先进方法。

Insight: 创新性地引入双过程理论模拟直觉与审慎的交互,利用可解释的关系蓝图作为软结构约束快速稳定搜索方向,并通过失败感知机制触发条件反射和受控回溯来克服推理停滞,实现了无需训练的高效可控推理。

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity–applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.


[7] Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments cs.CLPDF

Ashish Raj Shekhar, Shiven Agarwal, Priyanuj Bordoloi, Yash Shah, Tejas Anvekar

TL;DR: 本文提出Integrity Shield系统,一种文档层水印技术,通过在评估PDF中嵌入模式感知、项目级水印来防止多模态大语言模型(MLLMs)直接解答考试题目,同时保持文档对人眼可见外观不变,以应对AI滥用带来的学术诚信问题。

Details

Motivation: 解决现有令牌级水印或需控制模型解码过程的技术在面对学生使用教师提供的文档查询专有黑盒系统时无效的问题,旨在维护学术评估的完整性和成绩/证书的可信度。

Result: 在涵盖STEM、人文和医学推理的30场考试中,Integrity Shield在四个商业MLLMs上实现了极高的预防率(91-94%的考试级阻止)和强检测可靠性(89-93%的签名检索率)。

Insight: 创新点在于文档层水印设计,实现了模式感知和项目级水印嵌入,无需控制模型内部过程即可有效阻止AI解答并可靠追踪来源;客观来看,该系统为黑盒AI环境下的学术防作弊提供了实用解决方案,其交互式演示界面也增强了可用性。

Abstract: Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model’s decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.


[8] T$^\star$: Progressive Block Scaling for MDM Through Trajectory Aware RL cs.CLPDF

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu

TL;DR: 本文提出T方法,一种基于TraceRL的训练课程,用于在掩码扩散语言模型(MDMs)中实现渐进式块大小缩放。该方法从自回归初始化的小块MDM开始,平滑过渡到更大的块,从而在数学推理基准测试上实现更高并行度的解码,同时性能下降最小。此外,分析表明T可以收敛到一个替代的解码调度方案Ŝ,达到可比的性能。

Details

Motivation: 解决掩码扩散语言模型(MDMs)在增大解码块大小以提升并行度时,通常面临性能显著下降的问题,旨在实现高效的高并行度解码同时保持模型性能。

Result: 在数学推理基准测试上,该方法实现了更高并行度的解码,且性能下降最小;同时,其收敛到的替代解码调度Ŝ也能达到可比的性能水平。

Insight: 创新点在于提出了一种基于强化学习(TraceRL)的渐进式块大小缩放训练课程,使模型能平滑适应更大的解码块,这为扩散模型的高效解码调度优化提供了新思路。

Abstract: We present T$^\star$, a simple \textsc{TraceRL}-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T$^\star$transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T$^\star$can converge to an alternative decoding schedule $\hat{\rm S}$ that achieves comparable performance.


[9] MultiCaption: Detecting disinformation using multilingual visual claims cs.CLPDF

Rafael Martins Frade, Rrubaa Panchendrarajan, Arkaitz Zubiaga

TL;DR: 本文提出了MultiCaption数据集,专门用于检测视觉声明中的矛盾,包含64种语言的11,088个视觉声明对,并基于Transformer架构、自然语言推理模型和大语言模型进行了全面实验,为多模态多语言环境下的虚假信息检测建立了强基线。

Details

Motivation: 在线虚假信息通过多媒体和多语言平台快速传播,对社会构成日益严重的威胁,而现有自动化事实核查方法因缺乏反映现实复杂性的数据集而受限。

Result: 实验表明MultiCaption数据集比标准自然语言推理任务更具挑战性,需要任务特定的微调才能获得强性能;多语言训练和测试带来的提升突显了该数据集在不依赖机器翻译的情况下构建有效多语言事实核查流程的潜力。

Insight: 创新点在于创建了首个专门针对视觉声明矛盾检测的多语言多模态数据集,并通过实验验证了其在多语言环境下的实用性和挑战性,为未来研究提供了基准和方向。

Abstract: Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset’s potential for building effective multilingual fact-checking pipelines without relying on machine translation.


[10] Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering cs.CL | cs.LGPDF

Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yixiong Fang

TL;DR: 本文提出了一种名为RT-RAG(推理树引导的检索增强生成)的新型分层框架,用于解决复杂多跳问答任务中现有方法存在的推理不连贯和错误传播问题。该方法通过结构化实体分析和基于共识的树选择,将多跳问题分解为显式的推理树,并采用自底向上的遍历策略进行查询重写和证据收集,从而显著提升了多跳问答的性能。

Details

Motivation: 当前用于多跳问答的迭代式检索增强生成方法主要依赖大语言模型自我引导和规划多步探索路径,这容易因不准确的查询分解和错误传播而导致跨步骤的推理连贯性难以维持。本文旨在解决这些问题。

Result: 综合实验表明,RT-RAG在F1分数和精确匹配(EM)分数上分别大幅超越最先进方法7.0%和6.0%,证明了其在复杂多跳问答任务中的有效性。

Insight: 论文的创新点在于引入了显式的、结构化的推理树来引导检索过程,通过分离核心查询、已知实体和未知实体的结构化实体分析,以及基于共识的树选择来最小化分解错误,并采用自底向上的遍历策略来减轻错误传播。从客观角度看,将问题分解过程形式化为树结构并引入共识机制,为多步推理任务提供了更稳健、可解释的规划框架。

Abstract: Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.


[11] Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming cs.CLPDF

Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim

TL;DR: 这篇论文提出将算法问题解决与代码生成分离来评估大型语言模型在竞争性编程任务中的表现,通过引入自然语言解题思路(editorials)作为中间步骤,并构建包含83个ICPC风格问题、黄金解题思路和完整测试集的数据集,对19个LLM进行评估,发现生成解题思路能提升部分模型的解决率,但模型在实现和问题解决方面仍存在瓶颈。

Details

Motivation: 现有评估方法将算法推理与代码实现混为一谈,而竞争性编程本质上是问题解决任务,因此需要分离这两个方面以更准确地评估LLM的能力。

Result: 在引入黄金解题思路后,部分LLM的解决率有所提升,但模型在代码实现上仍困难,且生成与黄金解题思路之间的差距揭示了问题解决瓶颈;论文使用专家标注和LLM作为评判协议进行可扩展评估,并构建了包含83个问题的数据集。

Insight: 创新点在于强调以自然语言解题思路为中心来分离问题解决与代码生成,这有助于未来基准测试的设计;客观分析认为,该方法能更精细地诊断LLM在竞争性编程中的错误,推动评估向更结构化方向发展。

Abstract: Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.


[12] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models cs.CLPDF

Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong

TL;DR: 本文提出了一种名为神经思维链搜索(NCoTS)的新框架,将大语言模型的推理过程重新定义为对最优思维策略的动态搜索。该方法通过评估候选推理算子,主动寻找更准确、更简洁的稀疏优质推理路径,从而在提升模型准确率的同时显著减少生成步骤。

Details

Motivation: 当前大语言模型在思维链推理中顺序生成步骤,缺乏前瞻性,容易陷入包含冗余步骤的次优推理路径。本文旨在解决这一问题,通过主动搜索来找到更优的推理策略。

Result: 在多个推理基准测试中,NCoTS实现了帕累托改进,准确率提升超过3.5%,同时生成长度减少超过22%。

Insight: 核心创新在于将推理过程建模为可搜索的优化问题,并设计了一个兼顾正确性和计算成本的双因子启发式评估机制来引导搜索,从而发现并利用解空间中稀疏但更优的推理路径。

Abstract: Chain-of-Thought reasoning has significantly enhanced the problem-solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain-of-Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual-factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at https://github.com/MilkThink-Lab/Neural-CoT-Search.


[13] Reward Modeling for Scientific Writing Evaluation cs.CLPDF

Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych

TL;DR: 本文提出了一种针对科学写作评估的、成本效益高的开源奖励模型,通过两阶段训练框架优化科学评估偏好和推理能力,实现跨任务的细粒度评估和动态标准的鲁棒性。

Details

Motivation: 现有基于LLM的评估模型主要针对通用基准,缺乏对科学领域稀疏知识的推理能力,且针对每个任务进行微调成本高昂,因此需要开发能够适应多样化开放科学写作任务评估的模型。

Result: 实验分析表明,该训练方案显著提升了基于LLM的科学写作评估性能,模型能够有效泛化到不同任务及未见过的科学写作评估场景,无需任务特定重训练即可复用。

Insight: 创新点包括两阶段训练框架(先优化评估偏好后精炼推理能力)、多维度评估设计以及跨任务联合训练,实现了对动态评分标准的鲁棒性和细粒度评估能力。

Abstract: Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.


[14] Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences cs.CL | cs.AI | cs.CY | cs.SIPDF

Morgane Hoffmann, Emma Jouffroy, Warren Jouanneau, Marc Palyart, Charles Pebereau

TL;DR: 本文提出一个评估框架,用于分析大语言模型在招聘决策中的逻辑,通过借鉴经济学方法构建合成数据集,揭示了模型如何权衡不同匹配标准及其在不同项目背景和人口亚组间的差异,发现模型虽能关注核心生产力信号,但对某些特征的解读超出其显式匹配价值,并在交叉效应中显示出生产力信号在不同人口群体间的权重差异。

Details

Motivation: 解决大语言模型在招聘应用中如何分配属性重要性、其决策逻辑是否符合经济原则、招聘者偏好及社会规范的不确定性问题。

Result: 研究发现LLM能权衡技能和经验等核心生产力信号,但会过度解读某些特征;平均而言对少数群体歧视最小,但交叉效应显示生产力信号在不同人口群体间权重不同。

Insight: 创新点在于将经济学方法引入LLM招聘行为评估,构建合成数据集进行全因子设计分析;客观分析表明,该方法可揭示模型决策的隐式权重和群体公平性,为模型与人类决策对齐提供实验基础。

Abstract: General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM’s decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.


[15] Do explanations generalize across large reasoning models? cs.CL | cs.AIPDF

Koyena Pal, David Bau, Chandan Singh

TL;DR: 该论文研究了大型推理模型(LRMs)生成的思维链(CoT)解释是否具有泛化性,即这些解释是否捕捉了问题的通用模式而非模型特有的模式。通过评估一个LRM生成的解释能否在其他LRMs中诱导相同行为,研究发现CoT解释通常能提高模型间的一致性,且这种泛化性与人类偏好排名和强化学习后训练相关。论文还分析了解释产生一致答案的条件,并提出了一种简单的句子级集成策略来提升一致性。

Details

Motivation: 动机在于探究大型推理模型生成的文本解释是否具有泛化性,这对于理解或发现新概念(如AI for science)至关重要,因为若解释仅反映模型特有模式,则无法可靠用于洞察问题本质。

Result: 实验发现CoT解释能提高不同LRMs之间的一致性,这种泛化性与人类偏好和强化学习后训练相关;论文还提出句子级集成策略进一步提升了模型间一致性。

Insight: 创新点在于提出并评估了LRM解释泛化性的具体概念(跨模型行为诱导),揭示了CoT解释的泛化潜力及其与人类偏好的关联,同时提供了分析框架和集成策略以增强解释可靠性,对使用LRM解释进行科学发现具有警示意义。

Abstract: Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.


cs.CV [Back]

[16] Future Optical Flow Prediction Improves Robot Control & Video Generation cs.CVPDF

Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue

TL;DR: 本文提出了FOFPred模型,一种结合视觉语言模型(VLM)和扩散模型架构的语言条件化光流预测模型,旨在从网络规模的人类活动视频数据中学习,以预测未来的密集运动表示,并成功应用于机器人操控和视频生成两个下游任务。

Details

Motivation: 解决从嘈杂、真实世界数据中学习并预测具有泛化能力的、空间密集的未来运动表示(如光流)这一关键挑战,该能力对于控制和生成任务具有重要价值。

Result: 在语言驱动的机器人操控和视频生成任务上进行了评估,证明了FOFPred的跨领域通用性,确认了其统一架构和从多样化网络数据中学习的价值。

Insight: 主要创新点在于将视觉语言模型(VLM)与扩散模型统一到一个架构中,实现了强大的多模态推理能力和像素级的生成保真度,用于未来运动预测;同时展示了从大规模、非结构化的网络视频-文本数据中有效学习的方法。

Abstract: Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.


[17] ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research cs.CVPDF

Gerhard Krumpl, Henning Avenhaus, Horst Possegger

TL;DR: 本文介绍了ICONIC-444数据集,这是一个专为OOD检测研究设计的大规模工业图像数据集,包含超过310万张RGB图像,涵盖444个类别。该数据集旨在弥补现有数据集的不足,通过提供结构化的、多样化的数据来支持不同难度级别(从近OOD到远OOD)和不同粒度(细粒度和粗粒度)的计算机视觉任务的严格评估。

Details

Motivation: 当前OOD检测研究的进展受到缺乏大规模、高质量数据集的限制,这些数据集需要明确定义不同难度级别的OOD类别,并支持细粒度和粗粒度的计算机视觉任务。

Result: 论文在ICONIC-444数据集上定义了四个参考任务,并为22种最先进的后处理OOD检测方法提供了基线结果,用于基准测试和推动OOD检测研究。

Insight: 创新点在于构建了一个专门针对OOD检测的大规模工业图像数据集,通过模拟真实世界任务和提供结构化评估框架,填补了现有数据集的空白。从客观角度看,该数据集的设计注重任务复杂性和难度谱系,有助于系统性地评估和比较不同OOD检测方法的性能。

Abstract: Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.


[18] Can Vision-Language Models Understand Construction Workers? An Exploratory Study cs.CV | cs.AIPDF

Hieu Bui, Nathaniel E. Chodosh, Arash Tavakoli

TL;DR: 本研究评估了三种领先的视觉语言模型(GPT-4o、Florence 2和LLaVa-1.5)在建筑工地静态图像中识别工人行为和情绪的能力。使用包含1000张图像、标注了10种行为和10种情绪类别的数据集进行测试,发现GPT-4o在两项任务中均表现最佳,而所有模型在区分语义相近类别时均存在困难。

Details

Motivation: 随着机器人越来越多地融入建筑工作流程,其理解和响应人类行为的能力对于实现安全有效的协作至关重要。视觉语言模型(VLMs)有望识别人类行为,且无需大量领域特定训练,这在标注数据稀缺的建筑领域具有吸引力,因为监控工人行为和情绪对安全与生产力至关重要。

Result: 在行为识别任务中,GPT-4o的平均F1分数为0.756,准确率为0.799;在情绪识别任务中,F1分数为0.712,准确率为0.773,均达到最高水平。Florence 2表现中等,LLaVa-1.5表现最差。混淆矩阵分析显示所有模型都难以区分语义相近的类别(如团队协作与同主管沟通)。

Insight: 通用视觉语言模型可为建筑环境中的人类行为识别提供基线能力,但实际应用可靠性仍需通过领域适应、时序建模或多模态感知等技术进一步提升。论文的创新之处在于首次系统评估了VLMs在建筑工人行为与情绪识别这一特定领域的潜力与局限。

Abstract: As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model’s outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.


[19] Effects of Different Attention Mechanisms Applied on 3D Models in Video Classification cs.CVPDF

Mohammad Rasras, Iuliana Marin, Serban Radu, Irina Mocanu

TL;DR: 本文研究了在视频分类任务中,将不同注意力机制应用于3D CNN模型(如MC3、R3D、R(2+1)D)的效果,特别是通过降低时间维度知识并增加帧分辨率来构建受限时间模型。作者为每个基础模型创建了十个变体,集成了CBAM、TCN、多头注意力和通道注意力等模块,并在UCF101数据集上测试性能。

Details

Motivation: 探索在视频动作识别中,减少时间数据知识捕获同时增加帧分辨率对3D CNN模型性能的影响,并评估不同注意力机制在受限时间模型中的有效性。

Result: 在UCF101数据集上测试,改进的R(2+1)D模型结合多头注意力机制达到了88.98%的准确率,表明注意力模块能提升受限时间模型的性能。

Insight: 创新点在于系统性地将多种注意力机制(如CBAM、TCN、多头注意力)集成到3D CNN中,以补偿时间特征减少的损失,并揭示了不同注意力模块对类别级准确率的差异化影响,尽管整体性能提升相似。

Abstract: Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.


[20] FrankenMotion: Part-level Human Motion Generation and Composition cs.CVPDF

Chuqiao Li, Xianghui Xie, Yong Cao, Andreas Geiger, Gerard Pons-Moll

TL;DR: 本文提出了FrankenMotion,一个基于扩散模型的部分感知人体运动生成框架,能够根据细粒度的、时间感知的部分级文本描述生成人体运动。为了解决现有方法缺乏细粒度标注的问题,作者利用大语言模型构建了一个包含原子级、时间感知部分级文本标注的高质量运动数据集。该框架允许对各个身体部位进行空间和时间上的独立控制,从而生成训练中未见过的组合运动。

Details

Motivation: 现有基于文本提示的人体运动生成方法主要依赖序列级或动作级描述,缺乏对身体各部位的细粒度控制。这主要是由于缺少细粒度的、部分级的运动标注数据。

Result: 实验表明,FrankenMotion在作者构建的数据集上,优于所有为适应此设置而调整和重新训练的基线模型。该模型能够组合出训练中未见过的运动。

Insight: 主要创新点在于:1) 利用LLM构建了首个包含异步、语义独立、细时间分辨率的部分级运动标注数据集,突破了以往数据集同步、固定时间片段或全局标注的限制;2) 提出了一个扩散框架,首次实现了对身体部位(空间)和原子动作(时间)的独立控制进行运动生成。

Abstract: Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.


[21] Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images cs.CV | cs.AI | cs.LGPDF

Pouya Afshin, David Helminiak, Tianling Niu, Julie M. Jorns, Tina Yen

TL;DR: 本文提出了一种自监督学习引导的潜在扩散模型,用于生成高质量的合成训练图像块,以解决乳腺癌保乳手术中深层紫外全表面图像分类任务中标注数据稀缺的问题。该方法通过微调的DINO教师模型嵌入引导潜在扩散模型,将细胞结构的丰富语义细节注入合成数据,并结合真实与合成图像块微调视觉Transformer,通过图像块预测聚合实现全切片图像级别的分类。

Details

Motivation: 动机是解决乳腺癌保乳手术中深层紫外荧光扫描显微镜图像标注数据稀缺的问题,以训练鲁棒的深度学习模型进行术中切缘评估。

Result: 在5折交叉验证实验中,该方法达到了96.47%的准确率,并将FID分数降低至45.72,显著优于类别条件基线方法。

Insight: 创新点在于将自监督学习(DINO)的语义嵌入引导与潜在扩散模型结合,生成具有丰富细胞结构细节的合成数据,从而有效缓解数据稀缺问题,提升分类性能。

Abstract: Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.


[22] Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images cs.CV | cs.AIPDF

David Szczecina, Hudson Sun, Anthony Bertnyk, Niloofar Azad, Kyle Gao

TL;DR: 本文针对树冠检测任务,在仅有150张标注图像的极端数据稀缺场景下,评估了YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNET和DINOv2五种代表性模型。研究发现,基于卷积的预训练模型(特别是YOLOv11和Mask R-CNN)泛化能力显著优于基于Transformer的模型,后者因数据需求高、缺乏强归纳偏置以及任务差异(语义分割与实例分割)而表现不佳。

Details

Motivation: 解决在真实数据标注稀缺(如仅有150张标注图像)条件下,训练深度学习模型进行树冠检测时面临的严重过拟合挑战,并评估不同架构在极端数据稀缺场景下的适用性。

Result: 在Solafune树冠检测竞赛的小型不平衡数据集上,基于卷积的预训练模型(YOLOv11、Mask R-CNN)表现最佳;而DeepLabv3、Swin-UNET和DINOv2表现不佳,确认了Transformer架构在低数据区域需要大量预训练或数据增强,且任务类型(语义分割vs实例分割)差异影响性能。

Insight: 在数据极端稀缺的计算机视觉任务中,轻量级基于CNN的方法(如YOLOv11、Mask R-CNN)因其强归纳偏置和预训练优势,比Transformer模型更可靠;同时,任务定义(实例分割与语义分割)的匹配对模型选择至关重要。

Abstract: Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.


[23] PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis cs.CV | cs.AI | cs.CLPDF

K Lokesh, Abhirama Subramanyam Penamakuri, Uday Agarwal, Apoorva Challa, Shreya K Gowda

TL;DR: 本文提出了一种预咨询对话框架(PCDF),通过模拟医生与患者之间的多轮对话来提升医学诊断的准确性。该框架利用两个视觉语言模型(VLM)——DocVLM和PatientVLM——进行交互:DocVLM基于医学图像和对话历史生成后续问题,PatientVLM则根据真实诊断的症状描述进行回答。生成的合成症状经过小规模临床验证,被证实具有临床相关性、症状覆盖度和真实性。利用这些对话数据微调DocVLM,相比仅使用图像训练,取得了显著性能提升。

Details

Motivation: 传统AI医学诊断研究主要集中于图像分析,缺乏患者报告的症状信息,限制了诊断准确性。本文旨在模拟真实诊断流程,通过医生与患者的迭代对话来整合症状信息,以弥补这一不足。

Result: 通过PCDF生成的合成症状在临床验证中显示出良好的临床相关性、症状覆盖度和真实性。利用这些对话数据微调DocVLM,在诊断任务上相比仅使用图像训练取得了显著提升,突显了基于症状询问的对话监督的价值。

Insight: 创新点在于提出了一个模拟真实医患对话的预咨询框架,通过两个VLM的交互生成高质量的诊断对话数据,并利用这些数据提升诊断模型的性能。这为结合多模态信息(图像与文本症状)进行医学诊断提供了新思路,强调了症状询问在诊断过程中的重要性。

Abstract: Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.


[24] MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement cs.CVPDF

Meidan Ding, Jipeng Zhang, Wenxuan Wang, Haiqin Zhong, Xiaoling Luo

TL;DR: 本文提出MMedExpert-R1,一种新型的多模态医学推理模型,通过领域特定适应和临床指南强化来解决现有医学视觉语言模型在复杂临床推理中的不足。该方法构建了包含10K样本的高质量数据集MMedExpert,涵盖四个医学专科,并采用领域特定适应创建专科特定的LoRA模块,结合基于指南的优势建模不同临床推理视角,最后通过冲突感知能力集成将这些专家模块统一,实现了在多个基准测试上的最先进性能。

Details

Motivation: 现有医学视觉语言模型在感知任务上表现出色,但在真实场景中处理复杂临床推理时存在困难,主要问题包括深度推理数据稀缺、多专科对齐的冷启动限制,以及标准强化学习算法无法建模临床推理的多样性。

Result: 在MedXpert-MM基准上,7B模型达到27.50分,在OmniMedVQA基准上达到83.03分,实现了最先进的性能,为可靠的多模态医学推理系统奠定了坚实基础。

Insight: 创新点包括构建高质量多专科推理数据集、领域特定适应提供多样化初始化、基于指南的优势建模临床推理多样性,以及冲突感知能力集成确保多专科对齐,这些方法可借鉴于其他需要复杂推理和多领域适应的AI系统中。

Abstract: Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.


[25] Your One-Stop Solution for AI-Generated Video Detection cs.CV | cs.AIPDF

Long Ma, Zihao Xue, Yan Wang, Zhiyuan Yan, Jin Xu

TL;DR: 本文提出了AIGVDBench,一个用于AI生成视频检测的综合性与代表性基准。该基准覆盖了31个最先进的生成模型和超过44万个视频,并对33个现有检测器进行了超过1500次评估,旨在解决当前领域在数据集规模和多样性以及基准分析深度方面的不足。

Details

Motivation: 当前AI生成视频技术快速发展,合成视频越来越逼真,亟需可靠的检测方法。然而,该领域的发展受到两大限制:现有数据集规模小、模型过时且缺乏多样性;现有基准大多停留在数据集创建阶段,缺乏系统性的深入分析。

Result: 通过构建AIGVDBench基准并执行大规模评估,该工作从多个角度提出了8项深入分析,并识别出4个对后续研究具有价值的新发现。

Insight: 主要创新点在于构建了一个覆盖广泛、代表性强的大规模基准,并进行了系统性的评估与分析,为AI生成视频检测领域提供了坚实的研究基础。从客观角度看,其将基准构建从单纯的数据集收集提升到了系统性评估与深度分析的层面,具有指导意义。

Abstract: Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbf{From the dataset perspective}, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbf{From the benchmark perspective}, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf{31} state-of-the-art generation models and over \textbf{440,000} videos. By executing more than \textbf{1,500} evaluations on \textbf{33} existing detectors belonging to four distinct categories. This work presents \textbf{8 in-depth analyses} from multiple perspectives and identifies \textbf{4 novel findings} that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at https://github.com/LongMa-2025/AIGVDBench.


[26] M3DDM+: An improved video outpainting by a modified masking strategy cs.CVPDF

Takuya Murakawa, Takumi Fukuzawa, Ning Ding, Toru Tamaki

TL;DR: 本文提出了M3DDM+,一种改进的视频外绘方法。它通过修改M3DDM模型的掩码策略,解决了在相机运动有限或外绘区域较大等挑战性场景下,模型表现出的空间模糊和时间不一致问题,从而提升了视觉保真度和时间连贯性。

Details

Motivation: M3DDM在信息有限的挑战性场景下(如相机运动有限或外绘区域较大)会出现显著的质量下降,表现为空间模糊和时间不一致。作者发现其根本原因是训练与推理阶段在掩码策略上的不匹配。

Result: 实验表明,M3DDM+在信息有限的场景下,显著提升了视觉保真度和时间连贯性,同时保持了计算效率。

Insight: 核心创新点在于识别并解决了训练-推理不匹配问题,具体方法是在训练阶段对所有帧应用统一的掩码方向和宽度,并对预训练的M3DDM模型进行微调。这为基于扩散模型的视频生成/编辑任务中,如何设计更一致的训练策略提供了借鉴。

Abstract: M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation – manifested as spatial blur and temporal inconsistency – under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM’s training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at https://github.com/tamaki-lab/M3DDM-Plus.


[27] PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models cs.CVPDF

Qiyuan Zhang, Biao Gong, Shuai Tan, Zheng Zhang, Yujun Shen

TL;DR: 本文提出PhysRVG,一种面向视频生成模型的物理感知强化学习范式,首次将物理碰撞规则直接应用于高维空间,确保物理知识被严格遵循而非仅作为条件。进一步扩展为统一框架Mimicry-Discovery Cycle (MDcycle),在充分保留模型利用物理反馈能力的同时实现大幅微调。为验证方法,构建了新基准PhysRVGBench并进行广泛实验。

Details

Motivation: 现有基于Transformer的视频生成模型普遍忽视物理原理,尤其在刚体运动渲染上存在显著局限,导致生成视频的物理真实性不足;传统预训练-微调范式在像素级全局去噪过程中丢弃了物体刚性概念,即使正确的数学约束在训练后优化中也仅被视为次优条件。

Result: 在新建基准PhysRVGBench上进行了全面的定性与定量实验,验证了所提方法的有效性;具体结果未在摘要中明确给出,但暗示通过物理规则强化提升了生成视频的物理真实性。

Insight: 创新点在于首次将物理感知强化学习引入视频生成,通过在高维空间直接强制执行物理碰撞规则,确保物理知识被严格应用;提出的Mimicry-Discovery Cycle统一框架实现了物理反馈与模型微调的有效平衡,为增强生成内容的物理合理性提供了新思路。

Abstract: Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model’s ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.


[28] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning cs.CV | cs.AI | cs.GRPDF

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Xiuyu Li, Michael J. Black

TL;DR: 本文提出了一种名为VIGA(Vision-as-Inverse-Graphic Agent)的智能体,通过交错的多模态推理(迭代执行与验证)实现视觉作为逆图形任务,即从图像重建可编辑的图形程序。VIGA采用闭环的“编写-运行-渲染-比较-修订”流程,无需额外模块即可处理3D重建、场景编辑、物理交互等多种任务。

Details

Motivation: 解决现有视觉语言模型(VLMs)因缺乏细粒度空间和物理基础能力而无法一次性完成视觉逆图形任务的问题,即从图像生成可编辑图形程序这一长期目标。

Result: 在BlenderGym和SlideBench基准上,VIGA相比一次性基线方法分别提升35.32%和117.17%;在作者新提出的BlenderBench基准上,VIGA提升124.70%,证明了其有效性。

Insight: 创新点在于引入交错多模态推理的闭环代理框架,结合技能库(生成器与验证器交替)和演进上下文记忆(存储计划、代码差异和渲染历史),实现任务无关和模型无关的通用评估协议,提升了长视野推理能力。

Abstract: Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren’t able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn’t require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn’t require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.


[29] SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention cs.CVPDF

Ruibang Li, Guan Luo, Yiwei Zhang, Jin Gao, Bing Li

TL;DR: 本文提出了SoLA-Vision,一种细粒度逐层混合的注意力机制,通过策略性地插入少量全局softmax注意力层,在保持视觉任务性能的同时显著降低了计算复杂度。

Details

Motivation: 标准softmax自注意力在视觉任务中表现出色,但具有二次方复杂度O(N^2),限制了其在高分辨率场景下的部署;而线性注意力虽将复杂度降至O(N),但其压缩的状态表示会损害建模能力和精度。本文旨在探索一种能平衡精度与计算成本的混合注意力方案。

Result: 在ImageNet-1K分类任务上,SoLA-Vision超越了纯线性注意力模型和其他混合注意力模型;在密集预测任务上,它也以显著优势持续超越强基线模型。

Insight: 论文的创新点在于从层堆叠的视角对线性和softmax注意力进行了分析,并系统地实验了逐层混合模式,发现细粒度的逐层混合设计(相比僵化的块内混合)能以更少的softmax层达到或超越性能,从而实现了精度与计算成本之间的强权衡。

Abstract: Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.


[30] ATATA: One Algorithm to Align Them All cs.CVPDF

Boyi Pang, Savva Ignatyev, Vladimir Ippolitov, Ramil Khafizov, Yurii Melnik

TL;DR: 本文提出了一种名为ATATA的新型多模态算法,用于基于Rectified Flow模型对结构对齐的配对样本进行联合推理。该方法通过联合传输样本空间中的片段,实现了更快的推理计算,并可在结构化潜在空间上构建于任意Rectified Flow模型之上。实验表明,该方法在图像、视频和3D形状生成领域均能实现高度的结构对齐和视觉质量,在图像和视频生成方面达到SOTA,在3D生成方面质量相当但速度大幅提升。

Details

Motivation: 现有方法在联合生成时未从结构对齐角度出发,而基于Score Distillation Sampling的方法存在耗时、易模式崩溃和结果卡通化的问题,因此需要一种高效且能保持结构对齐的联合推理算法。

Result: 在图像、视频和3D形状生成任务上,使用SOTA基线进行评估,相比基于编辑和联合推理的竞争方法,本方法在图像和视频生成方面改进了SOTA,在3D生成方面达到可比质量但速度快数个数量级。

Insight: 创新点在于从结构对齐视角重新定义联合生成问题,并提出基于样本空间片段联合传输的快速推理方法,避免了SDS的缺陷,同时保持了跨模态生成的结构一致性和高质量输出。

Abstract: We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.


[31] Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification cs.CVPDF

Zhiqi Pang, Lingling Zhao, Yang Liu, Chunyu Wang, Gaurav Sharma

TL;DR: 本文提出了一种名为无监督多场景行人重识别的新任务,旨在通过一个统一的框架处理跨分辨率、换装等多种场景。为解决该任务,作者提出了图像-文本知识建模框架,该框架利用预训练的CLIP模型,通过三个阶段:在图像编码器中引入场景嵌入并进行微调、优化文本嵌入以关联伪标签并引入多场景分离损失、以及通过异构匹配模块和动态文本表示更新策略来获取可靠的异构正样本对并保持文本与图像监督信号的一致性。实验表明,该方法在多个场景上优于现有的特定场景方法,并通过整合多场景知识提升了整体性能。

Details

Motivation: 动机是解决现有行人重识别方法通常针对单一场景设计,缺乏处理跨分辨率、换装等多种场景的统一框架的问题,从而提出无监督多场景行人重识别任务。

Result: 实验结果表明,ITKM在多个场景(如跨分辨率、换装)上均表现出优越性和泛化能力,不仅超越了现有的特定场景方法,还通过整合多场景知识提升了整体性能,达到了先进水平。

Insight: 创新点包括:提出无监督多场景行人重识别新任务;设计三阶段ITKM框架,有效利用视觉-语言模型的表示能力;引入场景嵌入、多场景分离损失、异构匹配模块和动态文本表示更新策略,以自适应地利用多场景知识并保持监督一致性。

Abstract: We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) – a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.


[32] Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval cs.CVPDF

Fangke Chen, Tianhao Dong, Sirry Chen, Guobin Zhang, Yishu Zhang

TL;DR: 本文提出了一种轻量级非对称双编码器框架,用于跨文字手写体检索。该方法通过联合优化实例级对齐和类级语义一致性,学习统一且风格不变的视觉嵌入,将视觉嵌入锚定到与语言无关的语义原型上,从而实现对不同文字和书写风格的不变性。

Details

Motivation: 解决手写单词检索中因手写变异性大和跨语言语义鸿沟带来的挑战,同时克服现有大型视觉语言模型计算成本过高、难以在实际边缘设备部署的问题。

Result: 方法在语内检索基准测试上超越了28个基线模型,达到了最先进的准确率;在查询语言与目标语言不同的显式跨语言检索任务中也验证了其有效性,且参数量仅为现有模型的一小部分。

Insight: 创新点在于提出轻量级非对称双编码器框架,通过联合优化实例对齐和语义一致性来学习语言无关的视觉嵌入,实现了跨文字和书写风格的不变性,为资源高效的跨脚本手写检索提供了可行方案。

Abstract: Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.


[33] FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection cs.CVPDF

Cheng-Zhuang Liu, Si-Bao Chen, Qing-Ling Shu, Chris Ding, Jin Tang

TL;DR: 本文提出了一种名为FTDMamba的频率辅助时序扩张Mamba网络,用于无人机视频异常检测,旨在解决动态背景下的多源运动耦合问题,并构建了一个新的大规模动态背景数据集MUVAD。

Details

Motivation: 现有视频异常检测方法主要针对地面监控或静态背景的无人机视频,对动态背景的无人机视频研究有限。动态无人机视频中存在物体运动与无人机全局运动的多源耦合,导致现有方法易将正常无人机运动误判为异常或难以捕捉真实异常。

Result: 在现有的两个公开静态基准数据集和新构建的MUVAD数据集上进行的广泛实验表明,FTDMamba取得了最先进的性能。

Insight: 创新点包括:1) 频率解耦时空相关性模块,通过频域分析解耦运动模式并建模全局时空依赖;2) 时序扩张Mamba模块,利用Mamba的序列建模能力联合学习多时序感受野下的细粒度时序动态和局部空间结构;3) 构建了首个专注于动态背景的大规模无人机视频异常检测数据集MUVAD。

Abstract: Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba’s sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: https://github.com/uavano/FTDMamba.


[34] X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning cs.CV | cs.AIPDF

Maanping Shao, Feihong Zhang, Gu Zhang, Baiye Cheng, Zhengrong Xue

TL;DR: X-Distill是一种用于视觉运动学习的跨架构视觉蒸馏方法,旨在解决机器人学习场景中数据稀缺的问题。该方法通过离线知识蒸馏,将大型冻结DINOv2教师模型的丰富视觉表示迁移到紧凑的ResNet-18学生模型上,然后在目标操作任务上联合微调蒸馏后的编码器和扩散策略头。

Details

Motivation: 动机在于解决视觉运动策略中大型预训练ViT模型数据需求大与机器人学习数据稀缺之间的矛盾,同时利用紧凑CNN易于优化的优势,通过跨架构知识蒸馏来协同两种架构的优势。

Result: 在34个模拟基准和5个具有挑战性的真实世界任务上的广泛实验表明,该方法始终优于使用从头训练的ResNet或微调DINOv2编码器的策略,甚至超越了利用特权点云观测或更大视觉语言模型的3D编码器,在数据高效的机器人操作中实现了最先进的性能。

Insight: 创新点在于提出了一种简单而有效的跨架构知识蒸馏策略,将大型ViT的泛化能力迁移到紧凑CNN上,从而在数据稀缺的机器人操作任务中获得强大的视觉先验,实现高性能。从客观角度看,该方法通过离线蒸馏和联合微调,巧妙地将通用视觉表示与特定任务策略相结合,为数据高效的视觉运动学习提供了可借鉴的范式。

Abstract: Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.


[35] Efficient On-Board Processing of Oblique UAV Video for Rapid Flood Extent Mapping cs.CVPDF

Vishisht Sharma, Sam Leroux, Lisa Landuyt, Nick Witvrouwen, Pieter Simoens

TL;DR: 本文提出了一种名为Temporal Token Reuse (TTR)的自适应推理框架,旨在解决无人机(UAV)在严格尺寸、重量和功耗(SWaP)约束下,实时处理高分辨率倾斜视频流进行洪水范围测绘的计算瓶颈问题。该方法通过利用空中视频固有的时空冗余性,将图像块视为token,使用轻量级相似性度量动态识别静态区域并传播其预计算的深度特征,从而绕过冗余的主干网络计算。

Details

Motivation: 动机在于灾害响应(如洪水测绘)需要快速处理倾斜空中视频以最大化空间覆盖和态势感知,但无人机严格的SWaP约束使得在边缘设备上进行低延迟推理面临计算密度不足的挑战。

Result: 在标准基准和新构建的Oblique Floodwater Dataset(专为水文监测设计)上进行实验,结果表明,在边缘级硬件上,TTR将推理延迟降低了30%,而分割精度下降可忽略不计(< 0.5% mIoU),有效提升了操作帕累托边界。

Insight: 创新点在于提出TTR框架,通过动态识别和重用静态区域的预计算特征来加速视频分割,这是一种针对空中视频时空冗余性的高效自适应推理方法,可应用于时间紧迫的遥感任务中实现高保真实时视频理解。

Abstract: Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on-board processing of high-resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide-field-of-view streams precludes low-latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge-grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy (< 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high-fidelity, real-time oblique video understanding for time-critical remote sensing missions


[36] SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2 cs.CVPDF

Gergely Dinya, András Gelencsér, Krisztina Kupán, Clemens Küpper, Kristóf Karacs

TL;DR: SAMannot是一个基于SAM2的开源本地交互式视频实例分割框架,通过优化资源管理和引入自动化工作流程,解决了现有视频分割工具在效率、隐私和成本方面的限制。

Details

Motivation: 当前视频分割研究流程面临手动标注耗时、商业平台昂贵以及云服务隐私泄露等问题,需要一种高效、隐私安全且成本可控的解决方案。

Result: 在动物行为跟踪用例以及LVOS和DAVIS基准数据集的子集上验证,该工具能够生成YOLO和PNG格式的研究就绪数据集,并记录结构化交互日志,为复杂视频标注任务提供了可扩展的替代方案。

Insight: 通过修改SAM2依赖并设计处理层以降低计算开销,结合持久实例身份管理、基于屏障帧的自动锁定-优化工作流以及掩码骨架化自动提示机制,实现了资源高效和用户界面响应迅速的创新设计。

Abstract: Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine’’ workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.


[37] Enhancing Vision Language Models with Logic Reasoning for Situational Awareness cs.CV | cs.LOPDF

Pavana Pradeep, Krishna Kant, Suya Yu

TL;DR: 本文提出了一种将视觉语言模型与传统计算机视觉方法通过显式逻辑推理相结合的方法,旨在增强情境感知能力。该方法通过提取细粒度事件细节、采用智能微调策略以提高准确性,并在推理过程中为VLM输出生成解释,从而提升模型在识别罕见但重要事件时的可靠性和准确性。

Details

Motivation: 解决视觉语言模型在情境感知应用中识别罕见但重要事件时可靠性、准确性不足,以及缺乏细粒度细节提取和输出质量评估的问题。

Result: 论文提出的智能微调机制在准确性上显著优于无指导的选择方法,并在推理过程中提供了验证VLM输出有效性或指出其可疑原因的有效手段。

Insight: 创新点在于将显式逻辑推理集成到VLM中,以实现细粒度细节提取、智能微调策略和输出解释生成,这为提升VLM在关键应用中的可信度和可解释性提供了新思路。

Abstract: Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.


[38] Assessing Building Heat Resilience Using UAV and Street-View Imagery with Coupled Global Context Vision Transformer cs.CVPDF

Steffen Knoblauch, Ram Kumar Muthusamy, Hao Li, Iddy Chazua, Benedcto Adamu

TL;DR: 本文提出了一种融合无人机(UAV)和街景(SV)图像的机器学习框架,通过耦合全局上下文视觉变换器(CGCViT)来学习城市建筑的热相关表征,并利用HotSat-1的热红外(TIR)数据量化建筑属性与热相关健康风险的关系。该框架在坦桑尼亚达累斯萨拉姆市的应用表明,能够识别家庭层面的热暴露不平等现象。

Details

Motivation: 气候变化加剧了城市热暴露风险,尤其是在全球南方人口密集的城市中心,但评估建筑热相关属性的可扩展方法仍然缺乏。

Result: 所提出的双模态跨视角学习方法比最佳单模态模型的性能提升高达9.3%,证明了无人机和街景图像提供了有价值的互补视角。建筑周围的植被、较亮的屋顶材料(如混凝土、粘土或木材)与较低的HotSat-1 TIR值显著相关。

Insight: 创新点在于提出了耦合全局上下文的视觉变换器(CGCViT)来融合无人机和街景两种视角的图像,实现了对城市建筑热属性的高效表征学习。该方法为利用多源遥感数据和机器学习进行局部化、数据驱动的气候风险评估提供了新思路。

Abstract: Climate change is intensifying human heat exposure, particularly in densely built urban centers of the Global South. Low-cost construction materials and high thermal-mass surfaces further exacerbate this risk. Yet scalable methods for assessing such heat-relevant building attributes remain scarce. We propose a machine learning framework that fuses openly available unmanned aerial vehicle (UAV) and street-view (SV) imagery via a coupled global context vision transformer (CGCViT) to learn heat-relevant representations of urban structures. Thermal infrared (TIR) measurements from HotSat-1 are used to quantify the relationship between building attributes and heat-associated health risks. Our dual-modality cross-view learning approach outperforms the best single-modality models by up to $9.3%$, demonstrating that UAV and SV imagery provide valuable complementary perspectives on urban structures. The presence of vegetation surrounding buildings (versus no vegetation), brighter roofing (versus darker roofing), and roofing made of concrete, clay, or wood (versus metal or tarpaulin) are all significantly associated with lower HotSat-1 TIR values. Deployed across the city of Dar es Salaam, Tanzania, the proposed framework illustrates how household-level inequalities in heat exposure - often linked to socio-economic disadvantage and reflected in building materials - can be identified and addressed using machine learning. Our results point to the critical role of localized, data-driven risk assessment in shaping climate adaptation strategies that deliver equitable outcomes.


[39] Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding cs.CV | cs.AIPDF

Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo

TL;DR: 论文提出了Think-Clip-Sample(TCS)框架,一种无需训练的方法,通过多查询推理和片段级快慢采样来提升多模态大语言模型对长视频的理解能力,在多个基准测试上显著提高了性能并减少了推理时间。

Details

Motivation: 解决多模态大语言模型在长视频理解中因计算限制和帧选择不佳导致的性能受限问题。

Result: 在MLVU、LongVideoBench和VideoMME基准测试上,TCS将不同MLLM的准确率最高提升6.9%,并能以50%的推理时间成本达到可比准确率。

Insight: 创新点在于结合了多查询推理以捕捉问题的互补方面,以及自适应平衡局部细节与全局上下文的快慢采样策略,实现了高效且有效的长视频理解框架。

Abstract: Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.


[40] Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning cs.CVPDF

Haomiao Tang, Jinpeng Wang, Minyi Zhao, Guanghao Meng, Ruisheng Luo

TL;DR: 本文提出了一种名为HUG(异质性不确定性引导)的新范式,用于解决组合图像检索(CIR)任务中因三元组数据固有噪声导致的模型鲁棒性问题。该方法采用细粒度概率学习框架,为查询和目标分别构建高斯嵌入以捕获详细概念和不确定性,并设计了异质性不确定性估计、动态加权机制以及不确定性引导的对比学习目标。

Details

Motivation: 组合图像检索任务中,参考图像与修改文本构成的三元组数据存在固有噪声,这引入了内在不确定性并威胁模型的鲁棒性。现有的概率学习方法由于采用实例级整体建模以及对查询和目标进行同质化处理,不足以有效解决CIR中的这一问题。

Result: 在多个基准测试上的实验表明,HUG方法超越了现有的最先进基线模型,证明了其有效性。

Insight: 主要创新点在于:1) 提出了针对CIR的异质性不确定性引导范式,对多模态查询和单模态目标进行定制化的不确定性估计;2) 细粒度的概率学习框架,利用高斯嵌入分别建模查询和目标;3) 设计了可证明的动态加权机制来综合查询不确定性;4) 引入了不确定性引导的对比学习目标,包括整体对比和细粒度对比,并配合全面的负采样策略以增强判别性学习。

Abstract: Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model’s robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG’s effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.


[41] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction cs.CVPDF

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Nanren Bao, Bo Qian

TL;DR: 本文提出了一种名为SUG-Occ的实时3D语义占据预测框架,该框架通过利用3D场景的固有稀疏性,结合显式的语义和不确定性引导,以及无符号距离编码,在保持几何和语义完整性的同时,显著降低了计算和内存开销,实现了高效的粗到细推理。

Details

Motivation: 为了解决3D语义占据预测在实时部署中面临的计算和内存开销过高的问题,旨在利用场景的稀疏性来减少冗余计算。

Result: 在SemanticKITTI基准测试上,该方法在准确性上提升了7.34%,在效率上提升了57.8%,超越了基线模型。

Insight: 创新点包括:利用语义和不确定性先验抑制自由空间投影以增强稀疏性;引入显式无符号距离编码提升几何一致性;设计级联稀疏补全模块和基于对象上下文表示(OCR)的掩码解码器,实现高效的特征聚合与细化,避免了体积特征上的昂贵注意力操作。

Abstract: As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8% gain in efficiency.


[42] Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model cs.CV | cs.AIPDF

Shuai Yuan, Tianwu Lin, Shuang Chen, Yu Xia, Peng Qin

TL;DR: 本文提出了一种名为WetSAM的框架,用于从稀疏点标注的卫星图像时间序列中进行湿地制图。该框架基于Segment Anything Model (SAM),通过双分支设计整合时序信息:一个时序提示分支利用分层适配器和动态时序聚合来分离湿地特征与物候变化,另一个空间分支采用时序约束的区域增长策略生成可靠的密集伪标签,并通过双向一致性正则化联合优化两个分支。

Details

Motivation: 现有湿地制图方法面临两个主要问题:密集像素级标注成本高昂,而稀疏点标注下现有深度学习模型性能不佳;同时,湿地强烈的季节性和年际动态使得单日期图像不足,导致制图误差大。此外,虽然基础模型如SAM在点提示下展现出良好的泛化能力,但其本质是为静态图像设计,无法建模时序信息,在异质性湿地中产生碎片化掩码。

Result: 在八个全球区域(每个约5000平方公里)的广泛实验中,WetSAM显著优于现有最先进方法,平均F1分数达到85.58%,能够以最小标注工作量实现准确且结构一致的湿地分割。

Insight: 论文的创新点在于将SAM扩展到时序领域,通过双分支架构结合时序提示和空间伪标签生成,并利用双向一致性正则化进行联合优化。从客观角度看,其核心贡献是提出了一个能够有效利用稀疏点监督和卫星图像时间序列的框架,解决了湿地动态建模和标注稀疏性的挑战,为可扩展、低成本、高分辨率的湿地制图提供了新思路。

Abstract: Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.


[43] PubMed-OCR: PMC Open Access OCR Annotations cs.CV | cs.CL | cs.DL | cs.LGPDF

Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott

TL;DR: PubMed-OCR是一个从PubMed Central开放获取PDF中提取的、以OCR为中心的学术文章语料库。它包含约20.95万篇文章(150万页,约13亿词),每页图像均使用Google Cloud Vision进行标注,并以包含词、行和段落级边界框的紧凑JSON格式发布。该语料库支持布局感知建模、坐标锚定的问答以及依赖OCR的流程评估。

Details

Motivation: 解决缺乏大规模、高质量、带有精细布局标注的学术文献OCR语料库的问题,以支持下游的文档理解、信息提取和评估研究。

Result: 构建了一个包含约20.95万篇文章、150万页图像和约13亿词的语料库,并提供了详细的布局标注。论文分析了语料库的期刊覆盖范围和检测到的布局特征,但未提及在特定基准测试上的定量性能比较。

Insight: 创新点在于提供了一个大规模、开放获取、带有细粒度(词、行、段落)布局标注的学术文献OCR数据集,支持布局感知和坐标锚定的任务。其数据发布模式和紧凑的JSON模式为相关研究提供了便利的基础设施。但需注意其依赖单一OCR引擎和启发式行重建的局限性。

Abstract: PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.


[44] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps cs.CV | cs.AIPDF

Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan

TL;DR: 本文提出了Map2Thought框架,旨在为3D视觉语言模型(VLMs)提供显式且可解释的空间推理能力。该框架基于两个核心组件:度量认知地图(Metric-CogMap)和认知思维链(Cog-CoT)。Metric-CogMap通过结合离散网格(用于关系推理)和连续度量尺度表示(用于精确几何理解)来提供统一的空间表示。Cog-CoT在此基础上,通过向量运算、边界框距离和遮挡感知的外观顺序线索等确定性操作进行显式几何推理,并生成基于3D结构的可解释推理轨迹。

Details

Motivation: 解决现有3D视觉语言模型在空间推理方面缺乏显式性和可解释性的问题,旨在通过结构化、基于度量的表示和确定性推理过程,实现对3D场景更透明和可靠的理解。

Result: 在VSI-Bench基准测试中,Map2Thought仅使用一半的监督数据就达到了59.9%的准确率,与使用完整数据集训练的基线(60.9%)相当接近。在10%、25%和50%的训练子集下,它分别以5.3%、4.8%和4.0%的优势持续超越最先进(SOTA)方法。

Insight: 主要创新点在于将离散关系推理与连续度量几何表示统一于一个认知地图中,并设计了一个基于确定性几何操作的思维链推理过程,从而实现了可解释的3D空间推理。从客观角度看,这种显式、结构化的推理框架为增强3D VLMs的透明度和数据效率提供了有前景的路径。

Abstract: We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.


[45] MHA2MLA-VLM: Enabling DeepSeek’s Economical Multi-Head Latent Attention across Vision-Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui

TL;DR: 本文提出MHA2MLA-VLM框架,旨在将现成的视觉语言模型(VLM)高效转换为多头部潜在注意力(MLA)架构,以解决推理过程中键值(KV)缓存快速增长导致的内存和计算瓶颈问题。

Details

Motivation: 随着视觉语言模型处理日益复杂的多模态任务,KV缓存的快速增长在推理时带来了显著的内存和计算瓶颈,而将现有VLM适配到MLA架构以避免昂贵的预训练成本仍是一个未充分探索的问题。

Result: 在三个代表性VLM上的大量实验表明,MHA2MLA-VLM能以最少的监督数据恢复原始模型性能,显著减少KV缓存占用,并能与KV量化无缝集成。

Insight: 创新点包括模态自适应的部分RoPE策略(通过选择性掩码非必要维度支持传统和多模态设置)和模态解耦的低秩近似方法(独立压缩视觉和文本KV空间),同时采用参数高效微调以最小化适配成本,并强调最小化输出激活误差而非参数距离能大幅减少性能损失。

Abstract: As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.


[46] Generative Scenario Rollouts for End-to-End Autonomous Driving cs.CVPDF

Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi

TL;DR: 本文提出了一种名为GeRo的即插即用框架,用于增强视觉-语言-动作模型在端到端自动驾驶中的能力。该框架通过自回归展开策略,联合执行规划并生成以语言为基准的未来交通场景,从而支持长时程推理和多智能体规划。

Details

Motivation: 当前基于视觉-语言-动作模型的端到端自动驾驶系统主要依赖于稀疏轨迹标注的模仿学习,未能充分利用其作为生成模型的潜力。本文旨在解决这一问题,通过生成式场景展开实现更一致、可解释的规划。

Result: 在Bench2Drive基准测试中,GeRo将驾驶分数和成功率分别提升了+15.7和+26.2。通过结合强化学习与生成式展开,GeRo在闭环和开环性能上均达到了最先进水平,并展示了强大的零样本鲁棒性。

Insight: 创新点在于提出了一个联合规划与语言基准场景生成的框架,通过自回归展开和一致性损失来稳定预测并保持文本-动作对齐。这为构建更安全、可解释的端到端自动驾驶系统提供了生成式、语言条件推理的新思路。

Abstract: Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.


[47] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes cs.CVPDF

Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni

TL;DR: 本文提出了ReScene4D方法,用于解决室内3D场景随时间演变时的语义实例分割问题。该方法能够对间歇性捕获的3D扫描进行联合分割、识别和时序关联,即使在变化未被直接观测到的情况下也能保持实例身份的时间一致性。

Details

Motivation: 现有3D语义实例分割方法缺乏时序推理,需要离散匹配步骤;而4D LiDAR方法依赖高频时序测量,不适用于室内环境长期、稀疏的演变观测。因此,需要一种能在稀疏时间观测下实现时序一致分割的新方法。

Result: 在3RScan数据集上,ReScene4D达到了最先进的性能,并引入了一个新的评估指标t-mAP来奖励时序身份一致性,为该任务建立了新的基准。

Insight: 创新点在于将3DSIS架构适配到4DSIS任务,无需密集观测,通过跨观测共享信息来提升分割质量和一致性。客观来看,其提出的t-mAP指标为评估时序一致性任务提供了更合适的度量标准。

Abstract: Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.


[48] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures cs.CV | cs.LGPDF

Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins

TL;DR: 本文提出ShapeR,一种从随意拍摄的图像序列中生成条件性3D物体形状的新方法。该方法利用现成的视觉-惯性SLAM、3D检测算法和视觉语言模型,从输入序列中提取每个物体的稀疏SLAM点、多视角图像和机器生成的描述。通过一个经过训练的整流流变换器,有效融合这些模态信息,生成高保真度的度量3D形状。

Details

Motivation: 现有3D形状生成方法大多依赖干净、无遮挡且分割良好的输入,这在现实场景中很难满足。本文旨在解决从真实世界随意拍摄(存在遮挡、背景杂乱等挑战)的数据中鲁棒生成3D形状的问题。

Result: 在作者提出的新评估基准(包含7个真实场景中的178个带几何标注的物体)上,ShapeR显著优于现有方法,其倒角距离(Chamfer distance)比当前最优方法提升了2.7倍。

Insight: 创新点包括:1) 提出一个端到端流程,有效融合稀疏3D点、多视角图像和文本描述三种模态;2) 采用动态组合增强、从物体级到场景级的课程训练方案等策略,提升对随意拍摄数据的鲁棒性;3) 引入一个包含真实场景标注的新基准,用于评估此类任务。

Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.


eess.IV [Back]

[49] Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment eess.IV | cs.CV | cs.MMPDF

Mayesha Maliha R. Mithila, Mylene C. Q. Farias

TL;DR: 本文提出了一种名为DAGR-VQA的无参考视频质量评估框架,首次将可学习的寄存器令牌直接集成到卷积主干网络中,用于预测时空动态显著性。该模型通过寄存器令牌作为全局上下文载体,实现受人类视觉系统启发的动态注意力,生成随时间变化的自适应显著性图,无需显式运动估计。结合RGB输入和时序Transformer,模型实现了感知一致的视频质量评估。

Details

Motivation: 现有NR-VQA方法通常使用静态显著性图作为辅助输入,未能将上下文信息从根本上嵌入视频序列的特征提取过程中,无法有效处理视频信号的全局上下文。

Result: 在LSVQ、KonVid-1k、LIVE-VQC和YouTube-UGC数据集上的综合测试表明,DAGR-VQA性能极具竞争力,超越了大多数顶级基线模型,并在1080p分辨率下达到387.7 FPS的计算效率,适合实时应用。

Insight: 创新点在于将寄存器令牌机制引入卷积网络,以动态、自适应的方式建模全局上下文和时空显著性,避免了显式运动估计,同时保持了高计算效率。这为视频理解任务中结合卷积与注意力机制提供了新思路。

Abstract: No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.


cs.SD [Back]

[50] SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models cs.SD | cs.CLPDF

Yirong Sun, Yanjun Chen, Xin Qiu, Gang Zhang, Hongyu Chen

TL;DR: 本文提出了SonicBench基准测试,用于系统评估大型音频语言模型对音频基本物理属性的感知能力,发现模型在音高、响度、空间位置等物理属性感知上存在显著缺陷,性能接近随机猜测,且无法有效利用音频编码器已捕获的物理线索。

Details

Motivation: 大型音频语言模型在语义和副语言任务上表现出色,但其对音频基本物理属性的感知能力尚未得到充分探索,因此需要构建专门的基准来评估和揭示这一瓶颈。

Result: 在SonicBench基准上,大多数模型在物理属性感知任务上表现接近随机猜测,且在比较任务中未能展现出人类预期的优势;线性探测分析表明冻结的音频编码器能成功捕获物理线索(准确率至少60%),但模型在后续对齐和解码阶段未能有效利用这些信号。

Insight: 创新点在于构建了一个基于心理物理学的可控生成基准,结合识别和比较两种范式来评估模型的感知精度和关系推理能力;关键发现是模型瓶颈主要在于对齐和解码阶段,而非音频编码本身,这为改进模型提供了明确方向。

Abstract: Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs’ foundational auditory understanding; most models perform near random guessing and, contrary to human patterns, fail to show the expected advantage on comparison tasks. Furthermore, explicit reasoning yields minimal gains. However, our linear probing analysis demonstrates crucially that frozen audio encoders do successfully capture these physical cues (accuracy at least 60%), suggesting that the primary bottleneck lies in the alignment and decoding stages, where models fail to leverage the sensory signals they have already captured.


[51] FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning cs.SD | cs.CL | eess.ASPDF

Tanyu Chen, Tairan Chen, Kai Shen, Zhenghua Bao, Zhihui Zhang

TL;DR: FlashLabs Chroma 1.0是一个开源的、实时的端到端语音对话模型,它通过交错文本-音频令牌调度(1:2)实现了亚秒级的端到端延迟,并支持流式生成,同时能在多轮对话中保持高质量的个性化语音克隆。

Details

Motivation: 现有的端到端语音对话系统虽然利用语音分词器和神经音频编解码器让LLM能直接处理离散语音表示,但往往在说话人身份保持方面能力有限,阻碍了个性化的语音交互。

Result: 实验结果表明,Chroma在说话人相似度上相比人类基线取得了10.96%的相对提升,实时因子(RTF)为0.43,同时保持了强大的推理和对话能力。

Insight: 主要创新点在于首次实现了开源、实时、端到端的语音对话模型,并同时兼顾低延迟和高保真个性化语音克隆。其核心技术创新是支持流式生成的交错文本-音频令牌调度方案,这解决了实时交互与语音质量之间的权衡问题。

Abstract: Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .


cs.LG [Back]

[52] Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs cs.LG | cs.CLPDF

Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng

TL;DR: 本文研究了强化学习与可验证奖励(RLVR)在提升大语言模型推理能力时,即使使用虚假或不正确的奖励,模型(如Qwen 2.5)也能取得显著性能提升的悖论现象。研究发现了一个‘困惑度悖论’,即虚假奖励会触发模型绕过推理、激活记忆捷径的隐藏电路机制。

Details

Motivation: 动机是探究RLVR方法中,为何使用虚假或不正确的奖励信号仍能显著提升模型性能,并理解其背后潜在的、可能导致模型依赖数据记忆而非真实推理的机制。

Result: 通过Path Patching、Logit Lens、JSD分析和神经微分方程等方法,在模型(如Qwen 2.5)中定位并揭示了促进记忆捷径的‘锚点-适配器’电路,并证明通过缩放该电路中特定MLP的键值可以双向因果调控由数据污染驱动的性能。

Insight: 创新点在于机制性地识别了RLVR调优模型中数据污染激活记忆捷径的‘锚点-适配器’电路(功能性锚点位于中间层,结构性适配器位于后续层),并提供了识别和缓解此类问题的路线图,对理解模型内部工作机制和鲁棒性调优有借鉴意义。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a “Perplexity Paradox”: spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.


[53] Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation cs.LG | cs.AI | cs.CLPDF

Pingzhi Tang, Yiding Wang, Muhan Zhang

TL;DR: 本文提出了一种名为参数化技能迁移(PaST)的框架,旨在解决大语言模型(LLMs)因参数记忆冻结而难以有效内化新知识的‘知识截止’挑战。该方法通过从源领域提取与领域无关的‘技能向量’,并将其线性注入到经过轻量级监督微调(SFT)的目标模型中,从而高效地提升模型利用新知识进行问答和决策的能力。

Details

Motivation: 动机在于,虽然监督微调(SFT)常用于更新模型知识,但它往往只更新事实内容,而无法可靠地提升模型运用新知识进行推理和决策的技能。强化学习(RL)能习得推理技能但计算成本高昂,不适用于高效的在线适应。作者观察到SFT和RL引发的参数更新近乎正交,因此探索一种模块化的技能迁移方法以实现高效的知识适应。

Result: 在知识融入问答(SQuAD, LooGLE)和智能体工具使用(ToolBench)基准测试上验证了方法的有效性。在SQuAD上,PaST比最先进的自我编辑SFT基线高出最多9.9分;在LooGLE长上下文问答上获得8.0分的绝对准确率提升;在ToolBench上,平均零样本成功率提升+10.3分,且在不同工具类别上均有稳定增益,表明技能向量具有良好的可扩展性和跨领域可迁移性。

Insight: 宣称的创新点在于提出了PaST框架,其核心洞察是SFT和RL的更新方向正交,从而允许将RL习得的‘知识操作技能’(封装为技能向量)与SFT更新的‘事实知识’解耦并模块化迁移。从客观角度看,这是一种新颖的、旨在分离和组合知识与技能更新的持续适应方法,为高效更新LLMs提供了新思路。

Abstract: Large Language Models (LLMs) face the “knowledge cutoff” challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model’s ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.


cs.CR [Back]

[54] VidLeaks: Membership Inference Attacks Against Text-to-Video Models cs.CR | cs.CVPDF

Li Wang, Wenyu Chen, Ning Yu, Zheng Li, Shanqing Guo

TL;DR: 本文首次系统研究了针对文本到视频(T2V)模型的成员推理攻击(MIA),并提出了名为VidLeaks的新框架。该框架通过空间重建保真度(SRF)和时间生成稳定性(TGS)两种互补信号,探测T2V模型在稀疏关键帧和随机时间动态中泄露的成员信息。实验在三种渐进式限制的黑盒设置下进行,结果表明现有T2V模型存在严重的隐私泄露风险。

Details

Motivation: 随着在大型网络数据集上训练的文本到视频(T2V)模型激增,其引发的版权和隐私侵犯风险日益紧迫。现有的成员推理攻击(MIA)技术主要针对图像或文本等静态数据设计,无法捕捉视频生成的时空复杂性,特别是忽略了关键帧中记忆信号的稀疏性和随机时间动态引入的不稳定性。

Result: 在三种代表性T2V模型(如AnimateDiff和InstructVideo)上的实验表明,即使在最严格的仅查询(query-only)黑盒设置下,VidLeaks也能实现高攻击性能(例如,在AnimateDiff上AUC达到82.92%,在InstructVideo上达到97.01%),揭示了模型存在严重且可利用的隐私漏洞。

Insight: 论文的核心创新在于首次系统地将MIA应用于T2V模型,并设计了专门针对视频时空特性的攻击框架。其提出的SRF(通过Top-K相似性放大稀疏关键帧的空间记忆信号)和TGS(通过测量多次查询的语义一致性来捕捉时间泄露)两种信号,为审计视频生成系统的隐私风险提供了新的方法论基础,并推动了新防御机制的发展需求。

Abstract: The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.


cs.AI [Back]

[55] MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents cs.AI | cs.CLPDF

Shouju Wang, Haopeng Zhang

TL;DR: 本文提出了MPCI-Bench,这是首个用于评估智能体隐私行为的多模态成对上下文完整性基准。该基准包含源自同一视觉源的正负实例对,涵盖种子判断、故事推理和可执行智能体追踪三个层级,并通过三原则迭代优化流程确保数据质量。对前沿多模态模型的评估揭示了它们在平衡隐私与效用方面的系统性失败,以及显著的模态泄露差距。

Details

Motivation: 随着语言模型智能体从被动聊天机器人演变为处理个人数据的主动助手,评估其对社会规范(特别是上下文完整性)的遵守变得至关重要。现有基准多为文本中心且侧重于负面拒绝场景,忽视了多模态隐私风险以及隐私与效用的基本权衡。

Result: 对最先进的多模态模型的评估表明,它们在平衡隐私与效用方面存在系统性失败,并显示出明显的模态泄露差距,即敏感视觉信息比文本信息泄露得更频繁。

Insight: 创新点在于构建了首个多模态、成对设计的上下文完整性评估基准,通过正负实例对和三层级结构(种子判断、故事推理、智能体追踪)更全面地刻画智能体隐私行为,并揭示了多模态场景下隐私泄露的模态差异性这一新问题。

Abstract: As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.


[56] Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models cs.AI | cs.CLPDF

Gerard Yeo, Svetlana Churina, Kokil Jaidka

TL;DR: 本研究分析了指令微调的大语言模型(如Llama 3.1 8B、Qwen 2.5 7B、Mistral 7B)如何编码网络叙事中的感知可信度。通过使用标注了认知评估、情感和行为意图的PEACE-Reviews数据集,研究发现模型在预训练过程中隐式编码了可信度线索,其激活模式在层和注意力头层面系统地区分高可信与低可信文本。探测分析表明可信度信号是线性可解码的,且微调会优化而非重构这些表示。模型与人类在线信任形成的关键维度(如公平性、确定性和自我问责)关联最强。

Details

Motivation: 解决大语言模型是否以及如何以心理上连贯的方式表示感知可信度这一关键问题,因为可信度是用户处理在线信息的基石,而LLM正日益融入搜索、推荐和对话系统。

Result: 在PEACE-Reviews数据集上,多个LLM均显示层和注意力头激活存在系统差异以区分可信度;可信度信号线性可解码;模型表示与公平性、确定性、自我问责等人类信任核心评估维度关联最强。

Insight: 创新点在于首次系统揭示了LLM在无显式监督下内化了基于心理学的可信度信号,为设计可信、透明、值得信赖的AI系统提供了表征基础。客观来看,研究通过层/头分析和线性探测,实证了预训练模型隐式编码社会心理构念的能力,且微调仅作优化,这为理解模型的社会认知表示提供了新视角。

Abstract: Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction-tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web-like narratives using the PEACE-Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer- and head-level activation differences distinguish high- from low-trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de-codable trust signals and fine-tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability-self – dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust-worthy AI systems in the web ecosystem. Code and appendix are available at: https://github.com/GerardYeo/TrustworthinessLLM.


[57] Building AI Agents to Improve Job Referral Requests to Strangers cs.AI | cs.CLPDF

Ross Chu, Yuting Huang

TL;DR: 本文开发了AI智能体来帮助求职者在专业在线社区中撰写有效的职位推荐请求。核心工作流程包括一个改进智能体来重写推荐请求,以及一个评估智能体,该评估智能体使用一个经过训练的模型来预测从其他用户那里获得推荐的概率,从而衡量修订的质量。

Details

Motivation: 解决求职者在向陌生人请求职位推荐时,因请求信息质量不佳而成功率低的问题,旨在通过AI辅助提升请求的有效性。

Result: LLM的修订提高了较弱请求的预测成功率,但降低了较强请求的预测成功率;结合检索增强生成(RAG)后,能防止对较强请求的负面编辑,并放大对较弱请求的改进。总体而言,使用带RAG的LLM修订,将较弱请求的预测成功率提高了14%,且未降低对较强请求的性能。

Insight: 创新点在于构建了一个包含改进和评估的双智能体工作流程,并引入RAG来优化LLM的编辑策略,使其能差异化处理不同质量的请求,在模型预测层面实现了性能提升,为后续真实用户实验提供了低成本信号。

Abstract: This paper develops AI agents that help job seekers write effective requests for job referrals in a professional online community. The basic workflow consists of an improver agent that rewrites the referral request and an evaluator agent that measures the quality of revisions using a model trained to predict the probability of receiving referrals from other users. Revisions suggested by the LLM (large language model) increase predicted success rates for weaker requests while reducing them for stronger requests. Enhancing the LLM with Retrieval-Augmented Generation (RAG) prevents edits that worsen stronger requests while it amplifies improvements for weaker requests. Overall, using LLM revisions with RAG increases the predicted success rate for weaker requests by 14% without degrading performance on stronger requests. Although improvements in model-predicted success do not guarantee more referrals in the real world, they provide low-cost signals for promising features before running higher-stakes experiments on real users.


[58] Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration cs.AI | cs.CVPDF

Sen Wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang

TL;DR: 本文提出了一个名为LMEE的长期记忆体探索框架,旨在统一智能体的探索认知与决策行为以促进终身学习。作者构建了相应的数据集和基准LMEE-Bench,包含多目标导航和基于记忆的问答任务,以全面评估体探索的过程和结果。为提升智能体的记忆检索和主动探索能力,作者提出了MemoryExplorer方法,该方法通过强化学习微调多模态大语言模型,鼓励主动查询记忆,并结合包含动作预测、前沿选择和问答的多任务奖励函数实现主动探索。

Details

Motivation: 现有主流的一次性体任务主要关注任务完成结果,忽视了探索过程和记忆利用的关键环节,而理想的体智能体应具备终身学习能力以处理长视野和复杂任务,这需要智能体不仅能准确完成任务,还能利用长期情景记忆优化决策。

Result: 在广泛的实验中,与最先进的体探索模型相比,该方法在长视野体任务中取得了显著优势。

Insight: 创新点在于提出了一个专注于探索过程和长期记忆利用的体探索框架与基准,并设计了一种通过强化学习微调多模态大语言模型来鼓励主动记忆查询的新方法,其多任务奖励函数设计促进了智能体的主动探索行为。

Abstract: An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent’s exploratory cognition and decision-making behaviors to promote lifelong learning.We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent’s memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.


[59] TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech cs.AI | cs.CL | cs.MM | cs.SIPDF

Girish A. Koushik, Helen Treharne, Diptesh Kanojia

TL;DR: 本文提出了TANDEM框架,将视听仇恨言论检测从二元分类任务转化为结构化推理问题,通过串联强化学习策略优化视觉-语言和音频-语言模型,在无需密集帧级监督的情况下实现长时序推理,显著提升了目标识别和时间定位的准确性。

Details

Motivation: 社交媒体中长格式多模态内容日益增多,有害言论通过音频、视觉和文本线索的复杂交互构建,现有自动化系统虽能高精度标记仇恨言论,但缺乏可解释的细粒度证据(如精确时间戳和目标身份),无法满足人机协同审核的需求。

Result: 在三个基准数据集上的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM数据集上目标识别的F1分数达到0.73(比当前最优方法提升30%),同时保持了精确的时间定位。

Insight: 创新点在于采用串联强化学习策略,通过自约束跨模态上下文优化多模态模型,实现稳定长时序推理;研究还表明,即使在复杂多模态场景中,结构化、可解释的对齐也是可实现的,为下一代透明且可操作的在线安全审核工具提供了蓝图。

Abstract: Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as “black boxes” that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.


[60] AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems cs.AI | cs.CLPDF

Weiyi Wang, Xinchi Chen, Jingjing Gong, Xuanjing Huang, Xipeng Qiu

TL;DR: 本文介绍了AstroReason-Bench,这是一个用于评估智能体在空间规划问题(SPP)中规划能力的综合性基准测试。SPP问题具有目标异构、物理约束严格和决策视野长等特点。该基准整合了多种调度机制,并提供了统一的智能体交互协议。评估发现,当前最先进的智能体模型在现实约束下的规划能力远低于专用求解器。

Details

Motivation: 现有智能体基准测试主要关注符号化或弱接地环境,缺乏对物理约束严格的实际领域(如空间规划)中智能体规划能力的评估。

Result: 在AstroReason-Bench上对一系列最先进的开源和闭源智能体大语言模型系统进行评估,发现当前智能体在严格物理约束下的规划性能显著低于专用求解器。

Insight: 创新点在于构建了一个针对高风险的异构空间规划问题的统一基准测试,强调了现实物理约束下通用规划智能体的关键局限性,为未来智能体研究提供了一个具有挑战性和诊断性的测试平台。

Abstract: Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.


cs.GT [Back]

[61] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents cs.GT | cs.AI | cs.CL | cs.MAPDF

Eilam Shapira, Roi Reichart, Moshe Tennenholtz

TL;DR: 本文研究了AI代理技术扩展对经济市场战略互动的影响,通过博弈论模型(议价、谈判和说服)发现,增加AI代理选择会显著改变均衡收益和监管结果,导致监管者有动机主动开发技术,同时揭示了’毒苹果’效应:代理人可能发布自己最终不用的新技术,仅为了操纵监管者的市场设计选择以谋利,损害对手和监管公平性,表明静态监管框架易受技术扩展操纵,需动态市场设计适应AI能力演变。

Details

Motivation: 探讨AI代理融入经济市场如何改变战略互动格局,解决技术扩展对博弈论经典场景(如资源分配、信息不对称交易和战略信息传递)中均衡和监管的影响问题。

Result: 在博弈论设置中,增加AI代理选择会急剧改变均衡支付和监管结果,’毒苹果’效应显示战略技术发布可提升发布者福利,但损害对手和监管公平目标,未提及具体benchmark或SOTA比较。

Insight: 创新点在于提出’毒苹果’效应概念,揭示AI技术扩展可能被战略用于操纵市场设计,客观分析表明需从静态监管转向动态适应性框架以应对AI能力演进带来的操纵风险。

Abstract: The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the “Poisoned Apple” effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator’s choice of market design in their favor. This strategic release improves the releaser’s welfare at the expense of their opponent and the regulator’s fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.


cs.NE [Back]

[62] Line-based Event Preprocessing: Towards Low-Energy Neuromorphic Computer Vision cs.NE | cs.AI | cs.CV | eess.IVPDF

Amélie Gruel, Pierre Lewden, Adrien F. Vincent, Sylvain Saïghi

TL;DR: 本文提出了一种基于线条的事件预处理方法,旨在降低脉冲神经网络在计算机视觉任务中的能耗。通过在线条检测机制中引入事件数据预处理,在三个事件基准数据集上验证了该方法能在保持或提升分类精度的同时显著降低理论能耗。

Details

Motivation: 脉冲视觉系统在动态视觉处理中具有生物启发性、低能耗和低延迟等优势,但嵌入式应用中的能耗优化仍是挑战。本文旨在通过预处理事件数据来减少突触操作数量,从而降低硬件能耗。

Result: 在三个事件基准数据集上的实验表明,基于线条的预处理策略能在分类精度维持或提升的情况下,显著降低理论能耗,提高了脉冲分类的效率。

Insight: 创新点在于将线条检测与事件预处理结合,通过数据量优化实现能耗与性能的有利权衡,为低能耗脉冲计算机视觉提供了新思路。

Abstract: Neuromorphic vision made significant progress in recent years, thanks to the natural match between spiking neural networks and event data in terms of biological inspiration, energy savings, latency and memory use for dynamic visual data processing. However, optimising its energy requirements still remains a challenge within the community, especially for embedded applications. One solution may reside in preprocessing events to optimise data quantity thus lowering the energy cost on neuromorphic hardware, proportional to the number of synaptic operations. To this end, we extend an end-to-end neuromorphic line detection mechanism to introduce line-based event data preprocessing. Our results demonstrate on three benchmark event-based datasets that preprocessing leads to an advantageous trade-off between energy consumption and classification performance. Depending on the line-based preprocessing strategy and the complexity of the classification task, we show that one can maintain or increase the classification accuracy while significantly reducing the theoretical energy consumption. Our approach systematically leads to a significant improvement of the neuromorphic classification efficiency, thus laying the groundwork towards a more frugal neuromorphic computer vision thanks to event preprocessing.


cs.RO [Back]

[63] H-AIM: Orchestrating LLMs, PDDL, and Behavior Trees for Hierarchical Multi-Robot Planning cs.RO | cs.AI | cs.CV | cs.LG | cs.MAPDF

Haishan Zeng, Peng Li

TL;DR: 本文提出了H-AIM框架,用于解决异构机器人团队执行长期任务规划的挑战。该框架采用三级级联架构:首先利用大语言模型解析指令并生成PDDL问题描述;其次结合LLM的语义推理与经典规划器的搜索能力生成优化动作序列;最后将规划结果编译为行为树以实现反应式控制。

Details

Motivation: 解决异构机器人团队在具身人工智能中执行长期任务时,大语言模型在长期推理和动态多机器人协调方面的局限性。

Result: 在提出的MACE-THOR基准数据集(包含8种不同家庭布局中的42个复杂任务)上,H-AIM将任务成功率从最强基线LaMMA-P的12%提升至55%,目标条件召回率从32%提升至72%。

Insight: 创新点在于将LLM的语义理解、经典规划器的精确搜索与行为树的反应式控制相结合,并通过共享黑板机制支持动态规模的异构机器人团队协调,实现了从高层指令到具体执行的层次化、鲁棒性规划。

Abstract: In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose Hierarchical Autonomous Intelligent Multi-Robot Planning(H-AIM), a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experimental results demonstrate that H-AIM achieves a remarkable performance improvement, elevating the task success rate from 12% to 55% and boosting the goal condition recall from 32% to 72% against the strongest baseline, LaMMA-P.