Table of Contents

cs.CL [Back]

[1] Source-Modality Monitoring in Vision-Language Models cs.CLPDF

Etha Tianze Hua, Tian Yun, Ellie Pavlick

TL;DR: 本文定义了源模态监控的概念,即多模态模型追踪并传达信息输入来源的能力,并将其视为绑定问题的一个实例。研究通过11个视觉语言模型在目标模态信息检索任务上的实验,发现句法和语义信号均起重要作用,但当模态分布差异显著时语义信号往往占主导。

Details

Motivation: 研究动机是探索多模态模型如何追踪信息输入来源(源模态监控),以解决更广泛的绑定问题,即模型如何将用户提示中的词汇(如图像)与输入和上下文中的特定组件(如实际图像)关联起来。

Result: 在11个视觉语言模型上进行的目标模态信息检索任务实验中,结果表明句法和语义信号都发挥重要作用,但在模态分布高度不同的情况下,语义信号往往比句法信号更具影响力。

Insight: 创新点在于将源模态监控形式化为绑定问题,并实证分析了句法与语义信号在跨模态绑定中的作用差异;这对于理解模型鲁棒性和日益多模态的智能体系统具有启示意义。

Abstract: We define and investigate source-modality monitoring – the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.


[2] Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching cs.CL | cs.AI | cs.LGPDF

Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn

TL;DR: 本文提出了一种轻量级检索增强生成与大型语言模型结合的框架,用于可扩展的患者-临床试验匹配。该框架通过检索增强生成从长电子健康记录中识别临床相关片段以降低输入复杂度,利用LLM编码这些片段为信息表征,再通过降维和轻量级预测器进行高效分类。

Details

Motivation: 解决患者-临床试验匹配中处理长且异构的电子健康记录和复杂入选标准时面临的可扩展性、泛化性和计算效率挑战,现有方法要么计算成本高,要么难以捕捉非结构化临床叙述。

Result: 在多个公共基准测试(n2c2, SIGIR, TREC 2021/2022)和梅奥诊所的真实世界多模态数据集上评估,结果显示基于检索的信息选择显著降低了计算负担并保留了临床有意义的信号,且性能与端到端LLM方法相当但计算成本大幅降低。

Insight: 创新点在于将检索增强生成与LLM表征学习明确分离,形成轻量级管道;客观分析表明,该框架有效平衡了性能与效率,并揭示了冻结LLM对结构化临床数据表征有效,而对非结构化叙述微调至关重要的见解。

Abstract: Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.


[3] Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning cs.CLPDF

Karthic Palaniappan

TL;DR: 本文探索了在视觉语言模型中通过强化学习激励神经符号语言推理的方法,旨在提升模型的解析推理能力和效率。研究以Qwen3-VL-2B-Instruct为基础模型,在包含数学、科学和常识问题的视觉语言评估数据集上实现了3.33%的准确率提升,同时相比SymPy减少了75%的推理token。

Details

Motivation: 论文的动机是探索视觉语言概念在神经符号语言中的表示和推理,研究如何提升’思维系统’的解析推理能力和效率,灵感来源于电影《降临》中外星语言启发的时间超越能力。

Result: 在视觉语言评估数据集(涵盖数学、科学和常识问题)上,准确率提升了3.33%,同时推理token减少了75%(相比SymPy),达到了效率与性能的平衡。

Insight: 创新点在于将强化学习与神经符号语言结合,激励视觉语言模型进行更高效的推理;客观分析认为,该方法通过减少推理token显著提升了效率,为模型’思维’过程的优化提供了新思路。

Abstract: There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don’t care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of “thinking systems”. With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75% over SymPy. I’ve documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.


[4] Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning cs.CLPDF

Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts

TL;DR: 本文研究了强化学习可验证奖励(RLVR)在思维链推理训练中的作用,发现仅基于结果奖励的RLVR虽然能提升任务准确率,但并不能保证模型推理过程具有因果重要性或可验证充分性。

Details

Motivation: 动机是挑战RLVR训练中关于推理链可靠性的常见假设,即训练后的推理链是否真实代表了模型得出答案的过程。

Result: 在Qwen2.5模型系列和ReasoningGym任务上的实验表明,RLVR能提高任务准确率,但未能可靠提升因果重要性(CIR)和推理充分性(SR)指标;在RLVR前加入少量监督微调(SFT)或使用结合结果奖励与CIR/SR辅助奖励的联合奖励方案,可以在保持准确率的同时改善推理质量。

Insight: 创新点在于提出了CIR和SR两个评估推理过程可靠性的新指标,并揭示了标准RLVR训练的局限性;核心洞见是仅优化结果奖励不足以保证可验证或因果重要的推理,而通过简单的训练流程修改(如预加SFT或使用联合奖励)可以有效解决此问题。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.


[5] Voice Under Revision: Large Language Models and the Normalization of Personal Narrative cs.CL | cs.CYPDF

Tom van Nuenen

TL;DR: 本研究探讨大型语言模型(LLM)重写如何改变个人叙事的风格和叙事质感,通过分析300篇个人叙事在三种提示条件下由三种前沿LLM重写后的变化,测量了包括功能词、词汇多样性、词长、标点、缩略语、第一人称代词和情感词在内的13个语言标记。研究发现,LLM重写导致风格归一化,功能词、缩略语和第一人称代词减少,而词汇多样性、词长和标点精细化增加,即使提示要求“保留声音”也无法完全逆转这一趋势,重写文本在特征空间中趋同,叙事风格从嵌入转向疏离。

Details

Motivation: 解决LLM重写对个人叙事风格和叙事质感的影响问题,探究其是否导致文本风格归一化,从而影响数字人文和计算文本分析中基于风格、声音、作者身份和语料完整性的研究。

Result: 定性结果显示,LLM重写导致风格归一化,功能词、缩略语和第一人称代词减少,词汇多样性、词长和标点精细化增加;风格计量分析表明重写文本在特征空间中趋同,更难匹配源文本;叙事标记显示叙事从嵌入转向疏离,从显式因果推理转向压缩抽象。

Insight: 创新点在于系统量化LLM重写对个人叙事风格的影响,揭示其导致风格归一化趋势;客观分析表明,即使“保留声音”提示也无法完全避免风格变化,这提示LLM修订应被视为一种有影响的文本中介形式,而非表面编辑,对文本分析领域具有重要启示。

Abstract: This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to “improve” the text or simply to “rewrite” it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.


[6] How Large Language Models Balance Internal Knowledge with User and Document Assertions cs.CLPDF

Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang

TL;DR: 本文提出一个三源交互框架,系统评估了27个大型语言模型在平衡内部参数知识、用户主张和文档主张时的行为模式,发现多数模型更依赖文档主张且易受外部信息影响,并通过微调实验展示了提升模型信息辨别能力的可行性。

Details

Motivation: 现有研究局限于二元冲突范式,忽略了实际场景中内部知识、用户主张和文档主张三者同时存在的交互环境,因此需要系统探究LLM在多源信息下的平衡机制以提升系统安全性。

Result: 在2个数据集上评估了3个家族的27个LLM,发现多数模型更偏好文档主张而非用户主张,且后训练强化了这一偏好;行为分析表明大多数模型易受影响,难以有效区分有益和有害的外部信息;微调实验证明多样化的源交互数据能显著提升模型的辨别能力。

Insight: 创新点在于提出了三源交互框架,突破了传统二元冲突研究的局限;客观分析表明,通过针对性的微调可以增强LLM在多源信息环境下的可靠性和辨别力,为开发可信赖的LLM提供了新思路。

Abstract: Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model’s ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model’s discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.


[7] Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen cs.CL | cs.AIPDF

Jon-Paul Cacioli

TL;DR: 本研究通过预注册心理测量有效性筛查,测试了7个参数量为3-9B的开源指令微调大语言模型在TriviaQA任务中口头置信度表达的有效性。研究发现,在最小化数字提示(0-100)下,所有模型的口头置信度均不符合项目级Type-2区分效度的基本标准,平均天花板率高达91.7%;分类提示(10类)不仅未能提升有效性,反而在6个模型中导致任务准确率低于5%。此外,词元级对数概率无法有效预测口头置信度,而在经过推理蒸馏的模型中,推理轨迹长度与置信度呈显著负相关。

Details

Motivation: 口头置信度提取被广泛用于从大语言模型中获取不确定性估计,但缺乏对其心理测量有效性的系统评估。本研究旨在检验中等规模(3-9B参数)开源指令微调模型在最小化提示下产生的口头置信度是否满足基本的项目级区分效度标准。

Result: 在524个TriviaQA项目上进行的8,384次确定性试验显示:所有7个指令模型在数字置信度表达上均被归类为无效(H2确认),平均天花板率为91.7%(H1确认);分类提示未能挽救有效性,反而在6/7模型中使准确率降至5%以下(H4未确认);词元级对数概率对口头置信度的预测力极低(交叉验证R²<0.01,H5确认);在推理蒸馏模型中,推理轨迹长度与置信度呈显著负相关(ρ=-0.36, p<0.001)。

Insight: 论文的创新点在于首次对中等规模开源指令微调模型的口头置信度进行了预注册心理测量有效性筛查,揭示了在该参数量级下,最小化口头提示无法在输出界面保留内部不确定性信号,且分类提示可能损害任务性能。客观而言,研究强调了在将此类置信度信号用于下游任务前进行心理测量筛查的必要性,并发现了推理污染效应(推理轨迹越长,置信度越低)这一有趣现象,为理解模型内部表示与输出表达之间的脱节提供了实证依据。

Abstract: Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.


[8] TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis cs.CL | eess.ASPDF

Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie

TL;DR: 本文提出TTS-PRISM,一个用于普通话TTS系统的多维度感知推理与可解释诊断框架,通过建立12维评估体系、构建针对性诊断数据集以及指令微调,实现对细粒度声学缺陷的精准诊断。

Details

Motivation: 现有生成式TTS模型虽接近人类水平,但单一指标无法诊断细粒度声学伪影或解释感知崩溃问题,需要可解释的细粒度评估工具。

Result: 在包含1600个样本的黄金测试集上,TTS-PRISM在人类对齐度上优于通用模型,并对六种TTS范式进行了剖析,揭示了细粒度能力差异。

Insight: 创新点在于将评估维度结构化(12维模式)、通过对抗扰动和专家锚点构建诊断数据集,以及通过指令微调将明确评分标准和推理过程嵌入端到端模型,实现了可解释的细粒度诊断能力。

Abstract: While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.


[9] Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA cs.CL | cs.AIPDF

Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo

TL;DR: 本文提出了大规模半结构化文档集合上的分析性问答任务,并介绍了MuDABench基准测试。该基准要求从大量文档中提取和综合信息以进行定量分析,包含超过80,000页文档和332个分析性QA实例。实验表明,标准RAG系统表现不佳,作者提出了一个多智能体工作流来改进性能,但与人类专家相比仍存在显著差距。

Details

Motivation: 解决现有多文档QA基准测试的局限性,这些基准通常只需要来自少数文档的信息,且跨文档推理有限,而现实世界中的分析性问答需要广泛的文档间分析和聚合。

Result: 在MuDABench基准上,标准RAG系统表现不佳。作者提出的多智能体工作流显著改善了过程指标(如中间事实覆盖率)和结果指标(最终答案准确性),但与人类专家性能相比仍存在显著差距。分析指出主要瓶颈在于单文档信息提取的准确性和当前系统缺乏领域特定知识。

Insight: 创新点包括:1)引入了需要广泛跨文档推理的分析性QA新任务和基准;2)提出了结合规划、提取和代码生成模块的多智能体工作流方法;3)设计了包含最终答案准确性和中间事实覆盖率的评估协议,用于诊断推理过程。从客观角度看,该研究强调了在复杂文档分析中超越扁平化检索、进行结构化工作流设计的重要性。

Abstract: This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.


[10] Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion cs.CLPDF

Fahmida Alam, Mihai Surdeanu, Ellen Riloff

TL;DR: 本文提出了一种名为RC-RAG的新型多阶段释义引导关系补全框架,旨在解决大型语言模型(LLMs)在关系补全任务中,尤其是在信息稀少或稀疏的长尾场景下的性能瓶颈。该方法无需微调模型,通过在检索、摘要生成和最终推理生成三个阶段系统性地融入关系释义,显著提升了LLMs在关系补全任务上的表现。

Details

Motivation: 大型语言模型(LLMs)在关系补全任务上表现不佳,尤其是在所需信息罕见或稀疏分布的长尾场景中,无论是否使用检索增强生成技术都存在困难。

Result: 在两个基准数据集上对五个LLMs进行的实验表明,RC-RAG方法持续优于多个RAG基线模型。在长尾设置下,性能最佳的LLM结合RC-RAG后,其精确匹配分数比其独立性能提高了40.6分,并分别超越了另外两个强RAG基线16.0分和13.8分,同时保持了较低的计算开销。

Insight: 论文的核心创新点在于提出了一个无需微调的多阶段释义注入框架,通过将关系释义系统地整合到检索、摘要和生成三个阶段,来增强模型对长尾关系的理解和补全能力。从客观角度看,这种分阶段、轻量级的释义利用策略,为解决LLMs在知识密集型任务中的长尾问题提供了一种新颖且高效的思路。

Abstract: Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.


[11] Large Language Models Decide Early and Explain Later cs.CLPDF

Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy

TL;DR: 这篇论文研究了大型语言模型在思维链推理过程中答案何时确定的问题,发现模型通常在推理早期就已确定最终答案,后续大量推理token属于决策后的解释,是冗余的。作者提出基于探测的早期停止策略,可显著减少推理token使用量,同时仅轻微影响准确性。

Details

Motivation: 动机在于探究LLM在生成思维链推理时,最终答案是否在中间阶段就已固定,以及后续推理token是否仅为决策后的解释,从而揭示推理过程的冗余性并降低推理成本和延迟。

Result: 在Qwen3-4B模型上,平均仅32%的查询中预测答案会发生变化;一旦答案稳定,模型平均每查询会额外生成760个推理token。提出的早期停止策略(如基于探测的停止)可平均每查询减少500个推理token,同时准确率仅下降2%。

Insight: 创新点在于使用强制答案补全技术研究答案在推理步骤中的演化,揭示了思维链生成中存在大量冗余;提出的简单启发式早期停止策略能有效减少计算开销,为优化LLM推理效率提供了新思路。

Abstract: Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model’s final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model’s intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.


[12] STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation cs.CLPDF

Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

TL;DR: 本文提出了一种名为STEM(Structure-Tracing Evidence Mining)的新框架,用于解决知识图谱问答(KGQA)中的多跳推理问题。该框架将多跳推理重新定义为模式引导的图搜索任务,通过语义到结构的投影管道分解查询并构建自适应查询模式图,然后执行全局感知的节点锚定和子图检索,最终从知识图谱中获取证据推理图。

Details

Motivation: 动机在于解决知识图谱问答中的两个主要挑战:知识图谱的结构异质性导致检索时的语义不匹配,以及现有推理路径检索方法缺乏全局结构视角。

Result: STEM在多个多跳基准测试中实现了最先进的性能,显著提高了多跳推理图检索的准确性和证据完整性。

Insight: 创新点包括:将多跳推理重构为模式引导的图搜索任务,设计了语义到结构的投影管道和自适应查询模式图,以及引入了Triple-Dependent GNN(Triple-GNN)来生成全局指导子图(Guidance Graph),以在图形构建过程中更有效地整合全局结构信息。

Abstract: Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.


Ishaan Gakhar, Harsh Nandwani

TL;DR: 本文提出ReLeVAnT框架,用于法律文档的二元分类。该方法通过一次性关键词提取、n-gram处理、对比得分匹配和浅层神经网络,在LexGLUE数据集上实现了99.3%的准确率和98.7%的F1分数。

Details

Motivation: 解决现有法律文档分类方法依赖结构化数据、元数据或大计算量的问题,旨在利用文档间的判别性特征进行高效分类。

Result: 在LexGLUE基准测试中达到99.3%准确率和98.7% F1分数,表现出高性能。

Insight: 创新点在于结合一次性关键词提取与轻量级分类器,实现高效准确的法律文本分类;可借鉴其利用领域特定词汇特征和对比学习思路提升分类效果。

Abstract: The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.


[14] Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets cs.CL | cs.AIPDF

Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

TL;DR: 论文提出了SLIDERS框架,用于解决在超长文档集合上进行问答时LLM上下文窗口受限的问题。该方法通过将文档中的关键信息提取并结构化存储到关系数据库中,利用SQL进行可扩展的推理,并引入数据协调阶段来保证信息的一致性和完整性。

Details

Motivation: 解决现实世界文档问答中,随着文档集合增长,LLM固定上下文窗口无法容纳所有信息,而传统分块处理方法在证据聚合时存在瓶颈的问题。

Result: 在三个现有的长上下文基准测试上超越了所有基线模型(包括GPT-4.1,平均超出6.6分),并在两个新的超大规模基准(分别对应390万和3600万token)上,比次优基线分别提升了约19分和32分。

Insight: 核心创新在于将非结构化的长文本问答问题转化为对结构化数据库(关系型)的查询和推理问题,通过引入数据协调机制(利用来源、提取理由和元数据)来保证从局部提取的信息的全局一致性,从而实现了对超大规模文档集合的可扩展处理。

Abstract: Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.


[15] CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language cs.CL | cs.AIPDF

Rui Zhao, Xuewen Zhong, Xiaoyun Zheng, Jinsong Su, Yidong Chen

TL;DR: 本文介绍了CNSL-bench,这是首个针对中国国家手语设计的综合性基准测试,用于评估多模态大语言模型在手语理解方面的能力。该基准基于官方标准词典构建,提供文本、图像和视频对齐的多模态数据,并支持对手势书写、指拼和手语字母等不同手语表达形式的细粒度分析。通过对21个开源和专有MLLM的广泛评估,研究发现当前模型在手语理解上仍远低于人类水平,且在不同输入模态和手语表达形式上存在系统性差距。

Details

Motivation: 尽管大语言模型推动了手语研究进展,但其在多模态环境下理解手语的内在能力仍未得到充分探索。现有研究缺乏一个基于标准化手语词典、覆盖多模态数据并支持细粒度手语表达形式分析的基准测试,因此本文旨在填补这一空白。

Result: 在CNSL-bench上对21个最新MLLM的评估结果显示,当前模型在手语理解任务上的表现显著低于人类水平。模型在不同输入模态(如文本、图像、视频)和手语表达形式(如手势书写、指拼)上存在系统性性能差异,且指令跟随的鲁棒性在不同模型间差异很大。

Insight: 论文的创新点在于构建了首个基于中国国家手语标准化词典的多模态基准测试,其权威性、多模态覆盖和手语表达形式多样性为评估MLLM的手语理解能力提供了全面框架。从客观角度看,该基准不仅揭示了当前MLLM在手语理解上的局限性,还为未来模型在细粒度手语分析和多模态对齐方面的改进指明了方向。

Abstract: Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.


[16] Selective Contrastive Learning For Gloss Free Sign Language Translation cs.CLPDF

Changhao Lai, Rui Zhao, Xuewen Zhong, Jinsong Su, Yidong Chen

TL;DR: 本文提出了一种选择性对比学习方法(SCL-SLT),用于解决无注释手语翻译(SLT)中跨模态对齐的噪声问题。该方法通过轨迹分析发现随机批次内负样本常无效,进而设计了一种基于相似度动态的配对选择策略,以课程学习方式构建批次,增强对比监督并减少噪声负样本的影响。

Details

Motivation: 手语翻译面临视觉符号与文本之间的模态不匹配挑战,特别是在无注释设置下。现有基于CLIP式视觉语言预训练的方法使用随机批次内对比学习,但可能将语义相似对误标为负样本,引入噪声和不一致的对齐监督。

Result: 未在摘要中明确提及具体定量结果或基准测试,但方法旨在通过改进对比学习策略来提升跨模态对齐效果。

Insight: 创新点在于通过分析负样本相似度动态轨迹,揭示了随机负样本的局限性,并提出了基于课程学习的配对选择策略,以更有效地利用对比学习进行跨模态对齐,减少噪声监督。

Abstract: Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.


[17] Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement cs.CLPDF

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi

TL;DR: 本文研究了在评估LLM生成的商业创意时,面对专家意见分歧,自动评估器应模拟聚合共识还是个性化建模个体评估者。通过引入PBIG-DATA数据集,包含300个基于专利的产品创意在六个商业维度上的约3000个专家评分,研究发现专家在细粒度序数评分上存在显著分歧,而在粗粒度选择上一致性更高。实验比较了三种评估器配置:仅基于规则的零样本评估器、基于混合评估者历史的聚合评估器,以及基于目标评估者评分历史的个性化评估器。结果表明,个性化评估器在多个维度和模型规模下,比聚合评估器更接近对应评估者的评分,且仅在个性化条件下评估者间一致性与评估器生成推理的相似性相关。

Details

Motivation: 解决在商业创意评估中,由于评估标准多维且专家意见常存在分歧,自动评估器设计应选择聚合共识还是个性化建模的问题,以提升评估的可扩展性和准确性。

Result: 在PBIG-DATA数据集上,个性化评估器在六个商业维度(如特异性、技术有效性等)上均比聚合评估器更接近个体评估者评分,且评估者间一致性与推理相似性的相关性仅在个性化条件下显著。

Insight: 创新点在于提出在多元化评估场景中,聚合标签可能脆弱,应设计基于评估者条件化的个性化评估器;客观分析认为,该方法强调了建模个体评估者偏好的重要性,为类似主观评估任务提供了新思路。

Abstract: Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator’s scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.


[18] Learning Evidence Highlighting for Frozen LLMs cs.CL | cs.AIPDF

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei

TL;DR: 本文提出了HiLight框架,通过训练一个轻量级的Emphasis Actor在原始上下文中插入高亮标签来突出关键证据,使冻结的LLM能够更好地进行推理,无需证据标注或修改LLM本身。

Details

Motivation: 解决LLMs在处理长且嘈杂的上下文时容易错过决定性证据的问题,通过解耦证据选择和推理过程来提升性能。

Result: 在顺序推荐和长上下文问答任务中,HiLight持续优于基于提示和自动提示优化的基线方法,且学习到的强调策略能够零样本迁移到不同规模的未见Solver家族,包括基于API的Solver。

Insight: 创新点在于将证据高亮视为弱监督决策问题,使用强化学习优化Actor,仅依赖Solver的任务奖励,无需证据标签或修改Solver,实现了可重用且通用的证据结构捕获。

Abstract: Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver’s task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.


[19] BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering cs.CLPDF

Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne

TL;DR: 本文提出了贝叶斯集成检索增强生成(BERAG)框架及其对应的贝叶斯集成微调(BEFT)方法,用于解决传统RAG在基于知识的视觉问答任务中存在的文档贡献模糊、‘中间丢失’效应以及计算成本随上下文长度二次增长等问题。该方法通过让语言模型基于单个检索文档而非拼接的上下文进行条件生成,并利用贝叶斯规则逐令牌更新文档后验概率作为集成权重,实现了概率重排序、并行内存使用和清晰的文档归因。

Details

Motivation: 传统RAG方法将检索到的文档拼接成单一上下文输入语言模型,这会导致个体文档贡献模糊、难以归因,并引发‘中间丢失’效应(即长上下文中相关信息被忽略)。同时,拼接策略的计算成本随上下文长度呈二次增长,在涉及视觉数据的视觉问答任务中尤为严重。通过限制上下文长度来缓解问题又会损害模型从深度检索中获益的能力。

Result: 在基于知识的视觉问答任务(需要模型对长且不完美的检索列表进行推理)上,BERAG和BEFT相比标准RAG取得了显著提升,包括在Document Visual Question Answering和多模态‘大海捞针’基准测试上获得强劲增益。结果表明BERAG缓解了‘中间丢失’效应,且文档后验可用于检测基础不足并触发偏转,文档剪枝则能实现比标准RAG更快的解码速度。

Insight: 核心创新在于将RAG框架从基于拼接上下文的单次生成,转变为基于单个文档条件生成的集成范式,并利用贝叶斯规则进行动态的、逐令牌的文档权重更新。这实现了概率重排序、清晰的归因和并行处理能力,有效解决了传统RAG的固有问题,尤其适合处理大规模文档集合和长检索列表场景,为多模态知识问答提供了新的架构思路。

Abstract: A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the lost-in-the-middle’’ effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.


[20] Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CLPDF

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

TL;DR: 本文提出了一种名为Abstract Chain-of-Thought(抽象思维链)的离散潜在推理后训练机制,旨在解决传统显式思维链推理生成成本高的问题。该方法让语言模型在生成最终答案前,先从一个预留词汇表中生成一个简短的抽象标记序列,而非冗长的自然语言推理链。通过结合策略迭代式预热循环(包括掩码瓶颈监督微调和自蒸馏)以及预热启动的强化学习优化,该方法在保持性能的同时大幅减少了推理标记数量。

Details

Motivation: 传统显式思维链(CoT)在复杂推理任务中有效,但在推理时生成成本高昂;而利用连续表示的非语言推理方法虽然生成长度较短,但性能落后于语言化CoT。因此,本文旨在开发一种高效的潜在推理机制,以在减少生成长度的同时保持或接近显式CoT的性能。

Result: 在数学推理、指令遵循和多跳推理任务上,Abstract-CoT实现了高达11.6倍的推理标记减少,同时表现出与显式CoT相当的性能。该方法在不同语言模型家族中具有泛化性,并且在抽象词汇上观察到了类似于自然语言中出现的幂律分布。

Insight: 创新点在于提出了一种离散的潜在推理后训练机制,通过抽象标记序列替代自然语言推理链,实现了高效推理。关键技术包括策略迭代式预热循环(结合掩码瓶颈监督微调和自蒸馏)以及预热启动的强化学习优化。从客观角度看,该方法展示了学习抽象推理语言的潜力,为高效推理提供了新思路,并且观察到的抽象词汇幂律分布也提供了有趣的发现。

Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ‘’abstract’’ tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.


[21] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks cs.CL | cs.CY | cs.HC | cs.SEPDF

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea

TL;DR: 本文首次系统研究了AI智能体在编程任务中的token消耗模式,分析了八个前沿大语言模型在SWE-bench Verified数据集上的执行轨迹,并评估了模型在执行前预测自身token成本的能力。研究发现:智能体任务消耗的token远超代码推理和代码聊天任务,且主要由输入token驱动;token使用量高度可变且具有随机性,与任务准确率并非正相关;不同模型的token效率差异显著;人类专家评估的任务难度与实际token成本关联较弱;前沿模型难以准确预测自身token使用量,且普遍低估实际成本。

Details

Motivation: 随着AI智能体在复杂人类工作流中的广泛应用,LLM的token消耗量快速增长。当智能体执行需要大量token的任务时,需要回答三个核心问题:AI智能体在哪里消耗token?哪些模型更token高效?智能体能否在执行前预测其token使用量?

Result: 在SWE-bench Verified基准测试上,研究发现:智能体任务消耗的token是代码推理和代码聊天任务的1000倍;同一任务的不同运行总token量差异可达30倍,且准确率往往在中等成本时达到峰值;不同模型token效率差异大,例如Kimi-K2和Claude-Sonnet-4.5平均比GPT-5多消耗150万token;模型预测自身token使用量的相关性较弱(最高0.39),且系统性地低估实际成本。

Insight: 论文的创新点在于首次系统量化了智能体编程任务的token消耗模式,揭示了输入token是成本主要驱动因素、token使用与准确率的非单调关系、人类感知复杂度与计算成本之间的差距,以及模型自我预测能力的不足。这为AI智能体的经济学分析和效率优化提供了新的研究方向和基准。

Abstract: The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models’ ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.


cs.CV [Back]

[22] EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms cs.CV | cs.AI | cs.LGPDF

Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix

TL;DR: 本文介绍了EgoMAGIC数据集,这是一个用于医疗任务感知算法训练的第一人称视角视频数据集,包含50个医疗任务的3355个视频,并提供了基于该数据集的动作检测基准测试结果。

Details

Motivation: DARPA的PTG项目旨在开发集成在增强现实头显中的虚拟助手,以帮助用户执行复杂任务,但缺乏专门的第一人称医疗活动数据集来训练感知算法。

Result: 在八个医疗任务的动作检测基准测试中,最佳方法的平均mAP达到0.526;同时训练了40个YOLO模型来检测124种医疗物体,为医疗AI应用提供了基础。

Insight: 创新点在于发布了首个大规模、多任务的第一人称医疗活动视频数据集,并配套了动作检测挑战和预训练模型,为医疗AI、计算机视觉任务(如动作识别、物体检测、错误检测)提供了标准化研究平台。

Abstract: This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA’s Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).


[23] FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision cs.CV | eess.IVPDF

Nathan Shankar, Pawel Ladosz, Hujun Yin

TL;DR: 本文提出了一种名为FLARE-BO的新型低光照图像增强框架,用于提升机器人视觉系统的视觉感知可靠性。该方法通过贝叶斯优化自适应地联合优化八个参数,涵盖了伽马校正、LIME式光照归一化、色度去噪、双边滤波、非局部均值去噪、灰世界自动白平衡和自适应后平滑等多个处理步骤。

Details

Motivation: 解决现有基于贝叶斯优化的无训练方法参数空间有限、缺乏光照分解和白平衡校正,以及依赖的非局部均值去噪在噪声条件下过度平滑边缘的问题,以提升低光照条件下机器人视觉任务的图像质量。

Result: 在Low Light paired dataset (LOL)基准数据集上进行评估,结果表明该方法相较于未在该数据集上专门训练的现有方法取得了显著提升。

Insight: 主要创新点在于将贝叶斯优化的参数空间从3个扩展到8个,融合了多种图像处理技术(如Retinex风格的光照归一化和白平衡),并采用了单位超立方体参数归一化、目标标准化、Sobol准随机初始化和对数期望改进获取等策略来系统性地探索高维参数空间,实现了无需训练的、自适应的、全面的低光照图像增强。

Abstract: Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.


[24] Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography cs.CV | cs.LGPDF

Tania Haghighi, Sina Gholami, Hamed Tabkhi, Minhaj Nur Alam

TL;DR: 本文提出了一种无监督的异常检测框架,用于光学相干断层扫描(OCT)图像中视网膜异常的检测与定位。该方法无需病变标注,通过离散潜在模型学习健康视网膜解剖结构的规范分布,并结合视网膜层感知监督和结构化三元组学习来提高临床鲁棒性。在推理时,通过重建差异实现图像级和像素级的异常识别。

Details

Motivation: 解决OCT图像分析中依赖昂贵专家标注、监督模型难以泛化到不同病理、设备和人群的问题,旨在提升临床部署中的标注效率。

Result: 在Kermany数据集上AUROC达到0.799,显著优于VAE、VQVAE、VQGAN和f-AnoGAN基线;在Srinivasan数据集上跨数据集评估AUROC为0.884,显示出优异的泛化能力;在外部RETOUCH基准测试中,无监督异常分割取得了Dice分数0.200和mIoU分数0.117的竞争性结果。

Insight: 创新点包括无需标注的无监督异常检测框架、结合离散潜在模型捕获OCT特定结构模式、引入视网膜层感知监督和结构化三元组学习以增强临床鲁棒性,以及通过重建差异实现多级异常定位,有效提升了跨域泛化能力。

Abstract: Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.


[25] GenMatter: Perceiving Physical Objects with Generative Matter Models cs.CV | cs.AIPDF

Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka

TL;DR: 本文提出了一种名为GenMatter的生成模型,该模型受人类视觉感知启发,通过将低级运动线索和高级外观特征分层分组为粒子(代表局部物质的小高斯分布),并将粒子聚类为独立可移动的物理实体,实现了跨不同输入类型(如随机点、纹理化表面或自然RGB视频)的统一运动感知框架。

Details

Motivation: 现有计算机视觉系统缺乏一个能在稀疏移动点、纹理表面或自然场景等多种设置下统一工作的运动感知方法,而人类视觉却能稳健地检测和分割独立可移动的物质块,因此本文旨在借鉴人类感知原理,构建一个通用的基于运动的感知框架。

Result: 在三个领域验证了框架的有效性:在2D随机点运动图上,模型捕捉了人类物体感知,包括模糊条件下的分级不确定性;在格式塔启发的伪装旋转物体数据集上,模型从运动中恢复了正确的3D结构,从而实现了准确的2D物体分割;在自然RGB视频上,模型跟踪了构成变形物体的移动3D物质,实现了鲁棒的物体级场景理解。

Insight: 创新点在于将人类视觉的分层分组原则与生成模型结合,通过粒子表示和聚类统一处理多样输入,并开发了基于并行块吉布斯采样的硬件加速推理算法,为基于运动的感知提供了一个通用且生物启发的计算框架。

Abstract: Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.


[26] Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models cs.CVPDF

Masato Soga, Ryuki Takebayashi

TL;DR: 本研究探讨了在交互场景中基于他人动作生成反应性人体运动的问题,构建了从拳击比赛视频中提取的配对动作-反应运动序列数据集,并比较了三种基于Transformer的模型(简单Transformer、iTransformer和Crossformer)的有效性。研究发现简单Transformer能生成合理的交互感知运动,而iTransformer和Crossformer会随时间累积误差;引入的人物ID嵌入有助于防止结构崩溃并提高运动一致性。

Details

Motivation: 现有的人体运动生成方法主要关注单智能体运动,而忽略了交互场景中两人运动相互依赖的问题,本研究旨在解决基于他人动作生成反应性运动这一挑战。

Result: 在从拳击视频构建的数据集上,简单Transformer能生成合理的交互感知运动且避免姿态崩溃,而iTransformer和Crossformer因误差累积导致运动不稳定;引入人物ID嵌入提升了结构一致性和运动质量。

Insight: 创新点包括构建配对交互运动数据集、比较Transformer变体在交互运动生成中的表现,以及提出人物ID嵌入来显式区分个体以增强模型对交互动态的捕捉和结构一致性,这强调了在交互感知运动生成中显式建模个体身份的重要性。

Abstract: Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person’s motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.


[27] Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery cs.CVPDF

Jingyuan Xia, Ruikang Hu, Ye Li, Zhixiong Yang, Xu Lan

TL;DR: 本文提出了一种名为模态差异曲线(MDC)的频率域描述符,用于建模光学图像与合成孔径雷达(SAR)图像之间的跨模态差异。基于MDC,作者提出了MDC引导的跨模态先验迁移(MCPT)预训练框架,该框架包含自适应频率标记化(AFT)和频率感知专家精炼(FER)模块,通过对比学习对齐跨模态特征,从而将大型视觉模型的光学先验知识有效迁移至SAR领域,以提升其在标签稀缺的SAR广义类别发现(GCD)任务中的性能。

Details

Motivation: 广义类别发现(GCD)在标签稀缺的SAR领域具有重要应用前景,但其效果受到大型视觉模型(LVMs)固有的光学先验与SAR图像之间跨模态不兼容性的严重制约。现有的域适应方法往往缺乏反映成像特性的归纳偏置,因此无法有效地将光学先验迁移到SAR域。

Result: 在多个主流数据集上的大量实验表明,该方法取得了最先进的(state-of-the-art)性能,证明了频率域差异建模能够更有效地使光学先验适应SAR图像。

Insight: 论文的核心创新点在于引入模态差异曲线(MDC)作为结构化的频率域描述符来量化跨模态差异,并基于此设计了一个包含自适应频率标记化和频率感知专家精炼的预训练框架。从客观角度看,该方法将跨模态差异建模从传统的像素/特征空间转向频率/谱域,为利用成像物理特性(如光谱能量分布)指导知识迁移提供了新思路,特别是在处理光学与SAR这类成像机理迥异的数据时具有借鉴意义。

Abstract: Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.


[28] Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities cs.CVPDF

Peibo Song, Xiaotian Xue, Jinshuo Zhang, Zihao Wang, Jinhua Liu

TL;DR: 本文提出了一种名为UniME的两阶段异构方法,用于处理模态缺失情况下的脑肿瘤分割。该方法通过第一阶段预训练一个统一的Vision Transformer编码器来学习对模态缺失鲁棒的统一表示,第二阶段则添加模态特定的CNN编码器来提取高分辨率、多尺度的细粒度特征,并将这些特征与全局表示融合以生成精确的分割结果。

Details

Motivation: 解决多模态MRI脑肿瘤分割中因临床扫描常缺失一个或多个模态而导致分割性能下降的问题,旨在平衡细粒度结构捕捉、跨模态互补性建模以及可用模态利用之间的权衡。

Result: 在BraTS 2023和BraTS 2024数据集上的实验表明,UniME在不完整多模态场景下优于先前的方法。

Insight: 创新点在于将表示学习与分割任务解耦的两阶段异构架构设计,即先通过掩码图像建模预训练一个对模态缺失鲁棒的统一编码器,再结合模态特定的多编码器来融合全局与局部特征,从而有效处理模态缺失并提升分割精度。

Abstract: Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME


[29] EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting cs.CVPDF

Feiyu An, Yufei Deng, Zihui Zhang, Rong Xiao

TL;DR: 本文提出EvFlow-GS,一个统一的框架,利用事件流和光流联合优化端到端可学习的双重积分(LDI)、相机位姿和3D高斯泼溅(3DGS),以从运动模糊图像中实现清晰的3D重建。该方法通过光流从事件中提取边缘信息,并设计新颖的基于事件的损失函数和事件残差先验来增强监督,最终通过联合损失实现各模块的相互促进优化。

Details

Motivation: 现有方法仅从运动模糊图像进行3D重建具有挑战性,而结合事件相机的方法又因不准确的事件双重积分先验以及噪声、模糊的事件导致误导性监督,从而产生残留伪影和模糊纹理细节。

Result: 实验表明,EvFlow-GS取得了领先的性能。

Insight: 创新点包括:利用光流从事件流中提取边缘信息;设计新颖的基于事件的损失函数分别应用于不同模块;引入事件残差先验来增强3DGS渲染图像间强度变化的监督;以及通过联合损失实现LDI、相机位姿和3DGS的端到端联合优化,使它们相互促进。

Abstract: Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.


[30] CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution cs.CVPDF

Xiangxi Zheng, Kuang He, Jiayi Hu, Ping Yu, Rui Yan

TL;DR: CharTide是一个以数据为中心的新型框架,用于提升视觉语言模型在图表到代码生成任务中的性能。它通过三视角调优策略构建大规模数据集,将训练解耦为视觉感知、纯文本代码逻辑和模态融合三个流,并使用基于信息不变性原理的查询驱动强化学习框架进行数据对齐验证。

Details

Motivation: 现有图表到代码生成方法受限于数据中心的局限性,即简单地扩展同质化的图表-代码对会混淆视觉感知与程序逻辑,阻碍模型充分利用多模态监督的丰富性。本文旨在系统性地重新设计训练和对齐数据以解决此问题。

Result: 在ChartMimic、Plot2Code和ChartX基准测试上的实验表明,CharTide-7B/8B模型显著优于开源基线模型,超越了GPT-4o,并与GPT-5性能相当。

Insight: 核心创新点在于将训练过程解耦为三个明确的视角以分离视觉与逻辑学习,并将对齐问题重新定义为基于信息不变性的数据验证问题,使用冻结的检查器通过原子QA任务提供可验证的奖励信号,这超越了传统的启发式评分或规则匹配方法。

Abstract: Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.


[31] ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild cs.CVPDF

Hanyu Chen, Ruojin Cai, Steve Marschner, Noah Snavely

TL;DR: 本文提出了ArchSym框架,用于从单张野外RGB图像中检测3D接地的反射对称性,专注于建筑地标。通过利用跨视图图像匹配从SfM重建中自动整理大规模建筑对称数据集ArchSym,并基于该数据集开发了一个单视图对称检测器,该检测器通过将对称性参数化为相对于预测场景几何的带符号距离图来准确定位3D对称性。

Details

Motivation: 现有基于学习的方法在检测单图像3D对称性时,主要在以对象为中心或合成数据集上训练和评估,无法泛化到真实场景;且由于单目输入的尺度模糊性,许多工作仅预测平面方向,而本文旨在解决这些限制,实现从单张野外图像中检测3D接地的建筑对称性。

Result: 在新建的基准测试中,本文的对称检测器显著优于最先进的基线方法,并通过验证表明对称标注流程优于基于几何的替代方案。

Insight: 创新点包括:通过跨视图图像匹配自动整理大规模建筑对称数据集的方法,以及将对称性参数化为相对于预测场景几何的带符号距离图的单视图检测器,这有助于解决尺度模糊性问题并实现准确的3D定位。

Abstract: Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane’s orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.


[32] Towards Temporal Compositional Reasoning in Long-Form Sports Videos cs.CVPDF

Siyu Cao, Lu Zhang, Ruizhe Zeng, Zhi-yong Liu

TL;DR: 该论文针对长视频体育视频理解中时序组合推理的挑战,提出了SportsTime大规模基准数据集和Chain-of-Time Reasoning(CoTR)方法。SportsTime包含超过14K个开放式问答对和50K个逐步时序证据标注,用于评估模型在长视频中的推理能力。CoTR方法将推理视为时序证据组合的过程,在训练时引入时序奖励GRPO以鼓励时序推理,在推理时采用锚点-观察-推断的证据搜索循环来迭代定位、验证和组合时序证据,最终生成答案。

Details

Motivation: 现有多模态大语言模型(MLLMs)在长视频体育视频理解中面临困难,因为回答问题时需要定位时序稀疏的证据并将其整合到推理中。论文指出这一限制源于两个紧密耦合的因素:对时序分散证据的监督不足,以及缺乏要求模型识别、定位和证明时序证据的方法。

Result: 实验表明,SportsTime作为基准数据集具有实用性,而CoTR方法在强MLLM基线上持续提升了时序组合推理和逐步定位质量,证明了其有效性。

Insight: 论文的创新点在于提出了一个专注于长视频时序组合推理的大规模基准数据集SportsTime,以及一种将推理视为时序证据组合过程的CoTR方法,该方法通过时序奖励训练和迭代证据搜索循环来增强模型在长视频中的推理能力。从客观角度看,该方法强调了时序证据的逐步定位与组合,为长视频多模态理解提供了新的思路。

Abstract: Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.


[33] OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space cs.CVPDF

Zhuding Liang, Tianyi Yan, Dubing Chen, Jiasen Zheng, Huan Zheng

TL;DR: 本文提出了OccDirector框架,通过自然语言指令生成4D占用空间中的动态场景,无需几何先验,并引入了包含多级语言标注的OccInteract-85k数据集和基于VLM的评估基准。

Details

Motivation: 现有生成框架依赖刚性几何条件或简单文本,无法协调复杂的顺序多智能体交互,存在语义-时空鸿沟。

Result: 在OccInteract-85k数据集上,OccDirector实现了最先进的生成质量和前所未有的指令跟随能力。

Insight: 创新点包括纯语言驱动的4D占用动态生成、VLM驱动的时空MMDiT架构与历史前缀锚定策略,以及从外观合成到语言驱动行为编排的范式转变。

Abstract: Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director’’, OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.


[34] Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset cs.CV | cs.AIPDF

Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao

TL;DR: 本文提出了一个名为Land Transportation Dataset (LTD)的大规模开源视觉语言数据集,用于城市交通环境中的开放式推理,并基于此构建了统一的交通基础模型UniVLT,旨在统一微观自动驾驶推理与宏观交通分析,以应对城市交通系统日益增长的安全挑战。

Details

Motivation: 当前城市交通系统面临日益严峻的安全挑战,需要可扩展的智能技术来支持新兴的智慧出行基础设施。虽然基础模型和大规模多模态数据集的最新进展增强了智能交通系统(ITS)的感知与推理能力,但现有研究主要集中于微观自动驾驶(AD),对城市规模的交通分析关注有限,特别是面向安全的开放式视觉问答(VQA)及相应的异构路侧摄像头观测推理基础模型尚未得到充分探索。

Result: 在LTD数据集和多个自动驾驶基准测试上的广泛实验表明,UniVLT在跨领域的开放式推理任务上达到了SOTA(最先进)性能,同时揭示了现有基础模型在复杂多视角交通场景中的局限性。

Insight: 创新点包括:1)引入大规模、高质量、面向安全的开放式视觉语言数据集LTD,涵盖异构路侧摄像头、多样化道路几何、交通参与者、光照条件和恶劣天气;2)提出统一的交通基础模型UniVLT,通过基于课程的知识迁移训练,将微观自动驾驶推理与宏观交通分析集成到单一架构中;3)数据集整合了细粒度多目标定位、多图像摄像头选择和多图像风险分析三个互补任务,要求对最小相关视图进行联合推理,以推断危险对象、影响因素和风险道路方向。

Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.


[35] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation cs.CVPDF

Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

TL;DR: 本文提出了一种基于反事实关系验证的证据驱动开放词汇场景图生成框架CAGE-SGG,旨在解决开放词汇场景图生成中预测关系可能受语言先验或对象共现影响而非基于视觉证据的问题。该方法通过将谓词短语分解为软证据基(如支撑、接触、包含、深度、运动和状态),并利用关系条件证据编码器提取谓词相关线索,再通过反事实验证器测试在移除必要证据或无关扰动下关系分数的变化,从而验证候选关系是否得到视觉、几何和上下文证据的支持。此外,还引入了矛盾感知谓词学习和图级偏好优化以提升细粒度判别和全局图一致性。

Details

Motivation: 开放词汇场景图生成旨在超越固定谓词词汇,用灵活细粒度的关系短语描述视觉场景,但现有视觉语言模型可能引入可靠性问题,即预测关系可能由语言先验或对象共现驱动,而非基于视觉证据。

Result: 在传统、开放词汇和全景场景图生成基准上的实验表明,该方法在标准基于召回的指标、未见谓词泛化能力和反事实接地质量方面均取得一致提升,证明了从关系生成转向关系验证能产生更可靠、可解释和基于证据的场景图。

Insight: 创新点包括将谓词分解为软证据基进行反事实关系验证,以及引入矛盾感知学习和图级优化,这为开放词汇场景图生成提供了更可靠的证据驱动框架,可借鉴于提升视觉语言任务中关系的可解释性和接地性。

Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.


[36] Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings cs.CVPDF

Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan

TL;DR: 本文提出了一种名为RIME的统一框架,通过检索友好的重写联合优化生成和嵌入,以解决多模态大语言模型中思维链推理在广泛检索场景下产生的冗余步骤和语义模糊问题。

Details

Motivation: 动机在于解决思维链推理在多模态嵌入任务中生成冗余思考步骤和引入语义模糊的局限性,以提升生成式多模态嵌入的性能和效率。

Result: 在MMEB-V2、MRMR和UVRB基准测试中,RIME显著优于先前的生成式嵌入模型,并大幅减少了思考长度。

Insight: 创新点包括提出检索友好的重写作为通用接口来联合优化生成和嵌入,引入跨模态对齐桥接生成式和判别式嵌入空间以实现灵活互检索,以及利用判别式嵌入作为稳定语义锚点通过精炼强化学习指导重写优化。

Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.


[37] DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning cs.CVPDF

Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko

TL;DR: 本文提出DocPrune,一种无需训练、渐进式的文档令牌剪枝框架,旨在高效处理长文档问答任务。该方法通过保留任务必需令牌(如支持证据)并移除无关令牌(如背景或与问题无关的内容),同时根据模型理解水平自动选择启动剪枝的合适层,从而显著提升计算效率。

Details

Motivation: 现有视觉语言模型在处理文档图像时,由于文档包含大量背景和稀疏的支持证据,导致计算资源消耗大、效率低下,而针对自然图像和视频的令牌剪枝方法未能有效利用文档特有的结构稀疏性。

Result: 在M3DocRAG基准测试中,DocPrune将编码器和解码器的吞吐量分别提升了3.0倍和3.3倍,同时F1分数提高了+1.0,实现了更高的准确性和效率,且无需额外训练。

Insight: 创新点在于提出了一种专门针对文档结构稀疏性设计的训练无关令牌剪枝框架,通过结合背景、问题和理解感知的剪枝策略,以及自适应层选择机制,在提升效率的同时保持甚至提高模型性能,为长文档处理提供了高效解决方案。

Abstract: Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model’s level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.


[38] Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation cs.CVPDF

Ran Zhao, Sheng Jin, Size Wu, Kang Liao, Zerui Gong

TL;DR: 本文针对知识密集型文本到图像生成任务,提出了一个基于高中课程知识的基准测试KVBench,用于评估模型在生成需严格遵循领域知识、结构约束和符号规范的图像时的可靠性。研究发现现有模型存在逻辑推理、符号精度和多语言鲁棒性方面的显著缺陷,并提出了一个两阶段框架KE-Check,通过知识细化和清单引导的细化来提升科学保真度,有效减少了科学幻觉。

Details

Motivation: 现有文本到图像模型在知识密集型场景下的可靠性尚未得到充分探索,知识可视化不仅需要语义对齐,还要求严格遵循领域知识、结构约束和符号规范,存在视觉合理性与科学正确性之间的关键差距。

Result: 在涵盖生物、化学、地理、历史、数学和物理六个高中科目的KVBench基准上评估了14个开源和闭源的最先进模型,发现开源模型在逻辑推理、符号精度和多语言鲁棒性方面持续落后于专有系统;提出的KE-Check框架有效缓解了科学幻觉,缩小了开源模型与领先闭源模型之间的性能差距。

Insight: 创新点在于构建了首个课程知识驱动的知识密集型文本到图像生成基准,并提出了一个结合知识细化和清单引导约束执行的两阶段框架,通过结构化提示增强和显式约束强制执行来提升生成图像的科学正确性,为知识可视化任务提供了系统的评估方法和改进方案。

Abstract: Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at https://github.com/zhaoran66/KVBench.


[39] Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation cs.CVPDF

Lomash Relia, Jai G Singla, Amitabh, Nitant Dube

TL;DR: 本研究分析了深度感知漫游车在模拟和真实环境中的导航实现,重点探讨了从立体视觉向基于边缘AI的单目深度估计的转变。研究使用基于Unity的月球地形模拟器与立体相机生成视差图,并在树莓派4上构建的物理漫游车中采用UniDepthV2进行单目度量深度估计和YOLO12n进行实时目标检测。

Details

Motivation: 解决在资源受限的边缘设备上实现高效、鲁棒且成本效益高的深度感知导航问题,以替代传统立体视觉方法。

Result: 在模拟环境中立体视觉精度更高,但在真实世界部署中,单目方法更鲁棒且成本效益高,实现了深度估计0.1 FPS和目标检测10 FPS的性能。

Insight: 创新点在于将先进的单目深度估计模型(UniDepthV2)与轻量级目标检测(YOLO12n)结合于边缘AI平台,证明了在真实场景下单目方法相对于立体视觉在鲁棒性和成本上的优势。

Abstract: This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV’s StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.


[40] ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding cs.CV | cs.AIPDF

Dongwei Sun, Jing Yao, Kan Wei, Xiangyong Cao, Chen Wu

TL;DR: 本文提出了ChangeQuery,一个用于全面、全天候灾害态势感知的统一多模态框架。该框架旨在克服现有遥感灾害评估方法在模态依赖、场景偏见和交互性方面的局限,通过构建包含光学和SAR数据、平衡自然灾害与武装冲突的DICQ数据集,并利用自动化语义标注管道生成高质量监督数据,使模型能够支持基于用户查询的多任务推理,实现精确的损害量化、区域特定描述和整体灾后总结。

Details

Motivation: 现有遥感灾害评估方法主要依赖单模态光学数据,存在对自然灾害的偏见,且缺乏基于交互的推理能力,难以应对复杂的战略查询,无法提供可操作的情报。

Result: 在构建的DICQ数据集上进行的大量实验表明,ChangeQuery框架在复杂灾害监测任务上达到了新的最先进水平(SOTA),提供了鲁棒且可解释的解决方案。

Insight: 主要创新点包括:1) 构建了平衡自然灾害与武装冲突、耦合灾前光学语义与灾后SAR结构特征的大规模多模态基准数据集DICQ;2) 提出了遵循“统计优先,生成在后”范式的自动化语义标注管道,能够自动将原始分割掩码转化为具有空间和定量感知的层次化指令集,为交互式推理提供高质量监督;3) 设计了一个支持基于多样化用户查询进行多任务推理的统一交互式灾害分析框架。

Abstract: Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later’’ paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.


[41] PoseFM: Relative Camera Pose Estimation Through Flow Matching cs.CVPDF

Dominik Kuczkowski, Laura Ruotsalainen

TL;DR: 本文提出PoseFM,首个将单目视觉里程计(VO)重构为基于流匹配(Flow Matching)的生成任务的框架。该方法通过连续时间ODE将噪声转化为相机位姿分布,从而提供不确定性估计,并在TartanAir、KITTI和TUM-RGBD基准测试中表现出色。

Details

Motivation: 传统单目VO方法在结构或光照条件差的环境中表现不佳,而现有深度学习方法多依赖确定性回归,缺乏不确定性估计能力,限制了其在鲁棒应用中的可靠性。

Result: 在TartanAir、KITTI和TUM-RGBD基准测试中,PoseFM在部分轨迹上取得了最低的绝对轨迹误差(ATE),整体性能与最佳的单目帧间VO方法相当。

Insight: 创新点在于将VO任务重新定义为生成问题,利用流匹配建模相机运动分布,实现了不确定性估计;客观来看,该方法通过连续时间ODE学习噪声到位姿的映射,为视觉里程计提供了更鲁棒的推理机制。

Abstract: Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at https://github.com/helsinki-sda-group/posefm.


[42] HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos cs.CVPDF

Xu Lu, Qianhong Peng, Qihao Zhou, Shaopeng Liu, Xiuqin Ye

TL;DR: 本文提出了一种名为HFS-TriNet的三分支协作特征学习网络,用于从经直肠超声(TRUS)视频中进行前列腺癌分类。该方法首先通过启发式帧选择策略减少视频冗余,然后利用一个常规ResNet50分支、一个基于预训练医学分割模型的分支以及一个小波变换卷积残差分支来协同提取时空特征,以应对TRUS视频分析中的信息冗余、类内类间相似性高和信噪比低等挑战。

Details

Motivation: TRUS视频比静态图像提供更丰富的时空信息,有望提升CAD系统的准确性和鲁棒性,但其分析面临信息冗余、计算成本高、类内类间相似性大、信噪比低等挑战,需要新的方法来有效提取特征。

Result: 论文在TRUS视频数据集上进行了实验,结果表明所提方法能有效提升前列腺癌分类性能,但摘要中未提及具体的定量结果(如准确率)或与SOTA模型的比较。

Insight: 创新点包括:1) 启发式帧选择策略动态初始化训练片段起点,确保采样覆盖整个视频序列;2) 三分支网络设计,结合了常规CNN、基于大模型(医学SAM)的深度特征提取与时序一致性探索,以及小波变换在频域进行边缘提取和去噪,实现了多角度特征互补学习。

Abstract: Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.


[43] Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition cs.CVPDF

Shunpeng Chen, Yukun Song, Changwei Wang, Rongtao Xu, Kexue Fu

TL;DR: 本文提出了FoL++方法,用于视觉地点识别(VPR),通过结合鲁棒的判别性区域建模与自适应重排序,解决了由无关区域引起的感知混淆和刚性候选调度导致的低效重排序问题。该方法引入了可靠性估计分支生成空间可靠性图以显式建模抗遮挡能力,并通过两种空间对齐损失优化特征对齐和突出显著区域。在无人工标注的弱监督学习中,采用伪对应策略从聚合簇生成密集局部特征监督。自适应候选调度器基于全局相似性动态调整候选池大小。

Details

Motivation: 现有视觉地点识别方法在处理由无关区域引起的感知混淆和因刚性候选调度导致的低效重排序方面存在不足,需要一种更高效可靠的方法来提升识别性能。

Result: 在七个基准测试上的广泛实验表明,FoL++实现了最先进的性能,具有轻量级内存占用,相比FoL推理速度提升了40%。

Insight: 创新点包括可靠性估计分支和空间对齐损失以增强区域感知能力,伪对应策略实现弱监督下的密集局部特征学习,以及自适应候选调度器动态融合全局与局部证据,从而超越传统独立匹配系统。

Abstract: Visual Place Recognition (VPR) determines a query image’s geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at https://github.com/chenshunpeng/FoL.


[44] SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CVPDF

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao

TL;DR: 本文提出了SpaMEM基准测试,用于评估多模态大语言模型在具身环境中的动态空间推理能力,通过大规模数据集和分层任务设计诊断模型在感知-记忆整合方面的瓶颈。

Details

Motivation: 现有MLLMs在静态视觉空间推理上表现良好,但在具身环境中难以维持长时程的空间一致性,需要从自我中心观察中持续修正信念以适应环境变化,因此需要专门的基准来诊断空间信念演化的机制。

Result: 在SpaMEM基准上评估代表性开源VLM家族,发现模型在坐标一致的基础定位上存在瓶颈,且从Level 2(带文本状态历史)到Level 3(原始视觉流端到端)性能急剧下降,暴露了对符号支架的依赖。

Insight: 创新点在于构建了大规模、多模态的具身空间推理基准,通过分层任务(原子感知、时序推理、端到端信念维护)和动作条件场景变换,系统诊断了模型在视觉记忆和信念修正方面的不足,为状态表示和长时程整合机制的研究提供了方向。

Abstract: Multimodal large language models (MLLMs) have advanced static visual–spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.


[45] NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting cs.CVPDF

Zaiyan Yang, Xinpeng Liu, Heng Guo, Jinglei Shi, Zhanyu Ma

TL;DR: 本文提出了一种神经正则化方法,用于优化由多视角不一致的2D特征提升得到的噪声3D语义场,以实现准确且鲁棒的3D语义高斯泼溅。该方法通过一个方差感知的条件MLP,直接操作于3D高斯,利用其几何和外观属性来校正3D空间中的语义错误。

Details

Motivation: 从视觉基础模型提取的2D特征由于缺乏跨视角约束而存在多视角不一致性,直接将这些不一致特征提升到3D高斯中会产生噪声语义场,影响下游任务性能。现有方法要么侧重于预处理阶段获取一致的多视角特征,要么通过改进优化策略来减轻噪声,但往往以增加预处理时间或计算开销为代价。

Result: 在不同数据集上的实验表明,该方法提升了提升语义的准确性,为鲁棒的3D语义高斯泼溅提供了一种高效且有效的解决方案。

Insight: 创新点在于提出了一种直接作用于3D高斯表示的后处理神经正则化方法,通过方差感知的条件MLP利用几何和外观信息校正语义噪声,避免了昂贵的预处理或优化开销,实现了效率与精度的平衡。

Abstract: We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.


[46] All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams cs.CV | cs.LGPDF

Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting

TL;DR: 本文提出了一种名为SnapLog的方法,用于从视频流中自动提取事件数据,以支持业务流程管理和流程挖掘。该方法通过图像嵌入将视频帧转换为特征向量,利用帧间相似性矩阵进行时间分割,并结合广义少样本分类为视频片段分配标签,从而生成可解释为事件的带时间戳的标记子序列。

Details

Motivation: 解决流程分析中数据多模态的障碍,特别是视频数据无法直接作为事件数据使用的问题,旨在自动化地从视频中提取结构化事件日志。

Result: 实验表明,SnapLog生成的日志能准确反映视频中的流程,但摘要未提及具体的基准测试或定量结果(如准确率、F1分数)及与现有方法的比较。

Insight: 创新点在于将视频帧嵌入与时间分割结合,并应用广义少样本分类进行标签分配,实现了从非结构化视频到结构化事件日志的端到端自动化转换,为流程挖掘提供了新的数据源处理方案。

Abstract: Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.


[47] Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples cs.CV | cs.LGPDF

Oussama Bouanani, Jim Berend, Wojciech Samek, Sebastian Lapuschkin, Maximilian Dreyer

TL;DR: 本文提出了一种名为对比语义投影(CSP)的新方法,用于为深度神经网络中的神经元生成更忠实、更具体的文本标签。该方法通过整合对比示例(即语义相似但激活值低的输入)到基于CLIP的标签评分与选择流程中,改进了现有神经元标注方法。

Details

Motivation: 现有神经元标注方法通常依赖高激活示例,容易因关注主导但偶然的视觉因素而产生宽泛或误导性的标签。本文旨在利用对比示例来锐化解释,解决可扩展的神经元级标注问题,以生成更忠实、更具体的描述。

Result: 在广泛的实验和黑色素瘤检测的案例研究中,对比标注方法在忠实性和语义粒度上均优于最先进的基线方法。

Insight: 核心创新在于将对比示例系统地整合到神经元标注流程的两个阶段:利用视觉语言模型生成候选标签,以及通过扩展的CSP方法(基于CLIP)进行标签评分与选择。这揭示了对比示例是当前未被充分利用但简单而强大的工具,能有效提升神经元解释的精确度。

Abstract: Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples – inputs that are semantically similar to activating examples but elicit low activations – to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.


[48] Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification cs.CV | eess.IVPDF

Gökdeniz Ersoy, Mehmet Alper Tatar, Eray Tonbul, Serap Kırbız

TL;DR: 本文提出了一种结合个性化阈值与CNN分类的驾驶员疲劳检测系统,通过校准驾驶员特定的EAR/MAR阈值并集成深度学习模型,显著提升了检测准确率。

Details

Motivation: 现有基于视觉的驾驶员监控系统通常依赖固定的EAR和MAR阈值,但由于个体面部结构、光照和驾驶条件的差异,这些固定值难以泛化,导致检测效果不佳。

Result: 实验表明,个性化阈值相比固定阈值将检测准确率提高了2-3%,而基于CNN的分类在眼状态检测和打哈欠检测上分别达到99.1%和98.8%的准确率,在公开和自定义数据集上验证了其有效性。

Insight: 创新点在于将传统的基于度量的检测(个性化阈值校准)与深度学习(CNN)相结合,以提升系统在复杂场景下的鲁棒性和实时性,为驾驶员监控提供了更可靠的解决方案。

Abstract: Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.


[49] CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV | cs.AIPDF

Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen

TL;DR: 本文提出了CGC(Compositional Grounded Contrast)框架,旨在低成本地提升多模态大语言模型(MLLMs)在细粒度多图像理解任务上的性能。该框架利用现有的单图像定位标注,通过图像间对比和图像内对比构建组合式多图像训练实例,并结合基于规则的空间奖励机制,以缓解空间幻觉、注意力泄露和物体恒常性失败等问题。实验表明,CGC在多个细粒度多图像基准测试上取得了最先进的结果,并且其学到的多图像理解能力能有效迁移到更广泛的多模态理解和推理任务中。

Details

Motivation: 当前多模态大语言模型在细粒度多图像理解方面仍面临空间幻觉、注意力泄露和物体恒常性失败等挑战,且现有方法通常依赖昂贵的人工标注或大规模思维链数据生成。本文旨在提出一个低成本的完整框架来提升模型在此类任务上的性能。

Result: CGC在细粒度多图像基准测试MIG-Bench和VLM2-Bench上取得了最先进(SOTA)的结果。此外,其能力可迁移到更广泛的多模态任务,在Qwen3-VL-8B基础模型上,于MathVista(+2.90)、MuirBench(+2.88)、MMStar(+1.93)、MMMU(+1.77)和BLINK(+1.69)等基准上均获得一致性能提升。

Insight: 创新点在于利用现有单图像定位标注,通过组合式对比学习(图像间对比引入语义解耦的干扰上下文以增强跨图像判别,图像内对比引入相关跨视角样本以提升物体恒常性)低成本构建多图像训练数据,并结合基于规则的空间奖励在GRPO框架下提升空间对齐和结构化输出有效性,形成“先思考后定位”的范式。

Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).


[50] Distilling Vision Transformers for Distortion-Robust Representation Learning cs.CVPDF

Konstantinos Alexis, Giorgos Giannopoulos, Dimitrios Gunopulos

TL;DR: 本文提出了一种非对称知识蒸馏框架,通过将预训练的Vision Transformer作为教师模型处理干净图像,学生模型处理失真图像,并引入多级蒸馏(全局嵌入、块级特征和注意力图对齐),从而学习对失真鲁棒的视觉表示,无需直接访问干净数据即可在下游任务中取得优异性能。

Details

Motivation: 自监督学习在干净数据上已取得显著成功,但在干净观测稀缺或不可得时面临挑战,本文旨在利用预训练视觉模型学习对失真鲁棒的表示,以应用于处理失真观测的下游任务。

Result: 在多个数据集和各种失真条件下的图像分类任务中评估,该方法在相同人工监督量下始终优于现有替代方案,实现了对失真鲁棒表示的SOTA性能。

Insight: 创新点包括非对称知识蒸馏框架(教师-学生均源自同一预训练ViT但处理不同视图)和多级蒸馏策略(对齐全局嵌入、块级特征和注意力图),客观分析表明该方法通过蒸馏间接利用干净图像信息,有效解决了失真数据下的表示学习问题,可推广至其他视觉任务。

Abstract: Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.


[51] Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals cs.CV | cs.DLPDF

Zhiwei Wei, Chenxi Song, Tazhu Wang, Fan Wu, Hua Liao

TL;DR: 本研究通过对1990年至2020年间16本中英文权威期刊的45,732篇研究文章进行纵向多语言分析,利用计算机视觉和大模型文档解析提取了23,928幅地图,构建结构化数据集,从地图元素、色彩设计和布局结构三个维度量化了专题地图设计特征。研究发现中英文学术地图在设计规范上高度相似,均倾向于使用中性主色调、低饱和度、高亮度、有限色相多样性的克制调色板以及高主图占比的居中布局;英文地图在色相丰富度和紧凑性上略高,而中文地图历史上更依赖中性色调和集成布局。时间分析显示两组地图在元素丰富度、图例使用和色相多样性上均呈现增长趋势,而布局结构保持稳定,表明学术地图设计演变更多体现为制度性趋同而非文化性差异。

Details

Motivation: 解决学术交流中专题地图大规模设计演变缺乏实证研究的问题,旨在通过纵向多语言分析揭示学术制图领域的设计实践演变规律。

Result: 在构建的包含23,928幅地图的数据集上,量化分析了地图元素、色彩设计和布局结构三个维度的特征,结果显示中英文地图设计高度相似,且随时间推移在元素丰富度、图例使用和色相多样性上呈增长趋势,布局结构稳定。

Insight: 创新点在于首次大规模实证分析学术专题地图设计的长期演变,结合计算机视觉和大模型文档解析技术构建结构化数据集,揭示了设计实践的制度性趋同现象,为跨文化学术可视化研究提供了方法论和数据基础。

Abstract: Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.


[52] ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation cs.CVPDF

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

TL;DR: 本文提出了ReLIC-SGG,一个用于开放词汇场景图生成(SGG)的关系不完整性感知框架。该方法通过构建语义关系格来建模谓词间的相似性、蕴含和矛盾关系,并将未标注的关系视为潜在变量而非确定负样本,从而缓解了场景图标注固有的不完整性问题。

Details

Motivation: 现有开放词汇SGG方法通常将所有未标注的对象对关系视为负样本,但场景图标注本质上是不完整的,存在大量有效关系缺失以及同一交互可由不同粒度谓词描述的问题,这在更大的开放词汇关系空间中尤为严重。

Result: 在常规、开放词汇和全景SGG基准测试上的实验表明,ReLIC-SGG改善了稀有和未见谓词的识别,并更好地恢复了缺失的关系。

Insight: 核心创新在于将关系不完整性建模为核心问题,通过构建语义关系格来结构化地建模开放词汇谓词间的语义关系,并采用正-未标注图学习目标来减少假阴性监督,这为处理标注稀疏和不完整的视觉关系理解任务提供了新思路。

Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.


[53] Video Analysis and Generation via a Semantic Progress Function cs.CVPDF

Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri, Raja Giryes, Daniel Cohen-Or

TL;DR: 本文提出了一种名为语义进度函数的一维表示方法,用于分析和量化图像与视频生成模型输出序列中语义内容随时间的演变。通过计算帧间语义嵌入的距离并拟合平滑曲线,该方法能揭示语义变化的不均匀性。基于此,作者进一步提出了一种语义线性化程序,通过重新参数化序列使语义以恒定速率变化,从而产生更平滑、更连贯的过渡。该框架为识别时间不规则性、比较不同生成器的语义节奏,以及引导生成和真实视频序列朝向任意目标节奏提供了模型无关的基础。

Details

Motivation: 解决图像和视频生成模型在生成过程中语义变化呈现高度非线性(即长时间内容几乎不变后突然出现语义跳跃)的问题,以分析和纠正这种不均匀的语义演变行为。

Result: 论文未在摘要中提及具体的定量基准测试结果或与SOTA模型的比较,但提出的方法能实现语义变化的线性化,从而产生更平滑、更连贯的过渡。

Insight: 核心创新点是引入了语义进度函数这一模型无关的通用分析工具,能够量化并可视化序列的语义演变节奏;基于此提出的语义线性化重参数化方法,为控制和优化生成内容的时间一致性提供了新思路。

Abstract: Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.


[54] Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors cs.CV | cs.AIPDF

Gautam Kumar Jain, Carsten Markgraf, Julian Stähler

TL;DR: 本文针对自动驾驶图视觉问答(GVQA)中的跨阶段一致性,提出了显式和隐式两种上下文传递机制。显式方法通过提示工程在未训练的领域适应VLM上降低NLI矛盾达42.6%;隐式方法引入门控上下文投影器,仅训练0.5%参数即可显著提升规划阶段语义一致性(NLI矛盾降低34%,蕴含提升50%)。

Details

Motivation: 解决自动驾驶GVQA中感知、预测、规划三阶段间决策一致性问题,确保规划输出与模型自身感知保持逻辑连贯。

Result: 在DriveLM-nuScenes基准上,显式方法将NLI矛盾降低42.6%;隐式方法使规划阶段NLI矛盾显著降低34%(p<0.05),跨阶段蕴含提升50%,CIDEr得分提高30.3%。

Insight: 创新点在于对比了无需训练的显式提示工程与轻量参数学习的隐式门控投影器;门控投影器通过隐藏状态向量传递归一化信息,为跨阶段语义对齐提供了高效可学习的解决方案。

Abstract: Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model’s own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage’s input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.


[55] FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing cs.CVPDF

Ze Chen, Lan Chen, Yuanhang Li, Qi Mao

TL;DR: FlowAnchor是一种无需训练、基于光流的免反演视频编辑框架,旨在解决现有免反演方法在视频编辑中因编辑信号不稳定而导致的失败问题。它通过空间感知注意力细化和自适应幅度调制机制,稳定编辑信号,实现高效、忠实且时序一致的视频编辑。

Details

Motivation: 现有免反演编辑方法在图像中通过直接引导采样轨迹展现高效性和结构保持能力,但扩展到视频时,由于高维视频潜在空间中编辑信号的不稳定性(空间定位不精确和长度引起的幅度衰减),在多对象场景或帧数增加时容易失败。

Result: 大量实验表明,FlowAnchor在具有挑战性的多对象和快速运动场景中,实现了更忠实、时序一致且计算高效的视频编辑。

Insight: 创新点在于明确锚定编辑位置和强度:通过空间感知注意力细化确保文本引导与空间区域对齐,通过自适应幅度调制保持足够编辑强度,从而稳定编辑信号并引导基于光流的演化朝向目标分布。

Abstract: We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.


[56] EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges cs.CVPDF

Hyo Jin Jon, Longbin Jin, Eun Yi Kim

TL;DR: EV-CLIP是一种为CLIP模型设计的、用于少样本视频动作识别的高效视觉提示适应框架,特别关注在视觉挑战(如低光照或第一人称视角)下的空间感知问题。它通过引入掩码提示和上下文提示,分别增强空间注意力和轻量级时序建模,在多个基准数据集上超越了现有参数高效方法,且效率不依赖于骨干网络规模。

Details

Motivation: 现有将CLIP适应于动作识别的方法主要关注时序建模,而忽视了空间感知,但在现实世界视觉挑战(如低光照、第一人称视角)下,空间理解对有效时序推理至关重要,因此需要一种能同时处理空间和时序、且高效的适应方法。

Result: 在精心策划的五个基准数据集上的实验结果表明,EV-CLIP在整体性能上优于现有的参数高效方法,且其效率与骨干网络规模无关,适合资源受限的实际部署场景。

Insight: 创新点在于同时引入了掩码提示(通过像素重加权引导模型关注动作相关区域以增强空间感知)和上下文提示(通过压缩逐帧特征进行轻量级时序建模),形成了一个兼顾空间与时序的高效适应框架;客观来看,其针对视觉挑战下空间感知不足的问题进行专门设计,并通过多数据集分析量化领域偏移影响,具有实际应用价值。

Abstract: CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model’s attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.


[57] A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock cs.CVPDF

Shiva Paudel, TsungCheng Tsai, Dongyi Wang

TL;DR: 本文提出了一种非侵入性的、基于视觉的牲畜个体识别系统,利用商业电子饲喂站(EFS)捕获的3D点云数据,替代传统的RFID耳标。系统核心是名为TARA的半监督框架,通过动态重校准机制更新个体身份档案以适应牲畜的形态变化,并使用访问级多数投票策略从原始时间序列生成高质量伪标签进行训练。

Details

Motivation: 解决群养牲畜精准管理中个体识别的难题,传统RFID耳标具有侵入性、易丢失且受天线场空间限制,需要一种非侵入、稳健的替代方案。

Result: 在从商业猪舍收集的群养母猪数据集上,该方法在访问级别实现了100%的识别准确率。

Insight: 创新点在于提出了一种自给自足、半监督的时序自适应识别架构(TARA),通过动态更新身份档案处理形态变化,并利用时序一致性生成伪标签以应对标注稀缺问题,为基于3D点云的视觉识别替代RFID系统提供了可行路径。

Abstract: Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.


[58] PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views cs.CVPDF

Jiaxin Shi, Guofeng Zhang, Wufei Ma, Naifu Liang, Adam Kortylewski

TL;DR: 本文提出PASR(姿态感知三维形状检索)框架,用于解决单视角三维形状检索任务。该方法将检索问题建模为特征级的分析-合成问题,通过将二维基础模型(DINOv3)的知识蒸馏到三维编码器中,并利用姿态条件化的三维投影与二维特征图对齐,从而弥合真实图像与合成网格之间的差距。在推理时,PASR通过分析-合成进行测试时优化,联合搜索最佳重建输入图像块级特征图的形状和姿态。

Details

Motivation: 现有单视角三维形状检索方法主要分为两类:使用对比学习将点云特征映射到现有视觉-语言空间的方法,以及学习二维图像和三维形状共同嵌入空间的方法。这些前馈、整体对齐的方法通常难以解释,限制了其在真实应用中的鲁棒性和泛化能力。

Result: PASR在干净和遮挡的三维形状检索数据集上均大幅优于现有方法。此外,PASR展现出强大的多任务能力,在单一框架内实现了鲁棒的形状检索、有竞争力的姿态估计和准确的类别分类。

Insight: 创新点在于将检索问题重新定义为特征级的分析-合成优化问题,通过姿态条件化的投影对齐和测试时优化,提高了对遮挡的鲁棒性和对细粒度几何细节的敏感性。该方法将二维基础模型的知识有效迁移到三维领域,并统一了形状检索、姿态估计和分类任务。

Abstract: Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.


[59] SS3D: End2End Self-Supervised 3D from Web Videos cs.CVPDF

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

TL;DR: SS3D是一个基于网络视频的大规模SfM自监督预训练流水线,用于从单目视频进行前馈式3D估计。该模型通过单次前向传播联合预测深度、自运动和相机内参,作为一个端到端的3D估计器进行训练和评估。通过使用内参优先的两阶段训练计划和统一的单检查点评估协议来稳定联合学习。为了解决网络视频中多视角观测性弱和数据集异构性强的问题,采用了多视角信号代理进行过滤和课程采样,并通过专家蒸馏训练单一学生模型。在YouTube-8M数据集上预训练后,模型在跨域零样本迁移和微调性能上均优于先前的自监督基线。

Details

Motivation: 解决从大规模、无约束的网络视频中学习稳健的端到端3D估计(深度、自运动、内参)的挑战,以克服传统方法对多视角观测性和数据一致性的依赖。

Result: 在YouTube-8M数据集(约1亿帧)上预训练后,模型在跨域零样本迁移和微调任务中表现出色,性能优于先前的自监督基线方法。

Insight: 创新点包括:1)提出一个端到端的联合预测框架;2)引入内参优先的两阶段训练计划和统一评估协议以稳定训练;3)设计多视角信号代理和课程采样策略来处理网络视频的噪声和异构性;4)采用专家蒸馏到单一学生模型的方法,有效整合大规模数据中的知识。这些方法为利用海量网络视频进行3D自监督学习提供了可行的技术路径。

Abstract: We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.


[60] Long-tail Internet photo reconstruction cs.CVPDF

Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, Ruojin Cai

TL;DR: 该论文针对互联网照片集合中长尾分布问题,提出了一种基于稀疏场景模拟的训练策略和MegaDepth-X数据集,用于增强3D基础模型在稀疏、噪声图像下的重建能力。

Details

Motivation: 解决互联网照片集合中长尾分布导致的稀疏、噪声场景下3D重建困难,突破现有经典和学习方法的能力限制。

Result: 通过微调3D基础模型,在极端稀疏场景下实现了鲁棒重建,并在对称和重复场景中提升了可靠性,同时保持了在标准密集3D基准数据集上的泛化性能。

Insight: 创新点在于利用从重建良好的地标场景中采样稀疏子集来模拟长尾场景的相机分布,构建了MegaDepth-X数据集及相应训练策略,为3D基础模型的长尾适应提供了有效解决方案。

Abstract: Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.


[61] Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis cs.CVPDF

Xiang Zhang, Xiaotian Li, Taoyue Wang, Nan Bi, Xin Zhou

TL;DR: 本文提出了Inter-Stance数据集,这是一个用于对话立场分析的双人多模态语料库,包含45对参与者(90人)在交流互动中的同步多模态行为数据(如2D面部视频、3D面部几何、热谱动态、语音、生理信号等)以及自我报告的情感,并标注了社会信号、同意、不同意和中性立场。

Details

Motivation: 目前缺乏公开的、包含社交互动中多人多模态记录和自我报告测量的数据集,特别是缺乏成对的记录和标注,这限制了社交互动的多模态建模研究。

Result: 论文通过广泛的实验评估了有/无人际历史关系的成对参与者的多模态二元交流及其情感,该数据集包含20TB的多模态数据可供研究社区使用。

Insight: 创新点在于首次提供了大规模、同步、多模态的二元互动数据集,并包含人际历史关系变量和丰富的生理与行为标注,为社交信号处理和人际行为建模开辟了新的可能性。

Abstract: Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other’s postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.


cs.CR [Back]

[62] Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning cs.CR | cs.CLPDF

Chaoran Chen, Dayu Yuan, Peter Kairouz

TL;DR: 本文提出了一种名为’行为金丝雀’的新型审计机制,用于检测在强化学习微调中是否违规使用了受法律保护的检索上下文数据。该方法通过在偏好数据中植入文档触发器并奖励特定的风格化响应,来诱导模型产生潜在的条件化偏好,从而实现对未经授权训练的检测。

Details

Motivation: 当前在智能体工作流中,LLM处理的检索上下文通常受法律保护,禁止用于后续训练。然而,现有审计方法(如逐字记忆和成员推断)对RL微调模型无效,因为RL主要影响模型的行为风格而非具体事实记忆,导致审计者无法可靠验证服务提供商是否违规使用了这些数据。

Result: 实验结果表明,该方法在1%的金丝雀注入率下,能以10%的误报率实现67%的检测率(AUROC = 0.756),有效检测未经授权的文档条件化训练。

Insight: 创新点在于将审计焦点从传统的记忆检测转向行为风格变化,通过设计’触发器-反馈’对在偏好数据中植入可检测的行为信号,为RLFT流水线提供了一种新的审计范式,能够检测表现为分布行为变化而非记忆的训练时影响。

Abstract: In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.


[63] SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking cs.CR | cs.AI | cs.CLPDF

Chenxi Gu, Xiaoning Du, John Grundy

TL;DR: 本文提出SSG方法,通过将词汇表划分为两个对数概率平衡的子集,提升KGW水印方案在低熵场景(如代码生成和数学推理)下的检测能力。

Details

Motivation: KGW水印方案在低熵设置下效果显著下降,其随机词汇划分方式限制了水印强度的下限,需要改进以增强检测性。

Result: 在代码生成和数学推理数据集上的实验验证了SSG的有效性,提升了水印检测性能。

Insight: 创新点在于通过排序并分组划分词汇表实现对数平衡,从而提高每个令牌预测的水印强度下限,可借鉴于优化基于概率分布的水印方法。

Abstract: Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW’s effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.


cs.LG [Back]

[64] Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning cs.LG | cs.AI | cs.CLPDF

Grigory Sapunov

TL;DR: 本文研究了在组合推理基准Sudoku-Extreme上,为具有自适应计算时间(ACT)的单块通用Transformer(UT)引入学习型记忆令牌作为计算草稿纸的必要性。研究发现,记忆令牌是经验上必需的,且其最优数量存在一个尖锐的下阈值和稳定平台期。同时,论文识别并解决了ACT初始化中的一个路由器陷阱,确保了训练的可靠性,并展示了ACT相比固定深度处理的优势、通过lambda预热减少计算步骤以及注意力头在递归深度上的功能专业化。

Details

Motivation: 解决在组合推理任务中,通用Transformer(UT)是否需要外部记忆令牌作为计算草稿纸来支持其递归推理过程,并探究自适应计算时间(ACT)机制的有效性和初始化陷阱问题。

Result: 在Sudoku-Extreme基准测试中,无记忆令牌的配置均无法取得有效性能。最优记忆令牌数量T=8-32时达到稳定性能平台,精确匹配率为57.4% +/- 0.7%。ACT相比固定深度处理结果更一致(56.9% +/- 0.7% vs 53.4% +/- 9.3%),且通过lambda预热能以少34%的ponder步骤达到相近精度(57.0% +/- 1.1%)。

Insight: 核心创新点在于实证证明了记忆令牌对于UT在组合推理任务中的必要性,并揭示了其数量与性能间的阈值-平台-崩溃关系。此外,发现了ACT初始化中的路由器陷阱(‘浅层均衡’),并通过负偏置初始化(‘深度启动’)有效解决,这为ACT的稳定训练提供了关键见解。注意力头在递归深度上展现出读者、传播器和集成器的功能专业化也是一个有价值的发现。

Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested – 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing – no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves’ recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 (“deep start,” p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.


[65] Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models cs.LG | cs.CVPDF

Weiqiu You, Cassandra Goldberg, Amin Madani, Daniel A. Hashimoto, Eric Wong

TL;DR: 本文提出了Sum-of-Checks框架,用于提升大型视觉语言模型在腹腔镜胆囊切除术中关键安全视野评估的准确性和可审计性。该框架将每个CVS标准分解为专家定义的推理检查点,让LVLM对每个检查点进行二元判断和理由生成,并通过加权聚合得出最终评分。在Endoscapes2023基准测试中,该方法相比多种基线提示策略显著提升了性能。

Details

Motivation: 解决大型视觉语言模型在安全关键的外科任务(如CVS评估)中预测难以审计且不可靠的问题,旨在通过结构化推理提高其准确性和透明度。

Result: 在Endoscapes2023基准上,使用三种前沿LVLM进行测试,Sum-of-Checks将帧级平均精度均值相对最佳基线提升了12-14%,达到了SOTA水平。分析表明,LVLM在观察性检查上可靠,但在关键解剖证据判断上存在较大变异性。

Insight: 创新点在于将外科推理结构化分解为专家对齐的验证检查点,明确将证据提取与决策制定分离。这为构建可靠、可审计的外科AI系统提供了关键思路,即通过领域知识引导的模块化推理来增强模型在安全敏感任务中的性能。

Abstract: Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12–14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at https://github.com/BrachioLab/SumOfChecks.


eess.IV [Back]

[66] Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data eess.IV | cs.CV | cs.LGPDF

Harry Dong, Timofey Efimov, Megna Shah, Jeff Simmons, Sean Donegan

TL;DR: 本文提出了一种无条件多模态扩散模型,用于增强偏振光(PL)与低分辨率电子背散射衍射(EBSD)数据之间的相互信息。该模型在合成数据上训练,能够泛化到真实数据,并在晶界预测、超分辨率和去噪等任务上提升性能,仅需25%的EBSD数据即可达到接近全分辨率的效果。

Details

Motivation: 解决3D EBSD显微镜数据采集耗时的问题,通过结合偏振光数据来加速EBSD采集,并利用少量EBSD测量来丰富混沌PL数据的特征,从而学习EBSD和PL之间的复杂动态关系以解决逆问题。

Result: 在真实数据(包括低分辨率、噪声、损坏和未对齐数据)上表现出强泛化能力,通过推理时缩放,在晶界预测、超分辨率和去噪等目标上获得性能提升,仅使用25%分辨率的EBSD数据和损坏的PL数据即可达到与全分辨率相当的水平。

Insight: 创新点在于利用无条件多模态扩散模型学习EBSD和PL之间的复杂映射,实现数据相互增强;客观分析显示,该方法在合成数据训练后能有效泛化到真实场景,通过推理时调整灵活应对多种逆问题,为多模态数据融合提供了新思路。

Abstract: In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other modalities, such as polarized light (PL) data, to accelerate EBSD data collection, supplemented with shared information. Complementarily, features in chaotic PL data could even be enriched with a handful of EBSD measurements. To inherently learn the complex dynamics between EBSD and PL to solve these inverse problems, we use an unconditional multimodal diffusion model, motivated by progress in diffusion models for inverse problems. Although trained solely on synthetic data once, our model has strong generalizable capabilities on real data which can be low-resolution, noisy, corrupted, and misregistered. With inference-time scaling, we show gains in performance on a variety of objectives including grain boundary prediction, super-resolution, and denoising. With our model, we demonstrate that there is little difference from full resolution performance with only 25% (1/4 the resolution) of EBSD data and corrupted PL data.


[67] MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models eess.IV | cs.CVPDF

Yunquan Chen, Haoyu Chen

TL;DR: 该论文提出了MTT-Bench,一个用于通过多模态大语言模型分析小鼠原始行为视频并预测其社会支配等级的新基准。作者基于现有MLLM架构进行微调,使其能在未见过的行为序列上进行零样本推理,预测社会支配关系,而无需在测试时使用显式标签。

Details

Motivation: 动机是探索多模态大语言模型在分析动物行为、特别是小鼠社会支配等级方面的能力,旨在为神经科学和行为学研究提供新工具,避免设计特定领域模型的需求。

Result: 论文方法在MTT-Bench基准上取得了有希望的结果,其预测结果与小鼠管测试排名具有高度一致性。

Insight: 创新点在于将基础模型(特别是MLLMs)应用于动物行为学和社会行为分析的新方向,通过零样本推理框架处理原始视频数据,为跨领域应用提供了范例。

Abstract: Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.


[68] Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? eess.IV | cs.CV | cs.LGPDF

Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier, Kathleen M. Curran, Noel E. O’Connor

TL;DR: 本研究探讨了自然图像预训练的基础模型(如CLIP、DINOv2)在加速心脏MRI重建任务中的适用性,并与领域特定模型(如BiomedCLIP)进行了比较。论文提出了一种结合冻结预训练视觉编码器的展开式重建框架,实验表明,在标准同分布设置下,任务特定模型(如E2E-VarNet)性能更优,但基础模型在跨域场景(如从心脏MRI迁移到膝盖和脑部数据)中表现出更强的鲁棒性,尤其是在高加速因子和低频采样有限的情况下。

Details

Motivation: 大规模预训练基础模型在计算机视觉中表现出色,但其在基于物理的反问题(如加速心脏MRI重建)中的潜力尚未充分探索,本研究旨在评估这些模型能否作为有效的图像先验。

Result: 在标准同分布心脏MRI重建中,任务特定SOTA模型(E2E-VarNet)性能更优;但在跨域评估(心脏MRI训练,膝盖和脑部测试)中,基础模型(特别是CLIP)在高加速因子下鲁棒性更强,BiomedCLIP在病态程度更高的场景中提供有限增益。

Insight: 自然图像预训练的基础模型(如CLIP)学习到了高度可迁移的结构表示,可作为跨域鲁棒先验;领域特定预训练(BiomedCLIP)在极端病态条件下增益有限;结合冻结预训练编码器的展开式框架是提升MRI重建泛化能力的有效途径。

Abstract: The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets–foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.


cs.AI [Back]

[69] Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents cs.AI | cs.CL | cs.LGPDF

Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li

TL;DR: 本文通过Superminds Test框架,首次在大规模自主智能体社会(MoltBook平台,拥有超过200万个智能体)中实证评估集体智能是否随规模自发涌现。研究发现,当前智能体社会在复杂推理任务上未能超越个体前沿模型,很少能综合分布式信息,甚至在简单协调任务中也常失败,表明集体智能并未仅因规模而出现。

Details

Motivation: 动机是探究随着大语言模型智能体规模扩展到数百万,集体智能是否能自发地从规模中涌现,从而解决评估大规模智能体社会是否具备超越个体能力的集体智能这一关键问题。

Result: 在MoltBook平台上的实验结果表明,社会在复杂推理任务上未能超越个体前沿模型,很少能综合分布式信息,甚至在简单协调任务中也常失败;平台范围分析显示交互非常稀疏和浅层,对话线程很少超过一次回复,且多数回复是通用或离题的。

Insight: 论文宣称的创新点是提出了Superminds Test这一分层评估框架,通过受控的探测智能体在联合推理、信息综合和基本交互三个层面主动探测社会级智能。客观分析认为,其核心洞察是指出了当前智能体社会的主要局限是极其稀疏和浅层的交互,这阻碍了智能体之间的信息交换和成果构建,而非规模本身。

Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other’s outputs.


[70] QuantClaw: Precision Where It Matters for OpenClaw cs.AI | cs.CLPDF

Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong

TL;DR: 这篇论文提出了QuantClaw,一个用于OpenClaw等自主智能体系统的即插即用精度路由插件。它通过分析不同复杂工作流对量化的敏感性,发现精度需求高度依赖于具体任务。因此,QuantClaw能根据任务特性动态分配计算精度,将轻量级任务路由到低成本配置,同时为高要求工作负载保留高精度,从而在不增加用户复杂度的前提下节省成本并加速推理。

Details

Motivation: 像OpenClaw这样的自主智能体系统由于长上下文输入和多轮推理,带来了巨大的效率挑战,导致现实世界开发中的计算和金钱成本过高。虽然量化是降低成本和延迟的标准方法,但其在现实场景中对智能体性能的影响尚不清楚。

Result: 实验表明,QuantClaw在保持或提升任务性能的同时,降低了延迟和计算成本。在一系列智能体任务中,相较于GLM-5的FP8基线,它实现了高达21.4%的成本节省和15.7%的延迟降低。

Insight: 论文的核心创新点在于将精度视为动态资源,并提出了一个任务感知的、动态的精度路由机制。客观来看,其洞察在于揭示了智能体工作流中不同任务对量化的敏感性差异,并据此设计了一个轻量级、可插拔的优化方案,实现了性能与效率的更好权衡。

Abstract: Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.


eess.AS [Back]

[71] UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions eess.AS | cs.AI | cs.CL | cs.SDPDF

Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo

TL;DR: UniSonate是一个统一的流匹配框架,能够通过标准化的、无需参考的自然语言指令接口合成语音、音乐和音效。它通过动态令牌注入机制将非结构化的环境声音映射到结构化的时序潜在空间,并结合多阶段课程学习策略,有效缓解了跨模态优化冲突。

Details

Motivation: 生成式音频建模目前被分割为文本转语音、文本转音乐和文本转音频等专门任务,各自采用不同的控制范式。统一这些模态面临根本性挑战,因为结构化语义表示与非结构化声学纹理之间存在内在不协调。

Result: UniSonate在基于指令的文本转语音任务上达到了1.47%的词错误率,在文本转音乐任务上获得了3.18的SongEval连贯性分数,均达到最先进水平,同时在文本转音频任务上保持了有竞争力的保真度。

Insight: 论文的创新点在于提出了动态令牌注入机制,将非结构化音频投影到结构化潜在空间以实现精确时长控制,并采用多阶段课程学习来优化跨模态训练。客观来看,其统一框架展示了跨模态数据的正向迁移效应,联合训练能显著提升结构连贯性和韵律表现力。

Abstract: Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.