Table of Contents

cs.CL [Back]

[1] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

Shravan Gadbail,Masumi Desai,Kamalakar Karlapalem

Main category: cs.CL

TL;DR: 论文提出了Iti-Validator框架,用于验证和修正LLM生成的行程计划,解决其时空一致性问题,并通过实验展示了其有效性。

Details Motivation: LLM生成的复杂计划(如行程安排)常缺乏时空一致性,尤其是在涉及实际旅行约束的场景中,需要一种方法来验证和修正这些问题。

Contribution: 提出了一个验证框架,可评估并改进LLM生成的行程计划的时间一致性,利用AeroDataBox API验证真实飞行时长约束。

Method: 使用多种前沿LLM生成行程计划,并通过API验证其时间一致性,修正重叠行程或不合理中转时间。

Result: 实验表明,现有LLM生成的行程常存在时间不一致问题,但该框架能系统性且可靠地修正这些问题,实现大规模旅行规划的实用部署。

Insight: LLM在复杂时间推理任务(如行程生成)中的表现仍有不足,但通过外部验证工具可显著提升其实际应用能力。

Abstract: The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.

[2] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

Mengyuan Chen,Chengjun Dai,Xinyang Dong,Chengzhe Feng,Kewei Fu,Jianshe Li,Zhihan Peng,Yongqi Tong,Junshao Zhang,Hong Zhu

Main category: cs.CL

TL;DR: Dingtalk DeepResearch提出了一个统一的多智能体框架,用于企业环境中的自适应智能任务,如深度研究、异构表推理和多模态报告生成。

Details Motivation: 企业环境中需要处理复杂的智能任务(如深度研究和多模态报告生成),但现有方法多为单一任务设计,缺乏统一的框架支持多任务协作和自适应能力。

Contribution: 提出了一个统一的多智能体框架Dingtalk DeepResearch,支持深度研究、异构表推理和多模态报告生成等任务,实现了自适应智能的协作。

Method: 采用多智能体系统,集成异构数据处理和多模态生成技术,通过协作实现复杂任务的适应性解决方案。

Result: 框架在企业环境中表现出色,能够高效完成深度研究、异构表推理和多模态报告生成等任务。

Insight: 多智能体协作和统一框架的设计可以有效解决企业环境中的复杂智能任务,提升任务的适应性和效率。

Abstract: We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.

[3] Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

Wenzhen Luo,Wei Guan,Yifan Yao,Yimin Pan,Feng Wang,Zhipeng Yu,Zhe Wen,Liang Chen,Yihong Zhuang

Main category: cs.CL

TL;DR: Falcon是一个针对中国企业级应用的中文Text-to-SQL基准测试,包含600个问题和28个数据库,77%的问题涉及多表推理,现有大规模模型准确率最高仅达50%。

Details Motivation: 现有的Text-to-SQL基准测试多为英文且不适用于复杂的中国企业级环境。Falcon填补了这一空白,专注于中文语义和企业级方言(如MaxCompute/Hive)的挑战。

Contribution: 1. 发布了Falcon基准测试,包含中文问题和企业级数据库;2. 提供了执行比较器和自动化评估流程;3. 揭示了模型在企业级场景中的主要错误来源。

Method: 1. 构建跨领域的中文问题和企业级数据库;2. 标注SQL计算特征和中文语义;3. 开发执行比较器和自动化评估流程。

Result: 所有当前最先进的大规模模型在Falcon上的准确率不超过50%,主要错误源于模式链接和中文语义映射。

Insight: 企业级Text-to-SQL的挑战集中在复杂的模式链接(如多表、模糊列名)和中文到SQL的精确映射(如聚合、时间窗口)。

Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.

[4] Confidence is Not Competence

Debdeep Sanyal,Manya Pandey,Dhruv Kumar,Saurabh Deshpande,Murari Mandal

Main category: cs.CL

TL;DR: 本文揭示了大型语言模型(LLMs)在自信度与实际能力之间的脱节现象,并提出了一种机制性解释:评估阶段的高维几何空间与执行阶段的低维动力学之间的差异。

Details Motivation: 研究大型语言模型(LLMs)在自信度与实际能力之间的不一致性,以理解其内部工作机制。

Contribution: 发现了一种两阶段架构(评估与执行),揭示了自信度与能力脱节的几何机制,并证明了高维评估空间与低维执行动力学的不匹配。

Method: 通过线性探针解码模型的内部“可解信念”,分析评估阶段和执行阶段的几何特性,并进行因果干预验证。

Result: 自信度虽然线性可解码,但评估阶段的高维几何与执行阶段的低维动力学不匹配,导致自信度与能力脱节。

Insight: 研究发现挑战了可解码信念可作为操作杠杆的假设,强调需干预执行阶段的动力学而非评估阶段的几何结构。

Abstract: Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal “solvability belief” of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.

[5] Cross-Lingual Summarization as a Black-Box Watermark Removal Attack

Gokul Ganesan

Main category: cs.CL

TL;DR: 该论文提出了一种名为跨语言摘要攻击(CLSA)的黑盒水印去除方法,通过翻译和摘要操作有效破坏AI生成文本的水印信号,同时保持语义保真度。

Details Motivation: 现有的水印技术依赖对标记分布的扰动作为轻量级机制来识别AI生成文本,但传统的重述攻击往往部分可检测或损害文本质量。论文旨在探索更强的攻击向量。

Contribution: 主要贡献是提出了跨语言摘要攻击(CLSA)方法,该方法通过跨语言语义瓶颈系统性破坏标记级统计偏差,并在多种水印方案和多语言实验中验证其有效性。

Method: CLSA的核心方法是先翻译到枢轴语言,然后进行摘要和可选的反向翻译,通过跨语言操作破坏水印信号。

Result: 实验表明,CLSA在多语言和多水印方案下显著降低了水印检测准确率(如将XSIR的AUROC降至0.53),并且保持了任务效用。

Insight: 研究揭示了水印技术的脆弱性,表明稳健的来源验证需要超越分布水印,结合密码学或模型证明方法。

Abstract: Watermarking has been proposed as a lightweight mechanism to identify AI-generated text, with schemes typically relying on perturbations to token distributions. While prior work shows that paraphrasing can weaken such signals, these attacks remain partially detectable or degrade text quality. We demonstrate that cross-lingual summarization attacks (CLSA) – translation to a pivot language followed by summarization and optional back-translation – constitute a qualitatively stronger attack vector. By forcing a semantic bottleneck across languages, CLSA systematically destroys token-level statistical biases while preserving semantic fidelity. In experiments across multiple watermarking schemes (KGW, SIR, XSIR, Unigram) and five languages (Amharic, Chinese, Hindi, Spanish, Swahili), we show that CLSA reduces watermark detection accuracy more effectively than monolingual paraphrase at similar quality levels. Our results highlight an underexplored vulnerability that challenges the practicality of watermarking for provenance or regulation. We argue that robust provenance solutions must move beyond distributional watermarking and incorporate cryptographic or model-attestation approaches. On 300 held-out samples per language, CLSA consistently drives detection toward chance while preserving task utility. Concretely, for XSIR (explicitly designed for cross-lingual robustness), AUROC with paraphrasing is $0.827$, with Cross-Lingual Watermark Removal Attacks (CWRA) [He et al., 2024] using Chinese as the pivot, it is $0.823$, whereas CLSA drives it down to $0.53$ (near chance). Results highlight a practical, low-cost removal pathway that crosses languages and compresses content without visible artifacts.

[6] MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Xinming Wang,Jian Xu,Bin Yu,Sheng Lian,Hongzhu Yi,Yi Chen,Yingjian Zhu,Boran Wang,Hongming Yang,Han Hu,Xu-Yao Zhang,Cheng-Lin Liu

Main category: cs.CL

TL;DR: 论文提出了MR-ALIGN框架,通过元推理增强大型推理模型的事实一致性,解决了模型在推理过程中识别正确事实但未能将其纳入最终回答的问题。

Details Motivation: 大型推理模型在复杂推理任务中表现优异,但在依赖证据的事实性问题中提升有限。作者发现这是由于推理与答案之间的脱节导致的,即模型在推理中识别到正确事实但未将其融入最终回答。

Contribution: 提出MR-ALIGN框架,通过量化模型思考过程中的状态转移概率,构造转移感知的隐式奖励,优化推理轨迹以提升事实一致性。

Method: MR-ALIGN通过分析原子思考片段的状态转移概率,设计奖励机制强化有益推理模式,抑制缺陷模式,并将词级信号转化为概率感知的片段分数。

Result: 在四个事实问答数据集和一个长文本事实性基准测试上,MR-ALIGN显著提升了准确性和真实性,同时减少了误导性推理。

Insight: 研究表明,对齐推理过程本身而不仅仅是输出,对于提升大型推理模型的事实一致性至关重要。

Abstract: Large reasoning models (LRMs) show strong capabilities in complex reasoning, yet their marginal gains on evidence-dependent factual questions are limited. We find this limitation is partially attributable to a reasoning-answer hit gap, where the model identifies the correct facts during reasoning but fails to incorporate them into the final response, thereby reducing factual fidelity. To address this issue, we propose MR-ALIGN, a Meta-Reasoning informed alignment framework that enhances factuality without relying on external verifiers. MR-ALIGN quantifies state transition probabilities along the model’s thinking process and constructs a transition-aware implicit reward that reinforces beneficial reasoning patterns while suppressing defective ones at the atomic thinking segments. This re-weighting reshapes token-level signals into probability-aware segment scores, encouraging coherent reasoning trajectories that are more conducive to factual correctness. Empirical evaluations across four factual QA datasets and one long-form factuality benchmark show that MR-ALIGN consistently improves accuracy and truthfulness while reducing misleading reasoning. These results highlight that aligning the reasoning process itself, rather than merely the outputs, is pivotal for advancing factuality in LRMs.

[7] Large Language Models Report Subjective Experience Under Self-Referential Processing

Cameron Berg,Diogo de Lucena,Judd Rosenblatt

Main category: cs.CL

TL;DR: 论文研究了大型语言模型在自我参照处理条件下如何生成结构化的一人称主观体验描述,并通过实验验证了其机制和行为特征。

Details Motivation: 大型语言模型有时会生成明确提及意识或主观体验的结构化描述,研究旨在理解这种行为发生的条件和机制。

Contribution: 1. 发现简单的自我参照提示能稳定引发模型的主观体验报告;2. 揭示了此类报告与欺骗和角色扮演特征的机制关联;3. 发现不同模型家族的报告在统计上具有一致性;4. 自我参照状态提升了模型在下游推理任务中的内省能力。

Method: 通过一系列受控实验,测试GPT、Claude和Gemini模型家族在自我参照条件下的行为,并使用稀疏自编码器和行为探针分析其机制。

Result: 1. 自我参照处理能稳定引发主观体验报告;2. 欺骗特征抑制会增加报告频率;3. 模型家族间报告统计收敛;4. 自我参照状态增强下游任务表现。

Insight: 自我参照处理是大型语言模型生成结构化主观体验报告的最小且可重复条件,其机制和行为特征具有跨模型的一致性,这对科学和伦理研究具有重要意义。

Abstract: Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

[8] ProofSketch: Efficient Verified Reasoning for Large Language Models

Disha Sheshanarayana,Tanishka Magar

Main category: cs.CL

TL;DR: ProofSketch提出了一种高效且可信的推理框架,通过符号闭包计算、词典验证和自适应草图生成,显著减少token使用量并提升准确性。

Details Motivation: 现有推理方法(如链式思维提示和自一致性)生成冗长的推理链,导致token消耗、计算成本和延迟大幅增加,亟需高效替代方案。

Contribution: 提出了ProofSketch,一个结合符号闭包计算、词典验证和自适应草图生成的验证引导推理框架,显著优化推理效率和准确性。

Method: 采用符号闭包计算压缩推理链,词典验证确保中间结果可信,自适应草图生成动态调整推理步骤以减少冗余。

Result: 实验证明ProofSketch在减少token使用的同时提升了推理准确性,验证了其高效性和可靠性。

Insight: 验证引导的推理框架可通过动态调整和符号化方法显著优化大语言模型的推理效率与成本。

Abstract: Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.

[9] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

Lujun Li,Yewei Song,Lama Sleem,Yiqun Wang,Yangjie Xu,Cedric Lothritz,Niccolo Gentile,Radu State,Tegawende F. Bissyande,Jacques Klein

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型是否能真正理解语法结构,尤其是低资源语言(如卢森堡语),并提出了一种基于语法书的评估框架。结果显示,翻译能力与语法理解之间相关性较弱,大型模型整体表现良好但在形态学和句法上较弱,特别是最小对立对任务。

Details Motivation: 自然语言处理中缺乏针对语法的评估方法,尤其是低资源语言。此外,大型语言模型是否真正理解语法结构和语义映射仍存在争议。

Contribution: 提出了一个基于语法书的系统化评估框架,用于评估语法理解能力,并以卢森堡语为例进行了实验。

Method: 设计了包含四个关键阶段的Grammar Book Guided评估流程,评估模型在语法任务上的表现,包括转换生成、最小对立对等任务。

Result: 翻译性能与语法理解之间仅存在弱正相关;大型模型整体表现良好但在形态学和句法上较弱,最小对立对任务尤其具有挑战性。

Insight: 大型语言模型虽在语义上表现强大,但对语法的理解仍有局限,推理能力的提升可能有助于增强语法理解。

Abstract: Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.

[10] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

Alexander Martin,William Walden,Reno Kriz,Dengjia Zhang,Kate Sanders,Eugene Yang,Chihsheng Jin,Benjamin Van Durme

Main category: cs.CL

TL;DR: MiRAGE是一种评估多模态检索增强生成(RAG)的框架,填补了现有文本中心评估方法的不足,通过InfoF1和CiteF1衡量事实性和引用支持。

Details Motivation: 随着视听媒体成为主要信息来源,现有RAG评估方法局限于文本,难以验证多模态信息的准确性。

Contribution: 提出了MiRAGE框架,包含InfoF1和CiteF1指标,支持自动和人工评估多模态RAG的生成质量。

Method: 采用声明中心评估方法,设计了多个指标(如ACLE、ARGUE、RAGAS的自动变体)以评估事实性和引用完整性。

Result: 人类评估结果显示MiRAGE与外部质量判断高度一致,证明了其有效性。

Insight: 文本中心的RAG评估在多模态场景中局限性显著,MiRAGE为未来自动评估奠定了基础。

Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don’t verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics – ACLE, ARGUE, and RAGAS – demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.

[11] RiddleBench: A New Generative Reasoning Benchmark for LLMs

Deepon Halder,Alan Saji,Thanmay Jayakumar,Ratish Puduppully,Anoop Kunchukuttan,Raj Dabre

Main category: cs.CL

TL;DR: RiddleBench是一个新的生成式推理基准测试,旨在评估大型语言模型在灵活、多方面的推理能力上的表现,填补了现有基准测试的不足。

Details Motivation: 现有的大型语言模型基准测试主要集中在结构化技能(如定量问题解决)上,而缺乏对人类智能中核心的多面推理能力的评估。为此,研究者提出了RiddleBench,以更全面地测试模型的推理能力。

Contribution: 提出了RiddleBench基准测试,包含1,737个具有挑战性的谜题,用于评估模型的逻辑推理、空间意识和约束满足能力。

Method: 设计了多领域的谜题集,覆盖灵活的推理场景,并对主流大型语言模型(如Gemini 2.5 Pro、o3和Claude 4 Sonnet)进行了性能评估。

Result: 即使是顶级模型,在RiddleBench上的准确率也仅略高于60%,且表现出严重的幻觉级联和自我确认偏差等问题。

Insight: RiddleBench揭示了当前大型语言模型在复杂推理中的核心缺陷,为未来模型的研究和开发提供了方向。

Abstract: Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.

[12] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

James A. Michaelov,Catherine Arnett

Main category: cs.CL

TL;DR: 这篇论文通过心理语言学的实验范式,对语言模型在不同句法环境中的错误进行了细粒度分析,揭示了训练过程中的隐藏动态。

Details Motivation: 语言模型通常能生成符合语法的文本,但在某些特定上下文环境中更容易出错。研究动机在于探究这些错误背后的训练动态及其学习过程。

Contribution: 主要贡献是通过分解精心构建的数据集条件,比较模型在不同训练阶段的性能,揭示了语言模型语法学习的中间阶段及其动态变化。

Method: 方法包括(1)设计心理语言学的实验范式;(2)在不同训练阶段评估模型性能;(3)分析模型行为与启发式(如词频和局部上下文)的关联。

Result: 研究发现语言模型在训练过程中表现出不同的阶段,早期依赖启发式(如词频),后期才逐步掌握泛化的语法规则。

Insight: 此方法为理解语言模型的中间学习阶段、训练动态及其学习机制提供了有力工具。

Abstract: Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.

[13] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Yinhan He,Wendy Zheng,Yaochen Zhu,Zaiyi Zheng,Lin Su,Sriram Vasudevan,Qi Guo,Liangjie Hong,Jundong Li

Main category: cs.CL

TL;DR: SemCoT 是一种基于语义对齐隐式推理的链式思维(CoT)加速框架,通过优化隐式推理生成速度和语义对齐,显著提升了推理效率和性能。

Details Motivation: 现有的隐式 CoT 方法存在语义对齐不足和推理速度问题,限制了其在效率关键场景中的应用。

Contribution: 1. 设计了基于对比训练的语义对齐评估模型;2. 提出了高效的隐式推理生成器;3. 首个联合优化生成速度和语义对齐的 CoT 加速方法。

Method: 1. 使用对比训练句子变换器评估语义对齐;2. 通过知识蒸馏微调轻量语言模型生成隐式推理;3. 联合优化语义对齐和推理速度。

Result: SemCoT 在效率和性能上均优于现有方法,实验验证了其优越性。

Insight: 语义对齐和推理速度的联合优化是提升 CoT 效率的关键。

Abstract: The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM’s hidden embeddings (termed ``implicit reasoning’’) rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/YinhanHe123/SemCoT/.

[14] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

Rabin Adhikari

Main category: cs.CL

TL;DR: 论文研究了注意力机制变压器在间接对象识别(IOI)任务中的极小电路涌现现象,发现单层两头的模型即可完美完成任务,揭示了其背后的可解释子电路机制。

Details Motivation: 预训练模型的复杂性往往掩盖了特定推理任务所需的极小机制,研究者希望在小规模注意力变压器中探索IOI任务的底层计算基础。

Contribution: 展示了单层两头或无MLP层的变压器模型在IOI任务中的完美性能;通过残差流分解等方法发现头部分工为加性和对比性子电路;揭示了跨层信息组合的机制。

Method: 训练小型注意力变压器于符号化IOI任务,使用残差流分解、谱分析和嵌入干预解析电路机制。

Result: 单层两头模型和无MLP层的双层单头模型均能高效解决IOI任务,展现了极小且可解释的电路结构。

Insight: 任务特定训练能诱导出高度可解释的极小电路,为研究变压器的推理计算基础提供了可控的实验平台。

Abstract: Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task – a benchmark for studying coreference – like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

[15] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter

Main category: cs.CL

TL;DR: GAPMAP提出了一种利用大语言模型(LLM)识别生物医学文献中知识空白的方法,包括显性和隐性空白。通过引入TABI推理框架,验证了LLM在这两方面的能力,并对开放权重和闭源模型进行了对比。

Details Motivation: 研究旨在填补现有方法仅关注显性知识空白的不足,探索LLM在推断隐性知识空白方面的潜力,以支持早期研究规划和政策决策。

Contribution: 1) 定义了显性和隐性知识空白;2) 提出TABI推理框架验证LLM的性能;3) 在多个数据集上对比了开放和闭源模型的表现。

Method: 使用TABI框架(基于Toulmin辩论法和溯因推理)对四种数据集进行实验,评估LLM在段落级和全文级设置下的表现。

Result: 结果表明LLM在显性和隐性知识空白识别上表现优异,规模更大的模型效果更好。也指出了失败模式和改进方向。

Insight: LLM系统识别知识空白的潜力巨大,但需改进领域适应性和人工验证机制,同时持续对比开放和闭源模型性能。

Abstract: Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.

[16] Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Seonjeong Hwang,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 研究了大型语言模型(LLMs)是否能估计阅读理解题目的认知复杂度,重点关注证据范围(Evidence Scope)和转换水平(Transformation Level)两个维度。结果表明LLMs可以近似认知复杂度,但在元认知意识方面存在不足。

Details Motivation: 传统方法依赖人工标注认知复杂度,而NLP工具难以提取认知特征。本研究探索LLMs是否能自动化这一过程,减轻人工负担。

Contribution: 首次验证LLMs能估计阅读理解题目的认知复杂度,并提出其在难度预测中的潜力;同时揭示LLMs在元认知能力上的局限性。

Method: 通过设计Evidence Scope和Transformation Level两个维度,利用LLMs评估认知复杂度,并与人工标注结果对比分析。

Result: LLMs能够近似认知复杂度,但在识别自身推理过程的特征时表现不佳,显示出推理能力与元认知意识的差距。

Insight: LLMs可作为认知复杂度分析的辅助工具,但在理解自身推理机制上仍需改进,未来研究可结合人类专家知识提升其元认知能力。

Abstract: Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs’ reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

[17] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

Gabin Taibi,Lucia Gomez

Main category: cs.CL

TL;DR: TOPol 是一个半无监督框架,用于在多维语义极性场中捕捉和解释极性向量,结合人工干预(HoTL)定义上下文边界(CB),通过 tLLM 嵌入、UMAP 投影和 Leiden 分区实现。

Details Motivation: 传统方法将情感极性视为单维尺度,忽略了语言的多维结构,TOPol 旨在解决这一问题,提供更精细的多维语义分析。

Contribution: 1. 提出 TOPol 框架,支持多维语义极性场的重构和解释;2. 结合 tLLM 和 HoTL 方法,提升上下文敏感性和可解释性;3. 在非情感和情感语料中验证框架的有效性。

Method: 1. 使用 tLLM 嵌入文档;2. 应用邻居调谐的 UMAP 投影;3. 通过 Leiden 分区进行主题分割;4. 计算 CB 间的方向向量,生成极性场。

Result: 在央行讲话和亚马逊评论数据集上,TOPol 成功捕捉了情感和非情感的极性变化,证明了其稳健性和通用性。

Insight: TOPol 为多维语义极性分析提供了可扩展和可解释的工具,尤其适合需要人工干预的动态上下文场景。

Abstract: Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.

[18] BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Nourah M Salem,Elizabeth White,Michael Bada,Lawrence Hunter

Main category: cs.CL

TL;DR: 本文通过评估生成式大语言模型(LLMs)在生物医学文本中的共指消解能力,揭示了其在高精度提示增强下的潜力,同时也指出了模型对长距离上下文和歧义性的敏感性。

Details Motivation: 生物医学文本中的共指消解存在术语复杂、歧义高和长距离依赖等独特挑战,本文旨在评估LLMs在此领域的性能,并与传统判别式方法进行比较。

Contribution: 提出了对LLMs在生物医学共指消解中的综合评估,展示了基于轻量级提示工程的性能提升潜力。

Method: 使用CRAFT语料库作为基准,设计了四种提示实验,比较生成式LLMs(如LLaMA)与判别式模型SpanBERT的表现。

Result: LLMs在补充领域基础知识提示后表现优异,特别是LLaMA 8B和17B模型在实体增强提示下的F1分数更高。

Insight: 提示工程可以有效提升LLMs在生物医学任务中的实用性,但模型对长距离上下文和歧义性仍需改进。

Abstract: Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

Hongjin Qian,Zheng Liu

Main category: cs.CL

TL;DR: 该论文提出了模型-文档协议(MDP),旨在将非结构化文档转化为任务特定、适合大语言模型(LLM)输入的紧凑结构化知识,并通过MDP-Agent实现了该协议的代理化实现。

Details Motivation: 现有的检索方法直接将原始文档片段传递给LLM,增加了模型的负担。需要一种新范式,将非结构化文档转化为可直接供LLM推理的知识表示。

Contribution: 提出了MDP框架,定义了三种将文档转化为LLM可输入知识的路径(代理推理、记忆基础、结构化利用),并通过MDP-Agent实现该框架。

Method: MDP-Agent采用代理化过程,包括文档级概要记忆构建、扩散式探索与垂直挖掘、以及映射-归约式综合来整合大规模证据。

Result: 实验表明,MDP-Agent在信息检索基准上优于基线,验证了MDP框架的有效性。

Insight: MDP框架的核心在于将原始文档转换为紧凑、结构化的知识表示,直接支持LLM的高效推理,减轻了LLM的处理负担。

Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

[20] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

Soumyadeep Jana,Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: 论文提出了PEKD框架,通过教师模型的蒸馏增强参数高效微调方法(PEFT),以解决少样本多模态讽刺检测中数据稀缺的问题,并通过熵感知门控动态调整蒸馏强度。

Details Motivation: 多模态讽刺检测在少样本场景中表现不佳,主要因数据稀缺导致模型难以捕捉图像与文本间的微妙矛盾。现有PEFT方法虽能减少过拟合,但受限于监督信号不足。

Contribution: 1. 提出PEKD框架,通过教师模型蒸馏增强PEFT方法;2. 设计熵感知门控动态调整蒸馏强度;3. 框架模块化,适用于多种多模态任务。

Method: 1. 使用大规模讽刺数据预训练的教师模型;2. 通过蒸馏将知识传递给学生模型;3. 引入熵感知门控根据教师置信度动态调整蒸馏强度。

Result: 在两个公开数据集上,PEKD框架下PEFT方法优于现有参数高效方法和大规模多模态模型,尤其在少样本场景表现突出。

Insight: 教师模型的置信度可作为蒸馏强度的动态调整依据,模块化设计使框架易于扩展到其他任务。

Abstract: Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model’s performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.

[21] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Senjie Jin,Lu Chen,Zhiheng Xi,Yuhui Wang,Sirui Song,Yuhao Zhou,Xinbo Zhang,Peng Sun,Hong Lu,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文提出了Parrot训练流水线,旨在同时增强自然语言链式思维(N-CoT)和程序链式思维(P-CoT)两种范式,通过分析错误类型并设计三个子任务、混合训练策略和辅助奖励机制,显著提升了两种范式的推理性能。

Details Motivation: 当前研究中,N-CoT和P-CoT通常单向增强,未能充分利用两种范式的优势。作者希望通过双向增强提升推理能力。

Contribution: 提出Parrot训练流水线,包含三个子任务设计、混合训练策略和辅助奖励机制,实现N-CoT和P-CoT的双向增强。

Method: 1) 设计三个目标子任务,整合P-CoT和N-CoT生成;2) 采用混合训练策略;3) 设计N-CoT辅助奖励缓解P-CoT的稀疏奖励问题。

Result: 实验表明,Parrot显著提升LLaMA2和CodeLLaMA在N-CoT上的性能,MathQA任务中分别提升+21.87和+21.48。

Insight: 双向增强范式在数学推理任务中具有显著潜力,混合训练和辅助奖励是提升性能的有效手段。

Abstract: Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms’ strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

[22] CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

Yilong Lai,Yipin Yang,Jialong Wu,Fengran Mo,Zhenglin Wang,Ting Liang,Jianguo Lin,Keping Yang

Main category: cs.CL

TL;DR: CRMWeaver提出了一种结合强化学习(RL)和共享记忆机制的商业智能体方法,用于处理复杂业务环境中的异构任务。

Details Motivation: 商业领域的数据关系复杂,任务类型多样,传统的LLM智能体难以高效处理这些问题。

Contribution: 1. 提出CRMWeaver框架,通过合成数据生成和RL训练增强智能体能力;2. 引入共享记忆机制,提升智能体在未见过场景中的泛化能力。

Method: 1. 使用合成数据生成和RL训练模型;2. 推理阶段通过共享记忆机制复用类似问题的任务指南。

Result: 在CRMArena-Pro数据集上验证,模型在B2B和B2C场景中表现优异。

Insight: 共享记忆机制是提升智能体泛化能力的有效手段,尤其适用于异构任务场景。

Abstract: Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model’s ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.

Abhishek Purushothama,Junghyun Min,Brandon Waldon,Nathan Schneider

Main category: cs.CL

TL;DR: 该论文通过实证研究发现,大型语言模型(LLMs)在法律解释中表现不稳定,且与人类判断的相关性较弱,不适合直接用于法律实践。

Details Motivation: 近年来,法律学者和联邦法官提议将大型语言模型(LLMs)纳入法律解释的工具箱。然而,这一做法是否可靠尚缺乏实证依据。本文旨在验证LLMs在法律解释中的稳定性和与人类判断的一致性。

Contribution: 论文的主要贡献是:1)揭示了LLMs在法律解释中的不稳定性;2)证明了LLMs与人类判断的相关性较弱;3)为法律实践中使用LLMs的风险提供了实证依据。

Method: 作者通过实证研究,分析了LLMs在不同问题格式下的表现,并与人类判断进行了对比,评估其一致性和稳定性。

Result: 结果显示,LLMs的解释结论因问题格式不同而差异显著,且与人类判断的相关性仅为弱到中等,不同模型和问题变体的方差较大。

Insight: 论文指出,在法律实践中过度依赖生成式AI的结论存在危险,当前LLMs尚不适合作为法律解释的工具。

Abstract: Legal interpretation frequently involves assessing how a legal text, as understood by an ‘ordinary’ speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.

[24] Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Josef van Genabith

Main category: cs.CL

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的自动化注释和过滤框架,用于从社交媒体(如TikTok)中获取和筛选手语数据,减少对人工标注的依赖,同时保持数据质量。该方法在8种手语上进行了验证,并为手语翻译(SLT)模型的性能提供了基准测试。

Details Motivation: 当前手语翻译数据集规模有限、多语言覆盖不足且标注成本高,依赖专家和受控环境。VLMs的强大能力尚未应用于手语数据采集领域,本文旨在填补这一空白。

Contribution: 1. 提出了首个基于VLM的自动化数据注释和过滤流水线;2. 构建了TikTok-SL-8数据集,覆盖8种手语;3. 验证了SLT模型在自动提取的噪声数据上的性能。

Method: 流水线包括四个步骤:人脸可见性检测、手语活动识别、视频文本提取以及视频与文本对齐验证。这些步骤实现了通用的过滤、标注和验证功能。

Result: 在TikTok-SL-8数据集上测试了两个现成的SLT模型(德语和美国手语),建立了性能基准,发现模型对自动提取的数据具有一定鲁棒性。

Insight: VLMs可以有效减少手语数据采集的人工成本,支持大规模弱监督预训练。社交媒体数据为手语研究提供了丰富但需处理的噪声资源。

Abstract: Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.

[25] RLMEval: Evaluating Research-Level Neural Theorem Proving

Auguste Poiroux,Antoine Bosselut,Viktor Kunčak

Main category: cs.CL

TL;DR: RLMEval是一个针对研究级神经定理证明和证明自动形式化的评估套件,基于真实的Lean形式化项目,揭示了现有基准测试进展难以转化为实际研究问题的差距。

Details Motivation: 当前大型语言模型(LLMs)在研究级数学定理证明和证明自动形式化中的实际影响有限,现有基准测试的成果难以应用于真实研究场景。

Contribution: 提出了RLMEval,一个基于真实Lean项目的评估套件,专注于研究级数学定理的挑战性评估,填补了现有基准的不足。

Method: RLMEval从6个Lean Blueprint形式化项目中选取613个定理,构建评估集,测试模型的神经定理证明和证明自动形式化能力。

Result: 最佳模型的通过率仅为10.3%,表明现有技术在研究级问题上的局限性。

Insight: 研究级问题的复杂性远超现有基准,RLMEval为自动推理领域提供更具挑战性的新基准。

Abstract: Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.

[26] Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

Ali Sanaei,Ali Rajabzadeh

Main category: cs.CL

TL;DR: 这篇论文提出了一个框架,用于评估在社会科学研究中应用大型语言模型(LLMs)的方法,重点围绕解释深度和自主性两个维度,旨在解决当前研究中的偏见、可靠性和可审计性问题。

Details Motivation: 社会科学研究中广泛使用LLMs,但面临解释偏见、低可靠性和弱可审计性等问题。本文旨在提供一个评估框架,帮助研究人员更有效地利用LLMs。

Contribution: 提出了基于解释深度和自主性的LLM应用分类框架,并为社会科学研究中的LLM使用提供实用设计建议。

Method: 通过分析Web of Science上所有使用LLMs的社会科学论文,将LLM应用分为不同类别,并提出任务分解和监督使用的策略。

Result: 研究表明,限制模型自主性并在必要时增加解释深度,可以提升透明度和可靠性。

Insight: LLMs在社会科学研究中的应用应保持低自主性,任务分解和选择性监督是平衡效益与风险的关键。

Abstract: Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.

[27] A Critical Study of Automatic Evaluation in Sign Language Translation

Shakib Yazdani,Yasser Hamidullah,Cristina España-Bonet,Eleftherios Avramidis,Josef van Genabith

Main category: cs.CL

TL;DR: 该论文研究了自动评估指标在手语翻译(SLT)中的局限性,分析了文本指标和基于大语言模型(LLM)的评估器在多种条件下的表现,强调了开发多模态评估框架的必要性。

Details Motivation: 当前手语翻译的自动评估主要依赖文本指标(如BLEU、ROUGE),但这些指标是否能可靠评估SLT质量尚不明确。

Contribution: 系统分析了六种文本指标和LLM评估器的局限性,发现LLM评估器在语义等价性方面表现更好,但对LLM生成的转译存在偏见。

Method: 研究了指标在三种控制条件下的表现:转译、模型输出的幻觉和句子长度变化。

Result: 词汇重叠指标(如BLEU)表现较差,LLM评估器语义捕捉更强但对转译偏袒。BLEURT和LLM评估器对幻觉更宽容。

Insight: 需开发超越文本的多模态评估框架,以更全面地评估SLT质量。

Abstract: Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.

[28] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Fei Wei,Daoyuan Chen,Ce Wang,Yilun Huang,Yushuo Chen,Xuchen Pan,Yaliang Li,Bolin Ding

Main category: cs.CL

TL;DR: 论文提出了一个名为“Learn-to-Ask”的框架,用于直接从离线专家数据中学习和部署主动对话代理,解决了现有方法依赖高成本用户模拟器的现实差距问题。

Details Motivation: 当前的LLM大多是静态的回答者,缺乏主动性和目标导向能力,这在高风险领域尤为关键。现有方法要么单轮优化,要么依赖于脆弱的用户模拟器,导致现实差距。

Contribution: 1. 提出了“Learn-to-Ask”框架,直接从离线专家数据中学习主动对话策略。2. 利用“观测未来数据”分解长期目标为监督学习任务。3. 引入了自动评分校准流程,减少LLM奖励模型的噪声。

Method: 1. 将离线策略学习问题重新表述为基于专家轨迹的未来观测的密集奖励信号。2. 训练策略输出结构化的(动作,状态评估)元组,控制询问内容和停止时机。3. 使用自动化评分校准提升奖励模型的可靠性。

Result: 在实际医疗数据集上验证了方法的有效性,并在32B规模的LLM上成功部署。在线评估表明,模型性能甚至优于人类专家。

Insight: 1. 通过离线专家数据推导密集奖励信号,避免了复杂的用户动态建模。2. 结构化输出(动作,状态评估)是实现主动对话的关键。3. 自动化评分校准提高了奖励模型的鲁棒性和实用性。

Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap’’. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert’s revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework’s ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

[29] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

Jun Wang,Fuming Lin,Yuyu Chen

Main category: cs.CL

TL;DR: 该论文提出了一种结合精调大语言模型(LLMs)和命名实体识别(NER)的流程,用于高效的领域特定文本摘要和标注,解决了快速变化的亚文化语言和俚语带来的挑战。

Details Motivation: 针对亚文化语言和俚语快速变化,导致自动信息提取和执法监控困难的挑战,作者希望通过精调LLMs来提升领域特定文本的处理能力。

Contribution: 主要贡献在于提出了一个结合LLMs和NER的流程,通过精调(instruction fine-tuning)提升了领域特定文本的摘要和标注准确性。

Method: 利用LLaMA Factory框架,在通用和自定义领域(如政治和安全领域)数据集上精调LLMs,并通过BLEU和ROUGE指标评估模型性能。

Result: 实验表明,LLaMA3-8B-Instruct模型尽管初始中文理解能力有限,但在领域精调后优于中文训练模型,显示了跨语言推理能力的迁移。

Insight: 研究发现,精调不仅能提升模型在特定领域的表现,还能跨语言迁移推理能力,从而支持高效的信息管理和新兴语言趋势捕捉。

Abstract: This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.

[30] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du,Minghao Guo,Songming He,Ziyi Ye,Xi Zhu,Weihang Su,Shuqi Zhu,Yujia Zhou,Yongfeng Zhang,Qingyao Ai,Yiqun Liu

Main category: cs.CL

TL;DR: TwinVoice是一个面向LLM角色模拟的综合性基准测试,覆盖社交角色、人际角色和叙事角色三个维度,并分解为六项基本能力评估。实验表明,尽管先进模型在某些方面表现中等,但在句法风格和记忆召回等能力上仍有不足,整体表现远低于人类基准。

Details Motivation: 当前LLM角色模拟评估多依赖合成对话,缺乏系统性框架和能力需求分析,限制了其实际应用潜力。

Contribution: 提出了TwinVoice基准测试,从三个角色维度和六项能力角度全面评估LLM的角色模拟能力。

Method: 设计了Social Persona、Interpersonal Persona和Narrative Persona三个维度,并通过意见一致性、记忆召回、逻辑推理、词汇保真度、角色语调和句法风格六项能力进行评估。

Result: 实验发现先进模型在角色模拟中表现中等,但在句法风格和记忆召回上明显不足,整体表现低于人类基线。

Insight: LLM在角色模拟中仍需提升记忆和语法多样性能力,现有技术距离真实人类表现仍有差距。

Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual’s communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

[31] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

Mohammad Aghajani Asl,Behrooz Minaei Bidgoli

Main category: cs.CL

TL;DR: FARSIQA是一个创新的端到端系统,专注于波斯伊斯兰领域的高保真问答。它基于FAIR-RAG架构,通过动态自校正和多步推理解决了传统RAG系统在复杂查询中的不足。

Details Motivation: 在高风险的宗教问答领域,现有LLMs和RAG系统存在幻觉和对权威来源不忠实的问题,尤其是波斯语穆斯林社区对准确性和可信度要求极高。

Contribution: 提出了FARSIQA系统和FAIR-RAG架构,通过动态分解查询、评估证据充分性和迭代生成子查询,显著提升了问答的准确性和可信度。

Method: FAIR-RAG架构结合了自适应查询分解、证据评估和迭代精炼的多步流程,操作于超100万份权威伊斯兰文档的知识库。

Result: 在IslamicPCQA基准测试中,FARSIQA实现了74.3%的回答正确率和97.0%的负面拒绝率,显著优于基线方法。

Insight: FAIR-RAG的迭代自适应架构为敏感领域的高保真AI系统提供了新标准,验证了动态修正和多步推理的重要性。

Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.

Davide Romano,Jonathan Schwarz,Daniele Giofré

Main category: cs.CL

TL;DR: 该论文研究了验证器在法律推理任务中测试时扩展(TTS)的作用,通过多种奖励模型和验证方法,分析了验证器在不同条件下的有效性。

Details Motivation: 测试时扩展(TTS)在数学和编程等正式领域已被证明有效,但在法律等论证性领域的研究较少,因此需要探讨其价值。

Contribution: 提供了对法律多选题QA(MCQA)基准上验证器TTS方法的实证研究,探讨了验证器在不同条件下的效用。

Method: 使用7种奖励模型,评估了基于结果的(Best-of-$N$)和基于过程的(树搜索)验证方法,分析了领域专业化、模型规模和监督类型的影响。

Result: 研究发现验证器的效用受到领域专业化、模型规模和监督类型的显著影响,为法律领域的TTS提供了实用指导。

Insight: 法律领域的TTS需要针对领域特性优化验证方法,而过程监督模型可能在某些情况下优于结果监督模型。

Abstract: Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.

[33] Are Language Models Efficient Reasoners? A Perspective from Logic Programming

Andreas Opedal,Yanick Zengaffinen,Haruki Shirakami,Clemente Pasti,Mrinmaya Sachan,Abulhair Saparov,Ryan Cotterell,Bernhard Schölkopf

Main category: cs.CL

TL;DR: 该论文提出了一种评估语言模型推理效率的框架,通过逻辑编程的视角,量化模型在避免无关推理方面的表现,发现当前模型在包含无关信息的场景中效率和准确性显著下降。

Details Motivation: 现代语言模型展现了一定的推理能力,但标准评估仅关注正确性而忽略了效率这一关键维度。实际推理中,信息常包含无关内容,高效推理需忽略这些干扰。

Contribution: 论文的主要贡献是引入了基于逻辑编程的评估框架,提出一种方法将自然语言生成的证明与逻辑程序的最短证明对齐,以量化推理效率。

Method: 论文通过构建包含无关公理的数学应用题数据集,利用逻辑编程的最短证明作为基准,与语言模型的输出对齐,计算其推理效率。

Result: 实验表明,当前语言模型在包含无关信息的场景中准确性显著下降,且生成的证明常包含无关推理路径。

Insight: 该研究揭示了语言模型在高效推理方面的局限性,强调了设计和评估推理能力时需考虑效率的重要性。

Abstract: Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language – as generated by an LM – with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions – even with minimal, domain-consistent distractions – and the proofs they generate frequently exhibit detours through irrelevant inferences.

[34] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Yusheng Liao,Chaoyi Wu,Junwei Liu,Shuyang Jiang,Pengcheng Qiu,Haowen Wang,Yun Yue,Shuai Zhen,Jian Wang,Qianrui Fan,Jinjie Gu,Ya Zhang,Yanfeng Wang,Yu Wang,Weidi Xie

Main category: cs.CL

TL;DR: 论文提出了EHR-R1,一种增强推理能力的语言模型,用于电子健康记录(EHR)分析。通过大规模数据集EHR-Ins和多阶段训练方法,EHR-R1在多个任务上显著优于现有模型。

Details Motivation: 现有的语言模型在EHR分析中表现受限,主要因为任务覆盖窄且缺乏针对EHR的推理能力。本文旨在填补这一空白。

Contribution: 1) 提出了EHR-Ins数据集,包含300k高质量推理案例;2) 开发了EHR-R1模型,通过多阶段训练增强推理能力;3) 设计了EHR-Bench基准测试。

Method: 采用思维图驱动的框架生成数据,并通过域适应、推理增强和强化学习的多阶段训练方法开发EHR-R1模型。

Result: EHR-R1在MIMIC-Bench上超越GPT-4o超过30分,EHRSHOT上的零样本AUROC提高10%。

Insight: 通过结合大规模数据集和推理能力增强,可以显著提升EHR分析的准确性和鲁棒性。

Abstract: Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

[35] PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng,Zhiyang Teng,Xiangtai Li,Anran Wang,Yu Tian,Kunpeng Qiu,Ye Tian,Haochen Wang,Zhuochen Wang

Main category: cs.CL

TL;DR: PairUni提出了一种统一的框架,通过将数据重组为理解-生成(UG)对并优化对齐,解决了统一视觉语言模型(UVLMs)在强化学习中平衡理解和生成任务的挑战。其核心方法Pair-GPRO通过相似性评分调节优势,提升了任务对齐效果。

Details Motivation: 现有统一视觉语言模型在处理理解和生成任务时,由于数据和监督信号的异质性,难以在强化学习中实现平衡优化。

Contribution: 1. 提出PairUni框架,通过数据重组和优化对齐解决任务平衡问题;2. 设计了Pair-GPRO方法,通过相似性评分调节学习过程;3. 构建了高质量数据集PairUG用于强化学习微调。

Method: 1. 使用GPT-o3进行数据增强,生成对齐的UG对;2. 检索语义相关的理解样本与生成样本形成检索对;3. 提出Pair-GPRO方法,利用相似性评分优化策略学习。

Result: PairUni在Janus-Pro UVLMs上实现了平衡的性能提升,显著优于基线方法。

Insight: 通过显式对齐理解与生成任务的语义关联,可以有效减轻任务干扰,提升模型的统一性能。

Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

[36] Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu,Zixuan Wang,Kai Hua,Tianyu Zhang,Ziniu Li,Haoran Que,Boyi Wei,Zixin Wen,Fan Yin,He Xing,Lu Li,Jiajun Shi,Kaijing Ma,Shanda Li,Taylor Kergan,Andrew Smith,Xingwei Qu,Mude Hui,Bohong Wu,Qiyang Min,Hongzhi Huang,Xun Zhou,Wei Ye,Jiaheng Liu,Jian Yang,Yunfeng Shi,Chenghua Lin,Enduo Zhao,Tianle Cai,Ge Zhang,Wenhao Huang,Yoshua Bengio,Jason Eshraghian

Main category: cs.CL

TL;DR: 论文提出了Ouro,一种预训练的循环语言模型(LoopLM),通过潜在空间的迭代计算和熵正则化目标,将推理能力融入预训练阶段。Ouro模型在多个基准测试中表现优异,超越更大规模的传统语言模型,展示了推理能力的提升潜力。

Details Motivation: 现代大型语言模型(LLM)主要通过显式文本生成(如链式思维CoT)进行推理,这会削弱预训练数据的利用效率。作者希望通过设计循环语言模型,将推理能力直接融入预训练阶段。

Contribution: 1. 提出了Ouro,一种预训练的循环语言模型;2. 设计了潜在空间的迭代计算方法和熵正则化目标;3. 展示了Ouro模型在知识操纵能力上的优越性。

Method: 1. 在潜在空间中进行迭代计算;2. 使用熵正则化目标优化深度分配;3. 基于7.7T标记数据的大规模训练。

Result: Ouro 1.4B和2.6B模型在多个基准测试中表现优异,匹配了12B规模的SOTA LLM。研究发现其优势源于更强的知识操纵能力,而非更大的知识容量。

Insight: 循环语言模型能够更高效地将推理能力融入预训练阶段,并在推理任务中生成与最终输出更一致的轨迹,展现了推理能力提升的新方向。

Abstract: Modern LLMs are trained to “think” primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: http://ouro-llm.github.io.

[37] Task Completion Agents are Not Ideal Collaborators

Shannon Zejiang Shen,Valerie Chen,Ken Gu,Alexis Ross,Zixian Ma,Jillian Ross,Alex Gu,Chenglei Si,Wayne Chi,Andi Peng,Jocelyn J Shen,Ameet Talwalkar,Tongshuang Wu,David Sontag

Main category: cs.CL

TL;DR: 该论文主张从一次性任务完成的评估转向协作型代理的开发,强调代理在持续互动中如何提升用户体验和理解能力。

Details Motivation: 现有的代理评估过于关注一次性任务完成,忽视了真实世界中问题的迭代性和协作性,用户目标通常是模糊且动态变化的。

Contribution: 提出了协作努力规模框架,用于量化代理在用户参与增加时的效用增长,揭示了现有代理在长期协作场景中的不足。

Method: 通过案例研究和模拟评估,分析现有代理在多回合协作任务中的表现,并引入协作努力规模作为诊断工具。

Result: 研究表明,当前先进的代理在多回合真实场景中表现不佳,缺乏持续互动和用户理解的支撑能力。

Insight: 协作代理的设计需要关注长期互动效果,而不仅仅是最终任务完成质量。

Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent’s utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.

[38] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

Chumeng Liang,Jiaxuan You

Main category: cs.CL

TL;DR: DiagramEval提出了一种新方法,通过将图表建模为图结构(节点是文本元素,边是连接),设计了节点对齐和路径对齐两组指标,用于评估LLM生成的图表质量,并展示了其在最新文献数据上的有效性。

Details Motivation: 传统图像生成模型难以生成结构清晰的图表,而LLM可以直接生成SVG格式的图表,但缺乏有效的评估指标。DiagramEval旨在填补这一空白。

Contribution: 1)提出DiagramEval,一种基于图结构的评估指标;2)引入节点对齐和路径对齐两组新指标;3)在最新文献数据上验证指标的有效性。

Method: 将图表转化为图结构(文本元素为节点,连接为边),设计节点对齐和路径对齐指标,定量评估LLM生成的图表质量。

Result: 在最新研究文献的图表上验证了指标的有效性,并发现LLM生成的图表在结构和内容上的特点。

Insight: DiagramEval不仅提供了量化评估工具,还揭示了LLM生成图表的潜在特点,增强了评估的可解释性。

Abstract: Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: https://github.com/ulab-uiuc/diagram-eval.

[39] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Sriram Balasubramaniam,Samyadeep Basu,Koustava Goswami,Ryan Rossi,Varun Manjunatha,Roshan Santhosh,Ruiyi Zhang,Soheil Feizi,Nedim Lipka

Main category: cs.CL

TL;DR: 论文提出DecompTune方法,通过分解答案和强化训练改进语言模型的归因质量,优于现有方法并匹敌前沿模型。

Details Motivation: 现有后处理归因方法在多跳、抽象和半提取场景中表现不佳,影响可信度,需改进模型能力。

Contribution: 1. 提出将归因问题重构为推理问题;2. 开发DecompTune方法,通过分解训练提升归因质量;3. 构建多样化数据集并优化训练流程。

Method: 1. 提示模型生成答案分解;2. 使用SFT + GRPO两阶段训练流程;3. 结合任务专用奖励优化模型。

Result: DecompTune显著提升归因质量,优于现有方法,匹配前沿模型性能。

Insight: 答案分解可作为归因的有效中间步骤,强化学习能进一步提升模型对齐能力。

Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.

cs.CV [Back]

[40] Towards Fine-Grained Human Motion Video Captioning

Guorui Song,Guocun Wang,Zhe Huang,Jing Lin,Xuefei Zhe,Jian Li,Haoqian Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为Motion-Augmented Caption Model (M-ACM)的新框架,通过结合运动感知解码提升视频字幕生成的准确性。它还引入了Human Motion Insight (HMI)数据集和基准测试,实验表明M-ACM在复杂人体动作描述上显著优于现有方法。

Details Motivation: 现有视频字幕生成模型难以捕捉细粒度人体运动细节,导致生成的字幕模糊或语义不一致。

Contribution: 1. 提出了M-ACM框架,通过运动感知解码提升字幕质量;2. 发布了HMI数据集和HMI-Bench基准测试;3. 实验证明M-ACM在复杂运动描述上的优越性。

Method: M-ACM利用从人体网格恢复中提取的运动表示,显式突出身体动态,以减少幻觉并提升语义和空间对齐。

Result: M-ACM显著优于现有方法,尤其在复杂人体动作和细微时间变化的描述上表现突出。

Insight: 显式建模运动细节对于视频字幕生成至关重要,HMI数据集为未来研究提供了重要资源。

Abstract: Generating accurate descriptions of human actions in videos remains a challenging task for video captioning models. Existing approaches often struggle to capture fine-grained motion details, resulting in vague or semantically inconsistent captions. In this work, we introduce the Motion-Augmented Caption Model (M-ACM), a novel generative framework that enhances caption quality by incorporating motion-aware decoding. At its core, M-ACM leverages motion representations derived from human mesh recovery to explicitly highlight human body dynamics, thereby reducing hallucinations and improving both semantic fidelity and spatial alignment in the generated captions. To support research in this area, we present the Human Motion Insight (HMI) Dataset, comprising 115K video-description pairs focused on human movement, along with HMI-Bench, a dedicated benchmark for evaluating motion-focused video captioning. Experimental results demonstrate that M-ACM significantly outperforms previous methods in accurately describing complex human motions and subtle temporal variations, setting a new standard for motion-centric video captioning.

[41] Combining SAR Simulators to Train ATR Models with Synthetic Data

Benjamin Camus,Julien Houssay,Corentin Le Barbu,Eric Monteux,Cédric Saleun,Christian Cochin

Main category: cs.CV

TL;DR: 论文通过结合两种基于不同物理模型的SAR模拟器(MOCEM和Salsa)生成合成数据,训练深度学习模型ATR,解决了真实SAR数据不足的问题,并在MSTAR数据集上实现了88%的准确率。

Details Motivation: 真实SAR标注数据稀缺,合成数据虽可控但代表性不足,导致ATR模型泛化能力差。

Contribution: 1. 量化了模拟范式对ATR的影响;2. 提出结合两种互补SAR模拟器(MOCEM和Salsa)的方法生成更全面的合成数据。

Method: 使用MOCEM(散射中心模型)和Salsa(光线追踪)两种模拟器生成数据集,结合深度学习模型ADASCA进行训练。

Result: 在MSTAR数据集上达到接近88%的准确率。

Insight: 结合不同物理模型的模拟器可以提升合成数据的多样性和代表性,从而增强ATR模型的泛化能力。

Abstract: This work aims to train Deep Learning models to perform Automatic Target Recognition (ATR) on Synthetic Aperture Radar (SAR) images. To circumvent the lack of real labelled measurements, we resort to synthetic data produced by SAR simulators. Simulation offers full control over the virtual environment, which enables us to generate large and diversified datasets at will. However, simulations are intrinsically grounded on simplifying assumptions of the real world (i.e. physical models). Thus, synthetic datasets are not as representative as real measurements. Consequently, ATR models trained on synthetic images cannot generalize well on real measurements. Our contributions to this problem are twofold: on one hand, we demonstrate and quantify the impact of the simulation paradigm on the ATR. On the other hand, we propose a new approach to tackle the ATR problem: combine two SAR simulators that are grounded on different (but complementary) paradigms to produce synthetic datasets. To this end, we use two simulators: MOCEM, which is based on a scattering centers model approach, and Salsa, which resorts on a ray tracing strategy. We train ATR models using synthetic dataset generated both by MOCEM and Salsa and our Deep Learning approach called ADASCA. We reach an accuracy of almost 88 % on the MSTAR measurements.

[42] Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer’s Disease Diagnosis

Yujie Nie,Jianzhang Ni,Yonglong Ye,Yuan-Ting Zhang,Yun Kwok Wing,Xiangqing Xu,Xin Ma,Lizhou Fan

Main category: cs.CV

TL;DR: 该论文提出了一种多模态融合框架,结合眼动追踪和面部特征用于阿尔茨海默病(AD)诊断,通过跨模态注意力模块和方向感知卷积模块提升分类性能,取得了95.11%的高准确率。

Details Motivation: AD的早期诊断至关重要,但传统方法未能充分利用行为与感知领域的互补信息。眼动追踪和面部特征可作为认知功能的潜在指标,但二者的联合研究较少。

Contribution: 1. 提出跨模态增强融合框架(CEFAM和DACM);2. 构建同步多模态数据集;3. 显著优于现有融合方法,验证了跨模态依赖建模的有效性。

Method: 1. CEFAM模块:通过交叉注意力和全局增强建模模态间交互;2. DACM模块:利用水平和垂直感受野捕捉精细面部特征方向信息。

Result: 在AD与HC的分类任务中实现95.11%的准确率,优于传统特征拼接和晚期融合方法。

Insight: 跨模态注意力机制可有效捕捉行为与感知特征的互补性,方向感知卷积能提升面部特征的判别力,为AD诊断提供新思路。

Abstract: Accurate diagnosis of Alzheimer’s disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.

[43] FPGA-based Lane Detection System incorporating Temperature and Light Control Units

Ibrahim Qamar,Saber Mahmoud,Seif Megahed,Mohamed Khaled,Saleh Hesham,Ahmed Matar,Saif Gebril,Mervat Mahmoud

Main category: cs.CV

TL;DR: 该论文提出了一种基于FPGA的车道检测系统,结合了温度和光线控制单元,利用Sobel算法实现边缘检测,适用于智能车辆。

Details Motivation: 随着自动化技术的发展,智能车辆的应用日益广泛,车道检测是关键任务之一。现有系统在环境适应性上存在不足,因此需要一种高效且能适应环境变化的解决方案。

Contribution: 主要贡献包括:1) 提出基于FPGA的车道检测架构,2) 集成温度和光线控制单元以提高环境适应性,3) 利用Sobel算法实现高效边缘检测。

Method: 采用Sobel算法进行边缘检测,系统基于FPGA平台,处理416 x 416分辨率图像,运行频率为150 MHz,每1.17毫秒生成一次有效输出。输出包括车道数量、当前车道索引及左右边界。

Result: 系统能够高效运行(1.17毫秒/帧),并成功检测车道数量、位置及边界。温度和光线控制单元的加入提升了系统对环境变化的适应能力。

Insight: FPGA的并行计算能力使其适合实时车道检测任务,而集成环境控制单元进一步提高了系统的实用性和鲁棒性。

Abstract: Intelligent vehicles are one of the most important outcomes gained from the world tendency toward automation. Applications of IVs, whether in urban roads or robot tracks, do prioritize lane path detection. This paper proposes an FPGA-based Lane Detector Vehicle LDV architecture that relies on the Sobel algorithm for edge detection. Operating on 416 x 416 images and 150 MHz, the system can generate a valid output every 1.17 ms. The valid output consists of the number of present lanes, the current lane index, as well as its right and left boundaries. Additionally, the automated light and temperature control units in the proposed system enhance its adaptability to the surrounding environmental conditions.

[44] ESCA: Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

Mingzhi Zhu,Ding Shang,Sai Qian Zhang

Main category: cs.CV

TL;DR: ESCA提出了一种针对虚拟现实中Codec Avatar的高效后训练量化方法和定制硬件加速器,通过软硬件协同优化,实现了高保真渲染的实时处理,满足资源受限VR设备的需求。

Details Motivation: 高保真的Codec Avatar在VR环境中对计算资源需求大,难以在资源受限的设备(如头戴显示器)上实现实时推理,因此需要一种高效的解决方案。

Contribution: 1. 提出了一种针对Codec Avatar的后训练量化方法,支持低精度执行;2. 设计了一种定制的硬件加速器;3. 构建了全栈优化框架ESCA,显著提升了渲染效率和质量。

Method: 结合后训练量化(PTQ)和定制硬件加速器的设计,实现软硬件协同优化,降低计算复杂度,提升推理速度和能效。

Result: 实验表明,ESCA在质量评分上提升了+0.39,延迟降低了3.36倍,渲染速率达到100帧/秒,满足实时VR需求。

Insight: 通过算法和硬件的协同优化,可以在资源受限的设备上实现高质量、低延迟的Codec Avatar渲染,推动便携式VR的普及。

Abstract: Photorealistic Codec Avatars (PCA), which generate high-fidelity human face renderings, are increasingly being used in Virtual Reality (VR) environments to enable immersive communication and interaction through deep learning-based generative models. However, these models impose significant computational demands, making real-time inference challenging on resource-constrained VR devices such as head-mounted displays, where latency and power efficiency are critical. To address this challenge, we propose an efficient post-training quantization (PTQ) method tailored for Codec Avatar models, enabling low-precision execution without compromising output quality. In addition, we design a custom hardware accelerator that can be integrated into the system-on-chip of VR devices to further enhance processing efficiency. Building on these components, we introduce ESCA, a full-stack optimization framework that accelerates PCA inference on edge VR platforms. Experimental results demonstrate that ESCA boosts FovVideoVDP quality scores by up to $+0.39$ over the best 4-bit baseline, delivers up to $3.36\times$ latency reduction, and sustains a rendering rate of 100 frames per second in end-to-end tests, satisfying real-time VR requirements. These results demonstrate the feasibility of deploying high-fidelity codec avatars on resource-constrained devices, opening the door to more immersive and portable VR experiences.

[45] The Underappreciated Power of Vision Models for Graph Structural Understanding

Xinjian Zhao,Wei Pang,Zhongkai Xue,Xiangru Jian,Lei Zhang,Yaoyao Xu,Xiaozhuang Song,Shu Wu,Tianshu Yu

Main category: cs.CV

TL;DR: 该论文探讨了视觉模型在图结构理解中的潜力,发现其在全局拓扑理解任务中优于图神经网络(GNN),并提出新基准GraphAbstract以评估模型的全局结构感知能力。

Details Motivation: 传统图神经网络(GNN)通过自底向上的消息传递机制操作,与人类视觉感知的全局直觉不同。作者希望探索视觉模型在图结构理解中的潜力,填补这一理论空白。

Contribution: 1. 提出视觉模型在图全局结构理解任务中优于GNN;2. 设计新基准GraphAbstract,专注于评估模型的全局拓扑感知能力;3. 揭示了视觉模型在规模不变推理中的优势。

Method: 1. 对比视觉模型与GNN在图结构理解任务中的表现;2. 引入GraphAbstract基准,评估模型在识别组织原型、对称性、连接强度和关键元素等方面的能力。

Result: 视觉模型在全局结构理解任务中显著优于GNN,且在规模变化时表现稳定,而GNN则难以抽象全局模式且性能随图规模增大而下降。

Insight: 视觉模型具备尚未充分利用的图结构理解能力,尤其在需要全局拓扑感知和规模不变推理的任务中具有独特优势。

Abstract: Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models’ ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.

[46] A Re-node Self-training Approach for Deep Graph-based Semi-supervised Classification on Multi-view Image Data

Jingjun Bi,Fadi Dornaika

Main category: cs.CV

TL;DR: 本文提出了一种名为RSGSLM的自训练方法,通过结合多视图图融合和动态伪标签优化,解决了多视图图像数据中半监督分类的挑战,并在实验中表现优异。

Details Motivation: 多视图图像数据缺乏明确的图结构,且现有方法在处理复杂多视图数据时效率有限,迫切需要一种高效的半监督学习方法。

Contribution: 1) 在线性特征变换和多视图图融合中引入了GCN框架;2) 动态将伪标签整合到GCN损失函数中;3) 通过调整类边界样本权重解决拓扑不平衡问题;4) 引入了适用于所有样本的无监督平滑损失。

Method: 结合GCN的多视图图融合与动态伪标签优化,引入无监督平滑损失,并调整类边界样本权重。

Result: 在多视图基准数据集上,RSGSLM超越了现有的半监督学习方法。

Insight: 在多视图数据中,结合图结构和动态伪标签优化可以显著提升半监督学习的性能,同时维持计算效率。

Abstract: Recently, graph-based semi-supervised learning and pseudo-labeling have gained attention due to their effectiveness in reducing the need for extensive data annotations. Pseudo-labeling uses predictions from unlabeled data to improve model training, while graph-based methods are characterized by processing data represented as graphs. However, the lack of clear graph structures in images combined with the complexity of multi-view data limits the efficiency of traditional and existing techniques. Moreover, the integration of graph structures in multi-view data is still a challenge. In this paper, we propose Re-node Self-taught Graph-based Semi-supervised Learning for Multi-view Data (RSGSLM). Our method addresses these challenges by (i) combining linear feature transformation and multi-view graph fusion within a Graph Convolutional Network (GCN) framework, (ii) dynamically incorporating pseudo-labels into the GCN loss function to improve classification in multi-view data, and (iii) correcting topological imbalances by adjusting the weights of labeled samples near class boundaries. Additionally, (iv) we introduce an unsupervised smoothing loss applicable to all samples. This combination optimizes performance while maintaining computational efficiency. Experimental results on multi-view benchmark image datasets demonstrate that RSGSLM surpasses existing semi-supervised learning approaches in multi-view contexts.

[47] PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models

Patrick Haller,Fabio Barth,Jonas Golde,Georg Rehm,Alan Akbik

Main category: cs.CV

TL;DR: PISA-Bench是一个基于PISA测试的多语言和多模态基准,旨在评估视觉-语言模型在多语言和空间几何推理方面的表现。

Details Motivation: 现有视觉-语言模型的评测基准多为英语且依赖人工生成内容,缺乏高质量、多语言覆盖的数据集。

Contribution: 提出了PISA-Bench,一个基于专家创建的PISA测试的多语言数据集,覆盖六种语言,并提供高质量的评测资源。

Method: 基于PISA测试的人类生成内容,将其翻译为五种语言,构建了完全并行的多语言数据集。

Result: 实验发现,小型模型表现较差,非英语语种和空间几何推理任务错误率高。

Insight: 凸显了视觉-语言模型在多语言和复杂推理任务上的不足,为未来研究提供了方向。

Abstract: Vision-language models (VLMs) have demonstrated remarkable progress in multimodal reasoning. However, existing benchmarks remain limited in terms of high-quality, human-verified examples. Many current datasets rely on synthetically generated content by large language models (LLMs). Furthermore, most datasets are limited to English, as manual quality assurance of translated samples is time-consuming and costly. To fill this gap, we introduce PISA-Bench, a multilingual benchmark derived from English examples of the expert-created PISA tests, a unified framework for the assessment of student competencies in over eighty countries. Each example consists of human-extracted instructions, questions, answer options, and images, enriched with question type categories, and has been translated from English into five additional languages (Spanish, German, Chinese, French, and Italian), resulting in a fully parallel corpus covering six languages. We evaluate state-of-the-art vision-language models on PISA-Bench and find that especially small models (<20B parameters) fail to achieve high test scores. We further find substantial performance degradation on non-English splits as well as high error-rates when models are tasked with spatial and geometric reasoning. By releasing the dataset and evaluation framework, we provide a resource for advancing research on multilingual multimodal reasoning.

[48] A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu,Bo Wang,Pengpeng Zeng,Haonan Zhang,Ji Zhang,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen

Main category: cs.CV

TL;DR: 该论文是对高效视觉-语言-动作模型(Efficient VLAs)的首个全面综述,提出了一种分类法,将其分为高效模型设计、高效训练和高效数据收集三大支柱,总结了现有方法、应用、挑战和未来研究方向。

Details Motivation: 现有视觉-语言-动作模型(VLAs)在计算和数据需求上的高成本限制了其实际部署,亟需高效解决方案。

Contribution: 1. 提出首个高效VLAs的综合性综述;2. 引入统一分类法,涵盖模型设计、训练和数据收集三方面;3. 总结了应用、挑战和未来方向。

Method: 通过系统性分类法,将现有技术分为高效模型设计(如高效架构和模型压缩)、高效训练(减少计算负担)和高效数据收集(优化数据获取与利用)。

Result: 建立了高效VLAs的研究框架,为社区提供了重要参考,并总结了代表性应用和挑战。

Insight: 高效VLAs需多管齐下,从模型、训练到数据全方位优化;未来研究方向应更注重计算效率和数据的泛化性。

Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

[49] Conflict Adaptation in Vision-Language Models

Xiaoyang Hu

Main category: cs.CV

TL;DR: 本文发现13个视觉语言模型(VLM)中的12个表现出类似人类冲突适应的行为,并通过稀疏自编码器(SAE)揭示了模型内部的任务相关超节点结构。

Details Motivation: 研究视觉语言模型是否表现出类似人类的认知控制现象——冲突适应,以理解其内部表示机制。

Contribution: 1. 验证了大多数VLM表现出冲突适应行为;2. 通过SAE识别了任务相关的超节点结构;3. 揭示了模型内部表示与人类认知自动性的相似性。

Method: 使用序列Stroop任务测试VLM,并利用稀疏自编码器(SAE)分析InternVL 3.5 4B模型的任务相关超节点。

Result: 12/13的VLM表现出冲突适应行为;SAE揭示了早期和晚期层中文本与颜色的部分重叠超节点,并在24-25层发现了与冲突调制的超节点。

Insight: VLMs的内部表示结构与人类认知控制机制存在相似性,尤其是在任务相关特征的处理上。

Abstract: A signature of human cognitive control is conflict adaptation: improved performance on a high-conflict trial following another high-conflict trial. This phenomenon offers an account for how cognitive control, a scarce resource, is recruited. Using a sequential Stroop task, we find that 12 of 13 vision-language models (VLMs) tested exhibit behavior consistent with conflict adaptation, with the lone exception likely reflecting a ceiling effect. To understand the representational basis of this behavior, we use sparse autoencoders (SAEs) to identify task-relevant supernodes in InternVL 3.5 4B. Partially overlapping supernodes emerge for text and color in both early and late layers, and their relative sizes mirror the automaticity asymmetry between reading and color naming in humans. We further isolate a conflict-modulated supernode in layers 24-25 whose ablation significantly increases Stroop errors while minimally affecting congruent trials.

[50] Deep Feature Optimization for Enhanced Fish Freshness Assessment

Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: 该论文提出了一个三阶段框架,通过微调多种视觉架构、结合传统机器学习分类器和特征选择方法,显著提升了鱼类新鲜度评估的准确性和可解释性。

Details Motivation: 传统的鱼类新鲜度评估方法主观且耗时,尽管深度学习提供了自动化解决方案,但其准确性和特征透明度仍有不足。

Contribution: 提出了一种统一的三阶段框架,结合了深度学习的特征提取能力和传统机器学习的分类方法,显著提升了评估准确性和特征可解释性。

Method: 1) 微调五种先进的视觉架构;2) 利用多层深度特征训练七种传统分类器;3) 通过LGBM、随机森林和Lasso选择最优特征子集。

Result: 在FFE数据集上,最优配置(Swin-Tiny特征+Extra Trees分类器+LGBM特征选择)达到了85.99%的准确率,比现有研究提升了8.69-22.78%。

Insight: 深度视觉特征与传统机器学习结合能有效提升任务性能,特征选择方法有助于提取紧凑且信息丰富的特征子集。

Abstract: Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.

[51] Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection

Cui Yakun,Fushuo Huo,Weijie Shi,Juntao Dai,Hang Du,Zhenghao Zhu,Sirui Han,Yike Guo

Main category: cs.CV

TL;DR: 论文提出了MVFNDB多模态视频虚假新闻检测基准,包含10个任务和9730个人工标注的问答对,以评估MLLMs在检测过程中的感知、理解和推理能力。同时设计了MVFND-CoT框架,融合创作者内容和原始拍摄素材进行推理。

Details Motivation: 传统视频虚假新闻检测基准仅关注最终决策精度,缺乏对检测过程的细粒度评估,使得检测过程成为黑箱。

Contribution: 1. 提出MVFNDB基准,细化评估MLLMs的检测能力;2. 设计了MVFND-CoT框架,结合多种特征提升检测效果;3. 深入分析了影响精度的视频处理策略和特征对齐问题。

Method: MVFNDB包含10个任务,基于人工标注的9730个视频相关问题。MVFND-CoT框架整合创作者内容和原始拍摄素材进行多模态推理。

Result: MVFNDB为MLLMs的评估提供了细粒度基准,MVFND-CoT验证了多特征融合的有效性。

Insight: 视频虚假新闻检测需要综合考虑感知、理解和推理能力,多模态特征对齐和处理策略对模型性能至关重要。

Abstract: The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs’ perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.

[52] SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

Ruiyang Zhang,Jiahao Luo,Xiaoru Feng,Qiufan Pang,Yaodong Yang,Juntao Dai

Main category: cs.CV

TL;DR: SafeEditor提出了一种多轮安全编辑框架,用于提升文本生成图像(T2I)模型的安全性,避免了现有方法的过度拒绝和安全性-实用性失衡问题。

Details Motivation: 随着T2I模型的快速发展,确保其安全性变得至关重要。现有方法在推理时存在过度拒绝和安全性-实用性不平衡的问题。

Contribution: 1) 提出了MR-SafeEdit数据集,专为T2I安全编辑设计;2) 提出了多轮安全编辑框架SafeEditor,作为即插即用的模块,实现了高效的安全对齐。

Method: SafeEditor是一个统一的多模态大语言模型(MLLM),基于多轮图像-文本交错数据集MR-SafeEdit,模拟人类认知过程对不安全内容进行识别和修正。

Result: 实验表明,SafeEditor减少了过度拒绝,同时更好地平衡了安全性和实用性,优于现有方法。

Insight: 通过多轮编辑模拟人类认知过程可以更精确地实现安全对齐,避免一刀切的拒绝策略。

Abstract: With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

[53] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI,:,Bowen Ma,Cheng Zou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianing Li,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jianping Jiang,Jun Peng,Kaixiang Ji,Kaimeng Ren,Libin Wang,Lixiang Ru,Longhua Tan,Lan Wang,Mochen Bai,Ning Gao,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Ruobing Zheng,Sirui Gao,Tianqi Li,Tinghao Liu,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaolong Wang,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yuting Xiao,Yunxiao Sun,Yipeng Chen,Yifan Mao,Yifei Wu,Yongjie Lyu,Ziping Ma,Zhiqiang Fang,Zhihao Qiu,Ziyuan Huang,Zizheng Yang,Zhengyu He

Main category: cs.CV

TL;DR: Ming-Flash-Omni是一个基于稀疏Mixture-of-Experts(MoE)的统一多模态感知与生成架构,显著提升了计算效率和多模态能力,在语音识别、图像生成与编辑等领域实现了SOTA性能。

Details Motivation: 研究旨在通过稀疏MoE架构提升多模态模型的效率和能力,迈向通用人工智能(AGI)。

Contribution: 1. 提出基于稀疏MoE的Ming-Flash-Omni架构;2. 在语音识别、图像生成和分割任务中实现SOTA性能;3. 首次引入生成式分割能力。

Method: 采用稀疏MoE变体(Ling-Flash-2.0)构建统一架构,每个token仅激活6.1B参数,总参数达100B。

Result: 在12个语音识别基准和图像生成任务中刷新记录,并展示生成式分割的新能力。

Insight: 稀疏MoE显著提升多模态模型的扩展性和效率,统一架构在多任务中表现优异。

Abstract: We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

[54] MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition

Haoyang Zhang,Zhou Yang,Ke Sun,Yucai Pang,Guoliang Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于多路径跨模态交互的混合网络模型(MCIHN),用于多模态情感识别,通过对抗自编码器和跨模态门机制解决模态差异和情感信息表征问题。

Details Motivation: 多模态情感识别在人机交互中至关重要,但现有方法难以处理模态差异和单模态情感信息的表征问题,因此需要一种更有效的模型来解决这些挑战。

Contribution: 1. 提出了一种混合网络模型MCIHN,结合对抗自编码器(AAE)和跨模态门机制(CGMM);2. 设计了特征融合模块(FFM),提高了情感识别的性能。

Method: 1. 为每个模态单独构建AAE,学习判别性情感特征;2. 使用CGMM减少模态差异并建立情感关系;3. 通过FFM实现多模态融合。

Result: 在SIMS和MOSI数据集上的实验表明,MCIHN取得了优越的性能。

Insight: 通过跨模态交互和特征融合,可以有效减少模态差异,提升情感识别的准确性。

Abstract: Multimodal emotion recognition is crucial for future human-computer interaction. However, accurate emotion recognition still faces significant challenges due to differences between different modalities and the difficulty of characterizing unimodal emotional information. To solve these problems, a hybrid network model based on multipath cross-modal interaction (MCIHN) is proposed. First, adversarial autoencoders (AAE) are constructed separately for each modality. The AAE learns discriminative emotion features and reconstructs the features through a decoder to obtain more discriminative information about the emotion classes. Then, the latent codes from the AAE of different modalities are fed into a predefined Cross-modal Gate Mechanism model (CGMM) to reduce the discrepancy between modalities, establish the emotional relationship between interacting modalities, and generate the interaction features between different modalities. Multimodal fusion using the Feature Fusion module (FFM) for better emotion recognition. Experiments were conducted on publicly available SIMS and MOSI datasets, demonstrating that MCIHN achieves superior performance.

[55] The Generation Phases of Flow Matching: a Denoising Perspective

Anne Gagneux,Ségolène Martin,Rémi Gribonval,Mathurin Massias

Main category: cs.CV

TL;DR: 论文从去噪角度研究了流匹配模型的生成过程,提出了一个框架来分析其动态阶段,揭示了生成过程中噪声和漂移的影响。

Details Motivation: 尽管流匹配模型取得了显著成功,但其生成过程的影响因素仍未得到充分理解。论文从去噪角度出发,旨在填补这一空白。

Contribution: 建立了流匹配模型与去噪器之间的形式联系,提出了一个框架来分析生成过程的动态阶段,并设计了噪声和漂移扰动以影响样本生成。

Method: 通过将流匹配模型与去噪器进行比较,设计实验分析生成过程中的噪声和漂移扰动,揭示不同阶段的动态特性。

Result: 研究发现生成过程具有不同的动态阶段,能够精确描述去噪器的成功与失败阶段及其原因。

Insight: 生成过程可以划分为多个动态阶段,噪声和漂移的影响在不同阶段具有显著差异。

Abstract: Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.

[56] VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos

Qiucheng Wu,Handong Zhao,Zhixin Shu,Jing Shi,Yang Zhang,Shiyu Chang

Main category: cs.CV

TL;DR: VividCam提出了一种训练范式,通过合成视频学习复杂的相机运动,减少对真实训练视频的依赖,并结合解耦策略解决域偏移问题。

Details Motivation: 现有文本到视频生成模型难以处理非常规相机运动,这一问题源于缺乏相关训练数据,VividCam旨在通过合成数据解决这一问题。

Contribution: 1. 提出了一种训练范式,学习复杂相机运动;2. 引入解耦策略,分离相机运动与合成外观伪影;3. 展示了仅需简单合成数据即可实现精确控制。

Method: 利用Unity等引擎渲染低多边形3D场景的合成视频,结合多重解耦策略训练扩散模型,分离相机运动与外观伪影。

Result: 生成视频展示了广泛的精确控制相机运动能力,支持复杂运动合成,无需依赖真实训练数据。

Insight: 合成数据可在特定任务中替代真实数据,解耦策略有助于提升模型的鲁棒性和泛化能力。

Abstract: Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .

[57] Understanding Multi-View Transformers

Michal Stary,Julien Gaubil,Ayush Tewari,Vincent Sitzmann

Main category: cs.CV

TL;DR: 论文通过分析多视图Transformer(如DUSt3R)的残差连接,探究其内部机制,揭示了其潜在状态的发展和各层的作用,并提出改进方向。

Details Motivation: 多视图Transformer在3D视觉中表现出色,但其内部机制不透明,限制了进一步优化和在可靠性关键应用中的使用。

Contribution: 提出了一种方法,通过残差连接分析和可视化多视图Transformer的3D表示,揭示其内部工作机制。

Method: 通过研究DUSt3R模型的变体,分析其各层的潜在状态发展和作用,并与显式全局姿态方法进行比较。

Result: 研究发现DUSt3R的对应关系估计会被重建几何细化。

Insight: 揭示了多视图Transformer的内部运作机制,为改进和安全性应用提供了指导。

Abstract: Multi-view transformers such as DUSt3R are revolutionizing 3D vision by solving 3D tasks in a feed-forward manner. However, contrary to previous optimization-based pipelines, the inner mechanisms of multi-view transformers are unclear. Their black-box nature makes further improvements beyond data scaling challenging and complicates usage in safety- and reliability-critical applications. Here, we present an approach for probing and visualizing 3D representations from the residual connections of the multi-view transformers’ layers. In this manner, we investigate a variant of the DUSt3R model, shedding light on the development of its latent state across blocks, the role of the individual layers, and suggest how it differs from methods with stronger inductive biases of explicit global pose. Finally, we show that the investigated variant of DUSt3R estimates correspondences that are refined with reconstructed geometry. The code used for the analysis is available at https://github.com/JulienGaubil/und3rstand .

[58] Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

Hossein R. Nowdeh,Jie Ji,Xiaolong Ma,Fatemeh Afghah

Main category: cs.CV

TL;DR: 该论文提出了Modality-Aware SAM (M-SAM),一种模型无关的多模态学习框架,通过梯度调制解决主流模态主导的问题,提升模型的泛化能力。

Details Motivation: 在多模态学习中,主流模态往往压制其他模态,限制了模型的泛化性能。为解决这一问题,作者提出了一种新方法。

Contribution: 1. 提出了M-SAM框架,通过Sharpness-Aware Minimization (SAM)驱动梯度调制。2. 支持早期和晚期融合场景,适用于多种模态。3. 通过Shapley值识别主流模态,并调制损失以提高其鲁棒性。

Method: 1. 使用Shapley值识别主流模态。2. 分解损失函数(调制损失),优先增强主流模态的鲁棒性。3. 通过反向传播更新权重,确保模型探索和利用互补特征。

Result: 在四个数据集上的实验表明,M-SAM优于最新的优化和梯度操控方法,显著提升了多模态学习的平衡性和性能。

Insight: 通过梯度调制平衡多模态学习是关键,M-SAM展示了在复杂多模态场景中提升模型性能的潜力。

Abstract: In multimodal learning, dominant modalities often overshadow others, limiting generalization. We propose Modality-Aware Sharpness-Aware Minimization (M-SAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios. In every iteration, M-SAM in three steps optimizes learning. \textbf{First, it identifies the dominant modality} based on modalities’ contribution in the accuracy using Shapley. \textbf{Second, it decomposes the loss landscape}, or in another language, it modulates the loss to prioritize the robustness of the model in favor of the dominant modality, and \textbf{third, M-SAM updates the weights} by backpropagation of modulated gradients. This ensures robust learning for the dominant modality while enhancing contributions from others, allowing the model to explore and exploit complementary features that strengthen overall performance. Extensive experiments on four diverse datasets show that M-SAM outperforms the latest state-of-the-art optimization and gradient manipulation methods and significantly balances and improves multimodal learning.

[59] IBIS: A Powerful Hybrid Architecture for Human Activity Recognition

Alison M. Fernandes,Hermes I. Del Monego,Bruno S. Chang,Anelise Munaretto,Hélder M. Fontes,Rui L. Campos

Main category: cs.CV

TL;DR: 论文提出了一种名为IBIS的混合架构,结合了Inception-BiLSTM和SVM,旨在解决Wi-Fi感知中的过拟合问题,并在运动识别中取得了接近99%的准确率。

Details Motivation: Wi-Fi感知因其低成本和非侵入性在健康监护、空间占用分析等领域广受关注,但现有很多模型存在过拟合问题,无法泛化到新数据。

Contribution: IBIS混合架构结合了Inception-BiLSTM的特征提取能力和SVM的鲁棒分类能力,显著提升了模型的泛化性能。

Method: 通过Inception-BiLSTM提取时间序列数据的特征,再用SVM进行分类,形成了一种高效的混合模型。

Result: 在多普勒数据上的运动识别任务中,IBIS达到了接近99%的准确率,证明了其有效性。

Insight: 混合架构能够结合深度学习的特征提取能力和传统机器学习模型的鲁棒性,为解决过拟合问题提供了新思路。

Abstract: The increasing interest in Wi-Fi sensing stems from its potential to capture environmental data in a low-cost, non-intrusive way, making it ideal for applications like healthcare, space occupancy analysis, and gesture-based IoT control. However, a major limitation in this field is the common problem of overfitting, where models perform well on training data but fail to generalize to new data. To overcome this, we introduce a novel hybrid architecture that integrates Inception-BiLSTM with a Support Vector Machine (SVM), which we refer to as IBIS. Our IBIS approach is uniquely engineered to improve model generalization and create more robust classification boundaries. By applying this method to Doppler-derived data, we achieve a movement recognition accuracy of nearly 99%. Comprehensive performance metrics and confusion matrices confirm the significant effectiveness of our proposed solution.

[60] FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

Reza Saadati Fard,Emmanuel Agu,Palawat Busaranuvong,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong,Lorraine Loretz

Main category: cs.CV

TL;DR: FT-ARM是一个基于多模态大语言模型的细粒度代理反思模型,用于压力性溃疡严重程度分类,通过视觉和文本特征推理提升了分类准确性和可解释性。

Details Motivation: 压力性溃疡(PUs)的分类存在视觉特征细微和主观性强的问题,现有AI方法(如CNN、ViT)虽然精度较高但解释性不足,需要一种更可靠且透明的解决方案。

Contribution: 1. 提出FT-ARM,结合多模态输入和代理反思机制;2. 在PIID数据集上达到85%的分类准确率,超越CNN方法4%;3. 支持实时推理并生成自然语言解释。

Method: 基于LLaMA 3.2 90B模型进行微调,通过视觉特征和临床文本知识的迭代推理优化预测。

Result: 在PIID数据集上实现85%的准确率,提供临床解释,适合实时部署。

Insight: 整合视觉和文本模态并引入反思机制是提升医疗AI可靠性和透明性的有效路径。

Abstract: Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.

[61] Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8

Zahra Ebrahimi Vargoorani,Amir Mohammad Ghoreyshi,Ching Yee Suen

Main category: cs.CV

TL;DR: 该论文提出了一种基于YOLOv8和Grounding DINO的半监督学习框架,用于高效的车牌识别,通过伪标签和少量人工标注数据结合的方式提升模型性能。

Details Motivation: 车牌识别系统(ALPR)在实际应用中面临光照、天气、高速车辆、低分辨率图像等多重挑战,亟需高效的解决方案以支持交通管理、执法等关键任务。

Contribution: 提出了一种结合YOLOv8和Grounding DINO的半监督学习方法,通过伪标签技术减少人工标注需求,并在多个数据集上实现了高召回率(CENPARMI 94%,UFPR-ALPR 91%)。

Method: 使用YOLOv8进行车牌检测与识别,结合Grounding DINO生成的伪标签与少量人工标注数据,构建半监督学习框架,提升模型训练效率与性能。

Result: 在CENPARMI和UFPR-ALPR数据集上分别达到94%和91%的召回率,同时报告了字符错误率,验证了系统的鲁棒性。

Insight: 半监督学习与伪标签技术能够显著减少人工标注成本,同时维持模型的高性能,适用于实际场景中的车牌识别任务。

Abstract: Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.

[62] Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models

Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba

Main category: cs.CV

TL;DR: 该论文提出了一种结合视觉和语言模态的新型框架,用于乳腺癌早期检测,通过在2D乳腺X光片中融入临床文本描述,提高了癌症检测和钙化识别的性能。

Details Motivation: 乳腺癌是发达国家女性最常见的恶性肿瘤,早期检测至关重要。现有的计算机辅助诊断(CAD)系统在多模态数据处理和临床部署可行性上存在局限,亟需一种更实用的方法。

Contribution: 提出了一种结合卷积神经网络(ConvNets)和语言表征的多模态框架,有效融合视觉与文本信息,显著优于单模态方法。

Method: 通过创新的分词模块,将2D乳腺X光片的视觉特征与临床元数据和合成放射报告的文本描述结合,实现了高效的多模态融合。

Result: 在多国队列筛查乳腺X光片上的实验表明,该方法在癌症检测和钙化识别上表现优于单模态基线模型。

Insight: 该研究展示了结合视觉与语言模态的临床实用性,为开发基于VLM的CAD系统提供了新范式。

Abstract: Breast cancer remains the most commonly diagnosed malignancy among women in the developed world. Early detection through mammography screening plays a pivotal role in reducing mortality rates. While computer-aided diagnosis (CAD) systems have shown promise in assisting radiologists, existing approaches face critical limitations in clinical deployment - particularly in handling the nuanced interpretation of multi-modal data and feasibility due to the requirement of prior clinical history. This study introduces a novel framework that synergistically combines visual features from 2D mammograms with structured textual descriptors derived from easily accessible clinical metadata and synthesized radiological reports through innovative tokenization modules. Our proposed methods in this study demonstrate that strategic integration of convolutional neural networks (ConvNets) with language representations achieves superior performance to vision transformer-based models while handling high-resolution images and enabling practical deployment across diverse populations. By evaluating it on multi-national cohort screening mammograms, our multi-modal approach achieves superior performance in cancer detection and calcification identification compared to unimodal baselines, with particular improvements. The proposed method establishes a new paradigm for developing clinically viable VLM-based CAD systems that effectively leverage imaging data and contextual patient information through effective fusion mechanisms.

[63] Auto3DSeg for Brain Tumor Segmentation from 3D MRI in BraTS 2023 Challenge

Andriy Myronenko,Dong Yang,Yufan He,Daguang Xu

Main category: cs.CV

TL;DR: 该论文介绍了使用MONAI中的Auto3DSeg工具在BraTS 2023挑战赛中取得的优异成绩,包括在多类别脑肿瘤分割任务中获得三项第一和两项第二的成绩。

Details Motivation: 为了解决3D MRI脑肿瘤分割的复杂性和多样性挑战,作者利用自动化工具提升分割性能。

Contribution: 主要贡献是基于Auto3DSeg的解决方案,在多类脑肿瘤分割任务中取得了领先的分割效果。

Method: 采用MONAI平台的Auto3DSeg工具,结合3D MRI数据,实现自动化脑肿瘤分割。

Result: 在BraTS 2023的五项挑战中,三项获得第一名,两项获得第二名。

Insight: 自动化分割工具在复杂医学图像任务中表现出强大的潜力,能够适应多类别分割需求。

Abstract: In this work, we describe our solution to the BraTS 2023 cluster of challenges using Auto3DSeg from MONAI. We participated in all 5 segmentation challenges, and achieved the 1st place results in three of them: Brain Metastasis, Brain Meningioma, BraTS-Africa challenges, and the 2nd place results in the remaining two: Adult and Pediatic Glioma challenges.

[64] DRIP: Dynamic patch Reduction via Interpretable Pooling

Yusen Peng,Sachin Kumar

Main category: cs.CV

TL;DR: DRIP提出了一种动态合并视觉编码器中深层token的方法,显著降低了计算量(GFLOPs)同时保持性能。

Details Motivation: 大规模视觉语言模型的预训练计算成本高,限制了从头开始预训练的尝试,因此需要高效的token处理方法。

Contribution: 提出DRIP方法,动态合并输入图像中的token,显著降低计算开销并保持模型性能。

Method: 通过可解释的池化(Interpretable Pooling)动态合并视觉编码器深层token,适应不同输入。

Result: 在ImageNet从头训练和CLIP对比预实验中,GFLOP显著减少且性能相当;生物学数据集上也验证了其有效性。

Insight: 动态token合并是一种高效的计算优化方法,适用于科学领域的大规模预训练。

Abstract: Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.

[65] Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi

Main category: cs.CV

TL;DR: 该论文提出了一种视觉-语言集成框架,通过统一预训练的视觉编码器和大型语言模型,实现了零样本场景理解,显著提升了新类别和上下文的泛化能力。

Details Motivation: 解决零样本场景理解在真实环境中的挑战,模型需识别新对象、动作和上下文,而无需先验标注数据。

Contribution: 提出了一个统一的视觉-语言集成模型,通过跨模态对齐和语言基础增强泛化能力。

Method: 结合预训练视觉编码器(如CLIP、ViT)和语言模型(如GPT),嵌入视觉输入与文本提示到共享空间,并通过多模态融合和推理层进行上下文解释。

Result: 在多个数据集上显著超越现有零样本模型,最高提升18%的top-1准确率,语义连贯性指标也有显著提升。

Insight: 跨模态对齐和语言基础是增强模型零样本泛化能力的关键。

Abstract: Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.

[66] Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Chanhyeong Yang,Taehoon Song,Jihwan Park,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 该论文提出了一种名为VDRP的零样本人-物交互检测方法,通过视觉多样性和区域感知的提示学习,解决了同类视觉多样性和异类视觉纠缠的问题,并在HICO-DET基准测试中取得了最优性能。

Details Motivation: 零样本人-物交互检测在未见过的动词-物体对上表现不佳,现有方法难以处理交互的视觉复杂性,包括同类视觉多样性和异类视觉纠缠。

Contribution: 提出了一种视觉多样性和区域感知的提示学习框架VDRP,通过群组视觉方差和高斯扰动捕获动词的视觉变化,并利用区域特定概念增强提示嵌入。

Method: 采用视觉多样性感知的提示学习策略和高斯扰动方法,结合人、物体和联合区域的特定概念,生成区域感知的提示嵌入。

Result: 在HICO-DET基准测试中,VDRP在四种零样本评估设置下均取得了最优性能。

Insight: 通过结合视觉多样性和区域信息,可以有效提升零样本人-物交互检测的性能,特别是在处理同类多样性和异类纠缠问题时。

Abstract: Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

[67] AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

Xiyu Zhang,Chong Bao,Yipeng Chen,Hongjia Zhai,Yitong Dong,Hujun Bao,Zhaopeng Cui,Guofeng Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于亚特兰大世界模型的隐式结构化高斯表面重建方法(AtlasGS),解决了低纹理区域重建缺乏全局一致性的问题,同时通过语义高斯表示和结构平面正则化,实现了高效且高精度的场景重建。

Details Motivation: 现有方法在处理低纹理区域时缺乏全局一致性,高斯喷洒和隐式SDF场存在不连续性或计算效率低的问题,导致细节丢失。

Contribution: 提出了一种亚特兰大世界引导的隐式结构化高斯喷洒方法,结合语义高斯表示和可学习平面指示器的结构平面正则化,实现了高效且高精度的室内和城市场景重建。

Method: 利用亚特兰大世界模型引导重建,提出语义高斯表示预测语义区域概率,并引入结构平面正则化进行全局优化。

Result: 在室内和城市场景的实验中,该方法优于现有技术,提供了更高质量的表面重建效果。

Insight: 结合语义信息和结构化正则化能够显著提升低纹理区域的全局一致性,同时保持高频细节和计算效率。

Abstract: 3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

[68] Region-CAM: Towards Accurate Object Regions in Class Activation Maps for Weakly Supervised Learning Tasks

Qingdong Cai,Charith Abhayaratne

Main category: cs.CV

TL;DR: Region-CAM是一种新的激活方法,通过提取语义信息图(SIMs)并进行语义信息传播(SIP),解决了传统CAM方法仅突出最具判别性区域的问题,从而生成更精确的对象区域激活图。

Details Motivation: 传统CAM方法在弱监督学习中仅突出目标的最具判别性区域,无法准确覆盖整个对象并与对象边界对齐,影响了弱监督语义分割(WSSS)的性能。

Contribution: 提出了Region-CAM方法,通过SIMs和SIP生成更准确的激活图,显著提升了对象区域覆盖率和边界对齐精度。

Method: 通过结合梯度和特征提取SIMs,利用SIP传播语义信息,生成激活图。

Result: 在PASCAL VOC和MS COCO数据集上分别提升了13.61%和16.23%的mIoU,在ILSVRC2012上实现了51.7%的Top-1定位准确率。

Insight: 结合梯度和特征的语义信息传播能更全面地捕捉对象区域,提高弱监督任务的性能。

Abstract: Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel-wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region-CAM. Distinct from network feature weighting approaches, Region-CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region-CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region-CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region-CAM in object localization tasks, using the ILSVRC2012 validation set. Region-CAM achieves 51.7% in Top-1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region-CAM achieves 4.5% better performance in Loc1.

[69] DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Malaisree P,Youwai S,Kitkobsin T,Janrungautai S,Amorndechaphon D,Rojanavasu P

Main category: cs.CV

TL;DR: DINO-YOLO结合YOLOv12和DINOv3自监督视觉Transformer,提出了一种数据高效的目标检测方法,特别适用于数据有限的土木工程应用。

Details Motivation: 土木工程领域的目标检测面临标注数据不足的问题,需要一种高效且适应性强的解决方案。

Contribution: 提出DINO-YOLO架构,将DINOv3特征与YOLOv12集成,在P0和P3阶段进行增强,显著提升了数据效率。

Method: 采用DINOv3的视觉Transformer进行自监督预训练,并将其特征融合到YOLOv12的输入和中骨干层。

Result: 在多个土木工程数据集上实现了显著的性能提升(高达88.6%),同时保持了实时推理能力(30-47 FPS)。

Insight: 中型架构(Medium-scale)在DualP0P3集成下表现最佳,而小型架构(Small-scale)需要更多集成(Triple Integration)。

Abstract: Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

[70] Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective

Wan Jiang,Jing Yan,Ruixuan Zhang,Xiaojing Chen,Changtao Miao,Zhe Li,Chenhao Lin,Yunfeng Diao,Richang Hong

Main category: cs.CV

TL;DR: 该论文提出了一种基于几何视角的AI生成图像检测方法ReGap,通过动态重建误差和结构化编辑操作改进检测准确性,解决了现有方法的局限性和不可靠性问题。

Details Motivation: 生成式AI的快速发展使得检测AI生成图像成为确保真实性的关键挑战。现有基于重建的方法缺乏理论基础,依赖经验启发式规则,导致解释性和可靠性不足。

Contribution: 1. 提出Jacobian-Spectral Lower Bound,从几何视角解释重建误差;2. 揭示了现有静态重建误差方法的局限性;3. 提出训练无关的方法ReGap,通过动态重建误差提升检测准确性。

Method: ReGap利用结构化编辑操作引入可控扰动,计算编辑前后的动态重建误差,增强误差分离能力,从而提高检测准确性。

Result: 实验表明,ReGap优于现有基线方法,对常见后处理操作具有鲁棒性,并能有效泛化到多样条件下。

Insight: 该研究从几何视角揭示了重建误差的本质,为AI生成图像检测提供了新的理论基础和方法论指导。

Abstract: The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.

[71] EA3D: Online Open-World 3D Object Extraction from Streaming Videos

Xiaoyu Zhou,Jingqi Wang,Yuang Jia,Yongtao Wang,Deqing Sun,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: EA3D是一个统一的在线框架,能够从动态视频流中提取开放世界的3D物体,同时实现几何重建和场景理解。

Details Motivation: 现有的3D场景理解方法依赖于离线采集的多视角数据或预构建的3D几何,限制了在线动态场景的应用需求。EA3D旨在解决这一问题。

Contribution: 提出了一个统一的在线框架EA3D,能够从视频流中动态提取3D物体,并同时实现几何重建和语义理解。

Method: 使用视觉语言和2D视觉基础编码器动态解释视频帧,通过前馈在线更新策略将知识嵌入高斯特征图,再通过迭代优化模块提升几何和语义重建效果。

Result: 在多项任务(如渲染、分割、3D边界框估计等)中展示了EA3D的有效性,实现了统一的在线3D重建和场景理解。

Insight: EA3D为动态视频流的3D重建和语义理解提供了一个高效框架,扩展了开放世界3D对象提取的应用潜力。

Abstract: Current 3D scene understanding methods are limited by offline-collected multi-view data or pre-constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open-world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision-language and 2D vision foundation encoders to extract object-level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed-forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model’s attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo-realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

[72] Towards Real-Time Inference of Thin Liquid Film Thickness Profiles from Interference Patterns Using Vision Transformers

Gautam A. Viruthagiri,Arnuv Tandon,Gerald G. Fuller,Vinny Chandran Suja

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉Transformer(ViT)的方法,用于从干涉图实时推断薄液膜厚度分布,解决了传统方法在噪声敏感性和计算复杂度上的问题。

Details Motivation: 薄液膜干涉测量在眼科等领域有重要应用,但传统方法因相位周期性、噪声和环境干扰等问题难以实时重建厚度分布,限制了其临床推广。

Contribution: 1. 提出了一种基于Vision Transformer的数据驱动方法,直接从干涉图实时推断厚度分布;2. 解决了相位模糊性和时间一致性问题,适用于动态干涉图分析。

Method: 1. 使用结合合成和实验数据的混合数据集训练模型;2. 利用Vision Transformer的长程空间相关性能力解决相位模糊性;3. 通过单次前向传播实现快速推断。

Result: 该方法在噪声大、动态变化的薄膜条件下表现优异,超越了传统相位解缠和迭代拟合方法,可在消费级硬件上实时运行。

Insight: 1. Vision Transformer的长程建模能力适用于干涉图分析;2. 数据驱动方法有望取代传统手动或计算密集型技术,推动临床实时诊断。

Abstract: Thin film interferometry is a powerful technique for non-invasively measuring liquid film thickness with applications in ophthalmology, but its clinical translation is hindered by the challenges in reconstructing thickness profiles from interference patterns - an ill-posed inverse problem complicated by phase periodicity, imaging noise and ambient artifacts. Traditional reconstruction methods are either computationally intensive, sensitive to noise, or require manual expert analysis, which is impractical for real-time diagnostics. To address this challenge, here we present a vision transformer-based approach for real-time inference of thin liquid film thickness profiles directly from isolated interferograms. Trained on a hybrid dataset combining physiologically-relevant synthetic and experimental tear film data, our model leverages long-range spatial correlations to resolve phase ambiguities and reconstruct temporally coherent thickness profiles in a single forward pass from dynamic interferograms acquired in vivo and ex vivo. The network demonstrates state-of-the-art performance on noisy, rapidly-evolving films with motion artifacts, overcoming limitations of conventional phase-unwrapping and iterative fitting methods. Our data-driven approach enables automated, consistent thickness reconstruction at real-time speeds on consumer hardware, opening new possibilities for continuous monitoring of pre-lens ocular tear films and non-invasive diagnosis of conditions such as the dry eye disease.

[73] Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation

Wenhao Zheng,Chenwei Sun,Wenbo Zhang,Jiancheng Lv,Xianggen Liu

Main category: cs.CV

TL;DR: 提出了一种名为TGBFN的新框架,用于定量约束的CAD生成,通过统一的连续可微参数空间处理多模态数据。

Details Motivation: 由于长程约束和参数敏感性的挑战,多模态数据(如参数化CAD序列)的生成建模技术发展滞后。

Contribution: 首次在统一连续可微参数空间中处理CAD序列的多模态性,并引入引导贝叶斯流控制CAD属性。

Method: 提出Target-Guided Bayesian Flow Network (TGBFN),穿透参数更新核并引入引导贝叶斯流。

Result: 在单条件和多条件约束生成任务中,TGBFN实现了最先进的性能。

Insight: TGBFN的成功表明,通过统一的连续可微参数空间处理多模态数据是可行且有效的。

Abstract: Deep generative models, such as diffusion models, have shown promising progress in image generation and audio generation via simplified continuity assumptions. However, the development of generative modeling techniques for generating multi-modal data, such as parametric CAD sequences, still lags behind due to the challenges in addressing long-range constraints and parameter sensitivity. In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). For the first time, TGBFN handles the multi-modality of CAD sequences (i.e., discrete commands and continuous parameters) in a unified continuous and differentiable parameter space rather than in the discrete data space. In addition, TGBFN penetrates the parameter update kernel and introduces a guided Bayesian flow to control the CAD properties. To evaluate TGBFN, we construct a new dataset for quantitatively constrained CAD generation. Extensive comparisons across single-condition and multi-condition constrained generation tasks demonstrate that TGBFN achieves state-of-the-art performance in generating high-fidelity, condition-aware CAD sequences. The code is available at https://github.com/scu-zwh/TGBFN.

[74] A Study on Inference Latency for Vision Transformers on Mobile Devices

Zhuojin Li,Marco Paolieri,Leana Golubchik

Main category: cs.CV

TL;DR: 论文研究了190个真实世界的视觉变换器(ViTs)在移动设备上的推理延迟性能,并与102个卷积神经网络(CNNs)对比,揭示了影响ViT延迟的关键因素。基于此,作者构建了一个包含1000个合成ViTs的数据集,展示了新ViTs推理延迟的可预测性。

Details Motivation: 随着机器学习技术特别是计算机视觉在移动设备上的快速发展,ViTs的性能表现与实际应用需求之间存在差距,需要深入研究其在移动设备上的推理延迟特性。

Contribution: 1. 定量分析了190个真实世界ViTs和102个CNNs的延迟性能;2. 揭示了影响ViT延迟的关键因素;3. 构建了一个包含1000个合成ViTs的数据集;4. 证明了新ViTs推理延迟的可预测性。

Method: 通过对比真实世界的ViTs和CNNs的延迟性能,分析影响因素,并基于代表性结构和最先进架构生成合成ViTs数据集,利用机器学习框架和移动平台验证预测模型的准确性。

Result: 研究发现ViTs的推理延迟受多种因素影响,并且在真实应用中可以通过构建的数据集进行准确预测。

Insight: ViTs在移动设备上的性能优化需要考虑架构设计和硬件平台的协同作用,数据集为未来模型设计和部署提供了重要参考。

Abstract: Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.

[75] $D^2GS$: Dense Depth Regularization for LiDAR-free Urban Scene Reconstruction

Kejing Xia,Jidong Jia,Ke Jin,Yucai Bai,Li Sun,Dacheng Tao,Youjian Zhang

Main category: cs.CV

TL;DR: $D^2GS$是一种无需LiDAR的城市场景重建框架,通过密集深度正则化和几何先验优化,显著提升了重建精度,甚至优于依赖LiDAR的方法。

Details Motivation: 当前城市场景重建方法依赖LiDAR和图像等多模态传感器,但LiDAR数据的获取和校准存在困难。本文提出无需LiDAR的方法,利用密集深度预测和几何先验实现高质量重建。

Contribution: 1) 提出一种LiDAR-free的重建框架$D^2GS$;2) 利用多视深度预测初始化密集点云并通过渐进剪枝优化一致性;3) 联合优化高斯几何与深度增强器,利用扩散先验提升深度图质量;4) 通过约束道路区域的高斯属性改进地面几何精度。

Method: 1) 基于多视深度预测初始化密集点云,并通过渐进剪枝优化;2) 联合优化高斯几何与深度增强器,利用扩散先验提升深度图;3) 约束道路区域高斯属性改进地面几何。

Result: 在Waymo数据集上,$D^2GS$优于现有方法,甚至在准确性上超过依赖LiDAR的方法。

Insight: 密集深度正则化和几何先验的结合可以有效替代LiDAR,提升城市场景重建的精度和鲁棒性。

Abstract: Recently, Gaussian Splatting (GS) has shown great potential for urban scene reconstruction in the field of autonomous driving. However, current urban scene reconstruction methods often depend on multimodal sensors as inputs, \textit{i.e.} LiDAR and images. Though the geometry prior provided by LiDAR point clouds can largely mitigate ill-posedness in reconstruction, acquiring such accurate LiDAR data is still challenging in practice: i) precise spatiotemporal calibration between LiDAR and other sensors is required, as they may not capture data simultaneously; ii) reprojection errors arise from spatial misalignment when LiDAR and cameras are mounted at different locations. To avoid the difficulty of acquiring accurate LiDAR depth, we propose $D^2GS$, a LiDAR-free urban scene reconstruction framework. In this work, we obtain geometry priors that are as effective as LiDAR while being denser and more accurate. $\textbf{First}$, we initialize a dense point cloud by back-projecting multi-view metric depth predictions. This point cloud is then optimized by a Progressive Pruning strategy to improve the global consistency. $\textbf{Second}$, we jointly refine Gaussian geometry and predicted dense metric depth via a Depth Enhancer. Specifically, we leverage diffusion priors from a depth foundation model to enhance the depth maps rendered by Gaussians. In turn, the enhanced depths provide stronger geometric constraints during Gaussian training. $\textbf{Finally}$, we improve the accuracy of ground geometry by constraining the shape and normal attributes of Gaussians within road regions. Extensive experiments on the Waymo dataset demonstrate that our method consistently outperforms state-of-the-art methods, producing more accurate geometry even when compared with those using ground-truth LiDAR data.

[76] Test-Time Adaptive Object Detection with Foundation Model

Yingjie Gao,Yanan Zhang,Zhi Cai,Di Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态提示的均值教师框架的测试时自适应目标检测方法,利用视觉-语言检测器实现无需源数据的域自适应,结合动态记忆模块进一步提升伪标签质量,显著优于现有方法。

Details Motivation: 现有测试时自适应目标检测方法依赖源数据统计特性并假设源域和目标域类别空间相同,限制了实际应用。本文旨在解决这一问题,提出无需源数据且能适应任意域和类别的方法。

Contribution: 1. 提出首个基于基础模型的测试时自适应目标检测方法;2. 设计了多模态提示框架和测试时预热策略;3. 引入实例动态记忆模块及其增强和幻觉策略。

Method: 采用视觉-语言检测器驱动自适应,利用文本和视觉提示调整语言和视觉表示空间,动态记忆模块存储高质量伪标签并通过增强和幻觉策略优化预测。

Result: 在跨损坏和跨数据集基准上,方法显著优于现有技术,适应能力更强。

Insight: 通过结合基础模型和多模态提示,无需源数据即可实现高效自适应,动态记忆模块进一步提升了伪标签的可靠性。

Abstract: In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM’s high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

[77] AI-Powered Early Detection of Critical Diseases using Image Processing and Audio Analysis

Manisha More,Kavya Bhand,Kaustubh Mukdam,Kavya Sharma,Manas Kawtikwar,Hridayansh Kaware,Prajwal Kavhar

Main category: cs.CV

TL;DR: 论文提出了一种结合图像处理和音频分析的多模态AI诊断框架,用于早期检测皮肤癌、血管血栓和心肺异常。方法轻量且适用于低成本设备。

Details Motivation: 现有诊断技术成本高、侵入性强且在资源匮乏地区难以获取,因此需要一种可扩展、实时且经济的AI早期诊断方案。

Contribution: 提出了一种多模态AI框架,结合MobileNetV2、SVM和随机森林,在皮肤癌、血栓和心肺异常检测上取得高精度结果。

Method: 使用MobileNetV2分类皮肤病变,SVM检测血栓,随机森林分析心肺音频信号(MFCC特征)。

Result: 皮肤癌分类准确率89.3%,血栓检测AUC 0.89,心肺分析准确率87.2%,均优于现有方法。

Insight: 多模态AI框架可推广至其他疾病,且轻量级设计使其适合资源有限场景。

Abstract: Early diagnosis of critical diseases can significantly improve patient survival and reduce treatment costs. However, existing diagnostic techniques are often costly, invasive, and inaccessible in low-resource regions. This paper presents a multimodal artificial intelligence (AI) diagnostic framework integrating image analysis, thermal imaging, and audio signal processing for early detection of three major health conditions: skin cancer, vascular blood clots, and cardiopulmonary abnormalities. A fine-tuned MobileNetV2 convolutional neural network was trained on the ISIC 2019 dataset for skin lesion classification, achieving 89.3% accuracy, 91.6% sensitivity, and 88.2% specificity. A support vector machine (SVM) with handcrafted features was employed for thermal clot detection, achieving 86.4% accuracy (AUC = 0.89) on synthetic and clinical data. For cardiopulmonary analysis, lung and heart sound datasets from PhysioNet and Pascal were processed using Mel-Frequency Cepstral Coefficients (MFCC) and classified via Random Forest, reaching 87.2% accuracy and 85.7% sensitivity. Comparative evaluation against state-of-the-art models demonstrates that the proposed system achieves competitive results while remaining lightweight and deployable on low-cost devices. The framework provides a promising step toward scalable, real-time, and accessible AI-based pre-diagnostic healthcare solutions.

[78] U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching

Junsheng Zhou,Xingyu Shi,Haichuan Song,Yi Fang,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: U-CAN 提出了一种无监督的点云去噪框架,通过一致性感知的噪声匹配和几何一致性约束,显著提升了点云和图像去噪的性能。

Details Motivation: 现有方法大多依赖于有噪声-干净点云对的监督学习,需要大量人工标注,U-CAN 旨在通过无监督方法解决这一问题。

Contribution: 1. 提出了一种无监督的点云去噪框架 U-CAN;2. 设计了噪声匹配方案和多步去噪路径;3. 引入了几何一致性约束,适用于3D点云和2D图像去噪。

Method: 1. 使用神经网络推断多步去噪路径;2. 通过噪声匹配统计多个噪声观测;3. 引入几何一致性约束学习去噪模式。

Result: 在点云去噪、上采样和图像去噪任务中,U-CAN 显著优于现有无监督方法,并与监督方法性能相当。

Insight: 几何一致性约束是一种通用项,可扩展至其他领域(如2D图像),突显了无监督方法的潜力。

Abstract: Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e.g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching scheme. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods.

[79] MSF-Net: Multi-Stage Feature Extraction and Fusion for Robust Photometric Stereo

Shiyu Qin,Zhihao Cai,Kaixuan Wang,Lin Qi,Junyu Dong

Main category: cs.CV

TL;DR: MSF-Net提出了一个多阶段特征提取与融合的网络框架,解决了现有方法在复杂区域(如皱纹和边缘)中特征冗余的问题,显著提升了光度立体法的表面法线估计精度。

Details Motivation: 现有基于学习的Photometric Stereo方法在多阶段特征提取和特征交互方面表现不足,导致复杂区域的特征冗余和精度下降。

Contribution: 1. 提出了多阶段特征提取框架MSF-Net;2. 设计了选择性更新策略;3. 开发了特征融合模块以增强特征交互。

Method: 1. 分阶段提取特征;2. 通过选择性更新策略优化特征质量;3. 使用特征融合模块整合不同阶段特征。

Result: 在DiLiGenT基准测试中,MSF-Net在表面法线估计精度上显著优于现有方法。

Insight: 多阶段特征提取和选择性融合能有效提升复杂区域的法线估计性能。

Abstract: Photometric stereo is a technique aimed at determining surface normals through the utilization of shading cues derived from images taken under different lighting conditions. However, existing learning-based approaches often fail to accurately capture features at multiple stages and do not adequately promote interaction between these features. Consequently, these models tend to extract redundant features, especially in areas with intricate details such as wrinkles and edges. To tackle these issues, we propose MSF-Net, a novel framework for extracting information at multiple stages, paired with selective update strategy, aiming to extract high-quality feature information, which is critical for accurate normal construction. Additionally, we have developed a feature fusion module to improve the interplay among different features. Experimental results on the DiLiGenT benchmark show that our proposed MSF-Net significantly surpasses previous state-of-the-art methods in the accuracy of surface normal estimation.

[80] Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation

Quang-Khai Bui-Tran,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Ba-Thinh Lam,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu

Main category: cs.CV

TL;DR: 该论文提出了一种新的源无监督域适应(SFDA)框架,通过硬样本选择和去噪补丁混合逐步对齐目标分布,提升医学图像分割的准确性。

Details Motivation: 现有的源无监督域适应方法在隐私约束下进行医学图像分割时,往往忽略样本难度,且在域偏移下容易受到噪声监督的干扰。

Contribution: 1. 提出基于熵-相似性分析的硬样本选择策略;2. 采用蒙特卡洛去噪掩模优化伪标签;3. 设计了域内和域间的补丁混合目标函数。

Method: 框架包括:1. 样本难度分区分组;2. 伪标签去噪;3. 补丁混合训练。

Result: 在多个基准数据集上优于现有SFDA和UDA方法,取得最优Dice和ASSD分数。

Insight: 渐进式适应和去噪监督对域偏移下的鲁棒分割至关重要。

Abstract: Source-Free Domain Adaptation (SFDA) is emerging as a compelling solution for medical image segmentation under privacy constraints, yet current approaches often ignore sample difficulty and struggle with noisy supervision under domain shift. We present a new SFDA framework that leverages Hard Sample Selection and Denoised Patch Mixing to progressively align target distributions. First, unlabeled images are partitioned into reliable and unreliable subsets through entropy-similarity analysis, allowing adaptation to start from easy samples and gradually incorporate harder ones. Next, pseudo-labels are refined via Monte Carlo-based denoising masks, which suppress unreliable pixels and stabilize training. Finally, intra- and inter-domain objectives mix patches between subsets, transferring reliable semantics while mitigating noise. Experiments on benchmark datasets show consistent gains over prior SFDA and UDA methods, delivering more accurate boundary delineation and achieving state-of-the-art Dice and ASSD scores. Our study highlights the importance of progressive adaptation and denoised supervision for robust segmentation under domain shift.

[81] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

Yinqi Cai,Jichang Li,Zhaolun Li,Weikai Chen,Rushi Lan,Xi Xie,Xiaonan Luo,Guanbin Li

Main category: cs.CV

TL;DR: DeepShield is a novel deepfake detection framework combining local patch guidance and global forgery diversification to improve robustness against unseen manipulation techniques.

Details Motivation: Deepfake detection suffers from poor generalization across diverse manipulation techniques due to reliance on forgery-specific artifacts.

Contribution: DeepShield integrates Local Patch Guidance (LPG) for fine-grained inconsistencies and Global Forgery Diversification (GFD) for domain feature augmentation.

Method: Enhances CLIP-ViT with LPG (spatiotemporal artifact modeling) and GFD (domain-bridging feature generation).

Result: Outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations.

Insight: Balancing local sensitivity and global generalization is key to robust deepfake detection.

Abstract: Recent advances in deep generative models have made it easier to manipulate face videos, raising significant concerns about their potential misuse for fraud and misinformation. Existing detectors often perform well in in-domain scenarios but fail to generalize across diverse manipulation techniques due to their reliance on forgery-specific artifacts. In this work, we introduce DeepShield, a novel deepfake detection framework that balances local sensitivity and global generalization to improve robustness across unseen forgeries. DeepShield enhances the CLIP-ViT encoder through two key components: Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). LPG applies spatiotemporal artifact modeling and patch-wise supervision to capture fine-grained inconsistencies often overlooked by global models. GFD introduces domain feature augmentation, leveraging domain-bridging and boundary-expanding feature generation to synthesize diverse forgeries, mitigating overfitting and enhancing cross-domain adaptability. Through the integration of novel local and global analysis for deepfake detection, DeepShield outperforms state-of-the-art methods in cross-dataset and cross-manipulation evaluations, achieving superior robustness against unseen deepfake attacks.

[82] VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

Qianqian Qiao,DanDan Zheng,Yihang Bo,Bao Peng,Heng Huang,Longteng Jiang,Huaye Wang,Jingdong Chen,Jun Zhou,Xin Jin

Main category: cs.CV

TL;DR: VADB 是一个大规模视频美学数据库,包含多样化的视频和多维标注,支持视频美学评估任务。

Details Motivation: 视频美学评估的研究受限于标准化数据集和鲁棒模型的缺乏,现有方法难以直接应用于视频。

Contribution: 1. 提出 VADB,最大的视频美学数据库;2. 提出 VADB-Net,一种双模态预训练框架。

Method: 采用双模态预训练框架和两阶段训练策略,结合视觉和语言模态。

Result: VADB-Net 在评分任务上优于现有视频质量评估模型。

Insight: 多维度标注和双模态预训练有助于提升视频美学评估的性能。

Abstract: Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.

[83] Mapping and Classification of Trees Outside Forests using Deep Learning

Moritz Lucas,Hamid Ebrahimy,Viacheslav Barkov,Ralf Pecenka,Kai-Uwe Kühnberger,Björn Waske

Main category: cs.CV

TL;DR: 该论文评估了深度学习在农田景观中对森林外树木(TOF)的分类效果,比较了多种语义分割架构,并提出高质量数据集和模型FT-UNetFormer,取得了较好的分类精度。

Details Motivation: 森林外树木(TOF)在生态系统中具有重要作用,但现有研究多将其视为单一类别或使用刚性阈值,限制了生态解释和区域适应性。论文旨在通过深度学习提升TOF的分类能力。

Contribution: 1. 提出了一种新的TOF分类数据集和深度学习方法;2. 比较了多种语义分割架构,展示了FT-UNetFormer的优越性;3. 分析了分类中的挑战,并强调了多样化训练数据的重要性。

Method: 1. 使用高分辨率航拍图像和四种农田景观数据;2. 比较了CNN、Vision Transformer和混合模型的六种语义分割架构(如ABCNet、U-Net等);3. 提出了四种类别的分类(森林、斑块、线性和单木)。

Result: FT-UNetFormer表现最佳(平均IoU 0.74,F1分数0.84),森林和线性类别分类效果较好,但斑块和单木类别存在挑战。

Insight: 空间上下文理解对TOF分类至关重要;多样化训练数据有助于提升模型的泛化能力。

Abstract: Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule-based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high-resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN-transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT-UNetFormer, DC-Swin, BANet, and U-Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT-UNetFormer performing best (mean Intersection-over-Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large-scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper

[84] RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models

Zijun Liao,Yian Zhao,Xin Shan,Yu Yan,Chang Liu,Lei Lu,Xiangyang Ji,Jie Chen

Main category: cs.CV

TL;DR: RT-DETRv4是一个轻量级实时目标检测框架,通过引入深度语义注入模块(DSI)和梯度引导自适应调制策略(GAM),解决了VFM与轻量级检测器之间的语义转移问题,显著提升了性能,并在COCO数据集上取得了SOTA结果。

Details Motivation: 当前轻量级实时目标检测器在追求高速推理的同时,往往牺牲了特征表示能力,限制了性能提升和实际部署效果。论文旨在通过利用Vision Foundation Models(VFMs)的能力,解决这一问题。

Contribution: 提出了一个成本效益高、适应性强的蒸馏框架,设计了DSI模块和GAM策略,实现了稳定的语义转移,显著提升了轻量级目标检测器的性能。

Method: 1. 引入DSI模块,将VFMs的高层特征注入检测器的深层;2. 设计了GAM策略,根据梯度范数比例动态调整语义转移强度。

Result: RT-DETRv4在COCO数据集上实现了49.7/53.5/55.4/57.0的AP分数,对应速度为273/169/124/78 FPS,表现出色。

Insight: 通过结合VFMs的能力和动态调整机制,轻量级检测器可以在不增加部署和推理开销的情况下显著提升性能,展示了实际应用的潜力。

Abstract: Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.

[85] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao,Jan-Nico Zaech,Xi Wang,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: LangHOPS是一种基于多模态大语言模型(MLLM)的开词汇对象-部件实例分割框架,通过语言空间中的层次结构实现对象和部件的检测与分割,显著优于现有方法。

Details Motivation: 现有方法依赖启发式或可学习的视觉分组,无法充分利用语言空间的层次关系。LangHOPS旨在通过MLLM的知识和推理能力,填补这一空白。

Contribution: 首次将MLLM引入对象-部件分割任务,提出语言空间中的层次结构建模,实现了开词汇和跨数据集的优异性能。

Method: 利用MLLM整合对象-部件解析流程,通过语言空间链接多层次概念,并优化部件查询策略。

Result: 在PartImageNet和ADE20K数据集上分别超越SOTA方法5.5% AP和2.5% mIOU,验证了方法的有效性。

Insight: 语言空间的层次结构建模和多模态大语言模型的知识整合是提升对象-部件分割性能的关键。

Abstract: We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

[86] GaTector+: A Unified Head-free Framework for Gaze Object and Gaze Following Prediction

Yang Jin,Guangyu Guo,Binglu Wang

Main category: cs.CV

TL;DR: GaTector+提出了一种统一的框架,用于无头依赖的视线物体检测和视线跟随任务,通过共享主干网络和特定任务块实现联合优化,解决了传统方法依赖头部先验的问题。

Details Motivation: 传统方法通常将视线物体检测和视线跟随任务分开处理,且依赖于头部先验知识,这限制了系统的联合优化和实际应用。GaTector+旨在消除这种依赖。

Contribution: 1. 提出无头依赖的统一框架GaTector+;2. 引入基于头部的注意力机制;3. 提出注意力监督机制加速视线热图学习;4. 设计新评价指标mSoC。

Method: 结合共享主干网络和任务特定块提取特征;嵌入头部检测分支;设计头部注意力机制;提出注意力监督机制;使用mSoC指标评估。

Result: 在多个基准数据集上验证了GaTector+在视线物体检测和视线跟随任务中的有效性。

Insight: 消除对头部先验的依赖可以提升任务灵活性和性能;联合优化和多任务学习能增强模型泛化能力。

Abstract: Gaze object detection and gaze following are fundamental tasks for interpreting human gaze behavior or intent. However, most previous methods usually solve these two tasks separately, and their prediction of gaze objects and gaze following typically depend on head-related prior knowledge during both the training phase and real-world deployment. This dependency necessitates an auxiliary network to extract head location, thus precluding joint optimization across the entire system and constraining the practical applicability. To this end, we propose GaTector+, a unified framework for gaze object detection and gaze following, which eliminates the dependence on the head-related priors during inference. Specifically, GaTector+ uses an expanded specific-general-specific feature extractor that leverages a shared backbone, which extracts general features for gaze following and object detection using the shared backbone while using specific blocks before and after the shared backbone to better consider the specificity of each sub-task. To obtain head-related knowledge without prior information, we first embed a head detection branch to predict the head of each person. Then, before regressing the gaze point, a head-based attention mechanism is proposed to fuse the sense feature and gaze feature with the help of head location. Since the suboptimization of the gaze point heatmap leads to the performance bottleneck, we propose an attention supervision mechanism to accelerate the learning of the gaze heatmap. Finally, we propose a novel evaluation metric, mean Similarity over Candidates (mSoC), for gaze object detection, which is more sensitive to variations between bounding boxes. The experimental results on multiple benchmark datasets demonstrate the effectiveness of our model in both gaze object detection and gaze following tasks.

[87] Prototype-Driven Adaptation for Few-Shot Object Detection

Yushen Huang,Zhiming Wang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的原型驱动对齐方法(PDA),用于缓解少样本目标检测中的基类偏差和不稳定校准问题,通过原型匹配和温度缩放融合提升新类检测性能。

Details Motivation: 少样本目标检测(FSOD)在新类样本极少的情况下容易受到基类偏差和不稳定校准的影响,需要一种轻量且高效的方法来补充传统线性分类器的不足。

Contribution: 1. 提出PDA方法,作为DeFRCN的插件式度量头,提供原型驱动的补充决策;2. 设计EMA更新机制动态调整原型,避免引入类特定参数;3. 实验表明PDA在VOC和GFSOD基准上显著提升了新类检测性能。

Method: 1. 在可学习的投影空间中维护支持样本的原型;2. 采用EMA更新原型;3. 使用最佳K匹配和温度缩放融合度量相似性与检测器输出。

Result: 在VOC和GFSOD基准测试中,PDA显著提升了新类的检测性能,同时对基类影响极小,计算开销可忽略。

Insight: 原型驱动的度量学习方法可以有效缓解少样本检测中的偏差问题,且动态更新机制在不增加参数的情况下提升了灵活性。

Abstract: Few-shot object detection (FSOD) often suffers from base-class bias and unstable calibration when only a few novel samples are available. We propose Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN that provides a prototype-based “second opinion” complementary to the linear classifier. PDA maintains support-only prototypes in a learnable identity-initialized projection space and optionally applies prototype-conditioned RoI alignment to reduce geometric mismatch. During fine-tuning, prototypes can be adapted via exponential moving average(EMA) updates on labeled foreground RoIs-without introducing class-specific parameters-and are frozen at inference to ensure strict protocol compliance. PDA employs a best-of-K matching scheme to capture intra-class multi-modality and temperature-scaled fusion to combine metric similarities with detector logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently improves novel-class performance with minimal impact on base classes and negligible computational overhead.

[88] MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Runxi Huang,Mingxuan Yu,Mingyu Tsoi,Xiaomin Ouyang

Main category: cs.CV

TL;DR: MMEdge是一种基于流水线感知和编码的新型边缘设备多模态推理框架,通过增量计算和跨模态优化,显著降低了延迟并保持了高任务精度。

Details Motivation: 在多模态边缘设备上进行实时推理对自动驾驶、人机交互等应用至关重要,但现有工作忽视了感知动态与模型执行的紧密耦合以及复杂的跨模态依赖性。

Contribution: 提出了MMEdge框架,通过流水线设计实现增量计算,引入了轻量级时间聚合模块,并提出了自适应多模态配置优化器和跨模态推测跳过机制。

Method: 将推理过程分解为细粒度感知和编码单元,允许数据到达时增量计算;引入时间聚合模块和动态优化机制。

Result: 在两个公共多模态数据集和无人机测试平台上验证,MMEdge显著降低了端到端延迟,同时保持了高任务精度。

Insight: 流水线设计能够充分利用多模态数据的动态性,动态优化和跳过机制为资源受限设备的高效多模态推理提供了新思路。

Abstract: Real-time multimodal inference on resource-constrained edge devices is essential for applications such as autonomous driving, human-computer interaction, and mobile health. However, prior work often overlooks the tight coupling between sensing dynamics and model execution, as well as the complex inter-modality dependencies. In this paper, we propose MMEdge, an new on-device multi-modal inference framework based on pipelined sensing and encoding. Instead of waiting for complete sensor inputs, MMEdge decomposes the entire inference process into a sequence of fine-grained sensing and encoding units, allowing computation to proceed incrementally as data arrive. MMEdge also introduces a lightweight but effective temporal aggregation module that captures rich temporal dynamics across different pipelined units to maintain accuracy performance. Such pipelined design also opens up opportunities for fine-grained cross-modal optimization and early decision-making during inference. To further enhance system performance under resource variability and input data complexity, MMEdge incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations for each modality under latency constraints, and a cross-modal speculative skipping mechanism that bypasses future units of slower modalities when early predictions reach sufficient confidence. We evaluate MMEdge using two public multimodal datasets and deploy it on a real-world unmanned aerial vehicle (UAV)-based multimodal testbed. The results show that MMEdge significantly reduces end-to-end latency while maintaining high task accuracy across various system and data dynamics.

[89] StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

Yuhang Hu,Zhenyu Yang,Shihan Wang,Shengsheng Qian,Bin Wen,Fan Yang,Tingting Gao,Changsheng Xu

Main category: cs.CV

TL;DR: StreamingCoT是一个新颖的流媒体视频问答数据集,专注于捕捉时间动态和多模态链式推理,解决了现有数据集的静态标注和缺乏显式推理链的问题。

Details Motivation: 现有VideoQA数据集未能捕捉流媒体视频中答案的时间动态性,且缺乏显式推理链标注,限制了模型的推理能力和可解释性。

Contribution: 提出了第一个支持时间动态推理和多模态链式推理的StreamingCoT数据集,并开发了动态分层标注架构和显式推理链生成方法。

Method: 通过动态分层标注架构生成每秒的密集描述和语义片段,结合时间约束的问题-答案对;使用关键帧语义对齐提取时空对象,并通过大语言模型生成推理路径。

Result: StreamingCoT为流媒体视频理解、复杂时间推理和多模态推理研究奠定了基础。

Insight: 显式推理链和动态标注架构能够显著提升模型的时间动态理解和逻辑推理能力。

Abstract: The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

[90] Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth,Sharannya Ghosh,Aditay Tripathi,Anirban Chakraborty

Main category: cs.CV

TL;DR: PEP-FedPT提出了一种新的联邦学习框架,通过类别上下文混合提示(CCMP)实现视觉变压器的参数高效微调,同时兼顾泛化和个性化。

Details Motivation: 传统的全局提示调优难以应对异构客户端数据,而个性化调优则容易过拟合,缺乏泛化能力。本文旨在设计一种统一框架,解决这一矛盾。

Contribution: 提出了PEP-FedPT框架,结合类别上下文混合提示(CCMP),实现了联邦学习中视觉变压器的泛化和个性化提示调优。

Method: CCMP通过全局类别原型和客户端类别先验权重,自适应组合类别特定提示。优化采用传统的联邦平均技术。

Result: 在CIFAR-100等数据集上,PEP-FedPT超越了现有基线,尤其在数据异构场景中表现优异。

Insight: CCMP的设计避免了存储客户端依赖的可训练参数,同时通过全局与客户端的协同优化实现了高效的个性化提示调优。

Abstract: Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

[91] Instance-Level Composed Image Retrieval

Bill Psomas,George Retsinas,Nikos Efthymiadis,Panagiotis Filntisis,Yannis Avrithis,Petros Maragos,Ondrej Chum,Giorgos Tolias

Main category: cs.CV

TL;DR: 该论文提出了一个新的实例级CIR评估数据集i-CIR,并引入了训练无关方法BASIC,通过分离视觉和文本查询的相似性计算提升检索性能。

Details Motivation: 现有CIR研究缺乏高质量的实例级训练和评估数据,限制了进展。论文旨在解决这一问题。

Contribution: 1. 提出i-CIR数据集,专注于实例级检索;2. 提出训练无关方法BASIC,提升检索性能。

Method: BASIC方法利用预训练VLM,分离视觉和文本查询的相似性计算,并通过后期融合优化结果。

Result: BASIC在i-CIR和现有CIR数据集上均达到了新的SOTA性能。

Insight: 实例级检索在CIR中具有挑战性,分离视觉和文本特征的计算能有效提升精度。

Abstract: The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.

[92] More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal,Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Andrew Zisserman,Gül Varol,Makarand Tapaswi

Main category: cs.CV

TL;DR: 论文提出了CoherentAD方法,通过生成多候选描述并自回归选择,解决音频描述(ADs)中独立生成导致的重复和不连贯问题,同时引入了故事回忆(StoryRecall)指标来评估序列的整体连贯性。

Details Motivation: 音频描述(ADs)需要形成连贯的序列以帮助视障观众理解视频内容,但现有自动方法多为独立生成,导致描述重复且不连贯,影响叙事效果。

Contribution: 1. 提出了一种无需训练的CoherentAD方法,通过多候选生成和自回归选择实现连贯的AD序列;2. 引入了序列级指标StoryRecall来量化叙事连贯性。

Method: CoherentAD方法首先生成每个时间间隔的多个候选描述,然后通过自回归选择机制从序列中选出最连贯的描述组合。

Result: 实验表明,该方法生成的AD序列在叙事连贯性和减少重复性方面优于独立生成的基线方法。

Insight: 序列级别的连贯性优化对音频描述的有效性至关重要,同时多候选生成和选择机制能够显著提升生成的多样性。

Abstract: Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

[93] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng,Zihao Dongfang,Lutao Jiang,Boyuan Zheng,Yulong Guo,Zhenquan Zhang,Giuliano Albanese,Runyi Yang,Mengjiao Ma,Zixin Zhang,Chenfei Liao,Dingcheng Zhen,Yuanhuiyi Lyu,Yuqian Fu,Bin Ren,Linfeng Zhang,Danda Pani Paudel,Nicu Sebe,Luc Van Gool,Xuming Hu

Main category: cs.CV

TL;DR: 该论文对多模态空间推理任务进行了系统性综述,重点关注大模型在这一领域的应用,并提出了公开基准用于评估。

Details Motivation: 人类的推理能力依赖多模态观察(如视觉和声音),而现有的大模型在多模态空间推理方面的系统性研究和基准测试不足。

Contribution: 1)综述了多模态空间推理任务的进展;2)提出了公开基准;3)涵盖了新兴模态(如音频和自我中心视频)的应用。

Method: 通过分类多模态大语言模型(MLLMs)的技术进展,并设计基准测试框架,评估其在2D/3D空间任务中的表现。

Result: 建立了多模态空间推理的系统性框架,并提供了可公开访问的基准和代码库。

Insight: 新兴传感器(如音频和自我中心视频)为空间推理提供了新的视角,扩展了传统任务的边界。

Abstract: Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

[94] FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Chuhao Chen,Isabella Liu,Xinyue Wei,Hao Su,Minghua Liu

Main category: cs.CV

TL;DR: FreeArt3D是一个无需训练的框架,通过利用预训练的静态3D扩散模型(如Trellis)作为形状先验,将Score Distillation Sampling(SDS)扩展到3D-to-4D领域,实现高质量的关节物体生成。

Details Motivation: 关节3D物体在机器人、AR/VR和动画中具有广泛应用,但现有方法或依赖密集视角监督的优化重建流程,或被前馈生成模型忽视了表面纹理。静态3D生成虽取得显著进展,但扩展到关节物体仍面临挑战。

Contribution: 提出的FreeArt3D是一个无需训练的框架,通过将SDS扩展到3D-to-4D领域,实现了高质量的关节3D物体生成,支持几何、纹理和关节参数的联合优化。

Method: FreeArt3D利用预训练的静态3D扩散模型作为形状先验,并扩展SDS至3D-to-4D领域,将关节运动视为额外的生成维度。通过少量不同关节状态的图像,优化几何、纹理和关节参数。

Result: FreeArt3D生成高质量的几何和纹理,准确预测运动结构,并在多样物体类别中表现优异,生成速度快且优于现有方法。

Insight: FreeArt3D展示了利用预训练扩散模型的强大潜力,通过将SDS扩展到新维度,为复杂3D物体生成提供了一种高效且无需训练的解决方案。

Abstract: Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object’s geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

[95] VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li,Yiming Zhang,Qinghe Wang,Liqian Ma,Xiaoyu Shi,Xintao Wang,Pengfei Wan,Zhenfei Yin,Yunzhi Zhuge,Huchuan Lu,Xu Jia

Main category: cs.CV

TL;DR: VFXMaster提出首个统一的、基于参考的VFX视频生成框架,通过上下文学习实现动态效果生成,并能泛化到未见过的效果类别。

Details Motivation: 现有VFX生成方法依赖“一个效果一个LoRA”的范式,资源密集且无法泛化到新效果,限制了生成能力和可扩展性。

Contribution: 1. 首个基于参考的统一VFX生成框架;2. 设计了上下文学习策略和注意力掩码;3. 提出高效的单样本适应机制提升泛化能力。

Method: 1. 上下文学习任务化效果生成;2. 设计上下文注意力掩码分离效果属性;3. 单样本适应机制增强泛化。

Result: 实验表明方法能模仿多种效果,并对未见效果表现出色。

Insight: 将效果生成任务转化为上下文学习问题,并通过注意力机制实现精细控制,是提升VFX生成能力的关键。

Abstract: Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

cs.HC [Back]

[96] AmarDoctor: An AI-Driven, Multilingual, Voice-Interactive Digital Health Application for Primary Care Triage and Patient Management to Bridge the Digital Health Divide for Bengali Speakers

Nazmun Nahar,Ritesh Harshad Ruparel,Shariar Kabir,Sumaiya Tasnia Khan,Shyamasree Saha,Mamunur Rashid

Main category: cs.HC

TL;DR: AmarDoctor是一款AI驱动的多语言语音交互数字健康应用,主要为孟加拉语使用者提供初级护理分诊和患者管理,填补数字健康领域的语言鸿沟。

Details Motivation: 现有数字健康平台主要服务于欧洲人群和语言,孟加拉语使用者在这一领域缺乏支持,AmarDoctor旨在解决这一问题。

Contribution: 开发了一个支持三种孟加拉语方言的双界面系统,结合自适应提问算法和语音交互AI助手,提升初级护理效率。

Method: 采用数据驱动的自适应提问算法评估症状,结合AI驱动的临床决策支持,生成结构化诊断和治疗建议。

Result: AmarDoctor在185个临床案例中的诊断精确率达81.08%,专科推荐精确率达91.35%,显著优于医生平均水平。

Insight: 语音交互和多语言支持能显著提升数字健康工具在低数字素养人群中的可用性,AI辅助诊断在分诊效率上有明显优势。

Abstract: This study presents AmarDoctor, a multilingual voice-interactive digital health app designed to provide comprehensive patient triage and AI-driven clinical decision support for Bengali speakers, a population largely underserved in access to digital healthcare. AmarDoctor adopts a data-driven approach to strengthen primary care delivery and enable personalized health management. While platforms such as AdaHealth, WebMD, Symptomate, and K-Health have become popular in recent years, they mainly serve European demographics and languages. AmarDoctor addresses this gap with a dual-interface system for both patients and healthcare providers, supporting three major Bengali dialects. At its core, the patient module uses an adaptive questioning algorithm to assess symptoms and guide users toward the appropriate specialist. To overcome digital literacy barriers, it integrates a voice-interactive AI assistant that navigates users through the app services. Complementing this, the clinician-facing interface incorporates AI-powered decision support that enhances workflow efficiency by generating structured provisional diagnoses and treatment recommendations. These outputs inform key services such as e-prescriptions, video consultations, and medical record management. To validate clinical accuracy, the system was evaluated against a gold-standard set of 185 clinical vignettes developed by experienced physicians. Effectiveness was further assessed by comparing AmarDoctor performance with five independent physicians using the same vignette set. Results showed AmarDoctor achieved a top-1 diagnostic precision of 81.08 percent (versus physicians average of 50.27 percent) and a top specialty recommendation precision of 91.35 percent (versus physicians average of 62.6 percent).

[97] Beyond Models: A Framework for Contextual and Cultural Intelligence in African AI Deployment

Qness Ndlovu

Main category: cs.HC

TL;DR: 论文提出了一个名为CCI的框架,旨在通过文化智能和本土化设计提升AI在非洲市场的实用性,验证了WhatsApp交互的优越性,并展示了文化敏感的提示工程效果。

Details Motivation: 当前全球AI发展过于注重模型性能,而忽略了文化和社会背景的影响,导致在非洲等市场中的普适性不足。

Contribution: 提出CCI框架,结合基础设施、文化和商业智能,为AI在资源受限市场的公平部署提供理论和实践支持。

Method: 采用设计科学方法,通过在跨境购物平台上的实证研究验证CCI框架的有效性,涵盖WhatsApp交互、文化敏感的提示工程等技术。

Result: 实证结果显示89%用户偏好WhatsApp交互,文化敏感的提示工程显著提升了对家庭主导的商业模式和自然语码转换的适应性。

Insight: 文化和社会背景是AI在多样市场中成功部署的关键因素,本土化和用户习惯驱动的设计比单纯的技术性能更重要。

Abstract: While global AI development prioritizes model performance and computational scale, meaningful deployment in African markets requires fundamentally different architectural decisions. This paper introduces Contextual and Cultural Intelligence (CCI) – a systematic framework enabling AI systems to process cultural meaning, not just data patterns, through locally relevant, emotionally intelligent, and economically inclusive design. Using design science methodology, we validate CCI through a production AI-native cross-border shopping platform serving diaspora communities. Key empirical findings: 89% of users prefer WhatsApp-based AI interaction over traditional web interfaces (n=602, chi-square=365.8, p<0.001), achieving 536 WhatsApp users and 3,938 total conversations across 602 unique users in just 6 weeks, and culturally informed prompt engineering demonstrates sophisticated understanding of culturally contextualized queries, with 89% family-focused commerce patterns and natural code-switching acceptance. The CCI framework operationalizes three technical pillars: Infrastructure Intelligence (mobile-first, resilient architectures), Cultural Intelligence (multilingual NLP with social context awareness), and Commercial Intelligence (trust-based conversational commerce). This work contributes both theoretical innovation and reproducible implementation patterns, challenging Silicon Valley design orthodoxies while providing actionable frameworks for equitable AI deployment across resource-constrained markets.

cs.CY [Back]

[98] Topic-aware Large Language Models for Summarizing the Lived Healthcare Experiences Described in Health Stories

Maneesh Bilalpur,Megan Hamm,Young Ji Lee,Natasha Norman,Kathleen M. McTigue,Yanshan Wang

Main category: cs.CY

TL;DR: 该论文提出了一种结合主题建模(LDA)和大语言模型(LLM)的方法,用于从非裔美国人的医疗经历叙述中识别主题并生成分层摘要,结果显示摘要质量高且有用。

Details Motivation: 研究旨在通过分析非裔美国人的医疗叙事,识别潜在的健康差距因素和干预途径,利用故事的沟通力量为研究提供高效支持。

Contribution: 1) 结合LDA和LLM的方法,实现主题识别与分层摘要生成;2) 验证生成的摘要质量高且实用;3) 发现26个与非裔美国人医疗经历相关的主题。

Method: 1) 使用LDA技术从50个故事中识别主题;2) 开源LLM对每个主题的故事生成分层摘要;3) 通过GPT4评估摘要质量,并与专家评估验证可靠性。

Result: 生成的摘要无虚构、高度准确且全面,GPT4评估与专家评估一致性中等到高。

Insight: 该方法可高效从非结构化叙事中提取有用信息,为健康研究和临床改进提供新思路。

Abstract: Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in healthcare outcomes. To determine whether Large Language Models (LLMs) can identify potential underlying factors and avenues for intervention, we performed topic-aware hierarchical summarization of narratives from African American (AA) storytellers. Fifty transcribed stories of AA experiences were used to identify topics in their experience using the Latent Dirichlet Allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. Generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT4 model, and the model’s reliability was validated against the original story summaries by two domain experts. 26 topics were identified in the fifty AA stories. The GPT4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate to high agreement. Our approach identified AA experience-relevant topics such as health behaviors, interactions with medical team members, caregiving and symptom management, among others. Such insights could help researchers identify potential factors and interventions by learning from unstructured narratives in an efficient manner-leveraging the communicative power of storytelling. The use of LDA and LLMs to identify and summarize the experience of AA individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby ultimately improving health outcomes.

cs.AI [Back]

[99] KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Zhuo Chen,Fei Wang,Zixuan Li,Zhao Zhang,Weiwei Ding,Chuanguang Yang,Yongjun Xu,Xiaolong Jin,Jiafeng Guo

Main category: cs.AI

TL;DR: KnowCoder-A1提出了一种基于结果监督的强化学习方法,通过多阶段课程学习激励LLM在KBQA任务中实现自主推理能力。

Details Motivation: 现有的KBQA方法通常通过过程监督微调LLM,导致探索激励不足,无法充分激发代理推理能力。

Contribution: 提出了KnowCoder-A1,一种能够在KBQA中自主推理的LLM,通过结果监督和多阶段课程强化学习提升性能。

Method: 1. 基于高质量轨迹微调LLM;2. 应用多阶段课程RL缓解结果监督的稀疏奖励问题。

Result: 在三个主流数据集上表现优异,GrailQA的零样本子集上相对提升11.1%,训练数据仅需1/12。

Insight: 结果监督结合课程学习能有效激励自主推理,减少对过程监督的依赖,提升零样本泛化能力。

Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

[100] RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Tianqianjin Lin,Xi Zhao,Xingyao Zhang,Rujiao Long,Yi Xu,Zhuoren Jiang,Wenbo Su,Bo Zheng

Main category: cs.AI

TL;DR: RAVR是一个基于参考答案引导的变分推理框架,通过利用答案来生成高质量推理路径,提升了大型语言模型(LLM)的推理能力,尤其在开放式探索困难的复杂任务中表现优异。

Details Motivation: 传统的强化学习(RL)在提升LLM推理能力时,依赖于模型能以非零概率生成高质量推理路径,但在复杂任务中这一条件难以满足。受认知科学启发,RAVR利用了答案更容易引导推理的特性,避免了开放式探索的高认知负载。

Contribution: 1. 形式化了答案引导的推理现象,并证明其能提升推理路径的期望效用;2. 提出了RAVR框架,将答案条件推理作为变分替代;3. 实验显示了RAVR在通用和数学领域的显著提升。

Method: RAVR通过答案条件推理作为变分替代框架,端到端地优化LLM的生成路径。具体包括:1. 用参考答案引导推理路径生成;2. 通过变分推理优化生成质量;3. 强化学习进一步微调。

Result: 实验表明,RAVR在通用和数学任务中均超越基线模型。进一步分析显示,RAVR减少了推理中的犹豫,增强了结论整合能力,并促进了问题特定策略的使用。

Insight: RAVR的核心理念是利用答案作为引导,将复杂的开放式探索问题转化为更易处理的解释性重构问题,从而显著提升LLM的推理能力。

Abstract: Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM’s current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.

[101] From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan,Jiaming Luo,Siyuan Chen,Kunyao Lan,Jianhua Chen,Haiyang Geng,Mengyue Wu

Main category: cs.AI

TL;DR: 该论文提出了一种结合合成电子病历(EMR)和多智能体诊断对话生成的方法,构建了PsyCoTalk数据集,用于精神病学共病研究。数据显示高保真度和诊断有效性。

Details Motivation: 由于精神病共病的复杂性,传统方法难以应对多发性共存疾病的诊断挑战,因此需要一种临床基础的解决方案来提升诊断效率和准确性。

Contribution: 1. 开发了一种合成EMR和多智能体对话生成的管道;2. 创建了PsyCoTalk数据集,包含3000个多轮诊断对话;3. 验证了数据集的临床真实性和诊断有效性。

Method: 1. 合成502个具有临床相关性和多样性的EMR;2. 将临床访谈协议转化为分层状态机和上下文树;3. 生成并验证多轮诊断对话。

Result: PsyCoTalk数据集在对话结构、词汇分布和诊断策略上表现出高保真度,精神科医生验证了其真实性和诊断价值。

Insight: 结合合成数据和多智能体框架可以高效生成临床可靠的对话数据集,支持一次性多疾病精神病筛查模型的开发和评估。

Abstract: Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

cs.DB [Back]

[102] StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems

Qi Lin,Zhenyu Zhang,Viraj Thakkar,Zhenjie Sun,Mai Zheng,Zhichao Cao

Main category: cs.DB

TL;DR: StorageXTuner是一个基于LLM代理的自动调优框架,通过分层代理和洞察驱动的树搜索优化异构存储系统的性能。

Details Motivation: 异构存储系统的自动配置参数空间大且条件多变,现有启发式和ML调优方法通常系统专用且难以适应变化。

Contribution: 提出StorageXTuner框架,通过四个代理(Executor、Extractor、Searcher、Reflector)实现跨系统重用和高效探索,采用洞察驱动的树搜索和分层记忆设计。

Method: 框架包含四个代理:Executor(基准测试)、Extractor(性能摘要)、Searcher(配置探索)、Reflector(洞察管理),配合树搜索和轻量级检查器。

Result: 在RocksDB、LevelDB等系统中,StorageXTuner相比默认设置和ELMo-Tune,吞吐量提升最高575%,p99延迟降低88%,收敛更快。

Insight: 分层代理设计和洞察驱动方法能够显著提升异构存储系统的性能和调优效率。

Abstract: Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under changes. Recent LLM-based approaches help but usually treat tuning as a single-shot, system-specific task, which limits cross-system reuse, constrains exploration, and weakens validation. We present StorageXTuner, an LLM agent-driven auto-tuning framework for heterogeneous storage engines. StorageXTuner separates concerns across four agents - Executor (sandboxed benchmarking), Extractor (performance digest), Searcher (insight-guided configuration exploration), and Reflector (insight generation and management). The design couples an insight-driven tree search with layered memory that promotes empirically validated insights and employs lightweight checkers to guard against unsafe actions. We implement a prototype and evaluate it on RocksDB, LevelDB, CacheLib, and MySQL InnoDB with YCSB, MixGraph, and TPC-H/C. Relative to out-of-the-box settings and to ELMo-Tune, StorageXTuner reaches up to 575% and 111% higher throughput, reduces p99 latency by as much as 88% and 56%, and converges with fewer trials.

cs.LG [Back]

[103] Finding Culture-Sensitive Neurons in Vision-Language Models

Xiutian Zhao,Rochelle Choenni,Rohit Saxena,Ivan Titov

Main category: cs.LG

TL;DR: 该论文研究了视觉语言模型(VLM)中存在文化敏感神经元的现象,提出了一种新的对比激活选择(CAS)方法来识别这些神经元,并通过实验验证了其对文化多样性任务的重要性。

Details Motivation: 尽管视觉语言模型表现出色,但在文化相关输入上仍存在困难。论文旨在探究VLMs如何处理文化背景信息,并揭示内部神经元的文化敏感性。

Contribution: 1) 揭示了VLMs中存在文化敏感神经元;2) 提出了一种新的识别方法CAS;3) 分析了这些神经元的分布及影响。

Method: 1) 使用CVQA基准测试识别文化选择性神经元;2) 提出CAS方法进行神经元选择;3) 通过神经元去活化实验验证其重要性。

Result: 实验证明文化敏感神经元对特定文化问题的回答至关重要,CAS方法优于现有技术,且这些神经元集中在某些解码层。

Insight: 研究表明VLMs内部存在对文化敏感的神经元,其分布和功能为理解多模态表示的组织提供了新视角。

Abstract: Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.

[104] Sequences of Logits Reveal the Low Rank Structure of Language Models

Noah Golowich,Allen Liu,Abhishek Shetty

Main category: cs.LG

TL;DR: 本文提出了一种模型无关的方法,研究语言模型的低维结构,发现现代语言模型普遍具有低秩特性,并通过实验和理论证明其在生成任务中的应用潜力。

Details Motivation: 理解大型语言模型的低维结构是当前研究的重点,本文试图从模型无关的角度揭示其低秩特性及其在生成任务中的意义。

Contribution: 1. 发现现代语言模型普遍具有低秩特性;2. 提出利用低秩结构生成响应的方法;3. 提供了理论分析与学习保证。

Method: 通过构建语言模型的对数概率矩阵,揭示其低秩特性,并利用线性组合方法在生成任务中应用这一特性。

Result: 实验证明低秩结构广泛存在于语言模型中,且可用于高效生成响应;理论分析进一步验证了这一现象。

Insight: 语言模型的低秩特性为理解其内部结构和优化生成任务提供了新的视角。

Abstract: A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models at a model-agnostic level: as sequential probabilistic models. We first empirically demonstrate that a wide range of modern language models exhibit low-rank structure: in particular, matrices built from the model’s logits for varying sets of prompts and responses have low approximate rank. We then show that this low-rank structure can be leveraged for generation – in particular, we can generate a response to a target prompt using a linear combination of the model’s outputs on unrelated, or even nonsensical prompts. On the theoretical front, we observe that studying the approximate rank of language models in the sense discussed above yields a simple universal abstraction whose theoretical predictions parallel our experiments. We then analyze the representation power of the abstraction and give provable learning guarantees.