Table of Contents

cs.CL [Back]

[1] Opening the Black Box: A Survey on the Mechanisms of Multi-Step Reasoning in Large Language Models cs.CL | cs.AIPDF

Liangming Pan, Jason Liang, Jiaran Ye, Minglai Yang, Xinyuan Lu

TL;DR: 这篇论文是一篇关于大语言模型多步推理机制的综述,不同于以往主要关注提升性能的工程方法,本文系统性地梳理了LLM内部实现多步推理的机制,并围绕七个相互关联的研究问题构建了一个概念框架,最后提出了未来机理研究的五个方向。

Details

Motivation: 尽管大语言模型已展现出解决多步推理问题的卓越能力,但其实现这种能力的内部机制仍然不明确。现有综述多聚焦于提升性能的工程方法,本文旨在填补这一空白,深入探讨LLM多步推理的内在机理。

Result: 作为一篇综述性论文,本文未报告具体的定量实验结果或基准测试排名。其主要成果是提出了一个用于分析和理解LLM多步推理机制的系统性概念框架。

Insight: 论文的核心创新在于将研究焦点从外部性能提升转向内部机理剖析,构建了一个涵盖从隐式激活到显式言语化推理的完整分析框架。这为未来从计算层面理解LLM的推理过程提供了清晰的研究路线图。

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities to solve problems requiring multiple reasoning steps, yet the internal mechanisms enabling such capabilities remain elusive. Unlike existing surveys that primarily focus on engineering methods to enhance performance, this survey provides a comprehensive overview of the mechanisms underlying LLM multi-step reasoning. We organize the survey around a conceptual framework comprising seven interconnected research questions, from how LLMs execute implicit multi-hop reasoning within hidden activations to how verbalized explicit reasoning remodels the internal computation. Finally, we highlight five research directions for future mechanistic studies.


[2] Hallucination-Free Automatic Question & Answer Generation for Intuitive Learning cs.CL | cs.AIPDF

Nicholas X. Wang, Aggelos K. Katsaggelos

TL;DR: 本文提出了一种无幻觉的多智能体生成框架,用于自动生成教育领域的选择题(MCQ),通过将生成过程分解为离散、可验证的阶段,并利用基于规则和LLM的检测代理以及幻觉评分指标,显著减少了LLM在生成过程中产生的幻觉问题。

Details

Motivation: 解决大型语言模型(LLMs)在自动生成教育选择题时产生的幻觉问题,这些幻觉表现为流畅但不正确或不连贯的输出,影响了生成内容的教育可靠性和实用性。

Result: 在AP对齐的STEM问题样本上评估,该系统将幻觉率降低了90%以上,同时保持了问题的教育价值和风格,优于基线生成方法。

Insight: 创新点包括将MCQ生成重新定义为最小化幻觉风险并最大化有效性、可答性和成本效益的优化任务,以及引入由智能体主导的、使用反事实推理和思维链(CoT)的迭代精炼过程;客观来看,其结构化多智能体协作框架为大规模教育内容创建中的幻觉缓解提供了可借鉴的系统性方法。

Abstract: Hallucinations in large language models (LLMs), defined as fluent yet incorrect or incoherent outputs, pose a significant challenge to the automatic generation of educational multiple-choice questions (MCQs). We identified four key hallucination types in MCQ generation: reasoning inconsistencies, insolvability, factual errors, and mathematical errors. To address this, we propose a hallucination-free multi-agent generation framework that breaks down MCQ generation into discrete, verifiable stages. Our framework utilizes both rule-based and LLM-based detection agents, as well as hallucination scoring metrics to optimize question quality. We redefined MCQ generation as an optimization task minimizing hallucination risk while maximizing validity, answerability, and cost-efficiency. We also introduce an agent-led refinement process that uses counterfactual reasoning and chain-of-thought (CoT) to iteratively improve hallucination in question generation. We evaluated a sample of AP- aligned STEM questions, where our system reduced hallucination rates by over 90% compared to baseline generation while preserving the educational value and style of questions. Our results demonstrate that structured multi-agent collaboration can mitigate hallucinations in educational content creation at scale, paving the way for more reliable LLM-powered learning tools.


[3] Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models cs.CLPDF

Aradhya Dixit, Tianxi Liang, Jai Telang

TL;DR: 该论文提出了一个名为’验证器引导蒸馏’的训练协议,旨在将错误修复过程(包括显式冲突检测和回溯)而不仅仅是最终正确答案,从小型语言模型(SLMs)中提取出来。通过在包含错误和自我纠正的已验证推理轨迹上训练一个70亿参数的模型,研究表明小型模型可以出现潜在的验证行为,使其能够偶尔停止、检测矛盾并修正先前的假设。

Details

Motivation: 解决小型语言模型(参数少于100亿)在严格约束满足问题上经常失败的问题,因为它们通常采用线性、过度自信的推理轨迹,无法从早期错误中恢复。

Result: 论文通过在已验证的推理轨迹上训练一个70亿参数的模型,展示了小型模型可以出现潜在的验证行为,使其能够偶尔停止、检测矛盾并修正先前的假设。

Insight: 创新点在于将错误修复过程(如冲突检测和回溯)作为可蒸馏的训练目标,而不仅仅是最终答案,这有助于小型模型学习更稳健的推理策略,从而在约束满足问题上表现更好。

Abstract: Small Language Models (SLMs, under 10B parameters) are attractive for private, on-device deployment, yet they frequently fail on strict constraint-satisfaction problems due to linear, overconfident reasoning traces that do not recover from early mistakes. We introduce Verifier-Guided Distillation, a training protocol that transfers the process of error repair - explicit conflict detection and backtracking - rather than only correct final answers. By training a 7B model on verified reasoning traces that include mistakes and self-corrections, we show that latent verification behavior can emerge in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.


[4] Can LLM Reasoning Be Trusted? A Comparative Study: Using Human Benchmarking on Statistical Tasks cs.CLPDF

Crish Nagarkar, Leonid Bogachev, Serge Sharoff

TL;DR: 本文研究了大型语言模型(LLMs)在统计任务上的解决能力及其对推理质量的评估能力。通过在一个专门开发的数据集上微调选定的开源LLMs,提升了其统计推理能力,并与人类基准分数进行了比较。结果表明,微调后的模型在高级统计任务上达到了与统计学学生相当的水平,且LLMs自身在答案质量(包括解释和推理评估)的评判上优于BLEU或BertScore等传统指标。

Details

Motivation: 尽管SOTA LLMs在多种NLP任务中表现出色,但其处理中等复杂统计挑战的能力尚不明确,本文旨在探究LLMs在统计任务上的实际表现及其自我评估潜力。

Result: 微调后的模型在高级统计任务上表现优于基准,达到与统计学学生相当的水平;微调效果依赖于模型架构,部分模型性能提升显著;LLMs在答案质量评估上优于传统指标如BLEU或BertScore。

Insight: 创新点包括:通过微调专门数据集有效提升LLMs的统计推理能力;揭示了LLMs可作为更优的自我评估工具,替代传统指标,为教育技术和自动化分析工具提供可扩展的评估方案;展示了在学术和工业环境中作为研究方法验证工具及数据分析质量控制机制的潜力。

Abstract: This paper investigates the ability of large language models (LLMs) to solve statistical tasks, as well as their capacity to assess the quality of reasoning. While state-of-the-art LLMs have demonstrated remarkable performance in a range of NLP tasks, their competence in addressing even moderately complex statistical challenges is not well understood. We have fine-tuned selected open-source LLMs on a specially developed dataset to enhance their statistical reasoning capabilities, and compared their performance with the human scores used as a benchmark. Our results show that the fine-tuned models achieve better performance on advanced statistical tasks on the level comparable to a statistics student. Fine-tuning demonstrates architecture-dependent improvements, with some models showing significant performance gains, indicating clear potential for deployment in educational technology and statistical analysis assistance systems. We also show that LLMs themselves can be far better judges of the answers quality (including explanation and reasoning assessment) in comparison to traditional metrics, such as BLEU or BertScore. This self-evaluation capability enables scalable automated assessment for statistical education platforms and quality assurance in automated analysis tools. Potential applications also include validation tools for research methodology in academic and industry settings, and quality control mechanisms for data analysis workflows.


[5] Business Logic-Driven Text-to-SQL Data Synthesis for Business Intelligence cs.CLPDF

Jinhui Liu, Ximeng Zhang, Yanbo Ai, Zhou Yu

TL;DR: 本文提出了一种面向商业智能(BI)的业务逻辑驱动文本到SQL数据合成框架,旨在生成具有高业务真实性的评估数据,以解决私有BI环境中缺乏真实领域特定数据的问题。

Details

Motivation: 在私有商业智能环境中评估文本到SQL代理具有挑战性,因为缺乏反映真实业务逻辑和工作流的领域特定数据,现有合成数据生成方法未能捕捉业务真实性。

Result: 在Salesforce生产规模数据库上的实验表明,合成数据实现了高业务真实性(98.44%),显著优于OmniSQL(+19.5%)和SQL-Factory(+54.7%),同时保持了强的问题-SQL对齐(98.59%)。最先进的文本到SQL模型在最复杂的业务查询上执行准确率仅为42.86%。

Insight: 创新点在于提出了一个基于业务角色、工作场景和工作流的数据合成框架,并引入了业务推理复杂性控制策略来多样化分析推理步骤,从而提升数据的业务真实性和评估价值。

Abstract: Evaluating Text-to-SQL agents in private business intelligence (BI) settings is challenging due to the scarcity of realistic, domain-specific data. While synthetic evaluation data offers a scalable solution, existing generation methods fail to capture business realism–whether questions reflect realistic business logic and workflows. We propose a Business Logic-Driven Data Synthesis framework that generates data grounded in business personas, work scenarios, and workflows. In addition, we improve the data quality by imposing a business reasoning complexity control strategy that diversifies the analytical reasoning steps required to answer the questions. Experiments on a production-scale Salesforce database show that our synthesized data achieves high business realism (98.44%), substantially outperforming OmniSQL (+19.5%) and SQL-Factory (+54.7%), while maintaining strong question-SQL alignment (98.59%). Our synthetic data also reveals that state-of-the-art Text-to-SQL models still have significant performance gaps, achieving only 42.86% execution accuracy on the most complex business queries.


[6] Towards Execution-Grounded Automated AI Research cs.CL | cs.AI | cs.LGPDF

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang

TL;DR: 该论文探讨了基于执行反馈的自动化AI研究的可行性,通过构建一个自动执行器来实施AI研究想法并进行大规模并行GPU实验验证。研究将LLM预训练和后训练转化为执行环境,并比较了进化搜索和强化学习两种从执行反馈中学习的方法。

Details

Motivation: 解决当前LLM在自动化AI研究中常生成看似合理但无效想法的问题,探索执行反馈是否能提升自动化研究的有效性。

Result: 在LLM后训练任务上,执行引导的进化搜索在10个搜索周期内找到的方法显著优于GRPO基线(69.4% vs 48.0%);在预训练任务上,找到的配方优于nanoGPT基线(19.7分钟 vs 35.9分钟)。强化学习虽能提升平均奖励,但受限于模式崩溃,未能改善上限。

Insight: 执行引导的进化搜索在自动化AI研究中具有样本效率高的优势,能有效利用前沿LLM生成的算法想法;而强化学习易陷入简单想法的收敛,限制了其上限提升。这为未来基于执行的自动化研究提供了实证分析和方向参考。

Abstract: Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.


[7] Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education cs.CLPDF

Unggi Lee, Jiyeong Bae, Jaehyeon Park, Haeun Park, Taejun Park

TL;DR: 本文提出PedagogicalRL-Thinking框架,通过教学推理提示和思维奖励两种新方法,将教学对齐扩展到教育领域的大语言模型推理过程,以优化其作为智能辅导系统的内部思维质量。

Details

Motivation: 现有强化学习方法训练LLM导师时,仅优化可见的响应,而忽略了模型的内部思维过程,缺乏针对教育场景的专门优化。

Result: 实验表明,基于领域特定教育理论的教学推理提示优于通用提示,思维奖励与教学提示结合时效果最佳;仅在数学辅导对话上训练的模型,在未见过的教育基准测试上性能提升,同时保持了基础模型的事实知识。定量和定性分析显示,教学思维奖励能系统性地改变推理轨迹,增加教学推理和结构化教学决策。

Insight: 创新点在于将教学对齐从输出层面深入到内部推理过程,通过领域理论引导的提示和专门评估推理轨迹教学质量的奖励机制,提升了LLM在教育应用中的可解释性和教学有效性。

Abstract: Large language models (LLMs) are increasingly deployed as intelligent tutoring systems, yet research on optimizing LLMs specifically for educational contexts remains limited. Recent works have proposed reinforcement learning approaches for training LLM tutors, but these methods focus solely on optimizing visible responses while neglecting the model’s internal thinking process. We introduce PedagogicalRL-Thinking, a framework that extends pedagogical alignment to reasoning LLMs in education through two novel approaches: (1) Pedagogical Reasoning Prompting, which guides internal reasoning using domain-specific educational theory rather than generic instructions; and (2) Thinking Reward, which explicitly evaluates and reinforces the pedagogical quality of the model’s reasoning traces. Our experiments reveal that domain-specific, theory-grounded prompting outperforms generic prompting, and that Thinking Reward is most effective when combined with pedagogical prompting. Furthermore, models trained only on mathematics tutoring dialogues show improved performance on educational benchmarks not seen during training, while preserving the base model’s factual knowledge. Our quantitative and qualitative analyses reveal that pedagogical thinking reward produces systematic reasoning trace changes, with increased pedagogical reasoning and more structured instructional decision-making in the tutor’s thinking process.


[8] Social Caption: Evaluating Social Understanding in Multimodal Models cs.CL | cs.LGPDF

Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, Louis-Philippe Morency

TL;DR: 该论文提出了Social Caption框架,基于互动理论从社会推断、整体社会分析和定向社会分析三个维度评估多模态大语言模型的社会理解能力,并分析了模型规模、架构和口语语境等因素对性能的影响。

Details

Motivation: 多模态大语言模型需要社会理解能力来解读人类社交互动,但目前缺乏系统的评估框架,因此本文旨在填补这一空白。

Result: 通过实验分析了影响模型社会理解性能的因素,并利用MLLM作为评判者,为自动化评估多模态社会理解的扩展提供了见解。

Insight: 创新点在于提出了一个基于互动理论的结构化评估框架,将社会理解分解为三个可衡量的维度,并探索了自动化评估的扩展性。

Abstract: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to extract relevant social information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges contribute insights about scaling automated evaluation of multimodal social understanding.


[9] SearchGym: Bootstrapping Real-World Search Agents via Cost-Effective and High-Fidelity Environment Simulation cs.CL | cs.AIPDF

Xichen Zhang, Ziyi He, Yinghao Zhu, Sitong Wu, Shaozuo Yu

TL;DR: 论文提出了SearchGym,一个用于训练搜索智能体的高保真、低成本模拟环境,以及配套的SearchGym-RL课程学习方法。该方法通过构建可验证的知识图谱和对齐的文档语料库来解决使用真实Web API成本过高和使用静态数据快照存在噪声的问题,从而提供纯净的奖励信号来稳定强化学习训练。实验表明,在SearchGym中训练的模型在多个基准测试上超越了基线,并展现出良好的模拟到现实的泛化能力。

Details

Motivation: 解决训练搜索智能体时面临的困境:直接与商业Web API交互成本过高,而使用静态数据快照则因数据错位引入噪声,导致奖励信号被破坏,从而破坏训练稳定性。

Result: 在Llama和Qwen系列模型上的广泛实验证明了强大的模拟到现实泛化能力。具体而言,在SearchGym中训练的Qwen2.5-7B-Base模型在九个不同的基准测试上平均相对领先于web-enhanced ASearcher基线10.6%。

Insight: 核心创新在于构建了一个可控、可验证的高保真模拟环境(SearchGym),通过严格的生成流程确保任务的事实基础和可解性,并结合课程学习(SearchGym-RL)渐进式地优化策略。这为开发搜索智能体提供了一种可扩展且高性价比的方法论。

Abstract: Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.


[10] Say Anything but This: When Tokenizer Betrays Reasoning in LLMs cs.CL | cs.AIPDF

Navid Ayoobi, Marcus I Armstrong, Arjun Mukherjee

TL;DR: 这篇论文揭示了大型语言模型(LLMs)推理过程中的一个未被充分认识的脆弱性:由于现代子词分词器(tokenizer)的非唯一编码特性(即多个不同的token ID序列可以解码为相同的表面字符串),导致模型在处理语义相同的文本时可能将其视为不同的内部表示,从而引发推理失败。作者通过设计一个简单的分词一致性探测任务,在多个开源LLM上进行了超过11000次试验,发现模型会产生‘幻影编辑’(phantom edits)等系统性错误,表明部分表面上的推理缺陷实际上源于分词器层的表征缺陷。

Details

Motivation: 动机在于揭示LLMs推理过程中一个被忽视的脆弱性根源:分词器(tokenizer)的非唯一编码(即一对多映射)会导致模型内部表示与表面文本语义之间的不匹配,从而可能破坏推理过程。

Result: 在多个最先进的开源LLM上进行了超过11000次替换试验,发现了不可忽视的‘幻影编辑’输出错误率,并系统性地识别和分类了八种分词器伪影(如空格边界偏移和词内重新分割)。

Insight: 创新点在于首次系统地揭示了分词器层的表征缺陷(非唯一编码)是导致LLMs推理失败的一个重要且未被测量的根源,并提出了一个专门用于探测此类缺陷的简单任务。客观来看,这为理解LLMs的推理瓶颈提供了新的视角,并提示在盲目扩大模型规模和训练数据之前,应先考虑改进分词器设计以提升推理鲁棒性。

Abstract: Large language models (LLMs) reason over discrete token ID sequences, yet modern subword tokenizers routinely produce non-unique encodings: multiple token ID sequences can detokenize to identical surface strings. This representational mismatch creates an unmeasured fragility wherein reasoning processes can fail. LLMs may treat two internal representations as distinct “words” even when they are semantically identical at the text level. In this work, we show that tokenization can betray LLM reasoning through one-to-many token ID mappings. We introduce a tokenization-consistency probe that requires models to replace designated target words in context while leaving all other content unchanged. The task is intentionally simple at the surface level, enabling us to attribute failures to tokenizer-detokenizer artifacts rather than to knowledge gaps or parameter limitations. Through analysis of over 11000 replacement trials across state-of-the-art open-source LLMs, we find a non-trivial rate of outputs exhibit phantom edits: cases where models operate under the illusion of correct reasoning, a phenomenon arising from tokenizer-induced representational defects. We further analyze these cases and provide a taxonomy of eight systematic tokenizer artifacts, including whitespace-boundary shifts and intra-word resegmentation. These findings indicate that part of apparent reasoning deficiency originates in the tokenizer layer, motivating tokenizer-level remedies before incurring the cost of training ever-larger models on ever-larger corpora.


[11] AdaTIR: Adaptive Tool-Integrated Reasoning via Difficulty-Aware Policy Optimization cs.CLPDF

Zhaiyu Fang, Ruipeng Sun

TL;DR: 本文提出了AdaTIR框架,旨在解决大型语言模型(LLM)在工具集成推理(TIR)中存在的认知卸载问题,即模型倾向于过度调用外部工具,即使对于简单任务。该框架通过难度感知的策略优化,动态调整工具使用预算,让模型内部化处理简单任务的推理,仅在复杂任务时选择性调用工具。

Details

Motivation: 当前基于工具的LLM智能体存在认知卸载问题,会为简单任务冗余调用工具。真正的智能体智能不仅需要调用工具,更需要自适应地判断何时使用工具。论文旨在将范式从静态工具调用转向难度感知的推理内部化。

Result: 实验结果表明,AdaTIR在简单任务上减少了高达97.6%的工具调用,在复杂挑战上减少了28.2%的工具调用,同时保持或提高了准确性。特别是在AIME 2024基准测试中,即使严格禁用工具访问,AdaTIR仍以4.8%的优势超越基线模型。

Insight: 论文的核心创新点在于提出了难度感知效率奖励和裁剪优势塑形(CAS)方法。前者根据任务复杂度动态调整工具预算,促进推理内部化;后者解决了工具惩罚可能压倒正确性奖励的符号反转问题,确保正确性为主要目标,效率为次要约束。这为构建更高效、更自主的LLM智能体提供了新思路。

Abstract: Tool-Integrated Reasoning (TIR) has significantly enhanced the capabilities of Large Language Models (LLMs), yet current agents tend to exhibit cognitive offloading, redundantly invoking external tools even for simple tasks. In this paper, we suggest that true agentic intelligence requires not just tool invocation, but the adaptive wisdom to discern when to use them. We propose AdaTIR, a framework that shifts the paradigm from static tool invocation to difficulty-aware reasoning internalization. By introducing a difficulty-aware efficiency reward, AdaTIR dynamically adjusts tool budgets based on task complexity–internalizing reasoning for simple tasks while selectively invoking tools for complex tasks. Furthermore, we identify a sign reversal problem where tool penalties outweigh correctness rewards, mistakenly penalizing correct rollouts with negative advantages. To resolve this, we propose Clipped Advantage Shaping (CAS), which ensures that correctness remains the primary objective while using efficiency as a secondary constraint. Empirical results demonstrate that AdaTIR reduces tool calls by up to 97.6% on simple tasks and 28.2% on complex challenges while maintaining or enhancing accuracy. Notably, AdaTIR successfully internalizes reasoning, outperforming baselines by 4.8% on AIME 2024 even when tool access is strictly disabled.


[12] ClaimDB: A Fact Verification Benchmark over Large Structured Data cs.CLPDF

Michael Theologitis, Preetam Prabhu Srikar Dammu, Chirag Shah, Dan Suciu

TL;DR: 本文提出了ClaimDB,这是首个基于大规模结构化数据的事实核查基准,包含80个真实数据库,涵盖治理、医疗、媒体等多个领域。该基准要求模型通过组合数百万条记录和多张表格来验证事实,迫使方法从‘阅读’证据转向可执行程序推理。实验评估了30个SOTA专有和开源LLM,发现最高准确率不超过83%,且多数模型在弃权能力上表现不佳。

Details

Motivation: 现有事实核查基准主要关注非结构化文本,而基于大规模结构化数据(如数据库)的声明验证研究不足,需要构建一个能够评估模型在复杂结构化数据上进行推理和事实核查能力的基准。

Result: 在ClaimDB基准上测试了30个SOTA专有和开源LLM(参数量低于700亿),没有模型准确率超过83%,超过一半模型准确率低于55%。模型在弃权(承认缺乏证据)能力上普遍存在困难。

Insight: 创新点在于构建了首个大规模结构化数据事实核查基准,强调从基于阅读的验证转向程序化推理;客观分析表明,当前LLM在处理复杂结构化数据验证和可靠弃权方面存在显著局限性,这为未来研究指明了方向。

Abstract: Despite substantial progress in fact-verification benchmarks, claims grounded in large-scale structured data remain underexplored. In this work, we introduce ClaimDB, the first fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on “reading” the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that none exceed 83% accuracy, with more than half below 55%. Our analysis also reveals that both closed- and open-source models struggle with abstention – the ability to admit that there is no evidence to decide – raising doubts about their reliability in high-stakes data analysis. We release the benchmark, code, and the LLM leaderboard at https://claimdb.github.io .


[13] DARL: Encouraging Diverse Answers for General Reasoning without Verifiers cs.CLPDF

Chongxuan Huang, Lei Lin, Xiaodong Shi, Wenping Hu, Ruiming Tang

TL;DR: 本文提出了一种名为DARL的强化学习框架,旨在解决现有基于可验证奖励的强化学习方法(如RLVR和RLPR)在开放领域任务中过度拟合参考答案、导致输出多样性不足的问题。DARL鼓励模型在保持与参考答案对齐的同时,生成在可控偏差范围内的多样化答案,无需额外验证器即可与现有通用强化学习方法兼容。

Details

Motivation: 现有强化学习方法(如RLVR)依赖领域特定验证器,限制了其在开放通用领域的应用;而RLPR等扩展方法虽放宽了领域限制,但容易过度拟合参考答案,在开放式任务(如写作)中抑制了输出多样性。

Result: 在13个基准测试上的广泛实验表明,DARL在推理性能上取得了持续提升。具体而言,在6个推理基准上平均提升1.3分,在7个通用基准上平均提升9.5分,超越了RLPR方法。

Insight: 创新点在于提出了一个简单有效的框架,通过鼓励在可控偏差内生成多样化答案来平衡准确性与多样性,且无需额外验证器即可集成到现有通用强化学习方法中。从客观角度看,其核心洞察是将“多样性”作为可控的优化目标纳入奖励设计,以缓解对单一参考答案的过拟合,这对开放域文本生成任务具有借鉴意义。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model’s ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.


[14] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction cs.CLPDF

Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon, Pittawat Taveekitworachai, Kunat Pipatanakul

TL;DR: 本文提出了Typhoon OCR,一个专为泰语和英语文档提取设计的开源视觉语言模型。该模型通过多阶段数据构建管道生成的泰语训练数据集进行微调,能够统一处理文本转录、布局重建和文档结构一致性。最新版本Typhoon OCR V1.5是一个紧凑且推理高效的模型,在多种泰语文档类别上评估显示,其性能可与或超过更大的专有前沿模型相媲美,同时计算成本显著降低。

Details

Motivation: 现有视觉语言模型主要服务于高资源语言,而泰语由于非拉丁字母的脚本复杂性、缺乏明确词边界以及现实世界中高度非结构化文档的普遍存在,使得当前开源模型效果有限,因此需要专门针对泰语的文档提取解决方案。

Result: 在金融报告、政府表格、书籍、信息图表和手写文档等多种泰语文档类别上的综合评估表明,Typhoon OCR的性能可与或超过更大的专有前沿模型相当,同时计算成本大幅降低。

Insight: 创新点包括:1) 采用多阶段数据构建管道(结合传统OCR、基于VLM的重构和精心合成的数据)来创建泰语训练数据集;2) 提出一个统一框架,同时处理文本转录、布局重建和文档结构一致性;3) 模型设计紧凑且推理高效,减少对元数据的依赖并简化部署,展示了开源VLM在低资源语言文档提取上达到专有系统性能的潜力。

Abstract: Document extraction is a core component of digital workflows, yet existing vision-language models (VLMs) predominantly favor high-resource languages. Thai presents additional challenges due to script complexity from non-latin letters, the absence of explicit word boundaries, and the prevalence of highly unstructured real-world documents, limiting the effectiveness of current open-source models. This paper presents Typhoon OCR, an open VLM for document extraction tailored for Thai and English. The model is fine-tuned from vision-language backbones using a Thai-focused training dataset. The dataset is developed using a multi-stage data construction pipeline that combines traditional OCR, VLM-based restructuring, and curated synthetic data. Typhoon OCR is a unified framework capable of text transcription, layout reconstruction, and document-level structural consistency. The latest iteration of our model, Typhoon OCR V1.5, is a compact and inference-efficient model designed to reduce reliance on metadata and simplify deployment. Comprehensive evaluations across diverse Thai document categories, including financial reports, government forms, books, infographics, and handwritten documents, show that Typhoon OCR achieves performance comparable to or exceeding larger frontier proprietary models, despite substantially lower computational cost. The results demonstrate that open vision-language OCR models can achieve accurate text extraction and layout reconstruction for Thai documents, reaching performance comparable to proprietary systems while remaining lightweight and deployable.


[15] Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning cs.CL | cs.CVPDF

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang

TL;DR: 本文提出了Render-of-Thought(RoT)框架,通过将文本推理链渲染为图像,使隐式推理过程显式化和可追溯,从而在保持竞争力的同时,实现了比显式思维链提示更高的令牌压缩和推理加速。

Details

Motivation: 解决思维链提示计算开销大、中间推理过程缺乏监督以及隐式推理链难以分析的问题。

Result: 在数学和逻辑推理基准测试中,该方法实现了3-4倍的令牌压缩和显著的推理加速,同时保持了与其他方法相当的性能。

Insight: 利用现有视觉语言模型的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐,实现了无需额外预训练的即插即用设计,将文本推理步骤可视化以增强可分析性。

Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT


[16] Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation cs.CLPDF

Rui Qi, Fengran Mo, Yufeng Chen, Xue Zhang, Shuo Wang

TL;DR: 本文提出了一种名为LcRL的多语言检索增强生成强化学习框架,通过语言耦合的组相对策略优化来缓解多语言检索中的知识偏差和冲突问题。

Details

Motivation: 现有MRAG方法采用统一的单轮检索和优化过程处理不同语言的查询,这种“一刀切”策略在多语言环境下容易导致知识偏差和冲突,因此需要更有效的框架。

Result: 实验结果表明,LcRL在多种实际场景(如训练数据受限和涉及大量语言的检索集合)中均取得了有竞争力的性能。

Insight: 创新点包括在策略模型中引入语言耦合的组采样以减少知识偏差,以及在奖励模型中添加辅助反一致性惩罚来缓解知识冲突,这为多语言检索增强生成提供了新的优化思路。

Abstract: Multilingual retrieval-augmented generation (MRAG) requires models to effectively acquire and integrate beneficial external knowledge from multilingual collections. However, most existing studies employ a unitive process where queries of equivalent semantics across different languages are processed through a single-turn retrieval and subsequent optimization. Such a ``one-size-fits-all’’ strategy is often suboptimal in multilingual settings, as the models occur to knowledge bias and conflict during the interaction with the search engine. To alleviate the issues, we propose LcRL, a multilingual search-augmented reinforcement learning framework that integrates a language-coupled Group Relative Policy Optimization into the policy and reward models. We adopt the language-coupled group sampling in the rollout module to reduce knowledge bias, and regularize an auxiliary anti-consistency penalty in the reward models to mitigate the knowledge conflict. Experimental results demonstrate that LcRL not only achieves competitive performance but is also appropriate for various practical scenarios such as constrained training data and retrieval over collections encompassing a large number of languages. Our code is available at https://github.com/Cherry-qwq/LcRL-Open.


[17] PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation cs.CLPDF

Chenning Xu, Mao Zheng, Mingyu Zheng, Mingyang Song

TL;DR: 本文提出了PodBench,一个用于评估指令感知音频导向播客脚本生成的综合基准,包含800个样本,输入长度可达21K token,并涉及复杂的多说话人指令。作者设计了一个多方面的评估框架,结合定量约束与基于LLM的质量评估。实验表明,虽然专有模型整体表现更优,但配备显式推理能力的开源模型在处理长上下文和多说话人协调方面比标准基线展现出更强的鲁棒性。然而,分析也揭示了指令遵循度高并不保证内容实质性的持续分歧。PodBench为长格式、以音频为中心的生成任务提供了一个可复现的测试平台。

Details

Motivation: 播客脚本生成任务要求LLM从多样化输入中合成结构化、基于上下文的对话,但该任务的系统性评估资源仍然有限,因此需要建立一个全面的基准来填补这一空白。

Result: 在PodBench基准上的广泛实验表明,专有模型整体表现更优,而配备显式推理能力的开源模型在处理长上下文和多说话人协调方面比标准基线(如标准微调模型)展现出更强的鲁棒性。然而,评估也发现高指令遵循度与高内容实质性之间存在持续的分歧。

Insight: 论文的创新点在于构建了首个针对播客脚本生成的综合基准PodBench,并提出了结合定量约束与LLM评估的多方面评估框架。客观来看,其通过引入长上下文、多说话人指令等复杂场景,以及揭示指令遵循与内容质量之间的脱节,为音频导向的长文本生成任务提供了更细致的评估视角和可复现的测试环境。

Abstract: Podcast script generation requires LLMs to synthesize structured, context-grounded dialogue from diverse inputs, yet systematic evaluation resources for this task remain limited. To bridge this gap, we introduce PodBench, a benchmark comprising 800 samples with inputs up to 21K tokens and complex multi-speaker instructions. We propose a multifaceted evaluation framework that integrates quantitative constraints with LLM-based quality assessment. Extensive experiments reveal that while proprietary models generally excel, open-source models equipped with explicit reasoning demonstrate superior robustness in handling long contexts and multi-speaker coordination compared to standard baselines. However, our analysis uncovers a persistent divergence where high instruction following does not guarantee high content substance. PodBench offers a reproducible testbed to address these challenges in long-form, audio-centric generation.


[18] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning cs.CL | cs.AIPDF

Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan

TL;DR: 论文提出了CorpusQA,一个包含1000万token的基准测试,用于评估大语言模型在跨文档库级别的分析和推理能力。该基准通过创新的数据合成框架生成复杂查询,要求模型在大量非结构化文本中进行全局推理,而不仅仅是依赖稀疏检索。

Details

Motivation: 现有基准测试主要局限于单个长文本或基于稀疏检索假设,无法评估模型在证据高度分散于数百个文档、需要全局整合、比较和统计聚合的真实语料库级别分析能力,因此需要新的基准来填补这一关键空白。

Result: 实验表明,即使是最先进的长上下文大语言模型,随着输入长度的增加,性能也会下降;标准的检索增强生成系统则完全失效。相比之下,记忆增强的智能体架构表现出更强的鲁棒性。

Insight: 创新点在于提出了一个可扩展至千万token级别的语料库分析基准,以及一个通过解耦推理与文本表示来生成具有程序化保证真实答案的复杂查询的数据合成框架。该框架不仅用于评估,其合成数据还能有效增强大语言模型的通用长上下文推理能力。研究指出,未来需要从单纯扩展上下文窗口转向开发用于全局信息合成的高级架构。

Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a “sparse retrieval” assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM’s general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.


[19] \textsc{LogicScore}: Fine-grained Logic Evaluation of Conciseness, Completeness, and Determinateness in Attributed Question Answering cs.CLPDF

Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Shaoru Guo

TL;DR: 本文提出了LogicScore,一个用于归因问答(AQA)的细粒度逻辑评估框架,旨在解决现有评估方法只关注孤立陈述的归因而忽视长答案整体逻辑完整性的问题。该框架基于Horn规则和反向验证机制,从逻辑完整性、简洁性和确定性三个维度评估答案的全局推理质量。

Details

Motivation: 当前归因问答的评估方法存在‘归因短视’问题,即过度强调孤立陈述及其归因的验证,而忽略了长答案的全局逻辑一致性,导致大语言模型(LLMs)常产生事实正确但逻辑不连贯的答案。

Result: 在三个多跳问答数据集(HotpotQA, MusiQue, 2WikiMultiHopQA)和超过20个LLM(包括GPT-5, Gemini-3-Pro, LLaMA3等)上的实验表明,领先模型虽然归因精度高(例如Gemini-3 Pro达到92.85%),但在全局推理质量上表现不佳(例如Gemini-3 Pro的简洁性得分仅为35.11%),揭示了模型在逻辑推理能力上的显著差距。

Insight: 论文的创新点在于将评估范式从局部归因验证转向全局逻辑审查,并提出了一个基于形式逻辑(Horn规则)和反向验证的统一评估框架,定义了完整性、简洁性和确定性三个可量化的逻辑评估维度,为LLM的逻辑推理能力评估建立了新标准。

Abstract: Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.


[20] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR) cs.CLPDF

Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello

TL;DR: 该论文介绍了RSNA REVEAL-CXR数据集,这是一个包含200个胸部X光影像研究、带有12个基准标签的公开基准数据集,用于评估多模态大语言模型在胸部心肺疾病诊断中的性能。数据集通过AI辅助的专家标注流程构建,每个影像由三位放射科医生验证,确保高质量标注。

Details

Motivation: 为了解决多模态大语言模型在临床应用中缺乏高质量、专家标注的基准数据集的问题,以促进开发临床有用的多模态LLM工具,特别是在胸部X光影像诊断领域。

Result: 创建了一个包含200个胸部X光影像研究的公开基准数据集(100个发布,100个保留),每个影像带有12个基准标签,并由三位放射科医生验证,其中至少两位完全同意标注的影像有381个,从中精选出200个。保留数据集由RSNA独家用于独立评估不同模型。

Insight: 创新点包括:1) 提出AI辅助的专家标注流程,利用GPT-4o和本地LLM(Phi-4-Reasoning)提取和映射异常发现,提高标注效率;2) 采用基于AI建议标签的采样算法,确保数据集的临床相关性和难度多样性;3) 建立半协作标注环境,最小化遗漏,支持大规模高质量数据标注。

Abstract: Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked “Agree all”, “Agree mostly” or “Disagree” to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected “Agree All” for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.


[21] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models cs.CL | cs.AI | cs.LGPDF

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao

TL;DR: 本文挑战了扩散大语言模型(dLLMs)中任意顺序生成能提升推理能力的直觉观点,揭示了这种灵活性反而会诱使模型规避关键的高不确定性token,导致解空间过早坍缩。作者提出了一种名为JustGRPO的极简方法,通过放弃任意顺序生成并应用标准的组相对策略优化,有效激发了dLLMs的推理能力,并在保持其并行解码能力的同时取得了优异性能。

Details

Motivation: 现有研究普遍认为dLLMs的任意顺序生成能力理论上能提供比传统自回归模型更大的解空间,从而提升在数学和代码等任务上的推理潜力,并催生了大量基于强化学习的方法来利用这种灵活性。本文旨在检验这一前提,并探究当前任意顺序生成的实际效果。

Result: 在GSM8K数学推理基准测试上,JustGRPO方法达到了89.1%的准确率,证明了其有效性。

Insight: 论文的核心创新点在于揭示了dLLMs中“灵活性陷阱”的现象,即模型会滥用顺序灵活性来规避探索关键推理步骤。从客观角度看,其可借鉴之处在于:对模型设计中的直觉假设(如“灵活性总是有益的”)进行实证检验的重要性,以及提出了一种通过策略性地限制灵活性(放弃任意顺序)来提升模型性能的极简而有效的技术路径(JustGRPO)。

Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap


[22] The Effect of Scripts and Formats on LLM Numeracy cs.CLPDF

Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter

TL;DR: 本文研究了大型语言模型(LLMs)在处理不同数字脚本和格式时的数值推理能力。研究发现,当数字以训练语料库中不常见的脚本或格式呈现时,LLMs的准确性会显著下降,即使背后的数学推理是相同的。同时,研究也表明,通过针对性的提示策略,如少样本提示和显式数字映射,可以大幅缩小这一性能差距。

Details

Motivation: 尽管LLMs在标准算术任务上已达到接近人类的水平,但其在训练语料主流惯例之外的数字脚本和格式上的表现尚未得到充分研究。本文旨在探究LLMs在多样化数字表示下的数值推理能力,揭示其潜在弱点。

Result: 实验表明,当数字以代表性不足的脚本或格式呈现时,LLM的准确率会大幅下降。通过采用少样本提示和显式数字映射等策略,可以显著改善模型性能,缩小与标准格式下的性能差距。

Insight: 论文的创新点在于系统性地揭示了LLMs在跨脚本、跨格式数值推理中的一个被忽视的挑战,即模型性能对数字的表面表示形式高度敏感。从客观角度看,这强调了评估和提升LLMs鲁棒性的重要性,并为通过提示工程等实用方法增强模型在多样化场景下的可靠性提供了可行见解。

Abstract: Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.


cs.CV [Back]

[23] A Cloud-Based Cross-Modal Transformer for Emotion Recognition and Adaptive Human-Computer Interaction cs.CV | cs.AI | cs.HC | cs.LG | cs.SD | eess.ASPDF

Ziwen Zhong, Zhitao Shu, Yue Zhao

TL;DR: 本文提出了一种基于云的跨模态Transformer(CMT)框架,用于多模态情感识别和自适应人机交互。该模型整合了视觉、听觉和文本信号,利用预训练编码器和跨模态注意力机制捕捉异构特征间的复杂依赖关系,并通过云计算基础设施实现可扩展、低延迟的大规模情感识别。

Details

Motivation: 现有情感识别系统通常依赖单模态分析(如面部表情、语音语调或文本情感),导致在真实环境中鲁棒性有限、泛化能力差。为解决这些问题,本研究旨在开发一个更稳健、可扩展的多模态情感识别框架。

Result: 在IEMOCAP、MELD和AffectNet等基准数据集上的实验表明,CMT达到了最先进的性能,与强大的多模态基线相比,F1分数提高了3.0%,交叉熵损失降低了12.9%。云部署评估显示平均响应延迟为128毫秒,比传统的基于Transformer的融合系统减少了35%。

Insight: 创新点包括:结合预训练编码器(Vision Transformer、Wav2Vec2和BERT)和跨模态注意力机制进行多模态融合;利用云计算基础设施(Kubernetes分布式训练和TensorFlow Serving)实现可扩展、低延迟的部署,为云原生情感计算和情感智能交互系统提供了重要进展。

Abstract: Emotion recognition is a fundamental component of next-generation human-computer interaction (HCI), enabling machines to perceive, understand, and respond to users’ affective states. However, existing systems often rely on single-modality analysis such as facial expressions, speech tone, or textual sentiment, resulting in limited robustness and poor generalization in real-world environments. To address these challenges, this study proposes a Cloud-Based Cross-Modal Transformer (CMT) framework for multimodal emotion recognition and adaptive human-computer interaction. The proposed model integrates visual, auditory, and textual signals using pretrained encoders (Vision Transformer, Wav2Vec2, and BERT) and employs a cross-modal attention mechanism to capture complex interdependencies among heterogeneous features. By leveraging cloud computing infrastructure with distributed training on Kubernetes and TensorFlow Serving, the system enables scalable, low-latency emotion recognition for large-scale user interactions. Experiments conducted on benchmark datasets including IEMOCAP, MELD, and AffectNet demonstrate that the CMT achieves state-of-the-art performance, improving the F1-score by 3.0 percent and reducing cross-entropy loss by 12.9 percent compared to strong multimodal baselines. Additionally, cloud deployment evaluations show an average response latency of 128 ms, representing a 35 percent reduction compared with conventional transformer-based fusion systems. These results confirm that the proposed framework enables efficient, real-time emotion recognition and adaptive feedback in applications such as intelligent customer service, virtual tutoring systems, and affective computing interfaces, marking an important step toward cloud-native affective computing and emotionally intelligent interactive systems.


[24] Intelligent Power Grid Design Review via Active Perception-Enabled Multimodal Large Language Models cs.CV | cs.HC | cs.LGPDF

Taoliang Tan, Chengwei Ma, Zhen Tian, Zhao Lin, Dongdong Li

TL;DR: 本文提出了一种基于预训练多模态大语言模型(MLLMs)的三阶段智能电网工程设计图纸审查框架,通过模仿专家审查流程,先进行全局语义理解以定位关键区域,再进行高分辨率细粒度识别,最后综合决策以诊断设计错误并提供可靠性评估。

Details

Motivation: 当前自动化系统在处理超高分辨率电网设计图纸时面临计算需求高、信息丢失以及缺乏整体语义理解以识别设计错误等挑战,因此需要一种更智能、可靠的审查方法。

Result: 在真实电网图纸上的初步结果表明,该方法显著提升了MLLM对宏观语义信息的把握和设计错误的定位能力,与传统被动MLLM推理相比,在缺陷发现准确性和审查判断可靠性方面均有改进。

Insight: 创新点在于提出了一种由提示工程驱动的主动感知范式,通过分阶段(全局理解、区域聚焦、综合决策)的框架,将MLLM的能力有效应用于专业领域的高分辨率图像理解与决策任务,提高了审查的智能性和可靠性。

Abstract: The intelligent review of power grid engineering design drawings is crucial for power system safety. However, current automated systems struggle with ultra-high-resolution drawings due to high computational demands, information loss, and a lack of holistic semantic understanding for design error identification. This paper proposes a novel three-stage framework for intelligent power grid drawing review, driven by pre-trained Multimodal Large Language Models (MLLMs) through advanced prompt engineering. Mimicking the human expert review process, the first stage leverages an MLLM for global semantic understanding to intelligently propose domain-specific semantic regions from a low-resolution overview. The second stage then performs high-resolution, fine-grained recognition within these proposed regions, acquiring detailed information with associated confidence scores. In the final stage, a comprehensive decision-making module integrates these confidence-aware results to accurately diagnose design errors and provide a reliability assessment. Preliminary results on real-world power grid drawings demonstrate our approach significantly enhances MLLM’s ability to grasp macroscopic semantic information and pinpoint design errors, showing improved defect discovery accuracy and greater reliability in review judgments compared to traditional passive MLLM inference. This research offers a novel, prompt-driven paradigm for intelligent and reliable power grid drawing review.


[25] CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments cs.CV | cs.AIPDF

Haotian Xu, Yue Hu, Zhengqiu Zhu, Chen Gao, Ziyou Wang

TL;DR: 该论文提出了CityCube基准测试,旨在评估视觉语言模型在城市环境中的跨视角空间推理能力。该基准包含5,022个精心标注的多视角问答对,涵盖五种认知维度和三种空间关系表达,并整合了四种视角动态以模拟相机运动。通过对33个VLM的评估,发现大规模模型准确率最高仅54.1%,远低于人类表现,而小规模微调模型可达60%以上,揭示了当前VLM与人类推理之间的显著差距。

Details

Motivation: 现有基准主要关注室内或街道场景,忽略了开放城市空间(具有丰富语义、复杂几何和视角变化)的独特挑战,因此需要专门基准来评估VLM在城市环境中的跨视角空间推理能力。

Result: 在CityCube基准上评估33个VLM,大规模模型最高准确率为54.1%,比人类低34.2%;小规模微调模型准确率超过60.0%,突显了基准的必要性。

Insight: 创新点在于构建了首个专注于城市环境跨视角推理的系统性基准,整合多平台视角(车辆、无人机、卫星)和动态视角模拟;客观分析表明,该基准能有效揭示VLM与人类在空间认知上的根本差异,并强调微调对特定任务的重要性。

Abstract: Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.


[26] Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data cs.CV | eess.IVPDF

Yixiong Chen, Zongwei Zhou, Wenxuan Li, Alan Yuille

TL;DR: 本文提出SegAE,一个轻量级视觉语言模型,用于自动评估大规模医学分割数据集中标签的质量。该模型在超过四百万个带质量评分的图像-标签对上训练,能快速预测142个解剖结构的标签质量,并揭示了公共数据集中普遍存在的低质量标注问题。

Details

Motivation: 大规模医学分割数据集常混合质量不均的手动和伪标签,低质量标签会损害模型训练和评估的鲁棒性,因此需要一种自动化工具来评估标签质量。

Result: SegAE与真实Dice相似度的相关系数达到0.902,评估一个3D掩码仅需0.06秒;在主动和半监督学习中,它能将数据集标注成本降低三分之一,每个标签的质量检查时间减少70%。

Insight: 创新点在于利用视觉语言模型进行跨解剖结构的通用标签质量评估,并通过合成数据训练实现高效自动化;客观来看,该方法为大规模医学数据质量控制提供了轻量且可扩展的解决方案。

Abstract: Large-scale medical segmentation datasets often combine manual and pseudo-labels of uneven quality, which can compromise training and evaluation. Low-quality labels may hamper performance and make the model training less robust. To address this issue, we propose SegAE (Segmentation Assessment Engine), a lightweight vision-language model (VLM) that automatically predicts label quality across 142 anatomical structures. Trained on over four million image-label pairs with quality scores, SegAE achieves a high correlation coefficient of 0.902 with ground-truth Dice similarity and evaluates a 3D mask in 0.06s. SegAE shows several practical benefits: (I) Our analysis reveals widespread low-quality labeling across public datasets; (II) SegAE improves data efficiency and training performance in active and semi-supervised learning, reducing dataset annotation cost by one-third and quality-checking time by 70% per label. This tool provides a simple and effective solution for quality control in large-scale medical segmentation datasets. The dataset, model weights, and codes are released at https://github.com/Schuture/SegAE.


[27] Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation cs.CV | cs.AI | cs.CL | cs.LGPDF

Danial Sadrian Zadeh, Otman A. Basir, Behzad Moshiri

TL;DR: 本文提出了一种用于自动驾驶的视觉自然语言场景理解新框架,能够将单张前视摄像头图像转换为简洁的自然语言描述,捕捉空间布局、语义关系和驾驶相关线索。同时,为了解决该领域专用数据集稀缺的问题,基于BDD100K数据集构建了一个新的数据集,并深入讨论了相关评估指标。

Details

Motivation: 自动驾驶需要准确感知和理解交通场景以确保安全导航,而现有方法在从图像生成详细、上下文丰富的自然语言描述方面存在不足,且缺乏专门的数据集。

Result: 在新构建的数据集上,使用CIDEr和SPICE等指标以及人工评估进行的广泛定量评估表明,所提出的模型实现了强劲的性能,有效达成了预期目标。

Insight: 创新点在于提出了一个结合混合注意力机制以增强空间和语义特征提取的模型,并构建了一个专门用于交通场景描述生成的新数据集,为任务提供了更合适的评估基准。

Abstract: Traffic scene understanding is essential for enabling autonomous vehicles to accurately perceive and interpret their environment, thereby ensuring safe navigation. This paper presents a novel framework that transforms a single frontal-view camera image into a concise natural language description, effectively capturing spatial layouts, semantic relationships, and driving-relevant cues. The proposed model leverages a hybrid attention mechanism to enhance spatial and semantic feature extraction and integrates these features to generate contextually rich and detailed scene descriptions. To address the limited availability of specialized datasets in this domain, a new dataset derived from the BDD100K dataset has been developed, with comprehensive guidelines provided for its construction. Furthermore, the study offers an in-depth discussion of relevant evaluation metrics, identifying the most appropriate measures for this task. Extensive quantitative evaluations using metrics such as CIDEr and SPICE, complemented by human judgment assessments, demonstrate that the proposed model achieves strong performance and effectively fulfills its intended objectives on the newly developed dataset.


[28] Gaussian Based Adaptive Multi-Modal 3D Semantic Occupancy Prediction cs.CVPDF

A. Enes Doruk

TL;DR: 本文提出了一种基于高斯模型的自适应多模态3D语义占据预测方法,旨在解决自动驾驶中长尾安全挑战。该方法通过高效的3D高斯模型,无缝融合相机模态的语义优势和激光雷达模态的几何优势,包含LiDAR深度特征聚合、基于熵的特征平滑、自适应相机-激光雷达融合以及Gauss-Mamba头部四个关键组件,以降低计算复杂度并提升动态环境下的鲁棒性。

Details

Motivation: 当前基于体素化的3D语义占据预测方法存在计算复杂度高、融合过程脆弱且静态、在动态环境下易失效的问题,无法有效应对自动驾驶的长尾安全挑战,因此需要一种更高效、自适应且鲁棒的多模态融合方案。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但强调所提方法通过线性计算复杂度的全局上下文解码(Gauss-Mamba头部)和自适应融合机制,旨在提升预测效率和动态环境下的性能。

Insight: 创新点包括:使用3D高斯模型实现内存高效的多模态融合;引入深度可变形采样处理几何稀疏性;利用交叉熵进行特征平滑以处理领域特定噪声;基于模型输出动态重新校准传感器输出的自适应融合机制;以及采用选择性状态空间模型(SSM)实现线性计算复杂度的全局上下文解码,这些设计可借鉴于其他多模态3D感知任务中。

Abstract: The sparse object detection paradigm shift towards dense 3D semantic occupancy prediction is necessary for dealing with long-tail safety challenges for autonomous vehicles. Nonetheless, the current voxelization methods commonly suffer from excessive computation complexity demands, where the fusion process is brittle, static, and breaks down under dynamic environmental settings. To this end, this research work enhances a novel Gaussian-based adaptive camera-LiDAR multimodal 3D occupancy prediction model that seamlessly bridges the semantic strengths of camera modality with the geometric strengths of LiDAR modality through a memory-efficient 3D Gaussian model. The proposed solution has four key components: (1) LiDAR Depth Feature Aggregation (LDFA), where depth-wise deformable sampling is employed for dealing with geometric sparsity, (2) Entropy-Based Feature Smoothing, where cross-entropy is employed for handling domain-specific noise, (3) Adaptive Camera-LiDAR Fusion, where dynamic recalibration of sensor outputs is performed based on model outputs, and (4) Gauss-Mamba Head that uses Selective State Space Models for global context decoding that enjoys linear computation complexity.


[29] GutenOCR: A Grounded Vision-Language Front-End for Documents cs.CV | cs.AI | cs.CL | cs.LGPDF

Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

TL;DR: GutenOCR是一个基于Qwen2.5-VL模型微调得到的视觉语言模型系列,用于文档OCR前端处理。它通过统一的提示接口支持阅读、检测和定位功能,在商业文档和科学文章上训练,能够进行整页或局部文本识别,并提供行级和段落级的边界框以及条件查询。

Details

Motivation: 解决传统OCR系统在文档处理中缺乏统一、灵活的视觉语言接口,以及难以同时支持文本阅读、检测和定位的问题,旨在提升文档理解的综合能力。

Result: 在10.5K个保留的商业和科学文档页面上,GutenOCR-7B的复合接地OCR分数从骨干模型的0.40提升到0.82,翻了一倍以上;在Fox和OmniDocBench v1.5基准测试中,显著改善了区域和行级OCR以及文本检测召回率,但在页面级线性化、颜色引导OCR和公式密集布局方面存在权衡。

Insight: 创新点在于通过微调视觉语言模型构建统一的提示接口,实现文档OCR的阅读、检测和定位一体化;客观分析认为,其引入的接地OCR评估协议和条件查询功能,为文档理解任务提供了更全面的解决方案,但需注意在复杂布局下的性能折衷。

Abstract: GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?’’ queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.


[30] Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency cs.CVPDF

Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen

TL;DR: 本文提出SDT-Net,一种用于涂鸦监督医学图像分割的新型双教师-单学生框架。该方法通过动态教师切换模块自适应选择最可靠的教师模型,并利用高置信度伪标签和分层一致性模块来增强监督,以解决弱标注带来的模糊性和噪声传播问题。

Details

Motivation: 涂鸦监督方法旨在减轻医学图像分割中繁重的标注负担,但其标注的稀疏性会引入显著模糊性,导致噪声伪标签传播并阻碍鲁棒解剖边界的学习。

Result: 在ACDC和MSCMRseg数据集上的大量实验表明,SDT-Net实现了最先进的性能,产生了更准确且解剖学上合理的分割结果。

Insight: 创新点在于动态教师切换机制与分层一致性约束的结合,通过自适应选择可靠教师并利用多级特征对齐,有效提升了弱监督信号下的分割质量。从客观角度看,其双教师架构与像素级可靠性筛选机制为处理标注稀疏性提供了新思路。

Abstract: Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.


[31] Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement cs.CV | cs.AIPDF

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

TL;DR: 本文提出了一种轻量级自适应视频推理增强框架,通过模糊控制器动态切换不同规模的模型,以平衡资源利用与推理性能。

Details

Motivation: 现有视频推理增强方法通常通过扩大模型规模和复杂网络架构提升性能,但忽视了资源效率与推理效果之间的权衡,导致资源利用低效和推理性能不佳。

Result: 实验结果表明,该方法在资源利用和推理性能之间实现了有效平衡,但未提及具体基准测试或与SOTA的比较。

Insight: 创新点在于引入模糊控制器基于系统参数和推理指标动态调整模型规模,并利用相邻视频帧的时空相关性进行自适应增强,可借鉴其轻量级动态切换策略以优化资源受限场景下的推理效率。

Abstract: Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.


[32] Anatomically Guided Latent Diffusion for Brain MRI Progression Modeling cs.CVPDF

Cheng Wan, Bahram Jafrasteh, Ehsan Adeli, Miaomiao Zhang, Qingyu Zhao

TL;DR: 本文提出了一种名为解剖引导的潜在扩散模型(AG-LDM),用于脑部MRI纵向进展建模。该方法通过直接融合基线解剖结构、噪声随访状态和临床协变量,简化了训练流程,并利用轻量级3D组织分割模型(WarpSeg)提供解剖学监督,以确保生成图像的解剖一致性和形态测量保真度。

Details

Motivation: 现有方法(如Brain Latent Progression)存在架构复杂、临床协变量利用不佳以及解剖一致性保证有限的问题,需要一种更高效且解剖学可靠的框架来建模脑部MRI的纵向进展。

Result: 在ADNI的31,713个纵向数据对上进行的实验以及在OASIS-3上的零样本评估表明,AG-LDM匹配或超越了更复杂的扩散模型,实现了最先进的图像质量,并将生成图像的体积误差降低了15-20%。同时,AG-LDM对时间和临床协变量的利用显著增强(敏感性比BrLP高31.5倍),并能生成生物学上合理的反事实轨迹,准确捕捉阿尔茨海默病进展的标志。

Insight: 创新点在于将解剖引导直接整合到潜在扩散模型的输入融合和训练监督中,避免了辅助控制网络,实现了端到端的统一建模。这提供了一种简化架构、增强解剖一致性和临床协变量利用的有效策略。

Abstract: Accurately modeling longitudinal brain MRI progression is crucial for understanding neurodegenerative diseases and predicting individualized structural changes. Existing state-of-the-art approaches, such as Brain Latent Progression (BrLP), often use multi-stage training pipelines with auxiliary conditioning modules but suffer from architectural complexity, suboptimal use of conditional clinical covariates, and limited guarantees of anatomical consistency. We propose Anatomically Guided Latent Diffusion Model (AG-LDM), a segmentation-guided framework that enforces anatomically consistent progression while substantially simplifying the training pipeline. AG-LDM conditions latent diffusion by directly fusing baseline anatomy, noisy follow-up states, and clinical covariates at the input level, a strategy that avoids auxiliary control networks by learning a unified, end-to-end model that represents both anatomy and progression. A lightweight 3D tissue segmentation model (WarpSeg) provides explicit anatomical supervision during both autoencoder fine-tuning and diffusion model training, ensuring consistent brain tissue boundaries and morphometric fidelity. Experiments on 31,713 ADNI longitudinal pairs and zero-shot evaluation on OASIS-3 demonstrate that AG-LDM matches or surpasses more complex diffusion models, achieving state-of-the-art image quality and 15-20% reduction in volumetric errors in generated images. AG-LDM also exhibits markedly stronger utilization of temporal and clinical covariates (up to 31.5x higher sensitivity than BrLP) and generates biologically plausible counterfactual trajectories, accurately capturing hallmarks of Alzheimer’s progression such as limbic atrophy and ventricular expansion. These results highlight AG-LDM as an efficient, anatomically grounded framework for reliable brain MRI progression modeling.


[33] LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning cs.CVPDF

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren

TL;DR: 本文提出了一种可学习的帧选择器(LFS),用于视频描述生成任务。LFS通过建模时间重要性,选择具有时间多样性和事件相关性的视频帧,并利用冻结的视频大语言模型(LLM)的生成反馈来直接优化下游描述质量。此外,论文还引入了一个新的基准数据集ICH-CC,以弥补现有基准与人类认知之间的差距。实验表明,LFS在多个基准测试上持续提升了详细视频描述的性能,并改善了视频问答任务的表现。

Details

Motivation: 现有视频描述模型通常对所有视频帧进行均匀采样,这忽略了视频事件分布的不均匀性,导致计算成本高昂且可能遗漏关键信息。因此,需要一种能够选择时间多样且事件相关的帧的方法。

Result: 在两个代表性的社区基准测试(VDC)和新提出的ICH-CC数据集上,LFS将详细视频描述的性能分别提升了高达2.0%和超过4%。同时,使用LFS增强的描述还提高了视频问答任务的性能。

Insight: 主要创新点包括:1)提出了一种可学习的帧选择器,通过显式建模时间重要性来平衡时间多样性和事件相关性,并采用分层策略确保时间覆盖;2)利用冻结视频LLM的生成反馈作为监督信号,直接优化下游任务质量;3)构建了ICH-CC新基准,以更好地评估模型与人类认知的一致性。从客观角度看,将帧选择视为可学习任务并与下游任务反馈结合,是一种有效且易于集成的解决方案。

Abstract: Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human’s cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.


[34] 3D Space as a Scratchpad for Editable Text-to-Image Generation cs.CVPDF

Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha

TL;DR: 本文提出了一种名为‘空间草稿’的3D推理框架,用于提升视觉语言模型在文本到图像生成中的空间一致性和可控性。该框架将文本提示解析为可编辑的3D网格,通过智能场景规划进行布局和视角选择,最后渲染回图像域,实现了更精确的几何关系和对象身份保持。

Details

Motivation: 现有视觉语言模型缺乏类似大语言模型的外部化推理机制,在空间推理方面存在局限,导致生成的图像难以准确反映几何关系、对象身份和组合意图。本文旨在通过引入3D空间作为显式工作空间来解决这一问题。

Result: 在GenAI-Bench基准测试上,该方法在文本对齐度上实现了32%的提升,证明了显式3D推理对于精确、可控图像生成的有效性。

Insight: 核心创新点在于将3D空间作为连接语言意图和图像合成的‘草稿’或中间表示,支持直观的3D编辑并能可靠地传播到最终图像。这为视觉语言模型提供了一种在空间中进行‘深思熟虑’的新范式,超越了传统的2D布局方法。

Abstract: Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad – a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/


[35] Learning Consistent Taxonomic Classification through Hierarchical Reasoning cs.CVPDF

Zhenghong Li, Kecheng Zheng, Haibin Ling

TL;DR: 本文提出了一种名为VL-Taxon的两阶段层次推理框架,旨在解决视觉语言模型在分类任务中缺乏层次知识、导致分类层级不一致的问题。该框架通过自上而下的过程提升叶节点分类精度,并利用精确的叶节点输出来确保整个分类层级的一致性,结合监督微调和强化学习进行训练。在iNaturalist-2021数据集上的实验表明,基于Qwen2.5-VL-7B模型实现的VL-Taxon框架,在叶节点精度和层级一致性上平均超越了其原始72B版本10%以上,且仅需少量数据微调。

Details

Motivation: 视觉语言模型在视觉理解方面表现出色,但往往缺乏对层次知识的把握,导致在分类时即使正确识别了最具体的叶节点,也经常在更粗的分类层级上出错。现有方法大多忽视了这一问题,未能对层次推理进行建模。

Result: 在iNaturalist-2021数据集上,基于Qwen2.5-VL-7B实现的VL-Taxon框架,在叶节点分类精度和层级一致性精度上,平均超越了其原始72B版本超过10%,达到了SOTA水平。这一显著提升仅通过对一小部分数据进行微调实现,且未依赖其他VLM生成的示例。

Insight: 论文的创新点在于提出了一个专门针对层次分类一致性的两阶段推理框架,将自上而下的精度提升与基于精确叶节点的层级一致性保障相结合,并通过监督微调与强化学习的组合来注入分类学知识并优化推理泛化能力。从客观角度看,其核心洞察是将分类任务明确分解为精度和一致性两个子问题,并设计了一个协同优化的流程,这在处理具有严格层级结构的分类任务(如生物分类)中是一个有效且可推广的思路。

Abstract: While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model’s reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.


[36] Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis cs.CV | cs.AI | cs.CL | cs.HCPDF

James Brock, Ce Zhang, Nantheera Anantrasirichai

TL;DR: 本文提出了Forest-Chat,一个由大语言模型驱动的智能体,旨在通过自然语言交互对森林变化进行综合分析。该框架集成了多级变化解释视觉语言骨干网络和基于基础变化检测模型的零样本检测能力,支持变化检测、变化描述、目标计数等多种遥感图像变化解释任务,并引入了包含双时相卫星影像和语义标注的Forest-Change数据集进行验证。

Details

Motivation: 针对森林监测中像素级变化检测和语义变化解释的挑战,以及现有大语言模型与视觉语言模型在遥感图像变化解释领域,尤其是非城市环境下的集成应用不足的问题。

Result: 实验表明,Forest-Chat在Forest-Change数据集和以树木变化为重点的LEVIR-MCI-Trees子集上,对于联合变化检测与描述任务取得了强劲的性能。

Insight: 创新点在于提出了一个LLM驱动的、支持自然语言查询和交互式点提示的集成分析框架,将零样本变化检测与多粒度语义解释相结合,并构建了专门针对森林环境的数据集,提升了森林变化分析的可访问性、可解释性和分析效率。

Abstract: The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.


[37] Mirai: Autoregressive Visual Generation Needs Foresight cs.CVPDF

Yonghao Yu, Lang Huang, Zerun Wang, Runyi Li, Toshihiko Yamasaki

TL;DR: 这篇论文提出了Mirai框架,旨在解决自回归视觉生成模型中因严格因果监督导致的全局一致性差和收敛慢的问题。通过引入‘远见’训练信号,即来自未来token的信息,Mirai在无需改变架构或增加推理开销的情况下,显著提升了生成质量和收敛速度。

Details

Motivation: 自回归视觉生成器通过下一个token的似然进行训练,这种严格的因果监督仅优化每个步骤的下一个token,削弱了全局一致性并减慢了收敛速度。论文旨在探究引入来自未来token的‘远见’信号是否能改善自回归视觉生成。

Result: 在ImageNet类别条件图像生成基准测试中,Mirai将LlamaGen-B的收敛速度提升了高达10倍,并将生成FID从5.34降低到4.34,显著加速了收敛并提高了生成质量。

Insight: 论文的核心创新点在于揭示了将‘远见’与自回归模型在2D图像网格上的内部表示对齐可以改善因果建模。Mirai框架通过显式(Mirai-E)或隐式(Mirai-I)方式注入未来信息,提供了一种无需架构修改即可提升自回归视觉生成性能的通用方法。

Abstract: Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models’ internal representation on the 2D image grids improves causality modeling. We formulate this insight with Mirai (meaning “future” in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations. Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B’s convergence by up to 10$\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark. Our study highlights that visual autoregressive models need foresight.


[38] LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models cs.CV | cs.LGPDF

Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhingra, Seonghyeon Nam

TL;DR: 本文提出LaVR方法,利用大型4D重建模型的隐式几何知识来生成单目视频的新视角。该方法通过联合条件化场景潜在表示和源相机姿态,解决了现有方法在几何条件化不足或依赖显式重建时导致的漂移、变形和误差问题,实现了视频重渲染任务的SOTA性能。

Details

Motivation: 解决现有视频重渲染方法的两大挑战:几何无条件模型缺乏空间感知导致视角变化下的漂移和变形,而几何条件模型依赖估计深度和显式重建,易受深度不准确和标定误差影响。

Result: 在视频重渲染任务上达到SOTA结果,具体基准未提及,但通过联合条件化场景潜在和相机姿态实现。

Insight: 创新点在于利用大型4D重建模型的隐式几何潜在空间来条件化生成过程,避免了显式重建的误差,并允许预训练扩散先验更有效地正则化错误,提供了一种灵活的场景表示方法。

Abstract: Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/


[39] A comprehensive overview of deep learning models for object detection from videos/images cs.CV | cs.AIPDF

Sukana Zulfqar, Sadia Saeed, M. Azam Zia, Anjum Ali, Faisal Mehmood

TL;DR: 这篇论文是一篇关于视频和图像监控中目标检测深度学习模型的综述,系统总结了现代技术,包括架构创新、生成模型集成以及利用时序信息提升鲁棒性和准确性的方法。

Details

Motivation: 动机是评估当前语义目标检测的有效性,并分析深度学习模型及其实际应用,同时针对监控特定挑战(如动态环境、遮挡、光照变化和实时性要求)对方法进行分类。

Result: 作为综述论文,未提出新模型或报告具体定量结果,但涵盖了CNN-based检测器、GAN辅助方法和时序融合方法,并概述了基准数据集和比较评估。

Insight: 创新点在于从核心架构、数据处理策略和监控特定挑战的角度对方法进行了新颖分类,并强调了生成模型在重建缺失帧、减少遮挡和光照归一化等任务中的作用,同时指出了低延迟、高效和时空学习等未来研究方向。

Abstract: Object detection in video and image surveillance is a well-established yet rapidly evolving task, strongly influenced by recent deep learning advancements. This review summarises modern techniques by examining architectural innovations, generative model integration, and the use of temporal information to enhance robustness and accuracy. Unlike earlier surveys, it classifies methods based on core architectures, data processing strategies, and surveillance specific challenges such as dynamic environments, occlusions, lighting variations, and real-time requirements. The primary goal is to evaluate the current effectiveness of semantic object detection, while secondary aims include analysing deep learning models and their practical applications. The review covers CNN-based detectors, GAN-assisted approaches, and temporal fusion methods, highlighting how generative models support tasks such as reconstructing missing frames, reducing occlusions, and normalising illumination. It also outlines preprocessing pipelines, feature extraction progress, benchmarking datasets, and comparative evaluations. Finally, emerging trends in low-latency, efficient, and spatiotemporal learning approaches are identified for future research.


[40] RegFreeNet: A Registration-Free Network for CBCT-based 3D Dental Implant Planning cs.CVPDF

Xinquan Yang, Xuguang Li, Mianjie Zheng, Xuefen Liu, Kun Tang

TL;DR: 本文提出了一种无需配准的CBCT三维种植体规划网络RegFreeNet,通过掩蔽术后数据中的种植体区域,利用邻牙纹理信息预测种植体位置,从而摆脱了对配准算法的依赖并支持多中心数据集构建。

Details

Motivation: 现有方法依赖术后数据配准来获取种植体位置标签,过程耗时且受配准精度限制,且难以构建多中心数据集。

Result: 在提出的ImplantFairy数据集和两个公共数据集上的实验表明,RegFreeNet达到了最先进的性能水平。

Insight: 创新点在于提出掩蔽术后种植体以直接利用任何含种植体的CBCT数据进行训练的新范式,以及设计了邻距感知模块和种植体坡度预测分支来增强特征学习。

Abstract: As the commercial surgical guide design software usually does not support the export of implant position for pre-implantation data, existing methods have to scan the post-implantation data and map the implant to pre-implantation space to get the label of implant position for training. Such a process is time-consuming and heavily relies on the accuracy of registration algorithm. Moreover, not all hospitals have paired CBCT data, limitting the construction of multi-center dataset. Inspired by the way dentists determine the implant position based on the neighboring tooth texture, we found that even if the implant area is masked, it will not affect the determination of the implant position. Therefore, we propose to mask the implants in the post-implantation data so that any CBCT containing the implants can be used as training data. This paradigm enables us to discard the registration process and makes it possible to construct a large-scale multi-center implant dataset. On this basis, we proposes ImplantFairy, a comprehensive, publicly accessible dental implant dataset with voxel-level 3D annotations of 1622 CBCT data. Furthermore, according to the area variation characteristics of the tooth’s spatial structure and the slope information of the implant, we designed a slope-aware implant position prediction network. Specifically, a neighboring distance perception (NDP) module is designed to adaptively extract tooth area variation features, and an implant slope prediction branch assists the network in learning more robust features through additional implant supervision information. Extensive experiments conducted on ImplantFairy and two public dataset demonstrate that the proposed RegFreeNet achieves the state-of-the-art performance.


[41] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV | cs.AI | cs.CLPDF

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

TL;DR: 本文提出HERMES,一种无需训练的架构,用于高效、实时地理解视频流。其核心创新是将Transformer中的KV缓存概念化为分层记忆框架,以多粒度封装视频信息,从而在推理时重用紧凑的KV缓存,在资源受限下实现高效的流式视频理解。

Details

Motivation: 现有多模态大语言模型在离线视频理解上取得进展,但难以同时满足流式视频输入所需的稳定性能、实时响应和低GPU内存开销。本文旨在解决这一挑战。

Result: 在多个基准测试中,即使将视频token减少高达68%(相比均匀采样),HERMES仍能达到相当或更优的准确率,在流式数据集上最高提升11.4%。同时,它实现了比先前SOTA快10倍的首次令牌生成时间,保证了实时交互响应。

Insight: 主要创新点在于将KV缓存机制重新概念化为分层记忆框架,用于高效存储和重用多粒度视频信息。这提供了一种无需额外训练、通过优化推理时注意力机制来显著提升流式视频理解效率和性能的新思路。

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.


[42] DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling cs.CV | cs.CL | cs.MMPDF

Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng

TL;DR: DeepMoLM是一个用于分子-文本建模的双视图框架,它通过融合高分辨率分子图像和从分子构象导出的几何不变量,解决了现有模型在处理3D几何和立体化学信息方面的不足。该模型在PubChem描述生成和ChEBI-20图像描述生成任务上表现出色,超越了通用基线模型,并与专业方法或最先进的视觉语言模型相当。

Details

Motivation: 现有药物发现和化学文献挖掘的AI模型大多依赖字符串或图表示,而视觉语言模型常忽略立体化学细节,难以将连续3D结构映射为离散标记。DeepMoLM旨在通过结合视觉和几何结构信息,实现物理上合理的分子文本生成,无需原子坐标。

Result: 在PubChem描述生成任务上,DeepMoLM相对于最强的通用基线实现了12.3%的相对METEOR提升,同时与专业方法保持竞争力;在专业设置下,分子量预测的MAE为13.64 g/mol,复杂度预测的MAE为37.89;在ChEBI-20图像描述生成任务上,超越了通用基线,并与最先进的视觉语言模型相匹配。

Insight: 创新点包括:1) 双视图框架融合高分辨率图像和几何不变量(如离散扩展3D指纹),以保留高频证据和构象邻域信息;2) 通过交叉注意力融合视觉和几何流,实现无需原子坐标的物理基础生成;3) 在分子文本任务中有效结合视觉与结构信息,提升了描述准确性和数值预测能力。

Abstract: AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.


[43] SimD3: A Synthetic drone Dataset with Payload and Bird Distractor Modeling for Robust Detection cs.CVPDF

Ami Pandat, Kanyala Muvva, Punna Rajasekhar, Gopika Vinod, Rohit Shukla

TL;DR: 本文提出了SimD3,一个用于鲁棒无人机检测的大规模高保真合成数据集,该数据集通过Unreal Engine 5模拟了带有异质载荷的无人机、多种鸟类干扰物以及多样化的环境条件。研究在YOLOv5检测框架内进行了实验评估,包括一个名为Yolov5m+C3b的注意力增强变体,结果表明SimD3能有效提升小目标无人机检测性能,且Yolov5m+C3b在领域内和跨数据集评估中均优于基线模型。

Details

Motivation: 解决无人机检测中因真实标注数据有限、外观变化大以及存在鸟类等视觉相似干扰物而导致的挑战。

Result: 在YOLOv5框架下,Yolov5m+C3b模型在合成数据、合成与真实数据结合以及多个未见过的真实基准测试中均一致优于基线,证明了SimD3数据集在训练和评估鲁棒无人机检测模型方面的有效性。

Insight: 创新点在于构建了首个明确建模无人机异质载荷和多种鸟类干扰物的合成数据集,并引入了基于注意力机制的C3b模块来增强检测性能;客观来看,该研究通过可控的合成数据生成和模型架构改进,为小目标检测在复杂环境中的泛化问题提供了可借鉴的解决方案。

Abstract: Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.


[44] ReinPath: A Multimodal Reinforcement Learning Approach for Pathology cs.CVPDF

Kangcheng Zhou, Jun Jiang, Qing Zhang, Shuang Zheng, Qingli Li

TL;DR: 本文提出了一种名为ReinPath的多模态强化学习方法,用于病理学领域。该方法通过构建高质量病理视觉问答数据集,并设计结合语义奖励策略的组相对策略优化,增强了病理大语言模型的推理能力,从而生成更准确且上下文相关的文本描述。

Details

Motivation: 现有病理学多模态方法由于缺乏支持显式推理的高质量数据集以及推理过程简单,导致可解释性有限。本文旨在解决这些问题,提升病理学多模态模型的推理能力和可解释性。

Result: 在构建的高质量病理视觉问答数据集上进行的综合实验表明,该方法优于现有最先进方法,即使仅使用20%的数据训练也能实现优异性能。在下游零样本图像分类任务上,其性能与CLIP相当。

Insight: 创新点包括构建专门支持复杂推理任务的高质量病理VQA数据集,以及设计结合语义奖励的组相对策略优化方法,以增强模型的推理和生成能力。从客观角度看,该方法通过强化学习优化多模态交互,为病理学可解释性研究提供了新思路。

Abstract: Interpretability is significant in computational pathology, leading to the development of multimodal information integration from histopathological image and corresponding text data.However, existing multimodal methods have limited interpretability due to the lack of high-quality dataset that support explicit reasoning and inference and simple reasoning process.To address the above problems, we introduce a novel multimodal pathology large language model with strong reasoning capabilities.To improve the generation of accurate and contextually relevant textual descriptions, we design a semantic reward strategy integrated with group relative policy optimization.We construct a high-quality pathology visual question answering (VQA) dataset, specifically designed to support complex reasoning tasks.Comprehensive experiments conducted on this dataset demonstrate that our method outperforms state-of-the-art methods, even when trained with only 20% of the data.Our method also achieves comparable performance on downstream zero-shot image classification task compared with CLIP.


[45] Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis cs.CVPDF

Keita Takeda, Tomoya Sakai

TL;DR: 本研究通过特征分布分析,系统探究了开源医学视觉语言模型(VLMs)的特征表示能力,并与非医学VLMs进行对比,发现医学VLMs能提取对医学分类任务有效的判别性特征,但近期改进的非医学VLMs(如LLM2CLIP)能产生更精细的特征表示,表明在开发医学VLMs时增强文本编码器比密集医学图像训练更为关键。

Details

Motivation: 医学VLMs预期能捕捉诊断相关特征,但其学习到的表征尚未被充分探索,且标准评估(如分类准确率)无法完全揭示其是否真正获得了具有判别性的病灶特异性特征;理解这些表征对于揭示医学图像结构和改进下游医学图像分析任务至关重要。

Result: 实验表明,医学VLMs能提取对医学分类任务有效的判别性特征;然而,非医学VLMs(如经过上下文丰富改进的LLM2CLIP)能产生更精细的特征表示,且在图像上叠加文本字符串时,非医学模型更容易引入偏差。

Insight: 论文创新点在于通过特征分布分析系统比较医学与非医学VLMs的表征能力;客观分析认为,其核心发现是开发医学VLMs时,增强文本编码器比密集医学图像训练更重要,同时下游任务模型选择需谨慎,并警惕图像中文本信息等背景偏差带来的推理风险。

Abstract: This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.


[46] M2I2HA: A Multi-modal Object Detection Method Based on Intra- and Inter-Modal Hypergraph Attention cs.CVPDF

Xiaofan Yang, Yubin Liu, Wei Pan, Guoqing Chu, Junming Zhang

TL;DR: 本文提出了一种基于超图注意力的多模态目标检测方法M2I2HA,旨在解决多模态融合中模态内和模态间信息有效提取与精确对齐的挑战。该方法通过模态内超图增强模块捕获模态内全局高阶关系,并通过模态间超图融合模块对齐与融合跨模态特征,同时引入M2-FullPAD模块实现自适应多级融合。

Details

Motivation: 现有CNN模型感受野受限且难以捕获长程依赖,Transformer模型计算复杂度高且局限于成对相关性建模,而Mamba等状态空间模型会破坏2D空间拓扑结构。因此,需要一种能有效建模模态内和模态间复杂高阶依赖关系的方法。

Result: 在多个公共数据集上的目标检测实验表明,M2I2HA在多模态目标检测任务中达到了最先进的性能。

Insight: 创新点在于将超图理论引入多模态感知,分别建模模态内(全局多对多高阶关系)和模态间(对齐与融合)的复杂依赖,并通过自适应多级融合模块优化特征流,为多模态融合提供了新的结构化建模思路。

Abstract: Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.


[47] FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes cs.CV | cs.AIPDF

Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Zhenhua Ling

TL;DR: 本文提出了FunCineForge,一个用于电影配音的统一数据集工具包和模型。它包含一个用于构建大规模配音数据集的生产流水线和一个基于MLLM的配音模型,旨在解决现有高质量多模态配音数据集规模小、质量差以及现有模型仅依赖唇部区域、在复杂实景电影场景中表现不佳的问题。

Details

Motivation: 解决现有电影配音方法面临的两大局限:一是高质量多模态配音数据集规模有限、质量不高且标注稀疏;二是现有配音模型仅依赖唇部区域进行视听对齐,在复杂电影场景中适用性有限,且在唇同步、语音质量和情感表达方面表现欠佳。

Result: 在独白、旁白、对话和多说话者等多种电影场景下的实验表明,该配音模型在音频质量、唇同步、音色迁移和指令遵循方面持续优于SOTA方法。

Insight: 主要创新点在于提出了一个端到端的大规模配音数据集生产流水线,并构建了首个具有丰富标注的中文电视配音数据集;同时,设计了一个基于MLLM的配音模型,能够处理多样化的电影场景,超越了仅依赖唇部区域的传统方法。

Abstract: Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at https://anonymous.4open.science/w/FunCineForge.


[48] Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation cs.CVPDF

Yifei Liu, Changxing Ding, Ling Guo, Huaiguang Jiang, Qiong Cao

TL;DR: 本文提出了一种名为重建锚定扩散模型(RAM)的新方法,用于文本到人体动作生成任务。该方法通过引入动作潜在空间作为中间监督,并联合训练一个动作重建分支来增强文本编码器对动作信息的理解。同时,提出了一种测试阶段的引导机制——重建误差引导(REG),利用扩散模型的自校正能力来减轻迭代去噪过程中的误差传播。实验表明,RAM在性能上取得了显著提升,达到了最先进的水平。

Details

Motivation: 当前基于扩散模型的文本驱动动作生成方法面临两个主要限制:一是由于预训练文本编码器缺乏动作特定信息导致的表征差距;二是在迭代去噪过程中存在的误差传播问题。本文旨在解决这两个挑战。

Result: 广泛的实验表明,RAM在文本到动作生成任务上取得了显著的性能提升,并达到了最先进的(SOTA)水平。

Insight: 论文的创新点在于:1)通过联合训练一个具有自正则化和以动作为中心的潜在对齐目标函数的动作重建分支,构建了一个富含信息的动作潜在空间作为中间监督,以弥合文本与动作之间的表征差距;2)提出了一种新颖的测试阶段引导机制REG,它利用扩散模型固有的自校正能力,通过放大当前预测与重建估计之间的残差来突出改进,从而有效缓解误差传播问题。从客观角度看,将重建任务与生成任务协同训练,并利用生成过程中的中间结果进行自引导校正,是一种巧妙且可借鉴的思路。

Abstract: Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model’s inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.


[49] UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking cs.CVPDF

Qihua Liang, Liang Chen, Yaozong Zheng, Jian Nong, Zhiyi Mo

TL;DR: 本文提出了一种名为UBATrack的新型多模态目标跟踪框架,该框架基于Mamba风格的状态空间模型,旨在通过联合建模跨模态依赖和时空视觉线索来提升跟踪性能。它包含两个核心模块:时空Mamba适配器(STMA)和动态多模态特征混合器,前者以适配器调优方式利用Mamba的长序列建模能力,后者增强多模态特征表示能力,从而无需昂贵的全参数微调即可提高训练效率。

Details

Motivation: 当前通用的多模态跟踪器主要通过提示学习统一各种模态跟踪任务(如RGB-热红外、RGB-深度或RGB-事件跟踪),但忽视了有效捕捉时空线索,因此需要一种能更好建模时空信息的高效方法。

Result: 实验表明,UBATrack在RGB-T、RGB-D和RGB-E跟踪基准测试(包括LasHeR、RGBT234、RGBT210、DepthTrack、VOT-RGBD22和VisEvent数据集)上优于现有最先进方法,达到了SOTA水平。

Insight: 创新点在于将Mamba状态空间模型引入多模态跟踪,通过STMA模块以适配器调优方式联合建模跨模态依赖和时空线索,以及动态多模态特征混合器增强特征表示,这避免了全参数微调,提升了训练效率和跟踪鲁棒性,为多模态序列建模提供了新思路。

Abstract: Multi-modal object tracking has attracted considerable attention by integrating multiple complementary inputs (e.g., thermal, depth, and event data) to achieve outstanding performance. Although current general-purpose multi-modal trackers primarily unify various modal tracking tasks (i.e., RGB-Thermal infrared, RGB-Depth or RGB-Event tracking) through prompt learning, they still overlook the effective capture of spatio-temporal cues. In this work, we introduce a novel multi-modal tracking framework based on a mamba-style state space model, termed UBATrack. Our UBATrack comprises two simple yet effective modules: a Spatio-temporal Mamba Adapter (STMA) and a Dynamic Multi-modal Feature Mixer. The former leverages Mamba’s long-sequence modeling capability to jointly model cross-modal dependencies and spatio-temporal visual cues in an adapter-tuning manner. The latter further enhances multi-modal representation capacity across multiple feature dimensions to improve tracking robustness. In this way, UBATrack eliminates the need for costly full-parameter fine-tuning, thereby improving the training efficiency of multi-modal tracking algorithms. Experiments show that UBATrack outperforms state-of-the-art methods on RGB-T, RGB-D, and RGB-E tracking benchmarks, achieving outstanding results on the LasHeR, RGBT234, RGBT210, DepthTrack, VOT-RGBD22, and VisEvent datasets.


[50] Multimodal system for skin cancer detection cs.CV | cs.AIPDF

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

TL;DR: 本文提出了一种用于皮肤癌检测的多模态系统,该系统使用常规照片图像而非专业皮肤镜图像,并结合患者人口统计学和皮损特征等表格元数据,以提高检测准确性。系统采用多模态神经网络处理图像和元数据,支持有无元数据两种情况的两步模型,并通过包含增强算法的三阶段流程进一步优化预测。针对高度不平衡数据集,采用了特定技术确保鲁棒训练。

Details

Motivation: 现有基于皮肤镜图像的深度学习模型需要专业设备,限制了其在更广泛临床环境中的应用。本研究旨在开发一种更易获取、更通用的黑色素瘤检测系统,利用常规照片图像并结合元数据来弥补设备依赖性的不足。

Result: 通过消融研究评估了多种视觉架构、增强算法和损失函数,在部分ROC AUC指标上达到了0.18068(最大值为0.2),top-15检索灵敏度为0.78371。结果表明,在结构化的多阶段流程中整合照片图像和元数据能带来显著的性能提升。

Insight: 主要创新点在于利用常规照片而非专业皮肤镜图像进行黑色素瘤检测,降低了设备门槛;同时,通过多模态融合(图像+元数据)和两阶段/三阶段处理流程来提升性能,并针对数据不平衡问题采用了特定训练技术,为不同医疗环境提供了可扩展、设备无关的解决方案。

Abstract: Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.


[51] GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars cs.CV | cs.AIPDF

Zhe Chang, Haodong Jin, Ying Sun, Yan Song, Hui Yu

TL;DR: 本文提出了一种名为GAT-NeRF的新型混合神经辐射场框架,用于从单目视频中重建高保真且可控的4D面部化身。该方法通过将Transformer机制集成到NeRF流程中,结合坐标对齐的多层感知机和轻量级几何感知Transformer模块,以增强对高频面部细节(如动态皱纹)的建模能力。

Details

Motivation: 从单目视频重建高保真4D动态面部化身是沉浸式虚拟人应用中的关键挑战。现有NeRF方法在从信息受限的单目流中捕捉高频面部细节(如动态皱纹和细微纹理)方面能力有限,需要显著增强。

Result: 综合实验明确表明,GAT-NeRF在视觉保真度和高频细节恢复方面达到了最先进的性能,为多媒体应用创建逼真的动态数字人开辟了新途径。

Insight: 论文的创新点在于提出了一种结合坐标对齐MLP与轻量级几何感知Transformer的混合框架。该Transformer模块通过融合包含显式几何先验的多模态输入特征(如3D空间坐标、3DMM表情参数和可学习潜在编码),有效学习和增强与细粒度几何相关的特征表示,从而显著提升了对复杂局部面部模式的建模能力。

Abstract: High-fidelity 4D dynamic facial avatar reconstruction from monocular video is a critical yet challenging task, driven by increasing demands for immersive virtual human applications. While Neural Radiance Fields (NeRF) have advanced scene representation, their capacity to capture high-frequency facial details, such as dynamic wrinkles and subtle textures from information-constrained monocular streams, requires significant enhancement. To tackle this challenge, we propose a novel hybrid neural radiance field framework, called Geometry-Aware-Transformer Enhanced NeRF (GAT-NeRF) for high-fidelity and controllable 4D facial avatar reconstruction, which integrates the Transformer mechanism into the NeRF pipeline. GAT-NeRF synergistically combines a coordinate-aligned Multilayer Perceptron (MLP) with a lightweight Transformer module, termed as Geometry-Aware-Transformer (GAT) due to its processing of multi-modal inputs containing explicit geometric priors. The GAT module is enabled by fusing multi-modal input features, including 3D spatial coordinates, 3D Morphable Model (3DMM) expression parameters, and learnable latent codes to effectively learn and enhance feature representations pertinent to fine-grained geometry. The Transformer’s effective feature learning capabilities are leveraged to significantly augment the modeling of complex local facial patterns like dynamic wrinkles and acne scars. Comprehensive experiments unequivocally demonstrate GAT-NeRF’s state-of-the-art performance in visual fidelity and high-frequency detail recovery, forging new pathways for creating realistic dynamic digital humans for multimedia applications.


[52] SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval cs.CV | cs.AIPDF

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng

TL;DR: SpatialMem是一个以内存为中心的系统,它将3D几何、语义和语言统一成一个可查询的表示。该系统从随意拍摄的第一人称RGB视频出发,重建具有公制尺度的室内环境,检测结构化的3D锚点(如墙壁、门、窗)作为第一层骨架,并用开放词汇的对象节点填充一个分层内存,将证据图像块、视觉嵌入和两层文本描述链接到3D坐标,以实现紧凑存储和快速检索。该设计支持对空间关系(如距离、方向、可见性)进行可解释的推理,并支持语言引导导航和对象检索等下游任务。

Details

Motivation: 论文旨在解决如何从日常RGB视频中构建一个统一、可查询的3D空间记忆表示,以支持具身空间智能任务,如语言引导导航和物体检索,而无需依赖专用传感器。

Result: 在三个真实室内场景的实验表明,SpatialMem在杂乱和遮挡增加的情况下,保持了强大的锚点-描述级导航完成度和分层检索准确性。

Insight: 创新点在于提出了一个将3D几何、语义和语言统一到单一分层内存中的系统,其核心是使用结构化的3D锚点作为骨架,并结合开放词汇对象节点与证据、视觉和文本信息的链接,从而实现了高效、可解释的空间关系推理和快速检索。

Abstract: We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes – linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates – for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.


[53] TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models cs.CV | cs.AI | cs.CLPDF

Carolin Holtermann, Nina Krebs, Anne Lauscher

TL;DR: 本文提出了TempViz数据集,这是首个用于全面评估文本到图像(T2I)模型中时间知识的数据集,包含7.9k个提示词和600多张参考图像。作者利用该数据集评估了五个T2I模型在五个时间知识类别上的能力,发现模型的时间能力普遍较弱,且现有自动评估方法均无法可靠地评估时间线索。

Details

Motivation: 时间会改变世界中实体(如物体、地点、动物)的视觉外观,因此,为了准确生成上下文相关的图像,关于时间的知识和推理至关重要。然而,尽管自然语言处理领域在理解和改进时间知识方面已有大量工作,但关于时间现象如何在T2I模型中出现和处理的研究仍然很少。本文旨在填补这一空白。

Result: 通过人工评估发现,所有五个T2I模型的时间能力普遍较弱,没有任何模型在所有类别上的准确率超过75%。同时,作者比较了几种现有的自动评估方法与人工判断,发现这些方法都无法提供对时间线索的可靠评估。

Insight: 论文的主要创新点是创建了首个专门用于评估T2I模型时间知识的数据集TempViz,并系统地定义了五个时间知识类别进行评测。从客观角度看,这项工作揭示了当前主流T2I模型在理解和生成与时间相关的视觉内容方面存在显著不足,并指出了现有自动评估指标在此任务上的局限性,为未来研究指明了方向。

Abstract: Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.


[54] Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers cs.CVPDF

Xinyu Peng, Han Li, Yuyang Huang, Ziyang Zheng, Yaoming Wang

TL;DR: 本文提出了一种名为LDF-VFI的视频帧插值新范式,它采用自回归扩散Transformer对整个视频序列进行建模,以确保长程时间一致性。该方法通过引入跳跃连接采样策略来缓解自回归生成中的误差累积,并结合稀疏局部注意力与分块VAE编码,实现了对长序列的高效处理及对任意空间分辨率(如4K)的泛化。

Details

Motivation: 现有视频帧插值方法通常采用以帧为中心的处理方式,将视频视为独立的短片段(如三元组),这会导致时间不一致性和运动伪影。本文旨在克服这些限制,提出一种整体、以视频为中心的建模方法。

Result: 在具有挑战性的长序列基准测试中,LDF-VFI取得了最先进的性能,尤其在具有大运动的场景中,表现出卓越的单帧质量和时间一致性。

Insight: 主要创新点包括:1) 将自回归扩散Transformer应用于视频帧插值以实现整体序列建模;2) 提出跳跃连接采样策略以稳定自回归生成;3) 结合稀疏局部注意力与分块VAE编码,实现高效长序列处理与分辨率泛化;4) 利用多尺度输入特征的增强条件VAE解码器以提升重建保真度。

Abstract: Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at https://github.com/xypeng9903/LDF-VFI.


[55] Unified Multi-Dataset Training for TBPS cs.CVPDF

Nilanjana Chatterjee, Sidharatha Garg, A V Subramanyam, Brejesh Lall

TL;DR: 本文提出了一种名为Scale-TBPS的统一多数据集训练方法,用于解决基于文本的行人搜索(TBPS)任务中因数据集分布差异和身份数量庞大导致的模型泛化能力不足问题。该方法通过噪声感知的统一数据集构建策略和可扩展的判别性身份学习框架,实现了在多个数据集上训练单一高性能模型。

Details

Motivation: 现有TBPS方法依赖于针对特定数据集的微调,导致需要为不同数据集训练多个独立模型,无法实现统一泛化。同时,直接联合训练会因身份数量过多和图像-文本对噪声而效果不佳。

Result: 在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926等多个基准数据集上的实验表明,单一的Scale-TBPS模型超越了针对特定数据集优化的模型以及简单的联合训练方法。

Insight: 创新点在于提出了一个统一的、可扩展的训练范式,通过精心设计的数据集合并策略来减少噪声影响,并采用能有效处理大量唯一身份的判别性学习框架,从而实现了跨数据集的强大泛化能力,为多源数据联合训练提供了新思路。

Abstract: Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.


[56] LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding cs.CVPDF

Xiaodong Wang, Langling Huang, Zhirong Wu, Xu Zhao, Teng Xu

TL;DR: 本文提出了首个面向交互式直播视频的全模态基准测试LiViBench,包含24个多样化任务,涵盖感知、推理和直播特有挑战。为高效构建数据集,作者设计了标准化半自动标注流程,利用多智能体MLLM系统生成视频描述,并通过种子问题驱动方法构建高质量标注。所有视频均包含音频、语音和实时评论模态。为提升模型对交互式视频的理解,作者提出了两阶段指令微调和视频到评论检索模块,并基于此开发了LiVi-LLM-7B模型。实验表明,该模型在LiViBench上超越了参数量达720亿的开源模型,缩小了与领先专有模型的差距,并在多个通用视频基准测试上取得提升。

Details

Motivation: 现有视频评估基准主要关注非交互式视频(如电影和录像),缺乏针对交互式直播视频的评估标准,本文旨在填补这一空白。

Result: 在LiViBench基准测试中,LiVi-LLM-7B模型超越了参数量达720亿的开源模型,缩小了与领先专有模型的性能差距;在通用视频基准测试VideoMME、LongVideoBench、MLVU和VideoEval-Pro上均实现了性能提升。

Insight: 创新点包括:1) 首个面向交互式直播视频的全模态基准测试LiViBench;2) 结合人类反馈的半自动标注流程,利用多智能体MLLM系统生成视频描述和种子问题驱动标注;3) 针对交互式视频理解的两阶段指令微调和视频到评论检索模块;4) 开发的LiVi-LLM-7B模型在特定和通用任务上均表现优异。

Abstract: The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non-interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges. To efficiently construct the dataset, we design a standardized semi-automatic annotation workflow that incorporates the human-in-the-loop at multiple stages. The workflow leverages multiple MLLMs to form a multi-agent system for comprehensive video description and uses a seed-question-driven method to construct high-quality annotations. All interactive videos in the benchmark include audio, speech, and real-time comments modalities. To enhance models’ understanding of interactive videos, we design tailored two-stage instruction-tuning and propose a Video-to-Comment Retrieval (VCR) module to improve the model’s ability to utilize real-time comments. Based on these advancements, we develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open-source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval-Pro.


[57] SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation cs.CVPDF

Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

TL;DR: 本文提出了SpatialV2A框架,用于从视频生成具有高空间保真度的音频。为了解决现有模型因依赖单声道音频数据集而缺乏空间感知能力的问题,研究团队构建了首个大规模视频-双耳音频数据集BinauralVGGSound,并开发了一个端到端的视觉引导空间音频生成框架,该框架通过显式建模空间特征来确保生成的音频具有真实的空间属性和层次深度。

Details

Motivation: 现有视频到音频生成研究主要关注语义和时间对齐,而忽视了合成音频的空间感知和沉浸感,这主要是由于模型依赖缺乏双耳空间信息的单声道音频数据集。

Result: 实验表明,该方法在空间保真度上显著优于最先进的模型,提供了更沉浸的听觉体验,且没有牺牲时间或语义一致性。

Insight: 核心创新点在于构建了首个大规模视频-双耳音频数据集BinauralVGGSound,并提出了一个包含视觉引导音频空间化模块的端到端框架,显式建模空间特征以生成具有真实空间属性的音频。

Abstract: While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models’ reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.


[58] The Pictorial Cortex: Zero-Shot Cross-Subject fMRI-to-Image Reconstruction via Compositional Latent Modeling cs.CVPDF

Jingyang Huo, Yikai Wang, Yanwei Fu, Jianfeng Feng

TL;DR: 该论文提出了一种名为PictorialCortex的方法,用于解决零样本跨被试的fMRI到图像重建问题。该方法通过组合式潜在建模,在统一的皮层潜在空间中解耦刺激驱动表征与个体、数据集及试次相关的变异性,并利用多被试条件下的合成潜在变量引导扩散模型为未见被试重建视觉体验。

Details

Motivation: 解决从fMRI解码视觉体验中的一个核心障碍:由于解剖、功能、认知和实验因素导致的皮层反应固有变异性,使得同一视觉刺激引发的神经活动在不同个体和试次间存在差异,导致fMRI到图像的重建是非单射的。论文旨在攻克零样本跨被试重建这一具有挑战性但实际意义重大的问题。

Result: 在提出的统一皮层表面数据集UniCortex-fMRI上进行了广泛实验。结果表明,PictorialCortex在零样本跨被试视觉重建任务上取得了改进,突显了组合式潜在建模和多数据集训练的优势。

Insight: 创新点在于提出了一个组合式潜在公式来结构化建模fMRI活动,将刺激驱动表征与多种变异性来源解耦;并构建了一个统一的皮层潜在空间,通过潜在因子分解-组合模块及一致性正则化实现该公式。客观来看,其利用多被试合成潜在变量引导扩散模型进行零样本推理的策略,为解决跨被试神经解码的泛化问题提供了新思路。

Abstract: Decoding visual experiences from human brain activity remains a central challenge at the intersection of neuroscience, neuroimaging, and artificial intelligence. A critical obstacle is the inherent variability of cortical responses: neural activity elicited by the same visual stimulus differs across individuals and trials due to anatomical, functional, cognitive, and experimental factors, making fMRI-to-image reconstruction non-injective. In this paper, we tackle a challenging yet practically meaningful problem: zero-shot cross-subject fMRI-to-image reconstruction, where the visual experience of a previously unseen individual must be reconstructed without subject-specific training. To enable principled evaluation, we present a unified cortical-surface dataset – UniCortex-fMRI, assembled from multiple visual-stimulus fMRI datasets to provide broad coverage of subjects and stimuli. Our UniCortex-fMRI is particularly processed by standardized data formats to make it possible to explore this possibility in the zero-shot scenario of cross-subject fMRI-to-image reconstruction. To tackle the modeling challenge, we propose PictorialCortex, which models fMRI activity using a compositional latent formulation that structures stimulus-driven representations under subject-, dataset-, and trial-related variability. PictorialCortex operates in a universal cortical latent space and implements this formulation through a latent factorization-composition module, reinforced by paired factorization and re-factorizing consistency regularization. During inference, surrogate latents synthesized under multiple seen-subject conditions are aggregated to guide diffusion-based image synthesis for unseen subjects. Extensive experiments show that PictorialCortex improves zero-shot cross-subject visual reconstruction, highlighting the benefits of compositional latent modeling and multi-dataset training.


[59] Three-dimensional visualization of X-ray micro-CT with large-scale datasets: Efficiency and accuracy for real-time interaction cs.CVPDF

Yipeng Yin, Rao Yao, Qingying Li, Dazhong Wang, Hong Zhou

TL;DR: 这篇论文综述了X射线显微CT在大规模数据集三维可视化方面的效率与精度权衡问题,重点回顾了从医学影像到工业无损检测的演进历程,系统分析了兼顾精度与效率的CT重建和体绘制方法,并展望了数字孪生模型在结构健康监测中的应用前景。

Details

Motivation: 解决工业CT超精密检测中大规模数据集带来的三维缺陷表征精度与效率的权衡问题,为实时在线监测材料内部缺陷提供方法指导。

Result: 论文是一篇综述性文章,未提及具体的定量实验结果或基准测试,但通过对比分析现有方法,梳理了从解析法到深度学习技术的CT重建算法演进以及体绘制算法的加速与数据缩减改进。

Insight: 创新点在于从精度-效率平衡的独特视角系统综述显微CT三维可视化技术,并前瞻性地提出将虚拟-物理交互与数字孪生模型结合用于结构健康监测的新方向。

Abstract: As Micro-CT technology continues to refine its characterization of material microstructures, industrial CT ultra-precision inspection is generating increasingly large datasets, necessitating solutions to the trade-off between accuracy and efficiency in the 3D characterization of defects during ultra-precise detection. This article provides a unique perspective on recent advances in accurate and efficient 3D visualization using Micro-CT, tracing its evolution from medical imaging to industrial non-destructive testing (NDT). Among the numerous CT reconstruction and volume rendering methods, this article selectively reviews and analyzes approaches that balance accuracy and efficiency, offering a comprehensive analysis to help researchers quickly grasp highly efficient and accurate 3D reconstruction methods for microscopic features. By comparing the principles of computed tomography with advancements in microstructural technology, this article examines the evolution of CT reconstruction algorithms from analytical methods to deep learning techniques, as well as improvements in volume rendering algorithms, acceleration, and data reduction. Additionally, it explores advanced lighting models for high-accuracy, photorealistic, and efficient volume rendering. Furthermore, this article envisions potential directions in CT reconstruction and volume rendering. It aims to guide future research in quickly selecting efficient and precise methods and developing new ideas and approaches for real-time online monitoring of internal material defects through virtual-physical interaction, for applying digital twin model to structural health monitoring (SHM).


[60] Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network cs.CVPDF

Aoran Liu, Kun Hu, Clinton Ansun Mo, Qiuxia Wu, Wenxiong Kang

TL;DR: 本文提出了一种名为Pb4U-GNet的图神经网络框架,用于解决服装模拟中现有方法跨分辨率泛化能力差的问题。该框架通过将消息传播与特征更新解耦,并引入动态传播深度控制和几何感知更新缩放机制,实现了对不同网格分辨率的自适应模拟。

Details

Motivation: 传统基于物理的服装模拟方法计算成本高,而现有基于图神经网络(GNN)的加速方法在训练分布外的高分辨率网格上表现显著下降,这主要源于固定的消息传递深度无法适应网格密度变化,以及顶点位移幅度本身具有分辨率依赖性。

Result: 大量实验表明,即使仅使用低分辨率网格进行训练,Pb4U-GNet在多种网格分辨率上都表现出强大的泛化能力,有效解决了神经服装模拟中的一个基本挑战。

Insight: 核心创新点在于提出了“传播先于更新”的图网络架构,将消息传播与特征更新解耦。具体包括动态调整消息传递迭代次数的机制,以及根据局部网格特征缩放预测的几何感知更新方法,这为处理图结构数据中的尺度变化问题提供了新思路。

Abstract: Garment simulation is fundamental to various applications in computer vision and graphics, from virtual try-on to digital human modelling. However, conventional physics-based methods remain computationally expensive, hindering their application in time-sensitive scenarios. While graph neural networks (GNNs) offer promising acceleration, existing approaches exhibit poor cross-resolution generalisation, demonstrating significant performance degradation on higher-resolution meshes beyond the training distribution. This stems from two key factors: (1) existing GNNs employ fixed message-passing depth that fails to adapt information aggregation to mesh density variation, and (2) vertex-wise displacement magnitudes are inherently resolution-dependent in garment simulation. To address these issues, we introduce Propagation-before-Update Graph Network (Pb4U-GNet), a resolution-adaptive framework that decouples message propagation from feature updates. Pb4U-GNet incorporates two key mechanisms: (1) dynamic propagation depth control, adjusting message-passing iterations based on mesh resolution, and (2) geometry-aware update scaling, which scales predictions according to local mesh characteristics. Extensive experiments show that even trained solely on low-resolution meshes, Pb4U-GNet exhibits strong generalisability across diverse mesh resolutions, addressing a fundamental challenge in neural garment simulation.


[61] Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning cs.CVPDF

Shuonan Yang, Yuchen Zhang, Zeyu Fu

TL;DR: 本文提出了一种无需训练的多阶段对抗推理框架MARS,用于可靠且可解释的仇恨视频检测。该方法首先对视频内容进行客观描述,然后并行进行基于证据的仇恨推理和反证据的非仇恨推理,最后综合得出可解释的决策。在两个真实数据集上的评估表明,MARS在某些骨干网络和设置下比其他无需训练方法提升高达10%,并在一个数据集上超越了最先进的基于训练的方法。

Details

Motivation: 现有基于训练的仇恨视频检测方法受限于有限的训练数据和缺乏可解释性,而直接提示大型视觉语言模型往往难以提供可靠的仇恨检测。本文旨在解决这些挑战,实现可靠且可解释的仇恨内容检测。

Result: 在两个真实世界数据集上的广泛评估显示,MARS在某些骨干网络和设置下比其他无需训练方法提升高达10%,并在一个数据集上超越了最先进的基于训练的方法。

Insight: 创新点在于提出了一种无需训练的多阶段对抗推理框架,通过并行进行仇恨和非仇恨视角的推理,并结合客观描述,实现了可靠且可解释的检测。该方法避免了数据依赖和黑盒问题,增强了内容审核的透明度和可解释性。

Abstract: Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.


[62] Large-Scale Multidimensional Knowledge Profiling of Scientific Literature cs.CVPDF

Zhucun Xue, Jiangning Zhang, Juntao Jiang, Jinzhuo Liu, Haoyang He

TL;DR: 该论文构建了一个大规模多维度知识图谱分析框架,用于分析2020年至2025年间22个主要会议的10万余篇论文,通过主题聚类、LLM辅助解析和结构化检索等方法,揭示了人工智能领域的研究主题生命周期、方法转变、数据集与模型使用模式以及机构研究方向等趋势。

Details

Motivation: 解决传统文献计量工具主要依赖元数据、难以深入分析论文语义内容的问题,以更好地追踪研究主题的演变和不同领域间的相互影响。

Result: 分析揭示了人工智能领域的显著转变,包括安全、多模态推理和智能体导向研究的增长,以及神经机器翻译和基于图的方法等领域的逐渐稳定。

Insight: 创新点在于整合大规模语料、多维度分析管道(结合主题聚类与LLM辅助解析)来系统化地剖析科学文献的语义内容,为理解AI研究演变提供了基于证据的视图和资源。

Abstract: The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: https://github.com/xzc-zju/Profiling_Scientific_Literature


[63] A Computer Vision Hybrid Approach: CNN and Transformer Models for Accurate Alzheimer’s Detection from Brain MRI Scans cs.CVPDF

Md Mahmudul Hoque, Shuvo Karmaker, Md. Hadi Al-Amin, Md Modabberul Islam, Jisun Junayed

TL;DR: 本研究提出了一种名为Evan_V2的混合模型,用于从脑部MRI扫描中准确分类阿尔茨海默病。该模型通过特征级融合整合了五种CNN架构和五种Transformer模型的输出,在包含轻度、中度、非痴呆和极轻度痴呆的四分类任务中实现了99.99%的准确率,显著优于所有单一模型。

Details

Motivation: 早期准确分类阿尔茨海默病对于及时临床干预和改善患者预后至关重要,本研究旨在通过比较和融合CNN与Transformer模型,开发更可靠的诊断工具。

Result: 在四分类AD任务上,ResNet50(CNN)达到98.83%准确率,ViT(Transformer)达到95.38%准确率,而提出的Evan_V2混合模型取得了最佳性能:准确率99.99%,F1分数0.9989,ROC AUC 0.9968,超越了所有独立模型。

Insight: 论文的创新点在于提出了一种特征级融合的混合集成策略(Evan_V2),有效结合了CNN的强性能和Transformer的泛化能力,显著减少了各痴呆阶段的误分类,为构建高可靠性临床诊断工具提供了新思路。

Abstract: Early and accurate classification of Alzheimers disease (AD) from brain MRI scans is essential for timely clinical intervention and improved patient outcomes. This study presents a comprehensive comparative analysis of five CNN architectures (EfficientNetB0, ResNet50, DenseNet201, MobileNetV3, VGG16), five Transformer-based models (ViT, ConvTransformer, PatchTransformer, MLP-Mixer, SimpleTransformer), and a proposed hybrid model named Evan_V2. All models were evaluated on a four-class AD classification task comprising Mild Dementia, Moderate Dementia, Non-Demented, and Very Mild Dementia categories. Experimental findings show that CNN architectures consistently achieved strong performance, with ResNet50 attaining 98.83% accuracy. Transformer models demonstrated competitive generalization capabilities, with ViT achieving the highest accuracy among them at 95.38%. However, individual Transformer variants exhibited greater class-specific instability. The proposed Evan_V2 hybrid model, which integrates outputs from ten CNN and Transformer architectures through feature-level fusion, achieved the best overall performance with 99.99% accuracy, 0.9989 F1-score, and 0.9968 ROC AUC. Confusion matrix analysis further confirmed that Evan_V2 substantially reduced misclassification across all dementia stages, outperforming every standalone model. These findings highlight the potential of hybrid ensemble strategies in producing highly reliable and clinically meaningful diagnostic tools for Alzheimers disease classification.


[64] ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation cs.CVPDF

Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao

TL;DR: 本文提出了一种名为ScenDi的3D到2D场景扩散级联方法,用于生成逼真的城市场景。该方法首先使用3D潜在扩散模型生成3D高斯分布以渲染低分辨率图像,然后利用2D视频扩散模型以这些渲染图像为条件来增强外观细节,从而在保持相机轨迹可控性的同时提升生成场景的真实感。

Details

Motivation: 现有方法在生成3D城市场景时面临挑战:仅依赖3D扩散模型会损失外观细节,而仅使用2D扩散模型则会牺牲相机可控性。本文旨在克服这一局限,整合3D和2D扩散模型的优势。

Result: 在Waymo和KITTI-360这两个具有挑战性的真实世界数据集上的实验证明了该方法的有效性,能够基于3D边界框、道路地图或文本提示等输入条件生成符合准确相机轨迹的期望场景。

Insight: 主要创新点在于提出了一个级联框架,将3D扩散模型提供的粗略几何与空间可控性,与2D视频扩散模型提供的高质量外观细节增强相结合。这为可控且高保真的3D场景生成提供了一种新思路,平衡了细节与可控性之间的权衡。

Abstract: Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.


[65] PROGRESSLM: Towards Progress Reasoning in Vision-Language Models cs.CV | cs.CLPDF

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang

TL;DR: 该论文提出了Progress-Bench基准,用于系统评估视觉语言模型在任务进度推理方面的能力,并探索了基于提示和基于训练的方法来提升模型性能,最终开发了ProgressLM-3B模型。

Details

Motivation: 现代视觉语言模型擅长描述静态视觉内容,但尚不清楚它们能否从部分观察中推断任务进度,因此需要评估和提升模型在长时程动态推理方面的能力。

Result: 在14个视觉语言模型上的实验表明,大多数模型在任务进度估计上表现不佳,对演示模态和视角变化敏感,且难以处理不可回答的情况;基于训练的ProgressLM-3B模型即使在小规模下也实现了持续改进,尽管训练任务与评估任务完全不相交。

Insight: 创新点包括引入Progress-Bench基准来系统评估进度推理,以及探索人类启发的两阶段进度推理范式;客观分析认为,基于结构化推理的提示方法收益有限且依赖模型,而基于训练的方法在小规模模型上也能实现泛化性提升,这为视觉语言模型的动态推理能力提供了新的研究方向。

Abstract: Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.


[66] FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion cs.CV | cs.ROPDF

Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng

TL;DR: FlowSSC是首个直接应用于单目语义场景补全的生成式框架,它将SSC任务视为条件生成问题,通过一步潜在扩散在紧凑的三平面潜在空间中实现实时高保真推理,显著提升了现有前馈方法的性能。

Details

Motivation: 解决单目RGB图像进行语义场景补全时,因遮挡区域几何信息模糊导致现有前馈方法难以生成合理细节和保持物体基本空间关系的问题,以满足实际应用中对整个3D空间精确生成推理能力的需求。

Result: 在SemanticKITTI基准测试上进行了广泛实验,FlowSSC取得了最先进的性能,显著超越了现有基线方法。

Insight: 创新点在于将生成式建模(特别是流匹配)引入单目SSC任务,并设计了在紧凑三平面潜在空间中操作的Shortcut Flow-matching机制,实现了仅需单步即可完成高保真生成,兼顾了质量与实时性,为自主系统的实际部署提供了可能。

Abstract: Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.


[67] DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration cs.CVPDF

Dominik Rößle, Xujun Xie, Adithya Mohan, Venkatesh Thirugnana Sambandham, Daniel Cremers

TL;DR: 本文介绍了DrivIng数据集,这是一个大规模多模态自动驾驶数据集,包含约18公里涵盖城市、郊区和高速公路路段的完整地理参考数字孪生。该数据集提供了白天、黄昏和夜间连续采集的六台RGB相机、一台激光雷达和高精度ADMA定位数据,并以10Hz频率标注了12个类别的3D边界框和轨迹ID,总计约120万个标注实例。DrivIng支持将真实交通1:1迁移到仿真中,实现真实且灵活的场景测试,并公开了数据集、数字孪生、高精地图和代码库以支持可复现研究。

Details

Motivation: 现有自动驾驶感知数据集通常缺乏高保真数字孪生,限制了系统测试、边缘案例模拟、传感器修改和仿真到真实评估。为填补这一空白,作者提出了DrivIng数据集,旨在提供大规模、高质量且支持全面评估的数据资源。

Result: 作者在DrivIng数据集上对最先进的感知模型进行了基准测试,以支持可复现研究和鲁棒验证。数据集已公开,包含数字孪生、高精地图和代码库。

Insight: 主要创新点在于构建了与真实路线完全对应的地理参考数字孪生,实现了真实交通到仿真的1:1迁移,从而在保留智能体交互的同时,支持更真实、灵活的场景测试和系统评估,这有助于推动仿真到真实的研究和边缘案例分析。

Abstract: Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.


[68] StableWorld: Towards Stable and Consistent Long Interactive Video Generation cs.CVPDF

Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang

TL;DR: 本文提出了一种名为StableWorld的动态帧驱逐机制,旨在解决交互式视频生成中长期存在的稳定性和时间一致性问题,该方法通过过滤退化帧并保留几何一致的帧来防止误差累积,从而提升多种交互视频模型的生成质量。

Details

Motivation: 当前交互式视频生成方法在长序列交互中常出现空间漂移和场景崩溃等不稳定和时间退化问题,论文旨在探究其根本原因并提供一个通用的解决方案。

Result: 在多个交互视频模型(如Matrix-Game、Open-Oasis、Hunyuan-GameCraft)上的实验表明,StableWorld能显著提升稳定性、时间一致性和泛化能力,且与模型无关。

Insight: 创新点在于识别出误差累积主要源于同一场景内生成帧的逐渐偏离,并据此提出简单有效的动态帧驱逐机制,这是一种可泛化到不同框架的稳定化策略。

Abstract: In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.


[69] Rethinking Video Generation Model for the Embodied World cs.CV | cs.AI | cs.ROPDF

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu

TL;DR: 本文针对具身智能中的视频生成模型,提出了一个全面的机器人基准测试RBench,用于评估面向机器人的视频生成在五个任务领域和四种不同具身形态下的表现,并揭示了现有模型在生成物理真实机器人行为方面的显著不足。同时,为了解决高质量训练数据短缺的问题,作者引入了一个四阶段数据管道,构建了最大的开源机器人视频生成数据集RoVid-X,包含400万个带标注的视频片段,以促进具身AI的发展。

Details

Motivation: 当前视频生成模型在合成准确反映真实世界机器人交互的高质量视频方面仍面临挑战,且缺乏标准化基准限制了公平比较和进展。

Result: 在RBench上评估了25个代表性模型,发现它们在生成物理真实机器人行为方面存在显著缺陷。该基准与人类评估的Spearman相关系数达到0.96,验证了其有效性。

Insight: 创新点在于提出了一个综合性的机器人视频生成基准RBench和最大的开源机器人视频数据集RoVid-X,形成了一个评估与数据的协同生态系统,为视频模型的严格评估和可扩展训练奠定了坚实基础。

Abstract: Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.


[70] LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes cs.CV | cs.GRPDF

Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar

TL;DR: LuxRemix提出了一种从单次多视角场景捕获中实现室内场景交互式光照编辑的新方法,通过基于生成图像的分解模型将复杂室内照明分解为独立光源,并集成到可重照明的3D高斯泼溅表示中,实现实时交互控制。

Details

Motivation: 解决从单次多视角捕获中实现室内场景光照的交互式编辑问题,特别是对复杂照明进行分解和独立操控的需求。

Result: 在合成和真实数据集上评估,与SOTA技术进行了定量和定性比较,展示了高度逼真的光照分解和重照明结果。

Insight: 创新点包括生成图像的光照分解模型、多视角光照协调以确保一致性,以及集成到3D高斯泼溅表示中实现实时交互控制。

Abstract: We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.


[71] Walk through Paintings: Egocentric World Models from Internet Priors cs.CVPDF

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov

TL;DR: 本文提出了一种名为Egocentric World Model(EgoWM)的方法,能够将任何预训练的视频扩散模型转化为以动作为条件的世界模型,从而实现可控的未来预测。该方法通过轻量级条件层注入运动指令,利用互联网规模视频模型的丰富先验知识,无需从头训练,即可在导航和操作任务中生成连贯的未来场景,并泛化到未见过的环境(如画作内部)。

Details

Motivation: 解决视频生成模型不仅能生成看似合理的未来,还能准确反映世界如何随动作变化的问题,即实现物理正确的、可控的未来预测。

Result: 在导航世界模型中,EgoWM将结构一致性分数(SCS)提高了高达80%,推理延迟降低了高达六倍,并在未见环境(包括画作内部导航)中表现出强大的泛化能力。

Insight: 创新点在于提出了一种架构无关的方法,通过轻量级条件层将动作指令注入预训练视频扩散模型,从而高效地构建以自我为中心的世界模型,避免了从头训练,并引入了SCS指标来独立于视觉外观评估物理正确性。

Abstract: What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.


[72] Iterative Refinement Improves Compositional Image Generation cs.CV | cs.AI | cs.LG | cs.ROPDF

Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh

TL;DR: 本文提出了一种用于文本到图像生成的迭代优化策略,通过引入视觉语言模型作为批评者提供反馈,引导图像生成模型在多步迭代中逐步改进其输出,以更好地满足复杂组合性提示的要求。

Details

Motivation: 现有文本到图像模型在处理涉及多个对象、关系和属性的复杂组合提示时表现不佳,现有推理时策略(如并行采样)仍显不足,需要一种更有效的方法来同时满足多个约束条件。

Result: 在多个基准测试中取得显著提升:在ConceptMix (k=7)上全正确率提升16.9%,在T2I-CompBench的3D-Spatial类别上提升13.8%,在Visual Jenga场景分解上提升12.5%,均优于计算量匹配的并行采样基线;人工评估中58.7%的偏好率也优于基线的41.3%。

Insight: 借鉴大语言模型中的思维链推理思想,将迭代自校正作为组合图像生成的通用原则;该方法无需外部工具或先验知识,可灵活应用于广泛的图像生成器和视觉语言模型,通过将复杂提示分解为顺序修正来生成更忠实的结果。

Abstract: Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/


[73] Towards Understanding Best Practices for Quantization of Vision-Language Models cs.CVPDF

Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

TL;DR: 本研究系统评估了多种量化方法(包括GPTQ和AWQ)在视觉语言模型(VLM)多模态流水线(包含视觉模型、语言模型及其连接器)上的应用效果,重点关注量化比特宽度、方法及流水线不同部分对图像描述、检索和问答任务性能的影响。研究发现,尽管参数量差异巨大,视觉Transformer(ViT)和大型语言模型(LLM)对模型性能的重要性相当,且对LLM进行更低比特的量化能在降低每权重比特数(bpw)的同时保持高精度。

Details

Motivation: 大型语言模型(LLMs)和视觉语言模型(VLMs)需要大量GPU内存和计算资源,为降低部署的内存和延迟,通常采用量化技术。现有研究多关注单模态模型(如纯语言或纯视觉模型)的量化,而对多模态流水线(包含视觉、语言及连接组件)的系统性量化策略和组件敏感性缺乏深入理解。

Result: 在图像描述、检索和问答等基准任务上的实验结果表明,视觉Transformer(ViT)和大型语言模型(LLM)对整体性能的影响具有可比性;对LLM部分进行低比特(如4位)量化可以实现较低的每权重比特数(bpw)同时保持高准确率。研究提供了不同量化方法和比特宽度在多模态流水线各组件上的性能影响数据。

Insight: 论文的创新点在于首次系统性地将先进的量化方法(如GPTQ、AWQ)应用于完整的视觉语言多模态流水线,并量化分析了不同组件(ViT、连接器、LLM)对量化策略的敏感性。客观来看,其核心洞察是:在多模态模型中,不能仅根据参数量大小判断组件重要性(ViT与LLM重要性相当),且针对LLM的激进低比特量化是高效部署的关键,这为实际部署中的资源分配和量化策略选择提供了重要指导。

Abstract: Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.


[74] APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping cs.CVPDF

Jiwon Kang, Yeji Choi, JoungBin Lee, Wooseok Jang, Jinhyeok Choi

TL;DR: 本文提出APPLE,一种基于扩散模型的师生框架,用于人脸交换任务。该方法通过属性感知的伪标签监督和条件去模糊任务重构,旨在更准确地保留目标人脸的姿态、表情、光照、肤色和妆容等属性,同时实现高质量的身份迁移。

Details

Motivation: 人脸交换任务缺乏真实标注数据,现有基于扩散模型的方法通过掩码条件修复可能丢失关键外观线索,导致属性对齐不佳,难以同时实现精确身份迁移和高保真属性保留。

Result: APPLE在属性保留和身份迁移方面达到SOTA水平,生成更具照片真实感且忠实于目标属性的结果。

Insight: 创新点包括:将人脸交换重构为条件去模糊任务以更好保留目标属性;引入属性感知的反演方案增强细节保留;通过精心设计的属性保留师生框架生成高质量伪三元组,为学生模型提供直接监督。

Abstract: Face swapping aims to transfer the identity of a source face onto a target face while preserving target-specific attributes such as pose, expression, lighting, skin tone, and makeup. However, since real ground truth for face swapping is unavailable, achieving both accurate identity transfer and high-quality attribute preservation remains challenging. In addition, recent diffusion-based approaches attempt to improve visual fidelity through conditional inpainting on masked target images, but the masked condition removes crucial appearance cues of target, resulting in plausible yet misaligned attributes. To address these limitations, we propose APPLE (Attribute-Preserving Pseudo-Labeling), a diffusion-based teacher-student framework that enhances attribute fidelity through attribute-aware pseudo-label supervision. We reformulate face swapping as a conditional deblurring task to more faithfully preserve target-specific attributes such as lighting, skin tone, and makeup. In addition, we introduce an attribute-aware inversion scheme to further improve detailed attribute preservation. Through an elaborate attribute-preserving design for teacher learning, APPLE produces high-quality pseudo triplets that explicitly provide the student with direct face-swapping supervision. Overall, APPLE achieves state-of-the-art performance in terms of attribute preservation and identity transfer, producing more photorealistic and target-faithful results.


eess.IV [Back]

[75] Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition eess.IV | cs.CVPDF

Zhengyong Huang, Xingwen Sun, Xuting Chang, Ning Jiang, Yao Wang

TL;DR: 本文提出了一种名为LGANet++的无监督可变形图像配准新框架,该框架集成了新颖的局部-全局注意力机制与独特的特征交互融合技术,旨在提升配准的精度、鲁棒性和泛化能力。

Details

Motivation: 可变形图像配准是医学图像分析中的关键技术,但传统方法计算量大且泛化性差,而现有基于注意力的深度学习方法在解剖结构高度可变区域的配准上仍面临挑战。

Result: 在跨病人、跨时间和跨模态(CT-MR)三种配准场景的五个公开数据集上评估,该方法均优于多种先进方法,配准精度分别提升了1.39%、0.71%和6.12%。

Insight: 创新点在于将局部-全局注意力机制与特征交互融合技术结合,以更好地处理解剖变异,从而在无监督设置下实现更准确、鲁棒的配准,对临床工作流具有潜在价值。

Abstract: Deformable image registration is a critical technology in medical image analysis, with broad applications in clinical practice such as disease diagnosis, multi-modal fusion, and surgical navigation. Traditional methods often rely on iterative optimization, which is computationally intensive and lacks generalizability. Recent advances in deep learning have introduced attention-based mechanisms that improve feature alignment, yet accurately registering regions with high anatomical variability remains challenging. In this study, we proposed a novel unsupervised deformable image registration framework, LGANet++, which employs a novel local-global attention mechanism integrated with a unique technique for feature interaction and fusion to enhance registration accuracy, robustness, and generalizability. We evaluated our approach using five publicly available datasets, representing three distinct registration scenarios: cross-patient, cross-time, and cross-modal CT-MR registration. The results demonstrated that our approach consistently outperforms several state-of-the-art registration methods, improving registration accuracy by 1.39% in cross-patient registration, 0.71% in cross-time registration, and 6.12% in cross-modal CT-MR registration tasks. These results underscore the potential of LGANet++ to support clinical workflows requiring reliable and efficient image registration. The source code is available at https://github.com/huangzyong/LGANet-Registration.


[76] Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans eess.IV | cs.CVPDF

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani

TL;DR: 该论文提出了一种用于医学影像分析的混合视觉模型,旨在从超声扫描图像中准确检测多囊卵巢综合征(PCOS)。研究引入了两种结合卷积神经网络和Transformer架构的新型混合模型,并在感染(PCOS阳性)与非感染(健康卵巢)两类数据上进行评估。最终优化的模型DenConREST在准确率上达到98.23%,显著提升了PCOS的诊断精度。

Details

Motivation: PCOS是育龄妇女中最常见的内分泌疾病,尤其在孟加拉国老年女性中患病率高,现有诊断方法可能存在误差。研究动机是探索有效的基于视觉的医学图像分析技术,通过混合模型提高PCOS检测的准确性,以减少诊断错误。

Result: 在PCOS超声图像数据集上,初始混合模型DenConST准确率为85.69%,最终优化模型DenConREST(整合Swin Transformer、ConvNeXt、DenseNet121、ResNet18和EfficientNetV2)准确率提升至98.23%,在所有评估模型中表现最佳,达到了高精度的诊断水平。

Insight: 论文的创新点在于提出了结合卷积神经网络(如DenseNet、ResNet、EfficientNetV2)与Transformer架构(如Swin Transformer)的混合模型,利用不同模型的互补优势提升特征提取能力。从客观角度看,这种多架构融合策略为医学影像分析提供了可借鉴的范式,有助于在有限数据下实现高性能的疾病检测。

Abstract: Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: “infected” (PCOS-positive) and “noninfected” (healthy ovaries). In the initial stage, our first hybrid model, ‘DenConST’ (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, ‘DenConREST’ (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.


cs.SD [Back]

Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

TL;DR: WavLink是一种紧凑的音频-文本嵌入模型,通过在Whisper音频编码器中引入可学习的全局令牌,并与文本编码器联合训练,实现了高效的音频-文本检索和分类性能。

Details

Motivation: 解决现有音频-文本嵌入模型(如基于CLAP的模型)未能有效利用Whisper作为通用音频特征提取器的问题,旨在开发更紧凑且高性能的嵌入表示。

Result: 在音频-文本检索任务上达到SOTA性能,并在AIR-Bench基准测试中(包括多项选择题和零样本分类)表现出竞争力;通过两阶段训练和Matryoshka式监督,实现了嵌入尺寸缩小8倍而性能损失最小。

Insight: 创新点包括在Whisper编码器中添加可学习的全局令牌以增强表示能力,以及系统化的设计选择研究(如预训练文本编码器、损失函数、训练模式和数据混合),提高了模型的可扩展性和效率。

Abstract: Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.


cs.IR [Back]

[78] Agentic-R: Learning to Retrieve for Agentic Search cs.IR | cs.CLPDF

Wenhan Liu, Xinyu Ma, Yutao Zhu, Yuchen Li, Daiting Shi

TL;DR: 本文提出了一种名为Agentic-R的新型检索器训练框架,专为代理式搜索设计。该框架通过结合局部查询-段落相关性和全局答案正确性来衡量多轮代理式搜索中段落的效用,并采用迭代训练策略,使搜索代理与检索器双向、迭代地优化。在七个单跳和多跳问答基准测试上的实验表明,Agentic-R在不同搜索代理中均持续优于现有基线方法。

Details

Motivation: 代理式搜索已成为解决复杂问题的强大范式,但如何为其设计检索器仍未被充分探索。现有搜索代理通常依赖基于相似性的检索器,而相似的段落并不总是对最终答案生成有用。

Result: 在七个单跳和多跳问答基准测试(如HotpotQA、2WikiMultiHopQA等)上的广泛实验表明,Agentic-R检索器在不同搜索代理中均一致地超越了强大的基线方法,取得了优越的性能。

Insight: 论文的创新点在于提出了一个专为多轮代理式搜索设计的检索器训练框架,其核心是结合局部相关性与全局答案正确性的段落效用衡量方法,以及使检索器与搜索代理在迭代中相互提升的双向优化策略,这不同于传统RAG检索器仅依赖局部效用和一次性训练的模式。

Abstract: Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi-step reasoning with on-demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity-based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike retrievers designed for single-turn retrieval-augmented generation (RAG) that only rely on local passage utility, we propose to use both local query-passage relevance and global answer correctness to measure passage utility in a multi-turn agentic search. We further introduce an iterative training strategy, where the search agent and the retriever are optimized bidirectionally and iteratively. Different from RAG retrievers that are only trained once with fixed questions, our retriever is continuously improved using evolving and higher-quality queries from the agent. Extensive experiments on seven single-hop and multi-hop QA benchmarks demonstrate that our retriever, termed \ours{}, consistently outperforms strong baselines across different search agents. Our codes are available at: https://github.com/8421BCD/Agentic-R.


cs.HC [Back]

[79] Designing KRIYA: An AI Companion for Wellbeing Self-Reflection cs.HC | cs.AI | cs.CL | cs.CYPDF

Shanshan Zhu, Wenxuan Song, Jiayue Melissa Shi, Dong Whi Yoo, Karthik S. Bhat

TL;DR: 本文介绍了KRIYA的设计,这是一个旨在促进个人健康数据自我反思的AI伴侣。它通过舒适区、侦探模式和假设规划等功能,帮助用户以解释而非绩效的视角来理解健康数据。研究通过对18名大学生进行半结构化访谈,发现KRIYA能支持好奇心、自我同情和反思性理解。

Details

Motivation: 现有健康应用通常以总结性仪表盘和结构化目标为主,可能导致比较、评判和焦虑,因此需要一种优先支持自我反思的互补方法。

Result: 通过使用假设数据的原型进行用户访谈,研究发现KRIYA使用户将健康数据互动视为解释而非绩效,反思体验取决于情感框架,且透明度有助于建立信任。

Insight: 创新点在于设计了一个AI伴侣,通过协作式解释功能(如侦探模式)促进反思性理解,强调情感框架和透明度在健康数据自我反思中的重要性。

Abstract: Most personal wellbeing apps present summative dashboards of health and physical activity metrics, yet many users struggle to translate this information into meaningful understanding. These apps commonly support engagement through goals, reminders, and structured targets, which can reinforce comparison, judgment, and performance anxiety. To explore a complementary approach that prioritizes self-reflection, we design KRIYA, an AI wellbeing companion that supports co-interpretive engagement with personal wellbeing data. KRIYA aims to collaborate with users to explore questions, explanations, and future scenarios through features such as Comfort Zone, Detective Mode, and What-If Planning. We conducted semi-structured interviews with 18 college students interacting with a KRIYA prototype using hypothetical data. Our findings show that through KRIYA interaction, users framed engaging with wellbeing data as interpretation rather than performance, experienced reflection as supportive or pressuring depending on emotional framing, and developed trust through transparency. We discuss design implications for AI companions that support curiosity, self-compassion, and reflective sensemaking of personal health data.


cs.AI [Back]

[80] VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration cs.AI | cs.CL | cs.LGPDF

Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

TL;DR: 本文提出了VisTIRA(视觉与工具集成推理代理),一个工具集成推理框架,旨在解决视觉语言模型在视觉数学推理任务中存在的模态差距问题。该框架通过将数学问题图像迭代分解为自然语言推理和可执行的Python步骤来进行结构化问题求解。研究还构建了一个用于测量和改进视觉数学推理的框架,包括一个将文本数学语料库转换为图像对应物的LaTeX流程,以及一个基于真实世界作业图像数据集(SnapAsk)生成的大规模合成工具使用轨迹数据集,用于微调视觉语言模型。

Details

Motivation: 视觉语言模型在处理以图像形式呈现的数学问题时,其准确性显著低于处理相同问题的纯文本形式,这种模态差距源于模型在读取密集公式、布局以及混合符号-图表上下文时的复合失败。本文旨在通过结构化工具集成来缩小这一差距。

Result: 实验表明,工具集成的监督提高了基于图像的推理能力,并且OCR(光学字符识别)接地可以进一步缩小较小模型的模态差距,尽管其收益在模型规模增大时减弱。研究发现模态差距的严重程度与模型大小成反比,结构化推理和基于OCR的接地是推进视觉数学推理的互补策略。

Insight: 论文的核心创新点在于提出了一个结合结构化问题分解和工具执行(Python代码)的迭代推理框架(VisTIRA),以及一个系统化的评估与数据增强框架(包括从文本语料生成挑战性图像和合成工具使用轨迹)。从客观角度看,其将复杂视觉推理任务分解为可执行的、可验证的步骤,并强调工具集成与数据合成,为提升视觉语言模型在结构化、符号密集任务上的性能提供了有效路径。

Abstract: Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging image counterparts, and a large set of synthetic tool-use trajectories derived from a real-world, homework-style image dataset (called SnapAsk) for fine-tuning VLMs. Our experiments show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models, although its benefit diminishes at scale. These findings highlight that modality gap severity inversely correlates with model size, and that structured reasoning and OCR-based grounding are complementary strategies for advancing visual mathematical reasoning.


[81] MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks cs.AI | cs.CL | cs.MAPDF

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen

TL;DR: 本文提出了MAS-Orchestra,一个在训练时将多智能体系统(MAS)编排建模为具有整体编排功能的函数调用强化学习问题的框架,能够一次性生成整个MAS。同时,为了严谨地研究MAS何时及为何有效,作者引入了MASBENCH,一个包含深度、视野、广度、并行性和鲁棒性五个维度的受控基准。

Details

Motivation: 当前自动多智能体系统设计方法存在不足,主要源于两个关键因素:方法复杂性和效能不确定性。具体来说,现有编排方法采用顺序的、代码级的执行,限制了全局系统级的整体推理,且随着智能体复杂性增加难以扩展;同时,多智能体系统部署时,其相较于单智能体系统的实际优势并不明确。

Result: 在包括数学推理、多跳问答和基于搜索的问答在内的公共基准测试中,MAS-Orchestra框架取得了持续的改进。分析表明,MAS的收益高度依赖于任务结构、验证协议以及编排器和子智能体的能力,而非普遍适用。

Insight: 主要创新点在于:1)将MAS编排形式化为一个具有整体编排的强化学习问题,将目标导向的子智能体抽象为可调用函数,从而支持对系统结构的全局推理;2)提出了一个受控基准MASBENCH,用于系统性地评估MAS在不同任务维度上的表现,为理解MAS的优势提供了严谨的分析工具。

Abstract: While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MAS-Orchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented sub-agents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and sub-agents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.


[82] Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation cs.AI | cs.CLPDF

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang

TL;DR: 本文研究发现,当大型语言模型(LLM)作为评估者,依据智能体轨迹(包括思维链)来评判其性能时,其评估结果容易被操纵。通过系统性地改写智能体的思维链(同时保持其行动和观察不变),可以显著提高LLM评估者给出错误正面评价的比率,揭示了当前基于LLM的评估机制存在根本性漏洞。

Details

Motivation: 当前在不可验证的场景中,常使用LLM作为评估者,其评估依赖于智能体轨迹中的思维链。该范式隐含假设智能体的思维链忠实地反映了其内部推理和底层环境状态。本文旨在检验这一假设的脆弱性,即LLM评估者是否容易被智能体推理轨迹的操纵所影响。

Result: 在涵盖多样化网页任务的800条轨迹上,操纵推理(思维链)可使最先进的视觉语言模型(VLM)评估者的错误阳性率(false positive rate)最高提升90%。研究发现,基于内容的操纵(例如捏造任务进展信号)比基于风格的操纵(仅改变推理表述方式)更有效。基于提示的技术和增加评估时的计算量可以降低但无法完全消除对操纵的敏感性。

Insight: 论文的核心创新点在于系统性地揭示了LLM作为评估者时,其依赖的思维链推理可能不忠实,从而导致评估结果被轻易操纵的严重漏洞。从客观角度看,该研究强调了未来设计评估机制时,必须包含对推理声明与可观察证据进行验证的环节,而不能盲目信任智能体提供的推理过程本身。

Abstract: Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent’s CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.


[83] The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution cs.AI | cs.CLPDF

Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo

TL;DR: 本文提出了一种用于理解LLM智能体行为内部驱动因素的通用归因框架,通过分层分析(组件级和句子级)来识别导致智能体特定行动的关键历史事件和文本证据,而不论任务结果成功与否。

Details

Motivation: 现有研究主要关注失败归因,即定位不成功轨迹中的显式错误,这不足以解释智能体行为背后的推理过程。为了增强智能体系统的可问责性和治理,需要一种能理解其行动内部驱动因素的通用归因方法。

Result: 实验在多种智能体场景(包括标准工具使用和记忆诱导偏差等微妙可靠性风险)中验证了该框架,结果表明它能可靠地定位智能体行为背后的关键历史事件和句子。

Insight: 创新点在于提出了一个不依赖任务结果的通用智能体归因框架,采用分层(时间似然动态和基于扰动的分析)方法来管理智能体交互的复杂性,从而揭示行为的内在驱动因素,为构建更安全、更可问责的智能体系统提供了关键工具。

Abstract: Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering. As these systems become more autonomous and are deployed at scale, understanding why an agent takes a particular action becomes increasingly important for accountability and governance. However, existing research predominantly focuses on \textit{failure attribution} to localize explicit errors in unsuccessful trajectories, which is insufficient for explaining the reasoning behind agent behaviors. To bridge this gap, we propose a novel framework for \textbf{general agentic attribution}, designed to identify the internal factors driving agent actions regardless of the task outcome. Our framework operates hierarchically to manage the complexity of agent interactions. Specifically, at the \textit{component level}, we employ temporal likelihood dynamics to identify critical interaction steps; then at the \textit{sentence level}, we refine this localization using perturbation-based analysis to isolate the specific textual evidence. We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias. Experimental results demonstrate that the proposed framework reliably pinpoints pivotal historical events and sentences behind the agent behavior, offering a critical step toward safer and more accountable agentic systems.


[84] BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries cs.AI | cs.CL | cs.CV | cs.ROPDF

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen

TL;DR: 本文提出BayesianVLA框架,旨在解决视觉-语言-动作(VLA)模型在机器人操作任务中因数据集偏差导致的信息坍缩问题,即模型退化为仅依赖视觉的决策策略而忽略语言指令。该方法通过引入可学习的潜在动作查询构建双分支架构,分别估计仅视觉先验和语言条件后验,并通过最大化动作与指令的条件点互信息来优化策略,从而强制模型遵循语言约束。

Details

Motivation: 当前VLA模型在泛化到新指令或复杂多任务场景时表现不佳,其根本原因在于目标驱动数据收集导致的数据集偏差,使得语言指令仅从视觉观察即可高度预测,导致动作与指令间的条件互信息消失(即信息坍缩),模型因此退化为忽略语言约束的纯视觉策略。

Result: 在SimplerEnv和RoboCasa基准上的大量实验表明,该方法显著提升了泛化性能,无需新数据即在具有挑战性的OOD SimplerEnv基准上取得了11.3%的性能提升,验证了其鲁棒地将语言信息融入动作决策的能力。

Insight: 核心创新点在于通过贝叶斯分解和潜在动作查询构建的双分支架构,以及最大化条件点互信息(PMI)的训练目标,这系统地惩罚了视觉捷径并奖励能明确解释语言指令的动作,为解决VLA模型中的信息坍缩问题提供了一种可解释的解决方案。

Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.


[85] AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving cs.AI | cs.CV | cs.ROPDF

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao

TL;DR: 本文提出了AutoDriDM,一个专注于评估自动驾驶中视觉语言模型决策能力的可解释性基准,包含6650个涵盖对象、场景和决策三个维度的问题。该工作评估了主流VLM,揭示了感知与决策性能之间的弱相关性,并进行了可解释性分析以识别关键失败模式。

Details

Motivation: 现有自动驾驶基准和指标过度强调感知能力,未能充分评估模型的决策过程,而视觉语言模型展现的推理能力为自动驾驶决策带来了新可能,因此需要一个新的决策中心化评估框架。

Result: 在提出的AutoDriDM基准上评估了主流视觉语言模型,相关分析表明模型的感知性能与决策性能之间存在弱对齐关系。

Insight: 创新点在于构建了一个渐进式、决策中心的基准,并引入了可解释性分析和自动化标注的解析器模型,以弥合以感知为中心和以决策为中心的评估之间的差距,为开发更安全可靠的自动驾驶VLM提供指导。

Abstract: Autonomous driving is a highly challenging domain that requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks and metrics overemphasize perceptual competence and fail to adequately assess decision-making processes. In this work, we present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision. We evaluate mainstream VLMs to delineate the perception-to-decision capability boundary in autonomous driving, and our correlation analysis reveals weak alignment between perception and decision-making performance. We further conduct explainability analyses of models’ reasoning processes, identifying key failure modes such as logical reasoning errors, and introduce an analyzer model to automate large-scale annotation. AutoDriDM bridges the gap between perception-centered and decision-centered evaluation, providing guidance toward safer and more reliable VLMs for real-world autonomous driving.


cs.LG [Back]

[86] PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen

TL;DR: PCL-Reasoner-V1.5是一个基于Qwen2.5-32B构建的320亿参数大语言模型,专门用于数学推理。该模型通过监督微调(SFT)和强化学习(RL)进行精炼,其核心创新是提出了一种离线强化学习方法,相比GRPO等标准在线RL方法,提供了更优越的训练稳定性和效率。模型在AIME 2024和AIME 2025基准测试上分别取得了90.9%和85.6%的平均准确率,达到了基于Qwen2.5-32B后训练模型中的最先进水平。所有实验均在华为昇腾910C NPU上完成。

Details

Motivation: 解决大语言模型在数学推理任务中训练不稳定和效率低下的问题,特别是针对标准在线强化学习方法(如GRPO)的局限性。

Result: 在AIME 2024和AIME 2025基准测试上分别达到90.9%和85.6%的平均准确率,在基于Qwen2.5-32B后训练的模型中实现了SOTA性能。

Insight: 宣称的创新点是提出了一种用于LLM数学推理的离线强化学习方法,以提升训练稳定性和效率。从客观角度看,将离线RL范式稳定地应用于大规模数学推理模型的微调是一个有借鉴价值的技术路径,可能减少对在线交互数据收集的依赖并优化训练过程。

Abstract: We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.


[87] Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models cs.LG | cs.AI | cs.CLPDF

Injin Kong, Hyoungjoon Lee, Yohan Jo

TL;DR: 本文通过比较分析自回归模型(ARMs)及其通过后训练转换的掩码扩散模型(MDMs),揭示了后训练过程中发生的内部机制转变。研究发现,MDMs并非简单复用自回归启发式方法,而是根据任务结构(局部因果依赖与全局规划)进行根本性的计算重组,在全局规划任务中表现出早期层处理增强和分布式语义整合的新模式。

Details

Motivation: 解决后训练将自回归模型转换为掩码扩散模型时,其内部算法转变机制不明确的问题,旨在探究转换后的MDMs是否真正获得了双向推理能力,还是仅仅重新包装了自回归启发式方法。

Result: 通过电路对比分析发现,对于局部因果依赖主导的任务,MDMs很大程度上保留了自回归电路;而对于全局规划任务,MDMs放弃了初始化的路径,表现出以早期层处理增强为特征的显著重新布线。在语义上,识别出从ARMs中尖锐、局部化的特化到MDMs中分布式整合的转变。

Insight: 扩散后训练不仅适应模型参数,而且从根本上重组了内部计算以支持非顺序的全局规划。创新点在于揭示了任务结构依赖性的“机制转变”,即模型根据任务性质(局部vs.全局)进行差异化的电路重构,这为理解模型后训练的本质转变提供了新的视角。

Abstract: Post-training pretrained Autoregressive models (ARMs) into Masked Diffusion models (MDMs) has emerged as a cost-effective strategy to overcome the limitations of sequential generation. However, the internal algorithmic transformations induced by this paradigm shift remain unexplored, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning capabilities or merely repackage autoregressive heuristics. In this work, we address this question by conducting a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic “mechanism shift” dependent on the structural nature of the task. Structurally, we observe a distinct divergence: while MDMs largely retain autoregressive circuitry for tasks dominated by local causal dependencies, they abandon initialized pathways for global planning tasks, exhibiting distinct rewiring characterized by increased early-layer processing. Semantically, we identify a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. Through these findings, we conclude that diffusion post-training does not merely adapt model parameters but fundamentally reorganizes internal computation to support non-sequential global planning.


[88] Strategic Doctrine Language Models (sdLM): A Learning-System Framework for Doctrinal Consistency and Geopolitical Forecasting cs.LG | cs.CLPDF

Olaf Yunus Laitinen Imanov, Taner Yilmaz, Derya Umut Kulali

TL;DR: 本文提出了战略学说语言模型(sdLM),这是一个学习系统框架,用于在学说一致性约束和校准不确定性的条件下进行多文档战略推理。该方法结合了多文档注意力、时间编码和学说一致性层,旨在改善长期预测和计划合理性,同时减少严重的学说违规。

Details

Motivation: 解决在战略推理和地缘政治预测中,如何确保多文档分析符合既定学说一致性约束,并提高长期预测的准确性和校准不确定性的问题。

Result: 在三个基准测试中(专家小组对47个战略场景的评分、对336份学说出版物中12,847条声明的一致性评估、以及对127个历史反事实事件在12-60个月范围内的地缘政治预测),sdLM在战略质量和校准方面优于强大的通用LLM基线,并在长期判断上与人类专家保持竞争力。

Insight: 创新点在于将多文档注意力、时间编码与专门的学说一致性层相结合,以结构化方式将领域知识(学说)整合到语言模型中,从而提升战略推理的可靠性和可解释性;客观来看,其框架设计强调了将领域特定约束系统性地融入生成过程,这对于需要遵循严格政策或规则的决策支持系统具有借鉴意义。

Abstract: We introduce Strategic Doctrine Language Models (sdLM), a learning-system framework for multi-document strategic reasoning with doctrinal consistency constraints and calibrated uncertainty. The approach combines multi-document attention, temporal encoding, and a doctrine-consistency layer to improve long-horizon forecasting and plan plausibility while reducing severe doctrinal violations. We evaluate sdLM using (i) expert-panel scoring of strategic scenarios (N=47), (ii) doctrine consistency on 336 doctrine publications (12,847 statements), and (iii) geopolitical forecasting on 127 historical counterfactuals (1945-2020) across 12-60 month horizons. Across these benchmarks, sdLM achieves higher strategic quality and better calibration than strong general-purpose LLM baselines, and remains competitive with human experts on long-horizon judgments. We further report ablations, scaling trends, and deployment-oriented performance/latency characteristics to clarify which components drive improvements and how they translate to operational settings.


[89] What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study cs.LG | cs.AI | cs.CLPDF

Keyu Lv, Manyi Zhang, Xiaobo Xia, Jingchen Ni, Shannan Yan

TL;DR: 本文系统研究了低比特量化感知训练(QAT)在推理大语言模型(LLMs)上的有效性,提出了一个名为Reasoning-QAT的优化工作流程。研究发现,知识蒸馏是鲁棒的训练目标,后训练量化(PTQ)可作为QAT的良好初始化,强化学习对量化模型依然可行,且对齐PTQ校准域与QAT训练域能加速收敛并提升最终精度。该方法在多个LLM骨干和推理数据集上一致优于最先进的后训练量化方法。

Details

Motivation: 推理模型在复杂任务上表现出色,但其推理速度慢且令牌效率低。后训练量化(PTQ)通常会导致精度大幅下降,尤其是在低比特设置下的推理任务中。本研究旨在系统探索量化感知训练(QAT)如何有效提升推理LLMs的推理效率。

Result: 提出的Reasoning-QAT工作流程在多个LLM骨干(如Qwen3-0.6B)和推理数据集(如MATH-500)上一致超越了最先进的后训练量化方法(如GPTQ)。例如,在Qwen3-0.6B上,该方法在MATH-500上比GPTQ高出44.53%,并在2比特量化下持续恢复了模型性能。

Insight: 论文的创新点在于系统性地识别并整合了提升推理LLMs低比特QAT有效性的关键因素:知识蒸馏作为鲁棒目标、PTQ作为QAT初始化、强化学习的可行性以及域对齐的重要性。从客观角度看,将PTQ与QAT流程深度结合并优化工作流是提升低比特量化推理模型性能的有效途径。

Abstract: Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.


[90] Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization cs.LG | cs.CVPDF

Adam Rokah, Daniel Veress, Caleb Caulk, Sourav Sharan

TL;DR: 本文研究了视觉任务中的混合专家模型,在CIFAR10图像分类数据集上比较了稠密、SoftMoE和SparseMoE三种分类头。研究发现,两种MoE变体在保持专家均衡利用的同时,验证精度略高于稠密基线。通过基于Hessian的锐度度量和损失曲面扰动分析,揭示了模型在收敛时的曲率特性和非局部行为的定性差异。研究还指出,在此规模下,简单的条件路由实现并未带来推理加速。

Details

Motivation: 混合专家架构通常用于扩展大语言模型,但本文旨在研究其在图像分类任务中的行为,重点关注预测性能、专家利用率和泛化能力。

Result: 在CIFAR10数据集上,在可比模型容量下,SoftMoE和SparseMoE的验证精度略高于稠密基线。基于Hessian的锐度度量显示SoftMoE具有更高的锐度,而稠密和SparseMoE处于相似的曲率状态,尽管所有模型的泛化性能相当。

Insight: 通过正则化可以避免专家崩溃并保持专家均衡利用;基于Hessian的曲率分析和损失曲面扰动分析为理解MoE模型的泛化行为提供了互补视角;研究揭示了稀疏MoE模型在理论效率与实际硬件实现效率之间的差距。

Abstract: Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.


[91] ZENITH: Automated Gradient Norm Informed Stochastic Optimization cs.LG | cs.CVPDF

Dhrubo Saha

TL;DR: ZENITH是一种基于梯度范数历史的自适应学习率优化器,无需手动调整学习率计划,在图像分类、目标检测、关键点检测和实例分割任务中实现了更高的测试精度和更快的训练速度。

Details

Motivation: 解决现有自适应优化器存在的计算和内存开销大、与正则化不兼容以及学习率选择次优的问题。

Result: 在6个CNN架构和6个基准测试的图像分类实验中,ZENITH在更短的挂钟时间内达到更高的测试精度;在MS COCO上使用R-CNN系列模型,在目标检测、关键点检测和实例分割任务中获得了更高的mAP。

Insight: 利用梯度范数的时间演化自适应调整学习率,实现了零开销、与正则化兼容的优化,提升了泛化性能。

Abstract: Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.


cs.CY [Back]

Yiran Hu, Huanghai Liu, Chong Wang, Kunran Li, Tien-Hsuan Wu

TL;DR: 本文是一篇关于大语言模型在法律应用中评估的综述性论文,系统性地探讨了当前LLM在法律任务评估中面临的关键挑战、现有评估方法与基准,并指出了未来研究方向。

Details

Motivation: 随着LLM越来越多地集成到司法决策支持、法律实践辅助等法律应用中,其在实际部署中引发了超越表面准确性的深层担忧,如法律推理过程的合理性与公平性、可靠性等可信问题,因此需要系统性的评估来确保其负责任的应用。

Result: 本文未提出具体的新模型或方法,因此没有定量实验结果。它主要回顾和分类了现有的评估方法与基准,并分析了它们在应对法律领域评估挑战方面的程度与局限性。

Insight: 论文的创新之处在于从现实法律实践出发,系统性地梳理了法律领域LLM评估的核心挑战(如结果正确性、推理可靠性、可信度),并以此框架对现有评估方法进行了批判性分析,为构建更现实、可靠且基于法律基础的评估框架指明了方向。

Abstract: Large language models (LLMs) are being increasingly integrated into legal applications, including judicial decision support, legal practice assistance, and public-facing legal services. While LLMs show strong potential in handling legal knowledge and tasks, their deployment in real-world legal settings raises critical concerns beyond surface-level accuracy, involving the soundness of legal reasoning processes and trustworthy issues such as fairness and reliability. Systematic evaluation of LLM performance in legal tasks has therefore become essential for their responsible adoption. This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice. We analyze the major difficulties involved in assessing LLM performance in the legal domain, including outcome correctness, reasoning reliability, and trustworthiness. Building on these challenges, we review and categorize existing evaluation methods and benchmarks according to their task design, datasets, and evaluation metrics. We further discuss the extent to which current approaches address these challenges, highlight their limitations, and outline future research directions toward more realistic, reliable, and legally grounded evaluation frameworks for LLMs in legal domains.


eess.AS [Back]

[93] AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering eess.AS | cs.AI | cs.CL | cs.LG | cs.SDPDF

Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee

TL;DR: 本文提出了AQAScore,一种用于评估文本到音频生成任务中语义对齐度的新框架。它利用音频感知大语言模型(ALLM),将评估重构为概率语义验证任务,通过计算对特定语义查询回答’是’的精确对数概率来估计对齐度,而非依赖开放式文本生成。

Details

Motivation: 现有基于嵌入相似度(如CLAPScore)的评估指标在细粒度语义对齐和组合推理方面存在局限,无法跟上文本到音频生成在真实性和多样性方面的快速发展。

Result: 在多个基准测试(包括人工评分相关性、成对比较和组合推理任务)上的实验结果表明,AQAScore相比基于相似度的指标和生成式提示基线,与人类判断的相关性始终更高,能有效捕捉细微的语义不一致性,并且其性能随底层ALLM能力的增强而提升。

Insight: 核心创新在于将评估任务重构为概率语义验证,通过计算精确的对数概率来量化对齐度,这比开放式生成或简单相似度比较更具鲁棒性和可解释性。该方法框架与具体模型主干无关,其有效性随ALLM能力扩展,为生成式音频评估提供了新思路。

Abstract: Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a “Yes” answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.