Table of Contents
- cs.CL [Total: 31]
- cs.CV [Total: 64]
- cs.SD [Total: 1]
- cs.IR [Total: 1]
- cs.CR [Total: 1]
- cs.LG [Total: 1]
- cs.MA [Total: 1]
- cs.RO [Total: 2]
- cs.AI [Total: 4]
- cs.SE [Total: 1]
- cs.HC [Total: 1]
cs.CL [Back]
[1] Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection cs.CLPDF
Janek Bevendorff, Maik Fröbe, André Greiner-Petter, Andreas Jakoby, Maximilian Mayerl
TL;DR: 本文概述了PAN 2026研讨会的五个核心任务,旨在通过客观可复现的评估推进计算文体学和文本取证技术。这些任务包括:Voight-Kampff生成式AI检测(尤其在混合和混淆的作者场景下)、文本水印(评估现有方案的鲁棒性)、多作者写作风格分析(定位作者变更位置)、生成式抄袭检测(检索生成文本的源文档并进行对齐)以及推理轨迹检测(识别LLM生成或人工撰写推理轨迹的来源与安全性)。
Details
Motivation: PAN研讨会的动机是通过客观、可复现的评估,推动计算文体学和文本取证领域的发展,以应对生成式AI和复杂文本场景带来的挑战。
Result: 论文未提及具体模型的定量结果,但指出自2012年以来,通过TIRA实验平台已收到超过1,100份以Docker容器形式提交的软件,体现了该评估框架的广泛参与度和可复现性。
Insight: 创新点在于设立了涵盖生成式AI检测、文本水印、多作者分析、生成式抄袭检测和推理轨迹检测的多元化任务体系,特别是新增的文本水印和推理轨迹检测任务,反映了对AI生成内容安全性和可追溯性的前沿关注。评估采用Docker容器确保可复现性,是值得借鉴的标准化实践。
Abstract: The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.
[2] Effective Reasoning Chains Reduce Intrinsic Dimensionality cs.CL | cs.AI | cs.LGPDF
Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw
TL;DR: 本文提出使用内在维度作为量化指标来评估思维链推理策略的有效性,发现有效的推理策略能降低任务的内在维度,从而提升模型泛化性能。
Details
Motivation: 当前对思维链推理提升模型泛化能力的机制理解不足,缺乏一致的量化指标来关联推理策略与泛化表现。
Result: 在GSM8K数据集上使用Gemma-3 1B和4B模型验证,发现推理策略的内在维度与其在分布内和分布外数据上的泛化性能呈强负相关。
Insight: 创新点在于将内在维度作为量化工具分析推理过程,揭示了有效推理通过更高效的任务压缩(使用更少参数)来促进学习,为评估推理策略提供了新指标。
Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
[3] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization cs.CL | cs.AI | cs.LGPDF
Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
TL;DR: 本文提出了一种名为反事实重要性加权的新方法,用于改进语言模型推理任务中的策略梯度优化。该方法通过掩码推理片段、测量答案概率的下降来估计每个生成令牌的重要性,并在策略梯度更新中相应地进行加权,从而替代了传统方法中对所有令牌赋予均匀信用的做法。
Details
Motivation: 现有策略梯度方法(如GRPO和DAPO)在语言模型推理中为所有生成令牌分配均匀信用,导致无关填充短语与关键计算步骤获得相同的梯度更新,这限制了策略优化的效率。
Result: 在GSM8K基准测试上,对Qwen和Llama家族的三个模型进行实验,该方法相比均匀基线方法实现了性能的持续提升,并更快地收敛到同等准确率;反转重要性信号会损害性能,验证了该方法捕获了真实的因果结构而非噪声。
Insight: 创新点在于直接从策略模型自身的概率变化中估计令牌重要性,无需辅助模型或外部标注,实现了因果信用分配;分析表明该方法能正确优先处理计算步骤而非支架文本,为后续研究奠定了基础。
Abstract: Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase “Let me think” receives the same gradient update as the critical calculation “23 + 45 = 68.” We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model’s own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
[4] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding cs.CLPDF
Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao
TL;DR: 本文提出FM SO.P框架,通过渐进式任务混合和自动多智能体评估系统,解决语言模型在标准操作程序理解和跨领域泛化中的难题。
Details
Motivation: 现有语言模型难以处理SOP理解所需的术语精确性、顺序性和约束推理能力,且跨领域泛化能力不足,因此需要一种能分阶段构建这些能力的方法。
Result: 在涵盖银行、DMV、医疗等七个领域的SOPBench基准测试中,FM SO.P的32B模型达到48.3%通过率,7B开源模型达到34.3%,与Qwen-2.5-72B-Instruct基线(34.4%)相当但参数减少10倍。
Insight: 创新点包括渐进式任务混合分阶段构建术语消歧、动作序列理解和场景感知图推理能力,以及自动多智能体评估系统能自适应生成评分标准、分层测试集和领域特定评估。
Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3% pass rate with our 32B model and 34.3% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4%) with 10x fewer parameters.
[5] Contractual Deepfakes: Can Large Language Models Generate Contracts? cs.CL | cs.AIPDF
Eliza Mik
TL;DR: 本文批判性地探讨了大型语言模型(LLMs)在生成合同方面的能力,认为LLMs仅能生成统计上占主导的词语模式,无法理解语境或进行法律推理,因此生成的合同可能是不一致条款的无用组合或虽可执行但不适用于特定交易。
Details
Motivation: 针对当前认为LLMs可以辅助起草合同的流行观点,论文旨在揭示这种想法的局限性,强调LLMs缺乏对语言意义、语境和法律推理的理解。
Result: 论文未提供具体的定量实验结果或基准测试,而是通过理论分析指出LLMs生成的合同在现实中可能无效或不适用。
Insight: 论文的创新点在于从法律实践角度批判了LLMs在专业领域(如合同起草)的应用假设,强调了语言生成与法律推理之间的本质区别,提醒业界避免对AI技术过于简化的乐观预期。
Abstract: Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.
[6] Are Language Models Sensitive to Morally Irrelevant Distractors? cs.CL | cs.CYPDF
Andrew Shaw, Christina Hahn, Catherine Rasgaitis, Yash Mishra, Alisa Liu
TL;DR: 这篇论文探讨了大型语言模型(LLMs)在道德判断中是否像人类一样受到道德无关情境因素(即“道德干扰物”)的影响。作者从心理学数据集中构建了一个包含60个道德干扰物的多模态数据集,并将其注入现有道德基准测试中,发现这些干扰物能显著改变LLMs的道德判断,即使在低模糊性场景下也能导致超过30%的偏移。
Details
Motivation: 随着LLMs在高风险场景中的广泛应用,确保其行为与人类价值观对齐至关重要。现有道德基准测试假设LLMs能报告相对稳定的道德偏好,但人类道德心理学研究表明,人类判断易受道德无关因素(如环境气味或噪音)影响。论文旨在评估LLMs是否表现出类似人类的认知道德偏见。
Result: 实验结果显示,在注入道德无关的干扰物后,LLMs的道德判断发生了显著变化,偏移幅度超过30%。这一结果基于对现有道德基准测试的修改和评估,突显了LLMs在道德判断上的不稳定性。
Insight: 论文的创新点在于将道德心理学中的“情境主义”视角引入LLMs评估,首次系统性地测试了道德无关干扰物对LLMs判断的影响。这挑战了LLMs具有稳定道德偏好的假设,强调了进行更情境化的道德评估和更精细认知道德建模的必要性。
Abstract: With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this “situationist” view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 “moral distractors” from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
[7] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency cs.CLPDF
Taewoong Yoon, Geunyeong Jeong, Geon Park, Sihyeong Yeom, Harksoo Kim
TL;DR: 本文提出了一种名为ACTSC(基于激活信息的难度感知自一致性)的新方法,旨在降低自一致性解码策略的推理成本。该方法利用前馈网络神经元激活中反映的内部难度信号,构建轻量级难度估计探针,动态调整采样数量,无需额外token生成或模型调用,并可应用于新数据集而无需预采样。
Details
Motivation: 自一致性解码策略虽然能提升大语言模型的推理性能,但需要大量采样导致推理成本高昂;现有的难度自适应自一致性方法虽能减少简单问题的token使用,但需要额外的模型调用和预采样来估计难度,计算开销大。
Result: 在五个基准测试上的实验结果表明,ACTSC在保持与现有方法相当准确性的同时,有效降低了推理成本。
Insight: 创新点在于利用模型内部激活信号(如前馈网络神经元激活)作为难度估计的轻量级探针,避免了预采样和额外模型调用,实现了动态样本调整且可跨数据集泛化;从客观角度看,该方法将模型内部表示与解码策略高效结合,为降低自一致性成本提供了新思路。
Abstract: Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.
[8] Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts cs.CL | cs.AIPDF
Shweta Parihar, Lu Cheng
TL;DR: 这篇论文评估了检索增强生成(RAG)系统中的社会偏见,发现引入外部上下文有助于减少偏见,但结合思维链(CoT)提示会提高准确性却增加偏见,揭示了偏见与准确性之间的权衡。
Details
Motivation: 解决大型语言模型(LLMs)中固有的社会偏见问题,并探究RAG架构在利用外部知识时对偏见的影响,以理解如何改善公平性。
Result: 在超过13种偏见类型的数据集上进行广泛实验,结果显示RAG能减少偏见,而CoT虽然提升准确性,却增加了整体偏见水平。
Insight: 创新点在于系统评估RAG的偏见影响,并发现外部上下文可对抗刻板印象,但推理过程(如CoT)可能加剧偏见,强调了开发偏见感知推理框架的必要性。
Abstract: Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model’s outputs. To better understand this phenomenon, we then explore the model’s reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model’s CoT. Our experiments reveal that the model’s bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.
[9] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models cs.CLPDF
Hikaru Asano, Tadashi Kozuno, Kuniaki Saito, Yukino Baba
TL;DR: 本文提出了一种名为Gt-Margin的基于真实标记的位置级评分方法,用于指导掩码扩散语言模型(MDLM)的解码顺序。该方法通过计算正确标记与其最强替代标记之间的概率差,生成一个优先处理简单位置的oracle解掩顺序。基于此,作者训练了一个监督式解掩规划器来模仿该顺序,从而在不修改标记预测模型的情况下,提升了MDLM在推理任务(特别是逻辑推理基准测试)上的生成质量。
Details
Motivation: 解决MDLM在推理时,解掩顺序(where-to-unmask)通常依赖启发式置信度度量或通过高成本的强化学习训练来决定的低效问题,旨在找到一种更有效且可学习的解掩顺序策略。
Result: 利用oracle解掩顺序显著提升了最终生成质量,特别是在逻辑推理基准测试上。训练出的监督式解掩规划器集成到标准MDLM采样中,提高了推理准确性,且无需修改底层的标记预测模型。
Insight: 核心创新点在于提出了Gt-Margin这一从真实标记推导出的位置级评分,它定义了一个理论上更优的、优先处理简单位置的解掩顺序。这为学习解掩顺序提供了一个有效的监督信号,使得模型可以通过学习排序(learning-to-rank)来模仿这一顺序,从而以较低成本提升MDLM的推理性能。
Abstract: Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
[10] The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking cs.CLPDF
Julia Maria Struß, Sebastian Schellhammer, Stefan Dietze, Venktesh V, Vinay Setty
TL;DR: CLEF-2026 CheckThat! 实验室旨在通过三个核心任务推进多语言事实核查技术的发展:任务1是科学网络声明的来源检索(延续2025版),任务2是对数值和时间声明进行事实核查(增加了推理组件),任务3是生成完整的事实核查文章,从而扩展了核查流程。
Details
Motivation: 该实验室的动机是开发创新技术,以应对多种语言和平台上在线传播中的虚假信息和操纵行为,并扩展事实核查流程的覆盖范围。
Result: 摘要未提及具体定量结果或基准测试性能,主要介绍了任务设置,这些任务代表了文档和片段级别的分类、检索和生成挑战。
Insight: 创新点在于将事实核查流程系统化,并引入了针对科学声明、数值/时间推理以及完整文章生成的具体任务,强调了多语言环境下的综合挑战,为构建端到端的事实核查系统提供了结构化框架。
Abstract: The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year’s edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.
[11] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models cs.CLPDF
Sangwon Yu, Ik-hwan Kim, Donghun Kang, Bongkyu Hwang, Junhwa Choi
TL;DR: 本文发现大型语言模型在搜索增强推理中存在知识整合衰减问题,即随着推理链增长,模型难以有效利用检索到的外部知识,并提出了一种无需训练的自锚定知识编码方法,通过在推理过程首尾锚定知识来提升知识整合效果。
Details
Motivation: 解决大型语言模型在长链搜索增强推理中,检索到的外部知识随着推理步骤增加而逐渐被忽略或遗忘的问题,即知识整合衰减瓶颈。
Result: 在Multi-hop QA和复杂推理基准测试上的实验表明,所提方法显著缓解了知识整合衰减,提升了模型性能。
Insight: 创新点在于揭示了知识整合衰减现象,并提出了一种轻量级、无需训练的自锚定知识编码推理策略,通过首尾锚定来维持检索知识的语义完整性,为智能体LLM的知识整合提供了有效解决方案。
Abstract: Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
[12] Advancing Block Diffusion Language Models for Test-Time Scaling cs.CLPDF
Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang
TL;DR: 本文提出了一种用于块扩散语言模型(BDLM)的测试时扩展统一框架,通过引入自适应解码和块级生成策略来解决长链思维推理中解码速度与效果的平衡问题。
Details
Motivation: 现有BDLM在测试时扩展设置下探索有限,尤其在长链思维推理中面临解码速度与效果难以平衡的挑战,需要更高效的自适应方法。
Result: 在TDAR-8B模型上应用所提方法(BACD和TCCF)相比TraDo-8B基线在AIME24基准上实现了2.26倍加速和11.2分的性能提升,达到了SOTA水平。
Insight: 创新点包括有界自适应置信解码(BACD)的动态去噪策略和“粗思细评”(TCCF)的测试时扩展范式,通过渐进块大小扩展有效缓解了性能下降,实现了效率与效果的平衡。
Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
[13] On the Optimal Reasoning Length for RL-Trained Language Models cs.CL | cs.AI | cs.LGPDF
Daisuke Nohara, Taishi Nakamura, Rio Yokota
TL;DR: 本文研究了强化学习训练的语言模型在推理任务中的最优输出长度问题,通过比较不同长度控制方法在Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B模型上的表现,发现长度惩罚可能阻碍推理能力的获取,而适当调整的长度控制能提升具有强先验推理能力模型的效率。
Details
Motivation: 强化学习虽然提升了大型语言模型的推理能力,但通常会导致思维链输出变长,增加训练和推理的计算成本;现有长度控制方法未能明确平衡效率与性能的最优输出长度。
Result: 在Qwen3-1.7B Base和DeepSeek-R1-Distill-Qwen-1.5B模型上的实验表明,长度惩罚可能阻碍推理获取,而适当调整的长度控制能提升效率;研究还识别了两种失败模式:长输出增加分散性,短输出导致思考不足。
Insight: 创新点在于将长度控制研究扩展到强化学习训练的策略中,并识别了长输出导致分散和短输出导致思考不足两种失败模式;客观分析认为,该研究为在强化学习框架下优化推理长度提供了实证依据,强调了根据模型先验推理能力定制长度控制策略的重要性。
Abstract: Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
[14] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning cs.CLPDF
Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen
TL;DR: 本文提出了一种名为错误定位策略优化(ELPO)的方法,用于解决工具集成推理(TIR)中LLM智能体面临的稀疏、延迟奖励和弱步骤级信用分配问题。该方法通过二分搜索滚动树定位首个不可恢复的错误步骤,利用层次化优势归因生成稳定的学习信号,并通过错误定位自适应剪裁强化对关键步骤及其后续的修正更新。
Details
Motivation: 动机在于解决工具集成推理(TIR)中,仅基于结果的强化学习因奖励稀疏、延迟和信用分配薄弱而效果不佳的问题,特别是在长视野轨迹中,早期不可恢复的错误对成败至关重要,需要准确定位并利用这些错误进行细粒度优化。
Result: 在数学、科学QA和代码执行等TIR基准测试中,ELPO在可比采样预算下持续优于强基线Agentic RL方法,并在Pass@K和Major@K扩展、滚动排名质量和工具调用效率方面取得额外提升。
Insight: 创新点在于首次系统性地定位并利用轨迹中的首个不可恢复错误步骤进行信用分配,通过二分搜索树结构和层次化优势归因将稀疏失败信号转化为稳定的学习目标,并结合自适应剪裁机制集中优化关键错误区域,为长序列决策中的错误修正提供了新思路。
Abstract: Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
[15] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering cs.CL | cs.AIPDF
Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do
TL;DR: 本文提出了MATA,一个用于表格问答(TableQA)的多智能体框架,旨在解决现有大型语言模型(LLM)在可靠性、可扩展性和效率方面的挑战。MATA利用多个互补的推理路径和一组基于小型语言模型构建的工具来生成候选答案,并通过工具进行优化或选择最优答案,同时采用算法减少对昂贵LLM的调用。实验表明,MATA在多个基准测试和不同LLM上实现了最先进的准确率和高效推理。
Details
Motivation: 尽管大型语言模型在表格理解任务上取得了进展,但在资源受限或隐私敏感的环境中,确保可靠性、可扩展性和效率仍面临挑战。
Result: 在两个不同难度的基准测试上,使用十种不同LLM进行的广泛实验表明,MATA实现了最先进的准确率(SOTA)和高效的推理,同时避免了过多的LLM推理开销。
Insight: 创新点在于通过多智能体框架协调多个互补的推理路径,结合小型语言模型工具来提升可靠性和效率,并设计算法优化LLM调用成本,从而实现了可扩展且可靠的表格问答系统。
Abstract: Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS-Lab/MATA.
[16] TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces cs.CLPDF
Yiming Shu, Pei Liu, Tiange Zhang, Ruiyang Gao, Jun Ma
TL;DR: 本文提出TraceMem,一个受认知启发的框架,用于从用户对话痕迹中构建结构化的叙事记忆图式,以解决大语言模型在维持长期交互时因有限上下文窗口而难以管理对话历史的问题。该框架通过三阶段流程(短期记忆处理、突触记忆巩固、系统记忆巩固)将对话组织成连贯的、随时间演化的叙事线程,并封装成结构化用户记忆卡。在LoCoMo基准测试中,TraceMem实现了最先进的性能,尤其在多跳和时间推理方面表现优异。
Details
Motivation: 解决大语言模型在长期交互中的瓶颈,即有限上下文窗口难以管理随时间延伸的对话历史,且现有记忆系统常将交互视为不连贯的片段,无法捕捉对话流的底层叙事连贯性。
Result: 在LoCoMo基准测试中,TraceMem实现了最先进的性能,通过构建连贯叙事,在多跳和时间推理方面超越基线模型,突显其在深度叙事理解中的关键作用。
Insight: 创新点包括:受认知启发的三阶段叙事记忆构建流程(短期记忆处理、突触记忆巩固、系统记忆巩固),将对话组织成连贯的叙事线程;以及代理搜索机制以增强推理过程。从客观角度看,该方法通过分层聚类和主题统一,有效模拟了人类记忆的叙事结构,提升了长期对话的连贯性和推理能力。
Abstract: Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: https://github.com/YimingShu-teay/TraceMem
[17] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs cs.CLPDF
Longhuan Xu, Cunjian Chen, Feng Yin
TL;DR: 本文提出了一种针对大语言模型的无监督层间动态测试时适应方法,通过一个轻量级超网络预测每层、每步的学习率乘子,精细调控仅更新LoRA参数的TTA过程,以解决单样本无监督TTA因固定学习率导致的过拟合和性能下降问题。
Details
Motivation: 解决大语言模型在无监督、样本特定的测试时适应中,由于固定手工学习率导致的更新不稳定、过拟合提示特定统计量以及生成质量下降的问题。
Result: 在多个数据集和LLM上的实验表明,该方法通过学习适应步骤和Transformer层投影的有效缩放模式,显著增强了TTA的稳定性并提升了性能。
Insight: 创新点在于将TTA强度明确建模为提示表示、LLM结构和适应步骤的函数,通过超网络实现细粒度的层间和步间动态学习率调控,为少样本无监督在线适应提供了更稳定有效的框架。
Abstract: Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
[18] AI-Assisted Scientific Assessment: A Case Study on Climate Change cs.CLPDF
Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer
TL;DR: 本研究评估了基于Gemini的AI辅助环境在气候科学领域的应用,通过13位科学家协作研究大西洋经向翻转环流(AMOC)稳定性问题,发现AI能加速科学工作流程,在46人时内完成79篇论文的综合分析,但专家监督对确保科学严谨性至关重要。
Details
Motivation: 解决AI在无法重复验证、需基于理论与证据共识的科学评估任务中的适用性问题,探索AI如何辅助协作性科学评估。
Result: 在气候科学案例研究中,AI辅助系统帮助团队通过104次修订周期合成79篇论文,AI生成内容大部分被保留,但专家贡献占报告一半以上,需大量监督以达到科学标准。
Insight: 创新点在于将AI集成到标准科学工作流中处理复杂共识驱动型问题,表明AI能提升效率与逻辑一致性,但专家监督不可或缺,为AI在科学评估中的角色提供了实证案例。
Abstract: The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in ‘guess and check’ loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.
[19] Targum – A Multilingual New Testament Translation Corpus cs.CLPDF
Maciej Rapacz, Aleksander Smywiński-Pohl
TL;DR: 本文介绍了Targum——一个包含657个新约译本的多语言语料库,其中352个是独特版本,在英语、法语、意大利语、波兰语和西班牙语五种语言中提供了前所未有的深度覆盖。语料库从12个在线圣经图书馆和一个现有语料库中聚合而成,每个翻译都经过手动元数据标注,包括作品标识、版本和修订年份。该资源支持从微观翻译家族分析到宏观去重研究的多层次灵活分析,为翻译历史的定量研究设立了新基准。
Details
Motivation: 现有圣经翻译语料库往往追求语言广度而忽略了翻译历史的深度,特别是欧洲语言中丰富的翻译变体未被充分捕捉,因此需要构建一个能支持多层次深度分析的专用资源。
Result: 构建的语料库包含657个新约译本(352个独特版本),在英语(208个独特版本)、法语(41)、意大利语(18)、波兰语(30)和西班牙语(55)上达到前所未有的覆盖深度,为翻译历史定量研究建立了新基准。
Insight: 通过手动标注标准化元数据(作品ID、版本、年份)实现翻译的规范化,使研究者能根据需求自定义“独特性”,支持从微观(如KJV谱系)到宏观(去重分析)的灵活研究层次,首次为多级翻译分析提供了专用语料库设计范式。
Abstract: Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define “uniqueness” for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
[20] Decomposing Reasoning Efficiency in Large Language Models cs.CL | cs.AI | cs.LGPDF
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud
TL;DR: 本文提出了一种可选的追踪框架,用于分解大型语言模型在推理任务中的令牌效率,将其分解为可解释的因素:固定令牌预算下的完成度、完成条件下的正确性以及冗余度。通过评估25个模型,发现准确性与令牌效率排名存在差异,效率差距主要由条件正确性驱动,且冗余度变化范围大。
Details
Motivation: 标准评估仅报告最终准确性,掩盖了令牌在推理过程中的使用或浪费情况,因此需要一种方法来分解和解释令牌效率,以揭示模型效率瓶颈。
Result: 在CogniLoad基准上评估25个模型,发现准确性与令牌效率排名之间的Spearman相关系数为0.63,效率差距主要由条件正确性驱动,冗余度变化约9倍,与模型规模关系较弱。
Insight: 创新点包括引入可选的追踪框架分解令牌效率为可解释因素,以及使用确定性追踪质量度量(如基础性、重复性、提示复制)来区分退化循环与冗余但投入的推理,无需人工标注或LLM评判,揭示了不同的效率瓶颈特征,为效率干预提供指导。
Abstract: Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
[21] AnalyticsGPT: An LLM Workflow for Scientometric Question Answering cs.CL | cs.DLPDF
Khang Ly, Georgios Cheirmpos, Adrian Raudaschl, Christopher James, Seyed Amin Tabatabaei
TL;DR: 本文提出了AnalyticsGPT,一个基于大语言模型的工作流,用于解决科学计量学问答任务。该系统通过检索增强生成和智能体概念实现端到端的处理,能够识别学术实体并整合多维度数据,最终生成结构化的分析报告。
Details
Motivation: 科学计量学问答作为元科学问题的一个子类,在规划阶段面临独特挑战,如学术实体的命名实体识别和多维度数据检索。传统方法难以有效处理,而大语言模型在任务分解和推理方面展现出潜力,因此探索其在该小众下游任务中的应用。
Result: 通过咨询领域专家并采用LLM-as-judges进行评估,论文提供了关于大语言模型在科学计量学问答任务中有效性的宝贵见解,但未在摘要中提及具体的定量结果或基准测试表现。
Insight: 创新点在于将大语言模型与检索增强生成及智能体概念结合,构建专门针对科学计量学问答的端到端工作流,并利用专有研究评估平台作为数据库,同时探索了LLM-as-judges的评估方法,为小众下游任务的应用提供了新思路。
Abstract: This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the “science of science.” When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.
[22] Text summarization via global structure awareness cs.CL | cs.AIPDF
Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Yibei Liu, Chenghao Li
TL;DR: 本文提出了GloSA-sum,一种基于拓扑数据分析(TDA)实现全局结构感知的文本摘要方法。该方法通过构建语义加权图,利用持久同调识别核心语义和逻辑结构,并保存在“保护池”中作为摘要主干。设计了一种拓扑引导的迭代策略,使用轻量级代理指标近似句子重要性以提高效率,并提出了分层策略以增强长文本处理能力。实验表明,该方法能在减少冗余的同时保持语义和逻辑完整性,在准确性和效率间取得平衡,并通过缩短上下文保留关键推理链来有益于LLM下游任务。
Details
Motivation: 现有文本摘要研究主要关注模型改进和句子级剪枝,但常忽视全局结构,导致连贯性破坏和下游性能减弱;而使用大语言模型(LLMs)虽精度高但资源与时间成本巨大。
Result: 在多个数据集上的实验表明,GloSA-sum能减少冗余,同时保持语义和逻辑完整性,在准确性和效率间取得了平衡。
Insight: 创新点在于首次将拓扑数据分析(TDA)应用于文本摘要以实现全局结构感知,通过持久同调识别语义核心与逻辑依赖,并设计了拓扑引导的迭代策略和分层处理策略来兼顾效率与长文本处理。
Abstract: Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool’’ as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
[23] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse cs.CLPDF
Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec
TL;DR: 本研究探讨了利用大语言模型(LLM)自身生成的推理内容来预测其预测结果正确性的可行性。通过对30,300条课堂对话中教师话语的分析,研究发现基于TF-IDF编码的推理特征可以有效训练分类器(如随机森林,F1分数达0.83)来识别错误预测,且针对特定教学行为构建专用检测器可进一步提升性能。
Details
Motivation: 当前利用LLM自动标注和分析教育对话的流程缺乏可靠的方法来检测模型何时出错,因此研究旨在探索是否可以利用模型生成的推理来预测其自身预测的正确性。
Result: 在人类验证的真实标签数据集上,使用TF-IDF编码推理特征训练的随机森林分类器取得了F1分数0.83(召回率0.854),成功识别了大部分错误预测,性能优于基线方法。针对特定教学行为构建的专用检测器进一步提升了在困难类别上的表现。
Insight: 创新点在于提出了一种基于LLM推理的、可扩展的错误检测方法,用于自动化教育对话分析的质量控制。研究发现,正确的预测往往使用具体的因果语言(如“因为”、“因此”),而错误的推理则更可能依赖认知性模糊表达(如“可能”、“可以”)和元认知表述(如“认为”、“意识到”),这为错误检测提供了可解释的语言学线索。
Abstract: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model’s own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model’s assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
[24] Steer2Edit: From Activation Steering to Component-Level Editing cs.CLPDF
Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng
TL;DR: 论文提出Steer2Edit框架,将大语言模型隐藏表示中的语义方向(steering vectors)从推理时干预信号转化为诊断信号,用于对模型组件(注意力头和MLP神经元)进行权重编辑,从而在保持模型性能的同时更精细地控制模型行为。
Details
Motivation: 现有基于激活干预的steering方法在推理时对模型内部状态进行全局、固定的修改,忽视了模型行为由少量异质组件控制的事实,导致在强控制下产生不利的属性-效用权衡。
Result: 在安全对齐、缓解幻觉和提升推理效率等任务上,Steer2Edit在保持下游性能的同时,将安全性提升高达17.2%,真实性提升9.8%,平均推理长度减少12.2%。
Insight: 核心创新在于将steering向量转化为无需训练的、组件级别的秩-1权重编辑,实现了从表示干预到参数更新的理论桥梁;其选择性重新分配行为影响的方法更具可解释性,且保持了标准前向传播和并行推理兼容性。
Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model’s internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
[25] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations cs.CL | cs.AI | cs.LGPDF
William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
TL;DR: 本文研究通过分析大型语言模型(LLM)在生成前的内部激活状态,来预测其在数学和编程任务上的成功概率,从而指导更高效的推理。研究发现,模型内部编码了与人类不同的、模型特定的难度概念,并利用线性探针进行预测,显著优于基于表面特征的基线方法。通过基于预测的路由策略,可以在MATH数据集上降低高达70%的推理成本,同时性能超过最佳单一模型。
Details
Motivation: 动机在于解决为每个问题都运行需要扩展推理的LLM所带来的高昂计算成本问题,目标是探索能否在生成前从模型的内部表示中恢复其自身成功的可能性,并利用这一信号来指导更高效的推理。
Result: 在数学和编程任务上,基于预生成激活训练的线性探针在预测模型特定成功率方面,显著优于问题长度和TF-IDF等表面特征。在E2H-AMC数据集上,证明了模型编码的难度概念与人类难度不同,且这种差异随扩展推理而增加。在MATH数据集上的路由实验表明,该方法在降低高达70%推理成本的同时,性能超过了最佳单一模型。
Insight: 创新点在于揭示了LLM在生成前的内部激活状态编码了其自身对任务成功的预测信号,这提供了一个模型特定的、与人类直觉不同的难度度量。从客观角度看,利用这种内部信号进行轻量级探针预测和查询路由,为实现计算高效的LLM推理提供了一种新颖且实用的方法。
Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty
[26] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning cs.CLPDF
Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang
TL;DR: 本文提出了一种名为ATTNPO的低开销过程监督强化学习框架,旨在解决大型推理模型在复杂任务中因过度思考而产生的冗余推理问题。该方法利用模型固有的注意力信号进行步骤级信用分配,通过识别一组特殊的注意力头来区分必要步骤与冗余步骤,并采用两种子策略来抑制冗余步骤同时保护必要步骤的准确性。
Details
Motivation: 现有基于强化学习和可验证奖励训练的推理模型在复杂任务上表现强劲,但常因’过度思考’而产生冗余推理,导致效率低下。轨迹级长度惩罚方法通常无法有效缩短推理长度且会损害准确性,因为它们对所有推理步骤一视同仁,缺乏细粒度信号来区分冗余与必要。同时,过程监督方法通常资源密集且存在信用分配不准确的问题。
Result: 实验结果表明,ATTNPO在9个基准测试上显著减少了推理长度,同时显著提升了性能。
Insight: 论文的创新点在于利用模型内部固有的注意力机制作为低成本、细粒度的过程监督信号,进行步骤级信用分配,从而有效区分并抑制冗余推理步骤,同时保护必要步骤的准确性。这为缓解大型推理模型的’过度思考’问题提供了一种高效且资源友好的新思路。
Abstract: Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model’s intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.
[27] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese cs.CLPDF
Trung Tien Cao, Lam Minh Thai, Nghia Hieu Nguyen, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen
TL;DR: 本文提出了一种名为ViMultiChoice的新方法,专门用于越南语阅读理解任务,该方法能够同时预测正确答案并生成相应的解释。作者还引入了一个新的越南语数据集,用于训练和评估具有解释生成能力的多选阅读理解模型。实验结果表明,ViMultiChoice在ViMMRC 2.0基准和新数据集上均达到了最先进的性能。
Details
Motivation: 现有的多选阅读理解模型通常缺乏解释其选择背后推理的能力,本文旨在解决这一问题,特别是在越南语语境下,通过开发能够生成解释的模型。
Result: ViMultiChoice在ViMMRC 2.0基准和新引入的数据集上均超越了现有基线,达到了最先进的性能水平,并且联合训练选项决策和解释生成显著提高了多选准确性。
Insight: 论文的创新点在于提出了一个专门针对越南语的联合模型,将答案预测和解释生成相结合,并通过新数据集支持这一任务,这为多选阅读理解的可解释性研究提供了新方向。
Abstract: Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.
[28] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference cs.CL | cs.AIPDF
Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu
TL;DR: 本文提出了一种名为DRIFT的双模型框架,旨在解决大语言模型在处理长上下文时知识整合与推理模式纠缠的挑战。该方法通过轻量级知识模型动态压缩文档块为隐式事实令牌,并将其投影到推理模型的嵌入空间,从而在保持推理精度的同时减少冗余文本。
Details
Motivation: 现有方法(如检索增强生成和参数化知识编辑)受限于有限的上下文窗口、检索器噪声或灾难性遗忘风险,无法有效整合动态知识。DRIFT旨在显式解耦知识提取与推理过程,以扩展LLMs的有效上下文窗口和推理能力。
Result: 在长上下文任务上的广泛实验表明,DRIFT显著提升了性能,在同等规模模型中优于强基线方法。
Insight: 创新点在于提出了一种动态压缩文档为隐式事实令牌的双模型架构,实现了知识提取与推理的解耦,为扩展LLMs的上下文窗口提供了一种可扩展且高效的范式。
Abstract: The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model’s embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.
[29] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval cs.CLPDF
Delvin Ce Zhang, Suhan Cui, Zhelin Chu, Xianren Zhang, Dongwon Lee
TL;DR: 本文提出MEVER模型,用于多模态可解释的声明验证,通过图基证据检索整合文本和视觉证据,并生成解释。模型包含证据检索、多模态声明验证和解释生成三个模块,并在新构建的科学领域数据集AIChartClaim上验证了有效性。
Details
Motivation: 现有声明验证方法大多仅关注文本证据或忽略可解释性,导致验证不准确且缺乏说服力,需要联合多模态推理和透明解释来解决这一问题。
Result: 实验表明模型在声明验证任务上表现出色,但摘要未提及具体基准测试或定量结果(如准确率),仅强调在新构建的AIChartClaim数据集上验证了模型优势。
Insight: 创新点包括构建双层多模态图进行证据检索,设计图像到文本和文本到图像的跨模态推理,以及引入多模态Fusion-in-Decoder生成解释,同时贡献了科学领域多模态验证数据集。
Abstract: Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
[30] Anagent For Enhancing Scientific Table & Figure Analysis cs.CL | cs.AIPDF
Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
TL;DR: 该论文提出了Anagent,一个用于增强科学表格与图表分析的多智能体框架,通过四个专门化智能体(Planner、Expert、Solver、Critic)分解、检索、合成和迭代优化分析任务。同时,论文引入了AnaBench基准,包含来自九个科学领域的63,178个实例,以量化分析挑战。
Details
Motivation: 当前AI系统难以准确解释复杂的多模态科学知识,科学表格和图表的复杂性、异构结构及长上下文需求构成了根本性障碍,论文旨在解决这一科学表格与图表分析难题。
Result: 在涵盖170个子领域的全面评估中,Anagent在无需训练的场景下取得了高达13.43%的提升,在微调后提升达42.12%,表明其在科学表格与图表分析任务上实现了显著改进。
Insight: 创新点在于提出了一个专门针对科学多模态分析的多智能体协作框架,并引入了系统性基准AnaBench。其模块化训练策略结合了监督微调和专门强化学习,强调了面向任务的推理和上下文感知问题解决对于高质量分析的重要性。
Abstract: In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table & figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table & figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43%$ in training-free settings and $\uparrow 42.12%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table & figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
[31] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing cs.CLPDF
Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq
TL;DR: 论文提出了Quantum-Audit基准测试,包含2700个问题,用于系统评估大语言模型在量子计算概念理解上的推理能力。评估了26个领先模型,发现顶级模型如Claude Opus 4.5在总体准确率上超过人类专家平均水平,但在专家编写问题、高级主题及识别错误前提的批判性推理任务上表现显著下降。
Details
Motivation: 现有基准主要评估量子代码生成和电路设计,缺乏对大语言模型在量子计算概念理解上的系统性测量,因此需要构建一个专门测试其概念理解和推理能力的基准。
Result: 在Quantum-Audit基准上,人类参与者得分在23%到86%之间,专家平均为74%。顶级模型Claude Opus 4.5达到84%的总体准确率,超过了专家平均,但在专家编写问题上准确率平均下降12个百分点,在安全等高级主题上降至73%,在识别错误前提的任务上准确率低于66%。
Insight: 创新点在于构建了一个大规模、多维度(包括专家编写、LLM生成、开放性和错误前提问题)的量子计算概念理解基准。客观分析表明,该研究揭示了当前LLMs在专业领域深度推理、批判性思维和处理高级/误导性信息方面仍存在显著局限,强调了领域特定评估的重要性。
Abstract: Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
cs.CV [Back]
[32] UI-Venus-1.5 Technical Report cs.CV | cs.AI | cs.CL | cs.LGPDF
Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu
TL;DR: 本文介绍了UI-Venus-1.5,一个统一的端到端图形用户界面(GUI)智能体,旨在实现强大的现实世界应用。该模型系列包括两个密集变体(2B和8B)和一个混合专家变体(30B-A3B)。相比前代版本,它引入了三项关键技术进展:全面的中期训练、在线强化学习以及通过模型合并构建的统一GUI智能体。在多个基准测试中取得了新的最先进性能。
Details
Motivation: 解决GUI智能体在实现广泛通用性和持续强大任务性能方面仍面临的挑战,旨在构建一个适用于现实世界应用的鲁棒、统一的端到端GUI智能体。
Result: 在ScreenSpot-Pro(69.6%)、VenusBench-GD(75.0%)和AndroidWorld(77.6%)等基准测试中建立了新的最先进(SOTA)性能,显著超越了之前的强基线模型。
Insight: 创新点包括:1)利用大规模多数据集进行中期训练以建立基础GUI语义;2)采用全轨迹展开的在线强化学习,使训练目标与大规模环境中的长视野、动态导航对齐;3)通过模型合并将领域特定模型(如基础、网页和移动端)合成为一个统一的检查点,构建了单一的统一GUI智能体。从客观角度看,这种将不同训练阶段(预训练、中期训练、强化学习)和模型合并技术结合以构建通用且高性能GUI智能体的方法具有借鉴意义。
Abstract: GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus
[33] SemanticMoments: Training-Free Motion Similarity via Third Moment Features cs.CVPDF
Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady
TL;DR: 本文提出SemanticMoments方法,通过计算预训练语义模型特征的高阶矩(特别是三阶矩)作为时间统计量,实现无需训练的视频语义运动相似性检索,以解决现有方法过度依赖静态外观而忽略运动动态的问题。
Details
Motivation: 现有视频表示方法过度依赖静态外观和场景上下文,而非运动动态,且传统光流等运动中心输入缺乏高层语义理解,导致语义运动检索问题尚未解决。
Result: 在提出的SimMotion基准(结合合成数据与人工标注真实数据集)上,SemanticMoments一致优于现有的RGB、光流和文本监督方法。
Insight: 创新点在于利用预训练语义模型特征的时间高阶矩(三阶矩)作为无需训练的运动表示,为以运动为中心的视频理解提供了可扩展且感知基础的方法。
Abstract: Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
[34] A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video cs.CV | cs.AI | cs.MMPDF
Andrea Filiberto Lucas, Dylan Seychell
TL;DR: 本文提出了一种用于广播新闻视频中命名实体提取的混合确定性框架,旨在自动检测和提取视频中出现的个人姓名。该框架包含一个精心策划且平衡的标注帧语料库,以及一个可解释、模块化的提取流程,该流程在确定性和可审计条件下运行。
Details
Motivation: 随着视频新闻内容的增长,需要透明可靠的方法来提取屏幕信息,但图形布局、排版惯例和平台特定设计模式的多样性使得手动索引不切实际。
Result: 提出的流程在图形元素定位上实现了95.8%的mAP@0.5,在提取任务上实现了平衡的精确率(79.9%)和召回率(74.4%)。与生成式多模态方法相比,后者在原始准确率上略高(F1: 84.18% vs 77.08%),但缺乏透明性。
Insight: 创新点在于构建了一个多样化的标注语料库,并设计了一个可解释、模块化的确定性提取流程,在保证性能的同时提供了完全的可追溯性,避免了幻觉问题,为新闻和分析场景提供了方法学上严谨且可解释的基线。
Abstract: The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.
[35] Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain cs.CVPDF
Michael D. Murray, James Tung, Richard W. Nuckols
TL;DR: 本研究探讨了利用穿戴式环境感知(RGB-D相机)在步态过程中预测足部与即将接触的地面(如平地到上楼梯的过渡)的交互参数(前后方向足底压力中心COP和触地时间TOI)的可行性。研究通过CNN-RNN模型在触地前250毫秒的预测窗口内连续预测COP和TOI,并评估了预测误差及影响因素。
Details
Motivation: 计算机视觉已用于步态中的环境分类以辅助控制系统,但预测足部如何与变化环境接触(如COP和TOI)的研究不足。本文旨在探索在平地到上楼梯过渡场景中,利用视觉数据提前预测这些交互参数的可行性,以改进辅助系统的预见性控制。
Result: 在8名受试者佩戴RGB-D相机和仪器化鞋垫的实验数据上,CNN-RNN模型在触地前150、100和50毫秒预测窗口的COP平均绝对误差(MAE)分别为29.42毫米、26.82毫米和23.72毫米,TOI的MAE分别为21.14毫秒、20.08毫秒和17.73毫秒。模型轻量级,可在消费级笔记本电脑或边缘计算设备上以60 FPS运行。
Insight: 创新点在于利用穿戴式视觉数据连续预测足部与地形的交互参数(COP和TOI),为辅助系统提供预见性控制信息。客观分析表明,该方法通过轻量级模型实现实时预测,且发现脚趾摆动速度等运动学因素可影响预测精度,为机器人或外骨骼的适应性控制提供了新思路。
Abstract: Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.
[36] VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models cs.CVPDF
Chenyu Wang, Tianle Chen, H. M. Sabbir Ahmad, Kayhan Batmanghelich, Wenchao Li
TL;DR: 该论文提出了VLM-UQBench,一个用于评估视觉语言模型(VLMs)中模态特定和跨模态不确定性的基准。它包含从VizWiz数据集中选取的600个真实样本,分为干净、图像、文本和跨模态不确定性子集,并提供了一个可扩展的扰动流程。研究评估了多种不确定性量化方法,发现现有方法存在模态特异性强、对底层VLM依赖大、与幻觉关联弱且难以检测细微实例级模糊性等问题。
Details
Motivation: 不确定性量化对于确保视觉语言模型安全可靠至关重要,但核心挑战在于定位不确定性的来源(图像、文本或两者错配)。现有方法缺乏细粒度、模态感知的评估基准。
Result: 在VLM-UQBench基准上评估了多种UQ方法,发现它们表现出强烈的模态特异性,对底层VLM依赖大;模态特定不确定性常与幻觉共存,但当前UQ分数提供的风险信号弱且不一致;UQ方法在检测明显的群体级模糊性时可与基于推理的思维链基线相当,但基本无法检测扰动流程引入的细微实例级模糊性。
Insight: 创新点在于构建了首个专注于模态特定和跨模态不确定性的VLM基准,并提出了量化UQ分数对扰动敏感性及其与幻觉相关性的简单指标。客观分析表明,该研究揭示了当前UQ实践与可靠VLM部署所需的细粒度不确定性之间的显著差距,强调了开发更精细UQ方法的必要性。
Abstract: Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
[37] VLM-Guided Iterative Refinement for Surgical Image Segmentation with Foundation Models cs.CV | cs.AI | cs.MAPDF
Ange Lou, Yamin Li, Qi Chang, Nan Xi, Luyuan Xie
TL;DR: 本文提出了IR-SIS,一种用于手术图像分割的迭代精炼系统。该系统利用微调的SAM3进行初始分割,通过视觉语言模型检测器械并评估分割质量,并采用智能体工作流自适应选择精炼策略,支持基于自然语言的临床医生交互。
Details
Motivation: 解决现有手术图像分割方法局限于预定义类别、缺乏自适应精炼机制以及缺少临床医生交互手段的问题。
Result: 在EndoVis2017和EndoVis2018基准测试的内域和外域数据上均取得了最先进的性能,临床医生交互能带来进一步的性能提升。
Insight: 首个基于语言、具备自适应自精炼能力的手术分割框架;构建了多粒度语言标注数据集;引入了结合VLM质量评估与智能体策略选择的迭代精炼工作流。
Abstract: Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.
[38] Rethinking Global Text Conditioning in Diffusion Transformers cs.CVPDF
Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu
TL;DR: 本文重新审视了扩散变换器中全局文本调节的作用,发现传统的池化嵌入调制对性能贡献有限,注意力机制已足够传递提示信息;但若将池化嵌入作为引导信号,可实现可控的属性偏移,从而提升多任务性能。
Details
Motivation: 探讨调制式文本调节在扩散变换器中是否必要,以及能否带来性能优势,以澄清当前方法中文本调节机制的有效性。
Result: 在文本到图像/视频生成和图像编辑等多样化任务中,该方法带来了改进,且无需训练、实现简单、运行时开销可忽略。
Insight: 创新点在于将池化文本嵌入从传统的调制角色转变为引导信号,实现可控的属性优化;客观分析认为这是一种灵活且通用的训练免费增强策略,可扩展至多种扩散模型。
Abstract: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.
[39] A Deep Multi-Modal Method for Patient Wound Healing Assessment cs.CV | cs.AIPDF
Subba Reddy Oota, Vijay Rowtula, Shahid Mohammed, Jeffrey Galitz, Minghsun Liu
TL;DR: 本文提出了一种基于深度多模态学习的方法,用于预测患者伤口愈合过程中的住院风险。该方法通过结合伤口变量(如临床指标)和伤口图像,利用迁移学习技术来预测伤口变量及其愈合轨迹,从而实现早期并发症检测并辅助临床诊断。
Details
Motivation: 解决因治疗延迟、患者不配合或合并症等因素导致伤口恶化并最终引发患者住院的问题,旨在通过多模态数据预测住院风险,降低医疗成本并提升临床效率。
Result: 论文未在摘要中明确提及具体的定量结果或基准测试,但指出所提出的迁移学习方案能够预测伤口变量和愈合轨迹,有助于早期检测伤口复杂性。
Insight: 创新点在于将多模态数据(伤口变量与图像)结合,并采用迁移学习进行伤口评估,这不仅能提升预测准确性,还可减少临床医生的诊断时间,为智能医疗辅助系统提供新思路。
Abstract: Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient’s non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient’s risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.
[40] GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification cs.CV | cs.AIPDF
Lin-Guo Gao, Suxing Liu
TL;DR: 本文提出了一种名为GAFR-Net的图注意力和模糊规则网络,用于乳腺癌组织病理学图像的可解释分类。该网络通过构建相似性驱动的图表示来建模样本间关系,并利用多头图注意力机制捕捉异质组织结构间的复杂关系特征。同时,其可微模糊规则模块将节点度、聚类系数和标签一致性等内在拓扑描述符编码为明确的、人类可理解的诊断逻辑,提供透明的“IF-THEN”映射,模仿医学专家的启发式推理过程。
Details
Motivation: 解决乳腺癌组织病理学图像分类中,传统深度学习模型在有限标注下性能下降以及“黑箱”性质阻碍临床整合的问题。
Result: 在BreakHis、Mini-DDSM和ICIAR2018三个基准数据集上的广泛评估表明,GAFR-Net在多种放大倍数和分类任务中始终优于各种最先进方法,验证了其优越的泛化能力和实用性。
Insight: 创新点在于将图神经网络与可微模糊规则系统相结合,构建了一个端到端的可解释模型。其核心在于利用图结构建模样本关系,并通过模糊逻辑规则提供透明、类似专家推理的决策依据,这在弱监督医学图像分析中兼具高性能和可解释性,是一个有前景的方向。
Abstract: Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic intervention.However, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a “blackbox” nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue structures.Concurrently, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent “IF-THEN” mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
[41] Deep Modeling and Interpretation for Bladder Cancer Classification cs.CVPDF
Ahmad Chaddad, Yihang Wu, Xianrui Chen
TL;DR: 该论文系统评估了多种基于卷积神经网络(CNN)和视觉变换器(ViT)的深度模型在膀胱癌图像分类任务上的性能、校准性和可解释性。通过在公开的多中心数据集上进行大量实验,研究发现ConvNext系列模型泛化能力有限,而ViT模型展现出更好的校准效果。研究还探讨了测试时数据增强对模型可解释性的提升,并指出不同模型在解释分布内和分布外样本时各有优劣,不存在一个通用的最佳可解释模型。
Details
Motivation: 动机在于,尽管ViT和CNN在自然图像数据集上表现出色,但在医学影像(如膀胱癌图像)中,异常区域通常只占图像的一小部分,现有模型可能不适用。因此,需要系统评估这些最新深度模型在膀胱癌分类任务中的实际表现、校准性和可解释性,以指导临床诊断应用。
Result: 在公开的多中心膀胱癌数据集上进行了约300次实验。结果显示,ConvNext系列模型在分类膀胱癌图像时泛化能力有限,准确率约为60%。相比之下,ViT模型比ConvNext和Swin Transformer系列表现出更好的校准效果。测试时数据增强有助于提升模型的可解释性。
Insight: 论文的创新点在于对多种先进深度模型在医学影像分类任务上进行了全面的性能、校准性和可解释性评估,并揭示了模型在分布内与分布外样本解释上的适用性差异。客观来看,其系统性的评估框架和结合校准分析、可解释性工具(如GradCAM++)的方法,为医学影像领域的模型选择与优化提供了重要参考,强调了在特定临床场景中模型需针对性设计,而非追求通用解决方案。
Abstract: Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.
[42] Impact of domain adaptation in deep learning for medical image classifications cs.CVPDF
Yihang Wu, Ahmad Chaddad
TL;DR: 本研究系统评估了领域自适应(DA)技术在医学图像分类任务中的应用效果,通过10个深度学习模型模拟常见DA方法,在四个医学图像数据集上进行了多场景实验。结果表明,DA能提升模型性能(如ResNet34在脑肿瘤数据集上提升4.7%)、增强噪声鲁棒性(约3%精度提升)、改善模型可解释性(通过Grad-CAM++技术),并在多模态数据上降低预期校准误差约2%,但在联邦学习框架中提升有限(皮肤癌分类仅约0.3%)。
Details
Motivation: 解决医学图像分析中因数据分布差异(如多模态、噪声、标签不足)导致的模型泛化能力下降问题,探索DA技术在复杂临床场景中的实际价值。
Result: 在脑肿瘤数据集上,DA使ResNet34性能提升4.7%;在噪声数据中DA带来约3%精度提升;在多模态数据集上DA将预期校准误差降低约2%;在联邦学习的皮肤癌分类任务中DA仅带来约0.3%的有限提升。
Insight: DA不仅能提升模型性能,还能增强噪声鲁棒性、改善可解释性(临床价值)及模型校准能力;但在联邦学习等分布式场景中需进一步优化DA策略,揭示了DA技术在不同医学图像任务中的差异化效用边界。
Abstract: Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2%$ compared to CNN alone on a multi-modality dataset.
[43] Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation cs.CVPDF
Jun Li
TL;DR: 本文提出了一种完全可微的双向协同学习(DBiSL)框架,用于半监督3D医学图像分割。该框架通过在线双向跨任务协作,整合并增强了监督学习、一致性正则化、伪监督学习和不确定性估计四个关键SSL组件,以缓解高质量标注数据稀缺的问题。
Details
Motivation: 解决医学图像分析中高质量标注数据稀缺、标注成本高的问题。现有双任务协同学习方法通常局限于单向交互机制(如从回归到分割),无法充分利用在线双向跨任务协作的潜力。
Result: 在两个基准数据集上的实验表明,该方法取得了最先进的(SOTA)性能。
Insight: 创新点在于提出了一个完全可微的双向协同学习框架,实现了分割与回归任务之间的在线双向交互与监督。这为双任务驱动的SSL提供了新的架构基础,并可作为通用的多任务学习框架应用于更广泛的计算机视觉任务。
Abstract: Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method’s state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.
[44] K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge cs.CVPDF
Zhikai Li, Jiatong Li, Xuewen Liu, Wangbo Zhao, Pan Du
TL;DR: 本文提出了K-Sort Eval,一个基于视觉语言模型(VLM)的可靠且高效的视觉生成模型偏好评估框架。该框架通过整合后验校正和动态匹配策略,旨在解决现有人工评估方法成本高、可扩展性差,以及直接使用VLM进行评估时存在的幻觉、偏见和对齐问题。
Details
Motivation: 视觉生成模型的快速发展需要更可扩展且与人类偏好对齐的评估方法。众包Arena平台虽然能提供人类偏好评估,但成本高、耗时长,限制了可扩展性。利用VLM替代人工判断是一个有前景的方案,但其固有的幻觉和偏见会损害评估的可靠性,且静态评估方法效率低下。
Result: 大量实验表明,K-Sort Eval提供的评估结果与K-Sort Arena(基于数千个人类投票构建的高质量数据集)一致,并且通常只需要少于90次模型运行,证明了其高效性和可靠性。
Insight: 创新点在于提出了一个结合了后验校正和动态匹配的VLM评估框架。后验校正方法基于VLM预测与人类监督的一致性,自适应地校正贝叶斯更新中的后验概率,以增强对齐和可靠性。动态匹配策略则通过平衡不确定性和多样性来最大化每次比较的预期收益,从而确保更高效的评估。这为自动化、可扩展的生成模型评估提供了一个有前景的解决方案。
Abstract: The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.
[45] LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging cs.CV | cs.AI | cs.LGPDF
Xinyu Wang, Ke Deng, Fei Dou, Jinbo Bi, Jin Lu
TL;DR: 本文提出LARV(Layer-wise Adaptive Rescaling Veneer),一种无需训练数据、无需额外训练且与具体合并方法正交的层自适应重缩放方法,用于提升任务向量合并的性能。LARV通过为每个任务向量在聚合前分配逐层的缩放因子,自适应地抑制浅层干扰并增强深层对齐,从而改进现有合并方法(如TIES、TSV-M、Iso-C/CTS)在视觉Transformer上的表现。
Details
Motivation: 现有任务向量合并方法(如TIES、TSV-M、Iso-C/CTS)在聚合时几乎均匀地处理所有层,忽略了大型视觉Transformer中存在的强烈层间异质性:浅层对干扰敏感,而深层编码稳定的任务特定特征。
Result: 在FusionBench基准测试中,LARV在8/14/20任务设置下持续提升了所有任务向量基线的性能。例如,Iso-C + LARV在ViT-B/32上达到85.9%,在ViT-B/16上达到89.2%,在ViT-L/14上达到92.6%。层间分析和损坏测试进一步表明LARV有效抑制了浅层干扰并适度放大了深层稳定特征。
Insight: 创新点在于首次为任务向量合并引入了层感知的缩放机制,通过简单的无数据层代理和轻量级规则生成逐层缩放因子,将模型合并从均匀处理转变为鲁棒的层感知过程。该方法与基础合并器正交,计算开销可忽略,且无需修改现有合并方法。
Abstract: Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.
[46] Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification cs.CV | cs.LGPDF
Yiqiao Li, Bo Shang, Jie Wei
TL;DR: 本文提出了一种无需训练的视觉-语言模型框架,用于解决路边LiDAR数据中细粒度卡车分类的模态鸿沟问题。通过将稀疏的3D点云转换为深度编码的2D视觉代理,并利用现成的视觉-语言模型进行少样本分类,该方法在真实数据集上实现了有竞争力的准确率,且可作为冷启动策略来引导轻量级监督模型。
Details
Motivation: 当前基于LiDAR的细粒度卡车分类方法依赖监督深度学习和人工标注,面临可扩展性挑战;视觉-语言模型虽具有少样本泛化潜力,但其应用于路边LiDAR时存在稀疏3D点云与密集2D图像之间的模态鸿沟。
Result: 在包含20个车辆类别的真实数据集上,该方法仅需每类16-30个样本即可达到有竞争力的分类准确率;在特定拖车类别(20英尺、40英尺和53英尺集装箱)上,少样本视觉-语言模型实现了超过75%的正确分类率,无需训练或微调。
Insight: 创新点包括:提出深度感知图像生成流程,将稀疏、遮挡的LiDAR扫描转换为2D视觉代理以桥接模态鸿沟;发现“语义锚定”效应,即文本引导在超少样本场景中正则化性能,但在更多样本时因语义不匹配而降低准确率;框架可作为冷启动策略,用视觉-语言模型生成的标签引导监督模型,减少初始标注需求。
Abstract: Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a “Semantic Anchor” effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
[47] SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL cs.CVPDF
Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang
TL;DR: 本文提出SceneReVis,一个基于视觉的自我反思框架,通过多轮强化学习的‘诊断-行动’循环来解决3D室内场景合成中的空间幻觉(如碰撞)问题。该方法利用多模态反馈迭代地拦截和解决空间冲突,并构建了大规模数据集SceneChain-12k来支持逐步推理。通过从监督微调到智能体强化学习的两阶段训练,模型演变为主动的空间规划器。实验表明,该方法在高保真生成和目标导向优化上达到了最先进的性能,并对长尾领域具有鲁棒泛化能力。
Details
Motivation: 当前一次性3D场景合成方法由于缺乏审慎推理,常出现空间幻觉(如碰撞)问题。本文旨在通过引入迭代的自我反思机制来弥补这一差距,实现更可靠的场景生成。
Result: 在广泛的实验中,SceneReVis在高保真生成和目标导向优化任务上取得了最先进的(SOTA)性能,并在长尾领域展现出鲁棒的泛化能力。
Insight: 核心创新点在于提出了一个基于视觉的自我反思框架,通过多轮‘诊断-行动’循环和多模态反馈来显式解决空间冲突。此外,构建大规模因果构建轨迹数据集SceneChain-12k以及从监督学习到智能体强化学习的两阶段训练范式,将模型演变为主动规划器,是可借鉴的方法。
Abstract: Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act’’ loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
[48] ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs cs.CV | cs.AI | cs.LGPDF
James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy
TL;DR: ArtifactLens是一个基于视觉语言模型(VLM)的合成图像伪影检测系统,它通过创新的架构设计(包括上下文学习和文本指令优化),仅需每个伪影类别数百个标注样本即可解锁预训练VLM中已有的知识,从而在多个基准测试上达到SOTA水平,并显著减少了对大规模标注数据的依赖。
Details
Motivation: 解决当前基于VLM的伪影检测器需要数万张标注图像进行微调,成本高昂且难以适应生成模型快速迭代或新伪影类型出现的问题。
Result: 在五个人工伪影基准测试(首次跨多个数据集的评估)上达到了最先进的(SOTA)水平,同时所需的标注数据量减少了数个数量级。
Insight: 核心创新在于揭示了预训练VLM已编码了检测伪影所需的知识,并通过精心设计的“脚手架”架构(包含上下文学习和文本指令优化等新颖改进)高效解锁该能力,实现了数据效率的极大提升;该方法可泛化至其他伪影类型(如物体形态、动物解剖、实体交互)以及AIGC检测任务。
Abstract: Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
[49] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation cs.CVPDF
Chuanhai Zang, Jiabao Hu, XW Song
TL;DR: 本文提出FD-DB,一种频率解耦的双分支网络,用于解决无配对合成到真实域转换中真实感与结构稳定性之间的权衡问题。该方法将外观转换分解为低频可解释编辑和高频残差补偿,通过可解释分支预测物理编辑参数以稳定低频外观,自由分支补充细节,并结合门控融合机制在频率约束下融合两分支。
Details
Motivation: 合成数据为几何敏感视觉任务提供低成本、精确标注的样本,但合成域与真实域之间的外观和成像差异导致严重的域偏移并降低下游性能。现有无配对合成到真实转换方法常在真实感与结构稳定性之间面临权衡:无约束生成可能引入变形或虚假纹理,而过于严格的约束则限制了对真实域统计的适应。
Result: 在YCB-V数据集上的实验表明,FD-DB提高了真实域外观一致性,并显著提升了下游语义分割性能,同时保持了几何和语义结构。
Insight: 创新点在于将域转换分解为低频可解释编辑和高频残差补偿的双分支频率解耦设计,以及结合物理参数预测与门控融合的稳定生成机制;从客观角度看,其两阶段训练策略和显式频率约束有助于平衡内容保持与细节生成,为域自适应提供了一种结构化的解耦思路。
Abstract: Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
[50] Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions cs.CVPDF
Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao
TL;DR: 本文提出了Align-TI,一个从令牌交互视角设计的新型知识蒸馏框架,用于压缩多模态大语言模型。该方法通过模仿教师模型在视觉-指令令牌交互和响应内部令牌交互上的动态能力,显著提升了学生模型的性能。
Details
Motivation: 现有MLLM知识蒸馏方法主要依赖静态的下一个令牌对齐,忽略了嵌入多模态理解和生成关键能力的动态令牌交互,导致压缩模型性能受限。
Result: 在实验中,Align-TI相比Vanilla KD取得了2.6%的相对提升,其蒸馏出的2B参数模型甚至超越了更大的LLaVA-1.5-7B模型7.0%,为训练参数高效的MLLM建立了新的SOTA蒸馏框架。
Insight: 核心创新在于从令牌交互的视角重构知识蒸馏,具体通过IVA组件对齐视觉显著性区域以提取相关信息,以及通过TPA组件对齐序列令牌转移概率以捕捉动态生成逻辑,这为模型压缩提供了新的设计思路。
Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher’s instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher’s dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI’s superiority. Notably, our approach achieves $2.6%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.
[51] A Universal Action Space for General Behavior Analysis cs.CVPDF
Hung-Shuo Chang, Yue-Cheng Yang, Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang
TL;DR: 该论文提出了一种通用动作空间(UAS),通过整合现有的大规模人类动作标注数据集来构建,并利用该空间作为基础来分析和分类哺乳动物(如黑猩猩)的行为数据,旨在为行为分析提供一个统一的表示框架。
Details
Motivation: 解决传统行为分析方法依赖手工特征和轨迹建模导致的鲁棒性和泛化性不足的问题,借鉴ImageNet启发的深度学习范式,构建一个通用的高级动作表示空间以提升行为分析的性能。
Result: 论文利用构建的UAS在哺乳动物和黑猩猩行为数据集上进行了分析和分类,但摘要中未提及具体的定量结果(如准确率)或与现有方法的比较(如是否达到SOTA)。
Insight: 创新点在于将大规模人类动作数据迁移到动物行为分析中,构建一个统一的动作表示空间,这为跨物种行为研究提供了可扩展的深度学习基础,并开源了代码以促进可重复性。
Abstract: Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at https://github.com/franktpmvu/Universal-Action-Space.
[52] Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs cs.CVPDF
Jingyi Wang, Fei Li, Rujie Liu
TL;DR: 本文提出了一种无需训练的注意力干预算法,通过增强任务相关视觉标记的注意力来缓解大型视觉语言模型(LVLM)中的幻觉问题。该方法基于视觉-文本相似性构建重加权矩阵,重新分配注意力,并在束搜索解码中注入视觉注意力值以提升视觉贡献。
Details
Motivation: 现有LVLM存在视觉注意力不足导致幻觉的问题,而现有方法增强所有视觉标记的注意力会引入无关标记干扰,因此需要针对性增强任务相关视觉标记的注意力。
Result: 在主流LVLM上的大量实验表明,该方法显著减少了幻觉,同时保持了生成内容的准确性和连贯性。
Insight: 创新点在于基于视觉-文本相似性动态重加权注意力,避免增强无关标记;并在解码阶段注入视觉注意力以提升视觉贡献,这是一种无需训练的高效干预策略。
Abstract: Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
[53] Singpath-VL Technical Report cs.CVPDF
Zhen Qiu, Kaiwen Xiao, Zhengwei Lu, Xiangyu Liu, Lei Zhao
TL;DR: Singpath-VL是一个针对宫颈细胞学的视觉语言大模型,旨在填补该领域AI助手的空白。论文首先开发了一个三阶段流程,利用通用多模态大语言模型作为弱标注器,合成百万级图像-描述数据集。随后,使用该数据集对Qwen3-VL-4B模型进行多阶段微调,构建了专门的细胞病理学MLLM。该模型在细粒度形态感知和细胞级诊断分类方面表现出色,并计划开源部分合成数据集和基准。
Details
Motivation: 解决多模态大语言模型在细胞病理学,特别是宫颈细胞学中应用不足的问题,主要原因是缺乏大规模、高质量标注数据集。
Result: 模型在细粒度形态感知和细胞级诊断分类方面表现出优越性能,但摘要未提及具体基准测试或定量结果(如准确率),也未明确说明是否达到SOTA水平。
Insight: 创新点在于提出了一种利用通用MLLM作为弱标注器、结合共识融合和专家知识注入的三阶段合成高质量数据集的新流程,以及针对细胞病理学的多阶段微调策略,为数据稀缺领域提供了有效的解决方案。
Abstract: We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
[54] SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem cs.CVPDF
Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata
TL;DR: 本文提出SchröMind框架,通过求解薛定谔桥问题来缓解多模态大语言模型中的幻觉问题,该方法在POPE和MME基准测试中实现了最先进的性能,同时仅引入极小的计算开销。
Details
Motivation: 多模态大语言模型在医疗等高风险领域应用受限,主要因为其生成的文本常与视觉输入矛盾或忽略视觉信息,即存在幻觉问题,模型虽能理解图像但难以生成准确的词元序列。
Result: 在POPE和MME基准测试上的广泛实验表明,SchröMind实现了最先进的性能,同时仅引入极小的计算开销。
Insight: 通过求解薛定谔桥问题,在幻觉激活和真实激活之间建立词元级映射,以最小传输成本进行轻量级训练,从而减少幻觉,同时保持模型的原始能力。
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model’s original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.
[55] DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment cs.CVPDF
Bohan Fu, Guanyi Qin, Fazhan Zhang, Zihao Huang, Mingxuan Li
TL;DR: 本文提出了DR.Experts,一种新颖的先验驱动盲图像质量评估框架,旨在通过显式结合失真先验来提升模型性能。该框架首先利用退化感知的视觉语言模型获取失真特定先验,然后通过提出的失真显著性差分模块对其进行精炼,最后通过动态失真加权模块融合多种特征以进行最终质量预测。
Details
Motivation: 现有盲图像质量评估模型难以有效捕捉细微失真线索,导致与人类主观判断不一致,其根本原因在于缺乏可靠的失真先验,模型仅学习统一图像特征与质量分数之间的浅层关系。
Result: 在五个具有挑战性的BIQA基准测试上进行的广泛实验表明,DR.Experts优于当前方法,并在泛化能力和数据效率方面表现出色。
Insight: 创新点在于提出了一个先验驱动的框架,通过失真显著性差分模块精炼失真先验,以及采用混合专家风格的动态失真加权模块来加权不同失真特征,确保最终预测与人类感知对齐。从客观角度看,将视觉语言模型用于获取失真先验并设计专门的模块进行区分和加权,是提升模型对失真敏感性的有效途径。
Abstract: Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce DR.Experts, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. DR.Experts begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of DR.Experts over current methods and showcase its excellence in terms of generalization and data efficiency.
[56] AUHead: Realistic Emotional Talking Head Generation via Action Units Control cs.CVPDF
Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu
TL;DR: 本文提出了一种名为AUHead的两阶段方法,用于通过动作单元控制生成逼真的情感化说话头部视频。第一阶段利用大型音频-语言模型,通过时空AU标记化和’情感-后-AU’的思维链机制,从原始语音中解耦出细粒度的动作单元。第二阶段提出一个AU驱动的可控扩散模型,根据AU序列合成逼真的说话头部视频,并通过AU解耦引导策略在推理时实现灵活的质量权衡控制。
Details
Motivation: 当前说话头部视频生成方法因缺乏细粒度的情感控制而难以生成细腻的情感表达,本文旨在解决这一问题,实现从音频中解耦动作单元并进行可控生成。
Result: 在基准数据集上的结果表明,该方法在情感真实感、准确的唇部同步和视觉连贯性方面取得了有竞争力的性能,显著超越了现有技术。
Insight: 创新点包括:1) 探索大型音频-语言模型在AU生成上的能力,采用时空标记化和思维链机制;2) 提出AU驱动的可控扩散模型,将AU序列映射为结构化2D面部表示以增强空间保真度;3) 引入推理时的AU解耦引导策略,实现质量与控制的灵活权衡,提升情感表现力和身份一致性。
Abstract: Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an “emotion-then-AU” chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR
[57] Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination cs.CVPDF
Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata
TL;DR: 本文提出了一种名为Scalpel的方法,用于缓解大型视觉语言模型(LVLMs)中的多模态幻觉问题。该方法通过使用混合高斯模型捕获注意力激活的分布,并利用熵最优传输(即薛定谔桥问题)在幻觉流形和可信流形之间进行精确映射,从而在推理过程中动态调整注意力激活,使其朝向更可信的区域。实验表明,Scalpel在多个数据集和基准测试中有效减少了幻觉,达到了最先进的性能,且无需额外计算开销。
Details
Motivation: 大型视觉语言模型(LVLMs)由于大型语言模型(LLMs)的强先验和跨模态注意力未对齐,经常产生与视觉内容不一致的输出,即幻觉问题。本文旨在解决这一问题,通过精细化注意力激活分布来减少幻觉。
Result: 在多个数据集和基准测试上的广泛实验表明,Scalpel有效缓解了幻觉,性能优于先前方法,达到了最先进的(SOTA)水平。
Insight: 创新点在于使用混合高斯模型建模注意力激活的多峰分布,并应用熵最优传输(薛定谔桥)在幻觉和可信流形之间进行精确映射,从而在推理时动态调整注意力。该方法模型和数据无关,无需额外计算,仅需单步解码,具有高效性和通用性。
Abstract: Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.
[58] Delving into Spectral Clustering with Vision-Language Representations cs.CVPDF
Bo Peng, Yuanwei Hu, Bo Liu, Ling Chen, Jie Lu
TL;DR: 本文提出了一种基于视觉-语言预训练模型的多模态谱聚类方法,称为神经正切核谱聚类(Neural Tangent Kernel Spectral Clustering)。该方法利用预训练视觉-语言模型中的跨模态对齐信息,通过结合图像的视觉邻近性和语义重叠来构建亲和力矩阵,并引入正则化亲和力扩散机制自适应地融合不同提示词诱导的亲和力矩阵。在16个基准数据集上的实验表明,该方法大幅超越了现有最先进方法。
Details
Motivation: 传统谱聚类方法大多基于单模态数据,未能充分利用多模态表示中的丰富信息。受视觉-语言预训练近期成功的启发,本文旨在将谱聚类从单模态扩展到多模态领域,以挖掘跨模态对齐信息来提升聚类性能。
Result: 在包括经典、大规模、细粒度和领域偏移数据集在内的16个基准测试上进行了广泛实验,结果表明该方法始终以较大优势超越了当前最先进(SOTA)方法。
Insight: 创新点在于将神经正切核与视觉-语言预训练模型结合,通过语义接近的积极名词锚定来构建亲和力,从而增强簇内连接并抑制簇间虚假连接,促进块对角结构;此外,提出的正则化亲和力扩散机制能自适应地融合多提示诱导的亲和力矩阵,提升了方法的鲁棒性和泛化能力。
Abstract: Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks – including classical, large-scale, fine-grained and domain-shifted datasets – manifest that our method consistently outperforms the state-of-the-art by a large margin.
[59] MieDB-100k: A Comprehensive Dataset for Medical Image Editing cs.CV | cs.AIPDF
Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo
TL;DR: 本文提出了MieDB-100k,一个用于文本引导医学图像编辑的大规模、高质量、多样化数据集,旨在解决现有数据集多样性不足、忽视医学图像理解以及难以平衡质量与规模的问题。该数据集通过结合专家模型和基于规则的数据合成方法构建,并经过严格人工检查以确保临床保真度。实验表明,使用该数据集训练的模型性能优于现有开源和专有模型,并展现出强大的泛化能力。
Details
Motivation: 解决高质量数据稀缺这一阻碍多模态生成模型适应医学图像编辑的主要瓶颈,并克服现有数据集在多样性、医学图像理解以及质量与可扩展性平衡方面的不足。
Result: 在广泛的实验中,使用MieDB-100k训练的模型在性能上持续优于开源和专有模型,并表现出强大的泛化能力。
Insight: 创新点在于提出了一个从感知、修改和转换三个视角对编辑任务进行分类的数据集构建框架,并设计了一个结合领域专家模型与规则合成,再辅以严格人工检查的数据整理流程,确保了数据集的临床保真度和规模。这为专业医学图像编辑领域提供了一个高质量的基础数据集。
Abstract: The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
[60] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures cs.CVPDF
Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen
TL;DR: Hand2World是一个自回归框架,用于从单张场景图像和自由空间手势生成以自我为中心的交互视频。它通过投影3D手部网格实现遮挡不变的手部条件化,利用Plücker射线嵌入注入相机几何以稳定视角变化,并通过蒸馏双向扩散模型实现任意长度合成。
Details
Motivation: 解决以自我为中心的交互生成中的关键挑战:自由空间手势与接触密集训练数据之间的分布偏移、单目视图中手部运动与相机运动的模糊性,以及需要生成任意长度视频。
Result: 在三个以自我为中心的交互基准测试中,感知质量和3D一致性均有显著提升,同时支持相机控制和长时程交互生成。
Insight: 创新点包括基于投影3D手部网格的遮挡不变条件化方法,通过Plücker射线嵌入显式注入相机几何以解耦相机与手部运动,以及将双向扩散模型蒸馏为因果生成器以实现长序列生成。
Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
[61] Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing cs.CVPDF
Jialun Liu, Yukuo Ma, Xiao Cao, Tian Li, Gonghu Shang
TL;DR: 本文提出了Tele-Omni,一个统一的多模态框架,用于视频生成和编辑。该框架能够处理文本、图像和参考视频等多种模态的指令,通过预训练的多模态大语言模型解析指令并推断结构化意图,再由基于扩散的生成器合成高质量视频。
Details
Motivation: 现有基于扩散的视频生成方法多为任务特定型,主要依赖文本指令,难以在统一框架内处理多模态输入、上下文参考和多样的视频生成与编辑场景。许多视频编辑方法也依赖于为单个操作定制的复杂流程,这阻碍了可扩展性和可组合性。
Result: 实验结果表明,Tele-Omni在文本到视频生成、图像到视频生成、首尾帧视频生成、上下文视频生成和上下文视频编辑等多个任务上取得了具有竞争力的性能。
Insight: 核心创新点在于将指令解析与视频合成解耦,并结合任务感知的数据设计,实现了灵活的多模态控制,同时保持了强时间连贯性和视觉一致性。其任务感知数据处理管道将异构输入统一为结构化指令格式,支持跨异构视频任务的联合训练。
Abstract: Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.
[62] AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models cs.CV | cs.AI | cs.CRPDF
Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo
TL;DR: 本文提出了一种名为AGMark的注意力引导动态水印框架,用于大型视觉语言模型(LVLMs),旨在解决现有水印方法可能破坏视觉保真度的问题。该方法通过动态识别语义关键证据并联合考虑不确定性感知和证据校准,自适应地选择受保护令牌,从而在保持高检测准确性和鲁棒性的同时,显著提升生成质量与视觉语义保真度。
Details
Motivation: 现有水印方法存在视觉无关令牌引入、视觉基础破坏以及静态权重估计忽略生成过程中视觉依赖动态变化的问题,导致生成质量下降和低质量令牌产生。
Result: AGMark在实验中优于传统方法,显著提升了生成质量,尤其在生成后期阶段增强了视觉语义保真度,同时保持了高竞争性的检测准确率(至少99.36% AUC)和鲁棒的攻击抵抗力(至少88.61% AUC),且未牺牲推理效率。
Insight: 创新点在于动态识别语义关键证据(基于注意力权重和上下文感知线索)以及联合考虑不确定性感知(令牌熵)和证据校准(权重密度)来自适应划分词汇表,避免了无关令牌的引入,从而在保护知识产权的同时严格保持了视觉保真度。
Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36% AUC) and robust attack resilience (at least 88.61% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
[63] Towards Training-free Multimodal Hate Localisation with Large Language Models cs.CV | cs.MMPDF
Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu
TL;DR: 本文提出了LELA,首个无需训练、基于大语言模型的多模态仇恨视频定位框架,通过分解视频为图像、语音、OCR、音乐和视频上下文五种模态,并采用多阶段提示方案计算每帧的细粒度仇恨分数,结合组合匹配机制增强跨模态推理,在HateMM和MultiHateClip基准上显著优于现有无需训练基线。
Details
Motivation: 在线视频中仇恨内容的泛滥对个人福祉和社会和谐构成严重威胁,而现有视频仇恨检测方法要么严重依赖大规模人工标注,要么缺乏细粒度的时间精度。
Result: 在HateMM和MultiHateClip两个挑战性基准上的实验表明,LELA大幅优于所有现有的无需训练基线方法。
Insight: 创新点在于首次提出无需训练的大语言模型多模态仇恨视频定位框架,通过多模态分解、多阶段提示和组合匹配机制,实现了细粒度的时空定位和可扩展、可解释的检测,避免了监督学习对标注数据的依赖。
Abstract: The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
[64] VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model cs.CVPDF
Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei MA, Yiming Zhong
TL;DR: 该论文提出了VideoAfford方法,旨在从人类-物体交互视频中学习动态交互先验,以解决3D可操作性(affordance)的定位问题。作者构建了一个名为VIDA的大规模视频数据集,并设计了一个基于多模态大语言模型的统一框架,该框架能够结合世界知识推理和细粒度的可操作性分割能力,从而在3D物体上高亮显示可操作区域。
Details
Motivation: 现有研究主要从静态线索(如语言和图像)学习可操作性知识,难以提供揭示时序和因果线索的动态交互上下文,这限制了机器人操作中对3D物体可操作区域的准确理解。
Result: 在VIDA数据集上的大量实验评估表明,该模型显著优于现有成熟方法,并展现出强大的开放世界泛化能力和可操作性推理能力。
Insight: 创新点包括:1) 构建了首个大规模视频驱动的3D可操作性数据集VIDA;2) 提出了一个统一的多模态大语言模型框架,整合了世界知识推理和细粒度分割;3) 引入了潜在动作编码器从HOI视频中提取动态交互先验;4) 设计了空间感知损失函数以增强3D空间知识学习。从客观角度看,将动态视频信息与3D点云结合,并通过大语言模型进行知识融合,是提升可操作性理解的有效途径。
Abstract: 3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
[65] Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation cs.CVPDF
Siyu Chen, Ting Han, Haoling Huang, Chaolei Wang, Chengzheng Fu
TL;DR: 本文提出了Time2General框架,用于解决领域泛化视频语义分割(DGVSS)中的领域偏移和时间采样偏移问题。该框架基于稳定性查询构建,通过时空记忆解码器聚合多帧上下文,并使用掩码时间一致性损失来抑制闪烁并提高对变化采样率的鲁棒性。
Details
Motivation: 解决在单标注驾驶领域训练的模型部署到未见领域时,因领域偏移和时间采样偏移导致的基于对应关系的传播和固定步长时间聚合失效,从而引起视频流中帧间预测闪烁的问题。
Result: 在多个驾驶基准测试上的广泛实验表明,Time2General在跨领域准确性和时间稳定性方面显著优于先前的领域泛化语义分割和视频语义分割基线,同时运行速度高达18 FPS。
Insight: 创新点包括引入时空记忆解码器以避免显式对应传播,以及提出掩码时间一致性损失来正则化不同步长下的时间预测差异并随机化训练步长以增强模型对多样化时间间隔的鲁棒性。
Abstract: Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
[66] Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI cs.CVPDF
Boya Wang, Ruizhe Li, Chao Chen, Xin Chen
TL;DR: 本研究提出一个多任务深度学习框架,用于肝脏分割和肝纤维化分期,通过半监督学习和配准技术处理多参数MRI数据,并在CARE Liver 2025挑战赛的独立测试集上验证了其有效性。
Details
Motivation: 解决临床实践中肝脏纤维化诊断的挑战,特别是针对多参数MRI数据中标注有限、模态间域偏移和变异性的问题。
Result: 在CARE Liver 2025 Track 4挑战赛的独立测试集上进行了评估,包括分布内和分布外病例,使用了三通道和七通道MRI数据,具体性能指标未在摘要中提及。
Insight: 创新点包括结合图像分割与配准的半监督学习模型以利用未标注数据,以及基于patch的分类方法实现肝纤维化分期可视化,有效处理多模态数据和域偏移问题。
Abstract: Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: https://github.com/mileywang3061/Care-Liver
[67] GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation cs.CV | cs.AIPDF
Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh
TL;DR: 本文提出了GenSeg-R1框架,用于细粒度指代分割任务。该框架采用先推理后分割的解耦流程:首先使用视觉语言模型(VLM)根据图像和自然语言查询生成结构化空间提示(如边界框和关键点),然后利用冻结的提示分割器(SAM 2)将这些提示转换为高质量掩码。通过强化学习(GRPO)微调Qwen3-VL模型,无需监督推理链标注。
Details
Motivation: 解决细粒度指代图像分割问题,旨在通过自然语言查询精确分割图像中的特定实例。传统方法可能依赖监督标注或缺乏对无目标查询的检测能力,本文旨在通过强化学习驱动视觉语言对齐来提升分割精度和鲁棒性。
Result: 在RefCOCOg验证集上,GenSeg-R1-8B达到0.7127 cIoU和0.7382 mIoU,显著超越Qwen3-VL Instruct基线(分别提升15.3和21.9点)并超过Seg-Zero-7B 3.3 cIoU。在GRefCOCO验证集上,GenSeg-R1-G达到76.69%目标mIoU,无目标提示准确率达82.40%,优于Seg-R1-7B和Seg-Zero-7B。在ReasonSeg测试集上,GenSeg-R1-4B达到68.40% mIoU,超越Seg-Zero-7B 7.0点和Seg-R1-7B 10.7点。
Insight: 创新点包括:1) 采用解耦的推理-分割流程,结合VLM的结构化提示生成与SAM 2的冻结分割器,提升分割质量;2) 使用GRPO强化学习微调VLM,无需推理链监督标注,降低数据需求;3) 引入SAM 2在线奖励机制直接优化掩码质量,并增强对无目标查询的检测能力。从客观角度看,该方法有效整合了大型视觉语言模型与提示分割器的优势,通过强化学习实现了高效的视觉语言对齐,为指代分割提供了可扩展的解决方案。
Abstract: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
[68] Toward Fine-Grained Facial Control in 3D Talking Head Generation cs.CVPDF
Shaoyang Xie, Xiaofeng Cong, Baosheng Yu, Zhipeng Gui, Jie Gui
TL;DR: 本文提出了一种名为FG-3DGS的新框架,用于实现细粒度面部控制的3D说话头生成。该方法通过频率感知解耦策略,分别建模面部低频和高频运动区域,并结合高频精炼的后渲染对齐机制,以解决唇部同步不准确和面部抖动问题,从而生成时序一致、高保真的说话头视频。
Details
Motivation: 当前基于3D高斯溅射的说话头生成方法在实现精确的细粒度面部运动控制方面仍面临挑战,特别是唇部同步不准确和面部抖动问题,这可能导致恐怖谷效应。
Result: 在广泛使用的说话头生成数据集上进行的大量实验表明,该方法在生成高保真、唇部同步的说话头视频方面优于近期的SOTA方法。
Insight: 创新点在于引入了频率感知解耦策略来显式建模不同运动特征的面部区域,并提出了一个从大规模音视频对中学习的高频精炼后渲染对齐机制,以增强逐帧生成和唇部同步的准确性。
Abstract: Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.
[69] Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors cs.CVPDF
Sandeep Gupta, Roberto Passerone
TL;DR: 本文研究了网联与自动驾驶车辆(CAVs)中视觉系统的鲁棒性,这对实现L5级自动驾驶至关重要。文章分析了CAV导航所需的关键传感器和视觉组件,推导出一个CAV视觉系统的参考架构,并基于此识别潜在的攻击面。随后,详细阐述了针对每个攻击面的攻击向量,并严格评估了它们对机密性、完整性和可用性(CIA)的影响。
Details
Motivation: 动机是解决网联与自动驾驶车辆视觉系统的安全问题,因为安全可靠的CAV导航高度依赖于能够准确检测物体、车道线和交通标志的鲁棒视觉系统。
Result: 研究未提及具体的定量实验结果或基准测试,但通过分析推导了参考架构并识别了攻击向量,为制定安全措施提供了基础。
Insight: 创新点在于为CAV视觉系统提出了一个参考架构,并系统性地识别和评估了针对该架构的攻击向量及其对CIA三要素的影响,这有助于理解视觉系统的安全挑战并制定防护措施。
Abstract: This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.
[70] Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets cs.CVPDF
Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu
TL;DR: 本文通过分析三个大型多模态数据集(Re-LAION、DataComp1B和Conceptual Captions)中英文描述的图像标题,利用大语言模型提取地理位置信息,对数据集进行地理画像,揭示了训练数据在地理分布上的严重不平衡:美国、英国和加拿大占样本的48.0%,而南美和非洲国家分别仅占1.8%和3.8%。研究发现国家GDP与数据代表性高度相关(ρ=0.82),非英语子集也偏向于该语言主要使用国。此外,高代表性并不等同于更高的视觉或语义多样性,且基于Re-LAION训练的Stable Diffusion v1.3生成的图像虽看似真实,但覆盖范围远不及真实世界图像。
Details
Motivation: 解决文本到图像模型生成图像地理代表性不足的问题,探究其训练数据的地理来源分布,以评估数据集的代表性和潜在偏见。
Result: 在三个广泛使用的数据集上,通过分析20个常见实体的英文标题,发现地理分布严重偏向高GDP国家(如美、英、加占近一半样本),且与国家GDP强相关(ρ=0.82);非英语子集也存在类似偏向;Stable Diffusion v1.3生成的图像覆盖范围有限。
Insight: 创新点在于利用LLM从图像标题中提取地理位置进行大规模数据集地理画像,揭示了多模态数据集中存在的地理偏见及其与GDP的强关联,并指出高数据量不代表高多样性,为评估和改善数据集代表性提供了方法论和实证依据。
Abstract: Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0%$ of samples, while South American and African countries are severely under-represented with only $1.8%$ and $3.8%$ of images, respectively. We observe a strong correlation between a country’s GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.
[71] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing cs.CVPDF
Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang
TL;DR: 本文提出了SciFlow-Bench,一个用于评估科学图表生成的结构优先基准测试。该基准从真实科学PDF中构建,通过一个闭环的往返协议,将生成的图表图像逆向解析回结构化图以进行比较,从而直接评估像素级输出的结构正确性。实验表明,保持结构正确性,特别是对于具有复杂拓扑的图表,仍然是一个根本性挑战。
Details
Motivation: 现代文本到图像模型生成的科学图表通常在视觉上合理但结构不正确,而现有基准要么依赖于对结构不敏感的图像中心或主观指标,要么评估中间符号表示而非最终渲染图像,导致基于像素的图表生成评估不足。
Result: 实验结果表明,保持结构正确性仍然是一个根本性挑战,特别是对于具有复杂拓扑的图表。
Insight: 创新点在于提出了一个结构优先的基准测试SciFlow-Bench,它通过闭环的逆向解析协议,将生成的像素图像直接评估其结构可恢复性,而非仅视觉相似性,这由协调规划、感知和结构推理的分层多智能体系统实现。
Abstract: Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
[72] CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video cs.CVPDF
Hojun Song, Heejung Choi, Aro Kim, Chae-yeong Song, Gahyeon Kim
TL;DR: CompSplat是一种针对真实世界视频的压缩感知3D高斯溅射框架,旨在解决长序列视频中相机轨迹不规则、姿态未知以及压缩导致的几何失真问题。该框架通过显式建模帧级压缩特性,减少帧间不一致性和累积几何误差,提升重建质量。
Details
Motivation: 真实世界视频通常包含长序列、不规则相机轨迹和未知姿态,导致重建时出现姿态漂移、特征错位和几何失真;同时,有损压缩会引入不一致性,进一步降低几何和渲染质量。现有方法未能充分探索长视频中多样化的压缩模式。
Result: 在Tanks and Temples、Free和Hike等具有挑战性的基准测试中,CompSplat在严重压缩条件下实现了最先进的渲染质量和姿态精度,显著超越了最新的NVS方法。
Insight: 创新点包括压缩感知的帧加权和自适应剪枝策略,以增强鲁棒性和几何一致性;从客观角度看,该方法首次系统地将压缩特性建模融入长序列视频的3D高斯溅射训练中,有效缓解了压缩引起的累积误差问题。
Abstract: High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.
[73] SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding cs.CVPDF
Zhaoxu Li, Chenqi Kong, Peijun Bao, Song Xia, Yi Tu
TL;DR: 该论文针对大型视觉语言模型(LVLM)中的幻觉问题,提出了一种名为SAKED(稳定性感知知识增强解码)的训练无关方法。该方法通过分析模型内部知识的不稳定性(如注意力头、模型层和解码token的波动),引入层间知识稳定性评分(KSS)来量化知识稳定性,并对比最稳定和最不稳定的层,以抑制解码噪声并动态利用最可靠的内部知识来生成忠实token。
Details
Motivation: 解决大型视觉语言模型在现实应用中因幻觉(即生成不忠实于输入图像的内容)带来的安全和可靠性风险,其动机源于观察到人类在不确定或犹豫时更容易出错,从而研究模型内部知识的不稳定性如何导致幻觉。
Result: 大量实验表明,SAKED在各种模型、任务和基准测试(如POPE、MME、LLaVA-Bench等)上,在缓解幻觉方面实现了最先进的(SOTA)性能。
Insight: 创新点在于从模型内部知识稳定性的新视角(注意力头漂移、层间知识波动、相邻输出token间的视觉焦点分散)系统分析幻觉模式,并据此提出了一种无需训练、可即插即用的解码策略(SAKED),通过量化并利用层间知识稳定性差异来增强生成的忠实性。
Abstract: Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model ‘s internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.
[74] ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge cs.CVPDF
Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang
TL;DR: ARK是一个双轴多模态检索基准,从知识领域和推理技能两个互补维度评估模型能力,涵盖5个知识领域(17个子类型)和6类推理技能,包含16种异构视觉数据类型,通过设计针对性困难负样本避免捷径匹配。
Details
Motivation: 现有多模态检索基准主要关注日常生活图像的语义匹配,缺乏对专业知识和复杂推理的诊断能力,ARK旨在填补这一空白。
Result: 评估了23个代表性文本和多模态检索模型,发现知识密集型与推理密集型检索存在显著差距,细粒度视觉和空间推理是持续瓶颈;简单的重排序和查询改写能带来一致改进,但仍有很大提升空间。
Insight: 创新点在于构建了知识领域与推理技能的双轴评估框架,并通过设计针对性困难负样本强制模型进行多步推理;客观来看,该基准为诊断多模态检索模型在专业场景下的细粒度能力提供了系统化工具。
Abstract: Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.
[75] Kelix Technique Report cs.CVPDF
Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao
TL;DR: Kelix是一种完全离散的自回归统一模型,旨在通过离散视觉标记化实现多模态数据的统一理解和生成,以缩小离散与连续视觉表示之间的理解差距。
Details
Motivation: 解决现有视觉语言模型(VLMs)因依赖连续视觉特征而偏向理解、无法充分利用非文本数据的大规模自监督学习,以及离散视觉标记因容量限制导致信息丢失、理解能力较弱的问题。
Result: Kelix模型在理解能力上缩小了离散与连续视觉表示之间的差距,但摘要未提及具体基准测试或定量结果。
Insight: 创新点在于提出完全离散的自回归统一模型,通过改进离散视觉标记化来保留更多信息,从而促进多模态数据的统一理解和生成,借鉴了自回归LLM的范式扩展。
Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
[76] Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection cs.CVPDF
Peng Chen, Chao Huang, Yunkang Cao, Chengliang Liu, Wenqiang Wang
TL;DR: 论文提出Reason-IAD框架,通过检索增强的知识模块引入类别特定文本描述,结合熵驱动的潜在推理机制和动态视觉注入策略,在紧凑潜在空间中进行迭代推理,旨在提升工业异常检测的准确性和可解释性。
Details
Motivation: 现有基于通用数据预训练的多模态大语言模型难以捕捉工业领域特定类别的细粒度缺陷模式,限制了检测精度和可解释性。
Result: 大量实验表明,Reason-IAD在工业异常检测任务上持续超越最先进方法,达到SOTA水平。
Insight: 创新点包括:1) 检索增强的知识模块实现领域感知推理;2) 熵驱动的潜在推理机制通过可优化潜在思考令牌进行迭代探索;3) 动态视觉注入策略选择性关注关键图像区域。这些设计增强了模型对特定缺陷的推理能力和决策稳定性。
Abstract: Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.
[77] Code2World: A GUI World Model via Renderable Code Generation cs.CV | cs.AI | cs.CL | cs.HCPDF
Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu
TL;DR: 论文提出Code2World,一个通过生成可渲染代码来模拟图形用户界面(GUI)下一视觉状态的世界模型,旨在解决现有文本和像素方法在视觉保真度和结构可控性上的权衡问题。
Details
Motivation: 现有基于文本和像素的GUI世界模型难以同时实现高视觉保真度和细粒度结构可控性,限制了自主GUI代理的预测能力。
Result: Code2World-8B在下一代UI预测任务上达到顶尖性能,与GPT-5和Gemini-3-Pro-Image相当,并在AndroidWorld导航任务中将Gemini-2.5-Flash的成功率提升了9.5%。
Insight: 创新点在于将GUI轨迹转化为高保真HTML代码构建数据集,并采用渲染感知的强化学习,以渲染结果作为奖励信号来优化视觉语义保真度和动作一致性,从而实现了高保真且可控的GUI状态预测。
Abstract: Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.
[78] Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence cs.CVPDF
Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian
TL;DR: 本文提出了一种名为Free-GVC的无训练生成式视频压缩框架,通过将视频编码重新定义为由视频扩散先验引导的潜在轨迹压缩,在极低码率下实现高感知质量和时间连贯性。
Details
Motivation: 现有生成式视频压缩方法对时间相关性的利用有限,在极低码率下会导致明显的闪烁和时间连贯性下降,本文旨在解决这一问题。
Result: 实验表明,Free-GVC在DISTS指标上相比最新的神经编解码器DCVC-RT平均降低了93.29%的BD-Rate,用户研究进一步证实了其在极低码率下具有优越的感知质量和时间连贯性。
Insight: 创新点在于提出了一个无训练框架,通过自适应质量控制模块构建在线率失真代理模型来预测最佳扩散步长,以及通过组间对齐模块进行潜在融合以增强时间连贯性,避免了模型训练开销并有效缓解了闪烁问题。
Abstract: Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.
[79] MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation cs.CVPDF
Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang
TL;DR: 本文提出了一种名为MVISTA-4D的具身4D世界模型,用于机器人操作。该模型仅需单视角RGBD观测作为输入,即可生成几何一致、任意视角的RGBD序列,并通过视图融合构建更完整的时空3D结构。此外,论文还提出了一种测试时动作优化策略,通过反向传播生成模型来推断与预测未来最匹配的轨迹级隐变量,并结合残差逆动力学模型将其转化为可执行动作。
Details
Motivation: 现有基于世界模型的机器人操作方法通常仅支持纯图像预测或对部分3D几何进行推理,限制了其预测完整4D(3D+时间)场景动态的能力。本文旨在解决这一限制,构建一个能进行几何一致、多视角生成的4D世界模型,并解决从预测的未来状态到具体动作的逆动力学问题中存在的病态性(即同一状态转移可能对应多个动作)。
Result: 在三个数据集上的实验表明,该方法在4D场景生成和下游操作任务上均表现出色,并通过消融研究验证了关键设计选择的有效性。
Insight: 主要创新点包括:1)设计了一种显式的跨视图和跨模态特征融合机制,联合促进RGB与深度之间的一致性,并强制视图间的几何对齐;2)提出了一种新颖的测试时动作优化策略,通过生成模型的反向传播来推断轨迹级隐变量,并结合残差逆动力学模型,以解决从预测状态到动作映射的病态问题,从而更准确地生成可执行动作。
Abstract: World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
[80] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization cs.CVPDF
Shaoqiu Zhang, Zizhong Ding, Kaicheng Yang, Junyi Wu, Xianglong Yan
TL;DR: 本文提出了一种名为AdaTSQ的后训练量化框架,旨在提升扩散变换器(DiTs)在边缘设备上的部署效率。该框架通过分析扩散过程中独特的时间动态特性,设计了基于帕累托感知的时间步动态比特分配策略和基于Fisher信息的时间校准机制,以在压缩模型的同时保持生成质量。
Details
Motivation: 扩散变换器作为图像和视频生成的最先进骨干网络,其巨大的计算成本和内存占用阻碍了在边缘设备上的部署。现有后训练量化方法直接应用于DiTs时效果不佳,因为它们忽略了扩散过程固有的时间动态特性。
Result: 在四个先进的DiT模型(如Flux-Dev、Flux-Schnell、Z-Image和Wan2.1)上的大量实验表明,AdaTSQ显著优于SVDQuant和ViDiT-Q等最先进方法,在效率和质量上推进了帕累托前沿。
Insight: 创新点在于首次将扩散模型的时间敏感性系统性地引入量化过程,提出了时间步动态比特分配和基于Fisher信息的时间校准机制。这为动态时序模型的量化提供了新思路,即利用任务内在的时间结构来指导压缩策略。
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.
[81] A benchmark for video-based laparoscopic skill analysis and assessment cs.CVPDF
Isabel Funke, Sebastian Bodenstedt, Felix von Bechtolsheim, Florian Oehme, Michael Maruschke
TL;DR: 本文介绍了LASANA数据集,这是一个用于视频腹腔镜技能分析与评估的基准数据集,包含1270个立体视频记录,涵盖四项基本腹腔镜训练任务,每个记录都带有结构化技能评分和任务特定错误的二元标签。
Details
Motivation: 当前深度学习模型在腹腔镜手术技能自动视频评估方面的发展受到标注数据集规模有限的阻碍,因此需要构建一个大规模、高质量的数据集来促进相关研究。
Result: 论文提供了预定义的数据划分,并展示了深度学习模型的基线结果,为未来方法的比较提供了参考基准。
Insight: LASANA数据集通过收集来自真实培训课程的视频,反映了参与者技能的自然变化,并提供了结构化技能评分和错误标签,为视频基础的技能评估和错误识别任务建立了标准化基准。
Abstract: Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.
[82] Monocular Normal Estimation via Shading Sequence Estimation cs.CV | cs.AIPDF
Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu
TL;DR: 该论文提出了一种新的单目法线估计范式,将法线估计问题重新定义为着色序列估计问题,并提出了名为RoSE的方法。该方法利用图像到视频生成模型预测着色序列,然后通过求解普通最小二乘问题将序列转换为法线图。通过在合成的MultiShade数据集上进行训练,RoSE在真实世界基准数据集上实现了最先进的性能。
Details
Motivation: 现有单目法线估计方法通常存在3D错位问题,即估计的法线图看起来正确,但重建的表面无法与几何细节对齐。作者认为这源于当前范式难以从细微的颜色变化中区分和重建几何信息。
Result: 实验表明,RoSE在面向对象的单目法线估计真实世界基准数据集上达到了最先进的(SOTA)性能。
Insight: 核心创新点在于将法线估计重新定义为对几何信息更敏感的着色序列估计问题,并利用图像到视频生成模型来预测这些序列。这为解决3D错位问题提供了一个新的、更鲁棒的范式。
Abstract: Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
[83] VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization cs.CVPDF
Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao
TL;DR: 本文提出VersaViT,一个通过任务引导优化增强多模态大语言模型(MLLM)视觉骨干网络的方法。研究发现MLLM的视觉编码器在密集预测任务上表现欠佳,因此设计了一个多任务协作后训练框架,利用轻量级任务头和多粒度监督来优化视觉骨干,使其能同时胜任语言介导推理和像素级理解任务。
Details
Motivation: 解决MLLM视觉编码器在密集特征表示上的缺陷,使其能成为通用的视觉骨干,可靠地执行经典的以视觉为中心的任务(如语义分割、深度估计)。
Result: 在多个下游任务上的广泛实验证明了方法的有效性,所得到的视觉骨干在语言介导推理和像素级理解任务上均表现良好。
Insight: 创新点在于提出了一个新颖的多任务协作后训练框架(VersaViT),通过轻量级任务头和多粒度监督来优化MLLM视觉骨干的密集特征表示能力,从而使其成为通用的视觉骨干。从客观角度看,该方法将MLLM的高层语义对齐优势与经典视觉任务的密集预测需求相结合,是一种有前景的骨干网络增强策略。
Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
[84] Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework cs.CV | cs.AIPDF
Franziska Krauß, Matthias Ege, Zoltan Lovasz, Albrecht Bartz-Schmidt, Igor Tsaur
TL;DR: 本文提出了一种混合注意力-卷积(HAC)架构,用于从内窥镜视频中自动分割膀胱血管,以解决膀胱癌监测中因器官变形、图像伪影和黏膜褶皱干扰导致的血管分割难题。该方法结合Transformer捕获全局血管拓扑先验和CNN学习残差细化图以恢复细微血管细节,并通过优化标注数据和物理感知预训练策略提升性能。
Details
Motivation: 膀胱癌监测需要在反复干预中追踪肿瘤位置,但可变形、中空的膀胱缺乏稳定的解剖标志。内窥镜下可见的血管可作为患者特定的“血管指纹”用于导航,然而自动分割面临内窥镜数据质量不佳(如稀疏标注、气泡、光照变化、连续变形和模仿血管的黏膜褶皱)的挑战,现有先进方法难以处理这些领域特定复杂性。
Result: 在由内窥镜视频帧组成的BlaVeS数据集上评估,该方法实现了高准确率(0.94)以及优于现有先进医学分割模型的精确度(0.61)和clDice(0.66),并成功抑制了手术中因膀胱充盈和排空动态出现和消失的黏膜褶皱导致的假阳性。
Insight: 创新点在于将Transformer的全局建模能力与CNN的局部细节捕捉能力相结合,通过优化标注(排除短小和末端分支以强调结构连通性)和物理感知的自监督预训练(利用临床相关的数据增强处理未标记数据)来应对领域特定挑战和数据稀缺问题,为临床导航提供了可靠的结构稳定性。
Abstract: Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific “vascular fingerprint” for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.
[85] Learning to Detect Baked Goods with Limited Supervision cs.CVPDF
Thomas H. Schmitt, Maximilian Bundscherer, Tobias Bocklet
TL;DR: 本文提出了一种在有限监督下训练目标检测模型以识别烘焙食品的方法。该方法通过结合开放词汇检测器的定位能力与图像级监督进行弱监督训练,并利用Segment Anything 2进行伪标签传播以提升视角鲁棒性,最终在非理想部署条件下超越了全监督基线模型。
Details
Motivation: 自动化监控剩余烘焙食品对于优化生产至关重要,但德国烘焙食品种类繁多,导致全监督训练成本高昂且难以扩展。现有开放词汇检测器(如OWLv2、Grounding DINO)无法满足任务需求,因此需要解决在标注数据稀缺的工业场景中部署计算机视觉的挑战。
Result: 在包含19类烘焙食品的数据集上,仅使用图像级监督训练的模型达到了0.91的mAP。通过伪标签微调,模型在非理想部署条件下的性能提升了19.3%,并超越了全监督基线模型。
Insight: 创新点在于结合开放词汇检测器的定位与图像级监督进行弱监督训练,并利用Segment Anything 2进行伪标签传播以增强模型鲁棒性。这为标注数据稀缺的工业视觉任务提供了一种高效的训练范式。
Abstract: Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
[86] Coupled Inference in Diffusion Models for Semantic Decomposition cs.CV | cs.AI | cs.LGPDF
Calvin Yeung, Ali Zakeri, Zhuowen Zou, Mohsen Imani
TL;DR: 本文提出了一种基于扩散模型的语义分解框架,通过耦合推理将视觉场景分解为潜在因子。该方法将语义分解视为逆问题,利用重建驱动的引导项耦合扩散过程,并引入新颖的迭代采样方案提升性能。实验表明,该框架在合成语义分解任务中优于谐振子网络,且注意力机制谐振子网络是其特例。
Details
Motivation: 解决视觉场景中潜在因子的组合表示与分解问题,受谐振子网络与扩散模型相似性的启发,构建扩散模型中的耦合推理框架以实现更有效的语义分解。
Result: 在多种合成语义分解任务中,该耦合推理框架的性能优于谐振子网络,达到更优的分解效果。
Insight: 创新点包括将语义分解建模为逆问题、使用重建引导项耦合扩散过程,以及设计迭代采样方案;客观分析表明,该框架统一了注意力谐振子网络,扩展了扩散模型在结构化推理中的应用潜力。
Abstract: Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
[87] Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection cs.CV | cs.AIPDF
Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu
TL;DR: 本文提出Fake-HR1模型,一种用于合成图像检测的大规模混合推理模型。该模型首次根据生成检测任务的特点自适应地决定是否需要进行推理,以解决现有方法中冗长推理带来的资源开销问题。通过两阶段训练框架(混合微调和在线强化学习),模型能隐式学习选择合适推理模式,在提升检测性能的同时显著提高响应效率。
Details
Motivation: 现有研究将思维链(CoT)推理引入检测过程以提升合成图像检测能力,但过长的推理会导致显著的资源开销(如令牌消耗和延迟),在处理明显伪造图像时尤其冗余。
Result: 实验结果表明,Fake-HR1能自适应地对不同类型查询执行推理,在推理能力和生成检测性能上均超越现有大型语言模型,同时显著提升了响应效率。
Insight: 创新点在于首次针对生成检测任务自适应地决定推理必要性,并设计了两阶段训练框架(HFT和HGRPO)来隐式学习推理模式选择,实现了性能与效率的平衡。
Abstract: Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model’s ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
[88] Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving cs.CVPDF
Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag
TL;DR: 本文提出了一种时空注意力机制,将Transformer的自注意力扩展至时空维度,以利用视频帧间的时序一致性,提升自动驾驶场景下视频语义分割的准确性与稳定性。
Details
Motivation: 现有深度神经网络(尤其是基于Transformer的模型)在处理视频语义分割时,通常独立处理每一帧,未能有效利用时序一致性,这限制了在动态场景中的性能提升。
Result: 在Cityscapes和BDD100k数据集上的评估表明,该方法在时序一致性指标上提升了9.20个百分点,在平均交并比上最高提升了1.76个百分点,优于单帧基线模型。
Insight: 创新点在于将标准自注意力机制修改为处理时空特征序列,在保持计算效率和对现有架构改动最小的同时,实现了跨不同Transformer架构(包括轻量级和大规模模型)的广泛适用性,为视频语义分割提供了一种有效的架构增强方案。
Abstract: Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
[89] 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere cs.CVPDF
Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy
TL;DR: 4RC是一个用于从单目视频进行4D重建的统一前馈框架,它通过编码-查询范式联合捕获密集场景几何和运动动态,实现了任意时间和位置的查询。
Details
Motivation: 解决现有方法通常将运动与几何解耦或仅产生有限4D属性(如稀疏轨迹或双视图场景流)的问题,旨在学习一个全面的4D表示。
Result: 在广泛的4D重建任务中,4RC优于先前和同期方法,实验表明其性能优越。
Insight: 创新点包括引入’编码一次,任意查询’的范式,以及将每视图4D属性分解为基础几何和时变相对运动的极小因子化表示,以促进学习。
Abstract: We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
[90] Causality in Video Diffusers is Separable from Denoising cs.CV | cs.AI | cs.LGPDF
Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang
TL;DR: 本文提出了一种可分离因果扩散(SCD)架构,通过将因果推理与去噪过程解耦,显著提升了视频扩散模型的效率。SCD使用因果Transformer编码器进行每帧一次的时间推理,并通过轻量级扩散解码器执行多步逐帧渲染,在保持或超越基线生成质量的同时,大幅提高了吞吐量和每帧延迟。
Details
Motivation: 当前因果扩散模型将时间推理与迭代去噪过程紧密耦合,在每个去噪步骤和所有层中应用因果注意力,导致计算冗余。本文旨在证明因果推理可以与多步去噪过程分离,以提高模型效率。
Result: 在合成和真实基准测试的预训练和后训练任务上,SCD在匹配或超越强因果扩散基线生成质量的同时,显著提高了吞吐量和每帧延迟。
Insight: 创新点在于发现了自回归视频扩散模型中早期层特征跨去噪步高度相似(冗余计算)以及深层注意力稀疏(主要进行帧内渲染)的规律,并据此设计了将因果推理与去噪显式解耦的SCD架构。这为高效视频生成模型设计提供了新思路。
Abstract: Causality – referring to temporal, uni-directional cause-effect relationships between components – underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
[91] VideoWorld 2: Learning Transferable Knowledge from Real-world Videos cs.CVPDF
Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao
TL;DR: VideoWorld 2 扩展了 VideoWorld,首次研究了直接从原始真实世界视频中学习可迁移知识。其核心是引入了动态增强的潜在动态模型(dLDM),该模型将动作动态与视觉外观解耦:使用预训练的视频扩散模型处理视觉外观建模,使 dLDM 能够学习专注于紧凑且与任务相关的动态的潜在代码。然后对这些潜在代码进行自回归建模,以学习任务策略并支持长时程推理。
Details
Motivation: 从无标签视频数据中学习可迁移知识并将其应用于新环境是智能体的基本能力,但现有方法在真实世界视频上存在困难。
Result: 在具有挑战性的真实世界手工制作任务上,VideoWorld 2 实现了高达 70% 的任务成功率提升,并生成了连贯的长执行视频。在机器人领域,它从 Open-X 数据集中获取了有效的操作知识,显著提高了在 CALVIN 基准上的任务性能。
Insight: 创新点在于通过 dLDM 解耦动作动态与视觉外观,利用预训练视频扩散模型处理外观,使模型能专注于学习任务相关的紧凑动态表示,从而直接从原始视频中学习可迁移的世界知识,这为视频理解和机器人学习提供了新思路。
Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
[92] Olaf-World: Orienting Latent Actions for Video World Modeling cs.CV | cs.AI | cs.LGPDF
Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
TL;DR: 该论文提出了Olaf-World框架,通过SeqΔ-REPA目标函数,利用自监督视频编码器的时序特征差异来对齐潜在动作的语义,从而从大规模无标签视频中预训练动作条件化的世界模型,解决了潜在动作学习中跨场景迁移困难的问题。
Details
Motivation: 现有动作可控世界模型的扩展受限于动作标签的稀缺性,而现有的潜在动作学习方法在跨上下文迁移时效果不佳,因为它们会与场景特定线索纠缠且缺乏共享的坐标系。
Result: 大量实验表明,该方法学习到了更具结构化的潜在动作空间,在零样本动作迁移和适应新控制接口的数据效率方面均优于最先进的基线模型。
Insight: 核心创新在于利用动作的语义效果(通过时序特征差异观测)作为共享参考来对齐潜在动作,提出了序列级的控制-效果对齐目标(SeqΔ-REPA),从而解耦场景信息并建立跨上下文的统一动作表示。
Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
[93] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation cs.CVPDF
Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati
TL;DR: 本文提出ConsID-Gen框架,用于解决图像到视频生成任务中因视角变化导致的对象身份漂移和几何失真问题。该方法通过构建大规模对象中心数据集ConsIDVid及相应评测基准,并引入辅助视图增强和双流编码器,以提升生成视频的视角一致性和身份保持能力。
Details
Motivation: 现有图像到视频生成方法在视角变化时容易出现外观漂移和几何扭曲,这源于单视图2D观测的稀疏性和跨模态对齐的薄弱。
Result: 在ConsIDVid-Bench上的实验表明,ConsID-Gen在多个指标上持续优于现有方法,整体性能超越了Wan2.1和HunyuanVideo等领先视频生成模型,在具有挑战性的真实场景中实现了更优的身份保真度和时序一致性。
Insight: 创新点包括:构建了高质量、时序对齐的大规模对象中心数据集及多视角一致性评测框架;提出了利用无姿态辅助视图增强首帧,并通过双流视觉-几何编码器与文本-视觉连接器融合语义与结构线索的生成框架,为扩散Transformer骨干提供统一的条件输入。
Abstract: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
[94] Quantum Multiple Rotation Averaging cs.CVPDF
Shuteng Wang, Natacha Kuete Meli, Michael Möller, Vladislav Golyanik
TL;DR: 本文提出了IQARS算法,首次将多旋转平均问题重新表述为一系列可在量子退火器上执行的局部二次非凸子问题,以利用量子硬件的固有优势,从而在保留旋转流形几何的同时提升高噪声场景下的求解精度。
Details
Motivation: 解决经典多旋转平均方法(如L1-IRLS和Shonan)在保持精确流形几何、避免局部极小值以及依赖凸松弛方面的局限性,特别是在高噪声环境下精度下降的问题。
Result: 在合成和真实数据集上的评估表明,尽管当前量子退火器仍处于早期阶段且仅支持有限规模问题,但IQARS在D-Wave退火器上已能比最佳经典方法Shonan实现约12%的准确率提升。
Insight: 创新点在于将MRA问题转化为适合量子退火器处理的二值化局部二次非凸子问题序列,消除了对凸松弛的依赖,更好地保留了非欧几里得旋转流形几何,并利用量子隧穿和并行性进行高效解空间探索。
Abstract: Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS’s performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
[95] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI cs.CV | cs.ROPDF
Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu
TL;DR: SAGE是一个可扩展的智能体框架,用于为具身AI生成仿真就绪的3D场景。它通过理解用户指定的具身任务意图,自动生成大规模、逼真且物理有效的环境。该框架结合了布局和物体组合的生成器与评估语义合理性、视觉真实性和物理稳定性的评判器,通过迭代推理和自适应工具选择来优化场景。
Details
Motivation: 解决为具身AI收集真实世界数据成本高且不安全的问题,以及现有场景生成系统依赖基于规则或任务特定流程,导致伪影和物理无效场景的局限性。
Result: 生成的SAGE-10k数据集环境逼真、多样,可直接部署在现代仿真器中用于策略训练。仅使用该数据训练的智能体策略展现出清晰的扩展趋势,并能泛化到未见过的物体和布局,证明了仿真驱动扩展的潜力。
Insight: 创新点在于将场景生成构建为一个智能体驱动的迭代优化过程,结合多生成器和评判器进行语义、视觉和物理层面的评估与自优化,实现了大规模、高质量、仿真就绪环境的自动化生成,为具身AI提供了可扩展的数据解决方案。
Abstract: Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.
cs.SD [Back]
[96] Covo-Audio Technical Report cs.SD | cs.CL | eess.ASPDF
Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou
TL;DR: 本文介绍了Covo-Audio,一个70亿参数、端到端的语言音频大模型,能够直接处理连续音频输入并生成音频输出。通过大规模预训练和针对性后训练,该模型在语音-文本建模、口语对话、语音理解、音频理解和全双工语音交互等广泛任务上,达到了同规模模型中的SOTA或具有竞争力的性能。
Details
Motivation: 旨在开发一个统一架构的端到端语言音频大模型,以直接处理音频输入和输出,解决传统方法在音频智能与语义推理整合方面的挑战,并降低部署成本。
Result: 在多个基准测试中,预训练基础模型在语音-文本理解和语义推理能力上超越了同规模的开源模型;对话变体Covo-Audio-Chat展现了强大的口语对话能力;全双工变体Covo-Audio-Chat-FD在口语对话和全双工交互行为上表现显著更优。
Insight: 创新点包括:统一的端到端音频处理架构、大规模预训练与后训练策略、以及为降低部署成本提出的智能-说话人解耦策略,该策略允许用少量TTS数据实现灵活的语音定制,同时保持对话性能,为构建更强大、通用的LALM提供了可扩展的路径。
Abstract: In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
cs.IR [Back]
[97] QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search cs.IR | cs.CLPDF
Jianzhao Huang, Xiaorui Huang, Fei Zhao, Yunpeng Liu, Hui Zhang
TL;DR: 本文提出了QP-OneModel,一个用于小红书搜索中多任务查询理解的统一生成式大语言模型。它将异构子任务重新表述为统一的序列生成范式,采用渐进式三阶段对齐策略并结合多奖励强化学习。该模型还生成意图描述作为高保真语义信号,以增强下游任务。
Details
Motivation: 传统查询处理系统依赖孤立的判别模型流水线,存在语义理解有限和维护成本高的问题。现有LLM方法通常孤立优化子任务,忽略了内在语义协同,且缺乏对社交网络场景的针对性,难以弥合开放域语料与非正式社交网络语言模式之间的差距。
Result: 离线评估显示,QP-OneModel相比判别式基线实现了7.35%的整体性能提升,在命名实体识别和词项权重任务上F1分数分别显著提升9.01%和9.31%。在未见任务上,其准确率超越了一个320亿参数模型7.60%。在线A/B测试证实了其工业价值,将检索相关性(DCG)优化了0.21%,并将用户留存率提升了0.044%。
Insight: 主要创新点包括:1)将异构查询理解子任务统一为生成式序列建模范式;2)采用渐进式三阶段(指令微调、偏好对齐、多奖励强化学习)对齐策略;3)引入意图描述生成作为增强下游任务的高保真语义信号。这为构建统一、可泛化且业务对齐的领域特定LLM提供了思路。
Abstract: Query Processing (QP) bridges user intent and content supply in large-scale Social Network Service (SNS) search engines. Traditional QP systems rely on pipelines of isolated discriminative models (e.g., BERT), suffering from limited semantic understanding and high maintenance overhead. While Large Language Models (LLMs) offer a potential solution, existing approaches often optimize sub-tasks in isolation, neglecting intrinsic semantic synergy and necessitating independent iterations. Moreover, standard generative methods often lack grounding in SNS scenarios, failing to bridge the gap between open-domain corpora and informal SNS linguistic patterns, while struggling to adhere to rigorous business definitions. We present QP-OneModel, a Unified Generative LLM for Multi-Task Query Understanding in the SNS domain. We reformulate heterogeneous sub-tasks into a unified sequence generation paradigm, adopting a progressive three-stage alignment strategy culminating in multi-reward Reinforcement Learning. Furthermore, QP-OneModel generates intent descriptions as a novel high-fidelity semantic signal, effectively augmenting downstream tasks such as query rewriting and ranking. Offline evaluations show QP-OneModel achieves a 7.35% overall gain over discriminative baselines, with significant F1 boosts in NER (+9.01%) and Term Weighting (+9.31%). It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks. Fully deployed at Xiaohongshu, online A/B tests confirm its industrial value, optimizing retrieval relevance (DCG) by 0.21% and lifting user retention by 0.044%.
cs.CR [Back]
[98] Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models cs.CR | cs.CVPDF
Xinwei Zhang, Li Bai, Tianwei Zhang, Youqian Zhang, Qingqing Ye
TL;DR: 本文首次系统研究了针对大型视觉语言模型(LVLMs)的基于编码器的对抗样本可迁移性。通过大规模基准测试,揭示了现有攻击方法可迁移性严重受限,并分析出两个根本原因:模型间视觉定位不一致和模型内语义对齐冗余。基于此,论文提出了语义引导的多模态攻击(SGMA)框架,通过将扰动导向语义关键区域并在全局和局部层面破坏跨模态对齐,显著提升了对抗样本在不同LVLM架构间的可迁移性。
Details
Motivation: 大型视觉语言模型在多模态任务上表现出色,但其对视觉输入的依赖使其面临严重的对抗威胁。现有基于编码器的攻击方法仅针对视觉编码器进行优化,计算效率高,但其在现实黑盒场景下跨不同LVLM架构的可迁移性尚未得到充分理解和研究。
Result: 在八个不同的LVLM上进行的大规模基准测试表明,SGMA在多种受害模型和任务上的可迁移性均优于现有攻击方法,揭示了LVLM部署中存在的关键安全风险。
Insight: 创新点在于首次系统性地研究了LVLM中基于编码器的对抗可迁移性问题,并识别出阻碍可迁移性的两个关键内在机制。提出的SGMA攻击框架,通过语义引导将扰动与模型的多模态理解弱点(全局和局部语义对齐)相结合,为理解和增强对抗可迁移性提供了新思路,同时也凸显了开发鲁棒多模态防御的紧迫性。
Abstract: Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.
cs.LG [Back]
[99] Flexible Entropy Control in RLVR with Gradient-Preserving Perspective cs.LG | cs.AI | cs.CLPDF
Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng
TL;DR: 本文针对强化学习与可验证奖励(RLVR)训练中常见的策略熵崩溃问题,提出了一种基于梯度保留裁剪(Gradient-Preserving Clipping)视角的灵活熵控制方法。该方法通过理论分析和实验验证了重要性采样比特定区域对熵变化的影响,并引入动态裁剪阈值作为调控机制,设计了多种动态熵控制策略。实验表明,这些策略能有效缓解熵崩溃,并在多个基准测试上取得了更优的性能。
Details
Motivation: 解决RLVR持续训练中因梯度保留裁剪导致的策略熵崩溃问题,该问题表现为熵值快速衰减,引发模型过早过度自信、输出多样性降低以及梯度范数消失,从而阻碍学习。现有缓解策略多为静态,缺乏将裁剪机制与精确熵控制联系起来的框架。
Result: 实验结果表明,所提出的动态熵控制策略(如先增后减、减-增-减、振荡衰减)有效缓解了熵崩溃,并在多个基准测试上取得了优于基线方法的性能。
Insight: 核心创新点在于从梯度保留裁剪的视角重塑熵控制,建立了裁剪机制与熵动态之间的理论联系,并据此设计了基于动态裁剪阈值的精确调控方法。这为RL训练中的熵稳定和性能提升提供了一个可解释且灵活的框架。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
cs.MA [Back]
[100] LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis cs.MA | cs.CLPDF
Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan
TL;DR: 本文提出了LingxiDiagBench,一个用于评估大语言模型在中文精神科咨询与诊断中表现的多智能体基准框架。其核心是包含16,000个与电子病历对齐的合成咨询对话数据集LingxiDiag-16K,覆盖12种ICD-10精神疾病类别。实验发现,LLMs在二元分类任务上表现良好,但在共病识别和12类鉴别诊断上表现显著下降,且动态咨询效果常不如静态评估。
Details
Motivation: 全球精神障碍高发,但精神科医生短缺以及基于访谈诊断的主观性阻碍了及时、一致的评估。AI辅助精神科诊断的发展受限于缺乏能同时提供真实患者模拟、临床医生验证的诊断标签以及支持动态多轮咨询的基准。
Result: 在SOTA LLMs上的广泛实验表明:在二元抑郁-焦虑分类上准确率最高达92.3%;在抑郁-焦虑共病识别上准确率降至43.0%;在12类鉴别诊断上准确率仅为28.5%。动态咨询评估常逊于静态评估,且LLM-as-a-Judge评估的咨询质量与诊断准确率仅呈中等相关。
Insight: 论文的创新点在于构建了一个大规模、多智能体、支持动态多轮咨询的中文精神科诊断基准,其数据集与真实临床分布对齐。客观来看,该工作系统地揭示了当前LLMs在复杂精神科诊断任务(如共病识别、多分类诊断)上的局限性,并指出结构良好的提问策略本身不足以确保正确的诊断决策,这对未来AI辅助诊断系统的开发具有重要启示。
Abstract: Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression–anxiety classification (up to 92.3%), performance deteriorates substantially for depression–anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.
cs.RO [Back]
[101] LLM-Grounded Dynamic Task Planning with Hierarchical Temporal Logic for Human-Aware Multi-Robot Collaboration cs.RO | cs.CVPDF
Shuyuan Hu, Tao Lin, Kai Ye, Yang Yang, Tianwei Zhang
TL;DR: 本文提出了一种神经符号框架,将大型语言模型(LLM)的推理能力与分层线性时序逻辑(LTL)规范相结合,以解决开放世界多机器人协作任务规划问题。该方法通过滚动时域规划(RHP)循环和实时感知,动态处理环境变化(如移动用户或更新指令),从而生成运动学可行且高效的长期规划。
Details
Motivation: 现有方法存在不足:LLM生成的规划常缺乏运动学可行性且效率低下,而形式化方法(如LTL)虽能保证正确性和最优性,但通常局限于静态离线设置且计算可扩展性差。本文旨在弥合这一差距,实现动态、高效且可靠的多机器人任务规划。
Result: 广泛的真实世界实验表明,该方法在成功率和交互流畅度上显著优于基线方法,同时最小化了规划延迟。
Insight: 创新点在于将LLM的开放世界理解能力与LTL的形式化保证相结合,并通过分层状态空间和滚动时域规划实现动态、实时的规划调整,从而在保证正确性的同时提升了应对环境不确定性的能力。
Abstract: While Large Language Models (LLM) enable non-experts to specify open-world multi-robot tasks, the generated plans often lack kinematic feasibility and are not efficient, especially in long-horizon scenarios. Formal methods like Linear Temporal Logic (LTL) offer correctness and optimal guarantees, but are typically confined to static, offline settings and struggle with computational scalability. To bridge this gap, we propose a neuro-symbolic framework that grounds LLM reasoning into hierarchical LTL specifications and solves the corresponding Simultaneous Task Allocation and Planning (STAP) problem. Unlike static approaches, our system resolves stochastic environmental changes, such as moving users or updated instructions via a receding horizon planning (RHP) loop with real-time perception, which dynamically refines plans through a hierarchical state space. Extensive real-world experiments demonstrate that our approach significantly outperforms baseline methods in success rate and interaction fluency while minimizing planning latency.
[102] VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model cs.RO | cs.CVPDF
Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu
TL;DR: 本文提出VLA-JEPA,一种基于联合嵌入预测架构(JEPA)的视觉-语言-动作模型预训练框架,通过无泄漏状态预测在潜在空间学习动态抽象,避免像素变化干扰,并简化了训练流程。
Details
Motivation: 现有基于互联网视频的VLA策略预训练方法易受外观偏差、冗余运动和标签信息泄漏影响,导致学习到错误的动作无关表示,需要一种能稳健捕捉状态转移的预训练目标。
Result: 在LIBERO、LIBERO-Plus、SimplerEnv仿真环境和真实世界操作任务上的实验表明,VLA-JEPA在泛化性和鲁棒性上均优于现有方法,取得了持续的性能提升。
Insight: 核心创新在于设计无泄漏状态预测机制,将未来帧仅作为监督目标而非输入,强制模型在潜在空间预测动态,从而学习到对相机运动和背景变化鲁棒的抽象表示,简化了多阶段训练流程。
Abstract: Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation – future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe – JEPA pretraining followed by action-head fine-tuning – without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
cs.AI [Back]
[103] FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases cs.AI | cs.CL | cs.IRPDF
Xingjian Zhang, Sophia Moylan, Ziyang Xiong, Qiaozhu Mei, Yichen Luo
TL;DR: 本文提出了FlyBench基准测试,用于评估AI智能体在科学文献中进行端到端本体论策展的能力。该基准要求智能体仅根据基因符号,从16,898篇全文论文中搜索和阅读,以生成结构化注释,包括描述功能的基因本体术语、表达模式和连接数十年命名法的历史同义词。研究评估了四种基线智能体架构,发现多智能体设计性能更优,但所有基线仍有很大改进空间。
Details
Motivation: 现有基准测试主要关注命名实体识别或关系提取等孤立子任务,无法捕捉科学知识库维护中专家策展人搜索相关论文、整合跨文档证据并生成基于本体的注释的端到端工作流程。因此,需要一个新的基准来评估AI智能体在完整的本体论策展任务上的能力。
Result: 在基于果蝇知识库FlyBase的100个基因、7,397个专家策展注释上评估了四种基线架构(记忆化、固定流程、单智能体、多智能体)。结果表明,架构选择显著影响性能,多智能体设计优于更简单的替代方案,但扩展骨干模型带来的收益递减,所有基线仍有很大改进空间。
Insight: 论文的创新点在于提出了首个专注于端到端智能体化本体论策展的基准FlyBench,它模拟了真实科学策展工作流。客观分析认为,其核心价值在于推动了检索增强的科学推理能力的发展,并揭示了智能体主要利用检索来确认参数知识而非发现新信息这一重要发现,为未来研究提供了方向。
Abstract: Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
[104] Not-in-Perspective: Towards Shielding Google’s Perspective API Against Adversarial Negation Attacks cs.AI | cs.CLPDF
Michail S. Alexiou, J. Sukarno Mertoguno
TL;DR: 本文提出了一种基于形式化推理的包装方法,用于增强现有机器学习毒性检测系统对抗否定攻击的鲁棒性。该方法通过预处理和后处理步骤,有效缓解了包含逻辑修改(如否定)的对抗性攻击问题,显著提升了毒性评分的准确性和有效性。
Details
Motivation: 社交媒体平台中网络欺凌和有毒评论的增多,凸显了对在线互动进行有效监控和管理的需求。现有的自动化毒性检测系统主要基于机器学习或深度学习算法,但这些基于统计的解决方案容易受到包含逻辑修改(如否定)的对抗性攻击。
Result: 在多个机器学习模型上,使用否定对抗数据集评估了不同变体的包装方法。实验结果表明,混合(形式化推理与机器学习)方法相比纯统计解决方案有显著改进,提高了毒性检测的准确性。
Insight: 创新点在于将形式化推理与机器学习相结合,作为包装层来防御基于逻辑的对抗攻击(特别是否定攻击),这为提升现有毒性检测系统的鲁棒性提供了一种可借鉴的混合方法框架。
Abstract: The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
[105] Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices cs.AI | cs.CLPDF
Manon Reusens, Sofie Goethals, Toon Calders, David Martens
TL;DR: 本文研究了大型语言模型(LLMs)在旅行助手等主观决策场景中的表现,通过设计选择困境并利用多项Logit模型分析其响应,推导出隐含的支付意愿(WTP)估值,并与经济学文献中的人类基准进行比较。研究发现,较大规模的LLMs能够产生有意义的WTP值,但在属性层面存在系统性偏差,且总体上倾向于高估人类的WTP,尤其是在引入昂贵选项或商业导向人设时。通过基于用户先前对廉价选项偏好的条件设置,可以使估值更接近人类基准。
Details
Motivation: 随着LLMs在旅行辅助和购买支持等应用中的部署日益增多,它们经常需要在没有客观正确答案的场景中代表用户做出主观选择。本文旨在探究LLMs在此类主观决策中的行为,特别是其隐含的支付意愿,并与人类行为进行对比,以评估其作为决策支持工具的潜力和局限性。
Result: 研究结果表明,较大规模的LLMs能够生成有意义的支付意愿估值,但在属性层面存在系统性偏差,且总体上高估了人类的WTP。在引入昂贵选项或商业导向人设时,这种高估更为明显。当模型基于用户先前对廉价选项的偏好进行条件设置时,其估值更接近人类基准。
Insight: 论文的创新点在于将经济学中的支付意愿(WTP)概念和多项Logit模型应用于评估LLMs的主观决策行为,并与人类基准进行系统比较。从客观角度看,该研究揭示了LLMs在模拟人类主观偏好时的潜在偏差,强调了在实际部署中需要仔细进行模型选择、提示设计和用户表征的重要性,为LLMs在主观决策支持应用中的可靠性和校准提供了实证见解。
Abstract: As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users’ past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.
[106] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning cs.AI | cs.CL | cs.LGPDF
Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han
TL;DR: 本文提出Agent World Model(AWM),一种完全合成的环境生成流水线,旨在解决智能体强化学习中多样化、可靠环境缺乏的问题。该流水线生成了1000个覆盖日常场景的代码驱动环境,每个环境平均配备35种工具,并提供高质量观测和基于数据库的状态转换,从而支持大规模多轮工具使用智能体的强化学习训练。
Details
Motivation: 当前基于大语言模型的自主智能体在执行需要与工具和环境进行多轮交互的复杂任务时,其训练规模受到多样化、可靠环境稀缺的限制。
Result: 在三个基准测试上的实验表明,仅在合成环境中训练(而非特定基准环境)能产生强大的分布外泛化能力。
Insight: 创新点在于提出了一种完全代码驱动、数据库支持的合成环境生成方法,相比LLM模拟的环境,其状态转换更可靠、一致,且比从真实环境收集轨迹更高效,同时支持设计可靠的奖励函数,为智能体训练提供了可扩展的高质量资源。
Abstract: Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
cs.SE [Back]
[107] SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents cs.SE | cs.AI | cs.CLPDF
Zhirui Zhang, Hongbo Zhang, Haoxiang Fei, Zhiyuan Bao, Yubin Chen
TL;DR: SWE-AGI是一个开源基准测试,用于评估基于大型语言模型的智能体能否根据明确规范(如权威标准和RFC)自主构建生产规模的软件系统,该系统使用MoonBit语言编写,任务涉及实现解析器、解释器等核心逻辑。
Details
Motivation: 尽管大型语言模型展现出强大的编码能力,但其根据明确规范自主构建生产规模软件的能力仍是一个开放性问题,因此需要创建基准来评估这种端到端、规范驱动的软件构建能力。
Result: 在评估的前沿模型中,gpt-5.3-codex取得了最佳整体性能(解决了22个任务中的19个,成功率86.4%),优于claude-opus-4.6(15/22,68.2%),而kimi-2.5在开源模型中表现最强;随着任务难度增加,性能急剧下降。
Insight: 创新点在于利用新兴的MoonBit生态系统最小化数据泄露,迫使智能体依赖长程架构推理而非代码检索;行为分析揭示,随着代码库规模扩大,代码阅读(而非编写)成为AI辅助开发的主要瓶颈。
Abstract: Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
cs.HC [Back]
[108] Towards Human-AI Accessibility Mapping in India: VLM-Guided Annotations and POI-Centric Analysis in Chandigarh cs.HC | cs.CV | cs.CYPDF
Varchita Lalwani, Utkarsh Agarwal, Michael Saugstad, Manish Kumar, Jon E. Froehlich
TL;DR: 本文介绍了将Project Sidewalk平台(一个基于谷歌街景进行人行道可访问性众包标注的Web工具)适配并部署到印度昌迪加尔的过程,包括修改标注类型、提供示例以及集成基于视觉语言模型(VLM)的任务引导。利用该工具,作者对昌迪加尔三个不同土地利用类型(住宅、商业和机构)的约40公里人行道进行了以兴趣点(POI)为中心的可访问性分析,识别出大量需要基础设施改进的位置。
Details
Motivation: 将已在全球40个城市成功应用的Project Sidewalk平台扩展到印度昌迪加尔,以评估和改善该城市的人行道可访问性,特别是针对不同土地利用区域。
Result: 评估表明,集成的AI任务引导功能获得了3位标注者平均4.66分(推测为5分制)的实用性评分。在三个区域审计的约40公里道路和230个兴趣点附近,共识别出2913个位置中的1644个存在可改进的可访问性问题。
Insight: 主要创新点在于为特定地区(印度)定制了标注工具,并集成了VLM进行基于街景和元数据分析的自适应任务引导,这提高了标注的效率和针对性。同时,研究采用了以POI为中心的分析视角,将可访问性问题与具体城市功能区域紧密关联,为城市规划提供了更实用的洞见。
Abstract: Project Sidewalk is a web-based platform that enables crowdsourcing accessibility of sidewalks at city-scale by virtually walking through city streets using Google Street View. The tool has been used in 40 cities across the world, including the US, Mexico, Chile, and Europe. In this paper, we describe adaptation efforts to enable deployment in Chandigarh, India, including modifying annotation types, provided examples, and integrating VLM-based mission guidance, which adapts instructions based on a street scene and metadata analysis. Our evaluation with 3 annotators indicates the utility of AI-mission guidance with an average score of 4.66. Using this adapted Project Sidewalk tool, we conduct a Points of Interest (POI)-centric accessibility analysis for three sectors in Chandigarh with very different land uses, residential, commercial and institutional covering about 40 km of sidewalks. Across 40 km of roads audited in three sectors and around 230 POIs, we identified 1,644 of 2,913 locations where infrastructure improvements could enhance accessibility.