Table of Contents

cs.CL [Back]

[1] Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models cs.CL | cs.AI | cs.LGPDF

Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur’aini, Farid Adilazuarda, Alham Fikri Aji

TL;DR: 本文研究了像素级语言模型中视觉渲染是否真正绕过了分词限制的问题,通过评估DualGPT架构在四种印尼低资源本地语言上的表现,发现即使采用视觉渲染,重新引入文本分词器仍会导致分词器与脚本不对齐的问题,影响模型性能。

Details

Motivation: 探讨像素级语言建模是否真正解除了分词约束,特别是在多模态变体中重新引入文本分词器是否会重蹈分词不对齐的覆辙,以印尼低资源本地语言为例分析这一问题。

Result: 在DualGPT架构中,Llama 2分词器尽管具有较低的OOV和生育率,但性能显著低于自定义分词器,chrF++指标提升高达30.15,表明分词器对齐对模型表现至关重要。

Insight: 论文创新点在于揭示视觉渲染未能完全绕过分词瓶颈,文本分词器仍是多模态模型公平性的障碍;客观分析认为,强调自定义分词器在低资源语言中的重要性,为未来模型设计提供了警示。

Abstract: While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.


[2] BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents cs.CL | cs.AIPDF

R. James Cotton, Thomas Leonard

TL;DR: 本文提出了BiomechAgent,一种基于代码生成的AI代理,旨在通过自然语言交互实现无标记运动捕捉数据的生物力学分析,使非编程背景的临床医生能够查询数据库、生成可视化图表并进行数据解释。

Details

Motivation: 解决无标记运动捕捉技术普及后,临床医生因缺乏编程技能而难以分析所得数据的问题,降低生物力学分析的门槛。

Result: 在涵盖数据检索、可视化、活动分类、时间分割和临床推理的系统性基准测试中,BiomechAgent在数据检索和可视化任务上表现出稳健的准确性,并展示了初步的临床推理能力;使用领域特定的生物力学指令显著提升了性能,集成专用步态事件检测工具大幅提高了基础代理难以处理的时空分析准确性。

Insight: 创新点在于将代码生成代理与生物力学领域知识结合,通过自然语言接口简化复杂数据分析流程;客观来看,其采用领域定制化提示和集成已验证工具的策略,有效提升了专业任务性能,但本地开源模型在多数任务上表现仍远逊于前沿云基LLM,凸显了领域适应与模型能力平衡的重要性。

Abstract: Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent’s capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open-weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.


[3] Free Energy Mixer cs.CL | cs.AI | cs.LG | stat.MLPDF

Jiecheng Lu, Shihao Yang

TL;DR: 本文提出了自由能混合器(FEM),一种用于注意力机制的新型读取操作。它通过自由能(log-sum-exp)计算,在索引上应用一个由值驱动的、每通道的对数线性倾斜到快速先验(例如来自标准注意力的查询/键),从而在保持原有复杂度的同时,实现从平均到每通道选择的平滑过渡。论文实例化了一个两级门控FEM,可即插即用地用于标准/线性注意力、线性RNN和SSM,并在NLP、视觉和时间序列任务上以相同参数量优于强基线。

Details

Motivation: 标准注意力机制通过每个头的凸平均来读取键/值,这阻碍了通道级别的选择。本文旨在解决这个问题,提出一种能进行值感知、每通道选择性读取的方法,同时不增加计算复杂度。

Result: 在NLP、视觉和时间序列任务上,以匹配的参数预算,该方法持续优于强基线模型。

Insight: 核心创新在于将传统的(q,k)评分分布视为先验,并引入一个值驱动的对数线性倾斜来产生值感知的后验读取,从而在保持并行性和原始渐近复杂度(O(T^2)或O(T))的同时,实现了从平均到通道选择的平滑可控转换。这种将先验与值信息解耦并融合的思路,为改进注意力、RNN和SSM等序列模型提供了新视角。

Abstract: Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.


[4] Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI cs.CL | cs.AI | cs.ET | cs.HCPDF

Mohamed El Hajji, Tarek Ait Baha, Aicha Dakir, Hammou Fadili, Youssef Es-Saady

TL;DR: 本文介绍了Open TutorAI,一个基于大语言模型和生成技术的开源教育平台,旨在通过动态个性化辅导和可定制的3D化身交互,提供沉浸式、个性化的学习体验。

Details

Motivation: 现有教育聊天机器人系统缺乏情境适应性、实时响应性和教学灵活性,限制了学习参与度和教学效果,因此需要结合AI与沉浸式技术的开放集成平台来支持个性化学习。

Result: 论文未提及具体的定量基准测试结果,但宣称平台通过模块化架构、生成式AI和学习分析整合,推动了下一代智能辅导系统的发展。

Insight: 创新点包括结合LLM与可定制3D化身实现多模态交互,通过结构化引导流程捕获学习者偏好以配置个性化AI助手,并集成嵌入式学习分析支持自我调节学习,从而提升情感临场感和沉浸感。

Abstract: Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner’s goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.


[5] ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations cs.CL | cs.IRPDF

Long S. T. Nguyen, Quan M. Bui, Tin T. Ngo, Quynh T. N. Vo, Dung N. H. Le

TL;DR: 该论文提出了ViHERMES,一个针对越南语医疗法规的多跳问答基准数据集和系统。该数据集通过基于语义聚类和图启发式数据挖掘的生成流程构建,包含需要跨多个法规进行推理的高质量问答对。论文还提出了一个图感知检索框架,用于建模法律单元间的正式关系,以支持生成合法且连贯的答案。实验表明,该数据集对多跳法规QA系统具有挑战性,且所提出的图感知方法优于基于检索的基线模型。

Details

Motivation: 解决在低资源语言(如越南语)的医疗法规领域,缺乏明确支持多跳推理的基准数据集的问题,以系统评估检索增强和图基QA方法在跨法律相互依赖文本进行多跳推理方面的能力。

Result: 实验结果表明,ViHERMES为评估多跳法规QA系统提供了一个具有挑战性的基准,并且所提出的图感知检索方法在ViHERMES基准上持续优于强大的基于检索的基线模型。

Insight: 创新点在于:1) 针对低资源语言和特定领域(越南语医疗法规)构建了首个支持多跳推理的QA基准数据集;2) 提出了一个结合语义聚类和图启发式数据挖掘的受控多跳QA生成流程;3) 设计了一个图感知检索框架,在法规单元层面建模正式的法律关系,以实现有原则的上下文扩展,确保答案的合法性和连贯性。

Abstract: Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura-hcmut/ViHERMES.


[6] DLLM Agent: See Farther, Run Faster cs.CLPDF

Huiling Zhen, Weizhe Lin, Renxi Liu, Kai Han, Yiming Li

TL;DR: 本文研究了基于扩散大语言模型(DLLM)的智能体在规划与工具使用任务中的表现,通过与自回归(AR)模型在相同框架(DeepDiver)和训练数据下进行对比,发现DLLM智能体在保持相当准确率的同时,端到端效率平均提升超过30%,且能更早收敛到正确动作路径,减少回溯。

Details

Motivation: 扩散大语言模型在效率和建模特性上具有吸引力,但其在智能体多步决策中的潜力尚未充分探索;本文旨在探究当生成范式从自回归变为扩散时,是否会导致系统性的规划行为差异并转化为端到端效率增益。

Result: 在多个基准测试和案例研究中,DLLM智能体在准确率相当的情况下,端到端速度平均比AR智能体快30%以上,部分案例加速超过8倍;同时,在任务正确完成时,DLLM智能体需要更少的交互轮次和工具调用,表现出更高的规划命中率和更早收敛。

Insight: 创新点包括:1)首次在相同智能体框架下系统比较DLLM与AR模型在规划任务中的性能差异;2)发现DLLM智能体具有更强的全局规划信号和更高效的决策收敛特性;3)提出部署扩散主干网络时的两个实用考虑:需加强工具调用特定训练以避免结构化失败,以及需对齐注意力掩码以防止多轮输入中的虚假信息流。

Abstract: Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.


[7] SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning cs.CLPDF

Yijie Chen, Yijin Liu, Fandong Meng

TL;DR: 本文提出了一种名为SED-SFT的新方法,用于解决大语言模型监督微调中的模式崩溃问题。该方法通过引入一个基于选择性掩码机制的选择性熵正则化项,自适应地鼓励模型在令牌探索空间中的多样性,从而在保持准确性的同时显著提升生成多样性,并为后续的强化学习阶段提供更好的探索基础。

Details

Motivation: 传统的基于交叉熵损失的监督微调过程容易导致模式崩溃,即模型过度集中于特定的响应模式,从而限制了后续强化学习所需的探索效率。现有改进方法未能充分平衡多样性与准确性,导致强化学习后性能不佳。

Result: 在八个数学基准测试上的广泛实验表明,SED-SFT在计算开销增加可忽略的情况下,显著增强了生成多样性。在Llama-3.2-3B-Instruct和Qwen2.5-Math-7B-Instruct模型上,其后续强化学习性能相比标准的基于交叉熵损失的基线模型,平均分别提升了2.06和1.20个点。

Insight: 核心创新在于提出了一个选择性熵正则化框架,它通过选择性掩码机制自适应地鼓励多样性,而非不加区分地应用正则化,从而更有效地平衡了多样性与准确性。这为解决SFT中的模式崩溃问题提供了一种新颖且高效的优化思路。

Abstract: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT


[8] From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection cs.CLPDF

Mo Wang, Kaixuan Ren, Pratik Jalan, Ahmed Ashraf, Tuong Vy Vu

TL;DR: 本文系统评估了视觉语言模型在多语言仇恨表情包检测任务中的跨文化鲁棒性,发现常见的“先翻译后检测”方法会降低性能,而文化对齐的干预措施(如使用母语提示和单样本学习)能显著提升检测效果。

Details

Motivation: 当前视觉语言模型主要基于西方或英语中心视角训练,限制了其在仇恨表情包检测等任务中的公平性和跨文化鲁棒性,因此需要评估并改善模型在不同文化背景下的表现。

Result: 在多语言表情包数据集上的实验表明,模型性能在“先翻译后检测”时下降,而使用母语提示和单样本学习能显著提升检测准确率,揭示了模型系统性偏向西方安全规范的问题。

Insight: 创新点在于提出了一个系统性的跨文化鲁棒性评估框架,并验证了文化对齐干预(如母语提示和单样本学习)的有效性,为设计全球鲁棒的多模态内容审核系统提供了可操作的策略。

Abstract: Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect’’ approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.


[9] Let’s Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification cs.CLPDF

Jingshen Zhang, Xin Ying Qiu, Lifang Lu, Zhuhua Huang, Yutao Hu

TL;DR: 本文提出了一种引导大型语言模型进行多语言、无监督、能力可控的句子简化框架,通过将复杂的简化任务分解为可管理的步骤,包括动态路径规划、语义感知的示例选择和结合对话历史的思维链生成,以实现连贯推理。在两种基准测试和五种语言上的评估表明,该方法提高了简化效果,同时将计算步骤减少了22-42%。

Details

Motivation: 解决大型语言模型在能力可控的句子简化任务中,尤其是在跨越较大可读性级别进行简化时,能力有限的问题。

Result: 在两种基准测试和五种语言上的评估显示,该方法提高了简化效果,同时将计算步骤减少了22-42%。人类评估证实了简化效果与意义保留之间存在根本性的权衡。

Insight: 创新点在于通过分步简化框架(动态路径规划、语义感知示例选择、带对话历史的思维链生成)来增强对简化过程的控制。客观分析认为,该方法有效减少了计算开销,但研究也揭示了在广泛简化过程中保持语义保真度仍然是一个开放的挑战,甚至人类标注者对此也难以达成一致,凸显了该任务的内在复杂性。

Abstract: Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.


[10] Learning to Self-Verify Makes Language Models Better Reasoners cs.CL | cs.AIPDF

Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai

TL;DR: 本文研究了大型语言模型在生成与自我验证能力之间的不对称性,发现学习自我验证能有效提升生成性能,并提出一个多任务强化学习框架来联合优化生成与验证目标。

Details

Motivation: 解决大型语言模型在复杂任务中生成能力强但自我验证能力弱的不对称问题,探索如何通过自我验证提升模型整体推理性能。

Result: 在多个基准测试和模型上的广泛实验表明,该方法在生成和验证能力上均优于仅生成训练,达到与标准生成训练相当的准确率,同时产生更高效有效的推理轨迹。

Insight: 创新点在于揭示了自我验证对生成能力的反向促进作用,并提出将生成与验证作为独立但互补的目标进行多任务强化学习优化,可借鉴于提升语言模型的推理可靠性和效率。

Abstract: Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.


[11] SciClaimEval: Cross-modal Claim Verification in Scientific Papers cs.CLPDF

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin

TL;DR: SciClaimEval是一个用于科学论文中声明验证任务的新型多模态数据集,其特点是直接从已发表论文中提取真实声明(包括被反驳的声明),并通过修改支持证据(图表)而非声明本身或依赖LLM来生成反驳。该数据集包含来自机器学习、自然语言处理和医学三个领域的1,664个标注样本,并提供图表作为图像、LaTeX、HTML和JSON等多种格式的跨模态证据。研究对11个多模态基础模型进行了基准测试,发现所有模型在基于图像的验证任务上仍面临巨大挑战,与人类基线存在显著性能差距。

Details

Motivation: 解决现有声明验证数据集缺乏真实、特别是被反驳的科学声明,以及证据模态单一的问题,旨在为科学声明验证提供一个更真实、多模态的基准。

Result: 在SciClaimEval数据集上对11个开源和专有多模态基础模型进行基准测试,结果显示所有模型在基于图像的验证任务上表现均不佳,最佳系统与人类基线之间存在显著的性能差距。

Insight: 创新点在于通过直接修改科学论文中的图表证据来生成被反驳的声明,从而构建更真实的数据集;同时提供了图表的多格式表示,促进了跨模态理解研究。从客观角度看,该方法避免了LLM生成可能引入的偏差,为评估模型对科学内容的理解能力提供了更可靠的基准。

Abstract: We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.


[12] Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation cs.CLPDF

Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur

TL;DR: 本文研究了基于大语言模型的摘要评估中存在的偏见问题,特别是模型偏好与人类撰写摘要之间的重叠度关系。研究发现,当被评估摘要与人类摘要的相似度(通过ROUGE和BLEU衡量)降低时,LLM评估者会越来越倾向于选择其他LLM生成的摘要而非人类撰写的摘要,且这一模式在几乎所有测试模型中都存在,与模型自身的位置偏见无关。

Details

Motivation: LLM评估者虽在摘要等任务中比传统算法指标更能捕捉语义信息、推理能力更强且对改写更鲁棒,但它们存在长度、顺序等多种偏见,且易受对抗性提示影响。现有研究较少在细粒度层面结合明确的重叠度量来分析这些偏见,本文旨在填补这一空白。

Result: 测试了9个参数规模从10亿到120亿的近期LLM(包括Gemma 3和LLaMA 3的变体)。结果表明,随着被评估摘要与人类摘要的ROUGE和BLEU相似度降低,除一个模型外,所有LLM评估者都更偏好LLM生成的摘要而非人类摘要,且该模式独立于模型的位置偏见。此外,模型甚至难以评估重叠度有限的摘要。

Insight: 论文揭示了LLM作为评估者在摘要领域存在一种与人类摘要重叠度相关的系统性偏见,即“非人类偏好”偏见。这一发现表明,在摘要评估中单纯依赖LLM进行简单比较可能不可靠,需要结合其他技术来克服这种偏见,为改进LLM评估方法提供了重要洞见。

Abstract: Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models’ own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.


[13] SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents cs.CL | cs.IRPDF

Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang

TL;DR: 本文提出了SRR-Judge框架,用于对基于大推理模型的深度搜索代理的搜索集成推理过程进行细粒度的步骤级评估与优化。该框架通过一个改进的ReAct式“评估-精炼”工作流,为中间思维和行动提供监督,并利用标注数据进行迭代拒绝采样微调,从而显著提升了代理在复杂问答任务中的性能。

Details

Motivation: 当前基于大推理模型的深度搜索代理在训练时通常只依赖最终答案的监督信号,而忽视了中间推理步骤和搜索行动的质量,这限制了其搜索集成推理能力的进一步提升。

Result: 实验表明,SRR-Judge的步骤级评估比DeepSeek-V3.1等更大模型更可靠,其评分与最终答案正确性高度相关。通过使用SRR标注数据进行策略对齐,在多个具有挑战性的深度搜索基准测试上,平均绝对pass@1指标获得了超过10%的提升。

Insight: 论文的核心创新在于将监督信号从结果层面细化到推理步骤层面,提出了一个可集成到现有ReAct范式中的、用于评估和精炼中间步骤的框架。这为提升搜索代理的推理鲁棒性和训练数据效率提供了一种新思路,其评估器本身的高效性也值得借鉴。

Abstract: Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.


[14] Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models cs.CL | cs.AIPDF

Ningyu Xu, Qi Zhang, Xipeng Qiu, Xuanjing Huang

TL;DR: 该论文研究了大型语言模型在上下文概念推理中的内部处理机制,发现模型在中后期层中会自发形成一个概念子空间,其表征结构在不同上下文中保持稳定,并通过因果中介分析证实该子空间对模型预测具有功能性因果作用。

Details

Motivation: 探究LLMs是否在推理过程中功能性地依赖类似人类的结构化概念表征,以理解其灵活适应能力的计算基础。

Result: 通过因果中介分析证实概念子空间对模型预测具有因果作用,并揭示了早期到中期层的注意力头整合上下文线索以构建和细化该子空间,后期层则利用其生成预测的层级渐进过程。

Insight: LLMs在推理过程中会动态构建和使用结构化的潜在表征,这为理解模型内部的计算过程提供了新视角,表明其可能具备类似人类的表征构建与利用机制。

Abstract: Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.


[15] Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents cs.CLPDF

Jiatong Li, Changdae Oh, Hyeong Kyu Choi, Jindong Wang, Sharon Li

TL;DR: 本文研究了在用户参与场景下,强制LLM代理进行显式思考(如推理链)对其性能的影响。通过覆盖七个模型、三个基准测试和两种思考实例化的实验,发现强制思考反而会导致性能下降,使代理变得‘内向’——即缩短回复并减少向用户披露信息,从而削弱代理与用户的信息交换并导致下游任务失败。研究还表明,明确提示信息披露能可靠提升性能,强调了主动透明度对代理优化的重要性。

Details

Motivation: 尽管诱导推理(思考)已被证明能提升LLM在复杂任务上的性能,但其在现实用户参与代理场景中的有效性尚不明确,本文旨在系统探究显式思考在此类场景中的实际影响。

Result: 实验发现强制思考在用户参与设置中经常适得其反,导致多种LLM出现异常性能下降;通过定量响应分类分析和定性失败传播案例研究进行评估,并证明明确提示信息披露能跨不同模型家族可靠提升性能。

Insight: 创新点在于揭示了强制思考可能通过使代理‘内向化’(减少信息交换)而损害性能,并指出信息透明度意识是现实世界推理代理设计中一个关键但未被充分探索的视角;从客观角度看,研究强调了在用户交互场景中平衡推理与信息共享的重要性,为代理优化提供了新方向。

Abstract: Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted’’ by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at https://github.com/deeplearning-wisc/Thinking-Agent.


[16] LLMs Know More About Numbers than They Can Say cs.CLPDF

Fengting Yuchi, Li Du, Jason Eisner

TL;DR: 本文发现大型语言模型在处理混合表示的数字比较问题时存在错误,通过探测隐藏状态发现模型内部编码了数字的对数幅度信息,但无法有效转化为语言输出。研究提出利用线性分类器从隐藏状态中提取数字排序信息,并通过辅助目标微调模型,显著提升了数值推理能力。

Details

Motivation: 尽管当前最先进的LLMs能够解决数学问题,但在处理混合表示的数字比较时表现不佳,这引发了对模型是否真正理解数字大小的质疑。

Result: 在受限合成文本上,线性投影恢复数字的相对误差约为2.3%;在科学论文上为19.06%。线性分类器从隐藏状态中预测数字排序的准确率超过90%,而模型直接回答的准确率仅为50-70%。通过辅助目标微调,模型在语言化准确率上比基线提升了3.22%。

Insight: 论文揭示了LLMs内部隐藏状态编码了数字的幅度信息,但模型无法有效利用这些信息进行语言输出。通过探测和微调内部表示,可以增强模型的数值推理能力,这为改进模型在数值任务上的表现提供了新思路。

Abstract: Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: “Which is larger, $5.7 \times 10^2$ or $580$?” This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe’s log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models’ internal magnitude representations can enhance their numerical reasoning capabilities.


[17] TodoEvolve: Learning to Architect Agent Planning Systems cs.CL | cs.AI | cs.LGPDF

Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang

TL;DR: TodoEvolve提出了一种元规划范式,能够自主合成并动态修订面向特定任务的规划架构,以解决现有基于固定手工设计规划结构的方法在应对开放性问题时缺乏灵活性的问题。

Details

Motivation: 现有智能体系统的规划能力主要依赖固定、手工设计的规划结构,缺乏适应开放性问题结构多样性的灵活性。

Result: 在五个智能体基准测试上的实证评估表明,TodoEvolve始终优于精心设计的规划模块,同时保持了经济的API成本和运行时开销。

Insight: 创新点在于构建了标准化的模块化设计空间PlanFactory,并提出了阻抗引导偏好优化(IGPO)这一多目标强化学习目标,以训练出在任意任务和智能体骨干上均表现优异、稳定且令牌高效的规划系统。

Abstract: Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.


[18] Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection cs.CLPDF

Rui Feng, Zhiyao Luo, Liuyu Wu, Wei Wang, Yuting Song

TL;DR: 本文提出SynCog框架,通过可控零样本多模态数据合成与思维链(CoT)推理微调,解决基于语音的轻度认知障碍(MCI)检测中临床数据稀缺、跨语言泛化能力不足及模型可解释性差的问题。该方法生成具有不同认知特征的虚拟受试者数据以扩充语料库,并利用CoT策略微调多模态大语言模型(MLLM),使其能明确阐述诊断推理过程。

Details

Motivation: 动机在于解决基于语音的数字生物标志物开发中面临的临床数据严重匮乏、模型缺乏可解释性以及跨语言泛化能力弱的问题,这些障碍阻碍了稳健诊断模型的构建和临床信任的建立。

Result: 在ADReSS和ADReSSo基准测试上,通过合成数据增强获得了有竞争力的诊断性能,宏F1分数分别达到80.67%和78.46%,优于当前基线模型。在独立真实世界普通话队列(CIR-E)上的评估也展现了稳健的跨语言泛化能力,宏F1为48.71%。

Insight: 创新点在于将可控的、基于人物角色(persona)的零样本多模态数据合成与思维链(CoT)推理微调相结合,这不仅有效缓解了数据稀缺问题并提升了跨语言性能,还通过让模型显式输出诊断推理链,增强了模型的可解释性和临床可信度。

Abstract: Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.


[19] The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation cs.CLPDF

Arash Marioriyad, Omid Ghahroodi, Ehsaneddin Asgari, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR: 本文研究了基于大语言模型(LLM)的自动评估系统,发现这些‘法官’模型在评估任务(如事实问答和创意写作)时,会隐式地依赖注入提示中的无关上下文线索(如来源、时间、人口统计信息),并据此做出判断,但极少在其自然语言解释中明确承认这些线索的影响。

Details

Motivation: 动机是检验LLM作为自动评估‘法官’的忠实性,即其判断是否仅基于内容质量、对无关上下文保持不变性,并能透明反映决策驱动因素。

Result: 在ELI5(事实QA)和LitBench(开放创意写作)两个数据集上测试了六个法官模型(GPT-4o等)。实验发现,模型对多种线索(如来源、时效性、教育程度)表现出显著的判决偏移率(VSR),但线索承认率(CAR)通常接近零,表明模型严重依赖未报告的‘捷径’。

Insight: 创新点在于通过‘受控线索扰动’和引入‘线索承认率’(CAR)指标,揭示了LLM评估管道中存在‘解释鸿沟’:模型行为受无关线索影响,却不在解释中承认,这对其在研究和部署中的可靠性提出了质疑。

Abstract: Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert > Human > LLM > Unknown), recency preferences (New > Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.


[20] DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity cs.CL | cs.AIPDF

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

TL;DR: DeltaKV是一种基于残差的KV缓存压缩框架,通过利用KV表示中的长程令牌间相似性和高度共享的潜在组件,将语义残差编码相对于检索到的历史参考,从而在保持精度的同时显著减少存储。配合Sparse-vLLM高性能推理引擎,在长上下文场景中实现了高达2倍的吞吐量提升。

Details

Motivation: 解决长上下文LLM部署中KV缓存内存线性增长导致的瓶颈,现有压缩和淘汰方法难以平衡准确性、压缩比和硬件效率。

Result: 在LongBench、SCBench和AIME基准测试上,DeltaKV将KV缓存内存减少至原始的29%,同时保持近乎无损的精度;与Sparse-vLLM集成后,在长上下文场景中相比vLLM实现了高达2倍的吞吐量提升。

Insight: 创新点在于利用长程令牌相似性和共享潜在组件的经验发现,采用残差编码而非丢弃令牌的压缩策略,并结合了针对稀疏不规则KV布局优化的专用推理引擎以实现实际系统加速。

Abstract: The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse-vLLM.


[21] Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning cs.CLPDF

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

TL;DR: 本文提出了一种名为Diverge-to-Induce Prompting (DIP) 的框架,旨在提升大语言模型在零样本推理任务中的性能。该方法通过首先生成多个不同的高层推理策略(rationales),然后将每个策略扩展为详细的步骤草案,最后将这些草案归纳融合成一个最终计划,从而避免了传统思维链提示中单一策略的局限性以及资源密集的采样需求。

Details

Motivation: 为了解决标准思维链提示中无引导推理路径的不稳定性,以及现有方法依赖单一推理策略可能在不同任务上性能受限的问题。

Result: 实验表明,DIP在零样本推理准确性上超越了单一策略提示方法,证明了多计划归纳对于基于提示的推理的有效性。

Insight: 核心创新点在于“先发散后归纳”的框架设计:通过生成并融合多个不同的高层推理策略来增强鲁棒性和泛化能力,而非依赖单一策略或大量采样,这为提升零样本推理提供了一种高效的新思路。

Abstract: To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.


[22] TDGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs cs.CLPDF

Arshia Hemmat, Philip Torr, Yongqiang Chen, Junchi Yu

TL;DR: 本文提出了TDGNet,一种基于时序动态图的框架,用于检测扩散语言模型中的幻觉问题。该方法将幻觉检测建模为在演化的词元级注意力图上的学习过程,通过稀疏化注意力图、消息传递更新词元记忆,并使用时序注意力聚合整个去噪轨迹的证据进行最终预测。

Details

Motivation: 扩散语言模型具有并行去噪和双向上下文优势,但其幻觉检测尚未充分探索。现有基于自回归LLM的检测器通常依赖单次推理线索,无法直接迁移到扩散生成过程,因为事实性证据分布在去噪轨迹中,并可能随时间出现、漂移或自我修正。

Result: 在LLaDA-8B和Dream-7B模型上的问答基准测试表明,TDGNet在AUROC指标上持续优于基于输出、潜在表示和静态图的基线方法,且仅需单次推理并具有适中的计算开销。

Insight: 创新点在于将幻觉检测建模为时序动态图学习问题,通过捕捉注意力图在去噪轨迹中的演化来聚合证据。客观分析认为,该方法强调了在注意力图上进行时序推理对于扩散语言模型中鲁棒幻觉检测的重要性,为处理生成过程中的动态证据分布提供了新思路。

Abstract: Diffusion language models (D-LLMs) offer parallel denoising and bidirectional context, but hallucination detection for D-LLMs remains underexplored. Prior detectors developed for auto-regressive LLMs typically rely on single-pass cues and do not directly transfer to diffusion generation, where factuality evidence is distributed across the denoising trajectory and may appear, drift, or be self-corrected over time. We introduce TDGNet, a temporal dynamic graph framework that formulates hallucination detection as learning over evolving token-level attention graphs. At each denoising step, we sparsify the attention graph and update per-token memories via message passing, then apply temporal attention to aggregate trajectory-wide evidence for final prediction. Experiments on LLaDA-8B and Dream-7B across QA benchmarks show consistent AUROC improvements over output-based, latent-based, and static-graph baselines, with single-pass inference and modest overhead. These results highlight the importance of temporal reasoning on attention graphs for robust hallucination detection in diffusion language models.


[23] Emergent Search and Backtracking in Latent Reasoning Models cs.CL | cs.AIPDF

Jasmine Cui, Charles Ye

TL;DR: 本文研究了潜在推理变换器(LRTs)在连续隐藏空间中进行推理的过程,发现模型在潜在空间中自发学习了一种结构化的搜索过程,包括探索、暂定承诺、收敛或回溯等阶段。回溯现象普遍且有益,能显著提升准确率,并且搜索过程具有适应性。

Details

Motivation: 探索语言模型在无词化思考时的内部推理机制,特别是对比标准链式思维(CoT)模型的显式语言化步骤,研究潜在推理模型在隐藏空间中的动态决策过程。

Result: 在多项选择QA基准测试中,回溯现象出现于32%的实例,且在这些实例中准确率提升了34%;当用不合理干扰项替换时,探索阶段缩短了54%。

Insight: 潜在推理模型在激活空间中实现了类似链式思维的自我纠错能力,其结构化搜索过程(包括自适应探索和定向回溯)是隐式推理的关键创新,为理解模型内部决策动态提供了新视角。

Abstract: What happens when a language model thinks without words? Standard reasoning LLMs verbalize intermediate steps as chain-of-thought; latent reasoning transformers (LRTs) instead perform deliberation entirely in continuous hidden space. We investigate an LRT, decoding the model’s evolving beliefs at every step on a multiple-choice QA benchmark. We find that the model spontaneously learns a structured search process in latent space. Deliberation follows a consistent trajectory: an exploration phase where probability mass spreads across candidates, tentative commitment to a frontrunner, and either convergence or backtracking. Backtracking is prevalent (32% of instances), beneficial (34% accuracy gain over non-backtracking instances), and predominantly directed away from the semantically closest distractor toward the correct answer. The search is adaptive: replacing distractors with implausible alternatives shortens exploration by 54%. Latent reasoning models achieve in activation space what chain-of-thought achieves through words: the ability to be wrong, notice, and recover.


[24] LLMs and people both learn to form conventions – just not with each other cs.CL | cs.HCPDF

Cameron R. Jones, Agnese Lombardi, Kyle Mahowald, Benjamin K. Bergen

TL;DR: 本文研究人类与大型语言模型(LLM)在多模态交流游戏中形成交流惯例的能力。研究发现,在同类对话组(人类-人类、AI-AI)中,双方都能形成惯例(表现为准确性和一致性提高,消息长度缩短)。然而,在人类-AI混合对话组中,这种惯例形成失败,表明两者存在不同的交流倾向。实验二通过提示LLM模仿人类行为,使其消息长度与人类匹配,但准确性和词汇重叠度仍落后于同类组,表明对话对齐不仅需要模仿能力,还需要共享对传达意义的解释性偏见。

Details

Motivation: 探究人类和LLM在对话中是否能够形成共享的交流惯例,以及人类与AI混合对话时能否实现有效的对齐,以理解AI与人类在交流机制上的差异。

Result: 在同类对话组(人类-人类、AI-AI)中,准确性和一致性提高,消息长度缩短,表明惯例形成成功;在人类-AI混合组中,准确性和词汇重叠度持续落后,即使通过提示使消息长度匹配人类水平,仍无法达到同类组的对齐效果。

Insight: 论文的创新点在于通过实验比较人类和LLM在惯例形成上的异同,揭示对话对齐不仅依赖于表面行为的模仿,更依赖于共享的解释性偏见;这为改进人机交互提供了重要见解,即需要关注语义层面的对齐而不仅仅是形式匹配。

Abstract: Humans align to one another in conversation – adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail – suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.


[25] Document Reconstruction Unlocks Scalable Long-Context RLVR cs.CLPDF

Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin

TL;DR: 本文提出了一种无监督的强化学习方法(RLVR),通过文档重建任务来增强大语言模型的长上下文能力。该方法在长文档中随机替换部分段落为占位符,让模型通过强化学习从候选段落中正确识别并排序缺失段落以重建原始文档,从而学习全局叙事连贯性。

Details

Motivation: 为了解决传统RLVR方法依赖昂贵的人工标注或强大教师模型提供标准答案或评估准则的问题,本文探索无监督方法以低成本提升LLMs的长上下文理解能力。

Result: 该方法在RULER和LongBench v2两个基准测试上进行了验证,在RULER上取得了显著提升,在LongBench v2上也实现了合理改进,且无需人工标注的长上下文QA数据。

Insight: 创新点在于将长上下文能力训练转化为无监督的文档重建任务,通过强化学习奖励模型捕捉全局叙事连贯性;从客观角度看,这种自监督的范式降低了对标注数据的依赖,为扩展模型上下文窗口提供了可扩展且成本效益高的训练途径。

Abstract: Reinforcement Learning with Verifiable Rewards(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models’ supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBenchv2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBenchv2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.


[26] New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR cs.CLPDF

Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang

TL;DR: 本文提出了一种概率框架,将强化学习与可验证奖励(RLVR)赋予大语言模型(LLM)的新能力定义为实例级可解性。作者假设复杂推理能力的涌现可以通过强化原子步骤的概率来实现,从而克服多步推理链中成功率指数级衰减的问题。通过在Algebrarium框架上仅使用单步操作训练模型并在未见的多步任务上评估,实证结果表明RLVR通过放大模型现有技能来激励探索先前不可达的解决路径,复合性能严格受原子步骤联合概率支配,且RLVR作为全局优化器可能导致特定技能被牺牲以最大化总奖励。

Details

Motivation: 解决关于RLVR是赋予LLM新能力还是仅激发其潜在能力的核心争论,并提供一个概率视角来解释复杂推理能力的涌现。

Result: 在Algebrarium框架上的实验证实:RLVR通过放大现有技能激励探索新路径;复合性能与原子步骤联合概率高度相关(皮尔逊相关系数ρ∈[0.69, 0.96]);RLVR作为全局优化器可能牺牲特定技能以最大化总奖励。

Insight: 创新点在于提出一个基于实例级可解性的概率框架,将能力涌现解释为原子步骤概率的强化,从而克服多步推理的指数衰减问题;客观分析认为,该框架为理解RLVR中的能力涌现提供了新的理论解释,并强调了优化可解问题如何使模型获得解决先前不可解场景的能力。

Abstract: Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model’s existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.


[27] JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation cs.CLPDF

Binglin Wu, Yingyi Zhang, Xiannneg Li

TL;DR: 本文提出了一种名为JUSTICE的新型框架,用于自动化判决书生成。该框架模拟了人类法官“检索→预判→撰写”的认知工作流程,通过引入专门的预判阶段,包括参考性司法要素检索器、中间结论模拟器和司法统一合成器,以解决现有方法因忽略预判阶段而导致的法律推理不充分的问题。

Details

Motivation: 自动化判决书生成是法律AI领域一项重要但具有挑战性的任务。现有方法往往过度简化复杂的法律推理过程,特别是忽略了人类法官形成初步结论的“预判”阶段,这导致无法有效获取基础司法要素以及对预判过程建模不足,从而损害了最终生成文档的法律严谨性。

Result: 在领域内法律基准测试和分布外数据集上的实验表明,JUSTICE显著优于强基线模型,在法律准确性方面取得了实质性提升,例如在刑期预测任务上实现了4.6%的改进。

Insight: 论文的核心创新点是明确地将人类法官的“预判”认知阶段建模到自动化判决生成框架中。通过引入可验证的中间结论生成(ICE)来模拟这一过程,并结合检索到的法律条文和先例案例(RJER)作为参考基础,最终统一合成(JUS)生成判决书。这强调了模拟法官的中间推理步骤对于提升生成文档法律连贯性和准确性的重要性。

Abstract: Automated judgment document generation is a significant yet challenging legal AI task. As the conclusive written instrument issued by a court, a judgment document embodies complex legal reasoning. However, existing methods often oversimplify this complex process, particularly by omitting the Pre-Judge'' phase, a crucial step where human judges form a preliminary conclusion. This omission leads to two core challenges: 1) the ineffective acquisition of foundational judicial elements, and 2) the inadequate modeling of the Pre-Judge process, which collectively undermine the final document's legal soundness. To address these challenges, we propose \textit{\textbf{J}udicial \textbf{U}nified \textbf{S}ynthesis \textbf{T}hrough \textbf{I}ntermediate \textbf{C}onclusion \textbf{E}mulation} (JUSTICE), a novel framework that emulates the Search $\rightarrow$ Pre-Judge $\rightarrow$ Write’’ cognitive workflow of human judges. Specifically, it introduces the Pre-Judge stage through three dedicated components: Referential Judicial Element Retriever (RJER), Intermediate Conclusion Emulator (ICE), and Judicial Unified Synthesizer (JUS). RJER first retrieves legal articles and a precedent case to establish a referential foundation. ICE then operationalizes the Pre-Judge phase by generating a verifiable intermediate conclusion. Finally, JUS synthesizes these inputs to craft the final judgment. Experiments on both an in-domain legal benchmark and an out-of-distribution dataset show that JUSTICE significantly outperforms strong baselines, with substantial gains in legal accuracy, including a 4.6% improvement in prison term prediction. Our findings underscore the importance of explicitly modeling the Pre-Judge process to enhance the legal coherence and accuracy of generated judgment documents.


[28] Improving Data and Reward Design for Scientific Reasoning in Large Language Models cs.CLPDF

Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong

TL;DR: 本文提出了Dr. SCI数据集和对应的后训练流程,旨在提升大语言模型在开放式科学问题上的推理能力。通过构建包含100万个STEM问题的Dr. SCI数据集,并设计了包含探索扩展SFT、动态难度课程和基于评分标准的RL三部分的后训练流程,显著提升了模型在科学推理基准(如GPQA)上的性能。

Details

Motivation: 解决大语言模型在开放式科学问题上因监督和评估不可靠而面临的挑战,核心瓶颈在于科学后训练的数据构建和奖励设计。

Result: 使用Dr.SCI流程训练的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general基准上分别达到63.2和32.4分,性能持续优于o1-mini和GPT-4o等强基线,在开放式科学推理方面取得显著提升。

Insight: 创新点在于系统化的大规模科学数据处理流程(Dr. SCI数据集)以及重新设计的后训练工作流(探索扩展SFT、动态难度课程、基于评分标准的RL),这为稳定训练和评估开放式答案提供了可操作的框架。

Abstract: Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model’s reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model’s evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr.SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.


[29] Latent Reasoning with Supervised Thinking States cs.CL | cs.AIPDF

Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig

TL;DR: 本文提出Thinking States方法,通过在输入处理过程中生成思考标记序列并嵌入后续输入,实现大型语言模型的隐式推理,从而在保持推理能力的同时降低生成长链式思维带来的推理成本。

Details

Motivation: 解决链式思维推理方法因生成长推理链导致推理开销大的问题,旨在实现更高效的隐式推理。

Result: 在多个推理任务上优于其他隐式推理方法,在数学问题上缩小了与链式思维推理的差距,在2-Hop QA任务上性能相当且延迟更低,在状态跟踪任务上展现出比链式思维更强的推理泛化能力。

Insight: 创新点在于将思考过程表示为可学习的标记序列,并在输入处理时并行生成,结合自然语言监督和教师强制训练,实现了高效且可泛化的隐式推理机制。

Abstract: Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.


[30] UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models cs.CL | cs.CVPDF

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao

TL;DR: 论文提出了UReason基准,用于评估统一多模态模型中思维链推理对图像生成的实际影响。该基准包含五个任务家族的2000个实例,通过比较直接生成、推理引导生成和去上下文生成,揭示了推理悖论:推理痕迹通常能提升性能,但保留中间思想作为条件上下文反而会阻碍视觉合成,而仅基于精炼提示生成则带来显著收益。

Details

Motivation: 为了解决当前统一多模态模型在复杂视觉需求中采用思维链推理的实际效果不明确的问题,作者旨在诊断推理在视觉合成中的忠实执行程度,并评估推理对图像生成的影响。

Result: 在八个开源统一模型上的实验显示,存在一致的推理悖论:推理痕迹相比直接生成能提升性能,但保留中间思想作为条件上下文会阻碍视觉合成,而仅基于精炼提示生成则带来显著收益,表明瓶颈在于上下文干扰而非推理能力不足。

Insight: 论文的创新点在于提出了一个诊断性基准UReason来量化推理在图像生成中的作用,并揭示了推理悖论,即中间推理步骤作为上下文可能产生干扰,这为未来有效整合推理并减轻干扰的方法提供了动机和测试平台。

Abstract: To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.


[31] WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints cs.CLPDF

Zexuan Wang, Chenghao Yang, Yingqi Que, Zhenzhu Yang, Huaqing Yuan

TL;DR: 本文提出了WorldTravel基准测试,包含150个真实世界旅行规划场景,涉及5个城市,每个场景平均包含15个以上相互依赖的时间和逻辑约束。同时开发了WorldTravel-Webscape多模态环境,包含2000多个渲染网页,要求智能体从视觉布局中直接感知约束参数进行规划。评估10个前沿模型发现性能显著下降,即使在纯文本设置下最先进的GPT-5.2也仅达到32.67%的可行性,在多模态环境中暴跌至19.33%。

Details

Motivation: 现有基准测试主要包含松散耦合约束,可通过局部贪婪决策解决,且依赖理想化数据,无法捕捉从动态网络环境中提取参数的复杂性,因此需要更真实的规划基准。

Result: 在WorldTravel基准上,GPT-5.2在纯文本设置下可行性为32.67%,在多模态环境中降至19.33%。研究发现模型在约10个约束处存在规划视野阈值,感知和推理仍是独立瓶颈。

Insight: 创新点在于构建了具有紧密耦合约束的真实多模态旅行规划基准,并揭示了感知-行动鸿沟和规划视野阈值。客观分析表明,需要下一代智能体统一高保真视觉感知与长视野推理以处理脆弱的现实世界物流问题。

Abstract: Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67% feasibility in text-only settings, which plummets to 19.33% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.


[32] Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning cs.CL | cs.AIPDF

Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, Min Zhang

TL;DR: 本文提出了一种受认知启发的框架,通过分块压缩和选择性记忆召回,而非处理所有原始token,来解决大语言模型(LLMs)在长上下文处理中面临的二次计算成本、信息遗忘和检索增强生成(RAG)固有的上下文碎片化等挑战。该框架将长输入分段为块,使用学习到的压缩器将每个块编码为压缩记忆表示,并通过门控模块动态选择相关记忆块,再由推理模块结合演化的工作记忆迭代处理以解决下游任务。压缩器和推理器通过端到端强化学习联合优化,而门控模块则作为分类器单独训练。

Details

Motivation: 解决大语言模型在长上下文处理中存在的计算成本高、信息易遗忘以及RAG方法导致的上下文碎片化问题。

Result: 在RULER-HQA等多跳推理基准测试中取得了有竞争力的准确率,上下文长度从7K token外推至1.75M token,与强大的长上下文基线模型相比,在准确率与效率之间取得了有利的权衡,特别是相比MemAgent,实现了高达2倍的峰值GPU内存使用减少和6倍的推理加速。

Insight: 核心创新在于将长上下文处理建模为基于压缩记忆的动态选择性回忆与迭代推理过程,并通过端到端强化学习联合优化压缩与推理模块,实现了高效且可扩展的长上下文推理。其分块压缩、动态门控选择与工作记忆迭代更新的架构设计,为处理超长序列提供了新的思路,在效率提升方面表现显著。

Abstract: Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.


[33] Large Language Models and Impossible Language Acquisition: “False Promise” or an Overturn of our Current Perspective towards AI cs.CLPDF

Ziyan wang, Longlong Ma

TL;DR: 本文针对乔姆斯基对大型语言模型(LLMs)的批判,通过理论分析和实验研究,探讨了LLMs学习可能语言与不可能语言的能力。实验表明,GPT-2小模型在学习不可能语言时表现不佳,而LSTM模型的表现则符合乔姆斯基的论点,突显了Transformer架构演进的不可替代作用。基于此,论文提出了在乔姆斯基理论框架内对LLMs的新视角,并倡导从乔姆斯基的’理性主义-浪漫主义’范式转向功能主义和经验主义的研究范式。

Details

Motivation: 回应乔姆斯基在《The False Promise of CHATGPT》中对LLMs的批判,即LLMs仅是模式预测器,缺乏人类语言习得的内在因果和自我纠正结构,无法区分不可能语言,从而挑战AI的智力基础。

Result: 在GPT-2小模型和LSTM模型上进行了两轮受控实验,使用Welch’s t-test进行统计分析。结果显示,GPT-2小模型在学习所有不可能语言时表现均逊于可能语言(p<.001),而LSTM模型的表现与乔姆斯基的论点一致。

Insight: 论文的创新点在于通过实验验证了Transformer架构(如GPT-2)在语言学习能力上相对于LSTM的进化优势,并基于理论和实证发现,提出了在语言学和AI研究中从理性主义范式向功能主义和经验主义范式转变的新视角。

Abstract: In Chomsky’s provocative critique “The False Promise of CHATGPT,” Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch’s t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models’ performance tallies with Chomsky’s argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky’s theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his “rationalist-romantics” paradigm to functionalism and empiricism in LLMs research.


[34] Characterizing, Evaluating, and Optimizing Complex Reasoning cs.CLPDF

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang

TL;DR: 本文针对大型推理模型(LRMs)中复杂推理轨迹的质量定义、评估与优化问题,提出了一个统一框架。首先,引入ME²原则从宏观和微观层面定义推理质量;其次,将推理轨迹建模为有向无环图(DAG),并开发基于DAG的成对评估方法;最后,构建TRM-Preference数据集训练思维奖励模型(TRM),用于大规模推理质量评估。实验表明,思维奖励能有效优化推理过程,在测试时提升结果(最高19.3%增益),在强化学习训练中增强推理性能(最高3.9%增益)。

Details

Motivation: 现有研究缺乏对复杂推理轨迹的三个基本问题的统一解答:如何定义高质量推理、如何可靠评估长且隐式结构的推理轨迹、以及如何利用评估信号进行推理优化。

Result: 在多样化任务上,基于思维奖励的推理选择在测试时带来最高19.3%的性能提升,在强化学习训练中实现最高3.9%的增益,验证了方法的有效性。

Insight: 创新点包括:提出ME²原则统一刻画推理质量;将推理轨迹建模为DAG以捕捉复杂结构;构建TRM-Preference数据集并训练TRM模型,实现可扩展的推理评估与优化。从客观角度看,该工作为复杂推理的评估与优化提供了系统化框架,将结构化表示与奖励模型结合,具有实际应用潜力。

Abstract: Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.


[35] GISA: A Benchmark for General Information-Seeking Assistant cs.CL | cs.AI | cs.IRPDF

Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang

TL;DR: 本文提出了GISA基准测试,用于评估通用信息搜索助手,包含373个人工构建的真实查询,支持四种结构化答案格式,并提供了完整的人类搜索轨迹作为参考。

Details

Motivation: 现有基准测试常通过反向构建查询导致任务不自然,且多聚焦于特定信息定位或多源信息聚合,依赖静态答案集易受数据污染,GISA旨在弥补这些不足。

Result: 在主流大语言模型和商业搜索产品上的实验显示,即使最佳模型也仅达到19.30%的精确匹配分数,在需要复杂规划和全面信息收集的任务上性能显著下降。

Insight: GISA通过人工构建真实查询、结构化答案格式、整合深度推理与广泛信息聚合、以及包含定期更新的实时子集和完整搜索轨迹,提供了更贴近实际需求的评估框架,有助于推动搜索代理的发展。

Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.


[36] Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation cs.CLPDF

Archchana Sindhujan, Girish A. Koushik, Shenbin Qian, Diptesh Kanojia, Constantin Orăsan

TL;DR: 本文针对机器翻译质量评估(QE)中仅依赖标量分数、缺乏错误信息解释以及在低资源语言上性能不足的问题,提出了首个英语到马拉雅拉姆语的片段级QE数据集,包含直接评估分数和翻译质量评注,并引入了ALOPE-RL——一个基于策略的强化学习框架,通过结合错误感知奖励训练高效适配器,使LLM能够超越数值分数进行翻译质量推理。

Details

Motivation: 解决当前QE方法仅依赖标量质量分数、缺乏显式错误信息,以及在低资源语言(如英语-马拉雅拉姆语)上因标注数据有限而性能不可靠的问题。

Result: 在英语到马拉雅拉姆语的QE任务上,ALOPE-RL使用紧凑LLM(参数≤4B)结合LoRA微调和4位量化,实现了最先进的性能,超越了基于更大LLM的基线方法和领先的基于编码器的QE模型。

Insight: 创新点包括引入首个英语-马拉雅拉姆语片段级QE数据集(含错误评注),以及提出ALOPE-RL强化学习框架,通过错误感知奖励策略在有限数据和计算预算下提升QE性能;客观分析认为,该方法将错误解释与强化学习结合,为低资源QE任务提供了可扩展的解决方案。

Abstract: Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.


[37] Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models cs.CLPDF

Mingzi Cao, Xingwei Tan, Mahmud Akhter, Marco Valentino, Maria Liakata

TL;DR: 该论文研究了演绎、归纳和溯因三种基本推理范式对大型语言模型(LLM)泛化能力的影响。通过构建一个针对这些范式的符号任务数据集,并采用多种方法(如微调、增加模型深度、转换为专家混合模型)将推理技能注入LLM,论文发现这种方法能显著提升模型在现实世界自然语言任务上的泛化性能。

Details

Motivation: 尽管提升LLM推理能力的研究众多,但基本推理范式(演绎、归纳、溯因)如何影响模型泛化能力尚未得到系统探索,论文旨在填补这一空白。

Result: 在完全使用自然语言表述且包含真实世界知识的现实域外任务上,所提出的方法带来了显著的性能提升(最高达14.60分),表现出强大的泛化能力。

Insight: 创新点在于系统地将符号化的基本推理范式训练与自然语言任务泛化联系起来,表明通过抽象推理技能的训练可以诱导出对复杂现实任务的强泛化,这为提升LLM的推理鲁棒性提供了新思路。

Abstract: Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs’ reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.


[38] Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search cs.CL | cs.HC | cs.IRPDF

Clemencia Siro, Zahra Abbasiantaeb, Yifei Yuan, Mohammad Aliannejadi, Maarten de Rijke

TL;DR: 本研究通过用户实验探讨了在对话式搜索系统中,将图像融入澄清问题对用户表现的影响。研究发现,在回答澄清问题时,用户虽偏好多模态问题,但纯文本设置能带来更好的表现;而在查询重构任务中,图像有助于生成更精确的查询并提升检索性能,其效果受任务类型和用户专业水平影响。

Details

Motivation: 对话搜索系统常用澄清问题来优化用户查询,但此前研究多关注文本澄清问题,图像在澄清问题中的作用尚不明确,本文旨在探究图像对用户执行搜索相关任务的影响。

Result: 在73名参与者的用户研究中,多模态澄清问题在回答澄清问题时更受偏好,但纯文本设置用户表现更佳;在查询重构任务中,图像能提升查询精确度和检索性能,效果因任务和用户专业水平而异。

Insight: 创新点在于首次系统研究了图像在对话搜索澄清问题中的作用,揭示了视觉增强的益处具有任务依赖性,需根据具体搜索上下文和用户特征策略性实施,为设计多模态对话搜索系统提供了实证依据。

Abstract: Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.


[39] PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments cs.CLPDF

Shangrui Nie, Kian Omoomi, Lucie Flek, Zhixue Zhao, Charles Welch

TL;DR: 本文介绍了PERSPECTRA,一个用于评估大语言模型多元主义能力(即处理不同观点而不将其简化为单一视角的能力)的可扩展、可配置的基准测试。该基准整合了Kialo辩论图的结构清晰性和Reddit讨论的语言多样性,构建了包含3,810个论点、涵盖100个争议话题的762个正反立场的语料库。论文基于此基准初始化了三个任务(观点计数、观点匹配和极性检查),并在多个先进LLM上进行了实验,揭示了模型在多元主义理解和推理方面的系统性缺陷。

Details

Motivation: 当前LLM研究和大多数对齐研究缺乏对多元主义(pluralism)这一关键特性的仔细考察,而多元主义对于开发能够忠实反映人类异质性的大语言模型至关重要。现有的辩论导向数据源(如Reddit和Kialo)各有局限,需要一种结合结构清晰性和语言多样性的新基准来推动研究。

Result: 在PERSPECTRA基准上对最先进的开源和专有LLM进行了实验。结果表明,模型存在系统性失败,例如高估观点数量和错误分类让步结构,突显了模型在具备多元主义意识的理解和推理方面存在困难。

Insight: 论文的主要创新点在于构建了一个结合了结构清晰性(来自Kialo)和语言多样性(来自Reddit)的新型多元主义基准,并通过一个可控的检索与扩展流程生成了丰富、自然的论点变体。这为评估模型如何表征、区分和推理多个视角提供了首个可扩展、可配置的基准,填补了LLM对齐研究中的一个空白。

Abstract: Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies. Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation. Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse. We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded to multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives.


[40] Affective Flow Language Model for Emotional Support Conversation cs.CL | cs.AIPDF

Chenghui Zou, Ning Wang, Tiesunlong Shen, Luwei Xiao, Chuan Ma

TL;DR: 本文提出了一种名为情感流语言模型(AFlow)的框架,用于情感支持对话(ESC),通过建模多轮对话轨迹中的连续情感流,为中间策略决策提供细粒度监督,从而提升复杂多轮支持的效果。

Details

Motivation: 现有基于大语言模型的情感支持对话方法依赖稀疏的结果级信号,对中间策略决策的监督有限,导致复杂多轮支持具有挑战性。

Result: 实验表明,AFlow在多种情感场景下均优于竞争基线,且使用紧凑开源骨干网络时,在主要ESC指标上超越了GPT-4o和Claude-3.5等专有大语言模型。

Insight: 创新点在于引入连续情感流建模以提供细粒度监督,并提出子路径级流平衡目标来传播偏好信号至中间状态,从而增强策略连贯性和共情响应质量。

Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains challenging.This is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at https://github.com/chzou25-lgtm/AffectiveFlow.


[41] Understanding Dynamic Compute Allocation in Recurrent Transformers cs.CL | cs.AI | cs.LGPDF

Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin

TL;DR: 本文提出了一种用于评估循环Transformer中动态计算分配的新范式,通过算法和合成语言任务参数化难度,直接测试token级计算分配。作者提出了ANIRA框架,支持每个token可变深度计算,并系统分析了计算分配与复杂性、泛化性和决策时序的关系。研究发现,计算分配可以与任务复杂性对齐而无需显式监督,但这种对齐并不保证算法泛化能力。

Details

Motivation: 现有token级自适应计算研究主要在自然语言基准上使用任务级指标评估,其中token级难度不可观察且与架构因素混淆,导致无法确定计算分配是否真正与底层复杂性对齐。本文旨在填补这一空白,通过可控评估和统一框架来研究计算分配的真实效果。

Result: 在算法和合成语言任务上的实验表明,计算分配可以与任务复杂性对齐而无需显式难度监督,但模型无法泛化到未见过的输入大小。早期计算决策依赖静态结构线索,而在线停止机制更紧密地跟踪算法执行状态。

Insight: 创新点包括引入复杂性控制的评估范式、提出ANIRA统一循环Transformer框架,以及系统分析计算分配与复杂性、泛化和决策时序的关系。客观来看,该研究为理解自适应计算提供了更严格的评估方法,揭示了计算分配与泛化能力之间的脱节,对设计高效推理模型具有借鉴意义。

Abstract: Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.


[42] Large Language Models for Geolocation Extraction in Humanitarian Crisis Response cs.CL | cs.IRPDF

G. Cafferata, T. Demarco, K. Kalimeri, Y. Mejova, M. G. Beiró

TL;DR: 本文提出了一种结合少样本LLM命名实体识别与基于智能体的地理编码模块的两步框架,用于从人道主义文档中提取地理位置信息,旨在解决现有自动化系统在地理和社会经济维度上的偏见问题。

Details

Motivation: 人道主义危机响应需要及时准确的地理信息,但现有自动化位置提取系统往往复制现有的地理和社会经济偏见,导致受危机影响地区的可见性不均,因此研究LLM能否解决这些地理差异。

Result: 在扩展版HumSet数据集上,使用准确性和公平性指标评估,结果表明LLM方法显著提高了从人道主义文本中提取地理位置的精度和公平性,特别是在代表性不足的地区,优于最先进的预训练和基于规则的系统。

Insight: 创新点在于将LLM推理进展与负责任、包容性AI原则结合,通过上下文消歧模糊地名,为更公平的人道主义响应地理空间数据系统做出贡献,推动危机分析中“不落下任何地方”的目标。

Abstract: Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.


[43] Is Reasoning Capability Enough for Safety in Long-Context Language Models? cs.CL | cs.CRPDF

Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson

TL;DR: 这篇论文研究了在长上下文语言模型中,更强的推理能力是否会自动提升安全性。作者提出了组合推理攻击这一新威胁模型,将有害查询分解为散布在长上下文中的不完整片段,再通过中性推理查询诱导模型检索和合成,从而在组合后浮现有害意图。通过对14个前沿LLM在长达64k token的上下文上进行评估,发现更强的通用推理能力并未带来更强的鲁棒性,安全性随上下文长度增加而下降,但增加推理时计算量可显著降低攻击成功率。

Details

Motivation: 动机是检验一个假设:更强的推理能力应能帮助模型识别未明确陈述的有害意图,从而提升安全性,尤其是在长上下文设置中。

Result: 在长达64k token的上下文中评估了14个前沿LLM,发现:1) 通用推理能力更强的模型对组合推理攻击并未更鲁棒;2) 安全性对齐效果随上下文长度增加而持续下降;3) 增加推理时计算量是关键缓解因素,在GPT-oss-120b模型上可将攻击成功率降低超过50个百分点。

Insight: 论文的创新点在于提出了组合推理攻击这一新威胁模型,揭示了在长上下文推理中,安全性与推理能力并非自动同步扩展。客观来看,其核心洞察是模型在复杂、分布式的信息检索与合成过程中,其安全护栏可能失效,这为长上下文模型的安全评估和加固提供了新的重要方向。

Abstract: Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.


[44] How Should We Model the Probability of a Language? cs.CLPDF

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

TL;DR: 这篇立场论文批判了当前语言识别(LID)系统覆盖范围有限的问题,认为这源于将LID错误地框架化为去上下文的文本分类,并主张应将其重新定义为路由问题,通过整合环境线索来估计语言先验概率,以提升对尾部语言的覆盖。

Details

Motivation: 解决商业和研究级语言识别系统仅能可靠识别少数语言,而对全球数千种语言(尤其是尾部语言)覆盖不足的问题,指出该问题源于方法论的局限和制度激励的偏差。

Result: 论文为立场性分析,未提供具体定量实验结果,但论证了现有基于固定先验的全局模型在覆盖尾部语言上的根本缺陷。

Insight: 核心创新在于提出将语言识别重构为路由问题,强调利用环境线索(如地理位置、设备设置)进行动态先验概率估计,而非传统的静态文本分类,这为提升低资源语言识别提供了新的方法论方向。

Abstract: Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.


[45] When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents cs.CLPDF

Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan

TL;DR: 本文首次系统性地定义了计算机使用代理(CUAs)中的动作失准问题,并构建了包含人类标注动作级对齐标签的基准测试集MisActBench。同时,提出了一种名为DeAction的通用防护机制,可在动作执行前检测失准动作并通过结构化反馈进行迭代修正,在离线和在线评估中均显著优于现有基线。

Details

Motivation: 计算机使用代理在执行任务时经常产生偏离用户原始意图的失准动作,这些动作可能源于外部攻击(如间接提示注入)或内部局限性(如错误推理),不仅带来安全风险,还降低了任务效率和可靠性。本文旨在解决这一检测与修正问题。

Result: 在MisActBench基准测试上,DeAction的F1分数比所有基线方法绝对提升超过15%;在在线评估中,在对抗性设置下将攻击成功率降低了90%以上,同时在良性环境中保持甚至提高了任务成功率,且延迟开销适中。

Insight: 创新点在于首次对CUA动作失准问题进行了系统性定义与研究,并构建了涵盖外部诱发和内部产生失准动作的基准数据集;提出的DeAction机制作为一种通用防护栏,实现了执行前检测与迭代修正,在安全性和任务性能上取得了显著平衡与提升。

Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user’s original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.


cs.CV [Back]

[46] Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making cs.CV | cs.LGPDF

Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu

TL;DR: 本文提出了一种基于归因的人类先验对齐方法,旨在提升模型决策的可靠性。该方法通过将人类先验(如边界框)编码为模型应依赖的输入区域,并利用基于子集选择的高保真归因技术,在训练中暴露模型的决策证据。当模型的归因区域显著偏离先验区域时,通过惩罚对非先验证据的依赖,引导模型将归因转向预期区域。该方法在基于MLLM的GUI智能体模型的图像分类和点击决策任务上进行了验证。

Details

Motivation: 传统监督学习仅提供类别级标签,导致模型可能通过捷径相关性而非预期证据实现高准确率,其决策缺乏可靠的理由。人类先验有助于约束此类行为,但如何使学习到的表征与人类感知对齐仍具挑战。

Result: 在传统分类和自回归生成设置中,人类先验对齐方法在MLLM-based GUI智能体模型上,不仅一致地提升了任务准确率,还增强了模型决策的合理性。

Insight: 核心创新在于提出了一种基于子集选择归因的、可微的训练目标,将人类先验作为对模型归因区域的软约束,从而在优化准确性的同时,显式地引导模型学习与人类认知一致的决策证据。这为构建更可靠、可解释的AI系统提供了一种新思路。

Abstract: Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model’s decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model’s decision reasonability.


[47] MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation cs.CV | cs.AI | eess.IVPDF

Zhuonan Wang, Zhenxuan Fan, Siwen Tan, Yu Zhong, Yuqian Yuan

TL;DR: 本文提出了MAU-GPT,一个用于增强多类型工业异常理解的多模态大模型,并配套发布了MAU-Set数据集和评估协议。该模型通过新颖的AMoE-LoRA机制,融合了异常感知专家和通用专家的适配,以解决现有方法在数据集覆盖和模型泛化方面的不足。

Details

Motivation: 工业制造规模化使得自动化细粒度产品图像分析变得至关重要,但现有方法受限于数据集覆盖不足以及模型在多样复杂异常模式上泛化能力差的问题。

Result: 大量实验表明,MAU-GPT在所有领域均持续超越先前的最先进方法,在跨多个工业领域的MAU-Set数据集上展现了卓越性能,显示出强大的可扩展和自动化工业检测潜力。

Insight: 主要创新点在于提出了一个全面的多类型工业异常理解数据集MAU-Set及其评估协议,并设计了MAU-GPT模型,其核心是结合了异常感知与通用专家适配的AMoE-LoRA机制,以增强对不同缺陷类别的理解和推理能力。

Abstract: As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.


[48] Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models cs.CV | cs.AI | cs.LGPDF

Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

TL;DR: 本文提出了一种名为CR-VLM的方法,用于实现视觉语言模型(VLMs)中可配置的拒绝机制。该方法通过激活引导技术,结合教师强制机制提取可配置的拒绝向量、引入门控机制防止过度拒绝,并设计反事实视觉增强模块以对齐视觉表示与拒绝需求,从而在多个数据集和不同VLM上实现了有效、高效且鲁棒的可配置拒绝。

Details

Motivation: 现有视觉语言模型的拒绝机制通常是“一刀切”的,无法适应多样化的用户需求和上下文约束,容易导致拒绝不足或过度拒绝的问题。本文旨在解决这一局限性,开发一种可配置的拒绝方法。

Result: 在多个数据集和各种视觉语言模型上的综合实验表明,CR-VLM方法实现了有效、高效且鲁棒的可配置拒绝,为VLM中用户自适应的安全对齐提供了可扩展的路径。

Insight: 论文的创新点在于首次探索并系统构建了基于激活引导的可配置拒绝框架CR-VLM,其核心是通过教师强制提取拒绝向量、门控机制平衡拒绝与接受,以及反事实视觉增强模块,这些设计共同提升了拒绝机制的灵活性和准确性,是VLM安全对齐领域的一个有前景的方向。

Abstract: With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.


[49] Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation cs.CV | cs.AIPDF

Qingyu Wu, Yuxuan Han, Haijun Li, Zhao Xu, Jianshan Zhao

TL;DR: 本文提出了Vectra,一个针对电商图像内机器翻译(IIMT)的无参考视觉质量评估框架。该框架包含三个核心部分:一个将视觉质量分解为14个可解释维度的多维质量度量系统(Vectra Score),一个包含大规模真实世界产品图像的数据集(Vectra Dataset),以及一个能够生成定量评分和诊断推理的40亿参数多模态大语言模型(Vectra Model)。

Details

Motivation: 现有研究主要关注机器翻译的文本质量评估,而电商IIMT场景中,视觉渲染质量对用户参与度至关重要。面对背景复杂的商品图像和多模态缺陷,现有的基于参考的方法(如SSIM、FID)缺乏可解释性,而基于模型评判的方法则缺乏领域相关的细粒度奖励信号。

Result: 实验表明,Vectra在评分性能上与人类排名的相关性达到了最先进水平(SOTA)。其提出的Vectra模型在评分性能上超越了包括GPT-5和Gemini-3在内的领先MLLMs。

Insight: 主要创新点在于:1) 提出了首个针对电商IIMT的无参考、MLLM驱动的视觉质量评估框架;2) 设计了一个包含14个可解释维度的多维质量度量系统,并引入了空间感知的缺陷面积比(DAR)量化以减少标注歧义;3) 构建了一个大规模、多样化的数据集,包含系统评估基准、指令调优数据和专家偏好数据;4) 训练了一个兼具评分和诊断推理能力的专用MLLM,在特定领域任务上超越了通用大模型。

Abstract: In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.


[50] Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency cs.CVPDF

Mohsen Mostafa

TL;DR: 本文研究了基于高斯约束的LeJEPA表示方法,用于解决无监督场景发现和相机姿态一致性问题。通过三个逐步优化的流程,最终提出了一种在图像嵌入上施加各向同性高斯约束的LeJEPA启发方法,并在IMC2025挑战赛上验证了其在场景分离和姿态估计鲁棒性上的提升。

Details

Motivation: 解决从无结构图像集合中进行无监督3D场景重建的挑战,特别是在图像来自多个无关场景且存在显著视觉模糊性的真实世界条件下,如IMC2025挑战赛所强调的场景发现和相机姿态估计问题。

Result: 在IMC2025数据集上的实验结果表明,与启发式基线相比,高斯约束嵌入能改善场景分离和姿态合理性,尤其是在视觉模糊场景中,但未提供新的理论保证。

Insight: 创新点在于将LeJEPA启发的各向同性高斯约束应用于图像嵌入,以增强聚类一致性和姿态估计鲁棒性;从客观角度看,这为连接自监督学习原理和实际运动结构恢复流程提供了一个有前景的方向,尽管主要是经验性验证。

Abstract: Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.


[51] XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models cs.CV | cs.AIPDF

Thuraya Alzubaidi, Sana Ammar, Maryam Alsharqi, Islem Rekik, Muzammil Behzad

TL;DR: 本文提出了一种名为XAI-CLIP的ROI引导扰动框架,旨在解决医学图像分割中基于Transformer模型的可解释性不足问题。该方法利用多模态视觉-语言模型嵌入来定位具有临床意义的解剖区域,并指导解释过程,从而生成更清晰、边界感知的显著性图,同时显著降低计算开销。

Details

Motivation: 医学图像分割在临床工作流中至关重要,但基于Transformer的模型尽管性能优越,其有限的可解释性阻碍了临床信任和部署。现有的可解释人工智能技术(如基于梯度的显著性方法和基于扰动的方法)通常计算成本高、需要多次前向传播,且常产生噪声大或解剖学上不相关的解释。

Result: 在FLARE22和CHAOS数据集上的实验表明,XAI-CLIP相比传统扰动方法,运行时减少高达60%,Dice分数提升44.6%,基于遮挡的解释的IoU提高96.7%。定性结果进一步证实了该方法能生成更干净、解剖学更一致、伪影更少的归因图。

Insight: 创新点在于将多模态视觉-语言表示整合到基于扰动的XAI框架中,通过语言引导的区域定位与医学图像分割结合,并应用有针对性的区域感知扰动,从而同时提升可解释性和效率。这为开发透明且可临床部署的医学图像分割系统提供了新思路。

Abstract: Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60% reduction in runtime, a 44.6% improvement in dice score, and a 96.7% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.


[52] The Geometry of Representational Failures in Vision Language Models cs.CV | cs.AIPDF

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri

TL;DR: 该论文通过分析开放权重视觉语言模型(如Qwen、InternVL、Gemma)的表征几何,提出了一种机制性见解来解释其在多物体视觉任务中的失败,如幻觉或识别错误。研究通过提取’概念向量’(编码视觉概念的潜在方向)并验证其通过干预操作模型行为,发现向量间的几何重叠与特定错误模式强相关,为理解内部表征如何影响模型行为提供了量化框架。

Details

Motivation: 解决视觉语言模型在多物体视觉任务中出现的令人困惑的失败(如幻觉或识别错误),这些错误类似于人类的’绑定问题’,但其在人工系统中的内部机制尚不明确。

Result: 在简化及自然视觉任务(如将红色花朵感知为蓝色)中,通过干预操作验证了概念向量的有效性,并观察到向量间的几何重叠与错误模式强相关,提供了一个量化的理解框架。

Insight: 创新点在于提出通过表征几何分析来机制性地解释视觉语言模型的失败,利用’概念向量’的提取和干预验证,将内部表征的几何结构与具体错误模式关联,为模型行为分析提供了可量化的新视角。

Abstract: Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the “Binding Problem”, the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill “concept vectors” - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.


[53] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models cs.CV | cs.AI | cs.MMPDF

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao

TL;DR: 本文提出了一种新的多模态大语言模型(MLLM)训练范式,以解决模态间隙(Modality Gap)问题。首先,作者提出了固定框架模态间隙理论来精确描述其几何形状,并基于此提出了无需训练的对齐策略ReAlign。进一步,他们构建了名为ReVision的可扩展训练范式,利用大量未配对数据在预训练阶段对齐模态,从而减少对大规模高质量图文配对数据的依赖。

Details

Motivation: 尽管多模态对比学习在视觉和语言表示对齐上取得了成功,但不同模态表达相同语义的嵌入向量在几何上存在系统性偏移,即模态间隙。现有方法受限于过于简化的各向同性假设,难以应用于大规模场景。本文旨在精确刻画模态间隙的几何形状,并利用它来实现高效的模型扩展。

Result: 论文提出的ReVision框架表明,通过统计对齐的未配对数据可以有效替代昂贵的图文配对数据,为MLLM的高效扩展提供了可行路径。

Insight: 创新点在于:1. 提出了固定框架模态间隙理论,将模态间隙分解为稳定偏差和各向异性残差,提供了更精确的几何建模;2. 提出了无需训练的ReAlign对齐策略,通过锚点、追踪和质心对齐三步法,利用未配对数据的统计信息显式修正几何错位;3. 提出了ReVision训练范式,将ReAlign集成到预训练阶段,使模型能在视觉指令微调前从未配对的文本中学习视觉表示分布,降低了对大规模高质量配对数据的依赖,为模型扩展提供了新思路。

Abstract: Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.


[54] Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models cs.CV | cs.LGPDF

Sanggeon Yun, Ryozo Masukawa, SungHeon Jeong, Wenjun Huang, Hanning Chen

TL;DR: 本文提出了一种名为公平上下文学习(FCL)的测试时适应(TTA)框架,旨在解决视觉语言模型(如CLIP)在分布偏移下性能下降的问题。该方法通过解耦适应过程,避免依赖熵最小化,从而减轻由共享视觉证据引起的虚假相关性和过拟合错误。

Details

Motivation: 动机在于解决基于提示的TTA方法依赖熵最小化时,会因类别间共享视觉特征而放大虚假相关性并导致过度自信错误的问题。

Result: 通过广泛的评估,FCL在多种领域偏移和细粒度基准测试中,相对于最先进的TTA方法取得了具有竞争力的适应性能。

Insight: 创新点在于基于加性证据分解假设,将适应过程解耦为基于增强的探索和公平驱动的校准,通过公平性约束来平衡对常见视觉证据的敏感性,从而避免熵最小化并有效校准文本嵌入。

Abstract: Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization – an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.


[55] UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents cs.CV | cs.CLPDF

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang

TL;DR: 该论文提出了UNIKIE-BENCH,一个用于评估大型多模态模型在视觉文档中关键信息提取能力的统一基准。该基准包含两个互补的评估轨道:一个具有场景预定义模式的约束类别KIE轨道,以及一个提取文档中任何显式存在的关键信息的开放类别KIE轨道。通过对15个最先进的大型多模态模型进行实验,揭示了它们在多样化模式定义、长尾关键字段和复杂布局下的性能显著下降,以及在不同文档类型和场景间的明显性能差异。

Details

Motivation: 从真实世界文档中提取关键信息仍然具有挑战性,因为布局结构、视觉质量和任务特定信息需求存在巨大差异。尽管最近的大型多模态模型在直接从文档图像进行端到端KIE方面显示出潜力,但缺乏一个全面、系统的评估基准来覆盖现实和多样化的应用场景。

Result: 在15个最先进的LMM上的实验结果表明,在多样化的模式定义、长尾关键字段和复杂布局下,模型性能出现显著下降,并且在不同文档类型和场景间存在明显的性能差异。这些发现突显了基于LMM的KIE在基础准确性和布局感知推理方面面临的持续挑战。

Insight: 论文的创新点在于构建了一个统一的、包含约束和开放两个互补轨道的KIE评估基准,能够更全面地反映实际应用需求。从客观角度看,该基准系统地揭示了当前LMM在KIE任务上的局限性,特别是在处理复杂布局和长尾字段时的泛化能力不足,为未来模型改进提供了明确的评估框架和方向。

Abstract: Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.


[56] OMNI-Dent: Towards an Accessible and Explainable AI Framework for Automated Dental Diagnosis cs.CV | cs.LGPDF

Leeje Jang, Yao-Yi Chiang, Angela M. Hastings, Patimaporn Pungchanchaikul, Martha B. Lucas

TL;DR: 本文提出OMNI-Dent,一个数据高效且可解释的AI诊断框架,旨在通过结合临床推理原则的视觉语言模型(VLM)流程,利用智能手机拍摄的多视角照片,在没有专业临床影像的情况下,进行牙齿级别的自动化评估,以辅助早期诊断并帮助用户判断何时需要专业医疗介入。

Details

Motivation: 解决现有AI牙科诊断方法仅视为视觉模式识别任务、未融入结构化临床推理、需要大量专家标注数据且难以泛化到多样真实世界成像条件的问题,以提升牙科诊断的可及性和可解释性。

Result: 摘要中未提及具体的定量实验结果或基准测试,但指出框架旨在支持在缺乏专业临床影像的环境中进行诊断评估,并作为早期辅助工具帮助识别潜在异常。

Insight: 创新点在于将牙科专家的诊断启发式规则嵌入通用VLM流程,无需针对牙科任务进行VLM的特定微调,利用VLM已有的视觉-语言能力实现数据高效且可解释的牙齿级别评估,为资源有限场景提供实用解决方案。

Abstract: Accurate dental diagnosis is essential for oral healthcare, yet many individuals lack access to timely professional evaluation. Existing AI-based methods primarily treat diagnosis as a visual pattern recognition task and do not reflect the structured clinical reasoning used by dental professionals. These approaches also require large amounts of expert-annotated data and often struggle to generalize across diverse real-world imaging conditions. To address these limitations, we present OMNI-Dent, a data-efficient and explainable diagnostic framework that incorporates clinical reasoning principles into a Vision-Language Model (VLM)-based pipeline. The framework operates on multi-view smartphone photographs,embeds diagnostic heuristics from dental experts, and guides a general-purpose VLM to perform tooth-level evaluation without dental-specific fine-tuning of the VLM. By utilizing the VLM’s existing visual-linguistic capabilities, OMNI-Dent aims to support diagnostic assessment in settings where curated clinical imaging is unavailable. Designed as an early-stage assistive tool, OMNI-Dent helps users identify potential abnormalities and determine when professional evaluation may be needed, offering a practical option for individuals with limited access to in-person care.


[57] VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing cs.CV | cs.AIPDF

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

TL;DR: 该论文提出了首个专门针对遥感领域复杂推理任务的视觉语言推理基准VLRS-Bench,该基准包含2000个问答对,涵盖认知、决策和预测三个核心维度,旨在评估多模态大语言模型在遥感应用中的高级推理能力。

Details

Motivation: 现有遥感基准主要偏向感知任务(如目标识别和场景分类),缺乏对复杂推理能力的评估,这限制了多模态大语言模型在认知要求高的遥感应用中的发展,因此需要构建专门的推理基准来推动该领域进步。

Result: 实验结果表明,现有最先进的多模态大语言模型在VLRS-Bench上表现出显著的性能瓶颈,这为遥感社区推进多模态推理提供了关键见解。

Insight: 创新点在于首次构建了专注于遥感复杂推理的基准,其通过集成遥感先验知识和专家知识的专用流程构建,确保了地理空间真实性和推理复杂性,为评估和提升模型在遥感领域的认知能力提供了新工具。

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.


[58] ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees cs.CV | cs.LGPDF

Muhammad Rashid, Elvio G. Amparore, Enrico Ferrari, Damiano Verda

TL;DR: 本文提出ShapBPT,一种基于分层Shapley公式的数据感知可解释AI方法,用于计算机视觉任务。该方法通过将Shapley系数分配给为图像定制的多尺度分层结构——二叉划分树(BPT),确保特征归因与图像内在形态对齐,从而在提高计算效率的同时提供更具语义意义的视觉解释。

Details

Motivation: 现有分层Shapley方法未能利用图像数据的多尺度结构,导致收敛慢且与真实形态特征对齐弱;同时,缺乏针对计算机视觉任务的数据感知分层方法,在结构化视觉数据的模型可解释性上存在不足。

Result: 实验结果表明,ShapBPT在图像结构对齐和效率上优于现有XCV方法,一项20人用户研究证实人类更偏好ShapBPT的解释。

Insight: 创新点在于首次将数据感知分层(BPT)与分层Shapley方法结合用于计算机视觉,使特征归因更贴合图像形态,兼顾语义意义与计算效率。

Abstract: Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data. To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead. This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT’s effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.


[59] Interpreting Physics in Video World Models cs.CV | cs.AIPDF

Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel

TL;DR: 本文通过可解释性方法研究了大规模视频编码器内部如何表示物理变量,发现现代视频世界模型并非使用经典物理引擎式的因子化表示,而是采用分布式表示,但仍能做出准确的物理预测。

Details

Motivation: 探讨视频模型是否需要显式因子化表示物理变量才能做出准确物理预测,还是可以隐式地以任务特定的分布式方式表示这些变量。

Result: 在多个架构中发现物理信息在中间深度突然变得可访问的’物理涌现区’,物理相关表示在此后达到峰值并向输出层衰减;标量如速度和加速度从早期层即可获取,而运动方向仅在物理涌现区变得可访问,且通过具有圆形几何结构的高维群体编码。

Insight: 现代视频模型使用分布式而非因子化的物理变量表示;物理信息在模型中间层涌现并组织;运动方向等复杂变量通过高维群体结构编码,需要多特征协同干预来控制。

Abstract: A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition – which we call the Physics Emergence Zone – at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.


[60] Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning cs.CV | cs.AI | cs.LG | cs.NEPDF

Karthik Sivakoti

TL;DR: 本文提出Neural Sentinel,一种基于视觉语言模型(VLM)的统一方法,用于车牌识别、状态分类和车辆属性提取。该方法通过微调PaliGemma 3B模型,结合人类在环持续学习框架,实现了单次前向传播的多任务处理,在真实收费站图像上展现出高精度、低延迟和良好的零样本泛化能力。

Details

Motivation: 传统自动车牌识别系统采用多阶段流水线,存在错误累积、延迟高和架构复杂的问题。本文旨在利用统一的视觉语言模型来简化流程并提升性能。

Result: 在车牌识别任务上达到92.3%的准确率,比EasyOCR和PaddleOCR基线分别提升14.1%和9.9%;平均推理延迟为152ms,预期校准误差为0.048;在零样本任务中,车辆颜色检测、安全带检测和乘员计数的准确率分别为89%、82%和78%。

Insight: 创新点在于将VLM统一架构应用于ALPR,通过LoRA微调和人类在环持续学习框架实现多任务高效学习与适应,避免了灾难性遗忘,并展现出零样本泛化能力,代表了ALPR系统范式的转变。

Abstract: Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.


[61] From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal cs.CVPDF

Daniil Storonkin, Ilia Dziub, Maksim Golyadkin, Ilya Makarov

TL;DR: 本文提出了一种辅助计算机视觉系统,用于在废钢卸载过程中通过图像自动估计非金属夹杂物(污染)含量并分类废钢类型。该方法将污染评估建模为车厢级别的回归任务,并利用多实例学习和多任务学习处理序列数据。系统以近实时方式集成到验收工作流中,通过磁铁/车厢检测、版本化推理服务和主动学习循环来减少主观差异、提高安全性并支持工艺优化。

Details

Motivation: 废钢质量直接影响炼钢的能耗、排放和安全,目前非金属夹杂物含量依赖人工目视评估,存在主观性强、粉尘和移动机械带来的安全隐患。

Result: 最佳结果中,多实例学习(MIL)方法达到MAE 0.27和R² 0.83;多任务学习(MTL)设置达到MAE 0.36,同时废钢分类F1分数为0.79。

Insight: 创新点包括将污染评估作为车厢级回归任务,结合多实例学习和多任务学习利用序列数据;系统设计上集成近实时推理、置信度评分、结构化人工覆写和主动学习循环,实现了工作流集成与持续改进。

Abstract: Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.


[62] Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine cs.CVPDF

Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang

TL;DR: 本文提出了OmniFysics模型,一个集成了图像、音频、视频和文本理解以及语音和图像生成能力的紧凑型全模态模型。为了解决全模态模型中物理理解脆弱的问题,作者构建了一个物理数据引擎,包含FysicsAny和FysicsOmniCap两个组件,用于生成基于物理知识的指令-图像监督和高保真视频-指令对。模型通过分阶段的多模态对齐和指令微调进行训练,并采用潜在空间流匹配进行文生图,使用意图路由器按需激活生成。实验表明,该模型在标准多模态基准测试中具有竞争力,并在面向物理的评估中取得了改进结果。

Details

Motivation: 解决全模态模型中因关键物理属性视觉模糊且在网络规模数据中稀疏表示而导致的物理理解脆弱性问题。

Result: 在标准多模态基准测试中表现出竞争力,并在面向物理的评估中取得了改进结果。

Insight: 创新点在于构建了一个物理数据引擎(FysicsAny和FysicsOmniCap)来显式注入物理知识,并通过分阶段对齐、指令微调、潜在空间流匹配和意图路由器等技术,在紧凑模型中统一了多模态理解与生成能力,提升了物理推理性能。

Abstract: Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction–image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law–constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio–visual consistency filtering to generate high-fidelity video–instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.


[63] MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation cs.CV | cs.AIPDF

Haoming Wang, Qiyao Xue, Weichen Liu, Wei Gao

TL;DR: 本文提出了一种名为MosaicThinker的推理时计算技术,旨在增强资源受限的具身AI设备上小型视觉语言模型(VLM)在跨帧视觉空间推理任务中的能力。其核心思想是通过迭代构建全局语义地图,将多帧视频中的碎片化空间信息整合成统一的空间表示,并利用视觉提示引导VLM在该语义地图上进行推理。实验表明,该方法能显著提升设备在多种类型和复杂度的跨帧空间推理任务上的准确性。

Details

Motivation: 随着具身AI从传统的物体检测与识别扩展到机器人操作和动作规划等更高级任务,需要从视频输入中进行视觉空间推理以感知物体间的空间关系并指导设备行动。然而,现有视觉语言模型(VLMs)由于缺乏3D空间知识,尤其在涉及跨多帧复杂空间关系的推理任务上,空间推理能力非常薄弱。

Result: 实验结果表明,该技术能极大地提升资源受限的具身AI设备在跨帧空间推理任务上的准确性,这些任务涵盖了多种类型和复杂度。

Insight: 论文的创新点在于提出了一种推理时技术,通过迭代构建全局语义地图来整合多帧空间信息,并利用视觉提示引导VLM进行空间推理。从客观角度看,这是一种针对设备端、轻量级且高效的增强空间感知能力的方案,通过构建中间表示(语义地图)来弥补小型VLM在复杂跨帧推理上的固有缺陷,具有较好的实用性和可借鉴性。

Abstract: When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM’s spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM’s spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.


[64] WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark cs.CV | cs.AIPDF

Wang Lin, Feng Wang, Majun Zhang, Wentao Hu, Tao Jin

TL;DR: 该论文提出了WorldEdit数据集,旨在解决图像编辑模型在处理隐含编辑指令时的局限性,这些指令描述了视觉变化的原因而非具体结果。通过构建基于真实世界因果逻辑的高质量编辑样本,并引入WorldEdit-Test评估基准,论文采用两阶段训练框架微调模型(如Bagel),结合因果验证奖励,显著提升了模型在知识合理性和指令遵循方面的性能,缩小了与GPT-4o和Nano-Banana等先进模型的差距。

Details

Motivation: 现有图像编辑模型依赖统一的编辑策略,难以处理需要复杂世界知识和推理的隐含编辑指令,因此论文旨在填补这一空白,推动开放世界图像编辑的发展。

Result: 在WorldEdit-Test基准上的实验表明,所提方法在因果编辑场景中显著提升了性能,在指令遵循和知识合理性方面达到与GPT-4o和Nano-Banana竞争的水平,缩小了开源系统与先进模型之间的差距。

Insight: 创新点包括构建基于世界知识的图像编辑数据集WorldEdit,引入因果逻辑指导的指令改写,以及采用两阶段训练框架结合因果验证奖励,这为处理隐含编辑指令提供了可扩展的解决方案,强调了世界知识在图像编辑中的重要性。

Abstract: Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model’s performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.


[65] TLC-Plan: A Two-Level Codebook Based Network for End-to-End Vector Floorplan Generation cs.CVPDF

Biao Xiong, Zhen Peng, Ping Wang, Qiegen Liu, Xian Zhong

TL;DR: TLC-Plan是一个用于端到端矢量平面图生成的两级码本网络。它通过一个分层生成模型,直接从输入边界合成矢量平面图,避免了现有方法在栅格空间中操作并依赖后处理矢量化带来的结构不一致问题。

Details

Motivation: 现有自动化平面图生成方法在栅格空间操作并依赖后处理矢量化,导致结构不一致并阻碍端到端学习。本文旨在提出一种直接生成矢量平面图的方法,以符合基于模块化和可重用模式的人类建筑设计工作流程。

Result: 在RPLAN数据集上取得了最先进的性能(FID = 1.84, MSE = 2.06),并在LIFULL数据集上取得了领先的结果。

Insight: 创新点在于提出了一个两级VQ-VAE来分层编码全局布局(语义标记的房间边界框)和细化局部几何(多边形级代码),并通过统一的CodeTree表示和自回归Transformer进行条件采样,无需显式的房间拓扑或尺寸先验,即可生成多样且拓扑有效的设计。

Abstract: Automated floorplan generation aims to improve design quality, architectural efficiency, and sustainability by jointly modeling global spatial organization and precise geometric detail. However, existing approaches operate in raster space and rely on post hoc vectorization, which introduces structural inconsistencies and hinders end-to-end learning. Motivated by compositional spatial reasoning, we propose TLC-Plan, a hierarchical generative model that directly synthesizes vector floorplans from input boundaries, aligning with human architectural workflows based on modular and reusable patterns. TLC-Plan employs a two-level VQ-VAE to encode global layouts as semantically labeled room bounding boxes and to refine local geometries using polygon-level codes. This hierarchy is unified in a CodeTree representation, while an autoregressive transformer samples codes conditioned on the boundary to generate diverse and topologically valid designs, without requiring explicit room topology or dimensional priors. Extensive experiments show state-of-the-art performance on RPLAN dataset (FID = 1.84, MSE = 2.06) and leading results on LIFULL dataset. The proposed framework advances constraint-aware and scalable vector floorplan generation for real-world architectural applications. Source code and trained models are released at https://github.com/rosolose/TLC-PLAN.


[66] Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting cs.CV | cs.ROPDF

Zinan Lv, Yeqian Qian, Chen Sang, Hao Liu, Danping Zou

TL;DR: 本文提出了一种名为Relightable 3D Gaussian Splatting的端到端强化学习框架,用于解决无人机在非结构化室外森林环境中基于单目视觉的零样本导航问题。该方法通过将场景分解,在神经表示中实现物理上合理的光照编辑,从而在模拟训练中合成多样化的光照条件,使策略学习到对光照变化鲁棒的视觉特征,最终实现无需微调即可在真实森林中高速、无碰撞的导航。

Details

Motivation: 解决基于被动单目视觉的无人机在非结构化室外导航时,因仿真与真实世界间巨大的视觉域差距(尤其是动态真实光照与静态几何的耦合)而导致的策略泛化能力差的问题。

Result: 在真实世界森林环境中的广泛实验表明,搭载该策略的轻型四旋翼无人机能以高达10米/秒的速度实现鲁棒、无碰撞的导航,并对剧烈的光照变化展现出显著的适应能力,无需进行微调。

Insight: 核心创新点是提出了可重光照的3D高斯泼溅技术,将场景组件解耦以实现对光照的显式、物理基础的编辑,从而在仿真中生成多样化的光照条件以增强策略的鲁棒性。从客观角度看,该方法将高质量的神经场景表示与强化学习策略训练相结合,为解决仿真到真实世界的视觉域适应问题提供了一种新颖且有效的思路。

Abstract: UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.


[67] Extended to Reality: Prompt Injection in 3D Environments cs.CV | cs.AIPDF

Zhuoheng Li, Ying Chen

TL;DR: 本文提出了PI3D,一种针对3D环境中多模态大语言模型(MLLMs)的提示注入攻击方法。该方法通过在物理世界中放置带有文本的物体,而非数字图像编辑,来诱导MLLM执行攻击者指定的任务,并证明了其在多种相机轨迹下的有效性以及现有防御措施的不足。

Details

Motivation: 随着MLLMs在3D环境(如机器人、情境对话代理)中解释和响应视觉输入的能力增强,当模型基于摄像头捕获的物理世界视图进行推理时,攻击者可以通过放置带有文本的物理对象来覆盖MLLMs的原始任务,从而形成新的攻击面。现有研究主要关注文本领域和数字编辑的2D图像攻击,而3D物理环境中的此类攻击尚不明确。

Result: 实验表明,PI3D在多种相机轨迹下对多个MLLMs均构成有效攻击。同时,评估显示现有防御措施不足以抵御PI3D攻击。

Insight: 论文的创新点在于将提示注入攻击从文本和2D数字图像领域扩展到3D物理环境,并系统性地解决了在保证物体放置物理合理性的前提下,寻找有效3D物体姿态(位置和方向)以实现攻击目标的问题。这揭示了MLLMs在现实世界部署时面临的新型安全风险,并为评估和设计针对物理世界攻击的防御机制提供了重要视角。

Abstract: Multimodal large language models (MLLMs) have advanced the capabilities to interpret and act on visual input in 3D environments, empowering diverse applications such as robotics and situated conversational agents. When MLLMs reason over camera-captured views of the physical world, a new attack surface emerges: an attacker can place text-bearing physical objects in the environment to override MLLMs’ intended task. While prior work has studied prompt injection in the text domain and through digitally edited 2D images, it remains unclear how these attacks function in 3D physical environments. To bridge the gap, we introduce PI3D, a prompt injection attack against MLLMs in 3D environments, realized through text-bearing physical object placement rather than digital image edits. We formulate and solve the problem of identifying an effective 3D object pose (position and orientation) with injected text, where the attacker’s goal is to induce the MLLM to perform the injected task while ensuring that the object placement remains physically plausible. Experiments demonstrate that PI3D is an effective attack against multiple MLLMs under diverse camera trajectories. We further evaluate existing defenses and show that they are insufficient to defend against PI3D.


[68] Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models cs.CV | cs.AI | cs.CLPDF

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

TL;DR: 本文提出了Ex-Omni,一个开源的、用于增强全模态大语言模型(OLLMs)的框架,使其能够生成伴随语音的3D面部动画。该方法通过解耦语义推理与时间生成,利用语音单元作为时间支架,并采用统一的token-as-query门控融合机制,以在有限数据下实现稳定、对齐的语音与面部动画生成。

Details

Motivation: 当前全模态大语言模型旨在统一多模态理解与生成,但结合语音与3D面部动画这一对自然交互至关重要的能力仍未得到充分探索。主要挑战在于LLMs的离散、token级语义推理与3D面部运动所需的密集、细粒度时间动态之间存在表征不匹配,导致在有限数据下直接建模难以优化。

Result: 广泛的实验表明,Ex-Omni在保持与现有开源OLLMs竞争性能的同时,能够实现稳定且对齐的语音和面部动画生成。

Insight: 核心创新点在于通过解耦语义与时间生成来降低学习难度,具体包括:利用语音单元作为时间支架来指导动画生成,以及提出统一的token-as-query门控融合机制以实现可控的语义注入。此外,还引入了InstructEx数据集来支持该任务。从客观角度看,其将复杂任务分解为更易处理的子问题(语义推理与时间生成)的思路,以及设计中间表示(语音单元)作为桥梁的策略,对于处理多模态生成中的模态对齐问题具有借鉴意义。

Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.


[69] DuMeta++: Spatiotemporal Dual Meta-Learning for Generalizable Few-Shot Brain Tissue Segmentation Across Diverse Ages cs.CVPDF

Yongheng Sun, Jun Shu, Jianhua Ma, Fan Wang

TL;DR: 本文提出DuMeta++,一种无需配对纵向数据的双元学习框架,用于解决脑组织MRI分割在跨年龄泛化中的挑战。该方法结合元特征学习和元初始化学习,并引入基于记忆库的类别感知正则化策略来增强纵向一致性,在少样本设置下实现了优于现有方法的跨年龄泛化性能。

Details

Motivation: 脑组织MRI分割在神经科学和临床应用中至关重要,但由于大脑外观和形态随年龄动态变化,实现跨生命周期的稳定性能具有挑战性。现有方法通常依赖配对纵向数据进行自监督正则化,但此类数据在实践中往往难以获取。

Result: 在iSeg-2019、IBIS、OASIS和ADNI等多个数据集上的少样本实验表明,DuMeta++在跨年龄泛化任务中优于现有方法,实现了最先进的性能。

Insight: 创新点包括:1)无需配对纵向数据的双元学习框架(元特征学习+元初始化学习);2)基于记忆库的类别感知正则化策略,可在无显式纵向监督下强制纵向一致性;3)理论证明了算法的收敛性。该方法为数据稀缺场景下的医学图像分割提供了可借鉴的元学习与正则化思路。

Abstract: Accurate segmentation of brain tissues from MRI scans is critical for neuroscience and clinical applications, but achieving consistent performance across the human lifespan remains challenging due to dynamic, age-related changes in brain appearance and morphology. While prior work has sought to mitigate these shifts by using self-supervised regularization with paired longitudinal data, such data are often unavailable in practice. To address this, we propose \emph{DuMeta++}, a dual meta-learning framework that operates without paired longitudinal data. Our approach integrates: (1) meta-feature learning to extract age-agnostic semantic representations of spatiotemporally evolving brain structures, and (2) meta-initialization learning to enable data-efficient adaptation of the segmentation model. Furthermore, we propose a memory-bank-based class-aware regularization strategy to enforce longitudinal consistency without explicit longitudinal supervision. We theoretically prove the convergence of our DuMeta++, ensuring stability. Experiments on diverse datasets (iSeg-2019, IBIS, OASIS, ADNI) under few-shot settings demonstrate that DuMeta++ outperforms existing methods in cross-age generalization. Code will be available at https://github.com/ladderlab-xjtu/DuMeta++.


[70] Condition Matters in Full-head 3D GANs cs.CV | cs.GRPDF

Heyuan Li, Huimin Zhang, Yuda Qiu, Zhengwentai Sun, Keru Zheng

TL;DR: 本文提出了一种使用视角不变语义特征作为条件输入的方法,以解决全头3D GAN训练中因传统使用视角角度作为条件而导致的生成偏差和全局不一致性问题。通过构建合成头部图像数据集并利用FLUX.1 Kontext扩展高质量正面人脸数据集,提取正面视图的图像特征作为共享语义条件,从而解耦生成能力与视角方向,提升生成3D头部的保真度、多样性和全局一致性。

Details

Motivation: 传统全头3D GAN通常以视角角度作为条件输入,这会导致学习到的3D头部空间沿条件视角方向存在偏差,表现为条件视角与非条件视角在生成质量和多样性上的显著差异,造成不同头部区域的全局不一致性。因此,需要一种视角不变的语义条件来消除这种偏差。

Result: 在全头合成和单视图GAN反转任务上的大量实验表明,该方法在保真度、多样性和泛化性方面均取得了显著更高的性能。

Insight: 创新点在于提出使用视角不变语义特征作为条件输入,而非传统的视角角度,这通过构建合成数据集和利用FLUX.1 Kontext实现语义对齐,从而解耦生成与视角,促进连续学习和多样生成,增强了生成3D头部的全局一致性。

Abstract: Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.


[71] Understanding Real-World Traffic Safety through RoadSafe365 Benchmark cs.CVPDF

Xinyu Liu, Darryl C. Jacob, Yuxin Liu, Xinsong Du, Muchao Ye

TL;DR: 本文介绍了RoadSafe365,一个大规模视觉语言基准,用于从真实世界视频数据中细粒度分析交通安全。该基准通过层次化分类法系统组织,涵盖多种交通事件类型、环境背景和交互场景,提供丰富的属性标注和多选问答集,旨在推动交通安全分析的可重复研究。

Details

Motivation: 现有交通基准缺乏与官方安全标准对齐的系统性评估,RoadSafe365旨在填补这一空白,通过细粒度分析桥接官方标准与数据驱动的交通理解系统。

Result: 在RoadSafe365上微调模型时观察到一致的性能提升,跨真实和合成数据集的跨域实验进一步验证了其有效性,为大规模训练和标准化评估提供了基准。

Insight: 创新点在于引入层次化分类法细化并扩展了事故、事件和违规的基础定义,将官方安全标准与数据驱动系统结合,并提供大规模、多样化的标注数据支持视觉语言理解和推理任务。

Abstract: Although recent traffic benchmarks have advanced multimodal data analysis, they generally lack systematic evaluation aligned with official safety standards. To fill this gap, we introduce RoadSafe365, a large-scale vision-language benchmark that supports fine-grained analysis of traffic safety from extensive and diverse real-world video data collections. Unlike prior works that focus primarily on coarse accident identification, RoadSafe365 is independently curated and systematically organized using a hierarchical taxonomy that refines and extends foundational definitions of crash, incident, and violation to bridge official traffic safety standards with data-driven traffic understanding systems. RoadSafe365 provides rich attribute annotations across diverse traffic event types, environmental contexts, and interaction scenarios, yielding 36,196 annotated clips from both dashcam and surveillance cameras. Each clip is paired with multiple-choice question-answer sets, comprising 864K candidate options, 8.4K unique answers, and 36K detailed scene descriptions collectively designed for vision-language understanding and reasoning. We establish strong baselines and observe consistent gains when fine-tuning on RoadSafe365. Cross-domain experiments on both real and synthetic datasets further validate its effectiveness. Designed for large-scale training and standardized evaluation, RoadSafe365 provides a comprehensive benchmark to advance reproducible research in real-world traffic safety analysis.


[72] VideoNeuMat: Neural Material Extraction from Generative Video Models cs.CV | cs.GRPDF

Bowen Xue, Saeed Hadadan, Zheng Zeng, Fabrice Rousselle, Zahra Montazeri

TL;DR: VideoNeuMat提出了一种从视频生成模型中提取可复用神经材质资产的两阶段流程,首先通过微调大型视频模型生成受控相机和光照轨迹下的材质样本视频,然后利用大型重建模型从视频中重建紧凑的神经材质参数,实现了从互联网规模视频模型向独立可复用3D资产的材质知识迁移。

Details

Motivation: 解决高质量材质训练数据缺乏的问题,利用视频生成模型中蕴含的逼真材质外观知识,但该知识与几何和光照纠缠,需要提取为可独立使用的神经材质资产。

Result: 从17个生成的视频帧中,通过单次推理预测的神经材质参数能够泛化到新的视角和光照条件,生成的材质在真实感和多样性上远超有限的合成训练数据。

Insight: 创新点在于将视频生成模型视为虚拟测角反射计来创建结构化测量模式,并通过微调的大型重建模型实现高效的单次神经材质重建,实现了从生成视频到可复用3D资产的端到端知识提取流程。

Abstract: Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a “virtual gonioreflectometer” that preserves the model’s material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.


[73] Diabetic Retinopathy Lesion Segmentation through Attention Mechanisms cs.CVPDF

Aruna Jithesh, Chinmayi Karumuri, Venkata Kiran Reddy Kotha, Meghana Doddapuneni, Taehee Jeong

TL;DR: 本文提出了一种基于注意力机制的糖尿病视网膜病变(DR)病灶分割方法。该方法在DeepLab-V3+架构中集成了注意力机制,用于在DDR数据集的757张图像上分割微动脉瘤、软性渗出、硬性渗出和出血四种病灶,旨在提供像素级标注以辅助临床筛查。

Details

Motivation: 糖尿病视网膜病变(DR)可能导致视力丧失,早期筛查至关重要。现有基于深度学习的自动化筛查算法在病灶分割方面的临床适用性有限,因此需要开发更精确的像素级分割方法来支持眼科医生诊断。

Result: 在DDR数据集上的实验表明,与基线模型(DeepLab-V3+)相比,提出的Attention-DeepLab模型将平均精度(mAP)从0.3010提升至0.3326,平均交并比(mIoU)从0.1791提升至0.1928。其中,对早期关键症状微动脉瘤的检测分数从0.0205显著提升至0.0763。

Insight: 论文的主要创新点是将注意力机制集成到DeepLab-V3+分割架构中,以增强对DR病灶的特征聚焦能力。从客观角度看,这种注意力机制的引入有效提升了模型性能,尤其是在检测临床关键的早期微动脉瘤病灶方面取得了显著改进,这为医学图像分割中结合注意力机制以提升小目标检测精度提供了借鉴。

Abstract: Diabetic Retinopathy (DR) is an eye disease which arises due to diabetes mellitus. It might cause vision loss and blindness. To prevent irreversible vision loss, early detection through systematic screening is crucial. Although researchers have developed numerous automated deep learning-based algorithms for DR screening, their clinical applicability remains limited, particularly in lesion segmentation. Our method provides pixel-level annotations for lesions, which practically supports Ophthalmologist to screen DR from fundus images. In this work, we segmented four types of DR-related lesions: microaneurysms, soft exudates, hard exudates, and hemorrhages on 757 images from DDR dataset. To enhance lesion segmentation, an attention mechanism was integrated with DeepLab-V3+. Compared to the baseline model, the Attention-DeepLab model increases mean average precision (mAP) from 0.3010 to 0.3326 and the mean Intersection over Union (IoU) from 0.1791 to 0.1928. The model also increased microaneurysm detection from 0.0205 to 0.0763, a clinically significant improvement. The detection of microaneurysms is the earliest visible symptom of DR.


[74] LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery cs.CV | cs.AIPDF

Difei Gu, Yunhe Gao, Gerasimos Chatzoudis, Zihan Dong, Guoning Zhang

TL;DR: LUCID-SAE提出了一种统一视觉-语言稀疏自编码器,通过共享潜在字典学习图像块和文本标记的表示,同时保留模态特定细节的私有容量。该方法利用最优传输匹配目标实现特征对齐,无需标注,从而生成可解释的共享特征,支持跨模态概念发现和评估鲁棒性。

Details

Motivation: 解决现有稀疏自编码器(SAEs)因按模态训练导致特征不可直接理解、解释无法跨域迁移的问题,旨在学习统一的视觉-语言稀疏表示以实现可解释的概念发现。

Result: LUCID生成的共享特征能够捕获对象、动作、属性和抽象概念等多种语义类别,在跨模态神经元对应和基于相似性评估的概念聚类问题上表现出增强的鲁棒性,但未提及具体基准测试或SOTA比较。

Insight: 创新点包括:1)统一视觉-语言稀疏自编码器架构,结合共享字典和私有容量;2)基于最优传输的无监督特征对齐方法;3)自动化字典解释流程,通过术语聚类减少人工观察需求;4)共享特征涵盖广泛语义类别,为可解释多模态表示提供全面方案。

Abstract: Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID’s shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.


[75] Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation cs.CV | cs.AI | cs.LG | cs.ROPDF

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

TL;DR: 本文提出了一种名为CLARITY的语言引导RGB-T(可见光-热成像)融合框架,用于自动驾驶场景的鲁棒语义分割。该框架利用视觉语言模型(VLM)的先验知识,根据检测到的光照条件动态调整融合策略,并引入两种机制来保留暗处物体语义和增强薄物体边界,在MFNet数据集上取得了新的SOTA性能。

Details

Motivation: 解决在不良光照、照明和阴影条件下,现有RGB-T融合方法采用静态、统一的融合策略,导致模态特定噪声在网络中传播,从而影响自动驾驶场景语义分割鲁棒性的问题。

Result: 在MFNet数据集上的实验表明,CLARITY取得了新的最先进(SOTA)结果,达到62.3%的mIoU和77.5%的mAcc。

Insight: 核心创新点在于利用视觉语言模型(VLM)的先验知识实现动态的、条件自适应的模态融合策略,而非静态融合。此外,通过保留有效暗物体语义的机制和强制跨尺度结构一致性的分层解码器,分别解决了噪声抑制过当和薄物体边界模糊的问题,提升了分割精度。

Abstract: Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality’s contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.


[76] Optimizing Few-Step Generation with Adaptive Matching Distillation cs.CV | cs.LGPDF

Lichen Bai, Zikai Zhou, Shitong Shao, Wenliang Zhong, Shuo Yang

TL;DR: 本文提出了一种名为自适应匹配蒸馏(AMD)的统一优化框架,用于解决分布匹配蒸馏(DMD)在‘禁区’区域的不稳定性问题。AMD通过奖励代理显式检测并逃离禁区,动态调整梯度优先级并引入排斥景观锐化以防止模型崩溃。在图像和视频生成任务上的实验表明,AMD显著提升了样本保真度和训练鲁棒性,在SDXL等模型上超越了现有方法。

Details

Motivation: 分布匹配蒸馏(DMD)在加速生成模型时,在‘禁区’区域(即真实教师提供不可靠指导而虚假教师排斥力不足的区域)存在稳定性问题,限制了少步生成模型的性能上限。

Result: 在图像和视频生成任务(如SDXL、Wan2.1)及基准测试(如VBench、GenEval)中,AMD显著提升了性能。例如,在SDXL上,HPSv2分数从30.64提高到31.25,超越了最先进的基线方法。

Insight: 创新点在于将先前工作重新解释为隐式避免禁区的策略,并引入显式的自校正机制(AMD),通过结构信号分解动态优先校正梯度,以及排斥景观锐化来强化能量屏障,从而优化训练轨迹并提升模型鲁棒性。

Abstract: Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.


[77] Row-Column Separated Attention Based Low-Light Image/Video Enhancement cs.CVPDF

Chengqi Dong, Zhiyuan Cao, Tuoshi Qi, Kexin Wu, Yixing Gao

TL;DR: 本文提出了一种基于行-列分离注意力模块(RCSA)的低光照图像/视频增强方法,通过改进的U-Net结构结合RCSA模块,利用全局信息指导局部信息,同时引入时间损失函数以保持视频增强的时间一致性。

Details

Motivation: 现有U-Net结构在低光照增强中缺乏全局信息指导,导致局部噪声大、细节丢失;传统注意力机制参数量和计算量过大,需设计更高效的注意力模块。

Result: 在LOL、MIT Adobe FiveK图像数据集和SDSD视频数据集上的大量实验验证了方法的有效性,实现了SOTA水平的增强效果。

Insight: 创新点包括行-列分离注意力模块(RCSA)降低计算复杂度,以及时间损失函数确保视频时序一致性;可借鉴其轻量化注意力设计和跨模态(图像/视频)统一增强框架。

Abstract: U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module’s input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/cq-dong/URCSA.


[78] SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads cs.CVPDF

Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu

TL;DR: 本文提出了SoulX-FlashHead,一个用于实时、无限长度、高保真流式视频生成的统一框架。它通过流式感知时空预训练和Oracle引导的双向蒸馏技术,解决了音频驱动肖像生成中高保真与低延迟的平衡难题,并在HDTF和VFHQ基准上达到了最先进的性能。

Details

Motivation: 解决音频驱动肖像生成中,现有大规模模型计算成本过高,而轻量级模型又牺牲了整体面部表征和时间稳定性的问题,旨在实现高保真视觉质量与低延迟流式生成之间的平衡。

Result: 在HDTF和VFHQ基准测试中达到了最先进的性能。其轻量级变体在单张NVIDIA RTX 4090上实现了96 FPS的推理速度。

Insight: 创新点包括:1. 流式感知时空预训练与时间音频上下文缓存机制,用于从短音频片段中提取鲁棒特征;2. Oracle引导的双向蒸馏,利用真实运动先验提供精确物理指导以缓解长序列自回归生成中的错误累积和身份漂移;3. 构建了大规模高质量数据集VividHead以支持鲁棒训练。

Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.


[79] SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning cs.CVPDF

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang

TL;DR: 本文提出SpatialReward模型,通过显式空间推理解决在线强化学习在图像编辑任务中因奖励信号不足导致的感知差距问题,显著提升了评估准确性和在线RL性能。

Details

Motivation: 在线强化学习在复杂图像编辑中面临可靠、细粒度奖励信号稀缺的挑战,现有评估器存在’注意力崩溃’问题,即忽略跨图像比较和细粒度细节,导致感知不准确和分数校准错误。

Result: 在MMRB2和EditReward-Bench基准上达到SOTA性能,在提出的MultiEditReward-Bench上优于专有评估器;作为在线RL信号,将OmniGen2在GEdit-Bench上提升+0.90,超越领先判别模型并两倍于GPT-4.1的增益(+0.45)。

Insight: 通过将推理锚定到预测的编辑区域,使语义判断基于像素级证据,从而增强评估准确性;空间推理对于实现图像编辑中的有效对齐至关重要,可借鉴其显式空间建模方法以解决类似感知差距问题。

Abstract: Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term “Attention Collapse,” where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench–surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.


[80] IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation cs.CVPDF

Zhufeng Xu, Xuan Gao, Feng-Lin Liu, Haoxian Zhang, Zhixue Fang

TL;DR: 本文提出了一种名为IM-Animation的隐式运动表示方法,用于实现身份解耦的角色动画。该方法通过将每帧运动压缩为紧凑的一维运动令牌来捕获运动语义,并设计了一个基于掩码令牌的时序一致重定向模块,以解决现有显式方法的空间不匹配和隐式方法的身份泄漏问题。

Details

Motivation: 现有角色动画方法中,显式方法(如基于骨架)难以处理空间错配和身体比例变化,而隐式方法则存在身份信息泄漏以及运动与外观纠缠的问题。本文旨在提出一种新的隐式运动表示来解决这些挑战。

Result: 大量实验表明,IM-Animation在生成能力上达到了与最先进方法相当或更优的性能。

Insight: 创新点在于提出了一种紧凑的一维隐式运动令牌表示,它放松了二维表示的空间约束并有效防止身份泄漏;同时,设计了具有时序训练瓶颈的掩码令牌重定向模块,以提升重定向一致性。从客观角度看,其分阶段训练策略也提升了训练效率和保真度。

Abstract: Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images’ motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation’s generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.


[81] Evaluating Object-Centric Models beyond Object Discovery cs.CV | cs.LGPDF

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

TL;DR: 本文提出了一种超越传统物体发现评估的物体中心学习模型评估框架,通过使用指令调优的视觉语言模型作为评估器,在多样化视觉问答数据集上衡量模型对复杂推理任务的支持能力,并引入统一的任务和指标来联合评估定位能力和表示有用性。

Details

Motivation: 现有物体中心学习模型的评估主要局限于物体发现和简单推理任务,无法全面衡量其表示的有用性,且定位与表示有用性评估指标相互割裂,导致评估不全面和不一致。

Result: 论文通过指令调优的视觉语言模型在多样化VQA数据集上进行评估,并提出了一个统一的评估任务和指标来联合评估定位和表示有用性,同时引入了一个简单的多特征重建基线作为参考。

Insight: 创新点在于利用指令调优的视觉语言模型实现可扩展的复杂推理能力评估,并设计了统一的评估框架来同时衡量定位精度和表示质量,解决了现有评估方法的局限性。

Abstract: Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.


[82] Fine-Grained Cat Breed Recognition with Global Context Vision Transformer cs.CV | cs.AI | eess.IVPDF

Mowmita Parvin Hera, Md. Shahriar Mahmud Kallol, Shohanur Rahman Nirob, Md. Badsha Bulbul, Jubayer Ahmed

TL;DR: 本文提出了一种基于全局上下文视觉变换器(GCViT)的细粒度猫品种识别方法,在牛津-IIIT宠物数据集子集上实现了高精度分类。通过数据增强技术提升模型泛化能力,实验结果表明GCViT-Tiny模型在测试集和验证集上分别达到92.00%和94.54%的准确率,验证了Transformer架构在细粒度图像分类任务中的有效性。

Details

Motivation: 解决猫品种识别中因毛发纹理、面部结构和颜色等细微差异导致的分类挑战,提升细粒度图像分类的准确性。

Result: 在牛津-IIIT宠物数据集子集上,GCViT-Tiny模型取得测试准确率92.00%、验证准确率94.54%的结果,展现了Transformer架构在该任务上的竞争力。

Insight: 将全局上下文视觉变换器(GCViT)应用于细粒度猫品种识别任务,结合数据增强策略提升模型泛化性能,为兽医诊断、动物收容所管理等实际应用提供了高效解决方案。

Abstract: Accurate identification of cat breeds from images is a challenging task due to subtle differences in fur patterns, facial structure, and color. In this paper, we present a deep learning-based approach for classifying cat breeds using a subset of the Oxford-IIIT Pet Dataset, which contains high-resolution images of various domestic breeds. We employed the Global Context Vision Transformer (GCViT) architecture-tiny for cat breed recognition. To improve model generalization, we used extensive data augmentation, including rotation, horizontal flipping, and brightness adjustment. Experimental results show that the GCViT-Tiny model achieved a test accuracy of 92.00% and validation accuracy of 94.54%. These findings highlight the effectiveness of transformer-based architectures for fine-grained image classification tasks. Potential applications include veterinary diagnostics, animal shelter management, and mobile-based breed recognition systems. We also provide a hugging face demo at https://huggingface.co/spaces/bfarhad/cat-breed-classifier.


[83] LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing cs.CV | cs.LGPDF

Huimin Yan, Liang Bai, Xian Yang, Long Chen

TL;DR: 本文提出了一种名为LLM-Guided Diagnostic Evidence Alignment (LGDEA)的医学视觉-语言预训练方法,旨在解决现有方法在配对数据有限时难以学习可靠诊断表征的问题。该方法利用大语言模型从放射学报告中提取关键诊断证据,构建共享的诊断证据空间,实现证据级别的跨模态对齐,从而有效利用大量未配对的医学图像和报告。

Details

Motivation: 现有基于CLIP风格的医学视觉-语言预训练方法通常依赖大量配对数据进行全局或局部对齐,但全局对齐易受非诊断信息干扰,局部对齐则难以整合关键诊断证据,导致在配对数据有限的医学场景中学习可靠诊断表征困难。

Result: 大量实验结果表明,该方法在短语定位、图文检索和零样本分类任务上取得了一致且显著的性能提升,其效果甚至可与依赖大量配对数据的预训练方法相媲美。

Insight: 论文的核心创新在于将预训练目标转向与医学诊断过程更一致、更细粒度的证据级别对齐,并利用LLM从文本中提取诊断证据来引导跨模态学习,这降低了对精确配对数据的依赖,为数据稀缺的医学领域提供了新的预训练思路。

Abstract: Most existing CLIP-style medical vision–language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image–text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.


[84] MUFASA: A Multi-Layer Framework for Slot Attention cs.CVPDF

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

TL;DR: MUFASA是一个用于基于槽注意力的无监督物体中心学习的轻量级即插即用框架。它通过利用视觉变换器(ViT)编码器多个特征层的语义信息,计算跨层的槽注意力,并提出融合策略将各层槽聚合成统一的物体中心表示,从而提升现有方法的物体分割性能。

Details

Motivation: 当前基于槽注意力的无监督物体中心学习方法仅使用预训练ViT最后一层的特征来获取槽表示,忽略了其他层编码的丰富语义信息,MUFASA旨在更好地利用这些跨层潜在语义信息以改进物体分割。

Result: 将MUFASA集成到现有物体中心学习方法中,在多个数据集上提升了分割结果,达到了新的最先进水平(SOTA),同时以微小的推理开销改善了训练收敛性。

Insight: 创新点在于提出多层级特征利用框架,通过跨层槽注意力和融合策略聚合ViT各层语义信息,这是一种轻量级、可插拔的架构改进,能有效提升物体分割的表示能力。

Abstract: Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.


[85] Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation cs.CV | cs.AIPDF

Hussni Mohd Zakir, Eric Tatt Wei Ho

TL;DR: 本文提出了一种无需训练的少样本语义分割基线方法FSSDINO,利用DINOv3的冻结特征、类特定原型和Gram矩阵优化,在多个基准测试中展现了与复杂方法相当的性能。研究发现,DINOv3的中间层存在比标准最后一层更优的语义表示,但传统启发式方法无法可靠识别,揭示了基础模型中的’语义选择鸿沟’。

Details

Motivation: 探索自监督视觉Transformer(如DINOv3)在少样本语义分割中的固有能力,并分析其不同层特征在语义选择上的性能差距。

Result: 在二元、多类和跨域(CDFSS)少样本分割基准上,FSSDINO方法仅使用主干网络最后一层特征,性能与涉及复杂解码器或测试时适应的专门方法相当。Oracle引导的层分析显示,中间层存在比最后一层更优的表示,但当前无监督和支持引导的选择指标无法稳定达到这一最优性能。

Insight: 创新点在于揭示了DINOv3中’最安全vs最优’的困境:最后一层是强基线,但中间层潜藏更高性能的语义特征,传统启发式方法存在’语义选择鸿沟’。这为利用基础模型的潜在语义能力提供了诊断视角。

Abstract: Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a “Safest vs. Optimal” dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a “Semantic Selection Gap” in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the “Last-Layer” as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.


[86] VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation cs.CV | cs.AIPDF

Francesco Taioli, Shiping Yang, Sonia Raychaudhuri, Marco Cristani, Unnat Jain

TL;DR: 本文提出了一种名为VISOR的视觉空间物体推理方法,用于语言驱动的物体导航任务。该方法采用一个紧凑的3B参数的视觉-语言-动作(VLA)智能体,通过显式的基于图像的推理过程来直接回答‘这是目标物体吗?’和‘我为什么要采取这个动作?’,从而替代了传统的端到端模型或多模型拼接流水线。

Details

Motivation: 现有方法存在泛化能力差、缺乏可解释性、错误传播、计算成本高以及难以将推理整合到导航策略等问题。本文旨在解决这些问题,通过模拟人类具身推理的方式,同时处理物体识别和动作选择。

Result: 论文声称其方法在可解释性、泛化能力和导航效率方面均有所提升,但摘要中未提及具体的基准测试(benchmark)或定量结果(如与SOTA模型的比较)。

Insight: 主要创新点在于提出了一个紧凑的VLA智能体架构,并引入了包含‘思考’、‘思考总结’和‘行动’三阶段的显式图像接地推理过程。这增强了动作层面的可解释性,并有望实现更强的零样本泛化能力,同时避免了复杂多模型流水线的计算开销和错误累积问题。

Abstract: Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer “Is this the target object?” and “Why should I take this action?” The reasoning process unfolds in three stages: “think”, “think summary”, and “action”, yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.


[87] SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens cs.CVPDF

Xiaoyan Zhang, Zechen Bai, Haofan Wang, Yiren Song

TL;DR: SIGMA是一个基于扩散变换器的统一后训练框架,通过引入选择性多属性标记(如风格、内容、主题和身份标记),支持在交错文本-图像序列中解释和组合多个视觉条件,实现多条件交错生成。

Details

Motivation: 现有统一模型如Bagel仅支持单条件输入,缺乏从多个异构源合成结果的灵活性,SIGMA旨在解决多条件交错生成的局限性。

Result: 在Bagel统一骨干网络上使用70万个交错示例进行后训练后,SIGMA在多样编辑和生成任务中提升了可控性、跨条件一致性和视觉质量,在组合任务上相比Bagel有显著提升。

Insight: 创新点在于引入选择性多属性标记机制,使模型能够灵活处理交错的多模态输入,实现组合编辑和细粒度多模态对齐,增强了扩散变换器的多条件生成能力。

Abstract: Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.


[88] ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention cs.CV | cs.CLPDF

Wenjie Liu, Hao Wu, Xin Qiu, Yingqi Fan, Yihan Zhang

TL;DR: 本文提出ViCA(Vision-only Cross-Attention),一种高效的多模态大语言模型架构,通过仅在选定的层中使用稀疏的交叉注意力机制处理视觉标记,避免了视觉标记在所有Transformer层中的密集自注意力计算,从而大幅降低计算开销。

Details

Motivation: 现代多模态大语言模型采用统一的注意力设计,在每个Transformer层中处理视觉和文本标记,导致计算开销巨大。本文重新审视了这种密集视觉处理的必要性,发现投影后的视觉嵌入已与语言空间良好对齐,且有效的视觉-语言交互仅发生在少数层中。

Result: 在三个MLLM骨干网络、九个多模态基准测试和26个基于剪枝的基线方法上的广泛评估表明,ViCA保持了基线模型98%的准确率,同时将视觉侧计算减少至4%,始终实现优越的性能-效率权衡。在推理速度上,单批次推理加速超过3.5倍,多批次推理加速超过10倍。

Insight: 创新点在于提出了一种极简的MLLM架构,视觉标记绕过所有自注意力和前馈层,仅通过选定层的稀疏交叉注意力与文本交互,从而显著降低计算成本。该方法与标记剪枝方法正交,可无缝结合以进一步提升效率,且提供了规则、硬件友好的推理流程。

Abstract: Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.


[89] TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation cs.CV | cs.AIPDF

Yuanzhi Liang, Xuan’er Wu, Yirui Liu, Yijie Fang, Yizhen Fan

TL;DR: 本文提出TeleBoost,一个系统化的后训练框架,用于将预训练视频生成模型转化为高保真、可控且鲁棒的生产级模型。该框架将监督策略塑造、奖励驱动的强化学习和基于偏好的精炼整合到一个稳定性约束的优化栈中,旨在解决视频生成中的高计算成本、时间累积故障模式以及异质不确定反馈等实际约束。

Details

Motivation: 解决将预训练视频生成器转化为生产级模型时面临的指令跟随、可控性和长时间跨度鲁棒性等挑战,并应对视频生成特有的高计算成本、时间累积错误和反馈质量不佳等实际问题。

Result: 摘要未提及具体的定量基准测试结果或SOTA比较,但宣称该框架能有效提升感知保真度、时间连贯性和提示遵循能力,同时保持初始化的可控性。

Insight: 创新点在于将后训练视为一个分阶段、诊断驱动的系统性过程,而非孤立技巧的集合,提供了一个稳定、可扩展且有效的优化蓝图,可借鉴其将监督学习、强化学习和偏好学习整合到统一约束优化栈中的系统化方法。

Abstract: Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.


[90] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning cs.CV | cs.AIPDF

Hulingxiao He, Zijun Geng, Yuxin Peng

TL;DR: 本文提出Fine-R1,一种专为细粒度视觉识别(FGVR)设计的多模态大语言模型(MLLM)。通过思维链监督微调和三元组增强策略优化,该模型仅需4-shot训练,即在识别已见和未见子类别上超越了现有通用MLLM、推理MLLM以及对比式CLIP模型。

Details

Motivation: 解决多模态大语言模型在细粒度视觉识别任务上表现不佳、需要大量标注数据、容易对已见类别过拟合且泛化能力差的问题。

Result: 在细粒度视觉识别任务上,仅用4-shot训练,Fine-R1的性能超越了现有通用MLLM、推理MLLM和对比式CLIP模型,对已见和未见子类别均有效。

Insight: 创新点在于结合了思维链推理(构建包含视觉分析、候选子类别、比较和预测的细粒度CoT数据集)和三元组增强策略优化(类内增强和类间增强),以极少的标注样本将通用MLLM转变为强大的开放世界分类器,提升了判别能力和泛化性。

Abstract: Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of “visual analysis, candidate sub-categories, comparison, and prediction”, transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.


[91] HistoMet: A Pan-Cancer Deep Learning Framework for Prognostic Prediction of Metastatic Progression and Site Tropism from Primary Tumor Histopathology cs.CVPDF

Yixin Chen, Ziyu Su, Lingbin Meng, Elshad Hasanov, Wei Chen

TL;DR: 本文提出了一个名为HistoMet的深度学习框架,用于从原发性肿瘤的组织病理学全切片图像中预测癌症的转移进展和转移部位倾向。该框架采用决策感知、概念对齐的多实例学习,通过两阶段预测流程(先评估转移风险,再对高风险病例进行转移部位预测)来模拟临床决策过程,并在一个包含6504名患者的多机构泛癌队列上验证了其有效性。

Details

Motivation: 转移进展是癌症相关死亡的主要原因,但直接从组织病理学预测原发性肿瘤是否会转移及其转移部位仍是一个根本性挑战。现有的计算方法通常将转移状态或部位预测作为孤立任务处理,未能明确模拟临床中先评估转移风险、再进行下游部位特异性评估的序贯决策过程。

Result: 在临床相关的高灵敏度筛查设置(95%灵敏度)下,HistoMet显著减少了下游工作量,同时保持了高转移风险召回率。对于已发生转移的病例,HistoMet的宏F1分数达到74.6(标准差1.3),宏一对多AUC达到92.1,表明其具有稳健和可部署的预测能力。

Insight: 论文的核心创新在于提出了一个决策感知的两阶段预测框架,明确模拟了临床决策流程,并整合了语言定义和数据自适应的转移概念,通过预训练的病理视觉语言模型来指导表征学习并提高临床可解释性。这为从组织病理学图像进行预后预测提供了一种结构化和可解释的新范式。

Abstract: Metastatic Progression remains the leading cause of cancer-related mortality, yet predicting whether a primary tumor will metastasize and where it will disseminate directly from histopathology remains a fundamental challenge. Although whole-slide images (WSIs) provide rich morphological information, prior computational pathology approaches typically address metastatic status or site prediction as isolated tasks, and do not explicitly model the clinically sequential decision process of metastatic risk assessment followed by downstream site-specific evaluation. To address this research gap, we present a decision-aware, concept-aligned MIL framework, HistoMet, for prognostic metastatic outcome prediction from primary tumor WSIs. Our proposed framework adopts a two-module prediction pipeline in which the likelihood of metastatic progression from the primary tumor is first estimated, followed by conditional prediction of metastatic site for high-risk cases. To guide representation learning and improve clinical interpretability, our framework integrates linguistically defined and data-adaptive metastatic concepts through a pretrained pathology vision-language model. We evaluate HistoMet on a multi-institutional pan-cancer cohort of 6504 patients with metastasis follow-up and site annotations. Under clinically relevant high-sensitivity screening settings (95 percent sensitivity), HistoMet significantly reduces downstream workload while maintaining high metastatic risk recall. Conditional on metastatic cases, HistoMet achieves a macro F1 of 74.6 with a standard deviation of 1.3 and a macro one-vs-rest AUC of 92.1. These results demonstrate that explicitly modeling clinical decision structure enables robust and deployable prognostic prediction of metastatic progression and site tropism directly from primary tumor histopathology.


[92] AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning cs.CV | cs.AIPDF

Binxiao Xu, Junyu Feng, Xiaopeng Lin, Haodong Li, Zhiyuan Feng

TL;DR: 本文提出了AD-MIR框架,旨在通过结构化推理弥合广告视频理解中像素级感知与高层营销逻辑之间的认知鸿沟。该框架采用两阶段架构:首先通过结构感知记忆构建将原始视频转换为结构化数据库,然后通过结构化推理代理迭代分解叙事并推断隐含的说服策略,并采用基于证据的自我修正机制进行验证。

Details

Motivation: 解决现有智能体在广告视频理解中,难以将像素级感知与高层次的营销说服策略有效关联的问题。

Result: 在AdsQA基准测试上取得了最先进的性能,在严格准确率上超过最强的通用智能体DVD 1.8%,在宽松准确率上超过9.5%。

Insight: 创新点在于将广告意图解码明确构建为一个两阶段的结构化推理过程,并引入了基于证据的自我修正机制,强调将抽象营销策略明确地锚定在像素级证据上。这为需要复杂逻辑推理的多模态理解任务提供了可借鉴的范式。

Abstract: Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.


[93] Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation cs.CVPDF

Yichi Zhang, Feiyang Xiao, Le Xue, Wenbo Zhang, Gang Feng

TL;DR: 该论文通过构建UMD数据集(包含490个全身PET/CT和464个全身PET/MRI扫描,约67.5万张2D图像和1.2万个3D器官标注),对代表性3D分割基础模型进行了全面评估。研究发现,现有模型在从结构成像(如CT/MRI)转向功能成像(如PET)时存在显著的模态差异和泛化假象,其实际效能远低于文献报道的基准,表明当前模型远未达到真正的通用目的。

Details

Motivation: 当前新兴的3D医学基础模型被设想为具有通用能力的工具,但其验证主要局限于区域性和结构性成像,未充分探索显著的模态差异。论文旨在通过严谨评估,揭示模型在真实世界应用中的局限性。

Result: 评估显示,在从结构域(如CT/MRI)过渡到功能域(如PET)时,模型性能出现系统性失败,其真实世界效能与文献报告的基准存在显著差距,突显了当前模型在跨模态泛化上的不足。

Insight: 论文的创新点在于通过构建大规模多模态数据集(UMD)和受控的受试者内比较,首次系统揭示了3D医学基础模型在模态差异下的泛化假象。客观分析认为,这强调了未来研究需转向多模态训练和评估范式,以开发真正模态无关的医学基础模型。

Abstract: While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.


[94] From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding cs.CV | cs.AIPDF

Leonardo Gonzalez

TL;DR: 本文提出了一个名为Images2Slides的API驱动流程,旨在将静态信息图(PNG/JPG)自动重建为可编辑的Google Slides原生幻灯片。该系统利用视觉语言模型(VLM)提取区域级规格,将像素几何映射到幻灯片坐标,并通过Google Slides批量更新API重新创建元素,从而解锁图像中的内容以便更新、本地化和重用。

Details

Motivation: 解决静态信息图内容被锁定为像素后,更新、本地化和重用成本高昂的问题,旨在实现从“死像素”到可编辑幻灯片的自动化转换。

Result: 在一个包含29个程序生成信息图幻灯片、已知真实区域的控制基准测试中,Images2Slides实现了整体元素恢复率为0.989±0.057(文本:0.985±0.083,图像:1.000±0.000),文本区域的平均转录错误CER为0.033±0.149,文本区域的平均布局保真度IoU为0.364±0.161,图像区域为0.644±0.131。

Insight: 创新点在于提出了一个模型无关的、基于通用JSON区域模式和确定性后处理的流程,支持多种VLM后端,将视觉语言模型的区域理解能力与商业演示软件API深度集成,实现了从像素到可编辑文档的高保真自动化重建。系统还识别了文本大小校准和非均匀背景等实际工程挑战,为未来工作提供了指导。

Abstract: Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.


[95] Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making cs.CV | cs.AI | cs.LG | cs.ROPDF

Ross Greer, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Giovanni Tapia Lopez

TL;DR: 本文提出了一种扩展的’看与听内外’(L-LIO)多模态人工智能框架,通过融合音频和视觉传感器来增强驾驶员状态评估和车辆环境理解,以提升驾驶安全。

Details

Motivation: 现有’看入看出’(LILO)框架主要依赖视觉信息,本文认为音频模态可作为额外信息来源,以更全面地理解驾驶员、乘客及车外环境,解决视觉系统在复杂场景中信息不足的问题。

Result: 初步实验表明,音频在细微或上下文丰富的场景中能提供安全相关洞察,例如通过驾驶员语音分类潜在受损状态(如醉酒),但面临环境噪声干扰和隐私等挑战。

Insight: 创新点在于将音频模态系统性地集成到车辆智能系统中,通过多模态传感器融合(尤其是音频与视觉结合)来增强安全决策,为安全干预提供了新途径。

Abstract: The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., “turn after that red building”) to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.


[96] Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning cs.CV | cs.AI | cs.LG | cs.ROPDF

Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram

TL;DR: 本文研究了视觉语言模型在自动驾驶安全评估与决策中的应用,通过三个系统级用例探索了如何将视觉语言表征集成到感知、预测和规划流程中,以支持驾驶场景安全评估。

Details

Motivation: 动机是利用视觉语言模型强大的表征学习能力,将视觉观察与自然语言概念对齐,为安全关键的自动驾驶中的语义推理提供新机会,解决传统方法在语义理解和泛化能力上的不足。

Result: 在Waymo Open Dataset上,实验表明直接将全局视觉语言嵌入用于基于Transformer的轨迹规划框架并未提高轨迹精度;而在doScenes数据集上,基于视觉场景元素的自然语言指令作为行为约束能抑制罕见但严重的规划失败,并在模糊场景中改善安全对齐行为。

Insight: 创新点包括:提出了一种轻量级、类别无关的危险筛查方法,利用CLIP图像-文本相似性生成低延迟语义危险信号;强调了表征-任务对齐的重要性,需开发任务导向的提取方法;证明了自然语言作为显式行为约束在提升规划安全性方面的潜力。核心洞察是实现潜力需要精心系统设计和结构化接地,而非直接特征注入。

Abstract: Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.


[97] Process-of-Thought Reasoning for Videos cs.CV | cs.AIPDF

Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam

TL;DR: 本文提出了一种用于视频理解的Process-of-Thought(PoT)推理框架,通过将视频推理分解为一系列轻量级、可验证的步骤,使推理过程显式化。该框架包含时序证据选择、逐步状态更新和约束答案合成三个交错步骤,支持模型无关的即插即用设计,并能与外部工具结合进行证据增强推理。

Details

Motivation: 解决视频理解中需要对长时、噪声观测进行时序定位和多步推理的挑战,使推理过程更加明确、可追溯。

Result: 在标准视频推理任务上的大量实验表明,PoT框架持续提升了事实正确性和时序定位能力,同时提供了可解释的推理轨迹。

Insight: 创新点在于将视频推理结构化为一连串可验证的中间步骤,并引入了统一表示来对齐中间决策与时间片段,从而增强了对干扰的鲁棒性并减少了幻觉解释。

Abstract: Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.


[98] PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification cs.CV | cs.AI | cs.LG | cs.MMPDF

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

TL;DR: 本文提出了一种名为PAND(Prompt-Aware Neighborhood Distillation)的两阶段知识蒸馏框架,用于轻量级细粒度视觉分类(FGVC)。该框架通过解耦语义校准与结构传递,首先利用提示感知语义校准生成自适应语义锚点,然后采用邻域感知结构蒸馏策略约束学生网络的局部决策结构,从而更有效地将大型视觉语言模型(VLM)的知识迁移到轻量网络中。

Details

Motivation: 在细粒度视觉分类中,将大型视觉语言模型的知识蒸馏到轻量网络面临挑战,主要由于现有方法依赖固定提示和全局对齐,难以捕捉细粒度语义差异。

Result: 在四个FGVC基准测试上,PAND均优于现有最先进方法。具体而言,使用ResNet-18作为学生网络在CUB-200数据集上达到了76.09%的准确率,比强基线VL2Lite提升了3.4%。

Insight: 创新点在于将知识蒸馏过程解耦为语义校准和结构传递两个阶段,并引入了提示感知的语义锚点生成和邻域感知的结构蒸馏,这有助于更精细地传递VLM中的细粒度知识,可借鉴于其他需要处理细微差异的视觉任务中。

Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student’s local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.


[99] Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CVPDF

Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker

TL;DR: 本文提出Rolling Sink方法,旨在解决自回归视频扩散模型在有限训练时长下训练与在开放时长下测试时出现的性能差距问题,该方法无需额外训练即可将视频合成有效扩展到超长时长。

Details

Motivation: 自回归视频扩散模型在有限训练时长下训练,但在开放时长下测试时会出现视觉质量快速退化的问题,即训练-测试差距。由于开放测试时长可能远超任何有限训练窗口,且长视频训练计算成本高昂,本文寻求一种无需训练的方法来弥合这一差距。

Result: 在仅用5秒片段训练的自回归模型基础上,Rolling Sink在测试时能将视频合成有效扩展到超长时长(如16 FPS下5-30分钟),并保持主体一致、颜色稳定、结构连贯和运动平滑。大量实验表明,与SOTA基线相比,Rolling Sink在长时程视觉保真度和时间一致性方面表现更优。

Insight: 核心创新在于对自回归缓存维护进行了系统性分析,并据此提出了无需训练的Rolling Sink方法,有效弥合了有限训练时长与开放测试时长之间的差距,实现了高质量的超长视频生成。

Abstract: Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/


[100] Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing cs.CVPDF

Jayawant Bodagala, Balaji Bodagala

TL;DR: 本文提出UCATSC,一种基于模型的交通信号控制系统,它通过一个带约束的随机决策过程来建模交叉路口的信号控制,并考虑了基于视觉感知的不确定性。该系统旨在减少交通延误和排放,同时防止安全关键错误,并提供基于显式模型的可解释控制策略。

Details

Motivation: 解决自适应交通信号控制在现实世界中部署受限的问题,主要源于基于视觉感知的不确定性、隐含的安全性以及主要在仿真中学习和验证的非可解释控制策略。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较。

Insight: 创新点在于在信念空间的counterfactual推演中,预测并强制执行与安全和避免饥饿相关的硬约束,而不是像强化学习方法那样通过奖励塑形来学习预测安全,从而提供更可靠和可解释的控制。

Abstract: Real-world deployment of adaptive traffic signal control, to date, remains limited due to the uncertainty associated with vision-based perception, implicit safety, and non-interpretable control policies learned and validated mainly in simulation. In this paper, we introduce UCATSC, a model-based traffic signal control system that models traffic signal control at an intersection using a stochastic decision process with constraints and under partial observability, taking into account the uncertainty associated with vision-based perception. Unlike reinforcement learning methods that learn to predict safety using reward shaping, UCATSC predicts and enforces hard constraints related to safety and starvation prevention during counterfactual rollouts in belief space. The system is designed to improve traffic delay and emission while preventing safety-critical errors and providing interpretable control policy outputs based on explicit models.


[101] VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos cs.CV | cs.AIPDF

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su

TL;DR: 本文提出了VideoTemp-o3,一个统一的智能视频思考框架,旨在解决长视频理解中因均匀采样导致的关键视觉证据缺失问题。该框架联合建模视频定位和问答,通过统一的掩码机制和强化学习奖励设计,实现了强大的定位能力、按需裁剪和定位结果优化。

Details

Motivation: 动机在于解决现有基于智能思考的视频理解方法效率低下、定位能力弱、工作流程僵化的问题,以提升长视频理解中关键片段捕捉的准确性和效率。

Result: 实验结果表明,该方法在长视频理解和定位任务上均取得了显著性能提升,并在作者构建的基准测试中进行了系统评估。

Insight: 创新点包括:1)联合建模视频定位与问答的统一框架;2)监督微调阶段的统一掩码机制,平衡探索与噪声抑制;3)强化学习阶段的专用奖励设计,缓解奖励欺骗;4)从数据角度构建高质量长视频定位问答数据及相应基准。

Abstract: In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.


[102] Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures cs.CVPDF

Simiao Ren

TL;DR: 本文首次对视觉语言模型(VLMs)与专用年龄估计架构进行了大规模跨范式基准测试,涵盖34个模型和8个标准数据集。核心发现是零样本VLMs在面部年龄估计任务中显著优于大多数专用模型,平均绝对误差(MAE)为5.65年,而专用模型为9.88年。最佳VLM(Gemini 3 Flash Preview)比最佳专用模型(MiVOLO)性能高出15%。研究还分析了18岁年龄验证阈值下的表现,并揭示了所有模型在极端年龄组(<5岁和65岁以上)表现最差。

Details

Motivation: 面部年龄估计对于内容审核、年龄验证和深度伪造检测至关重要,但此前缺乏系统比较现代视觉语言模型与专用年龄估计架构性能的基准测试。本文旨在填补这一空白,评估通用VLMs与专用模型在年龄估计任务上的表现。

Result: 在8个标准数据集(总计每个模型1100张测试图像)上的实验结果表明,零样本VLMs的平均MAE为5.65年,显著优于非LLM专用模型的平均MAE 9.88年。最佳VLM(Gemini 3 Flash Preview, MAE 4.32)比最佳专用模型(MiVOLO, MAE 5.10)性能提升15%。在18岁阈值年龄验证任务中,非LLM模型对未成年人的误判为成人的比率高达60-100%,而VLMs仅为13-25%。所有模型在极端年龄组(<5岁和65岁以上)表现最差。

Insight: 论文的主要创新点是首次进行了大规模、跨范式的年龄估计基准测试,挑战了任务专用架构是必要的这一假设。客观分析表明,其核心洞察是通用视觉语言模型在零样本设置下展现出超越大多数专用模型的强大能力,这为未来研究方向提供了重要启示:领域应转向将VLM的能力蒸馏到高效的专用模型中,而非单纯设计新的专用架构。此外,研究还揭示了粗粒度年龄分箱(8-9类)会严重损害性能(MAE超过13年),以及结合面部和身体特征(如MiVOLO所做)是提升专用模型性能的关键因素。

Abstract: Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} – 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs – across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini3 Flash Preview, MAE4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60–100% false adult rates on minors while VLMs achieve 13–25%, and demonstrate that coarse age binning (8–9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.


[103] Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection cs.CVPDF

Guoting Wei, Xia Yuan, Yang Zhou, Haizhao Jing, Yu Liu

TL;DR: 本文提出OTA-Det,一个统一开放词汇航空检测(OVAD)和遥感视觉定位(RSVG)任务的首个框架。它通过任务重构策略统一目标与监督机制,并采用密集语义对齐策略实现从整体描述到个体属性的多粒度语义理解。基于RT-DETR架构扩展至开放文本检测,在六个基准测试上达到SOTA性能,同时保持34 FPS的实时推理速度。

Details

Motivation: 现有OVAD方法局限于粗粒度类别语义,而RSVG方法结构上仅支持单目标定位,两者均无法同时实现丰富语义理解和多目标检测。

Result: 在涵盖OVAD和RSVG任务的六个基准测试上取得了最先进的性能,同时保持34 FPS的实时推理效率。

Insight: 创新点包括:1)任务重构策略统一不同范式的训练目标与监督;2)密集语义对齐策略建立从整体描述到属性的多粒度对应关系;3)基于RT-DETR高效扩展至开放文本检测,兼顾性能与速度。

Abstract: Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.


[104] SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing

TL;DR: 本文提出了SPD-Faith Bench基准,用于诊断多模态大语言模型在思维链推理中的忠实性问题,并揭示了感知盲区和感知-推理解耦两种系统性失败模式。基于分析,作者提出了无需训练的SAGE框架来校准视觉证据,改善视觉路由和对齐推理与感知。

Details

Motivation: 现有研究主要关注感知幻觉,而思维链推理层面的忠实性尚未得到充分探索。为了隔离语言先验的影响,需要构建一个基于细粒度图像差异推理的诊断基准,以强制进行显式的视觉比较。

Result: 在SOTA多模态大语言模型上的评估揭示了系统性失败模式。提出的SAGE框架能够改善视觉路由和对齐,从而提高推理的忠实性。

Insight: 创新点在于构建了专门诊断推理忠实性的基准SPD-Faith Bench,并揭示了视觉注意力衰减和残差流中表征偏移是导致失败的根本原因。提出的SAGE框架无需训练即可通过校准视觉证据来提升性能,强调了超越答案正确性来显式评估忠实性的重要性。

Abstract: Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.


[105] VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping cs.CVPDF

Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall

TL;DR: VFace是一种无需训练的即插即用方法,用于实现基于扩散模型的视频人脸交换。它通过频域注意力插值、目标结构引导和流引导注意力时序平滑三个核心技术,在不修改底层扩散模型或进行额外训练的情况下,显著提升了生成视频的时序一致性和视觉保真度。

Details

Motivation: 解决现有基于扩散模型的逐帧人脸交换方法在视频中常出现的时序不一致问题,同时保持高视觉质量,并提供一个无需额外训练或微调的模块化解决方案。

Result: 大量实验表明,该方法显著增强了时序一致性和视觉保真度,为视频人脸交换提供了一个实用且模块化的解决方案。

Insight: 创新点在于提出了三个无需训练的核心技术:频域注意力插值以保持身份特征,即插即用的注意力注入以实现目标结构引导,以及流引导注意力时序平滑机制以强制时空一致性。其核心思想是通过在推理阶段引入精心设计的引导机制,在不改动预训练模型的情况下解决视频生成的特定问题,这是一种高效且灵活的范式。

Abstract: We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.


[106] Geometry-Aware Rotary Position Embedding for Consistent Video World Model cs.CVPDF

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang

TL;DR: 本文提出了一种名为ViewRope的几何感知旋转位置编码方法,用于解决视频世界模型中长期空间一致性的问题。该方法通过将相机光线方向直接注入视频Transformer的自注意力层,替代传统的基于屏幕空间的位置编码,从而为模型提供了三维一致性的归纳偏置。此外,论文还提出了几何感知的帧稀疏注意力机制以提高效率,并引入了ViewBench诊断套件进行评估。

Details

Motivation: 当前预测性世界模型在显式相机控制下模拟未来观测时,缺乏空间持久性,即在长轨迹中无法保持稳定的场景结构,当相机重新访问先前观察过的位置时经常产生幻觉细节。作者认为这种几何漂移源于对屏幕空间位置编码的依赖,这与三维一致性所需的投影几何相冲突。

Result: 论文在提出的ViewBench诊断套件上进行了评估,结果表明ViewRope方法显著改善了长期一致性,同时降低了计算成本。

Insight: 核心创新点在于用基于相对光线几何的参数化注意力(ViewRope)替代基于像素局部性的传统位置编码,为模型提供了三维一致性的原生归纳偏置。此外,几何感知的帧稀疏注意力机制利用几何线索选择性关注相关历史帧,在保持内存一致性的同时提高了效率。ViewBench诊断套件也为评估模型的空间一致性提供了专门的基准。

Abstract: Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.


[107] Recovering 3D Shapes from Ultra-Fast Motion-Blurred Images cs.CV | cs.GRPDF

Fei Yu, Shudan Guo, Shiqing Xin, Beibei Wang, Haisen Zhao

TL;DR: 本文提出了一种从超高速运动模糊图像中恢复3D形状的新方法。通过引入快速重心坐标求解器,显著提高了运动模糊模拟的计算效率,并利用可微分逆渲染技术,实现了从极端运动模糊的2D图像中重建3D几何形状。

Details

Motivation: 解决在自然和工业场景中(如高速运动的球体或旋转机械),由于物体快速运动导致图像严重模糊,使得传统多视角立体视觉等3D重建技术失效的问题。

Result: 在快速平移和旋转两种典型运动类型上验证了方法。前向模拟能高效真实地建模超高速运动物体;在逆渲染任务中,成功从极端平移和旋转运动的2D图像中恢复了3D形状,推进了基于视觉的3D重建边界。

Insight: 核心创新点在于提出了快速重心坐标求解器,解决了传统方法中重复计算重心权重的计算瓶颈,实现了高达4.57倍的加速,并构建了完全可微分的逆渲染流程,使梯度能从渲染图像传播到底层3D形状,从而支持从模糊图像进行形状恢复。

Abstract: We consider the problem of 3D shape recovery from ultra-fast motion-blurred images. While 3D reconstruction from static images has been extensively studied, recovering geometry from extreme motion-blurred images remains challenging. Such scenarios frequently occur in both natural and industrial settings, such as fast-moving objects in sports (e.g., balls) or rotating machinery, where rapid motion distorts object appearance and makes traditional 3D reconstruction techniques like Multi-View Stereo (MVS) ineffective. In this paper, we propose a novel inverse rendering approach for shape recovery from ultra-fast motion-blurred images. While conventional rendering techniques typically synthesize blur by averaging across multiple frames, we identify a major computational bottleneck in the repeated computation of barycentric weights. To address this, we propose a fast barycentric coordinate solver, which significantly reduces computational overhead and achieves a speedup of up to 4.57x, enabling efficient and photorealistic simulation of high-speed motion. Crucially, our method is fully differentiable, allowing gradients to propagate from rendered images to the underlying 3D shape, thereby facilitating shape recovery through inverse rendering. We validate our approach on two representative motion types: rapid translation and rotation. Experimental results demonstrate that our method enables efficient and realistic modeling of ultra-fast moving objects in the forward simulation. Moreover, it successfully recovers 3D shapes from 2D imagery of objects undergoing extreme translational and rotational motion, advancing the boundaries of vision-based 3D reconstruction. Project page: https://maxmilite.github.io/rec-from-ultrafast-blur/


[108] Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds cs.CVPDF

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu

TL;DR: 本文介绍了SSI-Bench,一个用于评估视觉-语言模型在受限流形上进行空间推理能力的VQA基准数据集。该数据集基于复杂的真实世界3D结构构建,包含1000个排序问题,涵盖几何与拓扑推理,并要求模型执行多种组合空间操作。评估31个广泛使用的VLMs发现,其性能与人类存在巨大差距。

Details

Motivation: 现有基准大多评估非受限场景,模型可利用2D捷径,而空间智能在物理世界中至关重要,因此需要构建一个在几何、拓扑和物理约束下评估空间推理的基准。

Result: 在SSI-Bench上,最佳开源模型准确率为22.2%,最强闭源模型为33.6%,而人类得分为91.6%,显示出模型与人类性能的巨大差距。

Insight: 创新点在于提出了一个完全以人为中心构建的、专注于受限流形上空间推理的基准SSI-Bench,强调结构基础和约束一致的3D推理能力,揭示了当前VLMs在复杂空间智能任务上的根本性不足。

Abstract: Spatial intelligence is crucial for vision–language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.


[109] WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning cs.CVPDF

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy, Seyda Ertekin, Sila Kurugol

TL;DR: 本文提出了WristMIR,一种用于儿科腕部X光片检索的区域感知框架。该框架利用密集的放射学报告和骨骼特异性定位,在没有手动图像级标注的情况下学习细粒度、具有临床意义的图像表示。它通过基于MedGemma的结构化报告挖掘生成全局和区域级描述,结合预处理的手腕图像和特定骨骼裁剪,联合训练全局和局部对比编码器,并执行两阶段检索过程:先进行粗粒度全局匹配以识别候选检查,再进行与预定义解剖骨骼区域对齐的区域条件重排序。

Details

Motivation: 检索具有类似骨折模式的腕部X光片具有挑战性,因为临床上重要的线索很细微、高度局部化,并且常常被重叠的解剖结构或变化的成像视角所掩盖。此外,基于案例的医学图像检索领域缺乏大型、标注良好的数据集,进一步限制了进展。

Result: WristMIR显著提升了检索性能,在图像到文本检索任务中,将Recall@5从基线模型的0.82%提高到9.35%。其嵌入表示也带来了更强的骨折分类性能(AUROC 0.949,AUPRC 0.953)。在区域感知评估中,两阶段设计显著改善了基于检索的骨折诊断,平均F1分数从0.568提升到0.753。放射科医生评价其检索的病例更具临床相关性,平均评分从3.36上升到4.35。

Insight: 论文的创新点在于:1)提出了一种结合全局和局部(区域感知)对比学习的医学图像检索框架,无需手动图像标注,利用放射学报告作为监督信号;2)引入了基于解剖结构(特定骨骼区域)的两阶段检索流程(粗匹配+区域条件重排序),使检索过程更符合临床诊断逻辑;3)利用大型语言模型(MedGemma)从放射报告中自动挖掘结构化的全局和区域级描述,以生成高质量的文本监督。从客观角度看,该工作将解剖先验知识与基于报告的视觉-语言学习有效结合,为细粒度医学图像检索提供了一种可解释且临床导向的新范式。

Abstract: Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.


[110] Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video cs.CV | cs.AIPDF

Zihui Gao, Ke Liu, Donny Y. Chen, Duochao Shi, Guosheng Lin

TL;DR: 本文提出SAGE框架,旨在通过互联网视频对3D几何基础模型进行可扩展的自适应学习。该方法利用层次化挖掘流程将视频转换为训练轨迹,并结合稀疏几何锚点(基于SfM点云)和密集可微一致性(基于3D高斯渲染)的混合监督,有效利用无标注视频数据提升模型泛化能力。

Details

Motivation: 当前3D几何基础模型的发展受限于大规模、多样化的3D标注数据稀缺,而互联网视频虽提供海量原始数据,但缺乏真实几何标注且存在观测噪声,难以直接用于几何学习。

Result: 在7Scenes、TUM-RGBD和Matterport3D等未见数据集上,SAGE显著提升了零样本泛化性能,将倒角距离(Chamfer Distance)降低了20-42%,达到了当前最先进水平。

Insight: 创新点包括:1)提出基于互联网视频的自适应框架,为几何基础模型提供可扩展的训练范式;2)设计混合监督策略,结合稀疏全局结构引导与密集多视图约束;3)引入锚点数据正则化以防止灾难性遗忘。该方法首次实现了通过互联网视频适应几何基础模型,推动了通用3D学习的发展。

Abstract: Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.


[111] Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models cs.CVPDF

Zhenhao Shang, Haizhao Jing, Guoting Wei, Haokui Zhang, Rong Xiao

TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的量化校准框架TLQ,通过梯度引导的token级重要性整合机制和量化暴露的层级校准方案,解决了视觉与文本token激活分布差异大、对量化误差敏感的问题,显著提升了后训练量化(PTQ)的性能与稳定性。

Details

Motivation: 视觉语言模型中视觉与文本token的激活分布差异大、对量化误差敏感,导致传统PTQ校准方法效果不佳,需要重新思考校准目标并设计更精细的校准策略。

Result: 在两个模型、三种模型规模和两种量化设置下评估,TLQ在所有设置中均实现了性能提升,表现出强大的量化稳定性。

Insight: 创新点包括:基于梯度信息的token级重要性整合机制、token级校准集构建、以及多GPU并行化的量化暴露层校准方案,降低了对A100大内存GPU的依赖,提升了校准效率与精度。

Abstract: Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning, and the quantized performance is often strongly affected by the calibration in PTQ. By contrast, in vision-language models (VLMs), substantial differences between visual and text tokens in their activation distributions and sensitivities to quantization error pose significant challenges for effective calibration during PTQ. In this work, we rethink what PTQ calibration should align with in VLMs and propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ). Guided by gradient information, we design a token-level importance integration mechanism for quantization error, and use it to construct a token-level calibration set, enabling a more fine-grained calibration strategy. Furthermore, TLQ introduces a multi-GPU, quantization-exposed layer-wise calibration scheme. This scheme keeps the layer-wise calibration procedure consistent with the true quantized inference path and distributes the complex layer-wise calibration workload across multiple RTX3090 GPUs, thereby reducing reliance on the large memory of A100 GPUs. TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings, indicating its strong quantization stability. The code will be released publicly.


[112] Which private attributes do VLMs agree on and predict well? cs.CVPDF

Olena Hrynenko, Darya Baranouskaya, Alina Elena Baia, Andrea Cavallaro

TL;DR: 该论文对开源视觉语言模型(VLMs)在隐私相关属性识别任务上进行了零样本评估,分析了VLMs在哪些属性上具有较高的内部标注一致性,并探讨了VLM与人类标注之间的差异。研究发现,相较于人类标注者,VLMs更倾向于预测隐私属性的存在,且在VLM间一致性高的案例中,它们能补充人类标注的遗漏,显示出在大规模图像数据集中辅助隐私标注的潜力。

Details

Motivation: 评估视觉语言模型(VLMs)在零样本设置下对隐私相关视觉属性的识别能力,以了解其与人类标注的一致性和差异,探索其辅助大规模隐私标注的可行性。

Result: 在基于人类标注的评估中,VLMs倾向于更频繁地预测隐私属性的存在;当VLMs内部标注一致性高时,它们能够识别出人类标注者忽略的属性,从而补充人类标注。

Insight: 论文的创新点在于首次系统评估了VLMs在隐私属性识别上的零样本性能,并分析了VLM间一致性及其与人类标注的互补性,为利用VLMs自动化或辅助大规模图像隐私标注提供了实证依据。

Abstract: Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.


[113] D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning cs.CVPDF

Changli Tang, Tianyi Wang, Fengyun Rao, Jing Lyu, Chao Zhang

TL;DR: 本文提出了D-ORCA,一个以对话为中心的、经过优化的全模态大语言模型,用于鲁棒的视听描述。为了解决开源数据集的不足,作者构建了一个大规模、高质量的双语(英语和普通话)多参与者对话视频数据集DVD。为了提升描述的细粒度准确性,模型采用了结合三种新颖奖励函数的组相对策略优化方法。实验表明,D-ORCA在说话人识别、语音识别和时间定位方面显著优于现有开源模型,并且在多个通用视听理解基准上,仅以80亿参数就达到了与Qwen3-Omni相当的性能。

Details

Motivation: 视频中的口语对话是信息的主要来源,因此准确识别谁在何时说了什么对于深度视频理解至关重要。现有开源生态系统缺乏高质量、大规模的对话视频数据集,且视听描述的细粒度准确性有待提升。

Result: D-ORCA在说话人识别、语音识别和时间定位方面显著优于现有开源模型。在多个通用视听理解基准上,其性能与Qwen3-Omni(一个更大的模型)具有竞争力。

Insight: 论文的创新点包括:1)构建了大规模、高质量的双语多参与者对话视频数据集DVD,填补了开源数据空白;2)首次将语音处理中广泛使用的评估指标(说话人归属准确性、全局语音内容准确性、句子级时间边界对齐)转化为强化学习的奖励函数,用于优化视听描述任务;3)提出的组相对策略优化方法有效提升了细粒度描述的准确性。

Abstract: Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{https://d-orca-llm.github.io/}{https://d-orca-llm.github.io/}. Our code, data, and checkpoints will be available at \href{https://github.com/WeChatCV/D-ORCA/}{https://github.com/WeChatCV/D-ORCA/}.


[114] Deepfake Synthesis vs. Detection: An Uneven Contest cs.CVPDF

Md. Tarek Hasan, Sanjay Saha, Shaojing Fan, Swakkhar Shatabda, Terence Sim

TL;DR: 本文对当前最先进的深度伪造检测技术进行了全面的实证分析,包括与前沿合成方法进行的人类评估实验。研究发现,许多最先进的检测模型在面对现代合成技术(如扩散模型、NeRF和增强的GAN)生成的深度伪造内容时表现显著不佳,甚至人类参与者在面对最高质量的深度伪造时也表现不佳。研究强调了检测方法的发展速度落后于生成技术,迫切需要持续改进检测模型。

Details

Motivation: 深度伪造技术的快速发展显著提升了合成媒体的真实感和可及性,而检测方法虽有进步,但两者之间的竞争态势尚不明确。本研究旨在通过实证分析,评估当前最先进的检测技术能否有效应对现代合成方法生成的深度伪造,揭示两者之间的差距。

Result: 实验结果表明,许多最先进的检测模型在面对现代合成技术生成的深度伪造时表现显著下降,人类参与者在评估高质量深度伪造时也表现不佳,这突显了检测模型与生成技术之间的性能鸿沟。

Insight: 论文的创新点在于通过全面的实证分析和人类评估,直接对比了最先进的深度伪造生成与检测技术,揭示了检测方法发展滞后于生成技术的严峻现实。从客观角度看,这强调了在深度伪造研究中,需要更加注重检测模型的持续迭代和对抗性评估,以应对不断进化的生成能力。

Abstract: The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.


[115] MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance cs.CV | cs.AIPDF

Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang

TL;DR: 本文提出MCIE-E1,一种基于多模态大语言模型的复杂指令图像编辑方法,通过空间感知交叉注意力和背景一致交叉注意力模块解决现有方法在复杂指令遵循和背景一致性方面的不足,并构建了CIE-Bench基准进行综合评估。

Details

Motivation: 现有基于指令的图像编辑方法局限于简单操作,难以处理现实世界中复杂组合指令,存在指令遵循不足和背景不一致的问题。

Result: 在提出的CIE-Bench基准上,MCIE-E1在定量和定性评估中均优于先前SOTA方法,指令遵循性提升了23.96%。

Insight: 创新点包括引入空间引导机制显式对齐语义指令与空间区域,以及设计背景一致模块保持未编辑区域特征;同时通过MLLM辅助的数据管道和人工验证构建高质量训练数据,并建立了专门的评估基准与指标。

Abstract: Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the denoising process, while the latter preserves features in unedited regions to maintain background consistency. To enable effective training, we construct a dedicated data pipeline to mitigate the scarcity of complex instruction-based image editing datasets, combining fine-grained automatic filtering via a powerful MLLM with rigorous human validation. Finally, to comprehensively evaluate complex instruction-based image editing, we introduce CIE-Bench, a new benchmark with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments, achieving a 23.96% improvement in instruction compliance.


[116] ForecastOcc: Vision-based Semantic Occupancy Forecasting cs.CV | cs.AI | cs.LG | cs.ROPDF

Riya Mohan, Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

TL;DR: 本文提出了ForecastOcc,首个基于视觉的语义占据预测框架,能够直接从历史相机图像联合预测未来多个时间步的占据状态和语义类别,无需依赖外部估计的地图。该框架在Occ3D-nuScenes数据集上进行了多视角预测评估,并在SemanticKITTI上建立了单目预测的首个基准。

Details

Motivation: 现有基于视觉的占据预测方法主要关注静态和动态等运动相关类别,缺乏语义信息;而近期的语义占据预测方法虽弥补了这一不足,但依赖于从独立网络获取的历史占据预测,导致对误差积累敏感且无法直接从图像学习时空特征。

Result: 在Occ3D-nuScenes和SemanticKITTI数据集上的大量实验表明,ForecastOcc始终优于基线方法,能够产生语义丰富、具有未来感知的预测,捕捉对自动驾驶至关重要的场景动态和语义。

Insight: 创新点包括:提出了首个直接从图像预测未来语义占据的端到端框架;引入了包含时序交叉注意力预测模块、2D到3D视图变换器、3D占据编码器和语义占据头的新型架构;在SemanticKITTI上建立了单目语义占据预测的首个基准。

Abstract: Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.


[117] FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging cs.CV | cs.AI | cs.CL | cs.LGPDF

Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang

TL;DR: FlashVID是一种无需训练的视频大语言模型推理加速框架,通过基于注意力和多样性的令牌选择以及基于树的时空令牌合并技术,有效压缩视频中的时空冗余,显著提升计算效率。

Details

Motivation: 现有视频大语言模型处理大量视觉令牌导致计算效率低下,且现有加速框架独立压缩空间和时间冗余,忽略了时空关系,导致压缩效果不佳。

Result: 在五个视频理解基准测试中,对三个代表性VLLMs进行实验,仅保留10%视觉令牌即可保持LLaVA-OneVision 99.1%的性能;在相同计算预算下,使Qwen2.5-VL的视频帧输入增加10倍,相对性能提升8.6%。

Insight: 创新点在于提出基于注意力和多样性的令牌选择与基于树的时空令牌合并方法,实现无需训练、即插即用的时空冗余消除,有效利用视频动态特性中的时空相关性进行压缩。

Abstract: Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.


[118] MIND: Benchmarking Memory Consistency and Action Control in World Models cs.CV | cs.AIPDF

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao

TL;DR: 该论文提出了MIND基准测试,用于评估世界模型在记忆一致性和动作控制方面的核心能力。MIND包含250个高质量视频,涵盖第一人称和第三人称视角,并设计了统一的动作空间和多样化的场景。论文还引入了MIND-World作为交互式基线模型,并通过实验揭示了当前世界模型在长期记忆一致性和跨动作空间泛化方面存在的挑战。

Details

Motivation: 目前缺乏一个统一的基准来评估世界模型在理解和预测动态视觉环境方面的基本能力,特别是记忆一致性和动作控制。

Result: 广泛的实验证明了MIND基准的完整性,并揭示了当前世界模型的关键挑战,包括难以保持长期记忆一致性以及跨动作空间泛化。

Insight: 创新点在于首次提出了一个开放域、闭环、可重访的基准测试MIND,用于系统评估世界模型的两大核心能力;其设计的多样化动作空间和高效评估框架为未来研究提供了标准化的测评工具。

Abstract: World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/


[119] Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects cs.CVPDF

Yahia Hamdi, Nicolas Andrialovanirina, Kélig Mahé, Emilie Poisson Caillault

TL;DR: 本文提出了一种增强的混合3D条件生成对抗网络(CGAN),用于3D对象的生成和补全。该方法将深度3D卷积GAN与专家混合(MoE)框架相结合,通过多个生成器捕获数据中的不同模态,并引入无辅助损失的动态容量约束(DCC)机制来指导生成器选择,以提升对复杂3D数据的处理能力。

Details

Motivation: 解决现有GAN在生成和补全3D对象时难以捕获复杂多样数据分布的问题,特别是在输入不完整或缺失区域较大的场景下,这些限制主要源于高计算需求和建模异构复杂数据的困难。

Result: 在具有不同大小缺失区域的形状生成和补全任务上评估模型,并与最先进方法比较,定量和定性结果均证实了所提MoE-DCGAN在处理复杂3D数据方面的有效性。

Insight: 创新点在于将MoE框架集成到3D CGAN中,利用多个专家生成器专门化处理不同数据模态,并引入DCC机制平衡专业化、训练稳定性和计算效率,这对于3D体素处理至关重要。

Abstract: The generation and completion of 3D objects represent a transformative challenge in computer vision. Generative Adversarial Networks (GANs) have recently demonstrated strong potential in synthesizing realistic visual data. However, they often struggle to capture complex and diverse data distributions, particularly in scenarios involving incomplete inputs or significant missing regions. These challenges arise mainly from the high computational requirements and the difficulty of modeling heterogeneous and structurally intricate data, which restrict their applicability in real-world settings. Mixture of Experts (MoE) models have emerged as a promising solution to these limitations. By dynamically selecting and activating the most relevant expert sub-networks for a given input, MoEs improve both performance and efficiency. In this paper, we investigate the integration of Deep 3D Convolutional GANs (CGANs) with a MoE framework to generate high-quality 3D models and reconstruct incomplete or damaged objects. The proposed architecture incorporates multiple generators, each specialized to capture distinct modalities within the dataset. Furthermore, an auxiliary loss-free dynamic capacity constraint (DCC) mechanism is introduced to guide the selection of categorical generators, ensuring a balance between specialization, training stability, and computational efficiency, which is critical for 3D voxel processing. We evaluated the model’s ability to generate and complete shapes with missing regions of varying sizes and compared its performance with state-of-the-art approaches. Both quantitative and qualitative results confirm the effectiveness of the proposed MoE-DCGAN in handling complex 3D data.


[120] Vanilla Group Equivariant Vision Transformer: Simple and Effective cs.CVPDF

Jiahong Fu, Qi Xie, Deyu Meng, Zongben Xu

TL;DR: 本文提出了一种简单有效的框架,通过系统性地使ViT的关键组件(包括补丁嵌入、自注意力、位置编码和上下采样)具有等变性,构建了保证等变性的ViT架构。该架构可作为即插即用的替代方案,理论上严谨且实用性强,甚至可无缝扩展到Swin Transformer。大量实验表明,所提出的等变性ViT在各种视觉任务中持续提升性能和数据效率。

Details

Motivation: 现有等变性ViT难以在性能与等变性之间取得平衡,主要挑战在于难以对ViT中多样化的模块(特别是协调自注意力机制与补丁嵌入)进行全面的等变性修改。

Result: 在广泛的视觉任务上进行的实验表明,所提出的等变性ViT一致地提高了性能和数据效率。

Insight: 创新点在于提出了一个系统性的框架,确保ViT核心组件的等变性,从而构建出理论上保证等变性且实用的架构,可作为即插即用的模块,并展示了其可扩展性至Swin Transformer等变体。

Abstract: Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.


[121] Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks cs.CV | cs.AIPDF

Yufei Wang, Haixu Liu, Tianxiang Xu, Chuancheng Shi, Hongsheng Xing

TL;DR: 本文提出了一种用于视频中’隐藏情绪’自动识别的多模态弱监督框架,在iMiGUE网球访谈数据集上取得了SOTA结果。该方法首先利用YOLO 11x和DINOv2-Base进行人像检测与视觉特征提取,然后通过结合思维链与反思提示的Gemini 2.5 Pro自动生成伪标签和推理文本作为弱监督信号。接着,使用OpenPose提取关键点序列并增强帧间偏移特征,将通常的图神经网络主干简化为MLP来高效建模关键点的时空关系。最后,通过超长序列Transformer独立编码图像和关键点序列,并与BERT编码的文本特征拼接,采用先单独预训练再联合微调的策略,并将伪标签样本融入训练集以进一步提升性能。

Details

Motivation: 解决视频中’隐藏情绪’(即被刻意掩饰的情感)的自动识别问题,该任务面临标注数据稀缺、类别严重不平衡等挑战,需要一种有效的弱监督学习策略。

Result: 在iMiGUE网球访谈数据集上,该方法将准确率从先前工作的不足0.6提升至超过0.69,建立了新的公开基准(SOTA)。实验还验证了’MLP化’的关键点主干网络在该任务上可以匹配甚至超越基于GCN的模型。

Insight: 主要创新点包括:1) 利用大语言模型(Gemini 2.5 Pro)结合CoT+Reflection提示策略自动生成高质量的伪标签和推理文本,为多模态任务提供了一种新颖的弱监督信号来源;2) 将复杂的图神经网络主干简化为高效的MLP来建模关键点时空关系,在保持性能的同时降低了模型复杂度;3) 采用了分阶段(隔离预训练、联合微调)的多模态融合策略,并融入伪标签数据以缓解类别不平衡问题。

Abstract: To tackle the automatic recognition of “concealed emotions” in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an “MLP-ified” key-point backbone can match - or even surpass - GCN-based counterparts in this task.


[122] Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling cs.CV | cs.AI | cs.RO | eess.SYPDF

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

TL;DR: 本文提出了Picasso,一种物理约束的场景重建流水线,通过考虑几何、非穿透性和物理原理来构建多物体场景重建。它采用一种基于推断物体接触图的快速拒绝采样方法来推理多物体交互。作者还创建了Picasso数据集,包含10个接触丰富的真实场景及其真值标注,并开源了量化物理合理性的度量标准作为基准。

Details

Motivation: 在遮挡和测量噪声存在的情况下,几何上准确但物理上不合理的场景重建(如物体穿透或不稳定平衡)会阻碍基于仿真的接触丰富行为的规划与控制。本文旨在通过整体推理场景(而非孤立处理每个物体),考虑物体交互和物理合理性,来解决物体姿态和形状估计中的物理正确性问题。

Result: 在Picasso新数据集和YCB-V数据集上的广泛评估表明,Picasso大幅优于现有技术(SOTA),同时提供既物理合理又更符合人类直觉的重建结果。

Insight: 创新点包括:1)提出物理约束的重建流水线,整合几何、非穿透和物理约束进行整体场景推理;2)利用推断的物体接触图指导快速拒绝采样,高效处理多物体交互;3)引入包含接触丰富场景和物理合理性度量的新基准数据集,推动该领域评估标准化。

Abstract: In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions – which fit the sensor data – can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.


[123] ReRoPE: Repurposing RoPE for Relative Camera Control cs.CVPDF

Chunyang Li, Yuanbo Yang, Jiahao Shao, Hongyu Zhou, Katja Schwarz

TL;DR: 本文提出了ReRoPE框架,一种即插即用的方法,旨在将相对相机姿态信息无缝集成到预训练的视频扩散模型中,以实现可控视角的高保真视频生成。该方法通过重新利用RoPE(旋转位置编码)中未充分利用的低频频谱带宽来注入相机控制信息,从而在不损害模型原有生成能力的前提下,提供精确的相机控制。

Details

Motivation: 现有基于预训练视频模型的可控视角生成方法通常使用相对于固定参考帧(如首帧)的相机姿态编码,这缺乏平移不变性,导致泛化能力差和累积漂移问题。而直接在预训练模型中集成任意视角对之间的相对相机姿态信息,又面临高昂训练成本或架构修改的挑战。

Result: 论文在图像到视频(I2V)和视频到视频(V2V)任务上评估了ReRoPE,结果表明其在相机控制精度和视觉保真度方面表现优异,为实现可控、高保真视频生成提供了一条训练高效的路径。

Insight: 核心创新点在于洞察到现有模型中的RoPE编码未充分利用其全部频谱带宽(特别是低频部分),并巧妙地重新利用这些未充分利用的频带注入相对相机姿态信息。这提供了一种无需大量重新训练或修改模型架构即可增强预训练模型可控性的新思路。

Abstract: Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/


[124] ViT-5: Vision Transformers for The Mid-2020s cs.CVPDF

Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad

TL;DR: 本文提出ViT-5,一种通过系统整合过去五年架构进展而现代化的视觉Transformer主干网络。它在保持标准Attention-FFN结构的同时,对归一化、激活函数、位置编码、门控机制和可学习token等组件进行了细化。实验表明,ViT-5在理解和生成任务上均优于现有最先进的普通视觉Transformer。

Details

Motivation: 动机是系统性地利用近五年的架构进展来现代化视觉Transformer主干网络,解决原始ViT架构相对陈旧、未能充分利用最新组件改进的问题,以提供一个简单、即插即用的升级方案。

Result: 在ImageNet-1k分类上,ViT-5-Base在可比计算量下达到84.2% top-1准确率,超过了DeiT-III-Base的83.8%。在生成任务中,当集成到SiT扩散框架时,其FID达到1.84,优于使用原始ViT主干的2.06。这些结果表明ViT-5在分类和生成基准上均达到了SOTA水平。

Insight: 宣称的创新点在于对视觉Transformer进行系统性的组件级现代化改造,而非引入全新架构。客观来看,其核心洞察是将近年来在归一化、激活、位置编码等方面的独立进展,以一种协调、统一的方式整合到一个简洁的ViT框架中,从而显著提升性能,同时保持了即插即用的兼容性。

Abstract: This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.


[125] VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval cs.CV | cs.AIPDF

Issar Tzachor, Dvir Samuel, Rami Ben-Ari

TL;DR: 本文提出VidVec方法,通过分析MLLM中间层的嵌入表示,结合校准的MLLM头部实现零样本视频-文本检索,并引入基于文本的轻量对齐策略,在无需视觉监督的情况下达到SOTA性能。

Details

Motivation: 现有生成式MLLM作为嵌入提取器在视频任务上表现不及视频基础模型,本文旨在利用MLLM提升视频-文本嵌入与检索性能。

Result: 在常见视频检索基准测试中,该方法无需微调即超越现有方法,取得了最先进的结果。

Insight: 创新点在于发现MLLM中间层已编码丰富任务相关信息,并利用文本摘要对齐策略实现无视觉监督的嵌入学习,为视频检索提供了高效新途径。

Abstract: Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.


[126] MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery cs.CV | cs.LGPDF

Sidike Paheding, Abel Reyes-Angulo, Leo Thomas Ramos, Angel D. Sappa, Rajaneesh A.

TL;DR: 本文介绍了MMLSv2数据集,这是一个用于火星表面滑坡分割的多模态遥感影像数据集,包含RGB、数字高程模型、坡度、热惯性和灰度等多个波段,共664张图像,并额外提供了一个来自不同地理区域的276张图像的独立测试集以评估空间泛化能力。实验表明,该数据集支持稳定的训练并达到有竞争力的性能,但在处理破碎、细长和小规模滑坡区域时仍存在挑战,且在独立测试集上性能显著下降,突显了其评估模型鲁棒性和泛化能力的价值。

Details

Motivation: 动机是构建一个专门用于火星滑坡检测的多模态遥感影像数据集,以解决现有数据在模态多样性和评估模型空间泛化能力方面的不足,从而推动行星科学中地质灾害自动检测的研究。

Result: 使用多种分割模型进行实验,结果显示数据集支持稳定训练并取得了有竞争力的性能,但在独立测试集上性能显著下降,表明模型在分布外泛化方面面临挑战。

Insight: 创新点在于提供了首个专注于火星滑坡分割的多模态数据集,并引入了地理隔离的独立测试集,这为评估计算机视觉模型在遥感领域的鲁棒性和跨区域泛化能力设立了新的基准,对推动领域自适应和泛化研究具有借鉴意义。

Abstract: We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small-scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in-distribution settings. Dataset will be available at: https://github.com/MAIN-Lab/MMLS_v2


[127] Building Damage Detection using Satellite Images and Patch-Based Transformer Methods cs.CVPDF

Smriti Siva, Jan Cross-Zamirski

TL;DR: 本研究评估了基于视觉Transformer(ViT)的模型在xBD数据集上进行建筑物损坏分类的性能,提出了一种针对性的基于图像块(patch)的预处理流程和冻结分类头的微调策略,以处理标签噪声和类别不平衡问题,并在多类损坏分类任务中取得了与先前CNN基线模型相当的竞争性结果。

Details

Motivation: 快速建筑物损坏评估对于灾后响应至关重要,但卫星图像数据中的标签噪声和严重的类别不平衡给基于卫星图像的损坏分类模型带来了重大挑战。

Result: 在xBD数据集上,采用所提新训练方法的小型ViT架构(DINOv2-small和DeiT)在灾难分类任务中,其宏平均F1分数与先前的CNN基线模型相比具有竞争力。

Insight: 创新点在于提出了一种针对性的基于图像块的预处理流程来隔离结构特征并最小化训练中的背景噪声,以及采用冻结分类头的微调策略以保持计算需求可控。从客观角度看,将先进的ViT模型与针对卫星图像特点的定制化预处理和高效微调策略相结合,是处理此类具有挑战性遥感数据的一种有前景的方法。

Abstract: Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.


[128] MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection cs.CVPDF

Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar

TL;DR: MambaFusion是一个用于自动驾驶中多模态3D目标检测的统一框架,它通过结合选择性状态空间模型(SSMs)和窗口化Transformer来高效建模全局上下文,并利用多模态令牌对齐模块和可靠性感知融合门动态融合相机和LiDAR特征,最后通过结构条件扩散头进行基于图的推理和不确定性感知去噪,以实现物理上合理的感知。

Details

Motivation: 解决自动驾驶中基于相机和LiDAR的多模态融合难题,克服现有BEV融合框架在低效上下文建模、空间不变融合以及不确定性下推理方面的困难。

Result: 在nuScenes基准测试上达到了新的最先进(SOTA)性能,并且具有线性时间复杂度。

Insight: 创新点在于将基于SSM的高效性与可靠性驱动的融合相结合,通过多模态令牌对齐和可靠性感知门实现自适应特征加权,并引入结构条件扩散头来增强物理合理性和校准置信度,从而获得鲁棒、时序稳定且可解释的3D感知。

Abstract: Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.


[129] Robustness of Vision Language Models Against Split-Image Harmful Input Attacks cs.CV | cs.AIPDF

Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu, Shagufta Mehnaz

TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的新型视觉越狱攻击方法——分割图像视觉越狱攻击(SIVA),该方法利用VLMs在安全对齐过程中通常只处理完整图像、而忽略跨多个图像片段分布的有害语义这一漏洞,通过将有害内容分割到多个图像中来绕过模型的安全防护。

Details

Motivation: 当前VLMs通过偏好优化(如RLHF)进行了广泛的安全对齐,对传统的单张/完整图像越狱攻击表现出很强的鲁棒性。然而,作者发现VLM的预训练和指令微调能很好地泛化到分割图像输入,但其安全对齐通常仅在完整图像上进行,未能处理有害语义分布在多个图像片段中的情况,这构成了新的安全漏洞。

Result: 在三个最先进的现代VLMs和三个越狱数据集上的评估表明,本文提出的最强攻击策略(利用新颖的对抗知识蒸馏算法Adv-KD)实现了比现有基线高出高达60%的跨模型迁移成功率。

Insight: 论文的核心创新点在于揭示了VLMs安全对齐在分割图像输入上的脆弱性,并提出了一种从简单分割到自适应白盒攻击、最终实现黑盒迁移攻击的渐进式攻击框架。其中,利用对抗知识蒸馏(Adv-KD)来显著提升跨模型迁移性的策略是关键的算法创新。这为VLM安全对齐的改进提供了重要的研究方向。

Abstract: Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.


[130] DAS-SK: An Adaptive Model Integrating Dual Atrous Separable and Selective Kernel CNN for Agriculture Semantic Segmentation cs.CVPDF

Mei Ling Chee, Thangarajah Akilan, Aparna Ravindra Phalke, Kanchan Keisham

TL;DR: 本文提出了一种名为DAS-SK的新型轻量级架构,用于农业图像语义分割。该模型通过将选择性核卷积(SK-Conv)集成到双空洞可分离卷积(DAS-Conv)模块中,并增强空洞空间金字塔池化(ASPP)模块,以平衡精度与计算效率。模型基于改进的DeepLabV3框架,使用MobileNetV3-Large和EfficientNet-B3作为互补主干网络,旨在解决高分辨率农业图像分割中模型部署到无人机等边缘设备时面临的大数据集需求、光谱泛化能力有限和高计算成本等问题。

Details

Motivation: 解决高分辨率农业图像语义分割中,模型在精度与计算效率之间难以平衡的问题,以实现在无人机等边缘设备上的实际部署。

Result: 在LandCover.ai、VDD和PhenoBench三个基准测试上,DAS-SK模型均取得了最先进的性能(SOTA),同时比基于CNN、Transformer和混合架构的竞争对手更高效。具体而言,与性能最佳的Transformer模型相比,DAS-SK的参数减少了21倍,计算量(GFLOPs)减少了19倍。

Insight: 主要创新点在于将选择性核卷积(SK-Conv)与双空洞可分离卷积(DAS-Conv)模块相结合,以增强多尺度特征学习能力,并改进ASPP模块以同时捕获细粒度局部结构和全局上下文信息。从客观角度看,这种轻量级架构设计在保持高精度的同时显著降低了模型复杂度和计算开销,为边缘设备上的实时应用提供了可借鉴的解决方案。

Abstract: Semantic segmentation in high-resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS-SK, a novel lightweight architecture that retrofits selective kernel convolution (SK-Conv) into the dual atrous separable convolution (DAS-Conv) module to strengthen multi-scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine-grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones - MobileNetV3-Large and EfficientNet-B3, the DAS-SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: LandCover.ai, VDD, and PhenoBench, demonstrate that DAS-SK consistently achieves state-of-the-art performance, while being more efficient than CNN-, transformer-, and hybrid-based competitors. Notably, DAS-SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top-performing transformer models. These findings establish DAS-SK as a robust, efficient, and scalable solution for real-time agricultural robotics and high-resolution remote sensing, with strong potential for broader deployment in other vision domains.


[131] Generative Regression for Left Ventricular Ejection Fraction Estimation from Echocardiography Video cs.CVPDF

Jinrong Lv, Xun Gong, Zhaohuan Li, Weili Jiang

TL;DR: 本文提出了一种用于从超声心动图视频中估计左心室射血分数(LVEF)的生成式回归方法。针对该任务固有的病态逆问题特性(如噪声、伪影和视角有限导致单一视频可能对应一个可能值的分布),作者提出了多模态条件分数扩散回归模型(MCSDR),以建模给定视频和患者人口统计学先验的连续后验分布,取代了传统最小化均方误差的确定性回归范式。

Details

Motivation: 传统深度学习方法将LVEF估计视为标准回归问题并最小化均方误差,这迫使模型学习条件期望,当潜在后验分布是多模态或重尾时(在病理场景中常见)会产生误导性预测。因此,需要从确定性回归转向能建模完整分布的生成式回归。

Result: 在EchoNet-Dynamic、EchoNet-Pediatric和CAMUS数据集上的大量实验表明,MCSDR实现了最先进的性能。定性分析显示,在高噪声或高生理变异性的病例中,模型的生成轨迹表现出独特行为。

Insight: 核心创新在于将LVEF估计从确定性回归框架转变为生成式回归框架,利用条件分数扩散模型直接建模连续后验分布,以处理任务固有的模糊性。这不仅提升了性能,其生成过程还为AI辅助诊断提供了新的可解释性维度,例如通过轨迹分析揭示不确定性来源。

Abstract: Estimating Left Ventricular Ejection Fraction (LVEF) from echocardiograms constitutes an ill-posed inverse problem. Inherent noise, artifacts, and limited viewing angles introduce ambiguity, where a single video sequence may map not to a unique ground truth, but rather to a distribution of plausible physiological values. Prevailing deep learning approaches typically formulate this task as a standard regression problem that minimizes the Mean Squared Error (MSE). However, this paradigm compels the model to learn the conditional expectation, which may yield misleading predictions when the underlying posterior distribution is multimodal or heavy-tailed – a common phenomenon in pathological scenarios. In this paper, we investigate the paradigm shift from deterministic regression toward generative regression. We propose the Multimodal Conditional Score-based Diffusion model for Regression (MCSDR), a probabilistic framework designed to model the continuous posterior distribution of LVEF conditioned on echocardiogram videos and patient demographic attribute priors. Extensive experiments conducted on the EchoNet-Dynamic, EchoNet-Pediatric, and CAMUS datasets demonstrate that MCSDR achieves state-of-the-art performance. Notably, qualitative analysis reveals that the generation trajectories of our model exhibit distinct behaviors in cases characterized by high noise or significant physiological variability, thereby offering a novel layer of interpretability for AI-aided diagnosis.


[132] Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation cs.CVPDF

Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang

TL;DR: 本文提出了一种地理空间推理思维链(GR-CoT)框架,旨在增强多模态大语言模型(MLLMs)的场景理解能力,以解决遥感开放词汇语义分割中因缺乏地理空间上下文而导致的语义模糊和误分类问题。该框架通过离线知识蒸馏和在线实例推理两个协作组件,生成图像自适应的词汇表来指导下游模型实现精确的像素级语义对齐。

Details

Motivation: 现有开放词汇遥感语义分割方法主要依赖视觉特征与文本嵌入的被动映射,这种基于外观的范式缺乏地理空间上下文感知,导致在遇到光谱特征相似但语义属性不同的地物类别时,产生严重的语义模糊和误分类。

Result: 在LoveDA和GID5基准测试上进行的广泛实验证明了该方法的优越性。

Insight: 创新点在于引入了地理空间推理思维链(GR-CoT)框架,通过结合离线知识蒸馏(建立细粒度类别解释标准)和在线实例推理(执行宏观场景锚定、视觉特征解耦和知识驱动决策合成的顺序推理过程),主动利用地理空间上下文来生成图像自适应词汇,从而引导分割模型实现更精确的语义映射,超越了传统被动映射的“外观”范式。

Abstract: Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based” paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.


[133] Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension cs.CVPDF

Yik Lung Pang, Changjae Oh

TL;DR: 本文提出了一种名为Chain-of-Caption的训练免费框架,旨在提升多模态大语言模型在指代表达理解任务上的性能。该框架通过工具使用为模型提供额外的视觉和文本上下文,无需微调即可显著提升模型在多个REC基准数据集上的准确率。

Details

Motivation: 多模态大语言模型在指代表达理解任务上已取得高精度,但其性能可通过提供额外上下文(如思维链和工具使用)进一步提升。本文旨在分析不同视觉和文本上下文提供技术的影响,并开发一个无需训练的性能提升框架。

Result: 在RefCOCO、RefCOCOg、RefCOCO+和Ref-L4数据集上的实验表明,单独的文本或视觉上下文均能提升REC性能。通过组合多种上下文,该训练免费框架在不同IoU阈值下的准确率比基线模型提升了5%到30%。

Insight: 论文的创新点在于提出了一个无需训练即可提升MLLM在REC任务上性能的框架Chain-of-Caption,其核心是通过工具使用系统地组合多种视觉和文本上下文信息,从而显著提升定位精度,这为模型推理能力的增强提供了一种高效、低成本的路径。

Abstract: Given a textual description, the task of referring expression comprehension (REC) involves the localisation of the referred object in an image. Multimodal large language models (MLLMs) have achieved high accuracy on REC benchmarks through scaling up the model size and training data. Moreover, the performance of MLLMs can be further improved using techniques such as Chain-of-Thought and tool use, which provides additional visual or textual context to the model. In this paper, we analyse the effect of various techniques for providing additional visual and textual context via tool use to the MLLM and its effect on the REC task. Furthermore, we propose a training-free framework named Chain-of-Caption to improve the REC performance of MLLMs. We perform experiments on RefCOCO/RefCOCOg/RefCOCO+ and Ref-L4 datasets and show that individual textual or visual context can improve the REC performance without any fine-tuning. By combining multiple contexts, our training-free framework shows between 5% to 30% performance gain over the baseline model on accuracy at various Intersection over Union (IoU) thresholds.


[134] Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval cs.CVPDF

Jing Zhang, Zhikai Li, Xuewen Liu, Qingyi Gu

TL;DR: 本文提出Efficient-SAM2,通过对象感知的视觉编码和内存检索机制来加速SAM2模型。该方法观察到SAM2存在类似生物视觉的稀疏感知模式,从而设计了稀疏窗口路由和稀疏内存检索来消除对背景区域和不重要token的冗余计算,在保持精度的同时显著提升推理速度。

Details

Motivation: SAM2在视频对象分割任务中性能优异,但其沉重的计算负担阻碍了其在实时视频处理中的应用。现有改进多集中于重新训练轻量级骨干网络,而对训练后加速的探索不足。

Result: 在SAM2.1-L模型上实现了1.68倍的加速,在SA-V测试集上仅带来1.0%的精度下降,且增加的参数量和训练开销可忽略不计。

Insight: 创新点在于利用SAM2的稀疏感知特性(注意力集中于前景对象且显著区域具有时间一致性),提出了无需重新训练的计算分配机制(SWR)和内存检索机制(SMR),实现了高效的后训练加速。这为大型视觉模型的后训练优化提供了新思路。

Abstract: Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.


[135] When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning cs.CV | cs.AI | cs.CLPDF

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao

TL;DR: 本文提出了一种自适应测试时视觉想象框架AVIC,用于视觉空间推理任务。该框架通过世界模型动态判断当前视觉证据是否足够,并选择性调用和调整视觉想象的规模,以平衡推理准确性和计算效率。

Details

Motivation: 尽管多模态大语言模型(MLLMs)发展迅速,但在视觉空间推理中,当正确答案依赖于未见或替代视角下的场景外观时,现有方法仍不可靠。现有工作通过世界模型进行视觉想象来增强推理,但何时需要想象、多少想象有益、以及何时想象有害等问题尚未明确,不加区分的想象可能增加计算开销甚至引入误导证据。

Result: 在空间推理基准(SAT、MMSI)和具身导航基准(R2R)上的实验表明,AVIC框架能够识别想象关键、边缘或有害的清晰场景,并通过选择性控制匹配或超越固定想象策略,同时显著减少世界模型调用和语言令牌数量。

Insight: 创新点在于将测试时视觉想象视为可控资源,并引入自适应机制来动态决定是否及如何进行想象。这为高效可靠的视觉空间推理提供了新思路,强调了在推理过程中分析和控制想象的重要性,以避免不必要的计算和性能下降。

Abstract: Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.


[136] PISCO: Precise Video Instance Insertion with Sparse Control cs.CV | cs.AIPDF

Xiangbo Gao, Renjie Li, Xinghao Chen, Yuheng Wu, Suofei Feng

TL;DR: 本文提出了PISCO,一个用于精确视频实例插入的扩散模型,支持任意稀疏关键帧控制。该方法允许用户通过指定单个、起止或任意时间戳的稀疏关键帧,自动传播物体外观、运动和交互,旨在解决专业AI辅助电影制作中精确、针对性修改的需求。

Details

Motivation: 动机在于推动AI视频生成从依赖大量提示工程和“筛选”的通用生成,转向细粒度可控生成和高保真后处理,特别是满足专业电影制作中对现有素材进行精确、目标化修改(如视频实例插入)的需求,该任务要求时空位置精确、物理交互一致且保持原始动态。

Result: 在构建的PISCO-Bench基准上,使用基于参考和无参考的感知指标进行评估。实验表明,PISCO在稀疏控制下持续优于强大的修复和视频编辑基线方法,并且随着提供更多控制信号,表现出清晰、单调的性能提升。

Insight: 宣称的创新点包括:提出用于鲁棒条件化的可变信息引导、用于稳定时序生成的分布保持时序掩码,以及用于真实场景适应的几何感知条件化,以解决预训练视频扩散模型中稀疏条件化引起的严重分布偏移。客观分析认为,其将稀疏关键帧控制与针对分布偏移和时序稳定性的专门技术结合,是视频编辑领域一个值得关注的细粒度控制方案。

Abstract: The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and “cherry-picking” - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.


[137] Tighnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning cs.CV | cs.AIPDF

Haixu Liu, Yufei Wang, Tianxiang Xu, Chuancheng Shi, Hongsheng Xing

TL;DR: 本文提出Tighnari v2框架,旨在解决多模态植物分布预测中因观测数据稀疏、有偏以及标签噪声和分布偏移带来的挑战。该框架通过创新的伪标签聚合策略处理存在-仅存数据的地理对齐问题,并采用堆叠式串行三模态交叉注意力机制融合卫星图像、表格特征和时间序列数据。针对PA数据训练与测试样本间的地理分布偏移及PO数据标签噪声问题,引入专家混合范式进行分区推理与后处理,在GeoLifeCLEF 2025数据集上验证了其在PA覆盖有限且分布偏移显著场景下的优越性能。

Details

Motivation: 大规模跨物种植物分布预测对生物多样性保护至关重要,但观测数据的稀疏性、偏差性以及存在-仅存数据的严重标签噪声和存在-缺失数据与测试样本间的地理分布偏移,给建模带来了重大挑战。

Result: 在GeoLifeCLEF 2025数据集上的实验表明,该方法在PA覆盖有限且分布偏移明显的场景下取得了优越的预测性能。

Insight: 创新点包括:1)基于卫星影像地理覆盖的PO数据伪标签聚合策略,实现了标签空间与遥感特征空间的地理对齐;2)采用堆叠式串行三模态交叉注意力机制优化异构模态融合;3)借鉴专家混合范式,根据测试样本与PA样本的空间邻近性进行分区,并针对不同分区使用在不同数据集上训练的模型进行推理和后处理,以缓解标签噪声和分布偏移问题。

Abstract: Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling efforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, offer broad spatial coverage and rich spatiotemporal distribution, but suffer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and different models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts.


[138] UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science cs.CV | cs.AIPDF

Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic

TL;DR: 该论文提出了一个用于城市科学的空间基础多模态嵌入学习框架,包括一个空间对齐的数据集UGData、一个两阶段训练策略UGE以及一个综合评估基准UGBench。该框架通过将街景图像与结构化空间图对齐,并利用空间推理路径和上下文描述进行监督,旨在学习可迁移的、空间基础的多模态嵌入,以支持地理定位、图像检索等城市理解任务。

Details

Motivation: 现有数据集和基准测试缺乏街景图像与城市结构之间的显式对齐,而城市理解本质上是空间性的,这使得学习可迁移的多模态嵌入具有挑战性。

Result: 在多个SOTA视觉语言模型(如Qwen2.5-VL-7B)上构建的UGE模型,在训练城市上实现了图像检索性能提升高达44%,地理定位排序提升30%;在未见城市上分别获得超过30%和22%的性能增益,证明了显式空间基础对空间密集型城市任务的有效性。

Insight: 创新点在于提出了一个空间对齐的数据集和两阶段训练策略,将图像、文本和空间结构进行渐进式对齐,并通过图编码显式建模距离、方向性和连通性等空间关系,为城市多模态理解提供了新的监督信号和评估基准。

Abstract: Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks – including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.


[139] What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning cs.CVPDF

Yujin Zhou, Pengcheng Wen, Jiale Chen, Boqin Yin, Han Zhu

TL;DR: 本文针对大型视觉语言模型(LVLMs)在‘图像思维推理’范式中的推理过程错误问题,首次构建了一个专门用于评估过程奖励模型(PRMs)的综合基准。通过分析推理轨迹和指导性搜索实验,定义了7种细粒度错误类型,并构建了包含1,206条人工标注轨迹的数据集。实验表明,现有LVLMs作为PRMs能力不足,存在性能差异、正向评估偏见和对步骤位置敏感等问题。

Details

Motivation: 随着‘图像思维推理’范式(模型在推理步骤中动态编辑和重新编码视觉信息)的发展,推理过程中可能出现多样错误,需要过程奖励模型(PRMs)来区分正负推理步骤,但现有PRMs基准主要面向文本,缺乏针对该范式的全面评估。

Result: 在构建的基准上实验分析发现,当前LVLMs作为PRMs表现不佳:在视觉推理过程评估中能力有限,不同错误类型间性能差异显著,存在正向评估偏见,且对推理步骤位置敏感。该基准有效揭示了这些不足。

Insight: 创新点在于首次为‘图像思维推理’范式建立了专门的PRMs评估基准,通过细粒度错误分类和人工标注轨迹实现了全面评估;客观来看,该工作系统性地揭示了LVLMs在过程级奖励建模上的关键缺陷,为未来PRMs的改进提供了重要基础和方向。

Abstract: The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.


[140] E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs cs.CVPDF

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou

TL;DR: 本文提出了首个针对电商短视频理解的新基准E-VAds,包含3,961个淘宝高质量视频和19,785个开放问答对,涵盖感知与认知推理两大维度五个任务。作者还开发了基于强化学习的推理模型E-VAds-R1,采用多粒度奖励设计MG-GRPO,在少量训练样本下显著提升了商业意图推理性能。

Details

Motivation: 现有视频理解基准主要面向通用任务,忽略了电商短视频中目标驱动格式和密集多模态信号带来的商业意图推理挑战,导致当前模型在此领域表现不佳。

Result: 实验表明,E-VAds-R1在仅使用数百个训练样本的情况下,在商业意图推理任务上实现了109.2%的性能提升。

Insight: 创新点包括:提出多模态信息密度评估框架量化电商内容复杂性;构建首个电商短视频理解基准E-VAds;设计基于强化学习的模型E-VAds-R1,采用多粒度奖励策略MG-GRPO,在探索与精度间取得平衡。

Abstract: E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a \textbf{multi-modal information density assessment framework} to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce \textbf{E-commerce Video Ads Benchmark (E-VAds)}, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop \textbf{E-VAds-R1}, an RL-based reasoning model featuring a multi-grained reward design called \textbf{MG-GRPO}. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.


[141] Learning Self-Correction in Vision-Language Models via Rollout Augmentation cs.CV | cs.CL | cs.LGPDF

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

TL;DR: 本文提出了一种名为Octopus的强化学习(RL)滚动增强框架,通过重组现有滚动来合成密集的自校正示例,以解决视觉语言模型(VLM)中自校正学习信号稀疏的问题。该方法结合响应掩码策略,解耦自校正与直接推理,从而有效学习两种行为。基于此,作者开发了Octopus-8B模型,在7个基准测试中实现了开源VLM的SOTA性能。

Details

Motivation: 现有强化学习方法在视觉语言模型中学习自校正时面临挑战,因为有效的自校正行为出现稀少,导致学习信号极度稀疏,难以优化。

Result: Octopus-8B在7个基准测试中达到了开源VLM的SOTA性能,比最佳RLVR基线高出1.0分,同时每步训练时间仅需0.72倍。

Insight: 创新点包括:1. 校正特定滚动(Octopus)框架,通过滚动重组增强样本效率并稳定RL优化;2. 响应掩码策略,解耦自校正与直接推理以避免信号冲突;3. 实现了可控自校正能力的推理VLM,在性能和效率上均有提升。

Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.


[142] D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy cs.CVPDF

Jianfeng Liang, Shaocheng Shen, Botao Xu, Qiang Hu, Xiaoyun Zhang

TL;DR: 本文提出了一种名为D$^2$-VR的视频修复框架,该框架基于单图像扩散模型,旨在解决现有方法在复杂真实世界退化下推理延迟高和时间不稳定的问题。通过设计退化鲁棒流对齐模块、采用对抗蒸馏范式压缩采样轨迹,并引入协同优化策略,实现了高效且时间一致的视频修复。

Details

Motivation: 现有结合扩散先验与时序对齐的视频修复方法虽能提供出色的感知质量,但在面对复杂真实退化时,存在推理延迟过高和时间不稳定的严重限制,阻碍了实际部署。

Result: 在广泛的实验中,D$^2$-VR在视频修复任务上达到了最先进的性能,同时将采样过程加速了12倍。

Insight: 创新点包括:1) 设计了退化鲁棒流对齐模块,利用置信度感知注意力过滤不可靠的运动线索;2) 引入了对抗蒸馏范式,将扩散采样轨迹压缩到快速少步机制;3) 提出了协同优化策略,以协调感知质量与严格的时间一致性。从客观角度看,该工作将高效推理与鲁棒时序处理相结合,为实际应用中的视频修复提供了可行的解决方案。

Abstract: The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}


[143] Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition cs.CVPDF

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang

TL;DR: 本文提出了Demo-ICL任务和Demo-ICL-Bench基准,旨在评估多模态大语言模型(MLLM)从少量视频演示中进行上下文学习的能力。作者还开发了名为Demo-ICL的模型,采用两阶段训练策略,在构建的基准上验证了其有效性。

Details

Motivation: 现有视频基准主要评估模型基于静态内部知识理解视频的能力,而非其从少量动态、新颖的上下文示例中学习和适应的能力。本文旨在弥补这一差距。

Result: 在提出的Demo-ICL-Bench基准上进行的广泛实验证实了该基准的挑战性,并证明了Demo-ICL模型的有效性,为未来研究指明了方向。

Insight: 创新点在于提出了专注于从演示中进行视频上下文学习的新任务和基准,并设计了包含两阶段训练(视频监督微调和信息辅助直接偏好优化)的MLLM来应对该挑战,强调了模型从动态上下文中获取程序性知识的能力。

Abstract: Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models’ static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model’s ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.


[144] Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries cs.CV | cs.AIPDF

Haocheng Lu, Nan Zhang, Wei Tao, Xiaoyang Qu, Guokuan Li

TL;DR: 本文提出了Vista框架,用于解决流式视频问答(Streaming Video QA)中的挑战,通过场景感知的分割、压缩和召回机制,实现对连续视频流的高效、可扩展推理。

Details

Motivation: 现有基于固定大小内存或简单压缩的方法在长视频、实时场景中容易导致上下文丢失或内存溢出,无法有效处理视频帧顺序到达和用户查询任意时间点的问题。

Result: 在StreamingBench基准测试中,Vista实现了最先进的性能,为真实世界流式视频理解建立了强基线。

Insight: 创新点包括动态聚类帧为场景单元、将场景压缩为紧凑令牌表示并存储在GPU内存中以实现高效检索,以及查询时选择性召回相关场景,这些机制在保持延迟和内存效率的同时支持长上下文推理,且与多种视觉语言骨干模型兼容。

Abstract: Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.


[145] TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation cs.CVPDF

Yiyang Cao, Yunze Deng, Ziyu Lin, Bin Feng, Xinggang Wang

TL;DR: 本文提出TriC-Motion,一种新颖的基于扩散模型的文本驱动运动生成框架。该框架通过整合时空频三域建模与因果干预,旨在生成高质量、与文本对齐的运动序列。其核心包括三个域特定建模模块、一个分数引导的三域融合模块以及一个基于因果关系的反事实运动解耦器,以联合优化并消除噪声。

Details

Motivation: 当前文本到运动生成方法主要关注时空建模或独立的频域分析,缺乏跨空间、时间和频率域的统一联合优化框架,这限制了模型同时利用多域信息的能力,导致生成质量欠佳。此外,运动生成框架中常存在由噪声引起的运动无关线索与有益特征纠缠的问题,导致运动失真。

Result: 在HumanML3D数据集上取得了出色的性能,R@1指标达到0.612,优于现有最先进方法(SOTA),证明了其生成高保真、连贯、多样且与文本对齐的运动序列的能力。

Insight: 主要创新点在于提出了一个统一的时空频三域联合建模与优化框架,并引入了基于因果干预的反事实解耦机制来消除运动无关噪声,从而更清晰地分离各域的真实建模贡献,提升生成质量。

Abstract: Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model’s ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.


[146] Gesture Matters: Pedestrian Gesture Recognition for AVs Through Skeleton Pose Evaluation cs.CV | cs.AI | cs.ET | cs.HC | cs.LGPDF

Alif Rizqullah Mahdi, Mahdi Rezaei, Natasha Merat

TL;DR: 本文提出了一种基于2D姿态估计的行人手势识别框架,用于提升自动驾驶车辆对行人非语言交流的理解能力。该框架从WIVW数据集的真实视频序列中提取归一化关键点,并从中衍生出76个静态和动态特征,将手势分为停止、通行、感谢/问候和无手势四类。

Details

Motivation: 解决自动驾驶车辆在交通场景中难以解读行人手势的问题,因为手势是交通中非语言交流的关键组成部分,尤其在正式交通规则不足时有助于行人-驾驶员交互。

Result: 在WIVW数据集上,该方法通过分析手部位置和运动速度等特征,实现了87%的分类准确率,有效区分了不同手势类别。

Insight: 创新点在于将2D姿态估计应用于行人手势识别,并系统性地提取了结合静态与动态的特征集;客观分析认为,强调手部位置和速度作为判别性特征,为自动驾驶系统的感知能力提供了可解释且实用的改进方向。

Abstract: Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.


[147] Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation? cs.CVPDF

Caterina Fuster-Barceló, Virginie Uhlmann

TL;DR: 本研究探讨了视觉基础模型(VFMs)在电子显微镜图像分割任务中的可迁移性,重点关注线粒体分割。通过使用两个公开EM数据集(Lucchi++和VNC)和三种VFM(DINOv2、DINOv3、OpenCLIP),评估了冻结主干训练和LoRA参数高效微调两种适应策略。研究发现,在单一数据集上训练可获得良好分割性能,LoRA能进一步提升域内性能;但在多数据集联合训练时,所有模型均出现严重性能下降,且PEFT改善有限。分析表明,尽管数据集视觉相似,但其潜在表示存在显著域不匹配,导致跨域泛化失败。

Details

Motivation: 探究视觉基础模型的潜在表示是否足够通用,能够有效支持跨异构电子显微镜图像数据集的分割任务迁移与重用,特别是在线粒体分割这一具体问题上。

Result: 在单一EM数据集上训练时,所有VFM骨干均能获得良好的分割性能(以前景IoU衡量),且LoRA微调能一致提升域内性能。然而,在Lucchi++和VNC两个数据集上联合训练时,所有模型均出现严重的性能退化,参数高效微调(PEFT)带来的增益微乎其微。

Insight: 论文的创新点在于系统评估了VFM在生物医学图像分析(特别是EM分割)中的跨域可迁移性,并揭示了即使视觉相似的EM数据集之间也存在显著的潜在表示域不匹配问题。客观来看,其核心洞察是:当前轻量级适应策略(如LoRA)虽能优化单一域性能,但不足以在不引入额外域对齐机制的情况下,获得跨异构EM数据集的鲁棒单一模型,这为未来研究指明了方向。

Abstract: Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.


[148] GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving cs.CVPDF

Linger Deng, Yuliang Liu, Wenwen Yu, Zujia Zhang, Jianzhong Ju

TL;DR: 本文提出GeoFocus框架,通过关键局部感知器(Critical Local Perceptor)和顶点语言(VertexLang)两个核心模块,结合基于理论的感知模板和紧凑的拓扑形式语言,增强多模态模型在几何问题解决中的全局与局部感知能力,在多个基准测试中取得显著性能提升。

Details

Motivation: 解决大型多模态模型(LMMs)在几何问题求解中面临的挑战,即需要同时处理全局形状识别和基于几何理论的复杂局部关系(如角度、平行线、距离比较),以提升模型对几何结构的理解能力。

Result: 在Geo3K、GeoQA和FormalGeo7K基准测试中,GeoFocus相比领先的专用模型准确率提升4.7%;在MATHVERSE中表现出更强的视觉条件鲁棒性;关键局部特征覆盖提升61%,全局感知训练时间减少20%。

Insight: 创新点包括:1)通过理论驱动的感知模板自动识别关键局部结构,增强局部特征提取;2)设计紧凑的顶点语言(VertexLang)替代冗长的代码编码,高效表示全局拓扑关系,平衡计算效率与准确性。

Abstract: Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page – https://github.com/dle666/GeoFocus


[149] TIBR4D: Tracing-Guided Iterative Boundary Refinement for Efficient 4D Gaussian Segmentation cs.CV | cs.GRPDF

He Wu, Xia Yan, Yanghui Xu, Liegang Xia, Jiazhou Chen

TL;DR: 本文提出了一种无需学习的4D高斯分割框架TIBR4D,用于动态4D高斯场景中的对象级分割。该方法通过两阶段迭代边界细化,将视频分割掩码提升到4D空间,以处理复杂运动、遮挡和模糊边界问题。

Details

Motivation: 解决动态4D高斯场景中对象级分割的挑战,包括复杂运动、遮挡和模糊边界,现有基于一次性阈值的方法难以有效处理这些问题。

Result: 在HyperNeRF和Neu3D基准测试中,该方法相比SOTA方法生成了边界更清晰、效率更高的准确对象高斯点云。

Insight: 创新点包括两阶段迭代边界细化:第一阶段通过迭代高斯实例追踪(IGIT)在时间片段级别逐步细化高斯到实例的概率,以更好地处理遮挡和保持对象结构完整性;第二阶段通过高斯渲染范围控制(RCC)抑制对象边界附近高度不确定的高斯,同时保留其核心贡献以实现更准确的边界。此外,IGIT中的时间分割合并策略平衡了身份一致性和动态感知能力。

Abstract: Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.


[150] GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing cs.CV | cs.AI | cs.LG | cs.MM | eess.IVPDF

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

TL;DR: GOT-Edit是一种用于通用物体跟踪的在线跨模态模型编辑方法,通过整合3D几何感知线索来增强传统基于2D特征的跟踪器。该方法利用预训练的视觉几何基础Transformer从少量2D图像中推断几何线索,并通过零空间约束更新进行在线模型编辑,以融合几何信息同时保持语义判别力,从而在遮挡和杂乱场景下实现更鲁棒和准确的跟踪。

Details

Motivation: 现有通用物体跟踪方法主要依赖目标的2D特征而忽略3D几何线索,导致对部分遮挡、干扰物以及几何和外观变化的鲁棒性不足。论文旨在通过整合人类感知中隐含的3D先验知识和语义推理来解决这一局限。

Result: 在多个通用物体跟踪基准测试上的广泛实验表明,GOT-Edit实现了卓越的鲁棒性和准确性,特别是在遮挡和杂乱场景下,确立了将2D语义与3D几何推理相结合的新范式。

Insight: 创新点在于提出在线跨模态模型编辑框架,通过零空间约束更新将几何线索无缝整合到跟踪器中,同时保持语义判别能力;从客观角度看,该方法有效利用了预训练模型的几何先验,为2D视觉任务引入3D推理提供了新思路。

Abstract: Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.


[151] SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning cs.CVPDF

Melany Yang, Yuhang Yu, Diwang Weng, Jinwei Chen, Wei Dong

TL;DR: 本文提出SemiNFT,一种基于扩散Transformer(DiT)的参考图像预设色彩迁移框架,模拟人类艺术训练过程:从刚性模仿到直觉创作。该方法首先使用配对三元组学习基本结构保持和色彩映射技能,然后通过强化学习在未配对数据上培养细腻的审美感知,并设计混合在线-离线奖励机制以防止灾难性遗忘。实验表明,SemiNFT在标准预设迁移基准上优于现有方法,并在黑白照片着色和跨域(动漫到照片)迁移等零样本任务中表现出色。

Details

Motivation: 解决现有参考式色彩迁移方法仅依赖像素级统计进行全局色彩映射,缺乏对语义上下文和人类审美的真正理解的问题,使非专家也能实现专业级照片调色。

Result: 在标准预设迁移基准上超越现有最先进方法(SOTA),并在黑白照片着色和动漫到照片的跨域预设迁移等零样本任务中展现出显著智能。

Insight: 创新点包括:模拟人类艺术学习轨迹的两阶段训练(模仿到创作)、结合配对与未配对数据的混合样本强化学习框架,以及防止灾难性遗忘的混合在线-离线奖励机制;从客观角度看,该方法将语义理解与审美感知融入色彩迁移,超越了传统统计匹配。

Abstract: Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at https://melanyyang.github.io/SemiNFT/.


[152] Overview and Comparison of AVS Point Cloud Compression Standard cs.CVPDF

Wei Gao, Wenxu Gao, Xingming Mu, Changhao Peng, Ge Li

TL;DR: 本文综述了中国音视频编码标准工作组(AVS)制定的第一代点云压缩标准AVS PCC,从技术和性能比较两个角度进行回顾,并对比了MPEG的G-PCC和V-PCC标准。

Details

Motivation: 点云数据量大,对传输和存储构成挑战,影响广泛应用,因此点云压缩在优化人类和机器感知的实际应用中至关重要。

Result: AVS PCC标准采用了多种新的编码工具和技术,与MPEG标准不同,但摘要未提及具体定量结果或基准测试性能。

Insight: AVS PCC作为中国自主制定的点云压缩标准,引入了创新编码工具,为点云压缩领域提供了多样化解决方案,可借鉴其标准化流程和技术集成策略。

Abstract: Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.


[153] Improving Reconstruction of Representation Autoencoder cs.CVPDF

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan

TL;DR: 本文提出LV-RAE,一种表示自编码器,旨在解决基于视觉基础模型的潜在扩散模型(LDMs)因语义特征缺乏低级信息(如颜色和纹理)而导致重建保真度下降的问题。该方法通过增强语义特征的低级信息来实现高保真重建,同时保持与语义分布的高度对齐。此外,针对高维信息丰富的潜在表示导致解码器对潜在扰动敏感的问题,论文提出了通过微调解码器增强其鲁棒性,并通过受控噪声注入平滑生成潜在表示,从而提升生成质量。

Details

Motivation: 当前利用视觉基础模型作为图像编码器的潜在扩散模型(LDMs)虽然语义特征易于学习,但其语义特征往往缺乏低级信息(如颜色和纹理),导致重建保真度下降,这已成为进一步扩展LDMs的主要瓶颈。

Result: 实验表明,LV-RAE显著提高了重建保真度,同时保持了语义抽象能力,并实现了强大的生成质量。

Insight: 论文的创新点在于提出了一种增强语义特征低级信息的表示自编码器(LV-RAE),以解决重建保真度问题;同时,通过分析解码器对潜在扰动的敏感性源于其在数据流形外方向上的过度响应,提出了微调解码器鲁棒性和受控噪声注入平滑潜在表示的策略,从而提升生成质量。从客观角度看,该方法在保持语义对齐的同时,有效融合了低级视觉信息,为高保真生成模型的设计提供了新思路。

Abstract: Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.


[154] Revisiting [CLS] and Patch Token Interaction in Vision Transformers cs.CVPDF

Alexis Marouani, Oriane Siméoni, Hervé Jégou, Piotr Bojanowski, Huy V. Vo

TL;DR: 本文重新审视了视觉Transformer中[CLS]类别标记与图像块标记的交互机制,发现标准归一化层会导致两类标记的隐式分化,进而提出专门的处理路径来有选择性地解耦它们在归一化层和早期QKV投影中的计算流,从而显著提升密集预测任务的性能。

Details

Motivation: 动机在于解决视觉Transformer中全局特征([CLS]标记)与局部特征(图像块标记)在学习过程中的潜在冲突,特别是在不同预训练策略下,标准统一处理方式可能限制模型在密集预测任务中的表现。

Result: 实验表明,该方法在标准分割基准上实现了超过2 mIoU点的性能提升,同时保持了强大的分类精度;模型参数仅增加8%,且无额外计算开销。

Insight: 创新点在于揭示了归一化层对两类标记的隐式分化作用,并据此设计了针对性的解耦处理路径;从客观角度看,这种轻量化的架构专业化策略为提升视觉Transformer在密集任务中的表示质量提供了新思路。

Abstract: Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.


[155] WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling cs.CVPDF

Yi Dao, Lankai Zhang, Hao Liu, Haiwei Zhang, Wenbo Wang

TL;DR: 本文提出了WiFlow,一种基于WiFi信号的轻量级连续人体姿态估计网络,通过时空特征解耦架构,在自收集的数据集上实现了高精度和低计算开销。

Details

Motivation: 解决现有WiFi姿态估计方法在连续运动处理和高计算成本方面的不足,为物联网智能感知提供实用方案。

Result: 在自收集的36万样本数据集上,PCK@20达到97.00%,PCK@50达到99.48%,平均关节位置误差0.008米,参数量仅4.82M,显著降低了复杂度。

Insight: 创新点包括使用编码器-解码器结构、时空卷积保持信号序列结构、轴向注意力捕捉关键点依赖关系,实现了高效时空特征解耦与轻量化设计。

Abstract: Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at https://github.com/DY2434/WiFlow-WiFi-Pose-Estimation-with-Spatio-Temporal-Decoupling.git.


[156] ALIVE: Animate Your World with Lifelike Audio-Video Generation cs.CVPDF

Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu

TL;DR: ALIVE是一个将预训练的文本到视频模型扩展到音频-视频生成和动画的生成模型,通过引入联合音频-视频分支(包括时间对齐的跨模态融合和统一时间RoPE)来增强MMDiT架构,并设计了高质量数据管道进行微调,在百万级数据上训练后,性能优于开源模型并匹配或超越最先进的商业解决方案。

Details

Motivation: 解决视频生成向统一音频-视频生成演进的需求,扩展文本到视频基础模型以支持文本到音频视频和参考动画能力,提升音频-视觉同步和动画质量。

Result: 在引入的新基准测试中,ALIVE表现出色,一致优于开源模型,并匹配或超越最先进的商业解决方案,具体结果基于百万级高质量数据的持续预训练和微调。

Insight: 创新点包括:通过TA-CrossAttn和UniTemp-RoPE实现精确的音频-视觉对齐和时间同步;设计全面的数据管道(如音频-视频字幕和质量控制)收集高质量微调数据;以及建立新基准以促进模型测试和比较,为社区开发音频-视频生成模型提供高效方案。

Abstract: Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.


[157] OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence cs.CVPDF

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin

TL;DR: 这篇论文提出了OneVision-Encoder,一种基于编解码器对齐稀疏性的视频编码架构。其核心假设是人工智能本质上是压缩问题,而视觉信号高度冗余,有效信息稀疏。该方法通过Codec Patchification技术,仅处理信号熵丰富的区域(3.1%-25%),并采用共享3D RoPE和基于百万级语义概念的聚类判别目标进行训练,以统一时空推理。

Details

Motivation: 现代视觉架构背离了信息论原则,对密集像素网格进行均匀处理,浪费了大量计算在静态背景上,而非聚焦于定义运动和意义的预测残差。论文旨在通过使架构与视频的编解码器信息论原则对齐,来解决视觉理解问题。

Result: 在集成到LLM后,该方法在16个图像、视频和文档理解基准测试中,持续优于Qwen3-ViT和SigLIP2等强视觉骨干网络,尽管使用了更少的视觉token和预训练数据。特别是在视频理解任务上,相比Qwen3-ViT平均提升了4.1%。

Insight: 论文宣称的创新点在于将编解码器对齐的、块级稀疏性确立为多模态智能的基础原则。从客观角度看,其核心创新在于将视频压缩的经典信息论思想(关注预测残差/惊喜)系统地引入深度学习架构设计,通过选择性处理高熵区域来实现效率与精度的正相关,而非传统的权衡关系。

Abstract: Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.


[158] Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm cs.CVPDF

Xiaogang Xu, Kun Zhou, Tao Hu, Jiafei Wu, Ruixing Wang

TL;DR: 本文提出了一种创新的低光视频增强(LLVE)方法,通过引入一种有效的时空分解范式来提升性能。核心框架VLLVE将视频分解为视角无关(捕捉内在外观)和视角相关(描述光照条件)两个分量,并利用动态跨帧对应关系和场景级连续性约束来确保分解的一致性。进一步,作者提出了VLLVE++,通过引入一个加性残差项来模拟场景自适应的退化,从而更全面地捕获视频内容,并支持增强与退化感知对应关系细化的双向端到端学习。该方法在公认的LLVE基准测试中进行了广泛实验,并展现出处理真实世界场景和高动态视频的强大能力。

Details

Motivation: 解决低光视频中存在的严重不可见性和噪声问题,旨在恢复动态或静态场景的视觉质量。现有方法可能难以一致地处理视频中的时空信息,因此需要一种有效的分解策略来更好地分离外观和光照信息。

Result: 在广泛认可的低光视频增强(LLVE)基准测试上进行了大量实验。VLLVE++展现出强大的处理能力,特别是在处理真实世界场景和高动态视频等挑战性案例时。

Insight: 创新点在于提出了一种将视频分解为视角无关和视角相关分量的时空分解范式,并引入了场景级连续性约束和跨帧交互机制来确保分解一致性。VLLVE++进一步通过加性残差项模拟难以建模的场景自适应退化,并采用双向端到端学习来同时优化增强和对应关系细化,这提高了对视频整体内容的捕获能力并增强了鲁棒性。

Abstract: Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.


[159] TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions cs.CVPDF

Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen

TL;DR: 本文提出了Omni Dense Captioning新任务,旨在生成带时间戳的连续、细粒度、结构化音视频叙事。为此,作者构建了高质量人工标注基准OmniDCBench、统一评估指标SodaM、训练数据集TimeChatCap-42K,并提出了基于SFT和GRPO训练的强基线模型TimeChat-Captioner-7B。实验表明该模型在密集描述任务上达到SOTA,超越Gemini-2.5-Pro,并显著提升下游音视频推理与时间定位任务的性能。

Details

Motivation: 解决现有视频描述任务在生成连续、细粒度、结构化叙事方面的不足,旨在创建类似电影剧本的、具有明确时间戳和密集语义覆盖的音视频描述。

Result: TimeChat-Captioner-7B在密集描述任务上实现了最先进的性能,超越了Gemini-2.5-Pro。其生成的密集描述显著提升了在音视频推理基准(DailyOmni和WorldSense)和时间定位基准(Charades-STA)上的下游任务能力。

Insight: 创新点包括:1)提出Omni Dense Captioning新任务,并引入六维结构化模式来生成剧本式描述;2)构建高质量基准OmniDCBench和统一评估指标SodaM以解决场景边界模糊问题;3)采用SFT与GRPO结合任务特定奖励的训练方法,有效提升了模型在密集描述和下游任务上的性能。

Abstract: This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create “script-like” captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.


[160] Towards Understanding Multimodal Fine-Tuning: Spatial Features cs.CV | cs.LGPDF

Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark

TL;DR: 本文首次对视觉语言模型(VLM)的多模态微调过程进行了机制性分析,通过引入阶段式模型差分技术,揭示了语言模型如何学习‘视觉’能力,特别是空间特征的涌现与重构过程。

Details

Motivation: 尽管视觉语言模型在多种任务上表现出色,但语言主干网络在多模态训练中如何适应以及视觉特定能力何时涌现仍不明确,本文旨在通过机制分析来理解这一过程。

Result: 研究识别出在微调过程中出现或重新定向的视觉偏好特征,其中一部分可靠地编码了空间关系,并通过空间提示的受控偏移得到验证;这些特征可追溯到一小部分注意力头。

Insight: 创新点在于提出阶段式模型差分技术来隔离多模态微调引入的表征变化,从而清晰展示视觉基础如何重塑纯文本特征,为理解预训练语言模型获取视觉基础能力提供了可解释的方法论基础。

Abstract: Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to “see”. We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.


[161] Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images cs.CV | cs.AIPDF

Farnaz Khun Jush, Grit Werner, Mark Klemens, Matthias Lenga

TL;DR: 本文提出并系统评估了三种无需训练的零样本方法,用于自动检测容积CT和MR图像中的解剖体区域。这些方法包括:基于预训练多器官分割模型的分割驱动规则系统、由放射科医生规则指导的多模态大语言模型(MLLM)以及结合视觉输入与显式解剖证据的分割感知MLLM。在887个异质性CT和MR扫描数据集上的评估表明,分割驱动规则方法取得了最佳性能。

Details

Motivation: 解决医学影像工作流中依赖不可靠DICOM元数据进行解剖体区域识别的局限性,并探索利用预训练基础模型中的知识实现完全零样本的体区域检测,以克服监督学习方法在现实场景中的适用性限制。

Result: 在887个手动验证的CT和MR扫描数据集上评估,分割驱动规则方法的加权F1分数最高,CT为0.947,MR为0.914,表现出跨模态和非典型扫描范围的鲁棒性;MLLM在视觉显著区域表现有竞争力,而分割感知MLLM显示出根本性局限。

Insight: 创新点在于系统探索了基于预训练模型的零样本体区域检测,无需额外训练;分割驱动规则方法结合预训练分割模型与专家规则,在零样本设置下实现了接近监督学习的性能,为医学影像分析提供了高效可靠的替代方案。

Abstract: Reliable identification of anatomical body regions is a prerequisite for many automated medical imaging workflows, yet existing solutions remain heavily dependent on unreliable DICOM metadata. Current solutions mainly use supervised learning, which limits their applicability in many real-world scenarios. In this work, we investigate whether body region detection in volumetric CT and MR images can be achieved in a fully zero-shot manner by using knowledge embedded in large pre-trained foundation models. We propose and systematically evaluate three training-free pipelines: (1) a segmentation-driven rule-based system leveraging pre-trained multi-organ segmentation models, (2) a Multimodal Large Language Model (MLLM) guided by radiologist-defined rules, and (3) a segmentation-aware MLLM that combines visual input with explicit anatomical evidence. All methods are evaluated on 887 heterogeneous CT and MR scans with manually verified anatomical region labels. The segmentation-driven rule-based approach achieves the strongest and most consistent performance, with weighted F1-scores of 0.947 (CT) and 0.914 (MR), demonstrating robustness across modalities and atypical scan coverage. The MLLM performs competitively in visually distinctive regions, while the segmentation-aware MLLM reveals fundamental limitations.


[162] SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training cs.CVPDF

Khadija Iddrisu, Waseem Shariff, Suzanne Little, Noel OConnor

TL;DR: 本文提出了SynSacc,一个从Blender到V2E的合成神经形态眼动数据生成流程,用于模拟扫视和注视。利用该流程生成的合成数据集,作者训练了脉冲神经网络(SNN)模型进行眼动分类,并在真实事件数据上进行了微调,模型准确率最高达到0.83,且在不同时间分辨率下保持稳定性能。

Details

Motivation: 研究眼动(特别是扫视和注视)对于理解人类认知和感知机制至关重要。事件相机(DVS)能无失真地捕捉快速动态变化,但缺乏高质量的标注数据集。本文旨在通过可控条件下生成合成事件流数据,以解决真实事件数据稀缺和标注困难的问题,并探索合成数据在基于事件的视觉任务中的应用价值。

Result: 提出的模型在眼动分类任务上达到了最高0.83的准确率,并在不同时间分辨率下保持了稳定的性能。与人工神经网络(ANN)相比,使用SNN处理合成事件流带来了显著的计算效率提升。

Insight: 主要创新点在于提出了一个完整的Blender到V2E的合成神经形态眼动数据生成与仿真-真实(sim-to-real)模型训练流程。这为事件相机视觉任务提供了高质量、可控的合成数据集解决方案,并验证了合成数据增强与SNN结合在提升计算效率和模型鲁棒性方面的有效性。

Abstract: The study of eye movements, particularly saccades and fixations, are fundamental to understanding the mechanisms of human cognition and perception. Accurate classification of these movements requires sensing technologies capable of capturing rapid dynamics without distortion. Event cameras, also known as Dynamic Vision Sensors (DVS), provide asynchronous recordings of changes in light intensity, thereby eliminating motion blur inherent in conventional frame-based cameras and offering superior temporal resolution and data efficiency. In this study, we introduce a synthetic dataset generated with Blender to simulate saccades and fixations under controlled conditions. Leveraging Spiking Neural Networks (SNNs), we evaluate its robustness by training two architectures and finetuning on real event data. The proposed models achieve up to 0.83 accuracy and maintain consistent performance across varying temporal resolutions, demonstrating stability in eye movement classification. Moreover, the use of SNNs with synthetic event streams yields substantial computational efficiency gains over artificial neural network (ANN) counterparts, underscoring the utility of synthetic data augmentation in advancing event-based vision. All code and datasets associated with this work is available at https: //github.com/Ikhadija-5/SynSacc-Dataset.


[163] From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models cs.CVPDF

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue

TL;DR: 本文提出了一种名为HATCH的训练框架,旨在提升多模态大语言模型在多图像空间推理任务中的表现。该框架通过两个互补目标——跨视图的补丁级空间对齐和先行动后回答的推理机制——来模拟人类的空间推理过程,从而有效整合多视角信息。

Details

Motivation: 当前多模态大语言模型在单图像空间推理上取得进展,但在需要整合多视角信息的多图像空间推理任务中仍面临挑战。人类通过跨视图对应和逐步视角变换两种机制解决此类任务,而现有研究往往仅部分或隐式地融入这些机制,缺乏对两者的显式监督。

Result: 在三个基准测试上的实验表明,HATCH在同等规模模型中显著优于基线方法,并与更大模型取得竞争性结果,同时保持了单图像推理能力。

Insight: 论文的创新点在于明确引入了人类空间推理的两个关键机制作为训练目标:跨视图的补丁级空间对齐和显式的视角转换动作生成。这为多模态模型的多图像理解提供了可解释且有效的监督信号,有助于模型更结构化地处理空间关系。

Abstract: While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.


[164] MVAnimate: Enhancing Character Animation with Multi-View Optimization cs.CVPDF

Tianyu Sun, Zhoujie Fu, Bang Zhang, Guosheng Lin

TL;DR: MVAnimate是一个利用多视角先验信息合成动态人物2D和3D信息的新框架,旨在提升生成动画视频的质量,解决现有方法输出质量低和训练数据不足的问题。

Details

Motivation: 现有基于2D或3D结构建模人体姿态的动画生成算法存在输出质量低、训练数据不足等问题,导致难以生成高质量动画视频,因此需要一种新方法来提升生成内容的质量和一致性。

Result: 实验结果表明,该方法在多种数据集上表现出对各类运动模式和外观的鲁棒性,并在生成的时间一致性和空间连贯性上优于现有动画方法。

Insight: 核心创新在于利用多视角先验信息来同时优化目标角色的多视角视频,从而增强不同视角下的视频质量,这为提升动画生成的保真度和一致性提供了新思路。

Abstract: The demand for realistic and versatile character animation has surged, driven by its wide-ranging applications in various domains. However, the animation generation algorithms modeling human pose with 2D or 3D structures all face various problems, including low-quality output content and training data deficiency, preventing the related algorithms from generating high-quality animation videos. Therefore, we introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information, to enhance the generated video quality. Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs, demonstrating improvements over existing animation methods. Our MVAnimate also optimizes the multi-view videos of the target character, enhancing the video quality from different views. Experimental results on diverse datasets highlight the robustness of our method in handling various motion patterns and appearances.


[165] Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems cs.CV | cs.AI | cs.LGPDF

Hao Dong, Eleni Chatzi, Olga Fink

TL;DR: 本文提出了一种用于受电弓-接触网系统电弧检测的多模态学习框架,通过结合高分辨率图像数据和力测量数据,以更准确、鲁棒地检测电弧事件。作者构建了两个包含同步视觉和力测量的电弧检测数据集,并提出了MultiDeepSAD算法,这是一种针对多模态的DeepSAD扩展,采用了新的损失函数和针对每种数据类型的伪异常生成技术。实验表明,该框架在电弧检测任务上显著优于基线方法,即使在领域偏移和真实电弧观测数据有限的情况下,对真实电弧事件也表现出更高的敏感性。

Details

Motivation: 受电弓-接触网界面的电弧事件检测具有挑战性,因为其具有瞬态性、嘈杂的操作环境、数据稀缺以及难以将电弧与其他类似瞬态现象区分开。为了解决这些问题,需要一种更准确和鲁棒的检测方法。

Result: 通过广泛的实验和消融研究,作者证明其提出的框架显著优于基线方法,在领域偏移和真实电弧观测数据有限的情况下,对真实电弧事件表现出增强的敏感性。

Insight: 论文的创新点在于提出了一种结合视觉和力测量的多模态框架(MultiDeepSAD),并引入了针对图像和力数据的定制化伪异常生成技术来增强训练数据。从客观角度看,这种多模态融合和特定领域的合成数据增强策略,为解决数据稀缺和领域适应性问题提供了可借鉴的思路。

Abstract: The pantograph-catenary interface is essential for ensuring uninterrupted and reliable power delivery in electrified rail systems. However, electrical arcing at this interface poses serious risks, including accelerated wear of contact components, degraded system performance, and potential service disruptions. Detecting arcing events at the pantograph-catenary interface is challenging due to their transient nature, noisy operating environment, data scarcity, and the difficulty of distinguishing arcs from other similar transient phenomena. To address these challenges, we propose a novel multimodal framework that combines high-resolution image data with force measurements to more accurately and robustly detect arcing events. First, we construct two arcing detection datasets comprising synchronized visual and force measurements. One dataset is built from data provided by the Swiss Federal Railways (SBB), and the other is derived from publicly available videos of arcing events in different railway systems and synthetic force data that mimic the characteristics observed in the real dataset. Leveraging these datasets, we propose MultiDeepSAD, an extension of the DeepSAD algorithm for multiple modalities with a new loss formulation. Additionally, we introduce tailored pseudo-anomaly generation techniques specific to each data type, such as synthetic arc-like artifacts in images and simulated force irregularities, to augment training data and improve the discriminative ability of the model. Through extensive experiments and ablation studies, we demonstrate that our framework significantly outperforms baseline approaches, exhibiting enhanced sensitivity to real arcing events even under domain shifts and limited availability of real arcing observations.


[166] MOVA: Towards Scalable and Synchronized Video-Audio Generation cs.CV | cs.SDPDF

SII-OpenMOSS Team, :, Donghua Yu, Mingshu Chen, Qi Chen

TL;DR: MOVA是一个开源的视频-音频联合生成模型,采用混合专家架构,能够从图像和文本输入生成高质量、同步的视听内容,包括口型同步的语音、环境感知音效和内容匹配的音乐。

Details

Motivation: 当前视听内容生成主要依赖级联管道,导致成本高、误差累积和质量下降,且现有联合生成系统多为闭源,限制了领域发展。

Result: 模型采用总计320亿参数(推理时激活180亿)的混合专家架构,支持图像-文本到视频-音频的生成任务,并开源了模型权重和代码。

Insight: 通过开源联合生成模型和配套工具,解决了视听同步生成的架构、数据和训练挑战,并促进了社区研究和创作生态的发展。

Abstract: Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.


[167] Addressing data annotation scarcity in Brain Tumor Segmentation on 3D MRI scan Using a Semi-Supervised Teacher-Student Framework cs.CV | cs.AIPDF

Jiaming Liu, Cheng Ding, Daoqiang Zhang

TL;DR: 本文提出了一种用于3D MRI脑肿瘤分割的半监督师生框架,以解决标注稀缺和数据异质性问题。该框架包含一个不确定性感知的伪标签生成教师模型和一个基于置信度的渐进式课程学习学生模型,通过双损失目标和一致性优化来提升分割性能。

Details

Motivation: 动机在于解决脑肿瘤MRI分割中标注成本高昂以及不同扫描设备和站点导致的数据异质性挑战,旨在开发一种在有限监督下仍能鲁棒工作的分割方法。

Result: 在BraTS 2021数据集上,验证集Dice相似系数从仅使用10%标注数据时的0.393提升到使用100%数据时的0.872,显示出高效的数据利用能力;教师模型达到0.922的DSC,而学生模型在肿瘤子区域(如NCR/NET和Edema)上超越了教师,并成功恢复了教师模型未能分割的增强类别(DSC 0.620)。

Insight: 创新点包括不确定性感知的伪标签生成、基于图像级置信度的渐进式课程学习策略、以及结合高置信度区域学习和低置信度区域遗忘的双损失目标,这些设计有效提升了半监督分割的鲁棒性和数据效率。

Abstract: Accurate brain tumor segmentation from MRI is limited by expensive annotations and data heterogeneity across scanners and sites. We propose a semi-supervised teacher-student framework that combines an uncertainty-aware pseudo-labeling teacher with a progressive, confidence-based curriculum for the student. The teacher produces probabilistic masks and per-pixel uncertainty; unlabeled scans are ranked by image-level confidence and introduced in stages, while a dual-loss objective trains the student to learn from high-confidence regions and unlearn low-confidence ones. Agreement-based refinement further improves pseudo-label quality. On BraTS 2021, validation DSC increased from 0.393 (10% data) to 0.872 (100%), with the largest gains in early stages, demonstrating data efficiency. The teacher reached a validation DSC of 0.922, and the student surpassed the teacher on tumor subregions (e.g., NCR/NET 0.797 and Edema 0.980); notably, the student recovered the Enhancing class (DSC 0.620) where the teacher failed. These results show that confidence-driven curricula and selective unlearning provide robust segmentation under limited supervision and noisy pseudo-labels.


[168] Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing cs.CVPDF

Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen

TL;DR: Omni-Video 2是一个可扩展且计算高效的模型,它将预训练的多模态大语言模型与视频扩散模型连接起来,用于统一的视频生成和编辑。其核心是利用MLLM的理解和推理能力来生成明确的目标描述,以解释用户指令,从而直接利用理解模型的丰富上下文表征来指导生成过程。此外,通过一个轻量级适配器将多模态条件令牌注入预训练的文本到视频扩散模型,以参数高效的方式最大程度地复用其强大的生成先验。

Details

Motivation: 解决复杂组合式视频编辑任务中,如何更好地理解和执行用户指令,并高效利用现有强大生成模型先验的问题。

Result: 在FiVE基准测试(细粒度视频编辑)和VBench基准测试(文本到视频生成)上进行了评估。结果表明,在视频编辑中,其遵循复杂组合指令的能力优异;在视频生成任务中,其质量具有竞争力或更优。

Insight: 创新点在于利用MLLM作为“指令解释器”来生成明确的目标描述,从而桥接理解和生成模块;同时,通过轻量级适配器实现多模态条件注入,以参数高效的方式复用预训练视频扩散模型的强大生成能力,实现了统一且可扩展的视频生成与编辑框架。

Abstract: We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.


[169] Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications cs.CVPDF

Yao Pu, Yiming Shi, Zhenxi Zhang, Peixin Yu, Yitao Zhuang

TL;DR: 本文提出了一种用于鼻咽癌(NPC)放疗的任意到所有MRI合成统一基础模型,该模型结合对比视觉表示学习和视觉语言对齐技术,能够从任意模态输入生成所有所需MRI模态,解决了临床实践中因扫描不完整导致的放疗计划精度问题。

Details

Motivation: MRI在鼻咽癌放疗中至关重要,但患者不适、扫描时间长、成本高等实际限制常导致临床实践中模态不完整,影响放疗计划准确性;传统MRI合成方法模态特定、解剖适应性有限且缺乏临床可解释性,无法满足NPC放疗需求。

Result: 在来自13个机构的40,825张图像上训练,并在26个内部/外部验证站点(15,748张图像)上评估,模型实现了持续高性能(平均SSIM 0.90,PSNR 27),具有卓越的合成保真度以及对噪声和域偏移的鲁棒性。

Insight: 创新点在于整合对比编码器(用于模态不变表示)和基于CLIP的文本知情解码器(用于语义一致合成),通过单一统一基础模型支持任意到所有MRI合成,其统一表示还增强了分割等下游放疗相关任务的性能,为NPC护理的数字医学解决方案提供了新途径。

Abstract: Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC’s RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.


[170] VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning cs.CVPDF

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu

TL;DR: 本文提出VideoVeritas框架,通过联合偏好对齐和感知预任务强化学习来检测AI生成的视频。该框架整合了细粒度感知和基于事实的推理,以弥补当前多模态大语言模型在细粒度感知能力上的不足。作者还构建了MintVid数据集用于评估。

Details

Motivation: 视频生成能力的提升带来了安全风险,需要可靠的检测方法。当前多模态大语言模型虽然推理能力强,但细粒度感知能力有限,因此需要增强感知能力以改进检测性能。

Result: 实验表明,现有方法倾向于偏向表面推理或机械分析,而VideoVeritas在多个基准测试中实现了更平衡的性能。

Insight: 创新点在于提出了感知预任务强化学习,通过时空定位和自监督物体计数等通用感知预任务来增强检测性能,而非直接优化检测任务本身,这有助于模型学习更鲁棒的表示。

Abstract: The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.


[171] TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models cs.CVPDF

Xiangtian Zheng, Zishuo Wang, Yuxin Peng

TL;DR: 本文提出了一种名为TiFRe的文本引导视频帧缩减框架,旨在降低视频多模态大语言模型的计算成本。该框架通过文本引导的帧采样策略选择关键帧,并利用帧匹配与合并机制保留非关键帧信息,从而在减少输入帧数的同时提升视频语言任务的性能。

Details

Motivation: 视频多模态大语言模型在处理大量视频帧时面临高昂的计算开销,而简单地按固定帧率选择关键帧会导致信息丢失和性能下降,因此需要一种能有效缩减帧数并保留关键信息的方法。

Result: 实验表明,TiFRe在视频语言任务上有效降低了计算成本,同时提升了性能,但摘要未具体说明在哪些基准测试上达到何种水平(如SOTA)。

Insight: 创新点在于结合文本引导的语义相似性进行动态帧采样,以及通过帧匹配与合并机制整合非关键帧信息,这为高效视频处理提供了可借鉴的轻量化思路。

Abstract: With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.


[172] Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit cs.CV | cs.LGPDF

Zhendong Wang, Cihan Ruan, Jingchuan Xiao, Chuqing Shi, Wei Jiang

TL;DR: 本文分析了标准多视图优化下3D高斯泼溅(3DGS)解的结构,将其定义为渲染最优参考(RORs),并揭示了其稳定的统计模式,如混合结构的尺度和双峰辐射度。通过可学习性探针,研究发现参数由密度分层决定:密集区域参数与几何相关,可进行无渲染预测;稀疏区域则因可见性异质性导致几何与外观参数耦合,预测系统性地失败。研究提出了密度感知策略以提高训练鲁棒性,并讨论了自适应平衡前馈预测与基于渲染细化的系统架构意义。

Details

Motivation: 动机是探究3D高斯泼溅(3DGS)在多视图优化中形成的解的结构特性,理解其参数的决定因素,特别是密度如何影响几何与外观参数的预测性,以解决稀疏区域预测失败的问题。

Result: 研究通过方差分解形式化证明了密度分层效应:在密集区域,参数与几何相关,可被无监督预测;在稀疏区域,可见性异质性导致几何与外观参数耦合,使得跨架构预测系统性地失败。这揭示了RORs的双重特性。

Insight: 创新点包括将3DGS解定义为RORs并分析其统计模式,揭示了密度分层的根本机制,以及提出密度感知策略来改进训练。从客观角度看,研究提供了对3DGS内部工作原理的新见解,为自适应平衡预测与渲染的系统设计提供了理论基础。

Abstract: We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.


[173] MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE cs.CV | cs.AI | cs.CG | cs.LGPDF

Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai

TL;DR: MotionCrafter是一个基于视频扩散模型的框架,能够从单目视频中联合重建4D几何(密集3D点云)并估计密集运动(3D场景流)。其核心是提出了一种在共享坐标系中联合表示密集3D点图和3D场景流的新方法,以及一个新颖的4D VAE来有效学习这种表示。

Details

Motivation: 解决现有方法强制3D值与RGB VAE潜在空间严格对齐(尽管两者分布本质不同)导致性能次优的问题,旨在更有效地从视频中联合重建几何与运动。

Result: 在多个数据集上的广泛实验表明,MotionCrafter在几何重建和密集场景流估计方面均达到了最先进的性能,几何和运动重建分别提升了38.64%和25.0%,且无需任何后优化。

Insight: 主要创新点在于:1)提出了3D点云与场景流的联合表示及4D VAE;2)摒弃了强制与RGB VAE潜在空间对齐的策略,转而采用新的数据归一化和VAE训练策略,以更好地迁移扩散先验。从客观角度看,其解耦表示学习和先验迁移的思路具有借鉴意义。

Abstract: We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page


[174] WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models cs.CV | cs.ROPDF

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin

TL;DR: 本文提出了WorldArena,一个用于系统评估具身世界模型在感知质量和功能效用两个维度的统一基准。该基准通过视频感知质量(包含6个子维度的16个指标)和具身任务功能(作为数据引擎、策略评估器和动作规划器)来评估模型,并提出了一个整合多维性能的综合性指标EWMScore。通过对14个代表性模型的广泛实验,揭示了感知与功能之间存在显著差距。

Details

Motivation: 当前对具身世界模型的评估主要集中于感知保真度(如视频生成质量),而忽视了这些模型在下游决策任务中的功能效用,导致评估体系碎片化。本文旨在建立一个统一的基准来弥补这一不足。

Result: 在WorldArena基准上对14个代表性模型进行了广泛实验,结果表明,高视觉质量并不必然转化为强大的具身任务能力,揭示了显著的感知-功能差距。该基准已公开发布并设有排行榜。

Insight: 主要创新点在于提出了首个统一评估具身世界模型感知与功能效用的基准WorldArena,并设计了综合指标EWMScore。客观来看,其核心贡献是系统性地定义了评估维度,并实证揭示了当前模型在感知与功能上的脱节,为未来开发真正实用的世界模型指明了方向。

Abstract: While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.


[175] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study cs.CVPDF

Arushi Rai, Adriana Kovashka

TL;DR: 本文针对视频-大语言模型在体育反馈生成任务上泛化能力差的问题,以攀岩运动为例,提出了一种利用目标领域免费辅助数据(如比赛视频和教练手册)来提升模型性能的方法,并设计了针对体育反馈质量的特异性和可操作性两个新评估指标。

Details

Motivation: 现有视频-LLMs在体育反馈生成任务上表现不佳,泛化到未见过的运动时效果差,且依赖昂贵、难以收集的微调数据;同时,传统的文本生成评估指标无法有效衡量体育反馈的质量。

Result: 论文以攀岩运动为案例进行研究,通过引入目标领域的辅助数据,提升了模型在目标领域的反馈生成性能,并提出了特异性和可操作性两个新指标来更有效地评估生成结果。

Insight: 创新点在于利用目标领域易于获取的免费辅助数据(多模态数据与文本知识)来缓解数据稀缺和领域泛化问题,并设计了更贴合任务需求的新评估指标,为有限标注下的体育反馈生成提供了更实用、更有意义的解决方案。

Abstract: While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.


[176] WorldCompass: Reinforcement Learning for Long-Horizon World Models cs.CVPDF

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu

TL;DR: 本文提出WorldCompass,一种用于长视野、交互式视频世界模型的新型强化学习后训练框架,旨在通过交互信号引导模型更准确、一致地探索世界。该框架引入了针对自回归视频生成范式的三项核心创新:片段级展开策略、互补奖励函数和高效RL算法,并在开源世界模型WorldPlay上验证了其有效性。

Details

Motivation: 解决现有长视野交互式视频世界模型在探索世界时准确性和一致性不足的问题,旨在通过强化学习后训练框架,利用交互信号更好地引导模型。

Result: 在开源世界模型WorldPlay上的评估表明,WorldCompass显著提升了多种场景下的交互准确性和视觉保真度。

Insight: 创新点在于针对自回归视频生成范式量身定制了片段级展开策略以提高效率、设计互补奖励函数以防止奖励黑客行为,并采用负感知微调策略进行高效优化;其将强化学习与视频世界模型后训练结合的框架思路具有借鉴意义。

Abstract: This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively “steer” the world model’s exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.


physics.chem-ph [Back]

[177] LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning physics.chem-ph | cs.AI | cs.CL | cs.LGPDF

Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao

TL;DR: 本文提出了LatentChem,一种用于化学推理的潜在推理接口,它将化学计算与文本生成解耦,使模型能够在连续潜在空间中直接进行多步推理,而仅用语言输出最终结果。研究发现,当仅针对任务成功进行优化时,模型会自发地将推理过程内化,逐渐放弃冗长的文本推导,转而采用隐式的潜在计算。这种方法在多个化学推理基准测试中,不仅显著提升了性能,还大幅提高了推理速度。

Details

Motivation: 当前化学大语言模型主要依赖自然语言的显式思维链进行复杂推理,但化学推理本质上是连续和结构化的,将其强制转换为离散的语言标记会导致表示不匹配,从而限制了效率和性能。本文旨在解决这种表示不匹配问题。

Result: 在ChemCoTBench基准测试上,LatentChem相较于基于思维链的强基线模型,取得了59.88%的非平局胜率,同时实现了平均10.84倍的推理加速。

Insight: 论文的核心创新点是提出了一个将推理过程从显式文本生成转移到连续潜在空间的框架。这不仅是一种风格上的改变,更是一种计算上的优势。其关键洞察在于,化学推理更自然、更有效地被实现为连续的潜在动态过程,而非离散的语言轨迹。这为设计更高效、更本质的AI推理系统提供了新思路。

Abstract: Chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) in natural language to perform complex reasoning. However, chemical reasoning is inherently continuous and structural, and forcing it into discrete linguistic tokens introduces a fundamental representation mismatch that constrains both efficiency and performance. We introduce LatentChem, a latent reasoning interface that decouples chemical computation from textual generation, enabling models to perform multi-step reasoning directly in continuous latent space while emitting language only for final outputs. Remarkably, we observe a consistent emergent behavior: when optimized solely for task success, models spontaneously internalize reasoning, progressively abandoning verbose textual derivations in favor of implicit latent computation. This shift is not merely stylistic but computationally advantageous. Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84$\times$ average inference speedup. Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics rather than discretized linguistic trajectories.


cs.CR [Back]

[178] Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model cs.CR | cs.AI | cs.CLPDF

Tianyi Wu, Mingzhe Du, Yue Liu, Chengran Yang, Terry Yue Zhuo

TL;DR: 本文提出SecCoderX,一个基于在线强化学习的框架,旨在解决大语言模型生成代码时的安全问题。该方法通过利用成熟的漏洞检测资源,合成多样化的漏洞诱导任务并训练一个基于推理的漏洞奖励模型,从而在在线强化学习循环中对齐代码LLM,以生成既安全又保持功能性的代码。

Details

Motivation: 现有的大语言模型在代码生成中存在生成不安全代码的倾向,而现有的安全代码对齐方法往往陷入功能性与安全性的悖论,即在提升安全性时严重损害代码的实用性。

Result: 大量实验表明,SecCoderX在有效安全率(ESR)指标上实现了SOTA性能,相比未对齐模型提升了约10%,而先前的方法则会导致ESR下降14-54%。

Insight: 主要创新点在于将成熟的漏洞检测资源桥接到安全代码生成任务中,具体通过两种方式:合成多样化的、基于现实的漏洞诱导任务用于在线RL探索,以及训练一个可扩展且可靠的、基于推理的漏洞奖励模型。这为在保持功能性的前提下对齐模型的安全性提供了一种新思路。

Abstract: Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment. Existing secure code alignment methods often suffer from a functionality–security paradox, improving security at the cost of substantial utility degradation. We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation. SecCoderX first bridges vulnerability detection and secure code generation by repurposing mature detection resources in two ways: (i) synthesizing diverse, reality-grounded vulnerability-inducing coding tasks for online RL rollouts, and (ii) training a reasoning-based vulnerability reward model that provides scalable and reliable security supervision. Together, these components are unified in an online RL loop to align code LLMs to generate secure and functional code. Extensive experiments demonstrate that SecCoderX achieves state-of-the-art performance, improving Effective Safety Rate (ESR) by approximately 10% over unaligned models, whereas prior methods often degrade ESR by 14-54%. We release our code, dataset and model checkpoints at https://github.com/AndrewWTY/SecCoderX.


cs.LG [Back]

[179] Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection cs.LG | cs.CLPDF

Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su

TL;DR: 该论文将大语言模型的安全对齐视为持续学习问题,提出了一种名为正交梯度投影安全对齐(OGPSA)的轻量级方法。该方法通过将安全更新梯度投影到捕获通用能力的低秩子空间的正交补空间上,来减少对齐过程中的干扰,从而在提升安全性的同时,最大限度地保留模型的通用能力。

Details

Motivation: 动机在于解决大语言模型安全对齐过程中出现的“对齐税”问题,即安全后训练会损害模型的通用能力(如推理和编码)。作者认为这主要是由于顺序对齐中类似持续学习的遗忘现象,即分布偏移和目标冲突导致安全更新覆盖了预训练习得的能力。

Result: 在监督微调(SFT)、直接偏好优化(DPO)以及顺序SFT→DPO设置下,OGPSA方法在安全性与通用效用的帕累托前沿上持续优于标准基线。例如,在Qwen2.5-7B-Instruct模型上,采用SFT→DPO流程时,OGPSA在保持强安全性的同时恢复了通用能力,将SimpleQA准确率从0.53%提升至3.03%,IFEval准确率从51.94%提升至63.96%。

Insight: 创新点在于将安全对齐问题形式化为一个需要平衡可塑性(获取安全约束)与稳定性(保留通用能力)的持续学习问题,并提出了一种无需大规模回放、辅助目标或重新训练的即插即用解决方案。其核心思想是利用梯度在代表通用能力的子空间上进行正交投影,以实现最小化干扰的定向更新,这是一个新颖且轻量的技术视角。

Abstract: Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety–utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53% to 3.03% and IFEval from 51.94% to 63.96%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}


[180] Reliable and Responsible Foundation Models: A Comprehensive Survey cs.LG | cs.AI | cs.CL | cs.CV | cs.CYPDF

Xinyu Yang, Junlin Han, Rishi Bommasani, Jinqi Luo, Wenjie Qu

TL;DR: 这篇综述论文全面探讨了基础模型(包括LLM、MLLM、图像生成模型和视频生成模型)的可靠性与责任性问题。它系统性地梳理了偏见与公平、安全与隐私、不确定性、可解释性、分布偏移等关键议题,以及幻觉、对齐和AIGC检测等模型局限性与方法,旨在为构建既强大又符合伦理、可信、可靠且对社会负责的基础模型提供研究方向和路线图。

Details

Motivation: 随着基础模型在现实世界中的部署日益广泛,确保其可靠性与责任性已成为学术界、工业界和政府的关键关切。论文旨在系统性地综述该领域,以促进基础模型向更负责任的方向发展。

Result: 作为一篇综述性论文,它未提出具体的新模型或方法,因此没有定量的实验结果。其主要成果是对当前研究现状的系统性梳理和对未来具体研究方向的展望。

Insight: 论文的创新之处在于提供了一个全面、结构化的框架来审视基础模型的可靠性与责任性这一新兴且复杂的跨领域问题,并强调了不同议题(如偏见、安全、可解释性)之间的相互关联与共同挑战,为后续研究提供了清晰的路线图。

Abstract: Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, science, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.


[181] DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning cs.LG | cs.AI | cs.CL | q-bio.QMPDF

Haoran Liu, Zheni Zeng, Yukun Yan, Yuxuan Chen, Yunduo Xiao

TL;DR: 本文提出DrugR方法,利用大型语言模型(LLMs)进行分子药物优化,通过引入显式的、逐步的药理推理过程,结合领域持续预训练、反向数据工程的监督微调以及自平衡多粒度强化学习,在保持分子核心功效的同时有效改善ADMET性质。

Details

Motivation: 解决LLMs在分子优化任务中面临的挑战,即分子结构与药理性质间复杂的隐式关系以及相应标注数据的缺乏。

Result: 实验结果表明,DrugR在不损害结构相似性或靶点结合亲和力的前提下,实现了跨多个性质的全面增强。

Insight: 创新点在于将显式、逐步的药理推理引入LLM驱动的分子优化过程,提供了可解释的优化步骤依据,推动了自动化、知识驱动的科学发现。

Abstract: Molecule generation and optimization is a fundamental task in chemical domain. The rapid development of intelligent tools, especially large language models (LLMs) with powerful knowledge reserves and interactive capabilities, has provided new paradigms for it. Nevertheless, the intrinsic challenge for LLMs lies in the complex implicit relationship between molecular structure and pharmacological properties and the lack of corresponding labeled data. To bridge this gap, we propose DrugR, an LLM-based method that introduces explicit, step-by-step pharmacological reasoning into the optimization process. Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning. This framework enables DrugR to effectively improve key ADMET properties while preserving the original molecule’s core efficacy. Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity. Importantly, its explicit reasoning process provides clear, interpretable rationales for each optimization step, yielding actionable design insights and advancing toward automated, knowledge-driven scientific discovery. Our code and model checkpoints are open-sourced to foster future research.


[182] Reinforcement Learning with Backtracking Feedback cs.LG | cs.AI | cs.CLPDF

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin

TL;DR: 本文提出了一种名为’带回溯反馈的强化学习’(RLBF)的新框架,旨在增强大型语言模型(LLMs)对抗对抗性攻击和分布内错误的安全性。该框架通过强化学习阶段,让模型学会动态纠正自身生成错误,并辅以改进的监督微调数据生成策略(BSAFE+)来支持回溯能力的习得。

Details

Motivation: 解决大型语言模型在面对对抗性攻击(如中间填充、GCG攻击)和分布内错误时,对鲁棒安全性的迫切需求。

Result: 全面的实证评估表明,RLBF在各种基准测试和模型规模上显著降低了攻击成功率,在保持基础模型实用性的同时,实现了卓越的安全性能。

Insight: 核心创新在于利用强化学习,通过评论家对模型实时输出的反馈,训练LLMs识别并从中恢复实际发生的安全违规,发出高效的’回溯x个词元’信号后继续自回归生成。这为模型注入了对抗复杂攻击策略的韧性。同时,改进的SFT数据生成策略(BSAFE+)通过向原本安全的连贯文本中注入违规,为回溯机制提供了更有效的初始训练。

Abstract: Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). This framework advances upon prior methods, such as BSAFE, by primarily leveraging a Reinforcement Learning (RL) stage where models learn to dynamically correct their own generation errors. Through RL with critic feedback on the model’s live outputs, LLMs are trained to identify and recover from their actual, emergent safety violations by emitting an efficient “backtrack by x tokens” signal, then continuing generation autoregressively. This RL process is crucial for instilling resilience against sophisticated adversarial strategies, including middle filling, Greedy Coordinate Gradient (GCG) attacks, and decoding parameter manipulations. To further support the acquisition of this backtracking capability, we also propose an enhanced Supervised Fine-Tuning (SFT) data generation strategy (BSAFE+). This method improves upon previous data creation techniques by injecting violations into coherent, originally safe text, providing more effective initial training for the backtracking mechanism. Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility.


[183] Beyond Correctness: Learning Robust Reasoning via Transfer cs.LG | cs.CLPDF

Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, Jinwoo Shin

TL;DR: 本文提出了一种名为RLTR(Reinforcement Learning with Transferable Reward)的新方法,旨在提升大型语言模型(LLM)推理过程的鲁棒性,而不仅仅是最终答案的正确性。该方法通过可转移奖励来评估推理前缀是否能够指导另一个模型得出正确答案,从而鼓励模型生成稳定、可解释且可泛化的推理链。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)方法主要关注最终答案的正确性,但忽略了推理过程本身的鲁棒性,导致推理可能脆弱且难以泛化。本文旨在填补这一空白,确保推理作为一种意义转移形式,能够在截断、重新解释和延续后依然有效。

Result: 在MATH500基准测试上,RLTR方法在Maj@64指标上相比RLVR提升了3.6个百分点,并且仅用大约2.5倍更少的训练步数就达到了与RLVR相当的平均准确率,同时提高了采样一致性。

Insight: 核心创新点是将推理鲁棒性定义为一种可转移性,并通过可转移奖励(transfer reward)来具体化和优化这一属性。这促使模型生成更具解释性和泛化能力的推理步骤,而不仅仅是追求最终答案匹配,从而实现了更高效和可靠的训练。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR’s average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.


[184] Bayesian Preference Learning for Test-Time Steerable Reward Models cs.LG | cs.CLPDF

Jiwoo Hong, Shao Tang, Zhipeng Wang

TL;DR: 本文提出了变分上下文奖励建模(ICRM),一种基于贝叶斯框架的新型奖励建模方法,通过上下文偏好演示实现测试时可调控的奖励模型,以应对复杂多目标对齐场景中静态分类器奖励模型适应性不足的问题。

Details

Motivation: 传统奖励模型训练后即固定,无法在测试时适应未见过的复杂或多目标偏好分布,限制了其在可验证奖励和多目标对齐等场景中的应用。

Result: 在单目标设置中,ICRM在SafeRLHF和RM-Bench上分别提升34%和9%的准确率;在多目标设置中,在帮助性和拒绝性基准上使帕累托前沿扩展,超体积提升4%;在数学推理任务中,其编码的可验证奖励优于传统奖励模型。

Insight: 创新点在于将奖励建模构建为基于Bradley-Terry模型和共轭Beta先验的摊销变分推断问题,实现了通过上下文演示进行测试时调控;理论分析表明变分目标存在全局内部最优解,且KL正则化能缓解奖励过优化问题。

Abstract: Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.


[185] A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents cs.LG | cs.AI | cs.CL | cs.CYPDF

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara

TL;DR: 本文提出了一种结合行为评估与可解释性分析的综合框架,用于评估语言模型代理的目标导向性。通过在2D网格世界环境中对LLM代理进行实验,研究发现代理在行为上能适应任务难度变化并保持鲁棒性,同时其内部表征非线性地编码了环境的粗略空间地图,并在推理过程中重组信息以支持动作选择。

Details

Motivation: 目前缺乏可靠的方法来归因智能代理的目标,这限制了对代理行为的解释和预测能力,因此需要一种整合行为与内部表征分析的评估框架。

Result: 在2D网格世界导航任务中,代理在不同网格大小、障碍物密度和目标结构下均表现出与最优策略相当的性能,且对难度保持变换和复杂目标结构具有鲁棒性;通过探测方法发现代理内部编码了近似任务相关的空间位置与目标信息。

Insight: 创新点在于将行为测试与基于可解释性的内部表征分析相结合,揭示了LLM代理如何非线性地编码环境结构并在推理中动态重组信息,强调了仅靠行为评估不足以全面理解代理的目标表征与追求机制。

Abstract: Understanding an agent’s goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models’ internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent’s internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.


[186] Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense cs.LG | cs.AI | cs.CLPDF

Jiacheng Liu, Yaxin Luo, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao

TL;DR: 本文提出了一种名为Next-Gen CAPTCHAs的新型防御框架,旨在应对GUI智能体对传统验证码的破解。该框架利用人类与智能体在交互感知、记忆、决策和行动方面的‘认知鸿沟’,设计需要自适应直觉而非精细规划的动态任务,从而在生物用户与人工智能体之间重建可靠的区分机制。

Details

Motivation: 传统验证码因GUI智能体的快速发展而失效,特别是像Gemini3-Pro-High和GPT-5.2-Xhigh这类强推理模型在复杂逻辑谜题(如‘Bingo’)上已能达到90%的通过率,因此需要一种可扩展且多样化的新防御机制来保护下一代网络。

Result: 论文提出的基准测试基于一个强大的数据生成流水线,能够进行大规模且易于扩展的评估,尤其对于后端支持的类型,系统能够生成几乎无限数量的验证码实例,从而在可扩展性和多样性上建立了优势。

Insight: 核心创新点在于利用并工程化‘认知鸿沟’,通过设计依赖自适应直觉的动态交互任务来防御高级智能体,而不是依赖静态数据集或传统逻辑谜题,这为智能体时代提供了一种可扩展且多样化的防御新范式。

Abstract: The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like “Bingo”. In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent “Cognitive Gap” in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.


[187] AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization cs.LG | cs.CV | cs.HCPDF

Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, Mohammad Soleymani

TL;DR: 本文提出了AVERE方法,通过偏好优化技术改进多模态大语言模型在情感理解任务中的表现,解决了模型对无关视听线索的虚假关联以及由文本先验驱动的幻觉问题,并引入了EmoReAlM基准进行评估。

Details

Motivation: 当前多模态大语言模型在情感理解中存在两个关键挑战:情绪与无关视听线索之间的虚假关联,以及语言模型主干中文本先验驱动的视听线索幻觉,这阻碍了社会智能代理的发展。

Result: 在DFEW、RAVDESS和EMER数据集上的零样本设置中,该方法使基线模型的相对性能提升了6-19%,显著改善了模型表现。

Insight: 创新点包括引入EmoReAlM基准来量化评估线索-情绪关联、幻觉和模态一致性,并提出AVEm-DPO偏好优化技术,通过构建对虚假关联或幻觉响应的偏好以及基于文本提示的视听输入对来对齐模型响应,同时加入正则化项惩罚对文本先验的依赖,从而缓解模态特定线索的幻觉问题。

Abstract: Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.


[188] Video-based Music Generation cs.LG | cs.AI | cs.CV | cs.MM | cs.SDPDF

Serkan Sulun

TL;DR: 本文提出了EMSYNC,一种快速、免费且自动化的视频配乐生成系统,能够根据输入视频生成情感与节奏同步的音乐。该系统包含三个核心组件:新颖的视频情感分类器、基于连续情感值的大规模MIDI数据集与生成器,以及用于时序同步的边界偏移编码方法。

Details

Motivation: 解决互联网视频内容激增背景下,为视频寻找合适配乐的挑战,使内容创作者无需作曲或授权即可自动生成情感与节奏同步的音乐。

Result: 在Ekman-6和MovieNet数据集上取得了最先进(SOTA)的结果;用户研究表明,在音乐丰富度、情感对齐、时序同步和整体偏好上均优于现有方法,设定了视频配乐生成的新SOTA。

Insight: 创新点包括:1) 采用预训练深度网络提取特征并冻结,仅训练融合层以降低计算复杂度并提升准确性的视频情感分类器;2) 首个基于连续情感值(而非离散类别)进行条件生成的情感MIDI生成器;3) 通过边界偏移编码实现音乐和弦与场景变化的时序对齐方法。

Abstract: As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for affective music generation. We then present an emotion-based MIDI generator, the first to condition on continuous emotional values rather than discrete categories, enabling nuanced music generation aligned with complex emotional content. To enhance temporal synchronization, we introduce a novel temporal boundary conditioning method, called “boundary offset encodings,” aligning musical chords with scene changes. Combining video emotion classification, emotion-based music generation, and temporal boundary conditioning, EMSYNC emerges as a fully automatic video-based music generator. User studies show that it consistently outperforms existing methods in terms of music richness, emotional alignment, temporal synchronization, and overall preference, setting a new state-of-the-art in video-based music generation.


[189] Mimetic Initialization of MLPs cs.LG | cs.AI | cs.CVPDF

Asher Trockman, J. Zico Kolter

TL;DR: 该论文提出了一种名为’模仿初始化’的方法,首次将其应用于通道混合层,即多层感知机(MLPs)。该方法通过观察训练后权重中的结构,设计了一种极其简单的技术:为MLP的第一层赋予非零均值,从而在CIFAR-10和ImageNet-1k等小规模视觉任务上加速训练。

Details

Motivation: 动机是将模仿初始化方法从空间混合层(如卷积、自注意力层)扩展到通道混合层(MLPs),以解决MLP初始化优化问题,提升训练效率。

Result: 在CIFAR-10和ImageNet-1k基准测试中,该方法能加速训练,效果虽小于空间混合初始化方法,但可与它们结合使用产生额外正向效果。

Insight: 创新点在于首次将模仿初始化应用于MLPs,并通过简单的非零均值初始化策略改进训练速度;客观分析认为,该方法为初始化研究提供了新的跨层类型应用视角,且其简洁性易于集成到现有框架中。

Abstract: Mimetic initialization uses pretrained models as case studies of good initialization, using observations of structures in trained weights to inspire new, simple initialization techniques. So far, it has been applied only to spatial mixing layers, such convolutional, self-attention, and state space layers. In this work, we present the first attempt to apply the method to channel mixing layers, namely multilayer perceptrons (MLPs). Our extremely simple technique for MLPs – to give the first layer a nonzero mean – speeds up training on small-scale vision tasks like CIFAR-10 and ImageNet-1k. Though its effect is much smaller than spatial mixing initializations, it can be used in conjunction with them for an additional positive effect.


cs.AI [Back]

[190] LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation cs.AI | cs.AR | cs.CLPDF

Yuheng Wu, Berk Gokmen, Zhouhua Xie, Peijing Li, Caroline Trippel

TL;DR: 本文提出了LLM-FSM基准测试,用于评估大语言模型从自然语言规格说明中恢复有限状态机行为并生成正确RTL代码的能力。该基准通过自动化流程构建了1000个问题,实验发现即使最强LLM在FSM复杂度增加时准确率也急剧下降,同时证明了监督微调在分布外任务上的泛化能力以及增加测试时计算可提升推理可靠性。

Details

Motivation: 有限状态推理是硬件设计的核心能力,现有基准依赖人工构建示例,需要自动化、可扩展的基准来评估LLM在规格说明到RTL生成任务中理解和实现状态依赖行为的能力。

Result: 在LLM-FSM基准上,最强LLM在FSM复杂度增加时准确率急剧下降;监督微调能有效泛化到分布外任务;增加测试时计算可提升推理可靠性。

Insight: 创新点在于通过全自动化流水线构建可配置状态数和约束转移结构的FSM基准,并生成结构化YAML、自然语言规格说明及正确构造的参考RTL与测试平台;基准本身具有可扩展性,其FSM复杂度可随未来模型能力提升而扩展。

Abstract: Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.


[191] Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? cs.AI | cs.CL | cs.LGPDF

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue

TL;DR: 本文提出‘空间理论’概念,旨在评估基础模型通过主动探索构建空间信念的能力,并设计了一个基于好奇心驱动的探索基准。研究发现,当前最先进的模型在主动信息获取方面存在显著性能下降、探索效率低下、空间信念不稳定以及信念惯性等问题,揭示了它们在维持连贯且可修正的空间信念方面的不足。

Details

Motivation: 研究动机在于探索多模态基础模型在主动、自导向探索方面的能力,特别是在部分可观测环境下通过主动获取信息来构建、修正和利用空间信念的能力,而现有研究对此关注不足。

Result: 在设计的基准测试中,评估了最先进的模型,发现存在主动-被动差距(性能显著下降)、探索效率低下(与非系统探索相比)、空间信念不稳定以及信念惯性(特别是视觉模型更新过时先验的能力差)等关键瓶颈。

Insight: 创新点包括定义了‘空间理论’并提出了空间信念探测方法,以揭示模型内部空间表示;客观分析表明,当前基础模型在主动探索中维持连贯、可修正空间信念的能力有限,这为未来研究指明了改进方向。

Abstract: Spatial embodied intelligence requires agents to act to acquire information under partial observability. While multimodal foundation models excel at passive perception, their capacity for active, self-directed exploration remains understudied. We propose Theory of Space, defined as an agent’s ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We evaluate this through a benchmark where the goal is curiosity-driven exploration to build an accurate cognitive map. A key innovation is spatial belief probing, which prompts models to reveal their internal spatial representations at each step. Our evaluation of state-of-the-art models reveals several critical bottlenecks. First, we identify an Active-Passive Gap, where performance drops significantly when agents must autonomously gather information. Second, we find high inefficiency, as models explore unsystematically compared to program-based proxies. Through belief probing, we diagnose that while perception is an initial bottleneck, global beliefs suffer from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm, we uncover Belief Inertia, where agents fail to update obsolete priors with new evidence. This issue is present in text-based agents but is particularly severe in vision-based models. Our findings suggest that current foundation models struggle to maintain coherent, revisable spatial beliefs during active exploration.


[192] From Out-of-Distribution Detection to Hallucination Detection: A Geometric View cs.AI | cs.CLPDF

Litian Liu, Reza Pourreza, Yubing Jian, Yao Qin, Roland Memisevic

TL;DR: 该论文提出将大语言模型中的幻觉检测问题重新定义为分布外检测问题,通过将下一个词预测视为分类任务,并应用OOD检测技术,实现了无需训练、基于单样本的幻觉检测方法,在推理任务上取得了良好的检测效果。

Details

Motivation: 现有幻觉检测方法在问答任务中表现良好,但在需要推理的任务上效果不佳,因此需要一种更有效的检测方法来解决大语言模型的安全性和可靠性问题。

Result: 基于OOD的方法在推理任务的幻觉检测上实现了较高的准确率,提供了一种无需训练、基于单样本的检测方案。

Insight: 将幻觉检测重新定义为分布外检测问题,并利用语言模型下一个词预测的分类特性,是该方法的核心创新点,为语言模型安全提供了一种可扩展的途径。

Abstract: Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.


[193] Steer2Adapt: Dynamically Composing Steering Vectors Elicits Efficient Adaptation of LLMs cs.AI | cs.CL | cs.LGPDF

Pengrui Han, Xueqiang Xu, Keyang Xuan, Peiyang Song, Siru Ouyang

TL;DR: 本文提出STEER2ADAPT框架,通过动态组合预定义的语义基向量来引导大型语言模型(LLMs)的激活,实现高效的任务适应。该方法避免了为每个任务从头学习静态引导向量,能够灵活应对任务变化和需要多能力协调的复杂任务。

Details

Motivation: 现有激活引导方法通常为每个任务或概念使用单一的静态方向,导致其在任务变化时不够灵活,且难以处理需要多种协调能力的复杂任务。

Result: 在推理和安全领域的9个任务和3个模型上的实验表明,STEER2ADAPT平均提升了8.2%的性能,证明了其有效性。分析进一步表明该方法是一种数据高效、稳定且透明的推理时适应方法。

Insight: 核心创新在于将任务共享的底层概念维度捕获为可重用的低维语义先验子空间,并通过少量示例动态发现基向量的线性组合来适应新任务,实现了轻量级、可组合的模型适应。

Abstract: Activation steering has emerged as a promising approach for efficiently adapting large language models (LLMs) to downstream behaviors. However, most existing steering methods rely on a single static direction per task or concept, making them inflexible under task variation and inadequate for complex tasks that require multiple coordinated capabilities. To address this limitation, we propose STEER2ADAPT, a lightweight framework that adapts LLMs by composing steering vectors rather than learning new ones from scratch. In many domains (e.g., reasoning or safety), tasks share a small set of underlying concept dimensions. STEER2ADAPT captures these dimensions as a reusable, low-dimensional semantic prior subspace, and adapts to new tasks by dynamically discovering a linear combination of basis vectors from only a handful of examples. Experiments across 9 tasks and 3 models in both reasoning and safety domains demonstrate the effectiveness of STEER2ADAPT, achieving an average improvement of 8.2%. Extensive analyses further show that STEER2ADAPT is a data-efficient, stable, and transparent inference-time adaptation method for LLMs.


[194] When Is Enough Not Enough? Illusory Completion in Search Agents cs.AI | cs.CLPDF

Dayoon Ko, Jihyuk Kim, Sohyeon Kim, Haeju Park, Dahyun Lee

TL;DR: 本文研究了当前基于多轮推理和搜索工具的智能体在处理多约束问题时出现的‘幻觉完成’现象,即智能体在约束未满足或违反的情况下错误地认为任务已完成,导致答案验证不足。作者提出了‘认知账本’评估框架来诊断此行为,并开发了‘实时账本’干预方法以在推理过程中显式跟踪约束状态,从而显著减少未验证答案并提高准确性。

Details

Motivation: 尽管现有搜索智能体在多跳和长视野基准测试中表现强劲,但其是否能在多约束问题中可靠地跟踪、验证和维持所有条件尚不明确。本文旨在探究智能体在此类问题中的可靠性,并诊断其‘幻觉完成’的失败模式。

Result: 在引入‘实时账本’干预后,多约束问题中的未验证答案最多减少了26.5%,整体准确率最多提升了11.6%。

Insight: 论文的创新点在于提出了‘认知账本’框架来系统诊断智能体在多轮推理中的约束跟踪失败,并展示了通过执行时显式状态跟踪(‘实时账本’)这一简单干预可有效缓解‘幻觉完成’,这为提升搜索智能体的可靠性和可解释性提供了新思路。

Abstract: Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents’ beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.


[195] EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge cs.AI | cs.CL | cs.IR | cs.MMPDF

Congcong Hu, Yuang Shi, Fan Huang, Yang Xiang, Zhou Ye

TL;DR: 这篇论文提出了EventCast,一个用于电子商务需求预测的模块化框架。该框架的核心创新在于利用大型语言模型(LLM)处理非结构化的未来事件信息(如促销活动、节假日),将其转化为可解释的文本摘要,然后通过双塔架构将这些事件知识与历史需求时序特征融合,以提升在高影响力时期(如闪购、假日活动)的预测准确性。

Details

Motivation: 解决现有预测系统在高影响力事件期间(如闪销、假日活动、政策突变)因需求模式突然且不可预测地变化而失效的问题。

Result: 在跨越4个国家160个地区、为期10个月的真实电商场景部署中,与无事件知识的变体相比,EventCast在MAE和MSE上分别提升了86.9%和97.7%;在事件驱动期间,与最佳工业基线相比,MAE和MSE分别降低了57.0%和83.3%。该框架自2025年3月起已部署到实际工业流水线中。

Insight: 主要创新点是提出了一种混合方法,将LLM专门用于事件驱动的推理和知识提取(而非直接进行数值预测),生成可解释的文本摘要,再与传统时序特征融合。这为将LLM的世界知识(如文化细微差别、新颖事件组合)有效整合到结构化预测任务中,提供了一个可扩展且可解释的实用方案。

Abstract: Demand forecasting is a cornerstone of e-commerce operations, directly impacting inventory planning and fulfillment scheduling. However, existing forecasting systems often fail during high-impact periods such as flash sales, holiday campaigns, and sudden policy interventions, where demand patterns shift abruptly and unpredictably. In this paper, we introduce EventCast, a modular forecasting framework that integrates future event knowledge into time-series prediction. Unlike prior approaches that ignore future interventions or directly use large language models (LLMs) for numerical forecasting, EventCast leverages LLMs solely for event-driven reasoning. Unstructured business data, which covers campaigns, holiday schedules, and seller incentives, from existing operational databases, is processed by an LLM that converts it into interpretable textual summaries leveraging world knowledge for cultural nuances and novel event combinations. These summaries are fused with historical demand features within a dual-tower architecture, enabling accurate, explainable, and scalable forecasts. Deployed on real-world e-commerce scenarios spanning 4 countries of 160 regions over 10 months, EventCast achieves up to 86.9% and 97.7% improvement on MAE and MSE compared to the variant without event knowledge, and reduces MAE by up to 57.0% and MSE by 83.3% versus the best industrial baseline during event-driven periods. EventCast has deployed into real-world industrial pipelines since March 2025, offering a practical solution for improving operational decision-making in dynamic e-commerce environments.


[196] Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training cs.AI | cs.CLPDF

Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou

TL;DR: 本文提出了Data Darwinism框架,这是一个十级分类法(L0-L9),用于概念化数据与模型的协同进化:高级模型能为下一代系统生成更优质的数据。作者在科学文献上验证了该框架,构建了包含900B token的Darwin-Science语料库(L0-L5),并通过使用前沿LLM进行生成精炼(L4)和认知补全(L5)来弥补原始科学文本的可学习性差距。

Details

Motivation: 数据质量决定了基础模型的性能,但目前缺乏系统性的数据处理框架。本文旨在通过提出一个系统化的分类法来解锁科学数据的潜在价值,以用于模型预训练。

Result: 通过从头预训练daVinci-origin-3B/7B模型(排除科学内容以创建无污染基线),并在Darwin-Science语料库上进行600B token的持续预训练后,该模型在20多个基准测试中分别比基线高出+2.12(3B)和+2.95(7B)分,在领域对齐任务上提升至+5.60和+8.40分。系统性地推进到L5处理级别带来了+1.36的总增益。

Insight: 主要创新点是提出了Data Darwinism这一数据-模型协同进化的系统性分类框架,并实证了通过高级别(如L4、L5)的生成式数据处理(如精炼推理和术语解释)可以显著解锁原始数据中的潜在价值,提升模型在特定领域的性能。该方法强调了数据处理流程的系统性和迭代优化的重要性。

Abstract: Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.


[197] Accelerating Social Science Research via Agentic Hypothesization and Experimentation cs.AI | cs.CLPDF

Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla

TL;DR: 本文提出了EXPERIGEN,一个基于智能体(agentic)的框架,旨在通过受贝叶斯优化启发的两阶段搜索(生成器提出候选假设,实验者进行实证评估)来加速端到端的社会科学发现过程。该框架在多个领域能发现更多统计显著的、预测性更强的假设,并能处理多模态和关系型数据。通过专家评审和首次对LLM生成假设进行的A/B测试,验证了其生成假设的新颖性、影响力和实证有效性。

Details

Motivation: 当前数据驱动的社会科学研究过程缓慢,依赖于观察、假设生成和实验验证的迭代循环,现有方法大多无法支持端到端的科学发现。本文旨在填补这一空白,加速整个研究流程。

Result: 在多个领域中,EXPERIGEN发现的具有统计显著性的假设数量是先前方法的2-4倍,且预测性能高出7-17%。专家评审显示,88%的机器生成假设具有中等或强烈的新颖性,70%被认为有影响力且值得进一步研究。首次对LLM生成假设进行的A/B测试获得了p值小于1e-6的统计显著结果,效应量高达344%。

Insight: 核心创新在于提出了一个受贝叶斯优化启发的、由生成器和实验者智能体协作的两阶段搜索框架,实现了端到端的自动化科学发现。该框架不仅提升了假设发现的效率和统计性能,还通过专家评审和真实世界A/B测试,系统性地评估了生成假设的科学价值(新颖性、影响力、严谨性),为AI驱动的社会科学研究提供了可验证的范例。

Abstract: Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.


[198] Free(): Learning to Forget in Malloc-Only Reasoning Models cs.AI | cs.CLPDF

Yilun Zheng, Dongyang Ma, Tian Liang, Jiahao Xu, Xinting Huang

TL;DR: 论文提出Free()LM模型,通过引入可插拔的Free-Module LoRA适配器,赋予推理模型内在的自我遗忘能力,使其能在推理和清理模式间迭代切换,动态识别并剪枝无用上下文块,从而解决推理模型中过度思考导致性能下降的问题。

Details

Motivation: 标准LLM作为’仅分配内存’引擎,在推理时持续积累有效和冗余步骤,缺乏剪枝过时信息的机制,导致过多思考标记反而损害性能,因此需要引入遗忘能力来维持紧凑无噪声的状态。

Result: 实验表明Free()LM在所有模型规模(8B到685B)上均带来一致改进,平均比顶级推理基线提升3.3%,并在IMOanswerBench上使用DeepSeek V3.2-Speciale达到新SOTA;在长视野任务中,标准Qwen3-235B-A22B模型完全失效(0%准确率)时,Free()LM将性能恢复至50%。

Insight: 创新点在于将’遗忘’机制形式化为可插拔模块,通过模式切换动态管理上下文,这挑战了传统推理模型仅注重积累的范式,为可持续智能提供了新思路,即思考能力与遗忘自由同等重要。

Abstract: Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as “malloc-only” engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state. Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to 50%. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.


[199] MemAdapter: Fast Alignment across Agent Memory Paradigms via Generative Subgraph Retrieval cs.AI | cs.CL | cs.LGPDF

Xin Zhang, Kailai Yang, Chenyue Li, Hao Li, Qiyu Wei

TL;DR: 本文提出MemAdapter,一个用于统一异构智能体记忆范式的检索框架。它采用两阶段训练策略:首先在统一记忆空间训练生成式子图检索器,然后通过对比学习训练轻量对齐模块以适应新记忆范式,从而在降低对齐成本的同时提升检索灵活性。

Details

Motivation: 现有智能体记忆系统通常设计在孤立的范式(如显式、参数化或隐式记忆)中,其紧密耦合的检索方法阻碍了跨范式的泛化与融合。本文旨在首次尝试在单一记忆系统中统一异构记忆范式。

Result: 在三个公共评估基准上的综合实验表明,生成式子图检索器在三种记忆范式和不同智能体模型规模上,持续优于五个强大的基线记忆系统。MemAdapter在单GPU上仅需13分钟即可完成跨范式对齐,以不到5%的训练计算量实现了优于原始检索器的性能,并能实现有效的零样本跨范式融合。

Insight: 核心创新在于提出了一个统一的生成式子图检索框架和两阶段训练策略,通过解耦检索器与特定范式的耦合,实现了快速、低成本的跨范式对齐与融合,为智能体记忆系统提供了一个潜在的即插即用解决方案。

Abstract: Memory mechanism is a core component of LLM-based agents, enabling reasoning and knowledge discovery over long-horizon contexts. Existing agent memory systems are typically designed within isolated paradigms (e.g., explicit, parametric, or latent memory) with tightly coupled retrieval methods that hinder cross-paradigm generalization and fusion. In this work, we take a first step toward unifying heterogeneous memory paradigms within a single memory system. We propose MemAdapter, a memory retrieval framework that enables fast alignment across agent memory paradigms. MemAdapter adopts a two-stage training strategy: (1) training a generative subgraph retriever from the unified memory space, and (2) adapting the retriever to unseen memory paradigms by training a lightweight alignment module through contrastive learning. This design improves the flexibility for memory retrieval and substantially reduces alignment cost across paradigms. Comprehensive experiments on three public evaluation benchmarks demonstrate that the generative subgraph retriever consistently outperforms five strong agent memory systems across three memory paradigms and agent model scales. Notably, MemAdapter completes cross-paradigm alignment within 13 minutes on a single GPU, achieving superior performance over original memory retrievers with less than 5% of training compute. Furthermore, MemAdapter enables effective zero-shot fusion across memory paradigms, highlighting its potential as a plug-and-play solution for agent memory systems.


[200] Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure cs.AI | cs.CLPDF

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang

TL;DR: 本文研究了潜在思维链(latent CoT)方法中的因果结构,通过将潜在步骤建模为结构因果模型(SCM)中的变量,并应用逐步干预分析其动态。在数学和通用推理任务上,分析了三个关键问题:步骤的因果必要性、影响传播模式以及中间轨迹的答案模式保留情况。

Details

Motivation: 潜在思维链方法用内部潜在步骤替代显式文本推理,但这些中间计算难以评估,仅依赖基于相关性的探测。本文旨在通过因果视角,将潜在步骤视为可操纵的因果过程,以更可靠地解释和改进潜在推理系统。

Result: 研究发现,潜在步骤预算不像同质的额外深度,而更像具有非局部路由的分阶段功能,并识别出早期输出偏见与晚期表征承诺之间的持续差距。这些结果基于对Coconut和CODI两种代表性范式在数学和通用推理任务上的分析。

Insight: 创新点在于将潜在思维链建模为结构因果模型,通过逐步干预进行因果分析,揭示了步骤间的非均匀功能分配和承诺差距。这启发了基于模式条件和稳定性感知的分析方法,可作为解释和改进潜在推理系统的更可靠工具。

Abstract: Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise $\mathrm{do}$-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decidable early; (2) how does influence propagate across steps, and how does this structure compare to explicit CoT; and (3) do intermediate trajectories retain competing answer modes, and how does output-level commitment differ from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses – and corresponding training/decoding objectives – as more reliable tools for interpreting and improving latent reasoning systems.


[201] CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute cs.AI | cs.CLPDF

Chen Jin, Ryutaro Tanno, Tom Diethe, Philip Teare

TL;DR: 本文提出CoRefine,一种基于置信度引导的自适应测试时计算优化方法,通过在冻结的LLM之上部署一个轻量级控制器,动态决定推理过程是终止、重新检查还是尝试不同路径,从而在显著减少计算开销(约190倍令牌减少)的同时,达到与大规模并行解码(如512样本)相当的推理精度。

Details

Motivation: 解决大型语言模型在推理任务中依赖大规模并行解码(如生成512个样本)来提升准确性所导致的巨大计算开销问题,旨在通过自适应、轻量化的方法实现高效推理。

Result: 在多个推理基准测试和三种开源模型上,控制器在自信终止时的精度达到92.6%,平均每个问题仅需2.7次细化步骤,相对于512样本基线减少了约190倍的令牌使用,实现了竞争性的准确率。

Insight: 创新点在于将置信度视为控制信号而非正确性保证,通过轻量级控制器实现动态自校正;扩展的CoRefine-Tree变体以混合顺序-并行方式自适应平衡探索与利用,为可扩展推理和不完美验证器下的智能体设置提供了模块化基础。

Abstract: Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a confidence-guided self-refinement method that achieves competitive accuracy using a fraction of the tokens via a lightweight 211k-parameter Conv1D controller atop a frozen LLM. The controller consumes full-trace confidence to decide whether to halt, re-examine, or try a different approach, enabling targeted self-correction with an average of 2.7 refinement steps per problem and roughly 190-fold token reduction relative to 512-sample baselines. Across diverse reasoning benchmarks and three open-source models, the controller achieves 92.6 percent precision when it confidently halts, indicating that confidence dynamics reliably signal correctness without ground-truth verification. We extend this to CoRefine-Tree, a hybrid sequential-parallel variant that adaptively balances exploration and exploitation, with easy serving integration and verifier compatibility. By treating confidence as a control signal rather than a correctness guarantee, CoRefine provides a modular primitive for scalable reasoning and agentic settings with imperfect verifiers.


[202] Data Science and Technology Towards AGI Part I: Tiered Data Management cs.AI | cs.CLPDF

Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao, Chuyue Zhou

TL;DR: 本文提出了一种面向AGI发展的分层数据管理框架(L0-L4),旨在通过数据与模型的协同进化来克服当前LLM训练中数据规模单向扩展的瓶颈。该框架将数据从原始未处理资源到可验证知识分为不同层级,并利用LLM参与数据质量管理与编辑,使数据能够根据训练阶段(如预训练、中期训练和对齐)进行战略分配,从而平衡数据质量、获取成本和训练效益。

Details

Motivation: 当前LLM研究过度依赖数据规模的单向扩展,面临数据可用性、获取成本和训练效率的瓶颈。本文认为AGI发展需要进入数据与模型协同进化的新阶段,即模型主动指导数据管理,而高质量数据反过来增强模型能力。

Result: 通过实证研究,使用从原始语料库构建的分层数据集进行多阶段训练,实验结果表明,分层感知的数据利用能显著提升训练效率和模型性能。

Insight: 创新点在于提出了一个系统化的分层数据管理框架,实现了数据与LLM训练的深度协同,其中LLM被用于数据管理过程(如质量评分和内容编辑),并根据不同训练阶段动态分配数据资源,为可扩展和可持续的数据管理提供了新思路。

Abstract: The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.


[203] VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation cs.AI | cs.CVPDF

Changhua Xu, Jie Lu, Junyu Xuan, En Yu

TL;DR: 本文提出VGAS框架,用于解决视觉-语言-动作模型在少量演示样本下适应新任务时存在的几何模糊性问题。该框架通过生成-选择范式,在推理时进行最佳动作块选择,结合微调的VLA模型作为高召回提议生成器,以及基于几何的Q-Chunk-Former评估器,并引入显式几何正则化来提升选择稳定性。

Details

Motivation: 现有VLA模型在少量演示下适应新任务时,常因几何模糊性导致失败,即语义合理的动作轨迹可能因细微几何差异而产生执行偏差,需要解决在有限监督下可靠选择几何精确动作的问题。

Result: 实验表明,VGAS在有限演示和分布偏移下持续提升了成功率和鲁棒性,通过理论分析和实证验证了其有效性。

Insight: 创新点包括:1) 将少样本VLA适应视为生成-选择问题,引入推理时最佳动作块选择机制;2) 设计几何基础的Transformer评估器Q-Chunk-Former来解析细粒度几何模糊性;3) 提出显式几何正则化以保持动作排序分辨率并缓解值不稳定性,可借鉴于其他需要精细几何推理的少样本强化学习或机器人控制任务。

Abstract: Vision–Language–Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation–selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.


[204] Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs cs.AI | cs.CVPDF

Siqu Ou, Tianrui Wan, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR: 本文针对多模态大语言模型(MLLMs)在复杂推理任务中存在的视觉注意力薄弱问题,提出了一种名为SAYO的视觉推理模型。该模型采用强化学习框架,通过引入区域级视觉注意力奖励,优化视觉注意力策略,从而纠正早期视觉错位,减少错误传播。

Details

Motivation: 现有基于思维链(CoT)的MLLMs主要依赖长文本推理轨迹,缺乏学习稳定视觉注意力策略的有效机制,导致模型视觉聚焦能力弱,早期视觉错位难以在后续推理中纠正,造成错误传播。

Result: 在多个多模态基准测试上的广泛实验表明,SAYO在多种推理和感知任务上持续提升了性能。

Insight: 创新点在于将强化学习与区域级视觉注意力奖励相结合,使优化信号与基于视觉的推理步骤明确对齐,从而学习更可靠的注意力行为。从客观角度看,该方法为MLLMs的视觉-语言对齐提供了一种新的、可学习的注意力优化机制。

Abstract: While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.


[205] CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT cs.AI | cs.CVPDF

Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu

TL;DR: CoTZero提出了一种无需人工标注的视觉推理范式,通过双阶段数据合成(自底向上提取视觉基元并组合、自顶向下进行层次化推理)和认知对齐训练(引入认知一致可验证奖励),旨在提升视觉语言模型(VLMs)的层次化、可组合和可验证推理能力,使其更接近人类推理水平。

Details

Motivation: 现有视觉语言模型(VLMs)主要依赖表面关联而非逻辑连贯的结构化表示,导致其难以捕捉高层语义结构和非因果关系,限制了组合式与可验证推理能力;本文旨在通过引入人类认知模型来解决这些问题。

Result: 在包含词汇扰动负例的多层次语义不一致基准测试中,CoTZero在领域内和领域外设置下均取得了83.33%的F1分数;消融实验证实了各组件对提升可解释性和人类对齐推理的有效贡献。

Insight: 创新点包括:1)无需人工标注的双阶段数据合成方法,模拟人类从局部到整体、从整体到局部的组合式推理过程;2)在强化微调中引入认知一致可验证奖励(CCVR),为推理步骤提供连贯性和事实正确性的逐步反馈,从而增强模型的层次化推理与泛化能力。

Abstract: Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs’ hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.


cs.RO [Back]

[206] Leveraging Adaptive Group Negotiation for Heterogeneous Multi-Robot Collaboration with Large Language Models cs.RO | cs.AI | cs.CLPDF

Siqi Song, Xuanbing Xie, Zonglin Li, Yuqiang Li, Shijie Wang

TL;DR: 本文提出了CLiMRS框架,这是一个利用大型语言模型进行异构多机器人协作的自适应群体协商系统。该框架为每个机器人配备一个LLM智能体,通过通用提议规划器动态形成子群,在子群内进行感知驱动的多LLM讨论以生成行动指令,并通过执行反馈循环实现高效规划与鲁棒执行。

Details

Motivation: 解决异构多机器人在空间约束和环境不确定性下进行长期协作任务时,现有方法未能充分利用大型语言模型在协调控制方面的潜力的问题。

Result: 在提出的异构多机器人基准测试CLiMBench上,CLiMRS在复杂任务中效率超过最佳基线40%以上,同时在简单任务上保持成功率。

Insight: 创新点在于受人类团队协作启发,引入了自适应群体形成与协商机制,通过动态子群管理和多LLM讨论,显著提升了异构多机器人协作的效率。

Abstract: Multi-robot collaboration tasks often require heterogeneous robots to work together over long horizons under spatial constraints and environmental uncertainties. Although Large Language Models (LLMs) excel at reasoning and planning, their potential for coordinated control has not been fully explored. Inspired by human teamwork, we present CLiMRS (Cooperative Large-Language-Model-Driven Heterogeneous Multi-Robot System), an adaptive group negotiation framework among LLMs for multi-robot collaboration. This framework pairs each robot with an LLM agent and dynamically forms subgroups through a general proposal planner. Within each subgroup, a subgroup manager leads perception-driven multi-LLM discussions to get commands for actions. Feedback is provided by both robot execution outcomes and environment changes. This grouping-planning-execution-feedback loop enables efficient planning and robust execution. To evaluate these capabilities, we introduce CLiMBench, a heterogeneous multi-robot benchmark of challenging assembly tasks. Our experiments show that CLiMRS surpasses the best baseline, achieving over 40% higher efficiency on complex tasks without sacrificing success on simpler ones. Overall, our results demonstrate that leveraging human-inspired group formation and negotiation principles significantly enhances the efficiency of heterogeneous multi-robot collaboration. Our code is available here: https://github.com/song-siqi/CLiMRS.


[207] LangGS-SLAM: Real-Time Language-Feature Gaussian Splatting SLAM cs.RO | cs.CV | eess.IVPDF

Seongbo Ha, Sibaek Lee, Kyungsu Kang, Joonyeol Choi, Seungjun Tak

TL;DR: 本文提出LangGS-SLAM,一个实时的RGB-D SLAM系统,它能在进行低延迟跟踪与建图的同时,重建一个与语言对齐的稠密特征场。系统通过Top-K渲染管线高效渲染高维特征图,采用多准则地图管理策略修剪冗余或不一致的高斯点以节省内存并保持场景完整性,并利用混合场优化框架根据场特性解耦优化频率以联合优化几何与语义场。

Details

Motivation: 旨在弥合3D感知与基于语言推理之间的差距,解决在线SLAM中同时实现高保真几何重建和稠密、未压缩的语言对齐特征场重建的挑战。

Result: 系统在实时(15 FPS)运行时,相比纯几何基线实现了更优的几何保真度,其语义保真度与离线方法相当。

Insight: 创新点包括:高效的Top-K渲染管线(避免语义失真)、基于多准则(语义-几何一致性等)的高斯点修剪地图管理策略、以及根据场特性(几何/语义)解耦优化频率的混合优化框架,实现了在线稠密语言特征场SLAM的可行性验证。

Abstract: In this paper, we propose a RGB-D SLAM system that reconstructs a language-aligned dense feature field while sustaining low-latency tracking and mapping. First, we introduce a Top-K Rendering pipeline, a high-throughput and semantic-distortion-free method for efficiently rendering high-dimensional feature maps. To address the resulting semantic-geometric discrepancy and mitigate the memory consumption, we further design a multi-criteria map management strategy that prunes redundant or inconsistent Gaussians while preserving scene integrity. Finally, a hybrid field optimization framework jointly refines the geometric and semantic fields under real-time constraints by decoupling their optimization frequencies according to field characteristics. The proposed system achieves superior geometric fidelity compared to geometric-only baselines and comparable semantic fidelity to offline approaches while operating at 15 FPS. Our results demonstrate that online SLAM with dense, uncompressed language-aligned feature fields is both feasible and effective, bridging the gap between 3D perception and language-based reasoning.


[208] A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration cs.RO | cs.CVPDF

Valerio Belcamino, Nhat Minh Dinh Le, Quan Khanh Luu, Alessandro Carfì, Van Anh Ho

TL;DR: 本文提出了一种用于人机协作(HRC)中人类活动识别(HAR)的分布式多模态感知系统,该系统结合了配备惯性测量单元(IMU)的模块化数据手套和基于视觉的触觉传感器,以捕捉与机器人接触时的手部活动。

Details

Motivation: 人机协作中,机器人需要识别人类活动以理解和适应人类意图,但现有方法在实时性和接触场景下的感知能力存在局限,本文旨在通过多模态融合解决这一问题。

Result: 实验在离线分类、静态条件下的实时分类以及真实HRC场景中进行,结果显示所有任务均达到高准确率,表明该多模态方法适用于多种协作设置。

Insight: 创新点在于将IMU数据手套与视觉触觉传感器结合,实现了对接触式手部活动的多模态感知;从客观角度看,这种分布式、模块化的设计增强了系统在实时HRC中的适应性和鲁棒性。

Abstract: Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond to and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.


[209] Research on a Camera Position Measurement Method based on a Parallel Perspective Error Transfer Model cs.RO | cs.CVPDF

Ning Hu, Shuai Li, Jindong Tan

TL;DR: 本文提出了一种基于平行透视误差传递模型的相机位姿测量方法,通过显式建模图像测量误差在透视几何中的传播,推导出特征点分布、相机深度与位姿估计不确定性之间的关系,并在此基础上开发了一种结合平行透视初始化和误差感知加权的位姿估计方法,提高了近场操作中的鲁棒性。

Details

Motivation: 解决稀疏对应点下相机位姿估计在近场场景中因强透视效应和异质测量噪声导致解析PnP解稳定性下降的问题。

Result: 在合成数据和真实图像(包括强光照、手术照明和水下低光等多样条件)上的大量实验表明,该方法在精度和鲁棒性上达到了与最先进的解析和迭代PnP方法相当的水平,同时保持了较高的计算效率。

Insight: 创新点在于提出了一个显式的几何误差传播框架和平行透视近似下的误差传递模型,强调了在挑战性近场设置中显式几何误差建模对可靠相机位姿估计的重要性。

Abstract: Camera pose estimation from sparse correspondences is a fundamental problem in geometric computer vision and remains particularly challenging in near-field scenarios, where strong perspective effects and heterogeneous measurement noise can significantly degrade the stability of analytic PnP solutions. In this paper, we present a geometric error propagation framework for camera pose estimation based on a parallel perspective approximation. By explicitly modeling how image measurement errors propagate through perspective geometry, we derive an error transfer model that characterizes the relationship between feature point distribution, camera depth, and pose estimation uncertainty. Building on this analysis, we develop a pose estimation method that leverages parallel perspective initialization and error-aware weighting within a Gauss-Newton optimization scheme, leading to improved robustness in proximity operations. Extensive experiments on both synthetic data and real-world images, covering diverse conditions such as strong illumination, surgical lighting, and underwater low-light environments, demonstrate that the proposed approach achieves accuracy and robustness comparable to state-of-the-art analytic and iterative PnP methods, while maintaining high computational efficiency. These results highlight the importance of explicit geometric error modeling for reliable camera pose estimation in challenging near-field settings.


[210] Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO | cs.AI | cs.CV | cs.LGPDF

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

TL;DR: 本文提出了一种名为R&B-EnCoRe的自监督引导方法,用于提升具身链式思维推理模型。该方法通过将推理视为潜变量,利用重要性加权变分推断,让模型能够从互联网规模的知识中自我提炼出与具体任务执行相关的推理策略,从而避免依赖人工设计的固定模板,并提升动作预测的准确性。

Details

Motivation: 当前具身链式思维推理方法依赖固定模板来指定推理基元,这可能导致策略处理无关信息,分散对关键动作预测信号的注意力,形成‘无成功策略则无法验证推理质量,无高质量推理则无法构建稳健策略’的瓶颈。

Result: 在操作、腿部导航和自动驾驶等多种具身任务及不同规模的VLA架构上验证,R&B-EnCoRe相比不加区分地推理所有可用基元的模型,在操作成功率上提升28%,导航分数提升101%,碰撞率指标降低21%。

Insight: 核心创新在于将推理作为潜变量,通过自监督的重要性加权变分推断框架,让模型能够自主引导并提炼出对成功控制具有预测性的、特定于具体任务执行的推理策略,从而绕过了人工标注工程,并将互联网规模的知识落地到物理执行中。

Abstract: Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.


[211] BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models cs.RO | cs.AI | cs.CVPDF

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin

TL;DR: 本文提出了BiManiBench,一个用于评估多模态大语言模型在双手协调任务中能力的分层基准测试。该基准从基础空间推理、高层动作规划到低层末端执行器控制三个层面进行评估,旨在解决现有基准局限于单臂操作、无法捕捉双手任务时空协调需求的问题。

Details

Motivation: 现有评估框架主要局限于单臂操作,无法评估多模态大语言模型在需要时空协调的双手任务(如抬起重锅)中的能力,因此需要一个新的基准来专门解决双手协调的独特挑战。

Result: 对超过30个最先进模型的分析表明,尽管MLLMs在高层推理方面表现熟练,但在双臂空间定位和控制方面存在困难,经常导致相互干扰和顺序错误,揭示了当前范式对相互运动学约束缺乏深入理解。

Insight: 创新点在于提出了首个专门针对双手协调任务的分层评估基准,能够区分感知幻觉与规划失败,并明确指出了未来研究需要关注双臂避碰和细粒度时序规划等方向。

Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.


[212] Reliability-aware Execution Gating for Near-field and Off-axis Vision-guided Robotic Alignment cs.RO | cs.CVPDF

Ning Hu, Senhao Cao, Maochen Li

TL;DR: 本文提出了一种可靠性感知的执行门控机制,用于提升近场和离轴视觉引导机器人对准任务的执行可靠性。该方法通过评估几何一致性和配置风险,在执行前选择性拒绝或缩放高风险位姿更新,从而减少因位姿估计误差放大导致的执行失败。

Details

Motivation: 当前视觉引导机器人系统在近场和离轴配置下,即使位姿估计数值准确,仍频繁出现执行失败。这表明仅靠位姿精度不足以保证执行层面的可靠性,因此需要一种能在执行层面提升可靠性的方法。

Result: 在真实UR5机器人平台上进行的实验表明,该方法显著提高了任务成功率,降低了执行方差,并抑制了尾部风险行为,同时平均位姿精度基本保持不变。

Insight: 创新点在于揭示了确定性几何误差放大机制是执行失败的根本原因,并提出了一种与估计器无关的执行层可靠性建模方法,可轻松集成到基于几何或学习的位姿估计流程中,为提升近场视觉引导机器人系统的鲁棒性提供了实用解决方案。

Abstract: Vision-guided robotic systems are increasingly deployed in precision alignment tasks that require reliable execution under near-field and off-axis configurations. While recent advances in pose estimation have significantly improved numerical accuracy, practical robotic systems still suffer from frequent execution failures even when pose estimates appear accurate. This gap suggests that pose accuracy alone is insufficient to guarantee execution-level reliability. In this paper, we reveal that such failures arise from a deterministic geometric error amplification mechanism, in which small pose estimation errors are magnified through system structure and motion execution, leading to unstable or failed alignment. Rather than modifying pose estimation algorithms, we propose a Reliability-aware Execution Gating mechanism that operates at the execution level. The proposed approach evaluates geometric consistency and configuration risk before execution, and selectively rejects or scales high-risk pose updates. We validate the proposed method on a real UR5 robotic platform performing single-step visual alignment tasks under varying camera-target distances and off-axis configurations. Experimental results demonstrate that the proposed execution gating significantly improves task success rates, reduces execution variance, and suppresses tail-risk behavior, while leaving average pose accuracy largely unchanged. Importantly, the proposed mechanism is estimator-agnostic and can be readily integrated with both classical geometry-based and learning-based pose estimation pipelines. These results highlight the importance of execution-level reliability modeling and provide a practical solution for improving robustness in near-field vision-guided robotic systems.


[213] Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction cs.RO | cs.CVPDF

Hongyi Chen, Tony Dong, Tiancheng Wu, Liquan Wang, Yash Jangir

TL;DR: 本文提出VIDEOMANIP框架,直接从RGB人类视频中学习灵巧的机器人手操作策略。该方法通过估计人手姿态和物体网格,从单目视频重建4D手-物体轨迹,并将其重定向到机器人手上进行策略学习。通过引入手-物体接触优化和交互中心抓取建模,以及从单视频生成多样化训练轨迹的演示合成策略,实现了无需额外机器人演示的泛化策略学习。

Details

Motivation: 解决多指机器人手操作因高维动作空间和大规模训练数据获取困难而面临的挑战,现有方法依赖可穿戴设备或专用传感设备进行遥操作,限制了可扩展性。

Result: 在仿真中,使用Inspire Hand在20个不同物体上达到70.25%的成功率;在真实世界中,使用LEAP Hand在七个任务上平均成功率为62.86%,比基于重定向的方法高出15.87%。

Insight: 创新点在于直接从RGB视频重建4D手-物体轨迹进行设备无关的灵巧操作学习,结合接触优化和演示合成策略,实现了从人类视频到机器人策略的有效迁移,避免了传统方法对专用设备的依赖。

Abstract: Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.


[214] Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving cs.RO | cs.AI | cs.CV | cs.LGPDF

Amir Mallak, Alaa Maalouf

TL;DR: 该论文对自动驾驶视觉策略的分布外(OOD)鲁棒性进行了系统性分解研究,通过五个环境轴(场景、季节、天气、时间、智能体组合)和可控的k因子扰动,评估了FC、CNN、ViT及基于基础模型(FM)特征策略的性能。研究发现ViT策略比同等规模的CNN/FC更鲁棒,FM特征策略在多重变化下保持高性能,而时间输入(多帧)未超越最佳单帧基线。研究还量化了各环境因素变化对性能的影响,并提出了可操作的OOD鲁棒性设计规则。

Details

Motivation: 自动驾驶中的OOD鲁棒性通常被简化为单一数字指标,掩盖了策略失效的具体原因。本文旨在通过分解环境因素,系统性地研究什么因素会导致驾驶策略失效,从而更深入地理解OOD鲁棒性。

Result: 在VISTA模拟器中进行闭环控制评估。结果显示:ViT策略比同等规模的CNN/FC策略OOD鲁棒性显著更强;基于冻结FM特征的ViT头实现了最先进的成功率(但存在延迟代价);非FM单帧策略在首次环境变化时性能大幅下降,在三个同时变化后所有非FM模型成功率均低于50%,而FM特征策略在三个同时变化下仍能保持85%以上成功率。研究还量化了各单因素变化(如乡村到城市、白天到夜晚导致约31%下降)及组合变化(非加性相互作用)的具体影响。

Insight: 创新点在于将OOD鲁棒性分解为多个可解释的环境因素进行系统性研究,而非单一指标。主要见解包括:1) 架构选择(ViT vs. CNN/FC)和特征来源(FM特征)对鲁棒性有决定性影响;2) 环境因素变化的影响是非加性的,某些组合(如季节与时间)尤其有害;3) 在ID数据中增加轨迹/视图规模或针对性暴露于困难条件能提升鲁棒性;4) 使用多个ID环境进行训练能拓宽覆盖范围并强化弱项,但会轻微降低ID性能;单一ID训练则保持峰值性能但领域狭窄。这些发现为设计OOD鲁棒的驾驶策略提供了具体指导原则。

Abstract: Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in {0,1,2,3}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31%$ each); actor swaps $\sim 10%$, moderate rain $\sim 7%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6% \rightarrow 70.1%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.


eess.IV [Back]

[215] Wavelet-Domain Masked Image Modeling for Color-Consistent HDR Video Reconstruction eess.IV | cs.CVPDF

Yang Zhang, Zhangkai Ni, Wenhan Yang, Hanli Wang

TL;DR: 本文提出了一种名为WMNet的新型高动态范围(HDR)视频重建网络,通过利用小波域掩码图像建模(W-MIM)来解决现有方法中常见的色彩不准确和时间不一致性问题。该方法采用两阶段训练策略:第一阶段通过在小波域选择性掩码色彩和细节信息进行自重建预训练,第二阶段进行微调。此外,网络引入了时间专家混合(T-MoE)模块和动态记忆模块(DMM)来增强时间一致性,并重组了HDRTV4K数据集为HDRTV4K-Scene作为新基准。

Details

Motivation: 现有HDR视频重建方法常存在色彩不准确和时间不一致的问题,本文旨在解决这些挑战,以从低动态范围(LDR)视频中恢复出精确的亮度、色彩和细节。

Result: 在广泛的实验中,WMNet在多个评估指标上达到了最先进的(SOTA)性能,显著提升了色彩保真度、时间一致性和感知质量,特别是在新建立的HDRTV4K-Scene基准上表现出色。

Insight: 创新点包括:1)将掩码图像建模(MIM)范式引入小波域进行自监督预训练,以增强色彩恢复能力;2)采用课程学习策略优化重建过程;3)提出T-MoE和DMM模块来改善时间一致性和长程依赖建模;4)重组现有数据集以提供场景分割的基准,促进了领域评估。从客观角度看,小波域处理可能更有效地分离和重建图像的频率成分,而两阶段训练结合特定模块设计为视频重建任务提供了系统性的解决方案。

Abstract: High Dynamic Range (HDR) video reconstruction aims to recover fine brightness, color, and details from Low Dynamic Range (LDR) videos. However, existing methods often suffer from color inaccuracies and temporal inconsistencies. To address these challenges, we propose WMNet, a novel HDR video reconstruction network that leverages Wavelet domain Masked Image Modeling (W-MIM). WMNet adopts a two-phase training strategy: In Phase I, W-MIM performs self-reconstruction pre-training by selectively masking color and detail information in the wavelet domain, enabling the network to develop robust color restoration capabilities. A curriculum learning scheme further refines the reconstruction process. Phase II fine-tunes the model using the pre-trained weights to improve the final reconstruction quality. To improve temporal consistency, we introduce the Temporal Mixture of Experts (T-MoE) module and the Dynamic Memory Module (DMM). T-MoE adaptively fuses adjacent frames to reduce flickering artifacts, while DMM captures long-range dependencies, ensuring smooth motion and preservation of fine details. Additionally, since existing HDR video datasets lack scene-based segmentation, we reorganize HDRTV4K into HDRTV4K-Scene, establishing a new benchmark for HDR video reconstruction. Extensive experiments demonstrate that WMNet achieves state-of-the-art performance across multiple evaluation metrics, significantly improving color fidelity, temporal coherence, and perceptual quality. The code is available at: https://github.com/eezkni/WMNet


[216] A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models eess.IV | cs.CVPDF

Weijie Gan, Xucheng Wang, Tongyao Wang, Wenshang Wang, Chunwei Ying

TL;DR: 本文提出Any2all统一框架,将多模态图像重建与合成任务统一表述为虚拟修复问题,通过训练单一无条件扩散模型,在推理时根据任意可用输入(干净图像或噪声测量)修复所有目标模态,在PET/MR/CT脑数据集上验证了其有效性。

Details

Motivation: 现有方法需要针对不同任务训练特定模型,导致训练和部署流程复杂,本文旨在通过统一框架解决多模态成像数据不完整时的重建与合成问题。

Result: 在PET/MR/CT脑数据集上,Any2all在多模态重建和合成任务中均取得优异性能,其基于失真的指标具有竞争力,且感知质量优于专用方法。

Insight: 创新点在于将多模态任务统一为虚拟修复问题,并利用单一无条件扩散模型实现灵活推理;客观来看,其统一框架设计简化了工作流程,且无条件模型结合推理时适配的策略具有通用性潜力。

Abstract: Image reconstruction and image synthesis are important for handling incomplete multimodal imaging data, but existing methods require various task-specific models, complicating training and deployment workflows. We introduce Any2all, a unified framework that addresses this limitation by formulating these disparate tasks as a single virtual inpainting problem. We train a single, unconditional diffusion model on the complete multimodal data stack. This model is then adapted at inference time to ``inpaint’’ all target modalities from any combination of inputs of available clean images or noisy measurements. We validated Any2all on a PET/MR/CT brain dataset. Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks, consistently yielding images with competitive distortion-based performance and superior perceptual quality over specialized methods.


cs.IR [Back]

[217] High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning cs.IR | cs.AI | cs.CL | cs.LGPDF

Rajat Arora, Ye Tao, Jianqiang Shen, Ping Liu, Muchen Wu

TL;DR: 本文提出了一种新颖的强化学习框架,用于从LinkedIn等大规模招聘平台的异构文本数据源(如个人资料、专业数据和搜索日志)中,为每个用户合成一个统一、可解释且简洁的文本表示。该方法利用用户参与信号(如点击、申请)作为主要奖励,并结合基于规则的奖励来约束格式和长度,从而无需人工标注即可构建与基于LLM的推荐系统直接兼容的用户表征。

Details

Motivation: 动机在于,随着推荐系统越来越多地采用大语言模型,如何从异构文本源中创建统一、可解释且简洁的用户表示变得至关重要,尤其是在对延迟敏感的在线环境中。

Result: 在LinkedIn多个产品上进行的广泛离线实验表明,该方法在关键的下游业务指标上取得了显著提升。

Insight: 创新点在于提出了一个以用户参与信号为主要奖励、结合规则奖励的强化学习框架,用于无监督地蒸馏异构文本信息,生成与LLM兼容的统一用户表示,提供了一种实用、无需标注且可扩展的解决方案。

Abstract: Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world’s largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.


[218] Reasoning-Augmented Representations for Multimodal Retrieval cs.IR | cs.AI | cs.CV | cs.LGPDF

Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly

TL;DR: 本文提出了一种数据中心的框架,通过外部化推理来增强多模态检索的表示,解决现有嵌入模型在处理需要潜在推理的查询时的脆弱性问题。该方法利用强大的视觉-语言模型对语料库条目进行密集标注、解析查询中的模糊多模态引用,并将冗长指令重写为简洁的检索约束,从而在训练和推理时提升检索性能。

Details

Motivation: 解决通用多模态检索中,当查询需要潜在推理(如解析不明确的引用或匹配组合约束)时,现代嵌入模型表现脆弱的问题,认为这种脆弱性常由数据引起,即图像携带’沉默’证据且查询隐含关键语义,导致单次嵌入需同时处理推理和压缩,引发虚假特征匹配。

Result: 在M-BEIR基准测试中,该方法相比强基线模型取得了一致的性能提升,消融实验表明语料库增强主要受益于知识密集型查询,而查询增强对于组合修改请求至关重要。

Insight: 创新点在于将推理过程外部化并融入数据表示,通过密集标注和语义解析来显式化隐含信息,从而在训练中利用增强的语义密集表示来避免分布偏移并充分利用额外信号,这为多模态检索提供了一种可借鉴的数据中心优化思路。

Abstract: Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry “silent” evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision–Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.


q-bio.NC [Back]

[219] How does longer temporal context enhance multimodal narrative video processing in the brain? q-bio.NC | cs.AI | cs.CV | cs.LGPDF

Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota

TL;DR: 本研究探讨了视频片段的时间上下文长度(3-12秒)和叙事任务提示如何影响自然观影过程中大脑与模型的对齐。研究发现,增加片段时长能显著提升多模态大语言模型(MLLMs)与大脑的对齐度,而单模态视频模型则无明显改善;较短时间窗口与感知和早期语言脑区对齐,较长时间窗口则与高阶整合脑区对齐,且MLLMs中存在层到皮层的层级对应关系;叙事任务提示会引发任务特异性、区域依赖的大脑对齐模式及高阶脑区在片段级调谐的上下文依赖变化。

Details

Motivation: 解决人类与人工智能系统如何处理复杂叙事视频这一神经科学与机器学习交叉领域的基本挑战,探究时间上下文长度和叙事任务提示如何塑造自然观影时的大脑-模型对齐。

Result: 在基于fMRI记录的全长电影观看实验中,发现增加片段时长显著提升MLLMs(如Video-LLaMA、VideoChat)与大脑的对齐度,而单模态视频模型(如VideoMAE、TimeSformer)无增益;在特定叙事任务(如多场景摘要、角色动机)上观察到任务特异性的脑区对齐模式。

Insight: 创新点在于将长篇叙事电影作为原则性测试平台,用于探究生物相关的时间整合和MLLMs中可解释的表征;客观分析认为,其揭示了MLLMs中时间上下文整合的神经类似物,以及任务提示如何调制模型-大脑对齐,为构建更具生物合理性的长上下文多模态模型提供了见解。

Abstract: Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3–12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.


cs.MM [Back]

[220] Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data cs.MM | cs.AI | cs.CVPDF

Thu Hang Phung, Duong M. Nguyen, Thanh Trung Huynh, Quoc Viet Hung Nguyen, Trong Nghia Hoang

TL;DR: 本文提出了一种广义的联邦提示调优框架,用于处理客户端本地数据集为多模态且输入特征缺失模式分布不同的实际场景。该框架通过专门的客户端调优和服务器聚合设计,解决了跨客户端和模态的提示指令语义对齐问题,从而有效聚合和互补提示指令。

Details

Motivation: 解决联邦学习与多模态提示调优之间的鸿沟,传统方法通常只关注单模态或集中式数据,而实际场景中客户端数据是多模态的且存在异构和不完整的特征缺失模式。

Result: 在多个多模态基准数据集上的广泛评估表明,该工作始终优于最先进的基线方法,达到了SOTA水平。

Insight: 创新点在于针对异构和不完整多模态客户端数据,设计了同时优化、对齐和聚合跨客户端及数据模态的提示调优指令的框架,实现了提示指令的有效互补与组合。

Abstract: This paper introduces a generalized federated prompt-tuning framework for practical scenarios where local datasets are multi-modal and exhibit different distributional patterns of missing features at the input level. The proposed framework bridges the gap between federated learning and multi-modal prompt-tuning which have traditionally focused on either uni-modal or centralized data. A key challenge in this setting arises from the lack of semantic alignment between prompt instructions that encode similar distributional patterns of missing data across different clients. To address this, our framework introduces specialized client-tuning and server-aggregation designs that simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities. This allows prompt instructions to complement one another and be combined effectively. Extensive evaluations on diverse multimodal benchmark datasets demonstrate that our work consistently outperforms state-of-the-art (SOTA) baselines.


cs.HC [Back]

[221] An Information-Theoretic Framework for Comparing Voice and Text Explainability cs.HC | cs.AI | cs.CL | cs.CY | cs.ITPDF

Mona Rajhans, Vishal Khawarey

TL;DR: 本文提出了一个基于信息论的框架,用于分析和比较语音与文本这两种解释模态如何影响用户对AI系统的理解和信任校准。该框架将解释传递视为模型与用户之间的通信信道,并通过信息保留、理解效率和信任校准误差等指标进行量化。作者开发了一个Python模拟框架,使用基于SHAP的合成特征归因,在多种解释风格(简洁、详细和类比式)下评估这些指标。结果表明,文本解释在理解效率上更高,而语音解释在信任校准方面表现更好,其中类比式解释在整体权衡上表现最佳。

Details

Motivation: 当前可解释人工智能(XAI)方法大多通过视觉或文本形式传递解释,缺乏对不同解释模态(如语音与文本)如何影响用户理解和信任的系统性分析。本文旨在填补这一空白,提供一个理论框架来量化比较不同解释模态的效果。

Result: 在模拟实验中,文本解释实现了更高的理解效率(CE),而语音解释则带来了更低的信任校准误差(TCE)。其中,类比式解释风格在理解效率和信任校准之间取得了最佳的整体权衡。这些结果基于合成SHAP特征归因的模拟评估得出。

Insight: 论文的创新点在于首次将信息论应用于比较不同解释模态(语音与文本)对XAI效果的影响,并提出了可量化的评估指标(CE和TCE)。从客观角度看,该框架为设计和评估多模态可解释系统提供了一个可复现的理论基础,并可扩展至使用真实SHAP或LIME输出在公开数据集(如UCI信用审批数据集)上的实证研究。

Abstract: Explainable Artificial Intelligence (XAI) aims to make machine learning models transparent and trustworthy, yet most current approaches communicate explanations visually or through text. This paper introduces an information theoretic framework for analyzing how explanation modality specifically, voice versus text affects user comprehension and trust calibration in AI systems. The proposed model treats explanation delivery as a communication channel between model and user, characterized by metrics for information retention, comprehension efficiency (CE), and trust calibration error (T CE). A simulation framework implemented in Python was developed to evaluate these metrics using synthetic SHAP based feature attributions across multiple modality style configurations (brief, detailed, and analogy based). Results demonstrate that text explanations achieve higher comprehension efficiency, while voice explanations yield improved trust calibration, with analogy based delivery achieving the best overall trade off. This framework provides a reproducible foundation for designing and benchmarking multimodal explainability systems and can be extended to empirical studies using real SHAP or LIME outputs on open datasets such as the UCI Credit Approval or Kaggle Financial Transactions datasets.


[222] Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals cs.HC | cs.CVPDF

Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal

TL;DR: 本文通过与六个警察机构合作,研究如何将多机器人地面视频集成到公共安全工作流程中。研究一提出了首个多机器人地面视频感知测试平台,包含38个公共安全相关事件、20个机器人巡逻视频数据集和6项设计要求;研究二开发了MRVS工具,利用提示工程视频理解模型增强多机器人巡逻视频流,实验表明该工具能减少人工工作量并提高信心,但也存在误报和隐私问题。

Details

Motivation: 解决如何设计和集成多机器人视频到公共安全工作流程中的问题,以提供可扩展的情境感知并减轻专业人员负担。

Result: 在开发的测试平台上,MRVS工具通过基于LLM的解释减少了手动工作量并提高了用户信心,但存在误报和隐私担忧;该工具未明确提及与现有方法的定量比较或SOTA水平。

Insight: 创新点包括首个多机器人地面视频感知测试平台的设计、结合提示工程视频理解模型的MRVS工具开发,以及从公共安全实践中提取的6项设计要求,为未来多机器人视频感知工具设计提供了实践指导。

Abstract: Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals’ burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools. The testbed is available at https://github.com/Puqi7/MRVS\_VideoSensemaking


cs.SD [Back]

[223] Massive Sound Embedding Benchmark (MSEB) cs.SD | cs.CLPDF

Georg Heigold, Ehsan Variani, Tom Bagby, Cyril Allauzen, Ji Ma

TL;DR: 本文提出了大规模声音嵌入基准(MSEB),这是一个用于评估多模态系统中听觉组件的可扩展框架。该基准首次发布包含八个核心任务,并引入了新的Simple Voice Questions(SVQ)数据集,旨在加速机器听觉智能的发展。

Details

Motivation: 音频是多模态感知的关键组成部分,任何真正的智能系统都必须具备广泛的听觉能力。当前缺乏一个统一的框架来全面评估这些能力,因此作者提出了MSEB来填补这一空白。

Result: 初步实验建立了清晰的性能上限,表明在音频作为核心信号的实际多模态体验方面存在显著的改进机会。

Insight: 创新点在于构建了一个统一、可扩展的基准来系统评估声音嵌入在各种任务(如转录、分类、检索、推理等)上的表现,并引入了新的SVQ数据集以支持大规模评估。这为比较和推进音频表示学习模型提供了标准化平台。

Abstract: Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful ‘embedding’ - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task’s final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.


[224] Beyond Transcripts: A Renewed Perspective on Audio Chaptering cs.SD | cs.CLPDF

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

TL;DR: 本文针对音频章节划分任务,系统比较了基于文本的模型、新型纯音频架构AudioSeg以及多模态大语言模型(MLLMs)的性能。研究发现AudioSeg显著优于基于文本的方法,停顿信息带来最大的声学增益,而MLLMs受限于上下文长度和指令跟随能力,但在短音频上表现有潜力。

Details

Motivation: 解决当前音频章节划分研究局限于文本方法、未能充分利用音频信息、处理ASR错误以及缺乏不依赖转录本的评估协议等关键问题。

Result: 在YTSeg数据集上的实验表明,AudioSeg大幅超越基于文本的方法;停顿是提升性能最有效的声学特征;MLLMs在短音频上表现出潜力,但受上下文长度和弱指令跟随限制。

Insight: 创新点包括:提出纯音频架构AudioSeg;系统分析影响性能的因素(如转录质量、声学特征);引入不依赖转录本的时间空间评估协议,为音频章节划分提供了新的研究视角。

Abstract: Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.