Table of Contents

cs.CL [Back]

[1] EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research cs.CL | cs.AIPDF

Houping Yue, Zixiang Di, Mei Jiang, Bingdong Li, Hao Hao

TL;DR: 本文提出了EduResearchBench,这是一个专门用于评估教育学术写作能力的综合性基准测试平台。该平台基于分层原子任务分解(HATD)框架,将端到端的研究工作流分解为6个专业研究模块和24个细粒度原子任务,并采用课程学习策略从基础技能逐步提升到复杂推理。利用55K原始学术样本,作者构建了11K高质量指令对来训练专用模型EduWrite(30B)。实验表明,在垂直领域,数据质量密度和分阶段训练课程比参数规模更具决定性。

Details

Motivation: 现有基准测试主要强调单次、整体的生成,缺乏对复杂学术研究工作流进行细粒度评估的能力,无法准确反映大语言模型在学术写作中的具体能力瓶颈。

Result: 实验表明,专门训练的教育学术写作模型EduWrite(30B)在多个核心指标上显著优于更大的通用模型(72B),证明了在垂直领域中,数据质量密度和分层分阶段的训练课程比参数规模更具决定性。

Insight: 创新点在于提出了首个专注于教育学术写作的综合性评估基准(EduResearchBench)及其分层原子任务分解(HATD)框架,该框架能提供细粒度的诊断性反馈,并引入了课程学习策略来逐步构建模型能力。从客观角度看,其将复杂工作流系统分解并构建自动化评估管道的思路,以及对“数据质量密度”和“训练课程”优于“参数规模”的实证结论,对垂直领域模型开发具有重要借鉴意义。

Abstract: While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.


[2] CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding cs.CL | cs.AI | cs.CVPDF

Tahir Hussain, Saddam Hussain Khan

TL;DR: 本文提出了一种名为CGRA-DeBERTa的概念引导残差增强Transformer框架,用于提升对伊斯兰圣训(Hadith)文本的问答(QA)准确性。该模型基于定制的DeBERTa主干,结合了轻量级LoRA适配、一个包含12个核心术语的伊斯兰概念词典提供的先验知识,以及一个选择性增强关键语义标记的概念门控机制。

Details

Motivation: 解决在古典伊斯兰文本上进行准确问答的挑战,这些挑战包括领域特定的语义、长上下文依赖性和对概念敏感的推理。

Result: 在一个包含42,591个QA对的专门构建的数据集上,CGRA-DeBERTa的精确匹配(EM)得分达到97.85,显著超过了BERT(75.87)和DeBERTa(89.77)的基线模型,实现了SOTA性能,同时推理开销仅增加约8%。

Insight: 创新点在于将领域先验知识(伊斯兰概念词典)通过概念引导的残差块和重要性加权注意力门控机制整合到预训练语言模型中,从而在保持计算效率的同时,增强了领域特定的语义表示和推理精度。这种方法为特定领域的NLP任务提供了一种可解释且高效的架构思路。

Abstract: Accurate QA over classical Islamic texts remains challenging due to domain specific semantics, long context dependencies, and concept sensitive reasoning. Therefore, a new CGRA DeBERTa, a concept guided residual domain augmentation transformer framework, is proposed that enhances theological QA over Hadith corpora. The CGRA DeBERTa builds on a customized DeBERTa transformer backbone with lightweight LoRA based adaptations and a residual concept aware gating mechanism. The customized DeBERTa embedding block learns global and positional context, while Concept Guided Residual Blocks incorporate theological priors from a curated Islamic Concept Dictionary of 12 core terms. Moreover, the Concept Gating Mechanism selectively amplifies semantically critical tokens via importance weighted attention, applying differential scaling from 1.04 to 3.00. This design preserves contextual integrity, strengthens domain-specific semantic representations, and enables accurate, efficient span extraction while maintaining computational efficiency. This paper reports the results of training CGRA using a specially constructed dataset of 42591 QA pairs from the text of Sahih alBukhari and Sahih Muslim. While BERT achieved an EM score of 75.87 and DeBERTa one of 89.77, our model scored 97.85 and thus surpassed them by 8.08 on an absolute scale, all while adding approximately 8 inference overhead due to parameter efficient gating. The qualitative evaluation noted better extraction and discrimination and theological precision. This study presents Hadith QA systems that are efficient, interpretable, and accurate and that scale provide educational materials with necessary theological nuance.


[3] AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking cs.CLPDF

Herbert Ullrich, Jan Drchal

TL;DR: 本文介绍了在AVerImaTeC共享任务中获得第三名的系统,该系统将去年的检索增强生成(RAG)流程与反向图像搜索(RIS)模块相结合,构建了一个双检索器RAG框架,用于图像-文本事实核查。

Details

Motivation: 动机是构建一个高效、低成本且易于复现的图像-文本事实核查系统,以应对多模态信息验证的挑战。

Result: 该系统在AVerImaTeC基准测试中取得了第三名的竞争性性能,每次事实核查平均仅需0.013美元(使用GPT5.1),且仅需一次多模态大语言模型调用。

Insight: 创新点在于将文本检索(基于相似性搜索)和图像检索(基于API访问的RIS)解耦为两个独立模块,并与生成模块(GPT5.1)结合,形成了一个简单、可复现且成本效益高的框架,为后续实验提供了易访问的起点。

Abstract: In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year’s retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module. Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just $0.013 on average using GPT5.1 via OpenAI Batch API. Our system is also easy to reproduce and tweak, consisting of only three decoupled modules - a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 - which is why we suggest it as an accesible starting point for further experimentation. We publish its code and prompts, as well as our vector stores and insights into the scheme’s running costs and directions for further improvement.


[4] Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory cs.CLPDF

Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li

TL;DR: 本文提出了Mnemis,一种新颖的LLM长期记忆框架,通过整合基于相似性的快速检索(System-1)和基于层次图全局推理的慢速检索(System-2),实现了对历史信息的语义和结构双重相关的高效检索。

Details

Motivation: 现有LLM记忆方法(如RAG和Graph-RAG)主要依赖基于相似性的检索,这种System-1风格的检索在需要全局推理或全面覆盖所有相关信息时存在不足。本文旨在解决这一问题,提出一种互补的双路径检索机制。

Result: 在长期记忆基准测试中,Mnemis取得了最先进的性能,在LoCoMo上得分为93.9,在LongMemEval-S上得分为91.6(使用GPT-4.1-mini)。

Insight: 核心创新点在于将记忆组织为基础图和层次图,并融合了System-1相似性搜索与System-2全局选择机制,实现了快速检索与深思熟虑的层次遍历的互补,从而提升了检索的全面性和相关性。

Abstract: AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.


[5] NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering cs.CL | cs.AIPDF

Rong Fu, Yang Li, Zeyu Zhang, Jiekai Wu, Yaohua Liu

TL;DR: 本文提出NeuroSymActive框架,用于知识图谱问答(KGQA),该框架结合了可微神经符号推理层与主动探索控制器,旨在通过软统一风格的符号模块、神经路径评估器和蒙特卡洛探索策略,在减少昂贵图查询和模型调用次数的同时,实现精确的多跳推理。

Details

Motivation: 解决大型预训练语言模型和神经推理系统在处理需要精确、结构化多跳推理的知识密集型查询时面临的挑战,以及知识图谱与神经模型集成中存在的效率低下、检索成本高和缺乏基于梯度的优化等问题。

Result: 在标准KGQA基准测试中,NeuroSymActive在保持强答案准确性的同时,相比常见的检索增强基线方法,显著减少了昂贵的图查找和模型调用次数。

Insight: 创新点在于将可微神经符号推理与主动探索策略相结合,通过软统一符号模块和值引导的蒙特卡洛探索来优化路径扩展,实现了效率与准确性的平衡,为神经符号推理提供了模块化且可微分的解决方案。

Abstract: Large pretrained language models and neural reasoning systems have advanced many natural language tasks, yet they remain challenged by knowledge-intensive queries that require precise, structured multi-hop inference. Knowledge graphs provide a compact symbolic substrate for factual grounding, but integrating graph structure with neural models is nontrivial: naively embedding graph facts into prompts leads to inefficiency and fragility, while purely symbolic or search-heavy approaches can be costly in retrievals and lack gradient-based refinement. We introduce NeuroSymActive, a modular framework that combines a differentiable neural-symbolic reasoning layer with an active, value-guided exploration controller for Knowledge Graph Question Answering. The method couples soft-unification style symbolic modules with a neural path evaluator and a Monte-Carlo style exploration policy that prioritizes high-value path expansions. Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.


[6] The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems cs.CL | cs.CV | cs.LGPDF

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He

TL;DR: 本文提出了一种名为Vision Wormhole的新型框架,旨在解决异构多智能体系统中离散文本通信效率低下的问题。该框架通过引入通用视觉编解码器,将异构模型的推理轨迹映射到共享的连续潜在空间,并利用视觉语言模型的视觉接口作为通用端口进行无文本、模型无关的潜在状态传输。

Details

Motivation: 现有基于大语言模型的多智能体系统依赖离散文本通信,导致显著的运行时开销和信息量化损失;而现有的潜在状态传输方法要么假设同构的发送-接收架构,要么依赖特定配对的学习翻译器,限制了在具有不同流形的异构模型家族间的可扩展性和模块化。

Result: 在异构模型家族(如Qwen-VL、Gemma)上的广泛实验表明,Vision Wormhole在受控比较中减少了端到端的挂钟时间,同时保持了与标准基于文本的多智能体系统相当的推理保真度。

Insight: 创新点在于将视觉语言模型的视觉编码器重新用作智能体间潜在状态通信的通用端口,并采用中心辐射拓扑将成对对齐复杂度从O(N^2)降低到O(N),同时利用无标签的师生蒸馏目标来对齐高速视觉通道与文本路径的稳健推理模式。

Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver’s visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas


[7] TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models cs.CL | cs.LG | cs.SEPDF

Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen

TL;DR: 本文提出了一种名为TAROT的测试驱动与能力自适应课程强化微调方法,用于提升大语言模型在代码生成任务中的功能正确性和鲁棒性。该方法通过为每个问题构建四层测试套件(基础、中级、复杂、边缘),并设计能力自适应的课程策略,以解决现有强化微调方法中因测试用例难度不均导致的奖励信号失衡和梯度更新偏差问题。

Details

Motivation: 尽管大语言模型正在改变编码范式,但生成算法复杂且鲁棒的代码仍是一个关键挑战。现有强化微调方法往往忽视测试用例固有的异构难度和粒度,导致奖励信号分布不平衡和训练中的梯度偏差,因此需要一种更系统、自适应的课程学习策略来激励模型的深度推理能力。

Result: 广泛的实验结果表明,代码生成中强化微调的最佳课程策略与模型的内在能力密切相关:能力较弱的模型在从易到难的课程中获益更大,而能力更强的模型则在先难后易的课程中表现更优。TAROT方法能自适应地根据模型能力定制课程设计,从而持续提升生成代码的功能正确性和鲁棒性。

Insight: 核心创新点在于将课程进度与原始奖励分数解耦,实现了基于模型能力的条件评估,并从一系列课程策略组合中进行原则性选择,而非依赖于偶然的测试用例难度组合。这为强化微调提供了一种可复现的、能力自适应的课程设计方法,促进了稳定的优化和更高效的能力获取。

Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model’s inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model’s capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.


[8] ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling cs.CL | eess.ASPDF

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

TL;DR: 本文提出ZeroSyl,一种无需训练、直接从冻结的WavLM模型中提取音节边界和嵌入的简单方法,用于纯语音语言建模。该方法利用WavLM中间层特征的L2范数进行音节切分,并通过均值池化和K-means离散化生成音节单元,在词汇、句法和叙事基准测试中优于现有音节分词器。

Details

Motivation: 纯语音语言模型直接从原始音频学习语言,但自监督语音编码器产生的离散标记序列过长,现有音节单元方法依赖复杂的多阶段训练流程,因此需要一种简单、无需训练的音节分词方法。

Result: ZeroSyl在音节切分性能上具有竞争力,并在词汇、句法和叙事基准测试中优于先前的音节分词器(如Sylber和SyllableLM)。扩展实验表明,更细粒度的单元对词汇任务有益,而发现的音节单元在句法建模中表现出更好的扩展性。

Insight: 创新点在于提出了一种完全无需训练、基于预训练模型中间层特征L2范数的简单音节切分和嵌入提取方法,避免了复杂训练流程,同时实现了竞争性性能,为语音语言建模提供了更高效的音节级表示方案。

Abstract: Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM’s intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.


[9] Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL cs.CL | cs.AIPDF

Yihan Wang, Peiyu Liu, Runyu Chen, Wei Xu

TL;DR: 本文提出了一种名为SquRL的强化学习框架,旨在解决传统文本到SQL(Text-to-SQL)任务中静态工作流在现实场景中泛化能力不足的问题。通过让系统在推理时自适应地构建动态工作流,该方法在多个基准测试中超越了最佳静态工作流方法,尤其在处理复杂和分布外查询时表现突出。

Details

Motivation: 现有Text-to-SQL方法依赖单一的静态工作流,难以扩展到分布外和长尾场景,导致实际应用效果受限。本文旨在使系统能够自适应地在推理时构建工作流,以替代用户通过大量实验选择合适方法的传统方式。

Result: 在广泛使用的Text-to-SQL基准测试上,动态工作流构建方法持续优于最佳静态工作流方法,性能提升主要由候选工作流的异质性驱动,在复杂和分布外查询上增益尤为显著。

Insight: 论文的创新点在于引入强化学习框架SquRL来增强大语言模型在自适应工作流构建中的推理能力,并设计了基于规则的奖励函数、动态演员掩码(以鼓励更广泛的探索)和伪奖励(以提高训练效率)两种有效训练机制。从客观角度看,该方法通过动态策略选择,有效利用了工作流之间的异质性,为提升Text-to-SQL系统的鲁棒性和可扩展性提供了新思路。

Abstract: Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs’ reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at https://github.com/Satissss/SquRL


[10] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens cs.CL | cs.AIPDF

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng

TL;DR: 本文提出了一种名为STAPO的新方法,用于稳定大型语言模型(LLM)的强化学习微调过程。该方法通过识别并抑制训练中罕见的虚假令牌(spurious tokens)引起的异常梯度更新,从而防止后期性能崩溃,提升训练稳定性与推理性能。

Details

Motivation: 现有基于强化学习的LLM微调方法(如GRPO)严重依赖启发式技术(如熵正则化)来维持稳定性,但在实践中常出现后期性能崩溃,导致推理质量下降和训练不稳定。作者发现,这种不稳定性是由极少数(约0.01%)的虚假令牌驱动的,这些令牌出现在正确响应中但对推理贡献甚微,却继承了序列级奖励,导致梯度异常放大。

Result: 在六个数学推理基准测试上,使用Qwen 1.7B、8B和14B基础模型进行实验,STAPO方法在熵稳定性方面持续表现出优越性,并且平均性能相比GRPO、20-Entropy和JustRL等方法提升了7.13%。

Insight: 论文的核心创新点在于从理论上推导出令牌级策略梯度的大小与令牌概率和局部策略熵负相关,并据此识别出导致训练不稳定的罕见虚假令牌。STAPO方法通过选择性屏蔽这些令牌的梯度更新并对有效令牌的损失进行重归一化,提供了一种理论驱动的、更稳定的RL优化方案,替代了传统的启发式正则化方法。

Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% over GRPO, 20-Entropy and JustRL.


[11] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos cs.CL | cs.AIPDF

Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé

TL;DR: 本文提出了一个名为FineMuSe的西班牙语多模态性别歧视检测数据集,包含二元和细粒度标注,并引入了一个全面的分层分类法,涵盖性别歧视、非性别歧视以及讽刺和幽默等修辞手法。作者评估了多种大型语言模型在二元和细粒度性别歧视检测上的表现,发现多模态LLMs在识别细微性别歧视方面与人类标注者表现相当,但在处理通过视觉线索传达的共现性别歧视类型时存在困难。

Details

Motivation: 在线性别歧视表现形式多样,检测具有挑战性;现有自动化工具通常仅限于二元分类,导致更微妙的性别歧视可能因缺乏细粒度、上下文敏感的标签而未被检测到。

Result: 评估了多种LLMs在二元和细粒度性别歧视检测上的性能,多模态LLMs在识别细微性别歧视方面与人类标注者表现相当,但在处理通过视觉线索传达的共现性别歧视类型时存在困难。

Insight: 提出了首个西班牙语多模态细粒度性别歧视检测数据集FineMuSe和分层分类法,为细粒度性别歧视检测提供了基准;揭示了多模态LLMs在视觉线索理解上的局限性,为未来多模态模型改进提供了方向。

Abstract: Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.


[12] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models cs.CL | cs.AIPDF

Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé

TL;DR: 本文提出了ChartEditBench基准测试,用于评估多模态大语言模型在基于代码的多轮图表编辑任务中的能力。该基准包含5000个难度可控的修改链和一个经过人工严格验证的子集,并设计了一个结合执行保真度检查、像素级视觉相似性和逻辑代码验证的鲁棒评估框架。实验表明,当前先进的MLLMs在多轮交互中因错误累积和共享上下文崩溃而性能大幅下降。

Details

Motivation: 尽管多模态大语言模型在单轮图表生成上表现强劲,但其支持现实世界探索性数据分析的能力尚未得到充分探索。实际应用中,用户需要通过多轮交互迭代优化可视化图表,这要求模型能够维持共同基础、跟踪先前编辑并适应不断变化的偏好。

Result: 在ChartEditBench上的实验显示,最先进的多模态大语言模型在多轮设置中性能显著下降,尤其是在以数据为中心的转换上频繁出现执行失败,而在样式编辑上表现较强。该基准为基于意图感知的多模态编程建立了一个具有挑战性的测试平台。

Insight: 论文的创新点在于提出了首个专注于评估多轮、基于代码的图表编辑能力的基准测试ChartEditBench,它超越了以往的单次评估,强调持续性和上下文感知的编辑。此外,论文设计了一个综合的评估框架,通过结合多种验证方法,有效缓解了单纯使用LLM-as-a-Judge指标的局限性。从客观角度看,这项工作将评估重点从静态生成转向动态交互过程,对推动MLLMs在真实数据分析场景中的应用具有重要意义。

Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.


[13] ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution cs.CLPDF

Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral

TL;DR: 该论文研究了多模态大语言模型在视觉表格归因任务上的表现,评估了模型在回答表格问题时提供具体行列引用的能力。研究发现,尽管模型在问题回答上表现尚可,但在证据归因方面的准确性远低于随机水平,尤其是在JSON格式输入中,且模型在引用行时比列更可靠,在文本格式上比图像格式更困难。

Details

Motivation: 解决多模态大语言模型在结构化数据(如表格)中提供答案来源引用的能力不足问题,以增强模型在需要透明度和可追溯性应用中的可信度。

Result: 在ViTaB-A基准测试中,所有模型的归因准确性均较低,接近随机水平,其中JSON输入的表现最差;模型在行引用上比列引用更可靠,且文本格式比图像格式更具挑战性,不同模型家族之间存在显著差异。

Insight: 论文的创新点在于系统评估了多模态大语言模型在视觉表格归因任务上的细粒度可信度,揭示了模型在问题回答与证据归因之间的性能差距,强调了当前模型在结构化数据透明引用方面的局限性,为未来改进提供了方向。

Abstract: Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.


cs.CV [Back]

[14] GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation cs.CV | cs.AIPDF

Abdul Joseph Fofanah, Lian Wen, Alpha Alimamy Kamara, Zhongyi Zhang, David Chen

TL;DR: GRAFNet是一种受生物视觉系统启发的医学图像息肉分割网络,通过模拟人类视觉系统的分层组织,整合了引导非对称注意力模块、多尺度视网膜模块和引导皮层注意力反馈模块,以增强息肉分割的准确性和鲁棒性。

Details

Motivation: 解决结肠镜息肉分割中因形态多变、与正常结构视觉相似以及需要多尺度检测而导致的挑战,现有深度学习方法存在单向处理、多尺度融合弱和缺乏解剖约束的问题,常导致假阳性和假阴性。

Result: 在五个公开基准(Kvasir-SEG、CVC-300、CVC-ColonDB、CVC-Clinic和PolypGen)上进行了广泛实验,实现了SOTA性能,Dice系数提升3-8%,泛化能力提高10-20%,并提供可解释的决策路径。

Insight: 创新点包括模拟生物视觉系统的模块化设计,如引导注意力机制和多尺度并行分析,以及通过预测编码进行迭代优化,这为AI准确性与临床可信推理之间的桥梁提供了新范式。

Abstract: Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and the absence of anatomical constraints, often leading to false positives (over-segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation-tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi-feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) that enforces spatial-semantic consistency via resolution-adaptive feedback. Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance, with 3-8% Dice improvements and 10-20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at https://github.com/afofanah/GRAFNet.


[15] Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition cs.CVPDF

Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang

TL;DR: 本文提出了一种解耦的零样本人物-物体交互检测框架,将物体检测与交互识别分离,并利用多模态大语言模型进行零样本交互识别。该方法通过确定性生成方法将交互识别转化为视觉问答任务,无需训练即可实现零样本识别,并通过空间感知池化模块和一次性确定性匹配方法进一步提升性能。

Details

Motivation: 现有零样本HOI检测方法通常将交互识别与特定检测器紧密耦合,并依赖粗粒度的视觉-语言模型特征,限制了其对未见交互的泛化能力。本文旨在解决交互识别在组合多样性下的挑战,实现与检测器无关的泛化性强的交互识别。

Result: 在HICO-DET和V-COCO数据集上的大量实验表明,该方法在零样本性能上表现优异,具有强大的跨数据集泛化能力,并且能够灵活地与任何物体检测器集成而无需重新训练。

Insight: 创新点在于提出解耦框架,将交互识别与检测器分离;利用MLLM进行确定性生成的零样本交互识别;设计空间感知池化模块整合外观和空间线索,以及一次性匹配方法提升效率。这为HOI检测提供了更灵活、泛化性更强的解决方案。

Abstract: Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.


[16] Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories cs.CV | cs.LGPDF

Praditha Alwis, Soumyadeep Chandra, Deepak Ravikumar, Kaushik Roy

TL;DR: 本文提出了一种模型无关的方法,通过分析训练过程中每个帧的累积样本损失轨迹来检测视频数据集中存在的标注错误,如错误标签和时序错乱。该方法在EgoPER和Cholec80数据集上验证了其有效性。

Details

Motivation: 解决现实世界视频数据集中普遍存在的标注错误问题,特别是对时序一致性要求高的任务(如阶段检测),这些错误会损害模型训练的可靠性。

Result: 在EgoPER和Cholec80数据集上的实验表明,该方法在检测错误标签和帧时序错乱方面表现出色,无需标注错误的真实标签。

Insight: 创新性地将帧在整个训练周期中的损失轨迹视为其“可学习性”的动态指纹,利用错误标注帧持续高损失或模式不规则的特征进行检测;这是一种通用的、无需真实错误标注的数据集审计工具。

Abstract: High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as mislabeling, where segments are assigned incorrect class labels, and disordering, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)–defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.


[17] How to Train Your Long-Context Visual Document Model cs.CV | cs.AI | cs.CLPDF

Austin Veselka

TL;DR: 本文首次对训练上下文长度高达344K的长上下文视觉语言模型进行了全面的大规模研究,旨在解决长文档视觉问答任务,并探索其向长上下文文本任务的迁移能力。研究系统地探讨了持续预训练、监督微调和偏好优化对240亿和320亿参数模型的影响,通过广泛的评估和消融实验填补了现有开源模型(如Qwen3 VL和GLM 4.5/6V)训练方法不可复现的空白,在MMLongBenchDoc基准上取得了最先进的性能。

Details

Motivation: 现有开源长上下文视觉语言模型(如Qwen3 VL和GLM 4.5/6V)虽然性能强大,但其训练方法和数据流程不可复现,缺乏系统性研究。本文旨在通过可复现的训练方法研究,填补这一空白,并提升长文档视觉问答的性能。

Result: 在MMLongBenchDoc基准测试中,研究提出的方法在240亿和320亿参数规模上均达到了最先进的性能水平。

Insight: 关键创新点包括:1) 发现训练上下文长度与评估长度匹配时优于使用更长上下文训练;2) 在训练和评估中使用页面索引能显著提升长文档性能;3) 提出的合成数据流程支持通过持续预训练和监督微调实现自我改进;4) 首次证明了视觉长上下文训练能反向迁移提升长上下文文本任务的性能。此外,还发布了经过人工校正的MMLBD-C基准版本以减少错误和低质量样本。

Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.


[18] Visual Persuasion: What Influences Decisions of Vision-Language Models? cs.CV | cs.AIPDF

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

TL;DR: 本文提出一个研究视觉语言模型(VLM)视觉决策偏好的框架,通过将VLM置于受控的图像选择任务中,并系统性地扰动其输入,来推断其潜在的视觉效用函数。核心方法是利用图像生成模型对原始图像进行视觉提示优化,生成一系列视觉上合理的修改(如构图、光照、背景),然后评估哪些编辑能提高VLM的选择概率。

Details

Motivation: 网络上充斥着大量原本为人类设计的图像,如今越来越多地被基于VLM的智能体解读和决策(如点击、推荐、购买),但我们对这些智能体的视觉偏好结构知之甚少。本文旨在系统性地研究影响VLM决策的视觉因素。

Result: 在主流前沿VLM上进行的大规模实验表明,经过优化的图像编辑在头对头比较中能显著改变模型的选择概率。研究还开发了一个自动可解释性流程来解释这些偏好,识别出驱动选择的一致视觉主题。

Insight: 创新点在于将经济学中的显示性偏好理论应用于VLM研究,通过受控的、系统性的图像编辑来揭示模型的潜在视觉效用。该方法提供了一种主动、高效地发现VLM视觉漏洞和安全问题(例如,容易被特定视觉特征操纵)的途径,有助于对基于图像的AI智能体进行更主动的审计和治理。

Abstract: The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent’s decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.


[19] Consistency-Preserving Diverse Video Generation cs.CVPDF

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

TL;DR: 本文提出了一种用于流匹配视频生成器的联合采样框架,旨在提高批次内视频样本的多样性,同时保持视频内部的时间一致性。该方法通过轻量级潜在空间模型计算多样性和一致性目标,避免了昂贵的视频解码和反向传播。

Details

Motivation: 解决文本到视频生成中因样本数量有限而需要最大化批次价值的问题,现有方法在提升多样性时往往损害时间一致性且计算成本高。

Result: 在先进的文本到视频流匹配模型上实验表明,该方法在保持与强联合采样基线相当的多样性的同时,显著提高了时间一致性和色彩自然度。

Insight: 创新点在于联合采样框架中应用多样性驱动更新后,仅移除降低时间一致性的成分,并使用潜在空间模型避免图像空间梯度和解码开销,实现了效率与质量的平衡。

Abstract: Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.


[20] Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models cs.CV | stat.MLPDF

Tai Le-Gia, Jaehyun Ahn

TL;DR: 本文提出了一种无需训练、零样本的3D脑部MRI异常检测框架,通过聚合2D基础模型处理的多轴切片来构建局部体积标记,从而恢复立方空间上下文,并直接与基于距离的批量级异常检测流程集成。

Details

Motivation: 现有零样本异常检测方法主要局限于2D数据集,扩展到3D医学图像时面临挑战,现有方法依赖切片级特征和视觉语言模型,无法捕捉体积结构。

Result: 结果表明,无需训练的批量级零样本异常检测可以有效地从2D编码器扩展到完整的3D MRI体积,为体积异常检测提供了一种简单而稳健的方法。

Insight: 创新点在于通过多轴切片聚合构建3D体积标记,恢复空间上下文,实现无需训练、提示或监督的3D异常检测,利用2D基础模型处理3D数据,提供紧凑且易于计算的表示。

Abstract: Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.


[21] Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs cs.CV | cs.AIPDF

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li

TL;DR: 本文提出Sparrow框架,通过文本锚定窗口注意力和视觉语义瞥见机制,解决了视频大语言模型中推测解码的性能崩溃问题,实现了长序列下的高效加速。

Details

Motivation: 针对视频大语言模型中推测解码因键值缓存爆炸和上下文窗口不匹配导致的注意力稀释和负视觉增益问题,提出改进方案。

Result: 实验表明,即使在25k视觉令牌的长序列下,Sparrow平均加速比达到2.82倍,有效解决了长序列性能下降问题。

Insight: 创新点包括利用隐藏状态重用的视觉感知文本锚定窗口注意力、中间层视觉状态桥接训练草稿模型,以及多令牌预测策略来弥合训练-推理分布偏移。

Abstract: Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.


[22] EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use cs.CVPDF

Siwei Wen, Zhangcheng Wang, Xingjian Zhang, Lei Huang, Wenjun Wu

TL;DR: 本文提出了EventMemAgent,一种基于分层事件记忆的主动在线视频理解代理框架,通过短期记忆动态检测事件边界并处理流式视频帧,长期记忆按事件结构化存储历史观察,结合多粒度感知工具包和代理强化学习,以解决流媒体无限输入与MLLMs有限上下文窗口之间的冲突。

Details

Motivation: 在线视频理解面临流媒体输入无限性与多模态大语言模型有限上下文窗口的根本矛盾,现有被动处理方法在保持长程上下文与捕捉细粒度细节之间存在权衡,需要一种主动框架来平衡这一冲突。

Result: 实验表明,EventMemAgent在在线视频基准测试中取得了有竞争力的结果,具体表现为在相关任务上达到与现有方法相当或更优的水平。

Insight: 创新点包括:分层事件中心记忆机制(短期与长期记忆结合)、事件粒度水库采样动态处理流视频、多粒度感知工具包主动迭代捕获证据,以及代理强化学习端到端内化推理与工具使用策略,从被动处理转向主动代理框架,提升在线视频理解的效率与精度。

Abstract: Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent’s intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.


[23] Effective and Robust Multimodal Medical Image Analysis cs.CVPDF

Joy Dhar, Nayyar Zaidi, Maryam Haghighat

TL;DR: 本文提出了一种名为MAIL(Multi-Attention Integration Learning)的新型多模态融合学习网络,旨在解决现有医学图像分析方法在泛化性、计算效率和对抗鲁棒性方面的不足。MAIL通过高效的残差学习注意力块和多模态交叉注意力模块,分别捕获模态特定的多尺度模式和跨模态的互补共享表示。此外,还进一步设计了Robust-MAIL,通过引入随机投影滤波器和调制注意力噪声来增强对抗鲁棒性。

Details

Motivation: 现有多模态融合学习方法存在三个关键局限:a) 通常针对特定模态,忽略了跨不同模态的有效互补信息,限制了其在多疾病分析中的泛化能力;b) 依赖计算昂贵的模型,在资源受限环境中适用性差;c) 缺乏对抗攻击的鲁棒性,影响了医疗AI应用的可靠性。

Result: 在20个公共数据集上的广泛评估表明,MAIL和Robust-MAIL均优于现有方法,性能提升高达9.34%,同时计算成本降低高达78.3%,确保了比顶级竞争对手更可靠的预测。

Insight: 创新点包括:1) 高效的残差学习注意力块用于捕获细粒度的模态特定多尺度模式;2) 高效的多模态交叉注意力模块用于学习跨模态的丰富互补共享表示;3) 通过随机投影滤波器和调制注意力噪声增强对抗鲁棒性。从客观角度看,该方法在提升性能的同时显著降低了计算开销,并兼顾了对抗鲁棒性,为资源受限的医疗AI应用提供了可行的解决方案。

Abstract: Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: https://github.com/misti1203/MAIL-Robust-MAIL.


[24] CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset cs.CVPDF

Jinho Baek, Houwei Cao, Kate Blackwell

TL;DR: 本文介绍了CREMD数据集,这是一个用于研究不同呈现模式(如上下文、音频、视频)和标注者特征(如养狗情况、性别、专业经验)如何影响狗情绪感知与标注的众包多模态数据集。该数据集包含923个视频片段,以三种模式呈现,并通过分析来自不同背景参与者的标注,揭示了影响狗情绪识别可靠性的关键因素。

Details

Motivation: 狗情绪识别对于改善人-动物交互、兽医护理以及开发自动化犬类健康监测系统至关重要,但由于情绪评估的主观性和缺乏标准化的真实标注方法,准确解释狗情绪具有挑战性。

Result: 研究发现:(1)添加视觉上下文显著提高了标注一致性,但音频线索的影响因设计限制(如缺少无上下文但有音频的条件、干净音频有限)尚无定论;(2)与预期相反,非养狗者和男性标注者分别比养狗者和女性标注者表现出更高的一致性,而专业人士的一致性更高,符合初始假设;(3)音频的存在显著提高了标注者识别特定情绪(尤其是愤怒和恐惧)的信心。

Insight: 论文的创新点在于构建了一个系统研究多模态信息和标注者特征对狗情绪识别影响的众包数据集。从客观角度看,其研究设计强调了上下文和标注者多样性在动物情绪标注任务中的重要性,为未来构建更可靠的动物情感计算基准提供了数据和方法学参考。

Abstract: Dog emotion recognition plays a crucial role in enhancing human-animal interactions, veterinary care, and the development of automated systems for monitoring canine well-being. However, accurately interpreting dog emotions is challenging due to the subjective nature of emotional assessments and the absence of standardized ground truth methods. We present the CREMD (Crowd-sourced Emotional Multimodal Dogs Dataset), a comprehensive dataset exploring how different presentation modes (e.g., context, audio, video) and annotator characteristics (e.g., dog ownership, gender, professional experience) influence the perception and labeling of dog emotions. The dataset consists of 923 video clips presented in three distinct modes: without context or audio, with context but no audio, and with both context and audio. We analyze annotations from diverse participants, including dog owners, professionals, and individuals with varying demographic backgrounds and experience levels, to identify factors that influence reliable dog emotion recognition. Our findings reveal several key insights: (1) while adding visual context significantly improved annotation agreement, our findings regarding audio cues are inconclusive due to design limitations (specifically, the absence of a no-context-with-audio condition and limited clean audio availability); (2) contrary to expectations, non-owners and male annotators showed higher agreement levels than dog owners and female annotators, respectively, while professionals showed higher agreement levels, aligned with our initial hypothesis; and (3) the presence of audio substantially increased annotators’ confidence in identifying specific emotions, particularly anger and fear.


[25] GMAIL: Generative Modality Alignment for generated Image Learning cs.CV | cs.AI | cs.LG | eess.IVPDF

Shentong Mo, Sukmin Yun

TL;DR: 本文提出了一种名为GMAIL的新框架,用于判别性地利用生成图像进行学习。该框架将生成图像视为与真实图像不同的模态,通过在潜在空间中进行多模态对齐,而非直接在像素空间替换,从而有效利用生成模型的进展来提升视觉-语言任务的性能。

Details

Motivation: 生成模型能合成高度逼真的图像,为训练机器学习模型提供了潜在丰富的数据源。然而,不加区分地将生成图像当作真实图像使用,可能因真实与合成域之间的模态差异导致模式崩溃。本文旨在解决这一问题,实现生成图像的有效利用。

Result: 在图像描述生成、零样本图像检索、零样本图像分类和长描述检索等任务上,该框架显著提升了性能。实验还展示了生成数据的积极扩展趋势,并在大型多模态模型LLaVA的描述生成性能上取得了显著增强。

Insight: 核心创新点在于将生成图像明确视为一个独立的模态,并通过跨模态对齐损失在潜在空间中进行对齐,而非简单替换像素。这提供了一种更有效利用生成数据的方法,避免了模态差异带来的负面影响,并可轻松集成到各种视觉-语言模型中。

Abstract: Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.


[26] Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation cs.CVPDF

Shuwei Li, Lei Tan, Robby T. Tan

TL;DR: 本文提出了一种用于抑制无配对图像翻译中目标类语义幻觉的新框架,通过双头判别器检测背景区域的幻觉内容,并利用类特定原型作为语义锚点,在特征空间中迭代地将幻觉特征推离原型,从而在日转夜翻译中保持对象语义。

Details

Motivation: 解决日转夜无配对图像翻译中因外观变化大且缺乏像素级监督而导致的语义幻觉问题,如错误合成交通标志、车辆和人造光效等目标类对象,这些幻觉会显著降低下游任务性能。

Result: 在BDD100K数据集上,该方法在日转夜域适应任务中mAP提升15.5%,对于易产生幻觉的类别如交通灯,mAP增益达31.7%,在定性和定量上均优于现有方法。

Insight: 创新点包括:1)双头判别器结合语义分割以检测背景幻觉;2)利用标注目标域对象特征构建类特定原型作为语义锚点;3)基于Schrodinger Bridge的迭代细化框架,在特征空间中显式抑制幻觉特征。这些设计可有效提升跨域翻译的语义保真度。

Abstract: Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.


[27] Emergent Morphing Attack Detection in Open Multi-modal Large Language Models cs.CVPDF

Marija Ivanovska, Vitomir Štruc

TL;DR: 本文首次系统性地评估了开源多模态大语言模型(MLLMs)在零样本设置下进行单图像人脸变形攻击检测(MAD)的能力。研究发现,许多MLLMs无需微调或领域适应,即可展现出显著的判别能力,其中LLaVA1.6-Mistral-7B模型在等错误率(EER)指标上超越了特定任务训练的MAD基线方法至少23%,达到了最先进的性能水平。这表明多模态预训练能够隐式编码指示变形痕迹的细粒度面部不一致性,从而具备零样本取证敏感性。

Details

Motivation: 人脸变形攻击威胁生物特征验证,而现有的大多数MAD系统需要针对特定任务进行训练,并且对未见过的攻击类型泛化能力差。同时,开源多模态大语言模型(MLLMs)已展现出强大的视觉-语言推理能力,但其在生物特征取证领域的潜力尚未被充分探索。

Result: 在多种变形技术上的评估显示,LLaVA1.6-Mistral-7B模型在等错误率(EER)指标上超越了高度竞争的特定任务MAD基线方法至少23%,达到了最先进的(SOTA)性能水平。

Insight: 论文宣称的创新点在于首次对开源MLLMs进行零样本MAD的系统性评估,并揭示了多模态预训练本身就能隐式编码用于检测人脸变形伪影的细粒度特征,这是一种涌现能力。从客观角度看,该研究将开源MLLMs定位为生物特征安全和取证图像分析的可复现、可解释且有竞争力的基础模型,并指出了通过针对性微调或轻量级适配来开发SOTA MAD系统的新机会。

Abstract: Face morphing attacks threaten biometric verification, yet most morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, open-source multimodal large language models (MLLMs) have demonstrated strong visual-linguistic reasoning, but their potential in biometric forensics remains underexplored. In this paper, we present the first systematic zero-shot evaluation of open-source MLLMs for single-image MAD, using publicly available weights and a standardized, reproducible protocol. Across diverse morphing techniques, many MLLMs show non-trivial discriminative ability without any fine-tuning or domain adaptation, and LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing highly competitive task-specific MAD baselines by at least 23% in terms of equal error rate (EER). The results indicate that multimodal pretraining can implicitly encode fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Our findings position open-source MLLMs as reproducible, interpretable, and competitive foundations for biometric security and forensic image analysis. This emergent capability also highlights new opportunities to develop state-of-the-art MAD systems through targeted fine-tuning or lightweight adaptation, further improving accuracy and efficiency while preserving interpretability. To support future research, all code and evaluation protocols will be released upon publication.


[28] RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution cs.CV | cs.AIPDF

Youngwan Jin, Incheol Park, Yagiz Nalcakan, Hyeongjin Ju, Sanghyeop Yeo

TL;DR: 本文提出了一种名为RPT-SR的新型Transformer架构,专门用于红外图像超分辨率。该模型通过引入一个双令牌框架,将可学习的区域先验令牌与局部令牌融合,从而将场景布局信息显式编码到注意力机制中,以解决通用超分辨率模型在固定视角红外成像场景中效率低下的问题。

Details

Motivation: 通用超分辨率模型(尤其是Vision Transformers)在监控和自动驾驶等固定或准静态视角的红外成像场景中存在效率低下的问题,因为它们未能利用这些场景中固有的强、持久空间先验,导致冗余学习和次优性能。

Result: 在涵盖长波(LWIR)和短波(SWIR)光谱的多个数据集上进行了广泛实验,RPT-SR均取得了新的最先进(SOTA)性能,展示了其广泛的适用性和多功能性。

Insight: 核心创新点是提出了一种双令牌注意力框架,其中可学习的区域先验令牌作为场景全局结构的持久记忆,与捕捉当前输入帧特定内容的局部令牌融合,使先验能够动态调制局部重建过程,从而有效利用场景空间先验。

Abstract: General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene’s global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra


[29] Semantic-Guided 3D Gaussian Splatting for Transient Object Removal cs.CVPDF

Aditi Prabakaran, Priyesh Shukla

TL;DR: 该论文提出了一种语义引导的3D高斯泼溅(3DGS)框架,用于从多视角图像中移除瞬态物体(如行人、车辆),以解决重建中的重影伪影问题。该方法利用视觉语言模型(如CLIP)计算渲染视图与干扰物文本提示之间的相似度得分,对高斯点进行累积评分、不透明度正则化和周期性剪枝,从而实现类别感知的瞬态物体移除。

Details

Motivation: 解决多视角捕捉中瞬态物体导致3D高斯泼溅重建产生重影伪影的问题,现有方法依赖高内存成本的场景分解或易受视差歧义影响的运动启发式方法,需要一种更鲁棒且高效的解决方案。

Result: 在RobustNeRF基准测试的四个序列上,相比原始3DGS,该方法在重建质量上取得了一致性提升,同时保持了最小的内存开销和实时渲染性能。阈值校准和基线比较验证了语义引导在可预测干扰物类别场景中的实用性。

Insight: 创新点在于利用视觉语言模型(CLIP)的语义相似性进行类别感知的瞬态物体识别,独立于运动模式,从而解决了视差歧义问题;方法结合了不透明度正则化和周期性剪枝,在保持高效内存和实时渲染的同时提升了重建质量。

Abstract: Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.


[30] Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs cs.CVPDF

Guangtao Lyu, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang

TL;DR: 本文提出了一种无需训练的注意力干预方法PADE,通过利用大型视觉语言模型内部的积极注意力动态来识别语义核心视觉区域,从而缓解幻觉问题。该方法通过构建PAD图、自适应控制干预强度以及系统令牌补偿来增强视觉基础并减少不一致输出。

Details

Motivation: 大型视觉语言模型在多模态推理方面表现出色,但仍容易产生幻觉,即输出与视觉输入或用户指令不一致。现有无需训练的方法(如对比解码和辅助专家模型)计算开销大且易受注意力沉没现象干扰,因此需要一种更高效且鲁棒的幻觉缓解方法。

Result: 在多个大型视觉语言模型和基准测试上的实验表明,PADE方法显著改善了视觉基础能力并减少了幻觉,验证了利用内部注意力动态提升多模态推理可靠性的有效性。

Insight: 创新点在于发现模型内部的积极注意力动态能自然揭示语义核心视觉区域,并据此设计自适应干预策略;客观分析认为,该方法通过动态调整注意力权重而非静态增强,有效避免了注意力沉没问题,同时保持了对复杂指令的响应一致性。

Abstract: LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.


[31] Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation cs.CVPDF

Marco Salmè, Federico Siciliano, Fabrizio Silvestri, Paolo Soda, Rosa Sicilia

TL;DR: 本文提出了概念增强多模态检索增强生成(CEMRAG)框架,用于可解释且准确的放射学报告生成。该框架将视觉表征分解为可解释的临床概念,并与多模态RAG相结合,通过丰富的上下文提示来提升报告生成的可解释性和事实准确性。

Details

Motivation: 解决当前基于视觉语言模型(VLM)的放射学报告生成(RRG)在临床应用中面临的两个主要问题:缺乏可解释性,以及容易产生与影像证据不符的幻觉发现。现有研究通常将可解释性和准确性视为独立目标,分别通过基于概念的解释技术和检索增强生成(RAG)方法处理,而本文旨在统一这两个目标。

Result: 在MIMIC-CXR和IU X-Ray数据集上,针对多种VLM架构、训练机制和检索配置进行了实验。结果表明,CEMRAG在临床准确性指标和标准NLP指标上,均优于传统的RAG方法和仅使用概念的基线方法,实现了性能的持续提升。

Insight: 主要创新点在于提出了一个统一的框架,将可解释的视觉概念分解与多模态检索增强生成相结合,挑战了可解释性与性能之间存在权衡的固有假设,证明了透明的视觉概念可以增强而非损害医学VLM的诊断准确性。其模块化设计将可解释性分解为视觉透明度和结构化语言模型条件,为构建临床可信的AI辅助放射学提供了原则性路径。

Abstract: Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.


[32] ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT cs.CVPDF

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

TL;DR: 本文提出了ToaSt框架,用于高效压缩视觉Transformer(ViT)。该框架采用解耦策略,对多头自注意力模块进行耦合的逐头结构化剪枝,并对前馈网络(FFN)引入令牌通道选择(TCS)方法,以在减少计算量(FLOPs)的同时保持或提升模型精度。

Details

Motivation: ViT模型计算成本高昂,阻碍了实际部署。现有的结构化权重剪枝和令牌压缩方法分别存在重训练时间长和全局传播导致的优化难题。本文旨在解决这些问题,实现更高效的ViT压缩。

Result: 在包括DeiT、ViT-MAE和Swin Transformer在内的九个不同模型上进行了广泛评估。在ViT-MAE-Huge上,ToaSt在减少39.4% FLOPs的同时,将准确率提升了1.64%,达到88.52%。在下游任务如COCO目标检测上,也取得了52.2 mAP(优于基准的51.9 mAP)的结果,优于现有基线方法。

Insight: 主要创新点在于解耦框架设计:针对ViT不同组件(MHSA和FFN)采用专门化的压缩策略。特别是为FFN(占大部分计算量)提出的令牌通道选择(TCS)方法,能在提高压缩率的同时避免全局传播问题,且分析表明TCS能有效过滤选择过程中的冗余噪声。耦合的逐头结构化剪枝则利用了注意力操作特性来增强鲁棒性。

Abstract: Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64 %) with 39.4% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.


[33] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation cs.CV | cs.AIPDF

Shutian Gu, Chengkai Huang, Ruoyu Wang, Lina Yao

TL;DR: 本文提出了一种检索增强框架,用于提升基于大语言模型(LLM)的视觉与语言导航(VLN)的效率和稳定性。该框架在无需修改或微调底层LLM的情况下,通过两个互补的检索模块(指令级轨迹检索和模仿学习的候选方向检索)来提供任务先验并减少决策噪声。

Details

Motivation: 解决基于提示的LLM导航方法决策效率低下的问题,因为模型需要在每一步都从头解释指令,并在大量嘈杂冗长的可导航候选方向中进行推理。

Result: 在Room-to-Room (R2R) 基准测试上,该方法在已见和未见环境中,在成功率(Success Rate)、Oracle成功率和SPL指标上均取得了一致的提升。

Insight: 创新点在于提出了一个轻量级、模块化的检索增强决策支持框架,通过指令级嵌入检索提供全局任务先验,并通过模仿学习的候选检索器在LLM推理前剪枝无关方向,从而互补地提升了导航的全局引导和逐步决策效率。这是一种有效且可扩展的策略。

Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.


[34] RaCo: Ranking and Covariance for Practical Learned Keypoints cs.CV | cs.ROPDF

Abhiram Shenoi, Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys

TL;DR: RaCo是一个轻量级神经网络,旨在学习适用于多种3D计算机视觉任务的稳健且通用的关键点。该方法整合了三个核心组件:可重复关键点检测器、用于在有限关键点数量下最大化匹配的可微分排序器,以及用于量化度量尺度空间不确定性的协方差估计器。仅通过透视图像裁剪进行训练,无需共视图像对,并通过大量数据增强实现强大的旋转鲁棒性,而无需使用计算昂贵的等变网络架构。

Details

Motivation: 解决现有学习方法在关键点检测中需要共视图像对、计算成本高(如等变网络)以及难以同时估计关键点排序和度量协方差的问题,旨在提供一种独立、简单且无需额外标签的策略来检测可解释和可重复的兴趣点。

Result: 在多个具有挑战性的数据集上评估,RaCo在关键点可重复性和两视图匹配方面展示了最先进的性能,特别是在大平面内旋转的情况下。

Insight: 创新点在于将可微分排序器和协方差估计器集成到关键点学习框架中,从而能够独立估计关键点排名和度量协方差,无需额外监督标签;同时,仅通过数据增强而非复杂网络结构实现旋转鲁棒性,是一种高效且实用的设计。

Abstract: This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at https://github.com/cvg/RaCo.


[35] Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models cs.CV | cs.AIPDF

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang

TL;DR: 本文针对多模态模型中生成与理解能力相互制约的优化困境,提出Reason-Reflect-Refine (R3)框架,将单步生成重构为’生成-理解-再生成’的多步过程,从而在提升生成性能的同时增强相关理解能力。

Details

Motivation: 解决当前多模态模型增强生成能力往往以牺牲理解为代价,反之亦然的优化困境,其根源在于生成与理解任务在模型内部存在竞争性冲突。

Result: R3框架成功缓解了优化困境,在相关生成任务上取得了更强的生成结果,并提升了与生成过程相关的理解能力。

Insight: 创新点在于将单步生成任务显式重构为多步循环过程,在生成阶段主动利用模型的理解能力进行反思与精炼,为设计下一代统一多模态模型提供了新思路。

Abstract: Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of “generate-understand-regenerate”. By explicitly leveraging the model’s understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.


[36] NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy cs.CVPDF

Laura Salort-Benejam, Antonio Agudo

TL;DR: 本文提出NeRFscopy,一种自监督的神经渲染管道,用于从单目内窥镜视频中实现可变形组织的新视角合成与3D重建。该方法通过结合规范辐射场和基于SE(3)变换的时变变形场来建模组织动态,并引入精心设计的损失项,仅从数据中学习3D隐式模型,无需模板或预训练模型。

Details

Motivation: 内窥镜在医学成像中至关重要,但现有方法难以从单目视频中实现动态组织的稳健3D重建,主要挑战包括组织形变、单目相机、光照变化、遮挡和未知相机轨迹。

Result: NeRFscopy在新视角合成方面取得了准确结果,在多种具有挑战性的内窥镜场景中超越了现有竞争方法。

Insight: 创新点包括:1)将可变形模型分解为规范辐射场和基于SE(3)的时变变形场,有效建模动态组织;2)通过精心设计的损失项充分利用颜色图像,实现无需模板或预训练的自监督学习;3)专为内窥镜场景设计,解决了医学图像中的特定挑战。

Abstract: Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.


[37] Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting cs.CVPDF

Ines Montoya-Espinagosa, Antonio Agudo

TL;DR: 本文提出了一种结合天空图像、光伏发电历史数据和气象数据的多模态混合方法,用于短期和长期光伏功率预测,旨在提高预测准确性、增强多云条件下的鲁棒性,并扩展预测能力以支持电网高效运行。

Details

Motivation: 解决光伏发电因天气变化导致的功率波动问题,特别是提高对功率陡升事件(ramp event)的预测精度,增强多云条件下的预测鲁棒性,并超越即时预测(nowcasting)以支持更长期的电网管理。

Result: 实验表明,引入气象数据(尤其是地表长波辐射、向下辐射以及风与太阳位置的组合)显著提升了即时预测和长期预测的准确性,特别是在多云天气条件下。

Insight: 创新点在于多模态数据融合(天空图像、历史功率和气象变量)与深度神经模型的结合,强调了整合多样化数据源对提高太阳能预测模型可靠性和可解释性的重要性,为可再生能源预测提供了可借鉴的混合方法。

Abstract: Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.


[38] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers cs.CVPDF

Lucas Sancéré, Noémie Moreau, Katarzyna Bozek

TL;DR: 该论文提出了一种基于可扩展图变换器(Graph Transformers)的方法,用于对皮肤鳞状细胞癌(cSCC)全切片图像(WSIs)中的健康与肿瘤上皮细胞进行分类。通过构建全切片细胞图来保留组织层面的上下文信息,该方法在单张WSI和多张WSI的训练设置下均超越了基于图像的深度学习方法,取得了更高的平衡准确率。

Details

Motivation: 解决全切片图像(WSIs)分析中,由于图像尺寸巨大且细胞组织结构复杂,现有基于卷积神经网络和视觉变换器的深度学习方法通常依赖基于图像块(patch)的表示,从而丢失了至关重要的组织层面上下文信息的问题。

Result: 在单张WSI上的3折交叉验证中,图变换器模型SGFormer和DIFFormer分别取得了85.2 ± 1.5和85.1 ± 2.5的平衡准确率,而最佳图像方法为81.2 ± 3.0。在扩展到多张WSI的多患者训练设置下,DIFFormer取得了83.6 ± 1.9的平衡准确率,优于最先进的图像模型CellViT256的78.1 ± 0.5。

Insight: 论文的创新点在于将全切片图像建模为细胞图,并应用可扩展的图变换器来整合细胞间的上下文关系,从而更有效地处理形态相似的细胞分类任务。客观来看,其核心洞察是结合细胞的形态、纹理特征以及非上皮细胞的类别信息作为节点特征,强调了周围细胞环境(cellular context)对于区分高度相似细胞类型的重要性。

Abstract: Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.


[39] VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation cs.CVPDF

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba

TL;DR: 本文提出VideoSketcher,一种数据高效的方法,用于生成顺序草图。该方法通过微调预训练的文本到视频扩散模型,将草图生成视为一个渐进绘制笔画的视频过程,利用大语言模型进行语义规划和笔画排序,视频扩散模型作为渲染器生成高质量、时序连贯的视觉内容。

Details

Motivation: 现有大多数生成模型将草图视为静态图像,忽略了创造性绘制过程中的时序结构。本文旨在解决顺序草图生成问题,捕捉草图绘制过程中的有意义笔画顺序。

Result: 尽管仅使用了极少量(如七个)人工绘制的草图过程数据,该方法能生成高质量的顺序草图,紧密遵循文本指定的笔画顺序,并展现出丰富的视觉细节。

Insight: 核心创新在于将草图表示为笔画在空白画布上渐进绘制的短视频,并利用LLM和视频扩散模型的互补优势。提出两阶段微调策略,将笔画顺序学习与草图外观学习解耦,分别通过合成形状组合和少量真实草图数据进行训练,实现了数据高效的高质量生成。方法还展示了通过笔刷风格条件化和自回归生成等扩展的灵活性。

Abstract: Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.


cs.CY [Back]

[40] FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health cs.CY | cs.CLPDF

Victor De Lima, Jiqun Liu, Grace Hui Yang

TL;DR: 本文提出了FrameRef,一个包含超过100万条系统性重构声明的大型数据集,涵盖权威性、共识性、情感性、声望性和煽动性五个框架维度,并建立了一个基于模拟的框架来研究排名和推荐系统中的序列信息暴露与强化动态对信息健康的影响。

Details

Motivation: 现代搜索和推荐系统中的排名和个性化策略对用户接触不良数字体验及其长期影响起着核心作用,为在受控环境中研究这些影响,本文旨在提供一个数据集和模拟测试平台。

Result: 通过蒙特卡洛轨迹采样,研究表明接受度和置信度的微小系统性偏移会随时间累积,导致累积信息健康轨迹显著分化;人类评估进一步证实FrameRef生成的框架可测量地影响人类判断。

Insight: 创新点在于构建了大规模、多维度的信息框架数据集,并提出了一个结合框架敏感智能体角色(通过带框架条件损失衰减的微调语言模型构建)的模拟框架,为通过模拟进行系统性信息健康研究奠定了基础。

Abstract: Information ecosystems increasingly shape how people internalize exposure to adverse digital experiences, raising concerns about the long-term consequences for information health. In modern search and recommendation systems, ranking and personalization policies play a central role in shaping such exposure and its long-term effects on users. To study these effects in a controlled setting, we present FrameRef, a large-scale dataset of 1,073,740 systematically reframed claims across five framing dimensions: authoritative, consensus, emotional, prestige, and sensationalist, and propose a simulation-based framework for modeling sequential information exposure and reinforcement dynamics characteristic of ranking and recommendation systems. Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence. Using Monte Carlo trajectory sampling, we show that small, systematic shifts in acceptance and confidence can compound over time, producing substantial divergence in cumulative information health trajectories. Human evaluation further confirms that FrameRef’s generated framings measurably affect human judgment. Together, our dataset and framework provide a foundation for systematic information health research through simulation, complementing and informing responsible human-centered research. We release FrameRef, code, documentation, human evaluation data, and persona adapter models at https://github.com/infosenselab/frameref.


cs.RO [Back]

[41] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation cs.RO | cs.CV | cs.LGPDF

Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki, Shubham Tulsiani

TL;DR: Dex4D是一个用于灵巧操作的框架,它通过在模拟环境中学习一个任务无关的、以3D点轨迹为条件的策略,能够将任何物体操纵到任意期望的位姿。该策略可以零样本迁移到真实世界任务,无需微调,仅需通过从生成视频中提取的以物体为中心的点轨迹进行提示即可。

Details

Motivation: 学习能够完成多种日常任务的通用策略是灵巧操作中的一个开放挑战。真实遥操作收集大规模数据成本高且难以扩展,而模拟学习虽可行,但设计多个任务特定的环境和奖励同样具有挑战性。

Result: 在模拟和真实机器人上的大量实验表明,该方法能够实现多种灵巧操作任务的零样本部署,并在多个基准上相比先前基线取得了一致的改进。同时,该方法在新物体、场景布局、背景和轨迹上表现出很强的泛化能力。

Insight: 创新点在于提出了一个任务无关的、以3D点轨迹为条件的’任意位姿到任意位姿’策略,通过模拟中大规模多样化物体和位姿配置的训练,实现了零样本真实世界迁移和闭环感知控制,强调了框架的鲁棒性和可扩展性。

Abstract: Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this ‘Anypose-to-Anypose’ policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.


cs.IR [Back]

[42] Automatic Funny Scene Extraction from Long-form Cinematic Videos cs.IR | cs.CVPDF

Sibendu Paul, Haotian Jiang, Caren Chen

TL;DR: 本文提出了一种端到端系统,用于从长视频电影中自动识别和排序幽默场景,该系统结合了镜头检测、多模态场景定位和针对电影内容优化的幽默标签。

Details

Motivation: 解决从长视频电影中自动提取高质量幽默场景的挑战,以提升流媒体平台的用户参与度和内容创作效率。

Result: 在OVSD数据集上,场景检测性能比现有最佳方法提升了18.3% AP;幽默检测在长文本上达到0.834的F1分数;在五个电影标题的评估中,87%的提取片段被判定为幽默,98%的场景定位准确。

Insight: 创新点包括结合视觉和文本线索的场景分割方法、通过引导三元组挖掘改进镜头表示,以及利用音频和文本的多模态幽默标签框架。这些方法可推广至预告片等格式,提升内容生成流程。

Abstract: Automatically extracting engaging and high-quality humorous scenes from cinematic titles is pivotal for creating captivating video previews and snackable content, boosting user engagement on streaming platforms. Long-form cinematic titles, with their extended duration and complex narratives, challenge scene localization, while humor’s reliance on diverse modalities and its nuanced style add further complexity. This paper introduces an end-to-end system for automatically identifying and ranking humorous scenes from long-form cinematic titles, featuring shot detection, multimodal scene localization, and humor tagging optimized for cinematic content. Key innovations include a novel scene segmentation approach combining visual and textual cues, improved shot representations via guided triplet mining, and a multimodal humor tagging framework leveraging both audio and text. Our system achieves an 18.3% AP improvement over state-of-the-art scene detection on the OVSD dataset and an F1 score of 0.834 for detecting humor in long text. Extensive evaluations across five cinematic titles demonstrate 87% of clips extracted by our pipeline are intended to be funny, while 98% of scenes are accurately localized. With successful generalization to trailers, these results showcase the pipeline’s potential to enhance content creation workflows, improve user engagement, and streamline snackable content generation for diverse cinematic media formats.


cs.CR [Back]

[43] Weight space Detection of Backdoors in LoRA Adapters cs.CR | cs.AI | cs.CL | cs.LGPDF

David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li

TL;DR: 本文提出了一种直接在权重空间中检测LoRA适配器后门的方法,无需运行模型或已知触发器,通过分析权重矩阵的奇异值集中度、熵和分布形状等统计特征来识别异常适配器。

Details

Motivation: LoRA适配器在开放仓库中共享易受后门攻击,现有检测方法需测试输入数据且触发器未知,难以大规模筛查,因此需一种数据无关的检测方法。

Result: 在Llama-3.2-3B模型的500个LoRA适配器(400个干净、100个带毒)上评估,覆盖Alpaca、Dolly、GSM8K等多个指令和推理数据集,检测准确率达97%,误报率低于2%。

Insight: 创新点在于直接从权重统计特征(如奇异值分布)进行后门检测,无需模型推理,实现了高效、数据无关的适配器筛查,可推广至其他参数高效微调方法的安全评估。

Abstract: LoRA adapters let users fine-tune large language models (LLMs) efficiently. However, LoRA adapters are shared through open repositories like Hugging Face Hub \citep{huggingface_hub_docs}, making them vulnerable to backdoor attacks. Current detection methods require running the model with test input data – making them impractical for screening thousands of adapters where the trigger for backdoor behavior is unknown. We detect poisoned adapters by analyzing their weight matrices directly, without running the model – making our method data-agnostic. Our method extracts simple statistics – how concentrated the singular values are, their entropy, and the distribution shape – and flags adapters that deviate from normal patterns. We evaluate the method on 500 LoRA adapters – 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset. We achieve 97% detection accuracy with less than 2% false positives.


cs.LG [Back]

[44] Seeing to Generalize: How Visual Data Corrects Binding Shortcuts cs.LG | cs.CLPDF

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel

TL;DR: 本文研究发现,视觉语言模型(VLMs)在纯文本任务(特别是长上下文信息检索)上可以超越其底层的大型语言模型(LLMs)。通过构建一个受控的合成检索任务,作者发现仅文本训练的Transformer模型在分布内(ID)表现完美但分布外(OOD)泛化失败,而后续在图像化版本的任务上进行训练后,其纯文本OOD性能几乎翻倍。机制可解释性分析表明,视觉训练改变了模型内部的绑定策略:文本训练鼓励位置捷径,而基于图像的训练通过空间平移不变性破坏了这些捷径,迫使模型采用更鲁棒的符号绑定机制,且该机制在重新引入纯文本示例后依然持续。作者进一步描述了绑定策略如何随训练方案、视觉编码器和初始化而变化,并表明在预训练的LLM到VLM的过渡中也发生了类似的转变。

Details

Motivation: 研究动机是探究一个令人惊讶的现象:视觉语言模型(VLMs)在纯文本任务上为何能超越其底层的大型语言模型(LLMs),特别是针对长上下文信息检索任务,旨在理解跨模态训练如何影响单模态任务的泛化能力。

Result: 在构建的受控合成检索任务上,仅文本训练的模型ID准确率完美(100%)但OOD失败,而后续进行图像化任务训练后,纯文本OOD性能从低水平提升至接近翻倍(例如从约40%提升至约80%)。机制分析揭示了绑定策略的转变,并表明在预训练LLM到VLM的过渡中也存在类似模式。

Insight: 论文宣称的创新点在于揭示了视觉训练通过引入空间平移不变性,破坏文本训练中形成的位置捷径绑定,从而迫使模型学习更鲁棒的符号绑定机制,这种机制能持续提升纯文本任务的泛化能力。从客观角度看,该研究为跨模态训练提升单模态任务推理和泛化提供了新的机制性解释和实验证据,对模型设计和训练策略有借鉴意义。

Abstract: Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model’s internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic binding mechanism that persists even after text-only examples are reintroduced. We further characterize how binding strategies vary across training regimes, visual encoders, and initializations, and show that analogous shifts occur during pretrained LLM-to-VLM transitions. Our findings suggest that cross-modal training can enhance reasoning and generalization even for tasks grounded in a single modality.


[45] Prescriptive Scaling Reveals the Evolution of Language Model Capabilities cs.LG | cs.AI | cs.CL | stat.MLPDF

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

TL;DR: 本文提出了一种基于大规模观测评估的预测性缩放方法,用于估计语言模型在不同预训练计算预算下的下游性能边界,并验证了其在时间上的稳定性。通过平滑分位数回归和单调饱和S型参数化,该方法能够量化模型能力边界随预训练FLOPs的变化,并发现除数学推理任务外,大多数任务的边界随时间保持稳定。

Details

Motivation: 针对基础模型部署的实际需求,研究者需要预测性缩放定律:给定预训练计算预算,如何可靠地预测当代后训练实践下的下游准确率,以及这种映射关系如何随领域发展而保持稳定。

Result: 基于5k观测数据和2k新采样数据的大规模评估,该方法在多个任务上估计的能力边界大多稳定,但数学推理任务表现出随时间持续提升的边界。此外,提出的高效算法能以约20%的评估预算恢复近乎完整的数据边界。

Insight: 创新点包括:1)采用平滑分位数回归与单调饱和S型参数化来估计能力边界;2)验证了边界的时间可靠性;3)扩展分析任务依赖性饱和和污染相关偏移;4)提出高效算法降低评估成本。同时,发布了最新的模型性能评估数据集Proteus 2k,为将计算预算转化为可靠性能预期和监测能力边界随时间变化提供了实用方法。

Abstract: For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.


[46] GLM-5: from Vibe Coding to Agentic Engineering cs.LG | cs.CLPDF

GLM-5 Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou

TL;DR: GLM-5是一个旨在将编程范式从’氛围编码’转向’智能体工程’的下一代基础模型。它通过采用DSA技术降低训练和推理成本,并利用新的异步强化学习基础设施和算法来提升模型对齐、自主性和在复杂、长程交互中的学习能力。

Details

Motivation: 解决现有模型在真实世界端到端软件工程任务中能力不足的问题,推动从直觉式编程向由智能体驱动的系统性工程范式转变。

Result: 在主要公开基准测试上达到了最先进的性能,并且在处理端到端软件工程挑战的真实世界编码任务中,其能力超越了之前的基线模型。

Insight: 核心创新点在于将DSA用于降低大模型成本并保持长上下文保真度,以及通过解耦生成与训练的异步强化学习框架来高效提升模型对齐与自主性,这为构建更实用、更经济的AI智能体工程系统提供了新思路。

Abstract: We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.


[47] On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks cs.LG | cs.CVPDF

Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce

TL;DR: 本文研究了多模态大语言模型在简单视觉规划任务中的推理泛化能力,特别是分布外泛化。作者提出了一个基于网格导航任务的评估框架,通过微调不同输入表示(视觉和文本)和思维链推理策略的模型变体,系统评估它们在分布内和分布外测试条件下的表现。实验发现,思维链推理虽能提升分布内泛化,但分布外泛化(如更大尺寸地图)在大多数情况下仍非常有限;有趣的是,结合多种文本格式的推理轨迹能实现最佳的非平凡分布外泛化,且纯文本模型始终优于基于图像输入的模型。

Details

Motivation: 尽管大语言模型和多模态大语言模型在推理能力上取得显著进展,但其泛化能力(尤其是分布外泛化)的定义和理解仍不清晰,本文旨在通过一个简单的规划任务来严格评估思维链推理方法的泛化性能。

Result: 在基于网格导航任务的实验中,思维链推理提升了所有输入表示的分布内泛化,但分布外泛化(如扩展到更大尺寸地图)在控制与分布内数据的平凡匹配后仍非常有限;结合多种文本格式的推理轨迹实现了最佳的非平凡分布外泛化,且纯文本模型(包括基于潜在空间推理的最新方法)一致优于基于图像输入的模型。

Insight: 论文的创新点在于提出了一个系统评估框架来量化推理模型的分布外泛化,并发现多格式文本推理在提升泛化中的有效性;从客观角度看,这揭示了当前多模态模型在视觉规划任务中分布外泛化的局限性,以及文本表示在推理任务中可能比视觉表示更具鲁棒性,为未来改进模型泛化提供了方向。

Abstract: Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.


[48] Guided Diffusion by Optimized Loss Functions on Relaxed Parameters for Inverse Material Design cs.LG | cs.CE | cs.CVPDF

Jens U. Kreber, Christian Weißenfels, Joerg Stueckler

TL;DR: 本文提出了一种基于扩散模型的新颖逆设计方法,用于解决工程和材料科学中的逆设计问题。该方法通过将原始设计空间松弛为连续网格表示,利用可微分模拟计算梯度,训练扩散模型作为先验,并通过引导扩散采样生成满足目标性能的多样化设计方案,最后投影回原始参数空间。

Details

Motivation: 逆设计问题中,多个设计参数可能产生相同或相似的输出值,需要多模态概率方法获取多样化解决方案;但设计空间中的离散参数或约束阻碍了基于梯度的优化,因此需要一种能处理此类结构并生成多样化可行设计的方法。

Result: 在复合材料设计问题中,前向过程建模为线性有限元问题,该方法在2D和3D设置中,能够为中高目标体积模量生成相对误差在1%以内的多样化设计;同时,通过多目标损失函数可最小化生成样本的材料密度。

Insight: 创新点包括:将离散设计空间松弛为连续表示以实现梯度计算;结合扩散模型作为先验,并通过可微分模拟的梯度进行引导扩散采样,实现多样化逆设计;方法可扩展至多目标优化,如同时优化材料密度。

Abstract: Inverse design problems are common in engineering and materials science. The forward direction, i.e., computing output quantities from design parameters, typically requires running a numerical simulation, such as a FEM, as an intermediate step, which is an optimization problem by itself. In many scenarios, several design parameters can lead to the same or similar output values. For such cases, multi-modal probabilistic approaches are advantageous to obtain diverse solutions. A major difficulty in inverse design stems from the structure of the design space, since discrete parameters or further constraints disallow the direct use of gradient-based optimization. To tackle this problem, we propose a novel inverse design method based on diffusion models. Our approach relaxes the original design space into a continuous grid representation, where gradients can be computed by implicit differentiation in the forward simulation. A diffusion model is trained on this relaxed parameter space in order to serve as a prior for plausible relaxed designs. Parameters are sampled by guided diffusion using gradients that are propagated from an objective function specified at inference time through the differentiable simulation. A design sample is obtained by backprojection into the original parameter space. We develop our approach for a composite material design problem where the forward process is modeled as a linear FEM problem. We evaluate the performance of our approach in finding designs that match a specified bulk modulus. We demonstrate that our method can propose diverse designs within 1% relative error margin from medium to high target bulk moduli in 2D and 3D settings. We also demonstrate that the material density of generated samples can be minimized simultaneously by using a multi-objective loss function.


cs.AI [Back]

[49] Protecting Language Models Against Unauthorized Distillation through Trace Rewriting cs.AI | cs.CLPDF

Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

TL;DR: 本文研究如何通过动态改写教师模型生成的推理轨迹来防止未经授权的知识蒸馏,旨在实现反蒸馏效果和API水印嵌入,从而保护大语言模型的知识产权。

Details

Motivation: 动机是防止未经授权的知识蒸馏不公平地利用前沿模型的开发成果,保护模型开发者的投入和成本。

Result: 实验表明,基于指令的简单改写方法在保持甚至提升教师模型性能的同时,实现了强反蒸馏效果,并支持高可靠性的水印检测,几乎没有误报。

Insight: 创新点在于动态改写推理轨迹以同时实现反蒸馏和水印嵌入,利用LLM的改写能力或基于梯度的技术,在保护知识产权的同时维持答案正确性和语义连贯性。

Abstract: Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher’s reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.


[50] RUVA: Personalized Transparent On-Device Graph Reasoning cs.AI | cs.CLPDF

Gabriele Conte, Alessio Mattiace, Gianni Carmosino, Potito Aghilar, Giovanni Servedio

TL;DR: RUVA是一种新型的“透明盒”架构,旨在解决当前个人AI中基于向量检索增强生成(RAG)的“黑盒”问题。它将个人AI的知识基础从向量数据库转变为个人知识图谱,使用户能够检查和精确编辑AI的记忆,从而确保数据的可解释性和‘被遗忘权’。

Details

Motivation: 当前主流的个人AI系统(如基于向量数据库的RAG)缺乏可解释性和问责制,当AI产生幻觉或检索敏感数据时,用户无法追溯原因或修正错误,且向量空间中的‘删除’操作在数学上不精确,会留下概率性的‘幽灵’数据,侵犯真实隐私。

Result: 论文提出了RUVA架构的原型,并提供了项目网站和演示视频,但摘要中未提及具体的定量实验结果或基准测试。

Insight: 核心创新点在于将个人AI的知识表示范式从‘向量匹配’转变为‘图谱推理’,并引入了‘人在回路’的记忆管理机制。这使得AI的记忆变得可检查、可编辑,为实现真正的透明、可控和隐私保护的个性化AI系统提供了新的技术路径。

Abstract: The Personal AI landscape is currently dominated by “Black Box” Retrieval-Augmented Generation. While standard vector databases offer statistical matching, they suffer from a fundamental lack of accountability: when an AI hallucinates or retrieves sensitive data, the user cannot inspect the cause nor correct the error. Worse, “deleting” a concept from a vector space is mathematically imprecise, leaving behind probabilistic “ghosts” that violate true privacy. We propose Ruva, the first “Glass Box” architecture designed for Human-in-the-Loop Memory Curation. Ruva grounds Personal AI in a Personal Knowledge Graph, enabling users to inspect what the AI knows and to perform precise redaction of specific facts. By shifting the paradigm from Vector Matching to Graph Reasoning, Ruva ensures the “Right to be Forgotten.” Users are the editors of their own lives; Ruva hands them the pen. The project and the demo video are available at http://sisinf00.poliba.it/ruva/.


[51] Recursive Concept Evolution for Compositional Reasoning in Large Language Models cs.AI | cs.CL | cs.LGPDF

Sarim Chaudhry

TL;DR: 本文提出了一种名为递归概念演化(RCE)的框架,使预训练大语言模型能够在推理过程中动态修改其内部表示几何,以解决组合推理任务中性能下降的问题。该方法通过检测表示不足、生成低秩概念子空间、选择与合并,并利用约束优化进行巩固,从而构建新的抽象表示。

Details

Motivation: 大语言模型在复杂推理任务上表现良好,但在需要组合推理的基准测试(如ARC-AGI-2、GPQA、MATH、BBH和HLE)中准确率急剧下降。现有方法通过扩展令牌级搜索来改进推理,但固定了模型的潜在表示空间,当所需抽象未编码时性能崩溃。

Result: 将RCE与Mistral-7B集成后,在组合推理基准测试中取得显著提升:ARC-AGI-2上获得12-18个百分点的增益,GPQA和BBH上改进8-14个百分点,并在MATH和HLE上持续减少深度诱导错误。

Insight: 创新点在于允许模型在推理时动态调整表示空间,而非仅重组现有抽象;通过最小描述长度准则选择子空间、合并协同概念,并使用约束优化保持稳定性,这为增强模型的组合推理能力提供了新途径。

Abstract: Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model’s latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.


[52] CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving cs.AI | cs.CVPDF

Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners

TL;DR: 本文提出了CARE Drive框架,用于评估自动驾驶中视觉语言模型(VLMs)的‘原因响应性’,即模型决策是否基于人类相关考量(如安全、效率)而非事后合理化。该框架通过对比基准模型与原因增强模型在受控上下文变化下的决策,评估人类原因对模型行为的因果影响。

Details

Motivation: 现有自动驾驶评估方法主要关注结果性能(如安全性、轨迹精度),而无法判断模型决策是否真正反映了人类相关的考量,这在安全关键领域可能导致虚假信心。

Result: 在自行车超车场景的演示中,结果表明明确的人类原因(如安全边际、社会压力)能显著影响模型决策,使其更符合专家推荐行为,但对不同类型原因的响应性存在差异。

Insight: 创新点在于提出了一个模型无关的、无需修改模型参数的框架,通过系统性的上下文扰动来量化评估基础模型的‘原因响应性’,为理解模型决策的因果机制提供了实证方法。

Abstract: Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.


cs.MM [Back]

[53] Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU cs.MM | cs.CL | cs.LGPDF

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

TL;DR: 本文提出了一种基于音频和IMU输入的实时对话助手,用于指导家具组装等程序性任务。该助手无需依赖视频输入,保护了用户隐私,并通过一种新颖的User Whim Agnostic LoRA微调方法优化了语言模型的对话行为,使其能主动提供关键指令并抑制冗余对话。

Details

Motivation: 解决现有实时对话助手依赖视频输入导致的算力消耗大和隐私泄露问题,探索仅使用轻量级、保护隐私的音频和IMU模态来理解任务上下文并提供全面指导。

Result: 提出的UWA LoRA微调方法在构建的数据集上使F-score提升了超过30%,并通过消除提示中的上下文示例实现了16倍的加速。系统可部署在边缘设备上,无需依赖云端。

Insight: 创新点在于首次仅使用音频和IMU模态构建实时程序性任务对话助手,并设计了UWA LoRA微调方法以优化模型对话的简洁性和关键信息传递能力,实现了隐私保护与效率的平衡。从客观角度看,将多模态感知(非视频)与特定微调策略结合用于边缘部署的对话系统是一个有前景的方向。

Abstract: Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user’s wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model’s ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.