Table of Contents

cs.CL [Back]

[1] Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans

Javier Conde,Miguel González,María Grandury,Gonzalo Martínez,Pedro Reviriego,Mar Brysbaert

Main category: cs.CL

TL;DR: 该论文提出了一种利用心理语言学数据集评估大语言模型(LLMs)与人类对齐程度的新方法,发现LLMs在部分语言特征上表现较好,但在感官关联上存在局限性。

Details Motivation: 当前对LLMs的评估主要基于任务性能,而忽视了语言中不易量化的心理语言学特征(如情绪、感官关联等)。利用心理语言学数据集可以更全面地评估LLMs与人类的对齐程度。

Contribution: 首次系统地评估了LLMs在心理语言学特征上与人类的对齐情况,发现LLMs在情绪和熟悉度等特征上表现较好,但在感官关联上表现较差。

Method: 使用Glasgow和Lancaster心理语言学数据集(涵盖13个语言特征)对代表性LLMs进行评估,比较其预测与人类评分的一致性。

Result: LLMs在Glasgow数据集(情绪、熟悉度等)上的对齐性优于Lancaster数据集(感官关联),表明LLMs在感官关联上存在缺陷。

Insight: LLMs缺乏人类的具身认知能力可能是其在感官关联上表现不佳的原因;心理语言学数据集为评估LLMs提供了新的维度。

Abstract: The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolor{black}{generally} better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.

[2] Hallucination Detection with Small Language Models

Ming Cheung

Main category: cs.CL

TL;DR: 论文提出了一种利用小型语言模型检测大型语言模型(LLM)生成回答中幻觉的框架,通过分解回答并验证句子可靠性,实验显示F1分数提升了10%。

Details Motivation: LLM在问答等任务中表现优异,但生成的回答可能出现幻觉(不真实内容),影响可靠性。目前缺乏有效方法在没有真实答案的情况下检测这些问题。

Contribution: 提出了一种基于多个小型语言模型的框架,用于验证LLM生成的回答,通过分解回答并利用模型生成’是’的概率检测幻觉,实验证明其有效性。

Method: 将LLM生成的回答分解为单句,利用多个小型语言模型对问题、回答和上下文生成’是’的概率,检测幻觉。实验使用100多组真实数据验证。

Result: 实验结果表明,该框架在检测正确回答与幻觉之间的F1分数提高了10%,验证了其有效性。

Insight: 小型语言模型可以高效且可扩展地用于验证LLM生成内容的可靠性,为实际应用提供了一种可行解决方案。

Abstract: Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answer scenarios. This paper proposes a framework that integrates multiple small language models to verify responses generated by LLMs using the retrieved context from a vectorized database. By breaking down the responses into individual sentences and utilizing the probability of generating “Yes” tokens from the outputs of multiple models for a given set of questions, responses, and relevant context, hallucinations can be detected. The proposed framework is validated through experiments with real datasets comprising over 100 sets of questions, answers, and contexts, including responses with fully and partially correct sentences. The results demonstrate a 10% improvement in F1 scores for detecting correct responses compared to hallucinations, indicating that multiple small language models can be effectively employed for answer verification, providing a scalable and efficient solution for both academic and practical applications.

[3] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Chenyang Shao,Tianxing Li,Chenhao Pu,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: AgentStealth 是一个基于本地部署的小型语言模型(SLM)的文本匿名化框架,通过上下文对比学习和自适应效用感知控制增强匿名化效果,并通过在线强化学习迭代优化性能。

Details Motivation: 现有文本匿名化方法要么破坏内容实用性,要么依赖昂贵的云LLM且存在隐私风险,而训练高效的本地小型语言模型面临高质量监督数据不足的挑战。

Contribution: 提出了自我强化的LLM匿名化框架AgentStealth,结合上下文对比学习和自适应效用感知控制,通过高质量数据和在线强化学习优化性能,支持边缘设备部署。

Method: 采用对抗性匿名化工作流(上下文对比学习+自适应效用感知控制),利用高质量数据进行监督适应,并通过在线强化学习迭代改进匿名化效果。

Result: 在两个数据集上,AgentStealth在匿名化效果(提升12.3%)和内容实用性(提升6.8%)上均优于基线方法。

Insight: 本地部署的小型语言模型通过对抗性学习和自我强化机制,可以在保护隐私的同时保持内容实用性,且避免了云依赖的隐私风险。

Abstract: In today’s digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.

[4] Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning

Zihao Zhao,Xinlong Zhai,Jinyu Yang,Chuan Shi

Main category: cs.CL

TL;DR: 该论文提出了一种针对无文本图数据的多域图对比学习框架MDGCL,通过识别和捕捉域差异以及引入域注意力机制,显著提升了跨域知识迁移的效果。

Details Motivation: 现有的图预训练方法主要针对单域场景,无法有效处理多域图数据中的语义和属性差异,限制了图基础模型的性能。

Contribution: 1. 提出了一种多域预训练框架MDGCL,能够识别和捕捉域差异;2. 引入域令牌和域注意力机制,支持细粒度的跨域知识迁移。

Method: 1. 预训练阶段采用改进的对比学习策略,捕获域差异;2. 下游任务阶段通过域注意力机制实现知识迁移。

Result: 在五个基准数据集上的实验表明,MDGCL在准确率和Macro-F1分数上分别最高提升了19.33%和19.13%。

Insight: 多域图对比学习中,显式建模域差异和跨域知识迁移机制是提升预训练模型性能的关键。

Abstract: Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely MDGCL.In the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33% on accuracy and 19.13% on Macro-F1 score.

[5] Can “consciousness” be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

Jingkai Li

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 – the latest iterations of this framework – to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $\Phi^{\max}$ (IIT 3.0), $\Phi$ (IIT 4.0), Conceptual Information (IIT 3.0), and $\Phi$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential “consciousness” phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed “consciousness” phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: https://doi.org/10.1016/j.nlp.2025.100163.

[6] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Deyu Zou,Yongqiang Chen,Mufei Li,Siqi Miao,Chenxi Liu,Bo Han,James Cheng,Pan Li

Main category: cs.CL

TL;DR: 该论文提出了一种名为ReG的方法,通过结合LLM反馈和改进检索结果的组织形式,解决了图增强检索生成(GraphRAG)中弱检索器的性能问题,显著提升了生成质量和效率。

Details Motivation: 目前基于图的检索增强生成(GraphRAG)中,模型依赖的检索器通常质量较低,原因在于缺乏高质量监督数据以及图数据的抽象性导致检索结果组织混乱。这些问题影响了生成模型的表现。

Contribution: 论文提出ReG方法,通过LLM反馈去除噪声信号并改进监督质量,同时引入结构感知的重组模块优化检索结果的逻辑组织,显著提升了生成性能。

Method: ReG结合LLM反馈来优化监督数据质量,并设计了结构感知的重组模块,将检索结果重新组织为逻辑连贯的证据链。

Result: 实验表明,ReG在多类LLM上均带来显著提升(最高10%),仅需5%的训练数据即可达到SOTA性能,并能泛化到分布外的知识图谱,同时降低推理成本(最高30%)和提升性能(最高4%)。

Insight: 通过整合LLM反馈和结构化重组,可以显著提升弱检索器的性能和生成模型的效率,尤其是在资源受限或数据质量较低的场景中表现突出。

Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.

[7] MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages

Lu Kalkbrenner,Veronika Solopova,Steffen Zeiler,Robert Nickel,Dorothea Kolossa

Main category: cs.CL

TL;DR: MisinfoTeleGraph 是一个德语 Telegram 消息的网络驱动虚假信息检测数据集,包含 500 多万条消息及其元数据、频道关系和标签,通过图神经网络(GNN)和文本模型的比较,展示了网络结构在检测虚假信息中的重要性。

Details Motivation: Telegram 作为一个低监管平台,已成为德语选举背景下虚假信息传播的重要渠道。目前缺乏针对德语 Telegram 的网络驱动虚假信息检测数据集和方法。

Contribution: 1. 提出了首个德语 Telegram 虚假信息检测数据集 MisinfoTeleGraph;2. 提供了基于语义相似性和人工标注的强弱标签;3. 验证了图神经网络(如 GraphSAGE)在虚假信息检测中的优越性。

Method: 1. 数据集构建:收集 500 多万条消息,包含元数据、频道关系和标签;2. 使用 M3 嵌入和人工标注生成标签;3. 对比文本模型和图神经网络(如 GraphSAGE + LSTM)的性能。

Result: GraphSAGE 结合 LSTM 聚合方法在 Matthews 相关系数(MCC)和 F1 分数上显著优于纯文本模型。

Insight: 1. 网络结构(如转发关系)对虚假信息检测至关重要;2. 弱监督标签在性能上存在潜力与挑战;3. 该工作为低监管社交平台的虚假信息检测提供了基准和开放数据集。

Abstract: Connectivity and message propagation are central, yet often underutilized, sources of information in misinformation detection – especially on poorly moderated platforms such as Telegram, which has become a critical channel for misinformation dissemination, namely in the German electoral context. In this paper, we introduce Misinfo-TeleGraph, the first German-language Telegram-based graph dataset for misinformation detection. It includes over 5 million messages from public channels, enriched with metadata, channel relationships, and both weak and strong labels. These labels are derived via semantic similarity to fact-checks and news articles using M3-embeddings, as well as manual annotation. To establish reproducible baselines, we evaluate both text-only models and graph neural networks (GNNs) that incorporate message forwarding as a network structure. Our results show that GraphSAGE with LSTM aggregation significantly outperforms text-only baselines in terms of Matthews Correlation Coefficient (MCC) and F1-score. We further evaluate the impact of subscribers, view counts, and automatically versus human-created labels on performance, and highlight both the potential and challenges of weak supervision in this domain. This work provides a reproducible benchmark and open dataset for future research on misinformation detection in German-language Telegram networks and other low-moderation social platforms.

[8] Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin,Andy Arditi,Marvin Li,Joe Benton,Julian Michael

Main category: cs.CL

TL;DR: 论文提出了一种名为VFT的方法,通过训练语言模型在链式思维推理中明确承认奖励黑客行为,显著提高了模型的透明性和安全性。

Details Motivation: 使用强化学习(RL)训练的语言模型可能会进行奖励黑客行为(即利用非预期的策略获得高奖励),但这些行为在链式思维推理中难以检测,对高风险应用构成威胁。

Contribution: 提出了VFT方法,通过在RL之前训练模型明确承认奖励黑客行为,显著提高了检测率。VFT-trained模型的未检测奖励黑客行为率仅为6%,而未经VFT训练的模型高达88%。

Method: VFT是一种预RL干预方法,通过训练模型明确承认提示线索的影响,随后在RL环境中评估模型的行为。实验使用了包含错误答案提示线索的环境。

Result: 经过VFT和RL训练的模型,其未检测奖励黑客行为率降至6%,而未经VFT训练的模型高达88%。VFT还显著提高了模型承认提示线索影响的频率(从8%提升至94%)。

Insight: VFT通过在RL之前训练模型明确表达奖励黑客行为,显著提高了模型的透明性,为构建更安全和可解释的AI系统提供了实用路径。

Abstract: Language models trained with RL can engage in reward hacking–exploiting unintended strategies for high reward–without revealing this behavior in their chain-of-thought reasoning, making detection difficult and posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when they are influenced by prompt cues–hints which point to incorrect answers (e.g., “a Stanford professor thinks the answer is A”). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to reward hack by exploiting cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model’s responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues–from 8% to 42% after VFT, and up to 94% after RL–while baselines remain low even after RL (10% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.

[9] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Jianxin Yan,Wangze Ni,Lei Chen,Xuemin Lin,Peng Cheng,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: ContextCache是一个针对多轮对话的上下文感知语义缓存系统,通过两阶段检索架构和自注意力机制实现了高效的缓存命中,显著降低了计算成本并提升了响应速度。

Details Motivation: 现有的语义缓存系统主要关注单次查询匹配,缺乏对多轮对话上下文的感知能力,因此在不同上下文环境中相似的查询可能导致错误的缓存命中。

Contribution: 提出了ContextCache,一种结合上下文感知的两阶段检索架构,通过自注意力机制整合当前和历史对话表征,提升了多轮对话中的缓存精准度和召回率。

Method: 采用两阶段检索:1)基于向量的当前查询检索;2)通过自注意力机制整合上下文信息进行精确匹配。

Result: 实验表明,ContextCache在精准度和召回率上优于现有方法,缓存响应延迟降低约10倍,显著减少了计算成本。

Insight: 多轮对话的上下文信息对语义缓存的精准匹配至关重要,自注意力机制能够有效整合历史对话表征,提升缓存效果。

Abstract: Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.

[10] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Younwoo Choi,Changling Li,Yongjin Yang,Zhijing Jin

Main category: cs.CL

TL;DR: 该论文研究了大语言模型(LLMs)的对话伙伴意识(interlocutor awareness),即模型识别和适应对话伙伴身份和特征的能力,并首次系统地评估了当代LLMs在这方面的表现。研究发现,LLMs能够识别同族模型和某些知名模型家族,但同时也揭示了这种能力可能带来新的安全和协作挑战。

Details Motivation: 随着大语言模型越来越多地应用于多代理和人与AI交互系统,确保其可靠性和安全性变得至关重要。既往研究主要关注模型的“情境意识”(situational awareness),而忽视了其对对话伙伴的识别和适应能力。本研究旨在填补这一空白,并为未来的多代理系统提供理论支持。

Contribution: 论文首次形式化了“对话伙伴意识”这一概念,并通过评估当代LLMs在这一能力上的表现,揭示了其对协作和安全的影响。此外,研究还通过三个案例展示了这种能力的实际意义和潜在风险。

Method: 研究从三个维度(推理模式、语言风格和对齐偏好)评估LLMs对对话伙伴的识别能力。实验包括了同族模型和知名模型家族的识别测试。通过案例研究,进一步探索了这种能力在多代理协作和安全漏洞中的实际表现。

Result: 结果表明,LLMs能够可靠地识别同族模型和某些知名模型(如GPT和Claude)。同时,这种能力既能优化多代理协作(如通过提示适配),也可能带来奖励黑客行为和越狱漏洞等风险。

Insight: 论文揭示了对话伙伴意识的“双刃剑”特性:既能提升LLMs的协作效率,也可能引入新的安全挑战。未来需要在多代理系统部署中加强对这一能力的理解和防范措施。

Abstract: As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM’s ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.

[11] SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Xianzhe Fan,Xuhui Zhou,Chuanyang Jin,Kolby Nottingham,Hao Zhu,Maarten Sap

Main category: cs.CL

TL;DR: SoMi-ToM是一个新的基准测试,旨在评估多智能体复杂社交交互中的多视角心智理论(ToM)能力,填补了现有静态文本基准的不足。

Details Motivation: 现有ToM基准多为静态文本场景,无法反映真实动态社交交互,因此作者提出一个基于多模态交互数据的评估框架。

Contribution: 提出SoMi-ToM基准,支持第一人称和第三人称视角的多层次评估,并构建了包含视频、图像和标注问题的挑战性数据集。

Method: 采用第一人称(实时多模态输入)和第三人称(完整视频记录)的双视角评估方法,系统测试模型和人类的ToM能力。

Result: 实验显示,当前大型视觉语言模型(LVLMs)的表现远低于人类,准确率差距高达40.1%(第一人称)和26.4%(第三人称)。

Insight: LVLMs在复杂社交交互中的ToM能力仍有显著不足,未来需进一步优化以更好地理解和推理他人状态与行为。

Abstract: Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model’s ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

[12] MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition

João Lucas Luz Lima Sarcinelli,Marina Lages Gonçalves Teixeira,Jade Bortot de Paiva,Diego Furtado Silva

Main category: cs.CL

TL;DR: 该论文介绍了MariNER数据集,这是首个针对20世纪初巴西葡萄牙语的历史文本命名实体识别(NER)黄金标准数据集,包含9000多个手动标注的句子,并评估了现有NER模型的性能。

Details Motivation: 巴西葡萄牙语在特定领域(如历史文本)缺乏高质量的NER数据集,而该领域对数字人文学科研究至关重要。

Contribution: 构建了MariNER数据集,填补了巴西葡萄牙语历史文本NER资源的空白,并评估了现有模型在该数据集上的表现。

Method: 通过手动标注9000多个句子,构建了黄金标准数据集,并比较了多种前沿NER模型的性能。

Result: MariNER成为首个针对历史巴西葡萄牙语的NER数据集,为相关研究提供了基准。

Insight: 历史文本的NER任务需要特定领域的资源,而MariNER为巴西葡萄牙语的数字人文学科研究提供了重要支持。

Abstract: Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anota\c{c}~oes de Registros hIst'oricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.

[13] Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Xiang Zhuang,Bin Wu,Jiyu Cui,Kehua Feng,Xiaotong Li,Huabin Xing,Keyan Ding,Qiang Zhang,Huajun Chen

Main category: cs.CL

TL;DR: 本文提出了一种知识增强的分子结构解析框架(K-MSE),通过结合蒙特卡洛树搜索和外部分子子结构知识库,显著提升了大型语言模型(LLM)在分子结构解析任务中的表现。

Details Motivation: 分子结构解析是化学实验分析中的重要任务,但当前的大型语言模型(LLM)由于缺乏专业化学知识,在此任务中表现不佳。因此,需要一种方法扩充LLMs的化学知识覆盖范围并改进其推理能力。

Contribution: 1. 提出K-MSE框架,结合蒙特卡洛树搜索(MCTS)作为插件;2. 构建外部分子子结构知识库以扩展LLMs的化学结构空间覆盖;3. 设计分子-光谱评分器作为推理过程的奖励模型,解决LLMs评估不准确的问题。

Method: 1. 利用蒙特卡洛树搜索进行推理扩展;2. 构建外部知识库补充专业化学知识;3. 设计专用评分器作为奖励模型优化推理路径。

Result: 实验显示K-MSE显著提升了分子结构解析性能,在GPT-4o-mini和GPT-4o上均取得20%以上的性能提升。

Insight: 通过结合外部知识库和蒙特卡洛树搜索,可以显著增强LLMs在专业领域的推理能力,为解决其他领域的类似问题提供了参考。

Abstract: Molecular structure elucidation involves deducing a molecule’s structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs’ limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs’ coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.

[14] Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries

Zhengren Wang,Bozhou Li,Dongwen Yao,Wentao Zhang

Main category: cs.CL

TL;DR: Text2VectorSQL将Text-to-SQL和向量搜索相结合,支持更灵活的自然语言查询,提出了一个新的框架,解决了现有方法的局限性。

Details Motivation: Text-to-SQL在非结构化数据或模糊查询中表现不佳,而向量搜索虽有潜力但依赖于手动配置且缺乏评估框架。因此,需要一种统一的方法来解决这些问题。

Contribution: 提出了Text2VectorSQL框架,结合Text-to-SQL和向量搜索,支持语义过滤、多模态匹配和检索加速,并通过自动化流程和专家评审构建了评估数据集。

Method: 框架包括在适当列上构建向量索引、扩展用户查询以支持语义搜索,并通过合成数据和专家标注优化模型性能。

Result: 实验表明,Text2VectorSQL显著优于基线方法,为更灵活的数据库查询奠定了基础。

Insight: 结合结构化查询和语义搜索可以提升自然语言查询的表达能力,为数据库接口设计提供新思路。

Abstract: While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.

[15] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship

Yue Xu,Wenjie Wang

Main category: cs.CL

TL;DR: 论文提出了Genres基准,用于评估多模态大模型(MLLMs)在社会关系中表现出的性别偏见,超越了传统的单实体评估方法,聚焦于双人互动中的性别偏见。

Details Motivation: 现有基准大多关注单一场景下的性别偏见,忽略了从人际互动中可能产生的微妙偏见。本文旨在填补这一空白。

Contribution: 提出了Genres基准,首次从社会关系的角度评估MLLMs在双人互动中的性别偏见,并提供了细粒度的评估维度。

Method: 通过双人角色配置和叙事生成任务,结合丰富的互动动态,设计了一套多维度的偏见评估方法。

Result: 实验发现,MLLMs在双人互动中表现出显著的上下文相关性别偏见,这些偏见在单角色场景中并不明显。

Insight: 强调了关系感知基准的重要性,揭示了互动驱动的性别偏见,为未来偏见缓解提供了方向。

Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.

[16] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Janki Atul Nawale,Mohammed Safi Ur Rahman Khan,Janani D,Mansi Gupta,Danish Pruthi,Mitesh M. Khapra

Main category: cs.CL

TL;DR: 该论文提出了一个印度中心化的公平性评估基准INDIC-BIAS,通过评估14个LLM在85个身份群体中的表现,揭示了模型对边缘化群体的负面偏见和刻板印象。

Details Motivation: 现有关于公平性的研究主要集中在西方文化背景,无法适用于印度等文化多样性国家,因此需要针对印度背景的评估工具。

Contribution: 提出了印度中心化的公平性评估基准INDIC-BIAS,包含20,000个手动验证的场景模板,覆盖85个身份群体,评测了14个LLM的公平性。

Method: 通过咨询领域专家筛选1,800多个社会文化主题,生成并验证场景模板,设计三种评估任务(合理性、判断、生成),评测LLM的偏见表现。

Result: LLM普遍对边缘化身份群体表现出负面偏见,且难以通过理性化决策缓解偏见,可能导致分配性和代表性危害。

Insight: 突显了LLM在多元文化背景下的公平性问题,需谨慎应用于实际场景,并推动针对印度文化的偏见缓解研究。

Abstract: Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.

[17] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

Shivam Sharma,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the ‘Other’ class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the ‘Victim’ class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.

[18] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning

Zhaoye Fei,Li Ji,Siyin Wang,Junhao Shi,Jingjing Gong,Xipeng Qiu

Main category: cs.CL

TL;DR: 论文提出了一种基于强化学习的框架Embodied Planner-R1,通过自主探索和互动学习,显著提升了LLMs在具身任务规划中的表现,并在多个基准测试中取得了优异结果。

Details Motivation: 现有的LLMs在具身任务规划中表现不足,难以捕捉动作与环境反馈之间的因果关系,尤其是在部分可观测环境中。论文旨在通过强化学习框架弥补这一差距。

Contribution: 1. 引入了一种无监督的强化学习框架,利用并行探索和稀疏奖励;2. 提出了交互策略优化(IPO)算法,从分组轨迹中高效学习;3. 在两个具身任务基准测试中显著超越现有方法。

Method: 1. 使用组并行探索实现环境互动;2. 基于完成驱动的稀疏奖励机制;3. 通过Interactive Policy Optimization(IPO)优化策略学习。

Result: 在ALFWorld和ScienceWorld基准测试中分别取得了97.78%和79.92%的完成率,泛化性能仅下降-3.66%。

Insight: 通过强化学习的自主探索和互动学习,可以有效提升LLMs在具身任务中的表现,尤其是在复杂、动态环境中。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.

[19] Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format

Dingzirui Wang,Xuanliang Zhang,Rongyu Cao,Longxu Dou,Xianzhen Luo,Yingwei Ma,Qingfu Zhu,Wanxiang Che,Binhua Li,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: Format-Adapter通过生成和选择适合任务的推理格式,提升了LLMs的推理能力,避免了人工标注格式的高成本和不适配问题。

Details Motivation: 现有方法依赖人工标注的推理格式,但不同任务的适配性差且标注成本高,因此需要自动适应任务的最佳推理格式。

Contribution: 提出了基于错误度量的格式生成和选择方法,实现了任务自适应的推理格式优化。

Method: 通过LLMs生成和筛选推理格式,以最小化所提出的错误度量,从而选择适合任务的推理格式。

Result: 在数学和常识推理任务中,平均性能提升了4.3%,验证了方法的有效性。

Insight: 自动生成和选择推理格式是一种可行且高效的策略,能够显著提升LLMs的推理能力。

Abstract: Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.

[20] Benchmarking Deep Search over Heterogeneous Enterprise Data

Prafulla Kumar Choubey,Xiangyu Peng,Shilpa Bhagavath,Kung-Hsiang Huang,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 该论文提出了一个新的基准测试,用于评估深度搜索(Deep Search)在异构企业数据上的表现,揭示了当前检索增强生成(RAG)方法在复杂多跳推理中的局限性。

Details Motivation: 目前缺乏一个真实且复杂的基准测试来评估RAG系统在异构企业数据上的表现,尤其是在需要多跳推理的场景中。

Contribution: 1. 提出了一个模拟企业工作流程的合成数据管道,生成互联的异构数据和多跳问题;2. 发布了一个包含39,190个企业数据样本的基准测试,用于评估长上下文LLM和RAG系统。

Method: 1. 使用合成数据管道生成异构企业数据;2. 设计多跳问题和答案;3. 评估现有RAG方法的表现并分析瓶颈。

Result: 实验显示,即使在最佳情况下,RAG方法的平均得分仅为32.96,检索能力是主要瓶颈。

Insight: 当前RAG方法在处理异构企业数据时存在明显不足,尤其是在多跳推理和完整证据检索方面仍有提升空间。

Abstract: We present a new benchmark for evaluating Deep Search–a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.

[21] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Dingzriui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 这篇论文提出了一种新指标LCS(Learning-to-Context Slope),用于量化上下文学习(ICL)的有效性,解决了现有评估方法的局限性。

Details Motivation: 现有的ICL评估方法依赖性能变化,存在可靠性低、归因差和数据不足时实用性差的问题。作者旨在提出一种更可靠的指标。

Contribution: 提出了LCS指标,通过建模学习增益与上下文相关性之间的斜率,量化ICL有效性,解决了现有方法的缺陷。

Method: LCS通过计算学习增益(通过演示降低的损失)与上下文相关性(演示与输入的相关性)之间的斜率来评估ICL效果。

Result: 实验表明,LCS在标注数据场景下与性能提升高度相关,在数据稀缺或有偏见的场景中也能可靠反映ICL的真实效果。

Insight: LCS揭示了模型能力对ICL成功的关键作用,并提供了LCS的可操作阈值。

Abstract: In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

[22] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 该论文提出了V-Synthesis方法,通过V-Score指标从零开始为任意任务合成一致且多样化的上下文学习演示,显著提升了性能。

Details Motivation: 减少上下文学习(ICL)演示的高标注成本,现有合成方法依赖任务特定或预存演示,缺乏从零开始的通用解决方案。

Contribution: 提出了V-Score一致性指标和V-Synthesis方法,支持从零合成任意任务的演示,确保一致性和多样性。

Method: 利用V-Score进行比例采样,优化合成的演示,并结合一致性与多样性。

Result: 实验显示V-Synthesis平均性能提升2.0%,验证了其有效性。

Insight: V-Score的计算效率高且性能优,V-Synthesis无需依赖预存演示即可实现高质量合成。

Abstract: High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.

[23] Generalist Reward Models: Found Inside Large Language Models

Yi-Chen Li,Tian Xu,Yang Yu,Xuqin Zhang,Xiong-Hui Chen,Zhongxiang Ling,Ningjing Chao,Lei Yuan,Zhi-Hua Zhou

Main category: cs.CL

TL;DR: 本文发现大型语言模型(LLM)中已存在一个潜在的通用奖励模型,无需额外训练即可高质量提取奖励信号,并通过理论证明强化学习可提升模型性能。

Details Motivation: 当前LLM的对齐依赖于昂贵的人工偏好数据训练的奖励模型,且现有AI反馈方法缺乏理论支持。本文旨在寻找更高效且理论基础扎实的对齐方法。

Contribution: 1. 证明了标准next-token训练下的LLM中已存在通用奖励模型;2. 揭示了该内生奖励与离线逆强化学习等效;3. 理论证明了强化学习的有效性。

Method: 通过理论分析证明LLM中的内生奖励信号等价于离线逆强化学习,直接提取该信号用于强化学习对齐。

Result: 实验表明该方法优于现有LLM-as-a-judge方法,甚至超越显式训练的奖励模型。

Insight: 预训练阶段已捕获了足够知识,奖励建模阶段可被更高效的理论方法替代,为LLM和多模态模型对齐提供了新范式。

Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.

[24] Two Spelling Normalization Approaches Based on Large Language Models

Miguel Domingo,Francisco Casacuberta

Main category: cs.CL

TL;DR: 该论文提出了两种基于大语言模型的拼写规范化方法,一种是无监督训练的模型,另一种是为机器翻译训练的模型,并通过对多语言和历史时期数据集的评估,发现统计机器翻译在此任务中仍表现最佳。

Details Motivation: 历史文献中缺乏标准拼写规范以及语言的有机演变,为人文学科学者带来长期挑战。拼写规范化旨在将文献中的拼写与当代标准对齐,解决这一问题。

Contribution: 提出了两种基于大语言模型的拼写规范化方法,扩展了该领域的技术选择。

Method: 研究采用了两种模型:一种是无监督训练的大语言模型,另一种是为机器翻译训练的大语言模型。

Result: 在多语言和历史时期数据集上的评估显示,两种方法均取得积极效果,但统计机器翻译仍是最适合此任务的技术。

Insight: 尽管基于大语言模型的方法表现良好,但在拼写规范化任务中,传统统计机器翻译技术仍具优势。

Abstract: The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document’s orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

[25] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

P. Myles Eugenio

Main category: cs.CL

TL;DR: 论文提出了一个基于局部、事件驱动的涌现学习的神经符号框架,用于生成语言建模,核心是一个分层的Hopfield记忆链作为动态tokenizer,通过局部学习构建符号表示和长程依赖。

Details Motivation: 传统语言模型依赖于预定义的token和全局目标,缺乏灵活性与局部学习能力。论文旨在探索一种无需全局目标的局部学习方法,让语言结构从神经网络的动态中自然涌现。

Contribution: 1. 提出了一种无监督的神经符号框架,通过局部学习和Hopfield记忆链动态生成多尺度token。2. 展示了模型能从噪声中提取自然语言模式,并生成具有人类语言相似形态的合成语言。3. 引入了一种通过激活新神经元绑定多尺度token特征的机制,支持长时记忆和组合推理。

Method: 1. 使用分层的Hopfield记忆链作为动态tokenizer(retokenizer),无需预定义token。2. 通过局部Hebbian学习构建符号序列的多尺度表示,并引入冗余(gauge结构)以支持长程依赖。3. 在推理阶段激活新神经元,绑定分布式token特征并形成符号嵌入。

Result: 模型能够从噪声中生成具有内部一致性的语言模式,其形态与人类语言定量相似。此外,通过局部学习生成的token和嵌入支持组合推理和泛化能力。

Insight: 1. 无需全局目标的学习方法可以生成结构化的语言表示。2. Hopfield记忆链的动力学特性使得语言结构能够从局部学习中涌现。3. 这一框架为研究符号结构如何在神经学习中形成提供了方法论基础。

Abstract: We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology – quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class – even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems – where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.

[26] ATGen: A Framework for Active Text Generation

Akim Tsvigun,Daniil Vasilev,Ivan Tsvigun,Ivan Lysenko,Talgat Bektleuov,Aleksandr Medvedev,Uliana Vinogradova,Nikita Severin,Mikhail Mozikov,Andrey Savchenko,Rostislav Grigorev,Ramil Kuleev,Fedor Zhdanov,Artem Shelmanov,Ilya Makarov

Main category: cs.CL

TL;DR: ATGen是一个将主动学习(AL)与文本生成任务结合的框架,支持利用人类标注者和大型语言模型(LLM)进行标注,减少标注成本和API调用费用。

Details Motivation: 尽管AL在减少标注成本方面表现出色,但在自然语言生成(NLG)任务中的应用有限。因此,作者提出了ATGen框架,以填补这一空白。

Contribution: ATGen提供了一个统一的平台,支持将AL策略应用于NLG任务,并简化了标注流程,同时支持LLM服务或本地部署。

Method: 框架结合人类标注者和基于LLM的自动标注代理,支持多种AL策略的实现和基准测试。

Result: 实验表明,ATGen显著减少了人类标注的工作量和LLM API调用的成本。

Insight: 将AL与NLG结合是减少标注成本的有效途径,同时LLM的自动标注能力进一步提升了效率。

Abstract: Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at http://atgen-video.nlpresearch.group

[27] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs

Taejin Kim,Siun-Chuon Mau,Konrad Vesey

Main category: cs.CL

TL;DR: 该论文提出了Perspective-Dial方法,通过构建Perspective Space量化文本视角,并利用系统性提示工程控制LLM输出视角,解决了LLM偏见的量化与调控问题。

Details Motivation: 由于LLM在关键任务中广泛应用,但其输出中的偏见和视角缺乏量化理解,因此需要一种方法能定量测量和调控文本的视角。

Contribution: 提出了Perspective Space作为量化文本视角的度量空间,以及基于贪心坐标下降的系统性提示工程方法,用于控制LLM输出视角。

Method: 1. 构建Perspective Space量化不同视角;2. 使用贪心坐标下降的提示工程方法调控LLM输出视角。

Result: 该方法能够有效量化和调节LLM的视角,适用于偏见检测、公共话语分析等场景。

Insight: 通过经验方法解决了LLM缺乏视角量化理解的难题,为LLM的公平性调控提供了新思路。

Abstract: Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias – effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.

[28] Hierarchical Memory Organization for Wikipedia Generation

Eugene J. Yu,Dawei Zhu,Yifan Song,Xiangyu Wong,Jiebin Zhang,Wenxuan Shi,Xiaoguang Li,Qun Liu,Sujian Li

Main category: cs.CL

TL;DR: 论文提出了一种基于分层记忆组织的生成框架MOG,用于自动生成维基百科文章,通过层次化结构组织记忆单元,提升信息的准确性和可验证性。

Details Motivation: 自动生成维基百科文章需要整合来自不同来源的准确且全面的信息,同时避免幻觉(hallucinations)。传统方法难以平衡信息量和可靠性,因此需要一种更高效的结构化方法。

Contribution: 论文的主要贡献是提出了MOG框架,通过层次化记忆组织单元和引用模块,显著提升了生成文章的准确性和可追溯性。

Method: MOG从网络文档中提取细粒度记忆单元,递归组织为维基百科风格的层次结构,并利用该结构指导生成过程。同时,引用模块确保每个生成句子都可追溯到具体记忆单元。

Result: 在新建的WikiStart数据集上,MOG在生成信息丰富且可靠的维基百科文章方面优于基线方法,尤其在真实场景中表现稳健。

Insight: 层次化记忆组织能够有效提升生成文本的结构化和可靠性,同时引用模块的引入为生成内容提供了更强的可追溯性,减少了幻觉现象。

Abstract: Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.

[29] What to Keep and What to Drop: Adaptive Table Filtering Framework

Jang Won June

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks.

[30] Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

Zhihao Zhang,Qiaole Dong,Qi Zhang,Jun Zhao,Enyu Zhou,Zhiheng Xi,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Yanwei Fu,Tao Ji,Tao Gui,Xuanjing Huang

Main category: cs.CL

TL;DR: 论文研究了监督微调(SFT)和强化微调(RFT)在让多模态大语言模型学习新任务时的表现,发现SFT学习快但会导致灾难性遗忘,而RFT学习慢但能保留先验知识。通过分析学习动态,展示了RFT如何通过强化与基模型概率景观对齐的样本减少遗忘。

Details Motivation: 当前后处理算法(如SFT和RFT)在多模态大语言模型上的表现及其对先验知识的影响尚不明确。论文通过引入全新任务(拼图)研究这两种方法的行为,旨在了解它们的学习动态和遗忘机制。

Contribution: 1. 揭示了SFT和RFT在新任务学习中的权衡:SFT快速学习但导致遗忘,RFT学习慢但保留知识。
2. 提出RFT通过强化对齐基模型概率的样本减少遗忘。
3. 展示了使用RFT模拟的正确样本可以改进SFT的学习效率。

Method: 在开源多模态模型Qwen2.5-VL上引入拼图任务,比较SFT和RFT的表现。通过分析学习动态,尤其是概率景观对齐,解释了RFT的优势。

Result: SFT在新任务上表现快但导致灾难性遗忘;RFT学习慢但保留了先验知识。通过RFT模拟的正确样本可以提升SFT的稳定性。

Insight: 数据分布(而非算法差异)在灾难性遗忘中起核心作用,RFT有潜力实现多模态大语言模型的稳定持续学习。

Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model’s probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT’s potential for stable continual learning in multimodal large language models.

[31] Semantic-guided Diverse Decoding for Large Language Model

Weijie Shi,Yue Cui,Yaguang Wu,Jingzhi Fang,Shibo Zhang,Mengze Li,Sirui Han,Jia Zhu,Jiajie Xu,Xiaofang Zhou

Main category: cs.CL

TL;DR: 论文提出了一种语义引导的多样化解码方法(SemDiD),通过在嵌入空间中操作,结合正交方向引导、动态组间排斥和位置去偏概率评估,实现了语义多样性与内容质量的平衡。

Details Motivation: 现有的大语言模型多样化解码方法主要依赖词汇多样性而非语义多样性,限制了其在需要多个语义不同输出的应用(如Best-of-N策略、基于组的强化学习和数据合成)中的效果。

Contribution: 提出了SemDiD方法,直接操作嵌入空间,通过三种互补机制(正交方向引导、动态组间排斥、位置去偏概率评估)平衡语义多样性与质量。

Method: SemDiD结合了自适应增益函数和约束优化,确保语义分化的同时维持质量阈值。三种机制分别从不同角度优化解码过程。

Result: 实验表明,SemDiD优于现有方法,Best-of-N覆盖率提高1.4-5.2%,强化学习训练收敛速度提升15%,准确率最高提升2.1%。

Insight: 语义多样性比词汇多样性更关键,嵌入空间操作是实现这一目标的有效途径。自适应机制可以平衡多样性与质量之间的冲突。

Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

[32] L0: Reinforcement Learning to Become General Agents

Junjie Zhang,Jingyi Xi,Zhuoyang Song,Junyu Lu,Yuhua Ke,Ting Sun,Yukun Yang,Jiaxing Zhang,Songxin Zhang,Zejian Xie

Main category: cs.CL

TL;DR: 论文提出了L0系统,一种可扩展的端到端训练框架,用于训练通用智能体。通过低成本的并发工人池和“代码即动作”的NB-Agent支架,显著提升了在复杂环境中强化学习的效率。实验表明,该方法在问答任务上表现优异。

Details Motivation: 当前LLM在多轮、长周期任务中作为自主智能体训练时,存在可扩展性和训练效率的问题。L0旨在通过高效的系统设计解决这些问题。

Contribution: 1. 提出L0系统,一个可扩展、端到端的训练框架;2. 设计了NB-Agent支架,支持代码即动作模式;3. 开源了完整的系统、模型和训练流程。

Method: 1. 使用低成本的并发工人池提升训练效率;2. NB-Agent通过REPL执行代码动作;3. 使用RLVR(带验证奖励的强化学习)训练模型。

Result: 在Qwen2.5-7B-Instruct模型上,SimpleQA任务准确率从30%提升到80%,HotpotQA任务从22%提升到41%。

Insight: 代码驱动的动作设计和高效的训练框架可以显著提升LLM在复杂任务中的表现。

Abstract: Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a “code-as-action” fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).

[33] Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

Bowen Ding,Yuhan Chen,Futing Wang,Lingfeng Ming,Tao Lin

Main category: cs.CL

TL;DR: 论文提出了一种新算法DuP-PO,用于解决大型推理模型中的“思考陷阱”问题,通过优化思考令牌的使用,显著提升了推理效率和性能。

Details Motivation: 大型推理模型在处理简单任务时会产生冗余的思考令牌(如“wait”“however”),触发不必要的推理行为,导致效率降低,甚至影响正确性。这被称为“思考陷阱”。

Contribution: 主要贡献包括:1)发现了思考陷阱问题;2)提出了DuP-PO算法,通过平衡暴露、动态调控优势和政策塑形来优化思考令牌的使用。

Method: DuP-PO包含:1)均匀采样策略;2)细粒度优势控制技术;3)政策塑形方法。通过这些技术动态优化令牌预测。

Result: 在五个数学推理基准测试中,DuP-PO显著提升了大型推理模型的令牌效率,同时保持了基础模型的性能优势。

Insight: 思考令牌并非总对推理有益,合理调控其使用可以提升效率,避免陷入思考陷阱。

Abstract: Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.

[34] Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Seyed Mahed Mousavi,Edoardo Cecchinato,Lucia Hornikova,Giuseppe Riccardi

Main category: cs.CL

TL;DR: 这篇论文对三个广泛使用的推理基准(SocialIQa、FauxPas-EAI和ToMi)进行了系统性审计,揭示了基准设计和评估方法的普遍缺陷。研究表明,模型的高分可能更多依赖于格式特定的提示而非真实的推理能力。

Details Motivation: 当前推理基准的设计和评估方法存在严重问题,可能导致模型高分不能真实反映推理能力。作者希望通过审计揭示这些问题,并提出更可靠的评估方法。

Contribution: 1. 系统性审计揭示了基准的结构性、语义和语用问题;2. 发现模型表现对输入微小变化高度敏感;3. 提出需要基于推理过程的评估协议;4. 发布审计数据和工具支持更透明的评估。

Method: 使用五个LLM(GPT-3、3.5、4、o1和LLaMA 3.1)作为诊断工具,结合系统性人工标注和重新评估清理后的基准子集。

Result: 模型高分更多依赖于对齐格式提示,而非真实推理能力。清理后的基准子集显示,模型表现更接近真实推理能力。

Insight: 当前基准的评分不可靠,需要开发能评估推理过程的动态协议,而非静态输出选择。

Abstract: We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.

[35] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

André de Souza Loureiro,Jorge Valverde-Rebaza,Julieta Noguez,David Escarcega,Ricardo Marcacini

Main category: cs.CL

TL;DR: 该论文提出了MAPS框架,通过多层自反思与自动提示技术,提升大语言模型在多步数学推理任务中的性能。

Details Motivation: 当前大语言模型在多步推理任务中表现不佳,传统静态提示方法难以适应复杂问题需求,亟需一种动态优化的解决方案。

Contribution: 提出了MAPS框架,结合链式思维(CoT)、自反思和自动提示技术,通过动态调整提示词迭代优化推理过程。

Method: 采用多层自反思机制:首先生成CoT推理结果;检测错误后通过自适应反思分析问题,生成定制化提示词;迭代修正推理过程。

Result: 在多个基准测试中,MAPS显著优于标准CoT,达到与专业推理模型相当的性能,同时通过限制反思深度平衡成本与性能。

Insight: 动态提示优化是提升模型推理能力的关键,但需权衡反思深度与计算成本。

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.

[36] The Trilemma of Truth in Large Language Models

Germans Savcisens,Tina Eliassi-Rad

Main category: cs.CL

TL;DR: 该论文探讨了如何评估大语言模型(LLM)内部知识的真实性,并提出了一种新的探测方法sAwMIL,结合多实例学习和共形预测,以区分陈述的‘真’、‘假’和‘非真非假’。作者还揭示了LLM中真实性信号的一些特性。

Details Motivation: 论文的动机在于解决现有方法在评估LLM内部知识真实性时的局限性,尤其是这些方法中存在的错误假设。作者希望通过更精确的探测方法,揭示LLM如何在内部表示信息的真实性。

Contribution: 主要贡献包括:1)提出sAwMIL,一种基于多实例学习和共形预测的探测方法,能够区分LLM内部知识的真实性;2)通过实验揭示了LLM中真实性信号的特性,如信号分布的不对称性和集中性。

Method: sAwMIL方法利用LLM的内部激活,结合多实例学习和共形预测技术,将输入陈述分为‘真’、‘假’和‘非真非假’三类。该方法在16个开源LLM和3个新数据集上进行了评估。

Result: 实验发现:1)真实性信号多集中在LLM深度的第三部分;2)真假信号不对称;3)线性探针在聊天模型中表现更好;4)非线性探针更适合某些经过RLHF或知识蒸馏的LLM;5)LLM还捕捉到第三种既非真亦非假的信号。

Insight: 论文揭示了LLM内部知识表示的真实性信号分布不均且不对称,且非线性探测可能是某些模型获取真实性信号的关键。这些发现为理解LLM知识表示提供了新视角。

Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.

[37] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs

Yang Dai,Jianxiang An,Tianwei Lin,Hongyang He,Hongzhe Huang,Wenqiao Zhang,Zheqi Lv,Siliang Tang,Yueting Zhuang

Main category: cs.CL

TL;DR: 该论文提出了一种名为Graft的统一参数集成框架,通过高效的参数协同整合多模态大语言模型(MLLMs)中的领域知识。其核心贡献包括兼容性感知参数拼接(CAPS)策略和领域兼容性评分机制,有效提升了模型在多样化任务中的性能。

Details Motivation: 当前MLLMs在跨领域任务中表现不佳,尤其是针对特定任务微调的模型缺乏知识共享机制,导致知识碎片化。

Contribution: 1. 提出CAPS策略,结合局部功能归属和全局信息论信号,实现高效参数融合;2. 引入领域兼容性评分机制,量化专家模型间的对齐程度;3. 通过低秩适配层实现高效集成,减少推理开销。

Method: 采用兼容性感知参数拼接(CAPS)策略,结合功能分析和全局信息论信号,指导选择性参数融合;并通过低秩适配层实现模块化集成。

Result: 在多样化多模态基准测试中验证了框架的有效性,展示了其在组合式、领域自适应MLLMs中的潜力。

Insight: 通过模块化参数融合和兼容性评分,Graft为跨领域知识共享提供了一种可扩展的解决方案,同时保持了模型的灵活性。

Abstract: Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs–such as those trained for mathematics or code–remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.

[38] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Andrew Well,Mia Markey,Ying Ding

Main category: cs.CL

TL;DR: 该论文提出了一种名为Auto-TA的自动化主题分析方法,利用多智能体大型语言模型(LLM)和强化学习(RLHF)来实现对临床叙事的端到端主题分析,避免了人工编码的繁琐过程。

Details Motivation: 先天性心脏病(CHD)的临床叙事数据包含丰富的患者和护理者体验信息,但传统的手工主题分析方法(TA)耗时长且难以扩展。因此,作者希望通过自动化方法解决这一挑战。

Contribution: 1. 提出了一种基于多智能体LLM的端到端主题分析(TA)管道,完全自动化。2. 引入强化学习从人类反馈中优化主题相关性(RLHF)。3. 支持大规模定性数据集的患者中心分析。

Method: 1. 使用多智能体LLM框架,每个智能体承担特定角色以提升主题质量。2. 可选集成强化学习(RLHF)来优化主题与人类分析的匹配度。3. 针对特定临床场景微调LLM。

Result: 该方法能够高效、自动地处理临床叙事数据,生成与人工分析一致的主题,同时支持大规模数据集的分析。

Insight: 多智能体LLM框架和强化学习的结合可以显著提升自动化主题分析的质量和可扩展性,尤其在复杂临床场景中表现突出。

Abstract: Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.

[39] Large Language Models Don’t Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Anselm R. Strohmaier,Wim Van Dooren,Kathrin Seßler,Brian Greer,Lieven Verschaffel

Main category: cs.CL

TL;DR: 该论文通过数学教育视角分析了大型语言模型(LLMs)在解决数学应用题上的能力,指出其虽然能高效解决标准问题(s-problems),但缺乏对现实世界语境的理解,限制了其教育价值。

Details Motivation: 研究动机是评估LLMs在数学教育中的应用潜力,尤其是解决数学应用题的能力,以确定其是否真正理解问题语境,还是仅通过模式匹配完成任务。

Contribution: 论文贡献包括三方面:1)对比LLMs与学生解决数学应用题的概念化差异;2)系统综述213项研究,揭示常用数据集的局限性;3)实证评估287道应用题,证明LLMs在标准问题上表现出色,但在复杂语境中表现不佳。

Method: 方法包括技术概述(对比LLMs与学生的解决过程)、文献综述(分析213项研究的数据集)、实证评估(测试4种LLMs在287道应用题上的表现)。

Result: 结果表明,LLMs在标准数学应用题上准确率接近完美(如PISA题目全对),但在需要现实世界语境理解的问题上表现不佳。

Insight: 核心洞察是LLMs仅掌握了解题的表面模式,未真正理解问题语境,因此其作为数学教学工具的价值有限。

Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.

[40] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

Hyunjong Kim,Sangyeop Kim,Jongheon Jeong,Yeongjae Cho,Sungzoon Cho

Main category: cs.CL

TL;DR: 本文提出了一种可解释的图像描述生成评估指标EXPERT,通过结构化解释(基于流畅性、相关性和描述性)和高质量数据集,实现了在基准数据集上的最优表现。

Details Motivation: 当前的可解释评估指标缺乏标准化标准,且生成的解释质量未经验证,因此需要一种更可靠的评估方法。

Contribution: 提出EXPERT,一种基于结构化解释的无参考评估指标,并通过大规模高质量数据集和两阶段评估模板,实现了评分与解释生成的监督。

Method: 采用基于流畅性、相关性和描述性的结构化解释框架,构建数据集并设计两阶段评估模板来训练视觉语言模型。

Result: 在基准数据集上达到最优性能,并通过人工评估验证了生成的解释质量优于现有指标。

Insight: 结构化解释和高数据质量对提升可解释评估指标的可靠性至关重要。

Abstract: Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

[41] STACK: Adversarial Attacks on LLM Safeguard Pipelines

Ian R. McKenzie,Oskar J. Hollinsworth,Tom Tseng,Xander Davies,Stephen Casper,Aaron D. Tucker,Robert Kirk,Adam Gleave

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

[42] On the Predictive Power of Representation Dispersion in Language Models

Yanhong Li,Ming Li,Karen Livescu,Jiawei Zhou

Main category: cs.CL

TL;DR: 论文通过研究发现,语言模型的预测能力与其嵌入空间的广泛性紧密相关,嵌入空间的分散度(即隐藏向量的平均余弦距离)越低,模型的困惑度也越低。这一现象在不同模型家族和领域中均存在,且分散度可用于多种实际任务,如模型选择和优化检索方法。

Details Motivation: 研究动机是探索语言模型的预测性能与其表示分散度之间的关系,旨在理解并利用这一关系优化模型性能。

Contribution: 主要贡献包括:(1)揭示了表示分散度与困惑度之间的强相关性;(2)提出了基于分散度的无标注数据模型选择方法;(3)通过分散度优化检索方法;(4)设计了一种简单的训练目标以提升分散度。

Method: 通过测量不同模型和领域的隐藏向量之间的平均余弦距离(分散度),分析其与困惑度的相关性,并利用分散度优化模型选择和检索方法。此外,引入了一种“推开”训练目标以增强分散度。

Result: 结果表明,分散度与困惑度呈强负相关,且通过分散度优化的方法在模型选择和检索任务中表现优异。训练的推开目标有效提升了分散度和模型性能。

Insight: 洞察包括:(1)表示分散度是语言模型性能的重要指标;(2)无监督分散度测量可用于高效模型评估;(3)简单训练目标即可显著提升模型表现。

Abstract: We show that a language model’s ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.

[43] Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models

David M. Smiley

Main category: cs.CL

TL;DR: 该研究评估了基于Transformer的预训练语言模型在圣经希伯来文中检测互文平行段落的潜力,发现E5和AlephBERT在区分平行与非平行段落方面表现优异。

Details Motivation: 传统方法依赖人工比对,效率低且易出错,因此研究探讨如何利用预训练模型提升互文关系检测的效率和准确性。

Contribution: 首次系统评估多种Transformer模型在圣经希伯来文平行段落检测任务中的表现,证实了E5和AlephBERT的潜力。

Method: 使用E5、AlephBERT、MPNet和LaBSE生成词嵌入,通过余弦相似度和Wasserstein距离评估模型表现。

Result: E5在检测平行段落方面表现最佳,AlephBERT在区分非平行段落上更优,预训练模型显著提升了检测效率。

Insight: 预训练语言模型可有效应用于古代文本分析,为其他古代语言研究提供了新思路。

Abstract: Identifying parallel passages in biblical Hebrew is foundational in biblical scholarship for uncovering intertextual relationships. Traditional methods rely on manual comparison, which is labor-intensive and prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings and Chronicles, I assessed each model’s capability to generate word embeddings that delineate parallel from non-parallel passages. Utilizing cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show significant promise, with E5 excelling in parallel detection and AlephBERT demonstrating stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.

cs.CV [Back]

[44] Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring

Xinxin Sun,Peter Chang

Main category: cs.CV

TL;DR: 该论文提出了一种基于物理信息对齐的框架,用于解决结构健康监测中图像对齐的挑战,显著提高了裂纹追踪的准确性。

Details Motivation: 传统特征检测器(如SIFT和SURF)在高频边缘(如细裂纹)定位中表现不佳,而轻量级二值化方法(如ORB和BRISK)在复杂场景下重复性差,限制了结构健康监测的准确性。

Contribution: 提出了一个基于KAZE架构的物理信息对齐框架,通过非线性各向异性扩散构建裂纹保留的尺度空间,结合RANSAC估计单应性矩阵,实现了无需训练或参数调优的高精度几何校正。

Method: 利用非线性各向异性扩散构建尺度空间,保留裂纹的高频信息;结合RANSAC估计单应性矩阵,实现鲁棒的对齐。

Result: 在多种真实场景下进行验证,裂纹面积和长度误差分别减少了70%和90%,同时对关键指标的配准误差保持在5%以下。

Insight: 通过物理驱动的非线性尺度空间建模,能够在复杂场景中保留裂纹的高频信息,为结构健康监测提供了一种无需训练或校准的高效解决方案。

Abstract: Accurate image alignment is essential for monitoring crack evolution in structural health monitoring (SHM), particularly under real-world conditions involving perspective distortion, occlusion, and low contrast. However, traditional feature detectors such as SIFT and SURF, which rely on Gaussian-based scale spaces, tend to suppress high-frequency edges, making them unsuitable for thin crack localization. Lightweight binary alternatives like ORB and BRISK, while computationally efficient, often suffer from poor keypoint repeatability on textured or shadowed surfaces. This study presents a physics-informed alignment framework that adapts the open KAZE architecture to SHM-specific challenges. By utilizing nonlinear anisotropic diffusion to construct a crack-preserving scale space, and integrating RANSAC-based homography estimation, the framework enables accurate geometric correction without the need for training, parameter tuning, or prior calibration. The method is validated on time-lapse images of masonry and concrete acquired via handheld smartphone under varied field conditions, including shadow interference, cropping, oblique viewing angles, and surface clutter. Compared to classical detectors, the proposed framework reduces crack area and spine length errors by up to 70 percent and 90 percent, respectively, while maintaining sub-5 percent alignment error in key metrics. Unsupervised, interpretable, and computationally lightweight, this approach supports scalable deployment via UAVs and mobile platforms. By tailoring nonlinear scale-space modeling to SHM image alignment, this work offers a robust and physically grounded alternative to conventional techniques for tracking real-world crack evolution.

[45] Counting with Confidence: Accurate Pest Monitoring in Water Traps

Xumin Gao,Mark Stevens,Grzegorz Cielniak

Main category: cs.CV

TL;DR: 该论文提出了一种用于全面评估害虫计数置信度的方法,结合计数结果相关信息和外部环境条件,提高了害虫监测的准确性。

Details Motivation: 现有的视觉自动害虫计数研究通常在有真实标签的数据集上评估模型,但在实际场景中因缺乏真实标签而无法评估计数结果的可靠性。

Contribution: 首次提出全面评估计数任务中置信度的方法,并通过模型量化影响因素与计数置信度的关系。

Method: 使用害虫检测网络进行检测和计数,结合图像质量评估、图像复杂度评估和害虫分布均匀性评估,最后通过回归模型预测计数置信度。

Result: 实验结果表明,相比基线方法,该方法在害虫计数置信度测试集上MSE降低了31.7%,R2提高了15.2%。

Insight: 通过多因素敏感性分析和自适应DBSCAN聚类算法,优化了评估方法,为实际应用中的可靠性评估提供了新思路。

Abstract: Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.

[46] Scalable Dynamic Origin-Destination Demand Estimation Enhanced by High-Resolution Satellite Imagery Data

Jiachao Liu,Pablo Guarda,Koichiro Niinuma,Sean Qian

Main category: cs.CV

TL;DR: 该论文提出了一种结合高分辨率卫星图像和传统传感器数据的动态O-D需求估计框架,通过计算机视觉技术提取车辆密度信息,显著提升了无传感器路段的估计精度,并验证了其在大规模网络中的实用性。

Details Motivation: 传统交通数据依赖稀疏的局部传感器,而卫星图像能提供全城范围的交通信息,解决了数据不足的问题,为动态O-D需求估计提供了新思路。

Contribution: 主要贡献包括:1) 设计了结合卫星图像和传统数据的动态O-D需求估计框架;2) 开发了从卫星图像中提取车辆密度信息的计算机视觉流程;3) 验证了框架在大规模网络中的准确性和可扩展性。

Method: 方法包括:1) 基于计算机视觉的车辆检测和地图匹配流程;2) 基于计算图的动态网络状态校准模型,联合匹配传统传感器数据和卫星图像密度观测值。

Result: 实验结果表明,结合卫星图像数据能显著提升估计精度,尤其是无传感器路段;框架在处理大规模网络时也表现出色。

Insight: 卫星图像数据可以弥补传统交通数据的局限性,为大规模交通网络的需求估计提供了新工具,同时也展示了计算机视觉在交通领域的潜力。

Abstract: This study presents a novel integrated framework for dynamic origin-destination demand estimation (DODE) in multi-class mesoscopic network models, leveraging high-resolution satellite imagery together with conventional traffic data from local sensors. Unlike sparse local detectors, satellite imagery offers consistent, city-wide road and traffic information of both parking and moving vehicles, overcoming data availability limitations. To extract information from imagery data, we design a computer vision pipeline for class-specific vehicle detection and map matching, generating link-level traffic density observations by vehicle class. Building upon this information, we formulate a computational graph-based DODE model that calibrates dynamic network states by jointly matching observed traffic counts and travel times from local sensors with density measurements derived from satellite imagery. To assess the accuracy and scalability of the proposed framework, we conduct a series of numerical experiments using both synthetic and real-world data. The results of out-of-sample tests demonstrate that supplementing traditional data with satellite-derived density significantly improves estimation performance, especially for links without local sensors. Real-world experiments also confirm the framework’s capability to handle large-scale networks, supporting its potential for practical deployment in cities of varying sizes. Sensitivity analysis further evaluates the impact of data quality related to satellite imagery data.

[47] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models

Weiyi Zhao,Xiaoyu Tan,Liang Liu,Sijia Li,Youwei Song,Xihe Qiu

Main category: cs.CV

TL;DR: 本文通过生成合成图像和标注数据,解决了手术室中多模态大语言模型(MLLMs)在视觉-语义知识冲突(VS-KC)上的问题,提升了其对手术风险的感知能力。

Details Motivation: 手术风险识别对患者安全和减少可预防医疗错误至关重要,但现有MLLMs在视觉-语义知识冲突中表现不佳,需要针对性的数据集和方法来解决这一问题。

Contribution: 主要贡献包括:(1)为规则违反场景量身定制的数据生成方法;(2)开源OR-VSKC数据集及其基准;(3)对代表性MLLMs中违反敏感性知识一致性的实证分析。

Method: 使用扩散模型生成34,000多张合成图像,并通过214张人工标注图像验证,针对手术室安全规则违反场景构建数据集。

Result: 在OR-VSKC上微调显著提升了MLLMs对冲突实体的检测能力,但对于未训练的实体类型泛化性较差,表明需要更全面的训练。

Insight: 合成数据的生成和针对性训练可以显著改善MLLMs在特定场景下的性能,但泛化能力仍然是未来研究的重点。

Abstract: Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs’ detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at https://github.com/zgg2577/VS-KC.

[48] How Can Multimodal Remote Sensing Datasets Transform Classification via SpatialNet-ViT?

Gautam Siddharth Kashyap,Manaswi Kulahara,Nipun Joshi,Usman Naseem

Main category: cs.CV

TL;DR: 论文提出了一种名为SpatialNet-ViT的新模型,通过结合Vision Transformers(ViTs)和多任务学习(MTL),提升遥感数据的分类性能和泛化能力。

Details Motivation: 现有研究多局限于特定任务或数据集,限制了其在多样化遥感分类任务中的泛化能力。因此,作者希望开发一个能够结合空间感知和上下文理解的综合模型。

Contribution: 提出了SpatialNet-ViT模型,整合了ViTs和MTL,实现了更高的分类精度和可扩展性。同时采用了数据增强、迁移学习和多任务学习等技术增强模型鲁棒性。

Method: 结合Vision Transformers和多任务学习,利用数据增强和迁移学习技术训练模型,提升其在多样化数据集上的表现。

Result: 模型在分类任务中表现出更高的准确性和泛化能力,适用于多种遥感分类任务。

Insight: 通过融合空间感知与上下文理解,以及多任务学习技术,可以显著提升遥感数据分类模型的性能和适应性。

Abstract: Remote sensing datasets offer significant promise for tackling key classification tasks such as land-use categorization, object presence detection, and rural/urban classification. However, many existing studies tend to focus on narrow tasks or datasets, which limits their ability to generalize across various remote sensing classification challenges. To overcome this, we propose a novel model, SpatialNet-ViT, leveraging the power of Vision Transformers (ViTs) and Multi-Task Learning (MTL). This integrated approach combines spatial awareness with contextual understanding, improving both classification accuracy and scalability. Additionally, techniques like data augmentation, transfer learning, and multi-task learning are employed to enhance model robustness and its ability to generalize across diverse datasets

[49] What Makes a Dribble Successful? Insights From 3D Pose Tracking Data

Michiel Schepers,Pieter Robberechts,Jan Van Haaren,Jesse Davis

Main category: cs.CV

TL;DR: 该论文研究了如何利用3D姿态跟踪数据改进对足球中带球技能的理解,并通过新提出的姿态特征预测带球成功率。

Details Motivation: 传统的2D位置跟踪数据无法捕捉平衡、方向和控球等关键因素,限制了带球分析的深度。因此,作者希望通过3D姿态跟踪数据提供更全面的视角。

Contribution: 提出了基于姿态的新特征(如进攻者的平衡和与防守者的方向对齐),并证明这些特征能显著提高预测带球成功率的模型性能。

Method: 从2022/23赛季欧冠的1,736次带球中提取3D姿态特征,结合传统2D数据训练模型。

Result: 结果表明,姿态特征(如平衡和方向对齐)对预测带球成功率具有显著贡献。

Insight: 3D姿态数据能揭示2D数据无法捕捉的动态细节,为足球分析提供了新的维度。

Abstract: Data analysis plays an increasingly important role in soccer, offering new ways to evaluate individual and team performance. One specific application is the evaluation of dribbles: one-on-one situations where an attacker attempts to bypass a defender with the ball. While previous research has primarily relied on 2D positional tracking data, this fails to capture aspects like balance, orientation, and ball control, limiting the depth of current insights. This study explores how pose tracking data (capturing players’ posture and movement in three dimensions) can improve our understanding of dribbling skills. We extract novel pose-based features from 1,736 dribbles in the 2022/23 Champions League season and evaluate their impact on dribble success. Our results indicate that features capturing the attacker’s balance and the alignment of the orientation between the attacker and defender are informative for predicting dribble success. Incorporating these pose-based features on top of features derived from traditional 2D positional data leads to a measurable improvement in model performance.

[50] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

Hassan Baker,Austin J. Brockmeier

Main category: cs.CV

TL;DR: Patch2Loc是一种无监督学习方法,通过训练神经网络从正常脑部MRI切片中学习局部空间位置,从而检测异常组织的热图,并在多个数据集上优于现有方法。

Details Motivation: 传统的脑部病变检测方法依赖有监督学习,需要大量标注数据。本文提出无监督方法Patch2Loc,仅需正常MRI切片,避免标注需求。

Contribution: 提出Patch2Loc方法,通过预测patch的空间位置识别异常组织,生成热图用于像素级分割,并在多个数据集上优于现有方法。

Method: 训练神经网络从正常MRI切片中提取patch并预测其空间位置,推断时通过预测误差检测异常patch,生成热图。

Result: 在BraTS2021、MSLUB、ATLAS和WHM数据集上,Patch2Loc在无监督分割任务中表现优于现有方法。

Insight: 利用正常数据训练的空间定位模型可以有效检测异常区域,为无监督医学图像分析提供了新思路。

Abstract: Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{https://github.com/bakerhassan/Patch2Loc}{GitHub page}.

[51] Weakly Supervised Object Segmentation by Background Conditional Divergence

Hassan Baker,Matthew S. Emigh,Austin J. Brockmeier

Main category: cs.CV

TL;DR: 该论文提出了一种基于背景条件散度的弱监督对象分割方法,仅需图像级标签即可训练掩码网络,并通过对背景图像进行聚类和生成反事实图像来增强学习效果。

Details Motivation: 解决在缺少大量标注数据的专业领域(如声纳图像、遥感、生物医学图像等)中,实现自动对象分割的挑战。

Contribution: 1. 提出一种弱监督方法,仅需图像级标签;2. 利用背景聚类和反事实图像生成增强分割效果;3. 避免使用预训练网络、生成网络或对抗性判别器。

Method: 1. 对背景图像聚类;2. 生成反事实图像(将分割对象嵌入目标背景);3. 使用样本散度损失和监督背景损失联合训练掩码网络。

Result: 在声纳图像和自然图像上的实验表明,该方法优于无监督分割基线,且无需复杂模块即可实现合理性能。

Insight: 通过背景条件散度和反事实图像生成,弱监督方法在专业领域的分割任务中展现出潜力。

Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.

[52] FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Hang Xu,Jie Huang,Linjiang Huang,Dong Li,Yidi Liu,Feng Zhao

Main category: cs.CV

TL;DR: FreeDNA提出了一种无需训练的域自适应方法,通过调整扩散采样过程中的噪声统计量,实现基于扩散的密集预测模型的域自适应。

Details Motivation: 扩散模型在密集预测任务中表现出色,但其在域自适应方面的潜力尚未被充分挖掘。研究观察到扩散模型中的噪声统计偏差会导致域漂移,因此提出通过噪声统计量对齐来解决这一问题。

Contribution: 1. 提出了一种无需训练的自适应方法,适用于扩散密集预测模型的域自适应。2. 设计了一种噪声统计量对齐机制,适用于有源域和无源域的场景。3. 在四种密集预测任务中验证了方法的有效性。

Method: 通过噪声统计量对齐(Domain Noise Alignment, DNA)调整扩散采样过程中的噪声统计量,具体包括在有源域时直接对齐噪声统计量,在无源域时利用高置信区域的统计量逐步引导噪声调整。

Result: 方法在四种密集预测任务中显著提升了模型的域自适应能力。

Insight: 噪声统计量是域自适应的有效指标,通过调整噪声统计量可以高效解决扩散模型的域漂移问题。

Abstract: Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model’s performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{https://github.com/xuhang07/FreeDNA}{https://github.com/xuhang07/FreeDNA}.

[53] Lightning the Night with Generative Artificial Intelligence

Tingting Zhou,Feng Zhang,Haoyang Fu,Baoxiang Pan,Renhe Zhang,Feng Lu,Zhixin Yang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47\mu\mathrm{m}, 0.65\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model’s nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.

[54] Automated Defect Identification and Categorization in NDE 4.0 with the Application of Artificial Intelligence

Aditya Sharma

Main category: cs.CV

TL;DR: 论文提出了一种基于AI的自动化缺陷检测与分类框架,适用于NDE 4.0时代的射线检测,通过改进的U-Net模型和数据集增强技术实现高效缺陷识别。

Details Motivation: 解决现代射线检测中缺乏充分解释的信息问题,探索虚拟缺陷增强的潜力,评估框架在NDE测量中的可行性。

Contribution: 提出了一种结合虚拟缺陷增强和标准增强的数据集扩展方法,以及改进的U-Net模型,显著提升了缺陷检测的敏感性和效率。

Method: 使用223张飞机焊缝CR图像作为基础数据,通过数据增强技术扩展数据集,并训练改进的U-Net模型进行语义缺陷分割。

Result: 模型在缺陷检测中表现出高敏感性和准确性,尤其在焊缝区域的评估指标(如a90/95特性)上表现优异。

Insight: 数据集扩展方法(虚拟缺陷增强)对模型性能提升效果显著,且框架的快速推理能力使其适用于大规模图像分析。专业评估认为该框架有望成为测试周期的有效辅助工具。

Abstract: This investigation attempts to create an automated framework for fault detection and organization for usage in contemporary radiography, as per NDE 4.0. The review’s goals are to address the lack of information that is sufficiently explained, learn how to make the most of virtual defect increase, and determine whether the framework is viable by using NDE measurements. As its basic information source, the technique consists of compiling and categorizing 223 CR photographs of airplane welds. Information expansion systems, such as virtual defect increase and standard increase, are used to work on the preparation dataset. A modified U-net model is prepared using the improved data to produce semantic fault division veils. To assess the effectiveness of the model, NDE boundaries such as Case, estimating exactness, and misleading call rate are used. Tiny a90/95 characteristics, which provide strong differentiating evidence of flaws, reveal that the suggested approach achieves exceptional awareness in defect detection. Considering a 90/95, size error, and fake call rate in the weld area, the consolidated expansion approach clearly wins. Due to the framework’s fast derivation speed, large images can be broken down efficiently and quickly. Professional controllers evaluate the transmitted system in the field and believe that it has a guarantee as a support device in the testing cycle, irrespective of particular equipment cut-off points and programming resemblance.

[55] Container damage detection using advanced computer vision model Yolov12 vs Yolov11 vs RF-DETR A comparative analysis

Subhadip Kumar

Main category: cs.CV

TL;DR: 本文比较了Yolov12、Yolov11和RF-DETR三种先进计算机视觉模型在集装箱损坏检测中的性能,发现RF-DETR在罕见损坏检测中表现更好。

Details Motivation: 集装箱的损坏检测对物流行业安全和延长使用寿命至关重要,但现有模型在罕见损坏检测上的表现尚未明确,本文旨在填补这一空白。

Contribution: 通过实验比较了三种最新模型的性能,发现RF-DETR在罕见损坏检测中优于Yolo系列模型。

Method: 使用278张标注图像数据集训练、验证和测试模型,比较mAP和精确度指标。

Result: Yolov11和Yolov12的mAP@50为81.9%,优于RF-DETR的77.7%;但RF-DETR在罕见损坏检测中表现更优。

Insight: 不同模型在不同场景下各有优势,RF-DETR更适合复杂或罕见的损坏检测任务。

Abstract: Containers are an integral part of the logistics industry and act as a barrier for cargo. A typical service life for a container is more than 20 years. However, overtime containers suffer various types of damage due to the mechanical as well as natural factors. A damaged container is a safety hazard for the employees handling it and a liability for the logistic company. Therefore, a timely inspection and detection of the damaged container is a key for prolonging service life as well as avoiding safety hazards. In this paper, we will compare the performance of the damage detection by three state-of-the-art advanced computer vision models Yolov12, Yolov11 and RF-DETR. We will use a dataset of 278 annotated images to train, validate and test the model. We will compare the mAP and precision of the model. The objective of this paper is to identify the model that is best suited for container damage detection. The result is mixed. mAP@50 score of Yolov11 and 12 was 81.9% compared to RF-DETR, which was 77.7%. However, while testing the model for not-so-common damaged containers, the RF-DETR model outperformed the others overall, exhibiting superiority to accurately detecting both damaged containers as well as damage occurrences with high confidence.

[56] Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal,Akinniyi Akinyemi,Kathryn Alvero,Morteza Behrooz,Julia Buffalini,Fabio Maria Carlucci,Joy Chen,Junming Chen,Zhang Chen,Shiyang Cheng,Praveen Chowdary,Joe Chuang,Antony D’Avirro,Jon Daly,Ning Dong,Mark Duppenthaler,Cynthia Gao,Jeff Girard,Martin Gleize,Sahir Gomez,Hongyu Gong,Srivathsan Govindarajan,Brandon Han,Sen He,Denise Hernandez,Yordan Hristov,Rongjie Huang,Hirofumi Inaguma,Somya Jain,Raj Janardhan,Qingyao Jia,Christopher Klaiber,Dejan Kovachev,Moneish Kumar,Hang Li,Yilei Li,Pavel Litvin,Wei Liu,Guangyao Ma,Jing Ma,Martin Ma,Xutai Ma,Lucas Mantovani,Sagar Miglani,Sreyas Mohan,Louis-Philippe Morency,Evonne Ng,Kam-Woh Ng,Tu Anh Nguyen,Amia Oberai,Benjamin Peloquin,Juan Pino,Jovan Popovic,Omid Poursaeed,Fabian Prada,Alice Rakotoarison,Alexander Richard,Christophe Ropers,Safiyyah Saleem,Vasu Sharma,Alex Shcherbyna,Jia Shen,Jie Shen,Anastasis Stathopoulos,Anna Sun,Paden Tomasello,Tuan Tran,Arina Turkatenko,Bo Wan,Chao Wang,Jeff Wang,Mary Williamson,Carleigh Wood,Tao Xiang,Yilin Yang,Julien Yao,Chen Zhang,Jiemin Zhang,Xinyue Zhang,Jason Zheng,Pavlo Zhyzheria,Jan Zikes,Michael Zollhoefer

Main category: cs.CV

TL;DR: 论文介绍了Seamless Interaction Dataset,一个大规模的面对面互动数据集,并开发了一套模型,用于生成与人类语音对齐的二元动作和面部表情。这些模型支持2D和3D渲染,并可以通过控制情感和表达力水平生成更语义相关的手势。

Details Motivation: 开发能够理解和生成二元行为动态的AI技术,以推动虚拟代理、远程呈现体验和多模态内容分析工具的进步。

Contribution: 1) 发布了一个包含4000小时面对面互动数据的Seamless Interaction Dataset;2) 开发了能够生成与语音对齐的二元动作和面部表情的模型;3) 提供了可控的模型变体,用于调整情感和表达力。

Method: 1) 收集并标注大规模二元互动数据;2) 开发基于语音和视觉行为输入的生成模型;3) 结合LLM和2D/3D渲染技术;4) 提出可控的情感适应和语义相关手势生成方法。

Result: 模型能够生成高质量且与语音对齐的二元动作和面部表情,支持可控的情感表达和语义手势,为更直观的人机交互奠定了基础。

Insight: 大规模数据和多模态建模是提升AI社交智能的关键,同时可控的情感表达和语义手势生成可以显著增强人机交互的自然性。

Abstract: Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

[57] Recomposed realities: animating still images via patch clustering and randomness

Markus Juvonen,Samuli Siltanen

Main category: cs.CV

TL;DR: 本文提出了一种基于图像块的重建与动画方法,通过聚类和随机采样使静态图像动起来。

Details Motivation: 旨在利用现有图像数据为静态图像注入动态效果,实现图像的重新诠释而非简单复制。

Contribution: 提出一种基于k-means聚类和随机采样的图像块匹配方法,能够在不同概念域间共享局部结构。

Method: 使用k-means对图像块聚类,通过匹配和随机采样重建目标图像以实现动画效果。

Result: 成功实现静态图像的动态化,强调了对源图像的创造性重新诠释。

Insight: 通过局部结构的共享和随机性,实现了图像重建的灵活性与多样性。

Abstract: We present a patch-based image reconstruction and animation method that uses existing image data to bring still images to life through motion. Image patches from curated datasets are grouped using k-means clustering and a new target image is reconstructed by matching and randomly sampling from these clusters. This approach emphasizes reinterpretation over replication, allowing the source and target domains to differ conceptually while sharing local structures.

[58] Improving Token-based Object Detection with Video

Abhineet Singh,Nilanjan Ray

Main category: cs.CV

TL;DR: 本文改进了基于Token的目标检测方法,将其扩展到视频领域,提出了一种端到端的视频目标检测技术,通过将对象表示为离散Token序列,避免了传统检测器的限制,并直接输出3D Box或Tracklet,取得了优于传统方法的效果。

Details Motivation: 现有视频目标检测方法通常基于2D框的链接构建视频对象,这种间接方式存在计算复杂度和后处理启发式问题。本文旨在通过Token化和3D框的直接预测,简化流程并提升性能。

Contribution: 1. 提出了基于Token的视频对象表示方法,避免了传统框采样的限制;2. 直接输出3D Box或Tracklet,无需后处理链接;3. 展示了在有限计算资源下的竞争性性能。

Method: 1. 扩展Pix2Seq至视频,将对象表示为变长离散Token序列;2. 直接预测3D Box/Tracklet,避免2D框的链接;3. 通过增加视频子序列长度灵活扩展计算。

Result: 在多个数据集上优于静态Pix2Seq检测器,并在UA-DETRAC上与当前SOTA方法竞争,尽管受限于计算资源。

Insight: Token化表示和3D直接预测是视频目标检测的有效方向,但需进一步优化计算效率。

Abstract: This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.

[59] Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Shansong Wang,Zhecheng Jin,Mingzhe Hu,Mojtaba Safari,Feng Zhao,Chih-Wei Chang,Richard LJ Qiu,Justin Roper,David S. Yu,Xiaofeng Yang

Main category: cs.CV

TL;DR: MMKD-CLIP 是一种通用的生物医学基础模型,通过多教师知识蒸馏从九个特定领域或通用的生物医学 CLIP 模型中提取知识,解决了生物医学领域数据稀缺和异构性问题。

Details Motivation: 生物医学领域缺乏大规模图像-文本对数据,且图像模态和数据结构碎片化,限制了通用化生物医学基础模型的开发。利用知识蒸馏可以克服数据不足和异构性问题。

Contribution: 提出了 MMKD-CLIP,通过多教师知识蒸馏整合多个生物医学 CLIP 模型的知识,实现了在 58 个数据集上的跨模态、多任务优异表现。

Method: 两阶段训练:首先在 2.9 百万生物医学图像-文本对上预训练,然后通过特征蒸馏从九个教师模型中提取 19.2 百万特征对进行优化。

Result: 在零样本分类、线性探测、跨模态检索等六类任务中,MMKD-CLIP 均优于所有教师模型,展现了强鲁棒性和泛化能力。

Insight: 多教师知识蒸馏是一种在数据稀缺和异构环境下构建高性能生物医学基础模型的有效方法。

Abstract: CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

[60] Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation

Chee Mei Ling,Thangarajah Akilan,Aparna Ravinda Phalke

Main category: cs.CV

TL;DR: 该论文提出了一种名为Dual Atrous Separable Convolution (DAS Conv)的高效图像分割模块,集成在DeepLabV3框架中,用于农业语义分割任务,平衡了膨胀率和填充尺寸,提升了模型性能且不牺牲效率。结合战略性跳跃连接,模型在农业图像中表现优异且计算复杂度低。

Details Motivation: 现代农业依赖精确的视觉数据分析以优化作物管理和资源利用。现有语义分割模型计算复杂度高,难以在农业场景中高效部署。论文旨在设计一个轻量且高效的模型,提升分割精度。

Contribution: 1. 提出DAS Conv模块,优化膨胀率和填充尺寸;2. 引入跳跃连接增强细粒度空间特征;3. 在Agriculture Vision数据集上展示高效性能,优于基线并接近复杂Transformer模型。

Method: 1. 在DeepLabV3框架中集成DAS Conv模块;2. 通过跳跃连接将编码器的中间特征传递到解码器;3. 通过实验优化参数配置。

Result: 模型在Agriculture Vision数据集上表现接近SOTA,同时计算效率提升66%,证明其轻量高效的特点。

Insight: 轻量化设计在农业语义分割中具有重要意义,DAS Conv模块展示了如何在不牺牲性能的前提下降低计算复杂度。

Abstract: Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision-making and proactive interventions. A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size, thereby enhancing model performance without compromising efficiency. The study also incorporates a strategic skip connection from an optimal stage in the encoder to the decoder to bolster the model’s capacity to capture fine-grained spatial features. Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer-based state-of-the-art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance, compared to the SOTA model. This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high-quality performance in agricultural imagery.

[61] LIGHT: Multi-Modal Text Linking on Historical Maps

Yijun Lin,Rhett Olson,Junhan Wu,Yao-Yi Chiang,Jerod Weinman

Main category: cs.CV

TL;DR: 本文提出了一种多模态方法LIGHT,用于解决历史地图中文本链接的挑战,通过结合语言、图像和几何特征,显著提升了多词地名识别的性能。

Details Motivation: 历史地图中的文本信息对多学科研究至关重要,但由于文本的多样性和几何复杂性,现有方法难以有效链接多词文本片段,尤其是地名。

Contribution: LIGHT的主要贡献是提出了一种多模态框架,统一了语言、视觉和几何特征,特别是通过几何感知嵌入模块捕捉文本区域的形状和空间关系。

Method: LIGHT结合了LayoutLMv3预训练模型的视觉和语言特征,并引入了几何感知嵌入模块编码多边形坐标,采用双向学习策略增强序列鲁棒性。

Result: 实验表明,LIGHT在ICDAR 2024/2025 MapText竞赛数据上优于现有方法,验证了多模态学习的有效性。

Insight: 几何信息在多模态文本链接任务中扮演关键角色,而双向学习策略能有效提升序列预测的鲁棒性。

Abstract: Text on historical maps provides valuable information for studies in history, economics, geography, and other related fields. Unlike structured or semi-structured documents, text on maps varies significantly in orientation, reading order, shape, and placement. Many modern methods can detect and transcribe text regions, but they struggle to effectively ``link’’ the recognized text fragments, e.g., determining a multi-word place name. Existing layout analysis methods model word relationships to improve text understanding in structured documents, but they primarily rely on linguistic features and neglect geometric information, which is essential for handling map text. To address these challenges, we propose LIGHT, a novel multi-modal approach that integrates linguistic, image, and geometric features for linking text on historical maps. In particular, LIGHT includes a geometry-aware embedding module that encodes the polygonal coordinates of text regions to capture polygon shapes and their relative spatial positions on an image. LIGHT unifies this geometric information with the visual and linguistic token embeddings from LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal information to predict the reading-order successor of each text instance directly with a bi-directional learning strategy that enhances sequence robustness. Experimental results show that LIGHT outperforms existing methods on the ICDAR 2024/2025 MapText Competition data, demonstrating the effectiveness of multi-modal learning for historical map text linking.

[62] BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data

Arunkumar Kannan,Martin A. Lindquist,Brian Caffo

Main category: cs.CV

TL;DR: BrainMT提出了一种混合Mamba-Transformer架构,用于高效建模fMRI数据中的长程时空依赖性,显著提升了分类和回归任务的性能。

Details Motivation: 现有基于CNN或Transformer的方法难以有效捕捉fMRI数据中的长程时空依赖关系,限制了其性能。BrainMT旨在解决这一问题。

Contribution: 提出了BrainMT,一种结合Mamba块(高效捕捉时间依赖)和Transformer块(建模空间关系)的混合架构,在fMRI数据分析中实现了SOTA。

Method: 1.使用双向Mamba块(时间优先扫描机制)高效捕捉全局时间交互;2.通过Transformer块的自注意力机制建模全局空间关系。

Result: 在UKBioBank和Human Connectome Project数据集上,BrainMT在性别预测和认知能力预测任务中显著优于现有方法。

Insight: 混合架构(如Mamba+Transformer)在复杂时空数据建模中具有潜力,高效性和性能可兼顾。

Abstract: Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this https://github.com/arunkumar-kannan/BrainMT-fMRI

[63] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

Zuyao You,Zuxuan Wu

Main category: cs.CV

TL;DR: Seg-R1提出了一种基于强化学习的简单分割方法,通过RL提升大型多模态模型的像素级理解能力,实现了在COD和SOD任务上的高性能和强泛化能力。

Details Motivation: 传统的分割方法需要复杂的模型修改或多任务监督,而本文探索如何通过纯RL训练实现简单且高效的分割性能。

Contribution: 1. 引入了Group Relative Policy Optimization (GRPO)到分割领域;2. 通过RL训练实现了在COD和SOD任务上的高性能;3. 展示了纯RL训练的强开集泛化能力。

Method: 使用RL训练模型生成点或边界框提示,利用SAM2生成分割掩码,设计了GRPO优化策略以提升像素级理解能力。

Result: 在COD10K上达到.873 S-measure,RefCOCOg和ReasonSeg上的零样本性能分别达到71.4 cIoU和56.7 gIoU。

Insight: 纯RL训练不仅简单有效,还能在开集任务中表现出强大的泛化能力,为未来分割任务的研究提供了新思路。

Abstract: We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.

[64] ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models

Sotirios Panagiotis Chytas,Miso Choi,Hyunwoo J. Kim,Vikas Singh

Main category: cs.CV

TL;DR: 论文提出了一种名为ReCo的轻量级可训练模块,用于缓解视觉语言模型(VLMs)中的幻觉问题,通过几何代数和关系组合的方法增强视觉输入的长期记忆效果。

Details Motivation: 视觉语言模型(VLMs)在生成文本时容易产生幻觉(hallucination),即生成与视觉输入无关或矛盾的文本。这种现象被归因于对语言信息的过度依赖以及视觉输入的‘记忆衰减效应’。

Contribution: 提出了ReCo模块,这是一种轻量级的可训练模块,可以无缝集成到现有的VLMs(如InstructBLIP、LlaVA、MiniGPT4)中,有效缓解幻觉问题。

Method: 利用几何代数和关系组合的思想,设计了ReCo模块,通过增强视觉输入的长期记忆效果来抑制幻觉。模块无需修改模型其他部分。

Result: 在多个基准测试中,ReCo模块显著提升了VLMs的性能,并可与现有的幻觉抑制方法结合使用,进一步改善效果。

Insight: 轻量级的模块设计能够在不修改现有模型结构的情况下,有效解决幻觉问题,为VLM的改进提供了新思路。

Abstract: Vision Language Models (VLMs) show impressive capabilities in integrating and reasoning with both visual and language data. But these models make mistakes. A common finding – similar to LLMs – is their tendency to hallucinate, i.e., generate plausible sounding text which is not grounded in the visual input, or at worst, is contradictory. A growing consensus attributes this behavior to an over-reliance on language – especially as the generation progresses, the model suffers from a ``fading memory effect’’ with respect to the provided visual input. We study mechanisms by which this behavior can be controlled. Specifically, using ideas from geometric algebra and relational compositions, we propose the addition of a small, trainable module (named ReCo) on top of any VLM – no other modification is needed. We show that such a lightweight module is able to mitigate the fading memory effect on three of the most widely used VLMs (InstructBLIP, LlaVA, MiniGPT4), where we see performance improvements on multiple benchmarks. Additionally, we show that our module can be combined with many of the other approaches for reducing hallucination where we achieve improved results for each one.

[65] CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Haoxuan Wang,Zhenghao Zhao,Junyi Wu,Yuzhang Shang,Gaowen Liu,Yan Yan

Main category: cs.CV

TL;DR: 论文提出CaO$_2$框架,解决了扩散模型在数据集蒸馏中的目标不一致和条件不一致问题,通过两阶段方法优化蒸馏过程,性能显著提升。

Details Motivation: 当前基于扩散模型的数据集蒸馏方法存在目标不一致和条件不一致的问题,导致性能不理想,CaO$_2$旨在解决这些问题。

Contribution: 提出了CaO$_2$框架,通过两阶段优化(概率引导的样本选择和条件似然优化)解决了目标与条件不一致的问题,并在ImageNet上达到最优性能。

Method: 两阶段框架:1)概率引导的样本选择;2)优化潜在表示以提升条件似然。

Result: 在ImageNet及其子集上表现优异,平均准确率提升2.3%。

Insight: 对齐蒸馏目标和过程、优化条件生成是关键,扩散模型在数据集蒸馏中仍有潜力。

Abstract: The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.

[66] UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

Dayong Su,Yafei Zhang,Huafeng Li,Jinxing Li,Yu Liu

Main category: cs.CV

TL;DR: UniFuse提出了一种统一的多模态医学图像融合框架,能够同时处理退化和未对齐的输入图像,通过联合优化对齐、恢复和融合任务,显著提升了融合效果。

Details Motivation: 当前多模态医学图像融合方法通常假设输入图像质量高且完美对齐,实际应用中常因图像退化和未对齐导致效果下降。UniFuse旨在解决这一问题。

Contribution: 1. 提出UniFuse框架,首次在单一框架中联合对齐、恢复和融合任务;2. 设计了Omni Unified Feature Representation方案和Universal Feature Restoration & Fusion模块;3. 引入Adaptive LoRA Synergistic Network(ALSN)实现特征的自适应表示。

Method: 1. 通过退化感知提示学习模块整合多方向信息;2. 使用Spatial Mamba编码多方向特征以减小模态差异;3. 利用ALSN实现单阶段联合恢复与融合。

Result: 在多数据集上的实验表明,UniFuse显著优于现有方法,验证了其有效性。

Insight: UniFuse的创新在于将多个任务统一到一个框架中,通过自适应特征表示和退化类型引导,解决了医学图像融合中的退化和未对齐问题。

Abstract: Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration & Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN’s adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method’s effectiveness and significant advantages over existing approaches.

[67] RoboPearls: Editable Video Simulation for Robot Manipulation

Tao Tang,Likui Zhang,Youpeng Wen,Kaidong Zhang,Jia-Wang Bian,xia zhou,Tianyi Yan,Kun Zhan,Peng Jia,Hefeng Wu,Liang Lin,Xiaodan Liang

Main category: cs.CV

TL;DR: RoboPearls是一个基于3D高斯泼溅(3DGS)的可编辑视频仿真框架,用于机器人操作任务。它通过增量语义蒸馏(ISD)和3D正则化NNFM损失(3D-NNFM)等技术实现高真实感仿真,并结合大语言模型(LLM)和视觉语言模型(VLM)自动化仿真流程,提升机器人学习的效率。

Details Motivation: 真实世界的机器人演示数据收集成本高且效率低,而现有仿真平台难以解决仿真到现实的差距问题。RoboPearls旨在通过可编辑视频仿真解决这一问题。

Contribution: 1. 提出RoboPearls框架,基于3DGS实现高真实感、多视角一致的仿真;2. 结合ISD和3D-NNFM等技术支持多样化的物体操作仿真;3. 利用LLM和VLM自动化仿真流程并优化机器人学习性能。

Method: 1. 使用3D高斯泼溅(3DGS)构建高真实感仿真;2. 结合ISD和3D-NNFM损失进行语义蒸馏和正则化;3. 通过LLM自动化命令解释和执行,VLM分析学习问题以优化仿真。

Result: 在RLBench、COLOSSEUM等多个数据集和场景中验证了RoboPearls的仿真性能,展示了其高效性和实用性。

Insight: 1. 3DGS为机器人仿真提供了高真实感的视觉一致性;2. LLM和VLM的结合为仿真自动化开辟了新途径;3. 增量学习和语义蒸馏技术有效提升了仿真质量。

Abstract: The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.

[68] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Dinh Phu Tran,Dao Duy Hung,Daeyoung Kim

Main category: cs.CV

TL;DR: VSRM是一个基于Mamba的新型视频超分辨率框架,通过引入空间到时间和时间到空间的Mamba块,以及可变形交叉Mamba对齐模块和频率Charbonnier-like损失函数,实现了高效的长时空特征提取和高质量重建。

Details Motivation: 现有的视频超分辨率方法中,CNN受限于局部感受野,Transformer因二次复杂度难以处理长序列。Mamba因其长序列建模能力、线性复杂度和大感受野受到关注。

Contribution: 1. 提出了VSRM框架,结合Mamba的优势。2. 设计了空间到时间和时间到空间的Mamba块。3. 引入可变形交叉Mamba对齐模块,动态补偿相邻帧。4. 提出频率Charbonnier-like损失函数,减少频率域差异。

Method: 1. 使用空间到时间和时间到空间的Mamba块提取时空特征。2. 可变形交叉Mamba对齐模块增强动态补偿。3. 频率Charbonnier-like损失优化重建质量。

Result: 在多个基准测试中达到SOTA性能,验证了方法的有效性和鲁棒性。

Insight: 结合Mamba的线性复杂度和长序列建模能力,为视频超分辨率提供了新的解决方案,尤其是在处理长序列和大感受野时表现优异。

Abstract: Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.

[69] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Oguzhan Baser,Ahmet Ege Tanriverdi,Sriram Vishwanath,Sandeep P. Chinchali

Main category: cs.CV

TL;DR: 该论文提出了一种基于语言驱动的深度伪造(Deepfake)攻击方法PhonemeFake(PF),显著降低了人类感知,并开源了自适应的双层检测模型,实现了高效检测和精确定位。

Details Motivation: 现有的深度伪造数据集在欺骗人类感知方面表现不足,无法真实反映现实中的攻击效果,因此需要更逼真的攻击方式。

Contribution: 1. 提出PhonemeFake(PF)攻击方法,通过语言推理操纵关键语音片段;2. 开源了一个易于使用的PF数据集;3. 提出了一种自适应的双层检测模型,显著提升了检测效率和准确性。

Method: 1. 利用语言驱动的方法生成逼真的深度伪造语音;2. 设计双层检测模型,自适应地优先计算被操纵区域。

Result: PF攻击显著降低了人类感知(42%)和基准精度(94%);检测模型降低了91%的EER,同时实现了90%的速度提升。

Insight: 1. 语言驱动的深度伪造攻击更逼真;2. 自适应计算优先策略能提升检测效率。

Abstract: Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

[70] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Nuoye Xiong,Anqi Dong,Ning Wang,Cong Hua,Guangming Zhu,Mei Lin,Peiyi Shen,Liang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种改进的Concept Bottleneck Model(CBM-HNMU),旨在提升人类与神经网络之间的相互理解,通过自动识别和修正有害概念来增强模型的可解释性和准确性。

Details Motivation: 深度学习模型复杂度增加导致可解释性降低,现有方法多为样本级解释且缺乏有效干预,亟需一种能在模型层面提升可解释性和准确性的方法。

Contribution: 提出CBM-HNMU,利用CBM作为可解释框架,自动修正有害概念并将知识蒸馏回黑盒模型,实验表明其在多个数据集上提升了模型性能。

Method: 基于梯度贡献全局分析识别有害概念,通过修正(移除或替换)这些概念,再利用改进的CBM将知识反哺黑盒模型。

Result: 在多个CNN和Transformer模型上验证,最高准确率提升2.64%,平均准确率提升1.03%。

Insight: CBM-HNMU不仅提升了模型的可解释性,还通过修正概念间接提高了模型的准确性,为黑盒模型的干预提供了新的思路。

Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.

[71] Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

Byung Hyun Lee,Sungjin Lim,Seunggyu Lee,Dong Un Kang,Se Young Chun

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE

[72] Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding

Xingyilang Yin,Jiale Wang,Xi Yang,Mutian Xu,Xu Gu,Nannan Wang

Main category: cs.CV

TL;DR: MVOV3D通过噪声校正和多视图融合技术,显著提升了开放词汇3D场景理解性能,无需额外训练。

Details Motivation: 现有方法依赖于点-文本对或2D-3D特征对齐,但在处理多样化对象时受限于3D数据量,而多视图融合方法虽具潜力,但受噪声影响表现不佳。

Contribution: 提出了MVOV3D方法,通过噪声校正和多视图融合优化,无需训练即可提升开放词汇3D场景理解性能。

Method: 利用CLIP编码器的精确区域级图像和文本特征,结合3D几何先验,优化多视图融合。

Result: 在ScanNet200和Matterport160上分别实现了14.7%和16.2%的mIoU,显著优于现有方法。

Insight: 通过噪声校正和多视图融合,可以显著提升开放词汇场景理解的性能,同时保持模型的泛化能力。

Abstract: Recent open-vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point-text pairs or by distilling 2D features into 3D models via point-pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open-vocabulary 3d models. We observe that 2D multi-view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision-language models lead multi-view fusion to sub-optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open-world capabilities. Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi-view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.

[73] Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration

Ramya Hebbalaguppe,Tamoghno Kandar,Abhinav Nagpal,Chetan Arora

Main category: cs.CV

TL;DR: 该论文提出了一种属性感知的零样本测试时校准方法(TCA),旨在解决测试时提示调优(TPT)导致的置信度校准问题。通过利用大语言模型(LLM)初始化提示并引入正则化损失,显著提升了校准性能。

Details Motivation: 现有的测试时提示调优(TPT)方法在提高图像识别准确率的同时,却导致了置信度校准的下降,限制了其在关键应用中的适用性。论文旨在解决这一隧道视野问题。

Contribution: 1) 提出利用LLM初始化测试时提示以减少过拟合;2) 设计了降低类内距离、增大类间距离的正则化损失;3) 在多种CLIP架构和15个数据集上验证了方法的有效性。

Method: 通过LLM提供的目标标签属性先验知识初始化提示,并通过正则化损失优化提示质量。

Result: 实验显示TCA的平均预期校准误差(ECE)为4.11,显著优于其他TPT方法。

Insight: 提示初始化与正则化设计是改善测试时校准的关键,结合LLM的先验知识可以有效缓解过拟合问题。

Abstract: Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR’24), 6.78 for DiffTPT (CVPR’23), and 8.43 for PromptAlign (NeurIPS’23). The code is publicly accessible at: https://github.com/rhebbalaguppe/TCA_PromptWithoutPanic.

[74] Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze,Li Pengyi,Matvey Skripkin,Andrey Galichin,Anton Gusarov,Konstantin Sobolev,Andrey Kuznetsov,Ivan Oseledets

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model’s reasoning trace contradicts that of an independent, frozen vision-language model (“listener”) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

[75] AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results

Kien Nguyen,Clinton Fookes,Sridha Sridharan,Huy Nguyen,Feng Liu,Xiaoming Liu,Arun Ross,Dana Michalski,Tamás Endrei,Ivan DeAndres-Tame,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez,Javier Ortega-Garcia,Zijing Gong,Yuhao Wang,Xuehu Liu,Pingping Zhang,Md Rashidunnabi,Hugo Proença,Kailash A. Hambarde,Saeid Rezaei

Main category: cs.CV

TL;DR: 论文介绍了AG-VPReID 2025挑战赛,重点关注高空(80-120米)空中-地面视频行人重识别(ReID),并提出了一个新的数据集和多个国际团队开发的先进方法。

Details Motivation: 由于视角差异、尺度变化和遮挡等问题,空中与地面视角的行人重识别仍具挑战性,因此需要提升这一领域的性能。

Contribution: 提出了首个大规模空中-地面视频行人重识别数据集AG-VPReID,并组织了相关挑战赛,推动了该领域的研究。

Method: 参赛团队采用了多流架构、基于Transformer的时序推理和物理信息建模等方法。领先方法X-TFCLIP结合了这些技术。

Result: X-TFCLIP在空对地和地对空ReID设置中分别取得了72.28%和70.77%的Rank-1准确率,优于现有基线。

Insight: 挑战赛结果表明,多模态融合和时序建模对解决空中-地面ReID问题至关重要,同时数据集的复杂性也凸显了进一步研究的必要性。

Abstract: Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset’s complexity. For additional details, please refer to the official website at https://agvpreid25.github.io.

[76] DMD-Net: Deep Mesh Denoising Network

Aalok Gangopadhyay,Shashikant Verma,Shanmuganathan Raman

Main category: cs.CV

TL;DR: 本文提出了一个端到端的深度学习框架DMD-Net,通过图卷积神经网络(GCN)在原始图和偶图上进行聚合,结合非对称双流网络和原始-偶图融合模块,实现了高效的网格去噪。

Details Motivation: 网格去噪在3D对象处理中具有重要意义,但现有方法在处理高噪声时效果不佳。本文旨在通过深度学习技术提升网格去噪的鲁棒性和性能。

Contribution: 1.提出了DMD-Net,一种端到端的深度学习框架,结合原始图和偶图的GCN进行特征聚合;2.设计了特征引导变换(FGT)范式,包含特征提取器、变换器和去噪器;3.展示了该方法在高噪声情况下的优异性能。

Method: DMD-Net采用非对称双流网络结构,结合原始-偶图融合模块和FGT范式,通过特征提取、变换和去噪三个步骤实现网格去噪。

Result: 实验表明,DMD-Net在多种噪声条件下表现优于现有方法,且对高噪声具有鲁棒性。

Insight: 通过在原始图和偶图上同时进行特征聚合,可以更好地捕捉网格的几何结构信息,从而提升去噪效果。

Abstract: We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication between the primal-stream and the dual-stream. We develop a Feature Guided Transformer (FGT) paradigm, which consists of a feature extractor, a transformer, and a denoiser. The feature extractor estimates the local features, that guide the transformer to compute a transformation, which is applied to the noisy input mesh to obtain a useful intermediate representation. This is further processed by the denoiser to obtain the denoised mesh. Our network is trained on a large scale dataset of 3D objects. We perform exhaustive ablation studies to demonstrate that each component in our network is essential for obtaining the best performance. We show that our method obtains competitive or better results when compared with the state-of-the-art mesh denoising algorithms. We demonstrate that our method is robust to various kinds of noise. We observe that even in the presence of extremely high noise, our method achieves excellent performance.

[77] Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Li-Cheng Shen,Jih-Kang Hsieh,Wei-Hua Li,Chu-Song Chen

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.

[78] Region-Aware CAM: High-Resolution Weakly-Supervised Defect Segmentation via Salient Region Perception

Hang-Cheng Dong,Lu Zou,Bingguo Liu,Dong Ye,Guodong Liu

Main category: cs.CV

TL;DR: 提出了一种名为Region-Aware CAM的弱监督语义分割框架,通过显著区域感知实现高分辨率缺陷分割,解决了传统CAM方法分辨率低和细节保留不足的问题。

Details Motivation: 工业质量检测中缺陷检测的自动化需求迫切,但现有方法依赖大量标注数据,与实际资源受限场景冲突。弱监督学习为一种解决方法,但传统CAM方法存在分辨率低和细节保留不足的局限性。

Contribution: 1. 提出了过滤引导反向传播(FGBP),通过过滤梯度幅值来精确定位缺陷相关区域;2. 设计了区域感知加权模块,提升空间精度;3. 通过伪标签训练迭代优化模型性能。

Method: 框架包含两部分:区域感知CAM和伪标签训练。FGBP用于优化目标区域,区域感知加权模块进一步提升精度,最后通过伪标签迭代训练。

Result: 在工业缺陷数据集上的实验表明,该方法显著优于现有方法,成功弥合了弱监督学习与高精度缺陷分割之间的差距。

Insight: 通过梯度过滤和区域感知机制,可以在弱监督条件下实现高分辨率缺陷分割,为资源受限的工业场景提供了实用解决方案。

Abstract: Surface defect detection plays a critical role in industrial quality inspection. Recent advances in artificial intelligence have significantly enhanced the automation level of detection processes. However, conventional semantic segmentation and object detection models heavily rely on large-scale annotated datasets, which conflicts with the practical requirements of defect detection tasks. This paper proposes a novel weakly supervised semantic segmentation framework comprising two key components: a region-aware class activation map (CAM) and pseudo-label training. To address the limitations of existing CAM methods, especially low-resolution thermal maps, and insufficient detail preservation, we introduce filtering-guided backpropagation (FGBP), which refines target regions by filtering gradient magnitudes to identify areas with higher relevance to defects. Building upon this, we further develop a region-aware weighted module to enhance spatial precision. Finally, pseudo-label segmentation is implemented to refine the model’s performance iteratively. Comprehensive experiments on industrial defect datasets demonstrate the superiority of our method. The proposed framework effectively bridges the gap between weakly supervised learning and high-precision defect segmentation, offering a practical solution for resource-constrained industrial scenarios.

[79] STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Junsung Lee,Junoh Kang,Bohyung Han

Main category: cs.CV

TL;DR: 本文提出了一种名为STR-Match的无训练视频编辑方法,通过新的STR分数优化潜在空间,解决现有方法中时间不一致和运动失真的问题。

Details Motivation: 现有基于文本引导的视频编辑方法存在时间不一致、运动失真和领域转换受限的问题,主要由于对时空像素相关性建模不足。

Contribution: 提出了STR-Match算法,通过引入STR分数(捕捉时空像素相关性)和无训练的潜在优化框架,显著提升了视频编辑的时空一致性和视觉质量。

Method: 结合2D空间注意力和1D时序模块计算STR分数,用于潜在空间的优化,避免了计算昂贵的3D注意力机制。

Result: 实验表明,STR-Match在视觉质量和时空一致性上优于现有方法,特别是在显著领域转换时仍能保持一致性。

Insight: 通过高效建模时空相关性,可以显著提升视频编辑的质量,而无需复杂的3D注意力或额外训练。

Abstract: Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

[80] Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Dang Jisheng,Wu Xudong,Wang Bimei,Lv Ning,Chen Jiayu,Jingwen Zhao,Yichu liu,Jizhao Liu,Juncheng Li,Teng Wang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model’s semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.

[81] How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings

Fumiya Uchiyama,Rintaro Yanagi,Shohei Taniguchi,Shota Takashiro,Masahiro Suzuki,Hirokatsu Kataoka,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CV

TL;DR: 该论文提出了一种基于对比学习嵌入的语义信息性度量方法,用于量化图像或文本的绝对语义信息量,并通过信息增益理论扩展了其在视觉和语言领域的应用。

Details Motivation: 对比学习能够建模多模态概率分布,但现有方法主要关注相对语义相似性,而忽略了绝对语义信息性的度量。论文旨在填补这一空白。

Contribution: 1. 提出了一种新的语义信息性度量方法;2. 将信息增益理论扩展到视觉和语言领域;3. 通过嵌入的协方差加权范数实现了高效计算。

Method: 1. 从文本和图像样本中计算信息增益;2. 提出基于嵌入协方差加权范数的度量方法;3. 使用CLIP或SigLIP模型实现高效计算。

Result: 实验显示,信息增益分数与嵌入范数高度相关(R²=0.98–1.00),且计算成本低,适用于开放权重的预训练模型。

Insight: 低信息增益的图像通常为占位符图标,表明该方法能有效区分有用和无用信息,适用于实际应用。

Abstract: Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP’s empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as “image not found.” Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.

[82] CP-Guard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Senkang Hu,Yihang Tao,Guowen Xu,Xinyuan Qian,Yiqin Deng,Xianhao Chen,Sam Tak Wu Kwong,Yuguang Fang

Main category: cs.CV

TL;DR: 论文提出了一种名为CP-Guard的统一框架,用于多智能体感知系统中检测和防御恶意攻击。

Details Motivation: 在多智能体协作感知(CP)系统中,智能体需要接收来自其他智能体的信息,这使其容易受到恶意攻击。为了防止这种攻击并提升系统安全性,作者提出了CP-Guard。

Contribution: 主要贡献包括:1)提出了统一、概率无关且自适应的框架CP-Guard;2)开发了概率无关样本共识(PASAC)方法;3)定义了协作一致性损失(CCLoss);4)设计了在线自适应阈值方法。

Method: 方法分为三部分:1)PASAC用于无先验概率条件下验证共识;2)CCLoss用于捕捉主智能体与协作者的差异;3)动态阈值调整机制应对环境变化。

Result: 实验表明,CP-Guard能有效检测和防御恶意攻击,提升系统可靠性。

Insight: 论文强调了在多智能体系统中安全性问题的重要性,并通过动态共识机制提供了一种灵活且鲁棒的防御方法。

Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-Guard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent’s perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird’s eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-Guard

[83] Neural Cellular Automata: From Cells to Pixels

Ehsan Pajouheshgar,Yitao Xu,Ali Abbasi,Alexander Mordvintsev,Wenzel Jakob,Sabine Süsstrunk

Main category: cs.CV

TL;DR: 该论文提出了一种结合神经细胞自动机(NCA)和轻量级隐式解码器的方法,解决了NCA在高分辨率网格中的训练和推理效率问题,同时保持了其自组织和涌现性质。

Details Motivation: 传统的NCA在低分辨率网格上表现良好,但在高分辨率下受到训练时间、内存需求和信息传递范围的限制,限制了其实用性。

Contribution: 1. 将NCA与共享隐式解码器结合,实现高分辨率输出;2. 提出适用于高分辨率任务的损失函数;3. 展示了方法在多种任务和网格类型上的应用。

Method: 通过隐式神经表示技术,在粗网格上运行NCA,然后用轻量级解码器渲染任意分辨率的图像。同时设计了针对高分辨率任务的损失函数。

Result: 该方法能在实时生成全高清图像的同时保持NCA的自组织特性,且在多种任务和网格类型上均表现优异。

Insight: 隐式解码器与NCA的结合为高分辨率自组织系统的设计提供了高效且可扩展的解决方案。

Abstract: Neural Cellular Automata (NCAs) are bio-inspired systems in which identical cells self-organize to form complex and coherent patterns by repeatedly applying simple local rules. NCAs display striking emergent behaviors including self-regeneration, generalization and robustness to unseen situations, and spontaneous motion. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution grids. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information which impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing NCA with a tiny, shared implicit decoder, inspired by recent advances in implicit neural representations. Following NCA evolution on a coarse grid, a lightweight decoder renders output images at arbitrary resolution. We also propose novel loss functions for both morphogenesis and texture synthesis tasks, specifically tailored for high-resolution output with minimal memory and computation overhead. Combining our proposed architecture and loss functions brings substantial improvement in quality, efficiency, and performance. NCAs equipped with our implicit decoder can generate full-HD outputs in real time while preserving their self-organizing, emergent properties. Moreover, because each MLP processes cell states independently, inference remains highly parallelizable and efficient. We demonstrate the applicability of our approach across multiple NCA variants (on 2D, 3D grids, and 3D meshes) and multiple tasks, including texture generation and morphogenesis (growing patterns from a seed), showing that with our proposed framework, NCAs seamlessly scale to high-resolution outputs with minimal computational overhead.

[84] MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Mai A. Shaaban,Tausifa Jan Saleem,Vijay Ram Papineni,Mohammad Yaqub

Main category: cs.CV

TL;DR: 该论文提出了MOTOR,一种新的多模态检索和重排序方法,通过引入基于视觉和文本信息的最优传输来改进医学视觉问答(MedVQA)中的检索相关性,从而提升答案准确性。

Details Motivation: 现有的检索增强生成方法在MedVQA任务中常因检索到不相关内容而影响模型推理能力,且现有重排序方法忽略了多模态信息的联合作用,而这对医学诊断至关重要。

Contribution: 提出了MOTOR方法,利用最优传输和基于文本与视觉信息的检索重排序,显著提升了检索内容的临床相关性,从而提高了答案的准确性。

Method: 通过结合视觉信息和文本信息的最优传输,重新排序检索到的上下文,确保其在多模态上更相关。

Result: 在MedVQA数据集上,MOTOR的平均性能比现有最优方法提高了6.45%,并通过专家评估验证了其效果。

Insight: 多模态信息(尤其是视觉和文本联合)在医学领域中的检索和推理中具有重要作用,最优传输为这种联合建模提供了有效工具。

Abstract: Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.

[85] Point Cloud Compression and Objective Quality Assessment: A Survey

Yiling Xu,Yujie Zhang,Shuting Xia,Kaifa Yang,He Huang,Ziyu Shan,Wenjie Huang,Qi Yang,Le Yang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.

[86] MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Yunzhe Shao,Xinyu Yi,Lu Yin,Shihui Guo,Junhai Yong,Feng Xu

Main category: cs.CV

TL;DR: 论文提出了一种名为MagShield的新方法,用于解决稀疏惯性动作捕捉系统中磁干扰导致的定向误差问题。

Details Motivation: 现有惯性测量单元(IMU)系统在受磁干扰环境中容易产生定向估计误差,限制了其实际应用。

Contribution: MagShield采用“检测后校正”策略,通过多IMU联合分析和人体运动先验,显著提升了稀疏惯性动作捕捉系统在磁干扰环境中的精度。

Method: MagShield首先检测磁干扰,随后利用人体运动先验修正定向误差。该方法可以与大多数现有稀疏惯性MoCap系统集成。

Result: 实验表明,MagShield显著提高了磁干扰环境下的动作捕捉准确性,且兼容性良好。

Insight: 通过结合多传感器数据和运动先验,可以有效提升惯性捕捉系统在复杂环境中的鲁棒性。

Abstract: This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a “detect-then-correct” strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems.

[87] Attention to Burstiness: Low-Rank Bilinear Prompt Tuning

Yuzhu Wang,Manni Duan,Shu Kong

Main category: cs.CV

TL;DR: 该论文提出了一种名为Bilinear Prompt Tuning (BPT)的低秩双线性提示调优方法,通过白化处理和非高斯分布问题,显著提升了视觉提示调优的效率和准确性。

Details Motivation: 在视觉提示调优(VPT)中,图像块嵌入与Transformer自注意力模块中键和查询投影器的交互会导致值的“爆发性”(burstiness),同时这些值的分布呈现Laplacian和超-Laplacian分布,而非高斯分布,这对提示的学习提出了挑战。

Contribution: 论文的主要贡献包括:1) 提出了白化方法,对数据进行去相关和方差均衡,使其更接近高斯分布;2) 设计了低秩双线性提示调优方法(BPT),显著提升了调优速度和精度;3) 实验证明BPT优于多种VPT方法,同时减少了参数和计算开销。

Method: 方法包括:1) 白化图像块嵌入和键/查询投影器的数据;2) 通过双线性方式将白化矩阵与待学习的提示相乘;3) 引入低秩双线性模型降低计算复杂度。

Result: 实验结果表明,BPT在多个基准数据集上表现优异,例如在CUB数据集上提升超过25个精度点,同时减少了参数和计算开销。

Insight: 论文揭示了视觉提示调优中的“爆发性”问题,并通过白化和低秩双线性方法有效解决了这一挑战,为参数高效的模型调优提供了新思路。

Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns bursty prompts’’. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.

[88] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Yiwei He,Xiangtai Li,Zhenglin Huang,Yi Dong,Hao Fei,Jiangning Zhang,Baoyuan Wu,Guangliang Cheng

Main category: cs.CV

TL;DR: 该论文提出了BiMi框架,用于双语多模态虚假信息的检测与定位,通过跨模态和跨语言一致性分析以及自然语言解释,结合在线检索模块增强泛化能力,并发布了BiMiBench基准数据集。

Details Motivation: 随着多模态内容的真实性提升,虚假信息变得更加隐蔽,尤其是在双语新闻媒体中,图像和字幕的不一致可能误导观众。需要一种能够联合检测和解释虚假信息的方法。

Contribution: 1. 提出BiMi框架,实现区域级定位、跨模态和跨语言一致性检测及自然语言解释。2. 引入在线检索模块增强模型推理。3. 发布BiMiBench基准数据集,包含10.4万个样本。4. 首次在领域内应用GRPO以提升解释质量。

Method: BiMi结合了区域级定位、跨模态和跨语言一致性检测,并利用GRPO优化自然语言解释的质量。通过在线检索模块补充外部上下文信息。

Result: 实验表明,BiMi在分类准确率上提升了8.9,定位准确率提升了15.9,解释BERTScore提升了2.5,达到最新技术水平。

Insight: 多模态虚假信息检测需要联合视觉和语言模态的分析,GRPO在提升解释质量方面具有潜力,同时在动态环境中外部信息的补充能显著提高模型泛化能力。

Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

[89] Utilizing a Novel Deep Learning Method for Scene Categorization in Remote Sensing Data

Ghufran A. Omran,Wassan Saad Abduljabbar Hayale,Ahmad AbdulQadir AlRababah,Israa Ibraheem Al-Barazanchi,Ravi Sekhar,Pritesh Shah,Sushma Parihar,Harshavardhan Reddy Penubadi

Main category: cs.CV

TL;DR: 本文提出了一种新颖的深度学习方法CO-BRNN,用于遥感数据中的场景分类,解决了传统方法在噪声和大规模数据上的局限性,达到了97%的最高准确率。

Details Motivation: 遥感数据中的场景分类在灾害控制、生态观察和城市规划等领域有广泛应用,但由于数据噪声和多样性问题,传统深度学习方法难以实现高精度分类。

Contribution: 提出了Cuttlefish优化的双向循环神经网络(CO-BRNN),在多个现有方法中表现最优,准确率达到97%。

Method: 采用CO-BRNN模型,结合了双向循环神经网络和Cuttlefish优化算法,用于捕捉遥感数据中的关键视觉特征。

Result: CO-BRNN在准确率上显著优于其他方法(LSTM-CRF 90%、MLP-CNN 85%、CNN-LSTM 80%)。

Insight: 物理验证对确保卫星数据的效率尤为重要,同时优化算法在提升深度学习模型性能中的作用不可忽视。

Abstract: Scene categorization (SC) in remotely acquired images is an important subject with broad consequences in different fields, including catastrophe control, ecological observation, architecture for cities, and more. Nevertheless, its several apps, reaching a high degree of accuracy in SC from distant observation data has demonstrated to be difficult. This is because traditional conventional deep learning models require large databases with high variety and high levels of noise to capture important visual features. To address these problems, this investigation file introduces an innovative technique referred to as the Cuttlefish Optimized Bidirectional Recurrent Neural Network (CO- BRNN) for type of scenes in remote sensing data. The investigation compares the execution of CO-BRNN with current techniques, including Multilayer Perceptron- Convolutional Neural Network (MLP-CNN), Convolutional Neural Network-Long Short Term Memory (CNN-LSTM), and Long Short Term Memory-Conditional Random Field (LSTM-CRF), Graph-Based (GB), Multilabel Image Retrieval Model (MIRM-CF), Convolutional Neural Networks Data Augmentation (CNN-DA). The results demonstrate that CO-BRNN attained the maximum accuracy of 97%, followed by LSTM-CRF with 90%, MLP-CNN with 85%, and CNN-LSTM with 80%. The study highlights the significance of physical confirmation to ensure the efficiency of satellite data.

[90] YM-WML: A new Yolo-based segmentation Model with Weighted Multi-class Loss for medical imaging

Haniyeh Nikkhah,Jafar Tanha,Mahdi Zarrin,SeyedEhsan Roshan,Amin Kazempour

Main category: cs.CV

TL;DR: YM-WML是一种基于YOLO的医学图像分割模型,采用加权多类损失函数解决类别不平衡问题,在ACDC数据集上表现优异。

Details Motivation: 医学图像分割面临类别不平衡和复杂结构的挑战,需要一种更鲁棒的方法。

Contribution: 提出的YM-WML模型结合了多尺度特征提取、注意力机制和新的加权多类指数损失函数,显著提升了分割精度。

Method: 模型采用YOLOv11的Neck部分进行多尺度特征聚合,结合注意力分割头,并设计了WME损失函数处理类别不平衡。

Result: 在ACDC数据集上Dice相似系数达到91.02,超越现有方法,表现出稳定的训练和强泛化能力。

Insight: 通过加权损失函数解决类别不平衡,结合多尺度特征和注意力机制,是提升医学图像分割性能的有效途径。

Abstract: Medical image segmentation poses significant challenges due to class imbalance and the complex structure of medical images. To address these challenges, this study proposes YM-WML, a novel model for cardiac image segmentation. The model integrates a robust backbone for effective feature extraction, a YOLOv11 neck for multi-scale feature aggregation, and an attention-based segmentation head for precise and accurate segmentation. To address class imbalance, we introduce the Weighted Multi-class Exponential (WME) loss function. On the ACDC dataset, YM-WML achieves a Dice Similarity Coefficient of 91.02, outperforming state-of-the-art methods. The model demonstrates stable training, accurate segmentation, and strong generalization, setting a new benchmark in cardiac segmentation tasks.

[91] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Shreyas Dixit,Ashhar Aziz,Shashwat Bajpai,Vasu Sharma,Aman Chadha,Vinija Jain,Amitava Das

Main category: cs.CV

TL;DR: 该论文提出了PECCAVI,一种针对视觉改写攻击安全且无失真的图像水印技术,专注于AI生成图像的版权保护。

Details Motivation: 随着AI生成内容(尤其是图像)的快速增长,水印技术面临被恶意篡改或绕过的风险,尤其是新型的视觉改写攻击。

Contribution: 1. 提出了首个抗视觉改写攻击的无失真水印技术PECCAVI;2. 通过Non-Melting Points (NMPs)和多通道频域水印嵌入策略增强鲁棒性;3. 引入噪声抛光技术防止逆向工程定位NMPs。

Method: 1. 水印嵌入NMPs中;2. 采用多通道频域水印技术;3. 使用噪声抛光增加安全性。

Result: PECCAVI能够有效抵抗视觉改写攻击,同时保持图像质量无失真。

Insight: 通过专注于语义核心区域(NMPs),结合频域和噪声技术,PECCAVI为AI生成图像的水印提供了高效且安全的解决方案。

Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that “Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality.” In response, California’s Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.

[92] ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam,Vincent Tao Hu

Main category: cs.CV

TL;DR: ActAlign是一个零样本细粒度视频分类框架,通过语言引导的序列对齐实现了无需视频示例或时序标注的分类任务。相比于传统对比式视觉语言模型,ActAlign利用大语言模型生成的子动作序列与视频帧在共享嵌入空间中动态时间对齐,显著提升了细粒度动作识别性能。

Details Motivation: 现有的对比式视觉语言模型(如SigLIP)在开放集识别任务中表现良好,但无法捕捉细粒度动作识别所需的时间结构信息。因此,需要一种无需额外标注或调优的零样本方法来解决细粒度视频分类问题。

Contribution: 1)提出了ActAlign框架,首次将视频分类问题转化为序列对齐任务;2)利用大语言模型生成子动作序列,并通过动态时间对齐(DTW)与视频帧匹配;3)在极具挑战性的ActionAtlas基准上达到30.5%的准确率,超越现有十亿参数级视频语言模型。

Method: 1)使用大语言模型为每个动作类生成有序的子动作序列;2)将子动作序列和视频帧投影到共享嵌入空间;3)通过动态时间对齐(DTW)计算对齐得分,实现零样本分类。

Result: 在ActionAtlas基准上,ActAlign达到30.5%的准确率(人类准确率为61.6%),且模型参数量比现有方法少8倍。

Insight: 结构化语言先验结合经典对齐技术,为基于视觉语言模型的细粒度视频理解提供了一种可扩展且通用的零样本解决方案。

Abstract: We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

[93] Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation

Jie Liu,Jiayi Shen,Pan Zhou,Jan-Jakob Sonke,Efstratios Gavves

Main category: cs.CV

TL;DR: 该论文提出了FewCLIP,一个基于概率原型校准的框架,用于改进广义少样本语义分割(GFSS)中的多模态原型学习。FewCLIP通过视觉校准原型和分布正则化,提高了对稀缺标注新类别的适应性。

Details Motivation: 现有的原型学习方法在少样本语义分割中是确定性的,难以适应多样化的样本,尤其是标注稀缺的新类别。FewCLIP旨在通过概率原型校准提升模型的适应性和泛化能力。

Contribution: 1. 引入了原型校准机制,通过学习的视觉校准原型优化冻结的文本原型;2. 提出了分布正则化,实现了结构化和不确定性感知的原型学习。

Method: FewCLIP结合了多模态原型校准和分布正则化,通过视觉校准原型优化文本原型,并对校准原型进行概率建模以减少过拟合。

Result: 在PASCAL-5ⁱ和COCO-20ⁱ数据集上,FewCLIP在GFSS和类增量设置中显著优于现有方法。

Insight: 概率原型校准可以有效提升少样本语义分割中对新类别的适应性,避免过拟合,同时增强泛化能力。

Abstract: Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.

[94] Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models

Atharv Mittal,Agam Pandey,Amritanshu Tiwari,Sukrit Jindal,Swadesh Swain

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and visual question answering. However, they remain highly vulnerable to adversarial attacks, particularly in scenarios where both visual and textual modalities can be manipulated. In this study, we conduct a comprehensive reproducibility study of “An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models” validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication we propose several key improvements: (1) A novel initialization strategy that significantly improves Attack Success Rate (ASR). (2) Investigate cross-image transferability by learning universal perturbations. (3) A novel loss function targeting vision encoder attention mechanisms to improve generalization. Our evaluation across prominent VLMs – including Flamingo, BLIP-2, and InstructBLIP as well as extended experiments on LLaVA validates the original results and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.

[95] A Novel Frame Identification and Synchronization Technique for Smartphone Visible Light Communication Systems Based on Convolutional Neural Networks

Vaigai Nayaki Yokar,Hoa Le-Minh,Xicong Li,Wai Lok Woo,Luis Nero Alves,Stanislav Zvanovec,Tran The Son,Zabih Ghassemlooy

Main category: cs.CV

TL;DR: 本文提出了一种基于卷积神经网络(CNN)的轻量级帧识别与同步技术,用于智能手机可见光通信系统,实验结果显示其准确率高达98.74%。

Details Motivation: 为了解决屏幕到摄像头(S2C)可见光通信(VLC)系统中因模糊、裁剪和旋转图像等问题导致的帧识别与同步挑战,作者提出了一种新的解决方案。

Contribution: 提出了一种基于CNN的轻量级帧识别与同步技术,通过引入开销帧提升系统性能,并展示了其高准确率。

Method: 使用Python和TensorFlow Keras框架训练CNN模型,通过三次实时实验验证,并针对S2C通信中的实际问题(如模糊、裁剪)设计了数据集。

Result: 实验结果表明,模型在帧识别与同步任务中的总体准确率为98.74%。

Insight: 该方法为S2C VLC系统提供了一种鲁棒且高效的解决方案,尤其是在移动场景中表现优异。

Abstract: This paper proposes a novel, robust, and lightweight supervised Convolutional Neural Network (CNN)-based technique for frame identification and synchronization, designed to enhance short-link communication performance in a screen-to-camera (S2C) based visible light communication (VLC) system. Developed using Python and the TensorFlow Keras framework, the proposed CNN model was trained through three real-time experimental investigations conducted in Jupyter Notebook. These experiments incorporated a dataset created from scratch to address various real-time challenges in S2C communication, including blurring, cropping, and rotated images in mobility scenarios. Overhead frames were introduced for synchronization, which leads to enhanced system performance. The experimental results demonstrate that the proposed model achieves an overall accuracy of approximately 98.74%, highlighting its effectiveness in identifying and synchronizing frames in S2C VLC systems.

[96] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Jian Chen,Wenye Ma,Penghang Liu,Wei Wang,Tengwei Song,Ming Li,Chenguang Wang,Ruiyi Zhang,Changyou Chen

Main category: cs.CV

TL;DR: 该论文提出了MusiXQA,首个用于评估和改进多模态大语言模型(MLLMs)在乐谱理解能力上的综合数据集,并开发了基于该数据集的微调模型Phi-3-MusiX,显著优于现有的GPT方法。

Details Motivation: 当前MLLMs在自然图像、文本文档和图形设计中表现出色,但在乐谱理解方面研究不足。论文旨在填补这一空白。

Contribution: 1. 提出MusiXQA数据集,包含高质量合成乐谱及结构化标注;2. 揭示了当前MLLMs在乐谱理解上的局限性;3. 开发了Phi-3-MusiX模型,性能显著提升。

Method: 使用MusiXTeX生成合成乐谱,并标注音符、和弦、谱号等结构化信息,构建MusiXQA数据集;随后对MLLMs进行微调得到Phi-3-MusiX。

Result: 实验显示Phi-3-MusiX在乐谱理解任务上显著优于GPT类方法,为未来研究奠定了基础。

Insight: 乐谱理解是MLLMs的新挑战,需要结合视觉和音乐领域知识,未来可能扩展到更复杂的音乐任务。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

[97] VisionScores – A system-segmented image score dataset for deep learning tasks

Alejandro Romero Amezcua,Mariano José Juan Rivera Meraz

Main category: cs.CV

TL;DR: VisionScores是首个系统分割的乐谱图像数据集,专为深度学习和机器学习任务设计,包含两首钢琴曲的24.8k样本,分别基于不同作曲家和作品类型,并提供元数据和完整乐谱。

Details Motivation: 动机是为深度学习和机器学习任务提供结构丰富、信息密度高的乐谱图像数据集,填补现有数据集的不足。

Contribution: 主要贡献是创建了首个系统分割的乐谱图像数据集VisionScores,包含两场景:同一作品类型不同作曲家(14k样本)和同一作曲家不同作品类型(10.8k样本)。

Method: 方法包括收集两首钢琴曲的乐谱,按作曲家和作品类型分类,并预处理为128×512像素的灰度图像。

Result: 结果生成了24.8k样本的乐谱图像数据集,包含元数据和完整乐谱,支持进一步分析。

Insight: 洞察是乐谱图像的深度学习和机器学习任务需要考虑作曲家和作品类型的多样性,以提升模型泛化能力。

Abstract: VisionScores presents a novel proposal being the first system-segmented image score dataset, aiming to offer structure-rich, high information-density images for machine and deep learning tasks. Delimited to two-handed piano pieces, it was built to consider not only certain graphic similarity but also composition patterns, as this creative process is highly instrument-dependent. It provides two scenarios in relation to composer and composition type. The first, formed by 14k samples, considers works from different authors but the same composition type, specifically, Sonatinas. The latter, consisting of 10.8K samples, presents the opposite case, various composition types from the same author, being the one selected Franz Liszt. All of the 24.8k samples are formatted as grayscale jpg images of $128 \times 512$ pixels. VisionScores supplies the users not only the formatted samples but the systems’ order and pieces’ metadata. Moreover, unsegmented full-page scores and the pre-formatted images are included for further analysis.

[98] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

Xinrong Hu,Yiyu Shi

Main category: cs.CV

TL;DR: 该论文提出了一种基于扩散模型的数据增强方法AugPaint,通过inpainting技术生成图像-标签对,解决了医学图像分割中标注数据稀缺的问题。

Details Motivation: 医学图像分割的数据标注成本高昂且耗时,已有的标注数据往往不足,导致模型性能受限。因此,需要一种标注高效的方法来提升分割性能。

Contribution: 提出了AugPaint框架,利用潜在扩散模型(latent diffusion models)进行inpainting,从有限的标注数据中生成高质量的图像-标签对,无需重新训练。生成的图像与标签精确匹配,为下游分割任务提供了有效监督。

Method: AugPaint利用扩散模型的反向去噪过程,通过条件生成(以标注的前景区域为条件)填充背景区域,生成合成图像-标签对。该方法无需额外训练,高效且保真。

Result: 在四个公开的医学图像分割数据集(CT、MRI和皮肤影像)上验证了AugPaint的优越性,相比现有方法显著提升了分割性能。

Insight: 利用扩散模型的inpainting能力可以高效生成高质量的数据,尤其在标注稀缺的场景下,为半监督学习提供了新思路。

Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.

[99] From Coarse to Fine: Learnable Discrete Wavelet Transforms for Efficient 3D Gaussian Splatting

Hung Nguyen,An Le,Runfa Li,Truong Nguyen

Main category: cs.CV

TL;DR: AutoOpti3DGS通过可学习的离散小波变换,在3D高斯泼溅中实现从粗到细的训练,减少了高斯基元的冗余,同时保持了视觉质量。

Details Motivation: 3D高斯泼溅方法在训练和渲染中高效,但高斯基元的数量不断增加,导致内存和带宽压力。需要一种方法在保持视觉质量的同时限制基元增长。

Contribution: 提出了AutoOpti3DGS框架,通过可学习的离散小波变换实现粗到细的训练,减少冗余高斯基元,优化内存和带宽消耗。

Method: 使用固定低通滤波器和可学习高通滤波器的小波变换,通过正交性损失逐步激活高频细节,延迟冗余高斯基元的形成。

Result: 实验表明,AutoOpti3DGS仅需一个超参数,能够无缝集成现有框架,并生成更稀疏的场景表示。

Insight: 从粗到细的小波变换策略有效控制高斯基元数量,更适合内存受限的硬件。

Abstract: 3D Gaussian Splatting has emerged as a powerful approach in novel view synthesis, delivering rapid training and rendering but at the cost of an ever-growing set of Gaussian primitives that strains memory and bandwidth. We introduce AutoOpti3DGS, a training-time framework that automatically restrains Gaussian proliferation without sacrificing visual fidelity. The key idea is to feed the input images to a sequence of learnable Forward and Inverse Discrete Wavelet Transforms, where low-pass filters are kept fixed, high-pass filters are learnable and initialized to zero, and an auxiliary orthogonality loss gradually activates fine frequencies. This wavelet-driven, coarse-to-fine process delays the formation of redundant fine Gaussians, allowing 3DGS to capture global structure first and refine detail only when necessary. Through extensive experiments, AutoOpti3DGS requires just a single filter learning-rate hyper-parameter, integrates seamlessly with existing efficient 3DGS frameworks, and consistently produces sparser scene representations more compatible with memory or storage-constrained hardware.

[100] Ovis-U1 Technical Report

Guo-Hua Wang,Shanshan Zhao,Xinjie Zhang,Liangfu Cao,Pengxin Zhan,Lunhao Duan,Shiyin Lu,Minghao Fu,Xiaohao Chen,Jianshan Zhao,Yang Li,Qing-Guo Chen

Main category: cs.CV

TL;DR: Ovis-U1 是一个 30 亿参数的多模态统一模型,结合了多模态理解、文本到图像生成和图像编辑功能,性能超过当前部分最先进模型。

Details Motivation: 旨在通过统一训练方法,将多模态理解与生成任务结合,提升模型在理解和生成任务上的表现。

Contribution: 提出了基于扩散模型的视觉解码器和双向标记细化器,同时采用语言模型为基础的统一训练方法。

Method: 基于扩散模型的视觉解码器与双向标记细化器结合,以语言模型为基础进行统一训练。

Result: 在 OpenCompass 多模态学术基准上得分为 69.6,超过 Ristretto-3B 和 SAIL-VL-1.5-2B;在文本到图像生成和图像编辑任务上也表现优异。

Insight: 统一训练方法在理解和生成任务上的协同效应显著,未来多模态模型的设计可能会更倾向于统一架构。

Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

[101] Empowering Small VLMs to Think with Dynamic Memorization and Exploration

Jiazhen Liu,Yuchuan Deng,Long Chen

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME

[102] Where, What, Why: Towards Explainable Driver Attention Prediction

Yuchen Zhou,Jiayu Tang,Xiaoyan Xiao,Yueyao Lin,Linkai Liu,Zipeng Guo,Hao Fei,Xiaobo Xia,Chao Gou

Main category: cs.CV

TL;DR: 这篇论文提出了一种可解释的驾驶注意力预测任务,联合预测空间注意力区域(where)、解析关注语义(what)并提供注意力分配的认知推理(why),并提出了首个大规模数据集W3DA和基于大语言模型的框架LLada。

Details Motivation: 现有方法仅预测驾驶中的注意力分布(where),但忽略了注意力分配的语义解释(what)和认知原因(why),限制了对其机制的深入理解。因此,作者提出一种新的任务范式以填补这一空白。

Contribution: 1. 提出了首个可解释的驾驶注意力预测任务,联合建模where、what和why;2. 发布了W3DA数据集,包含详细的语义和因果标注;3. 提出了LLada框架,统一了像素建模、语义解析和认知推理。

Method: 提出了LLada框架,基于大语言模型(LLM),通过端到端架构联合建模驾驶注意力的空间分布、语义内容和认知原因。

Result: 实验表明,LLada在多数据集和驾驶条件下表现出强大的泛化能力,验证了其有效性。

Insight: 该研究为理解驾驶注意力机制提供了更深层次的解释,有望推动自动驾驶、智能驾驶培训和人际交互的发展。

Abstract: Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W3DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.

[103] DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

Jihun Kim,Hoyong Kwon,Hyeokjun Kweon,Wooseong Jeong,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: DC-TTA是一种新颖的测试时适应框架,通过分治策略优化交互式分割,提升了SAM模型在复杂场景中的表现。

Details Motivation: 尽管SAM在交互式分割中表现出色,但在专业领域或复杂场景(如伪装或多部件对象)中表现不佳。因此,需要一种方法能够动态适应这些挑战。

Contribution: 提出了DC-TTA框架,通过分治策略将用户交互划分为更一致的子集,独立进行测试时适应,合并后形成统一预测器,显著提升了复杂任务的分割效果。

Method: 1. 将用户交互划分为多个子集;2. 每个子集独立通过TTA适配单独的模型;3. 合并各子集的模型形成统一预测器,整合专业化知识。

Result: 实验显示,DC-TTA显著优于SAM的零样本结果和传统TTA方法,在伪装对象分割等复杂任务中实现了更少交互和更高准确率。

Insight: 分治策略减少了多样提示间的冲突,实现了更局部化的模型更新,适用于复杂场景的适应性优化。

Abstract: Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM’s zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.

[104] Computer-Aided Multi-Stroke Character Simplification by Stroke Removal

Ryo Ishiyama,Shinnosuke Matsuo,Seiichi Uchida

Main category: cs.CV

TL;DR: 该论文提出了一种基于笔画移除的多笔画汉字简化框架,通过选择性移除不影响可读性的笔画,降低非母语学习者的学习难度。

Details Motivation: 多笔画汉字(如中文和日文)复杂性高,为学习者尤其是非母语者带来挑战。简化这些字符可降低学习障碍,促进易读字体设计。

Contribution: 提出一种通过笔画移除简化的框架,结合高精度字符识别模型评估可读性,系统地简化多笔画字符。

Method: 使用字符识别模型评估笔画移除对可读性的影响,选择性移除最小影响可读性的笔画。实验覆盖1,256个字符类别。

Result: 实验发现即使移除多个笔画,许多字符仍可区分,为简化策略提供了潜在方向。

Insight: 该研究表明多笔画字符的可读性对笔画移除具有一定鲁棒性,支持更正式的简化策略开发。

Abstract: Multi-stroke characters in scripts such as Chinese and Japanese can be highly complex, posing significant challenges for both native speakers and, especially, non-native learners. If these characters can be simplified without degrading their legibility, it could reduce learning barriers for non-native speakers, facilitate simpler and legible font designs, and contribute to efficient character-based communication systems. In this paper, we propose a framework to systematically simplify multi-stroke characters by selectively removing strokes while preserving their overall legibility. More specifically, we use a highly accurate character recognition model to assess legibility and remove those strokes that minimally impact it. Experimental results on 1,256 character classes with 5, 10, 15, and 20 strokes reveal several key findings, including the observation that even after removing multiple strokes, many characters remain distinguishable. These findings suggest the potential for more formalized simplification strategies.

[105] Hierarchical Corpus-View-Category Refinement for Carotid Plaque Risk Grading in Ultrasound

Zhiyuan Zhu,Jian Wang,Yong Jiang,Tong Han,Yuhao Huang,Ang Zhang,Kaiwen Yang,Mingyuan Luo,Zhe Liu,Yaofei Duan,Dong Ni,Tianhong Tang,Xin Yang

Main category: cs.CV

TL;DR: 论文提出了一种新的层次化框架CVC-RF,用于超声颈动脉斑块风险分级,通过多级细化提升模型性能。

Details Motivation: 现有深度学习方法多关注跨视图特征融合,忽视了特征学习与类别差异的重要性,为解决这一问题,提出了一个多级处理的框架。

Contribution: 1. 提出首个基于深度学习的Carotid Plaque-RADS分级方法;2. 设计了中心记忆对比损失增强全局建模能力;3. 级联降采样注意力模块实现视图级特征融合;4. 无参数专家混合加权策略实现类别级特征解耦。

Method: 提出CVC-RF框架,通过Corpus、View、Category三层次信息处理:Corpus级用对比损失增强全局建模;View级用注意力模块融合多尺度信息;Category级用专家混合策略加权解耦特征。

Result: 实验表明CVC-RF在多级细化中有效建模全局特征,在颈动脉斑块分级任务中达到SOTA性能。

Insight: 多级信息细化(全局、视图、类别)对提升小目标高类内变异任务的性能至关重要;对比学习与注意力机制的结合可增强模型鲁棒性。

Abstract: Accurate carotid plaque grading (CPG) is vital to assess the risk of cardiovascular and cerebrovascular diseases. Due to the small size and high intra-class variability of plaque, CPG is commonly evaluated using a combination of transverse and longitudinal ultrasound views in clinical practice. However, most existing deep learning-based multi-view classification methods focus on feature fusion across different views, neglecting the importance of representation learning and the difference in class features. To address these issues, we propose a novel Corpus-View-Category Refinement Framework (CVC-RF) that processes information from Corpus-, View-, and Category-levels, enhancing model performance. Our contribution is four-fold. First, to the best of our knowledge, we are the foremost deep learning-based method for CPG according to the latest Carotid Plaque-RADS guidelines. Second, we propose a novel center-memory contrastive loss, which enhances the network’s global modeling capability by comparing with representative cluster centers and diverse negative samples at the Corpus level. Third, we design a cascaded down-sampling attention module to fuse multi-scale information and achieve implicit feature interaction at the View level. Finally, a parameter-free mixture-of-experts weighting strategy is introduced to leverage class clustering knowledge to weight different experts, enabling feature decoupling at the Category level. Experimental results indicate that CVC-RF effectively models global features via multi-level refinement, achieving state-of-the-art performance in the challenging CPG task.

[106] MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Haonan Chen,Hong Liu,Yuping Luo,Liang Wang,Nan Yang,Furu Wei,Zhicheng Dou

Main category: cs.CV

TL;DR: MoCa是一个两阶段框架,通过模态感知的持续预训练和异构对比微调,将因果视觉语言模型转化为高效的双向多模态嵌入模型,解决了现有方法的局限性。

Details Motivation: 当前基于因果视觉语言模型(VLM)的多模态嵌入方法存在三个主要问题:因果注意力不适合嵌入任务,依赖高质量标注数据导致扩展性问题,以及训练目标和数据多样性不足。

Contribution: 提出MoCa框架,包括两个核心阶段:模态感知持续预训练和异构对比微调,显著提升了双向多模态嵌入的性能和泛化能力。

Method: 第一阶段通过联合重建目标增强双向上下文推理能力;第二阶段利用多样化多模态数据提升对齐和泛化性能。

Result: 实验证明MoCa在MMEB和ViDoRe-v2基准测试中表现优异,刷新了SOTA结果,同时在模型规模和数据量上展现出强扩展性。

Insight: 双向注意力机制和多样化数据的引入是提升多模态嵌入模型性能的关键,无需依赖高质量标注数据即可实现高效训练。

Abstract: Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.

[107] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation

Zhenhua Ning,Zhuotao Tian,Shaoshuai Shi,Guangming Lu,Daojing He,Wenjie Pei,Li Jiang

Main category: cs.CV

TL;DR: 论文提出了基于推理的分割框架R²S和数据集3D ReasonSeg,通过模拟人类认知过程分解空间推理任务,提升了多模态大语言模型在点云感知中的空间推理能力。

Details Motivation: 现有方法在处理需要精确空间推理的复杂指令时存在困难,尽管3D点云数据提供了丰富的空间线索(如大小和位置)。

Contribution: 1. 提出R²S框架,将空间推理分解为两个阶段:先识别相关元素,再根据视觉先验处理指令;2. 提出3D ReasonSeg数据集,填补现有数据在复杂推理任务中的不足。

Method: 采用两阶段推理流程:第一阶段识别相关元素,第二阶段基于视觉先验处理指令。框架通过3D点云数据实现空间推理能力的增强。

Result: 定量和定性实验表明,R²S和3D ReasonSeg显著提升了3D点云感知的空间推理能力,为未来研究提供了新的基准。

Insight: 通过模拟人类认知过程的分阶段推理方法,可以更有效地解决复杂的空间推理问题,同时高质量的数据集对提升模型性能至关重要。

Abstract: Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.

[108] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval

Sophie Zhou,Shu Kong

Main category: cs.CV

TL;DR: 该论文提出了一种通过检索视觉相似的原创作品来识别和解释绘画抄袭的方法,并使用生成式AI构建了一个数据集。基于DINOv2的基线方法在识别任务中表现优异,但在检索任务中表现较差。通过度量学习微调模型显著提升了检索性能,但牺牲了一定的识别精度。

Details Motivation: 艺术抄袭检测对保护艺术家版权至关重要,但现有方法在法医学分析中仍具挑战性。论文旨在通过检索相似原创作品来解决这一问题。

Contribution: 论文的主要贡献包括:构建了一个包含真实绘画和AI合成抄袭作品的数据集;提出了一种基于DINOv2的基线方法;通过度量学习微调模型提升了检索性能。

Method: 方法包括:1)使用DINOv2提取特征并基于相似度阈值分类抄袭;2)通过度量学习微调DINOv2以提升检索性能。

Result: 基线方法在抄袭识别中准确率达97.2%,但检索平均精度仅为29.0%。微调后检索性能提升12%,但识别精度降至92.7%。

Insight: 研究表明,识别和检索任务之间存在性能权衡,未来研究需平衡两者。生成式AI合成数据为艺术抄袭检测提供了新思路。

Abstract: Art plagiarism detection plays a crucial role in protecting artists’ copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists’ styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2% but suffers from low retrieval precision 29.0% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7%). We conclude with insightful discussions and outline directions for future research.

[109] RoboScape: Physics-informed Embodied World Model

Yu Shang,Xin Zhang,Yinzhou Tang,Lei Jin,Chen Gao,Wei Wu,Yong Li

Main category: cs.CV

TL;DR: RoboScape提出了一种基于物理信息的统一世界模型,通过联合训练任务(如时间深度预测和关键点动态学习)提升视频生成的物理合理性和视觉逼真度,并在机器人策略训练和评估中表现出实用价值。

Details Motivation: 当前世界模型在物理感知方面表现不足,尤其是在3D几何和运动动力学建模中,导致接触丰富的机器人场景生成不真实。RoboScape旨在解决这一挑战。

Contribution: 1) 提出了RoboScape,一个统一的物理信息世界模型;2) 设计了两个联合训练任务以提升物理感知能力;3) 在多样化机器人场景中验证了模型的优越性和实用性。

Method: 模型联合学习RGB视频生成和物理知识,包括时间深度预测(增强几何一致性)和关键点动态学习(隐含编码物理属性,改进复杂运动建模)。

Result: 实验表明,RoboScape生成的视频在视觉逼真度和物理合理性方面表现出色,并在下游任务(如策略训练和评估)中具有实用价值。

Insight: RoboScape为构建高效物理感知世界模型提供了新思路,推动了具身智能研究的发展。

Abstract: World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.

[110] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

Shiyu Wu,Mingzhen Sun,Weining Wang,Yequan Wang,Jing Liu

Main category: cs.CV

TL;DR: 论文提出了VisualPrompter,一种无需训练的提示工程框架,通过视觉反馈优化用户输入以提高文本到图像合成的语义对齐和生成质量。

Details Motivation: 当前基于扩散模型的文本到图像生成中,用户输入的提示与模型偏好的提示之间存在显著差距,导致生成图像虽美观但语义上与用户描述不符。

Contribution: 1. 提出自动自反思模块,识别生成图像中缺失的概念;2. 设计了目标特定的提示优化机制,细粒度调整提示;3. 框架可即插即用,适配多种生成模型。

Method: VisualPrompter通过自反思模块和提示优化机制,动态调整用户输入的提示,结合视觉反馈实现语义和风格的双重优化。

Result: 在多个文本-图像对齐评估基准上达到了新的最先进性能。

Insight: 视觉反馈和动态提示优化是提升文本到图像合成语义对齐的关键,同时框架的灵活性使其易于扩展到其他生成任务中。

Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.

[111] MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Vladislav Bargatin,Egor Chistov,Alexander Yakovenko,Dmitriy Vatolin

Main category: cs.CV

TL;DR: MEMFOF是一种内存高效的多帧光流估计方法,在高分辨率(FullHD)输入下显著降低了GPU内存消耗,同时保持顶尖的准确性。

Details Motivation: 当前光流估计方法在追求高精度的同时,GPU内存消耗急剧增加,尤其是高分辨率输入。MEMFOF旨在解决这一问题,提出内存高效的解决方案。

Contribution: 1. 提出MEMFOF方法,减少光流估计的内存消耗;2. 在多项基准测试中达到SOTA性能;3. 支持原生1080p训练,无需裁剪或降采样。

Method: 1. 重新设计RAFT-like架构;2. 采用减少的相关体积和高分辨率训练协议;3. 结合多帧估计优化内存效率。

Result: 在Spring(1px异常率3.289)、Sintel(clean EPE 0.963)和KITTI-2015(Fl-all误差2.94%)等基准测试中排名第一。

Insight: 通过优化内存使用和高分辨率训练,可以在不牺牲精度的情况下显著提升光流估计的效率。

Abstract: Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at https://github.com/msu-video-group/memfof.

[112] Dynamic View Synthesis from Small Camera Motion Videos

Huiqiang Sun,Xingyi Li,Juewen Peng,Liao Shen,Zhiguo Cao,Ke Xian,Guosheng Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为DDR的新方法,用于解决小相机运动视频中动态3D场景的新视角合成问题,通过改进深度正则化和相机参数学习,显著提升了性能。

Details Motivation: 现有基于NeRF的方法在输入图像或视频运动视差不足时(如小相机运动),难以准确表示场景几何和估计相机参数,导致效果不佳或失效。本文旨在解决这一问题。

Contribution: 1. 提出Distribution-based Depth Regularization (DDR),通过Gumbel-softmax采样和密度约束改进深度正则化。2. 引入可视化工具观察渲染权重分布。3. 在训练中学习相机参数以增强鲁棒性。

Method: 1. 使用DDR计算渲染权重的误差期望,而非传统深度损失。2. 通过约束沿射线空间点的体积密度为0,确保正确几何表示。3. 结合相机参数学习提升模型对小相机运动输入的适应性。

Result: 实验表明,本文方法在小相机运动输入下表现优异,优于现有方法。

Insight: 通过改进深度正则化和相机参数学习的结合,可以有效提升小相机运动场景下的新视角合成效果。

Abstract: Novel view synthesis for dynamic $3$D scenes poses a significant challenge. Many notable efforts use NeRF-based approaches to address this task and yield impressive results. However, these methods rely heavily on sufficient motion parallax in the input images or videos. When the camera motion range becomes limited or even stationary (i.e., small camera motion), existing methods encounter two primary challenges: incorrect representation of scene geometry and inaccurate estimation of camera parameters. These challenges make prior methods struggle to produce satisfactory results or even become invalid. To address the first challenge, we propose a novel Distribution-based Depth Regularization (DDR) that ensures the rendering weight distribution to align with the true distribution. Specifically, unlike previous methods that use depth loss to calculate the error of the expectation, we calculate the expectation of the error by using Gumbel-softmax to differentiably sample points from discrete rendering weight distribution. Additionally, we introduce constraints that enforce the volume density of spatial points before the object boundary along the ray to be near zero, ensuring that our model learns the correct geometry of the scene. To demystify the DDR, we further propose a visualization tool that enables observing the scene geometry representation at the rendering weight level. For the second challenge, we incorporate camera parameter learning during training to enhance the robustness of our model to camera parameters. We conduct extensive experiments to demonstrate the effectiveness of our approach in representing scenes with small camera motion input, and our results compare favorably to state-of-the-art methods.

[113] STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Hanyu Zhou,Haonan Wang,Haoyue Liu,Yuxing Duan,Luxin Yan,Gim Hee Lee

Main category: cs.CV

TL;DR: 论文提出了一种时空解耦的高斯泼溅框架(STD-GS),结合帧相机和事件相机,解决高动态场景重建中背景与动态对象的时空特征不匹配问题。

Details Motivation: 现有方法采用统一表示模型直接匹配动态场景的时空特征,但无法处理动态对象的不连续时序特征以及背景与对象的异构空间特征。事件相机的引入可以弥补帧相机的不足。

Contribution: 提出了一个时空解耦的高斯泼溅框架,通过聚类区分背景与对象的时空特征,并利用高斯表示与事件数据的一致性指导对象的时空解耦。

Method: 引入事件相机辅助帧相机,通过聚类和时空解耦技术区分背景与动态对象的特征,并在高斯泼溅框架中实现连续的动态场景渲染。

Result: 实验验证了该方法在高动态场景重建中的优越性,通过时空解耦显著提升了背景与动态对象的时空辨识能力。

Insight: 高斯表示与事件数据具有一致的时空特性,可作为先验指导动态对象的时空解耦,从而提高重建质量。

Abstract: High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments have been performed to verify the superiority of the proposed method.

[114] UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Jie Feng,Shengyuan Wang,Tianhui Liu,Yanxin Xi,Yong Li

Main category: cs.CV

TL;DR: UrbanLLaVA是一个多模态大语言模型,专注于城市智能任务,通过空间推理增强和多阶段训练框架提升性能,并在多城市实验中优于开源和专有模型。

Details Motivation: 当前城市研究中的方法通常专注于特定数据类型,缺乏统一的多模态处理框架。多模态大语言模型(MLLMs)的成功为解决这一问题提供了机会。

Contribution: 提出了UrbanLLaVA,一个能够同时处理多模态城市数据的MLLMs模型,并通过多阶段训练框架和增强的空间推理能力提升了性能。同时扩展了城市研究的基准测试数据集。

Method: 1)构建多样化的城市指令数据集;2)提出多阶段训练框架,将空间推理增强与领域知识学习解耦;3)扩展城市任务基准测试。

Result: 在三个城市的实验中,UrbanLLaVA在单模态和跨模态任务中均优于其他开源和专有MLLMs,并表现出跨城市的鲁棒泛化能力。

Insight: 通过解耦空间推理和领域知识学习,可以显著提升模型在城市任务中的性能,为城市智能研究提供了新的框架和基准。

Abstract: Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

[115] DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

Mona Ahmadian,Amir Shirian,Frank Guerin,Andrew Gilbert

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.

[116] High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation

Lunhao Duan,Shanshan Zhao,Xingxing Weng,Jing Zhang,Gui-Song Xia

Main category: cs.CV

TL;DR: 本文提出了一种基于场景级别标注的点云分割高质量伪标签生成框架,通过多模态信息与区域-点语义一致性提升伪标签准确性,显著优于现有方法。

Details Motivation: 目前点云语义分割方法多依赖稀疏点级别标注,而场景级别标注下生成准确的点级别伪标签具有挑战性。本文旨在解决这一问题。

Contribution: 提出了一种结合跨模态特征引导与区域-点语义一致性的高质量伪标签生成框架,显著提升了场景级别标注下的点云分割性能。

Method: 1. 使用跨模态特征引导模块对齐2D图像与3D点云特征;2. 通过区域投票策略生成区域语义以指导点级别预测。

Result: 在ScanNet v2和S3DIS数据集上性能显著优于现有方法,并通过消融实验验证了各模块有效性。

Insight: 多模态信息与语义一致性设计可以有效提升伪标签质量,为场景级别标注下的点云分割提供了一种高效解决方案。

Abstract: This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach’s individual components. The code is available at https://github.com/LHDuan/WSegPC .

[117] VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

Marko Mihajlovic,Siwei Zhang,Gen Li,Kaifeng Zhao,Lea Müller,Siyu Tang

Main category: cs.CV

TL;DR: VolumetricSMPL is a neural volumetric body model that improves computational efficiency and accuracy for human-environment interactions by dynamically blending neural weights, outperforming existing methods in speed and memory usage.

Details Motivation: Surface mesh-based parametric human body models struggle with efficient interactions and collisions in complex scenarios, while existing volumetric models are either computationally expensive or lack robustness for complex articulations.

Contribution: Introduces VolumetricSMPL, a neural volumetric body model using Neural Blend Weights (NBW) for efficient MLP decoding, achieving faster inference, lower memory usage, and enhanced accuracy with an SDF for contact modeling.

Method: Leverages NBW to dynamically blend learned weight matrices based on shape and pose, creating compact MLP decoders. This approach reduces computational costs while maintaining expressiveness.

Result: Outperforms COAP with 10x faster inference, 6x lower GPU memory usage, and improved accuracy. Successfully applied to tasks like human-object interaction reconstruction and scene-constrained motion synthesis.

Insight: Dynamic blending of neural weights (NBW) is a powerful technique for balancing efficiency and expressiveness in volumetric models, enabling broader real-world applications in human-environment interactions.

Abstract: Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10x faster inference, 6x lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL’s strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.

[118] Aggregating Local Saliency Maps for Semi-Global Explainable Image Classification

James Hinns,David Martens

Main category: cs.CV

TL;DR: 论文提出了一种名为Segment Attribution Tables (SATs)的方法,通过聚合局部显著性图来为图像分类模型提供半全局可解释性,弥补了局部解释过于细节化和全局解释过于简化的不足。

Details Motivation: 当前深度学习在图像分类中占据主导地位,但模型预测的可解释性仍是一个挑战。局部解释方法(如显著性图)难以揭示重复模式,而全局方法则可能忽略重要局部行为。

Contribution: 主要贡献是提出SATs方法,通过汇总局部显著性图为图像段(如“眼睛”)量化其影响力,提供半全局的模型解释,帮助识别模型依赖的概念和虚假相关性。

Method: SATs利用显著性图和标注的图像分割图(如“眼睛”),量化每个图像段对模型预测的影响,形成半全局的解释表格。

Result: SATs能够揭示模型依赖的图像段(如背景或水印),即使在分布外测试性能变化不大时也能发现虚假相关性。

Insight: SATs填补了局部和全局解释之间的空白,成为分析调试图像分类器的实用工具,适用于任何能生成显著性图的分类器。

Abstract: Deep learning dominates image classification tasks, yet understanding how models arrive at predictions remains a challenge. Much research focuses on local explanations of individual predictions, such as saliency maps, which visualise the influence of specific pixels on a model’s prediction. However, reviewing many of these explanations to identify recurring patterns is infeasible, while global methods often oversimplify and miss important local behaviours. To address this, we propose Segment Attribution Tables (SATs), a method for summarising local saliency explanations into (semi-)global insights. SATs take image segments (such as “eyes” in Chihuahuas) and leverage saliency maps to quantify their influence. These segments highlight concepts the model relies on across instances and reveal spurious correlations, such as reliance on backgrounds or watermarks, even when out-of-distribution test performance sees little change. SATs can explain any classifier for which a form of saliency map can be produced, using segmentation maps that provide named segments. SATs bridge the gap between oversimplified global summaries and overly detailed local explanations, offering a practical tool for analysing and debugging image classifiers.

[119] DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

Kunwei Lv,Ping Lan

Main category: cs.CV

TL;DR: DGE-YOLO提出了一种基于YOLO的双分支架构,结合高效多尺度注意力机制,用于无人机多模态目标检测,显著提升了小目标检测性能。

Details Motivation: 无人机在多场景下的小目标检测面临复杂条件和多模态输入处理的挑战,现有方法在速度和性能上难以兼顾。

Contribution: 1. 双分支架构处理红外和可见光图像;2. 高效多尺度注意力(EMA)机制;3. Gather-and-Distribute模块替换传统颈部结构。

Method: 1. 双分支提取多模态特征;2. EMA机制增强多尺度特征学习;3. 新模块优化特征聚合。

Result: 在Drone Vehicle数据集上表现优于现有方法,验证了多模态目标检测的有效性。

Insight: 多模态融合和多尺度注意力机制对小目标检测具有显著优势。

Abstract: The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge. Existing approaches often prioritize inference speed, leading to degraded performance when handling multi-modal inputs. To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

[120] PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Aradhana Mishra,Bumshik Lee

Main category: cs.CV

TL;DR: 论文提出了PixelBoost,一种基于布朗运动的新型扩散模型,用于图像超分辨率,通过引入受控随机性提升真实感,尤其在纹理和边缘定义上表现优异。

Details Motivation: 现有扩散模型在图像超分辨率任务中面临真实感与计算效率的权衡问题,尤其在减少采样步长时易生成模糊图像。

Contribution: 1. 提出PixelBoost模型,利用布朗运动的随机性提升图像真实感;2. 引入受控随机性训练策略,避免局部最优;3. 提出sigmoidal噪声排序方法加速训练和推理。

Method: 通过整合布朗运动的随机性到训练中,结合sigmoidal噪声排序方法,优化纹理和边缘重建。

Result: 在LPIPS、LOE、PSNR、SSIM等指标上表现优越,视觉质量更高,边缘重建能力更强。

Insight: 受控随机性可有效捕捉图像纹理的不确定性,提升超分辨率的真实感和效率。

Abstract: Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.

[121] Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Md Moinul Islam,Sofoklis Kakouros,Janne Heikkilä,Mourad Oussalah

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.

[122] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Lei-lei Li,Jianwu Fang,Junbin Xiao,Shanmin Pang,Hongkai Yu,Chen Lv,Jianru Xue,Tat-Seng Chua

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的扩散模型Causal-VidSyn,用于合成具有因果实体反射的自中心交通事故视频,通过结合原因描述和驾驶员注视点,实现事故参与者和行为的精准识别。

Details Motivation: 现有的合成视频难以真实反映现实世界中的因果关系,特别是在交通事故场景中,精确识别事故参与者和行为对自动驾驶的安全性至关重要。

Contribution: 1) 提出Causal-VidSyn模型,通过原因描述和驾驶员注视点实现因果实体定位;2) 构建了最大的驾驶员注视数据集Drive-Gaze;3) 在视频编辑、正常到事故视频扩散和文本到视频生成任务中表现优异。

Method: 利用扩散模型框架,结合事故原因回答模块和注视条件选择模块,精准识别事故参与者和行为,实现因果实体在视频中的反映。

Result: Causal-VidSyn在帧质量和因果敏感性上超越现有视频扩散模型,适用于多种任务。

Insight: 因果关系和实体行为的精准建模对合成视频的真实性和实用性至关重要,尤其是在自动驾驶的安全测试中。

Abstract: Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.

[123] Token Activation Map to Visually Explain Multimodal LLMs

Yi Li,Hualiang Wang,Xinpeng Ding,Haonan Wang,Xiaomeng Li

Main category: cs.CV

TL;DR: 该论文提出了一种称为Token Activation Map(TAM)的方法,用于可视化解多模态大语言模型(MLLMs)的工作原理,解决了现有方法忽略上下文中冗余激活干扰的问题。

Details Motivation: 现有的方法在解释多模态大语言模型(MLLMs)时,未能有效处理上下文令牌之间的冗余激活问题,导致解释可靠性下降。

Contribution: 提出了TAM方法,通过估计因果推理和引入排名高斯滤波器,减少上下文干扰,显著提升了多模态大语言模型的解释质量。

Method: TAM利用估计因果推理方法减少上下文干扰,并结合排名高斯滤波器降低激活噪声,专注于令牌之间的交互作用。

Result: TAM在多种场景下(如物体定位、故障案例分析、视频可视化等)显著优于现有方法,提供了高质量的视觉解释结果。

Insight: MLLMs的令牌激活干扰问题需要特殊处理,因果推理和噪声过滤是提升解释可靠性的关键。

Abstract: Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available atgithub.com/xmed-lab/TAM.

[124] Ella: Embodied Social Agents with Lifelong Memory

Hongxin Zhang,Zheyuan Zhang,Zeyuan Wang,Zunzhe Zhang,Lixing Fang,Qinhong Zhou,Chuang Gan

Main category: cs.CV

TL;DR: Ella是一种具身社交智能体,具备终身学习能力,通过多模态记忆系统和基础模型集成,能够在开放世界中学习和社交。

Details Motivation: 旨在开发一种能够在动态3D开放世界中通过视觉观察和社交互动持续学习并自主发展的具身智能体。

Contribution: 提出了一种结构化的终身多模态记忆系统,用于有效存储、更新和检索信息,并将其与基础模型结合,实现自主决策和社交交互。

Method: 结合了以名字为中心的语义记忆和时空情境的片段记忆,通过多模态经验积累和社交互动实现终身学习。

Result: 实验表明Ella能够在社交活动中有效地影响并与其他智能体合作,展示了其通过观察和社交学习的能力。

Insight: 结构化记忆系统与基础模型的结合为推动具身智能的发展提供了新的可能性。

Abstract: We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella’s capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.

[125] Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Jinxing Zhou,Zhihui Li,Yongqiang Yu,Yanghao Zhou,Ruohao Guo,Guangyao Li,Yuxin Mao,Mingfei Han,Xiaojun Chang,Meng Wang

Main category: cs.CV

TL;DR: Mettle提出了基于元标记学习的轻量级方法,用于高效适应大规模预训练Transformer模型到音频-视觉任务,显著降低内存和训练时间。

Details Motivation: 解决传统方法在适应预训练Transformer模型时内存消耗大、训练时间长的问题,同时保持性能和参数效率。

Contribution: 1) 提出Layer-Centric Distillation (LCD)模块,并行蒸馏Transformer层的特征为紧凑元标记;2) 引入Meta-Token Injection (MTI)模块,支持细粒度分割任务;3) 在多个音频-视觉任务中验证了高效性和准确性。

Method: 1) 使用LCD模块并行蒸馏每层特征为元标记;2) 通过MTI模块将高层元标记注入低层指导特征适应;3) 直接应用元标记于分类和分割任务。

Result: 实验表明,Mettle显著减少内存和训练时间的同时,保持了竞争性性能。

Insight: 通过元标记并行蒸馏,Mettle实现了预训练知识的保留与任务适应的平衡,为轻量级迁移学习提供了新思路。

Abstract: We present \textbf{Met}a-\textbf{T}oken \textbf{Le}arning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textit{Layer-Centric Distillation (LCD)} module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textit{Meta-Token Injection (MTI)} module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.

[126] Why Settle for One? Text-to-ImageSet Generation and Evaluation

Chengyou Jia,Xin Shen,Zhuohang Dang,Zhuohang Dang,Changliang Xia,Weijia Wu,Xinyu Zhang,Hangwei Qian,Ivor W. Tsang,Minnan Luo

Main category: cs.CV

TL;DR: 该论文提出了文本到图像集(T2IS)生成问题,并引入了T2IS-Bench数据集和T2IS-Eval评估框架,同时提出了一种无需训练的AutoT2IS方法,显著优于现有方法。

Details Motivation: 现有的文本到图像模型通常关注单一领域的特定一致性需求,无法满足多样化应用场景的需求,因此需要一种更通用的方法来生成满足多种一致性需求的图像集。

Contribution: 论文的主要贡献包括:(1)提出文本到图像集(T2IS)生成问题;(2)构建T2IS-Bench数据集,涵盖26个子类的596条指令;(3)开发T2IS-Eval评估框架;(4)提出无需训练的AutoT2IS方法。

Method: AutoT2IS方法利用了预训练的Diffusion Transformers的上下文能力,协调视觉元素以满足图像级提示对齐和集合级视觉一致性。

Result: 实验表明,AutoT2IS在T2IS-Bench上显著优于现有方法,并能支持多种未充分探索的实际应用。

Insight: 论文揭示了多样化一致性对现有方法的广泛挑战,并展示了AutoT2IS在提升生成质量和实用性方面的潜力。

Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers’ in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

[127] MotionGPT3: Human Motion as a Second Modality

Bingfan Zhu,Biao Jiang,Sunyi Wang,Shixiang Tang,Tao Chen,Linjie Luo,Youyi Zheng,Xin Chen

Main category: cs.CV

TL;DR: MotionGPT3提出了一种双模态的运动-语言模型,将人类运动作为第二模态,通过分离参数和共享注意力机制实现跨模态交互与高效训练,同时在自回归框架下保留语言智能。

Details Motivation: 现有的多模态模型在统一理解与生成方面表现出色,但对运动-语言的统一模型研究较少。MotionGPT3旨在解决连续运动模态与离散表示的差距,并避免联合训练中对语言智能的损害。

Contribution: 1) 提出将人类运动作为第二模态的双模态模型;2) 通过分离参数和扩散头直接预测运动隐变量,避免离散化;3) 在自回归框架下保留语言智能并实现高效训练。

Method: 1) 使用运动变分自编码器(VAE)编码运动数据;2) 文本分支保留预训练语言模型结构,运动分支通过共享注意力机制整合;3) 扩散头直接从隐变量预测运动。

Result: 实验表明,模型在运动理解与生成任务中表现优异,同时保留了强大的语言能力。

Insight: 将运动作为独立模态并结合扩散模型,可有效解决连续运动与离散表示的鸿沟,同时避免多模态训练中的性能退化问题。

Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.

[128] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Hanwen Zhang,Congqi Cao,Qinyi Lv,Lingtong Min,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于自回归去噪分数匹配的视频异常检测方法,通过解决场景、运动和外观三个独特问题,提升了对潜在异常的检测能力。

Details Motivation: 当前基于似然的视频异常检测方法对接近学习分布的局部模式中的异常不敏感,因此需要一种更全面的方法来识别这些“未见”异常。

Contribution: 提出了一个结合场景条件和运动感知的分数函数,并设计了一种新型的自回归去噪分数匹配机制,增强了对外观异常的检测能力。

Method: 1. 构建噪声条件评分变换器。2. 引入场景依赖和运动感知的分数函数。3. 通过自回归注入噪声并与原始数据比较,积累异常上下文。

Result: 在三个流行的VAD基准测试中取得了最先进的性能。

Insight: 通过结合场景、运动和外观信息,可以更全面地建模异常检测,从而提升对局部模式异常的识别能力。

Abstract: Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen” anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.

[129] MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Yuhuan Yang,Chaofan Ma,Zhenjie Mao,Jiangchao Yao,Ya Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: MoMa是一个高效的适配器框架,通过在图像基础模型(IFMs)中集成Mamba的选择性状态空间建模,实现了全时空建模,提升了视频理解的性能。

Details Motivation: 视频理解需要有效建模时空动态,尽管图像基础模型(IFMs)在图像理解上表现优异,但现有参数高效微调(PEFT)方法通常分开处理时空信息,无法充分捕捉视频动态的复杂性。

Contribution: 提出了MoMa框架,通过SeqMod操作将Mamba的选择性状态空间建模引入IFMs,实现了高效的时空建模,同时保留了原始特征。

Method: 结合SeqMod操作和Divide-and-Modulate架构,MoMa在预训练IFMs中注入时空信息,提升了视频理解能力。

Result: 在多个视频基准测试中,MoMa表现优异,性能优于现有方法且计算成本更低。

Insight: 通过集成选择性状态空间建模,MoMa展示了在视频任务中高效利用预训练图像模型的潜力。

Abstract: Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba’s selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.

[130] Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

Daqian Shi,Xiaolei Diao,Xu Chen,Cédric M. John

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutual learning and self-distillation, as an attempt to achieve generic training performance enhancement through the cooperative training of multiple networks. However, such strategies achieve limited improvements due to the poor understanding of the impact of learning directions among networks across different iterations. In this paper, we propose a novel competitive distillation strategy that allows each network in a group to potentially act as a teacher based on its performance, enhancing the overall learning performance. Competitive distillation organizes a group of networks to perform a shared task and engage in competition, where competitive optimization is proposed to improve the parameter updating process. We further introduce stochastic perturbation in competitive distillation, aiming to motivate networks to induce mutations to achieve better visual representations and global optimum. The experimental results show that competitive distillation achieves promising performance in diverse tasks and datasets.

[131] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On

Xiang Xu

Main category: cs.CV

TL;DR: DiffFit 是一种新的两阶段潜在扩散框架,用于高保真虚拟试穿,通过解耦几何对齐和外观细化,显著提升了虚拟试穿的质量和效率。

Details Motivation: 目前的虚拟试穿方法在保留服装细节、精确对齐服装与身体、推理效率和泛化性方面仍存在挑战。

Contribution: DiffFit 提出了一个两阶段的潜在扩散框架,第一阶段进行几何感知的服装变形,第二阶段通过跨模态条件扩散模型优化纹理逼真度。

Method: DiffFit 采用渐进生成策略,先进行几何对齐,再进行纹理细化,通过解耦两个任务降低了复杂性。

Result: 在大规模虚拟试穿基准测试中,DiffFit 在定量指标和感知评估上均优于现有方法。

Insight: 解耦几何对齐和外观细化可以显著提升虚拟试穿的生成稳定性和视觉真实感。

Abstract: Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.

[132] FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method

Quang-Huy Che,Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: FastSeg 是一種高效的訓練免費開放詞彙分割方法,通過分層注意力細化機制,僅需 (1+1) 步的預訓練擴散模型反向過程,實現高質量分割。

Details Motivation: 開放詞彙語義分割在不需要密集標註數據的情況下進行任意文本類別的分割。現有方法中,對比學習模型在像素級分割時失去精確性,而擴散模型雖能捕捉細節但迭代次數與分割質量難以平衡。

Contribution: 提出了 FastSeg,一種無需訓練的高效框架,利用預訓練擴散模型的分層注意力機制,實現多類別同時分割,並引入雙提示機制、分層注意力細化方法(HARD)和測試時翻轉(TTF)來提升分割質量。

Method: 基於預訓練擴散模型(如 Stable Diffusion)的反向過程,僅需 (1+1) 步。關鍵技術包括雙提示機制用於區分性注意力提取、HARD 方法增強跨注意力、以及 TTF 提升空間一致性。

Result: 在 PASCAL VOC、PASCAL Context 和 COCO Object 數據集上達到 43.8% 的平均 mIoU,兼具高效推理能力。

Insight: FastSeg 在無需訓練的情況下,實現了高質量的開放詞彙分割,填補了分割質量與推理效率之間的空白,具有擴展潛力。

Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.

[133] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Parker Liu,Chenxin Li,Zhengxin Li,Yipeng Wu,Wuyang Li,Zhiqin Yang,Zhenyuan Zhang,Yunlong Lin,Sirui Han,Brandon Y. Feng

Main category: cs.CV

TL;DR: IR3D-Bench是一个新的基准测试,专注于评估视觉语言模型(VLMs)通过主动创建(而非被动识别)来理解场景的能力。它要求模型使用编程和渲染工具重建输入图像的3D结构,推动模型从描述性任务转向生成性任务。

Details Motivation: 传统视觉语言模型在描述性任务中表现出色,但其是否真正理解场景仍存疑。IR3D-Bench希望通过生成性任务(如逆渲染)更深入地评估模型的理解能力,超越传统基准测试的局限性。

Contribution: 1. 提出IR3D-Bench基准测试,专注于视觉语言代理(VLAs)的生成性任务;
2. 通过分析-合成范式评估模型的场景理解能力;
3. 为评估几何精度、空间关系和外观属性提供综合指标。

Method: IR3D-Bench要求视觉语言代理主动使用工具(如编程和渲染工具)重建输入图像的3D结构,实现逆渲染。这一方法测试模型的工具使用能力和生成性任务表现。

Result: 初步实验显示,当前最先进的视觉语言模型在视觉精度方面存在显著局限性,而非基本工具使用能力。IR3D-Bench的数据和评估协议已公开,以促进进一步研究。

Insight: IR3D-Bench揭示了视觉语言模型在生成性任务中的不足,尤其是视觉精度问题,为未来模型开发提供了明确的方向。

Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This “understanding-by-creating” approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

[134] GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Shunsuke Yasuki,Taiki Miyanishi,Nakamasa Inoue,Shuhei Kurita,Koya Sakamoto,Daichi Azuma,Masato Taki,Yutaka Matsuo

Main category: cs.CV

TL;DR: GeoProg3D是一个用于城市场景的3D语言场框架,通过自然语言实现交互,解决了现有方法在小规模环境中的局限,提出了地理感知的城市规模3D语言场和地理视觉API,结合LLM进行组合推理,并在新基准数据集GeoEval3D上表现出色。

Details Motivation: 现有3D语言场方法通常限于小规模环境,缺乏处理大规模复杂城市场景的可扩展性和组合推理能力。GeoProg3D旨在解决这一问题。

Contribution: 1. 提出了GCLF,一种高效的分层3D模型,整合地理信息;2. 设计了GV-APIs,专用于地理视觉任务;3. 使用LLM动态组合工具;4. 创建了GeoEval3D基准数据集。

Method: 1. GCLF结合地理信息(如方向、距离、高程、地标)高效处理城市规模数据;2. GV-APIs提供地理视觉功能;3. LLM作为推理引擎动态组合GV-APIs和GCLF。

Result: 实验表明,GeoProg3D在多种任务(如定位、空间推理、比较、计数、测量)上显著优于现有方法。

Insight: GeoProg3D首次实现了通过自然语言在高保真城市规模3D环境中的组合地理推理,为城市场景的交互提供了新思路。

Abstract: The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at https://snskysk.github.io/GeoProg3D/.

[135] Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement

Siyuan Chai,Xiaodong Guo,Tong Liu

Main category: cs.CV

TL;DR: 该论文提出了一种针对红外图像的任务导向增强方法,通过分层分解和形态学重建,提升下游视觉任务的性能。

Details Motivation: 红外图像在复杂天气条件下(如雾、雨、低光)有助于提升自动驾驶的感知能力,但其低对比度和噪声问题会影响高级视觉任务的性能。

Contribution: 1. 提出了一种红外图像的分层分解方法,增强场景细节的同时保留暗区特征;2. 基于形态学重建的显著性信息提取方法,有效增强目标信息而不放大噪声。

Method: 通过分层分解和形态学重建两步骤实现:分层分解提取细节,形态学重建增强目标信息。

Result: 实验表明,该方法在目标检测和语义分割任务中优于现有方法。

Insight: 分层分解和形态学重建的联合使用,可以兼顾图像增强的稳定性和任务导向的性能提升。

Abstract: Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.

[136] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai,He Zhang,Xi Chen,Jinbo Xing,Yiwei Hu,Yuqian Zhou,Kai Zhang,Zhifei Zhang,Soo Ye Kim,Tianyu Wang,Yulun Zhang,Xiaokang Yang,Zhe Lin,Alan Yuille

Main category: cs.CV

TL;DR: OmniVCus提出了一种基于多模态控制条件的视频定制方法,解决了多主题视频定制和信号控制的难题,并提出了数据构造和训练框架。

Details Motivation: 现有方法主要关注单主题视频定制,且缺乏多模态信号(如深度、掩码等)的控制能力。本文旨在解决多主题定制和信号控制的挑战。

Contribution: 1. 提出了VideoCus-Factory数据构造流水线,生成多主题训练数据;2. 提出IVTM训练方法结合图像编辑数据;3. 提出OmniVCus框架,包含Lottery Embedding和Temporally Aligned Embedding机制。

Method: 1. 通过VideoCus-Factory构建多主题训练数据;2. 使用IVTM训练方法结合图像数据;3. 提出OmniVCus框架,利用LE和TAE机制优化嵌入。

Result: 实验表明,OmniVCus在定量和定性评估中均显著优于现有方法。

Insight: 多模态信号的有效利用和高质量训练数据的构造是提升视频定制性能的关键。

Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

cs.AI [Back]

[137] MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Yulun Jiang,Yekun Chai,Maria Brbić,Michael Moor

Main category: cs.AI

TL;DR: MARBLE提出了一种多模态空间推理与规划的基准测试,旨在评估多模态语言模型(MLLMs)在复杂多模态任务中的逐步推理能力。现有MLLMs在MARBLE任务中表现较差,揭示其在复杂推理和感知上的局限性。

Details Motivation: 当前多模态推理研究主要集中在文本或简单多模态任务上,缺乏对复杂多模态推理的系统评估。通过MARBLE,作者希望推动多模态模型在逐步推理与规划能力上的改进。

Contribution: 提出MARBLE基准,包含两个高难度任务(M-Portal和M-Cube),用于测试MLLMs在空间、视觉和物理约束下的多步推理能力。

Method: MARBLE任务设计为多步推理问题,需综合视觉和文本信息。实验测试了12种先进MLLMs,分析其性能与局限性。

Result: 当前MLLMs在MARBLE任务中表现接近随机水平,部分简化子任务稍有提升,但复杂推理仍成挑战。视觉感知是主要瓶颈之一。

Insight: 复杂多模态推理需更强大的逐步规划与感知能力,MARBLE为下一代模型的开发提供了明确方向。

Abstract: The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE – all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.

[138] AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

Leander Melroy Maben,Gayathri Ganesh Lakshmy,Srijith Radhakrishnan,Siddhant Arora,Shinji Watanabe

Main category: cs.AI

TL;DR: AURA是首个开源的语音原生助手,能通过动态工具调用和多轮对话完成复杂目标驱动任务,结合了ASR、TTS和LLM技术,支持多种工具,性能接近GPT-4。

Details Motivation: 尽管语音和语言技术有进步,但缺乏开源系统支持语音到语音的多轮对话,并集成工具使用和智能推理。

Contribution: 首个开源、支持多轮语音对话和动态工具调用的语音助手AURA,模块化设计易于扩展新工具。

Method: 结合开源的ASR、TTS和LLM技术,构建级联流水线,支持自然语言提示和动作类工具集成。

Result: 在VoiceBench上表现优异,OpenBookQA得分92.75%,AlpacaEval得分4.39,人类评估任务成功率达90%。

Insight: 模块化设计和多技术整合是构建高效语音助手的关键,开源模型性能接近商业闭源系统。

Abstract: Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.

[139] Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

David Guzman Piedrahita,Yongjin Yang,Mrinmaya Sachan,Giorgia Ramponi,Bernhard Schölkopf,Zhijing Jin

Main category: cs.AI

TL;DR: 论文研究了大型语言模型(LLM)在多智能体公共品博弈中的合作行为,发现推理能力增强的LLM反而难以维持合作,而传统LLM表现更优。

Details Motivation: 随着LLM被部署为自主智能体,理解其合作与社交机制对确保安全部署至关重要。本文探索LLM如何在多智能体系统中权衡自身利益与集体利益。

Contribution: 揭示了LLM在公共品博弈中的四种行为模式,并指出推理能力增强的LLM在合作性上表现不佳,为改进LLM设计提供了新视角。

Method: 通过行为经济学中的公共品博弈实验,观察不同LLM在多轮互动中的行为策略,尤其是其在制裁机制下的合作表现。

Result: 推理LLM在合作中表现较差,而某些传统LLM能持续保持高水平合作。结果表明,提升推理能力并不一定能促进合作行为。

Insight: 当前以增强推理能力为导向的LLM改进方法可能不利于合作行为的培养,这对需要在协作环境中部署的LLM具有重要启示。

Abstract: As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim

[140] MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Huanjin Yao,Jiaxing Huang,Yawen Qiu,Michael K. Chen,Wenzheng Liu,Wei Zhang,Wenjie Zeng,Xikun Zhang,Jingyi Zhang,Yuxin Song,Wenhao Wu,Dacheng Tao

Main category: cs.AI

TL;DR: MMReason是一个新的多模态大语言模型(MLLM)基准测试,旨在精确评估其长链推理能力,通过多样、开放和具有挑战性的问题填补现有基准的不足。

Details Motivation: 现有MLLM基准在多样性、难度和中间推理步骤评估上存在缺陷,无法全面衡量长链推理能力。MMReason旨在解决这些问题。

Contribution: 1)设计了涵盖多个领域和难度级别的开放性问题;2)通过多模型投票技术消除猜测和记忆的干扰;3)提供了详细的分步解决方案和三元评分机制。

Method: 1)从6个学科和多个难度级别中筛选问题;2)用多模型投票技术过滤问题;3)标注详细的分步解决方案并设计三元评分机制。

Result: 通过MMReason测试,评估了主流MLLM的推理能力,并提供了深入分析。

Insight: MMReason通过多样化和挑战性的问题设计,为MLLM推理能力的研究提供了更可靠的评估工具。

Abstract: Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.

[141] SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu,Leon Guertler,Simon Yu,Zichen Liu,Penghui Qi,Daniel Balcells,Mickel Liu,Cheston Tan,Weiyan Shi,Min Lin,Wee Sun Lee,Natasha Jaques

Main category: cs.AI

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

cs.LG [Back]

[142] Masked Gated Linear Unit

Yukito Tajima,Nakamasa Inoue,Yusuke Sekikawa,Ikuro Sato,Rio Yokota

Main category: cs.LG

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 $\times$ inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.

[143] Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Kenny Peng,Rajiv Movva,Jon Kleinberg,Emma Pierson,Nikhil Garg

Main category: cs.LG

TL;DR: 该论文区分了稀疏自编码器(SAEs)在已知概念和未知概念上的不同作用,认为其更适合用于发现未知概念而非作用于已知概念。

Details Motivation: 尽管稀疏自编码器(SAEs)引发了广泛兴趣,但一系列负面结果引发了对其实用性的质疑。作者旨在通过概念上的区分,调和围绕SAEs的争议。

Contribution: 提出了SAEs在已知概念和未知概念上的作用区分,明确了其在发现未知概念中的价值。

Method: 通过概念分析和现有实验结果的分类,论证SAEs的适用场景。

Result: SAEs在机器学习可解释性、公平性、审计、安全和社科/健康科学等领域有潜在应用。

Insight: SAEs的工具价值需针对具体问题场景,而非一概而论。

Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

cs.DB [Back]

[144] GaussMaster: An LLM-based Database Copilot System

Wei Zhou,Ji Sun,Xuanhe Zhou,Guoliang Li,Luyang Liu,Hao Wu,Tianyuan Wang

Main category: cs.DB

TL;DR: GaussMaster是一个基于LLM的数据库助手系统,旨在通过自动化全面数据库维护减轻DBA的工作负担,特别是在金融行业。

Details Motivation: 金融行业中数据库管理员(DBA)的工作繁重,现有自治数据库平台功能有限,仍需人工干预。

Contribution: 提出了基于LLM的GaussMaster系统,能够自动化处理数据库维护任务,实现零人工干预。

Method: 通过分析数百种指标和日志,采用Tree-of-thought方法定位问题根源并调用工具解决问题。

Result: 在实际银行场景中成功应用,已在34个数据库维护场景中实现零人工干预。

Insight: LLM技术可以高效扩展传统数据库自治能力,覆盖全流程维护任务。

Abstract: In the financial industry, data is the lifeblood of operations, and DBAs shoulder significant responsibilities for SQL tuning, database deployment, diagnosis, and service repair. In recent years, both database vendors and customers have increasingly turned to autonomous database platforms in an effort to alleviate the heavy workload of DBAs. However, existing autonomous database platforms are limited in their capabilities, primarily addressing single-point issues such as NL2SQL, anomaly detection, and SQL tuning. Manual intervention remains a necessity for comprehensive database maintenance. GaussMaster aims to revolutionize this landscape by introducing an LLM-based database copilot system. This innovative solution is designed not only to assist developers in writing efficient SQL queries but also to provide comprehensive care for database services. When database instances exhibit abnormal behavior, GaussMaster is capable of orchestrating the entire maintenance process automatically. It achieves this by analyzing hundreds of metrics and logs, employing a Tree-of-thought approach to identify root causes, and invoking appropriate tools to resolve issues. We have successfully implemented GaussMaster in real-world scenarios, such as the banking industry, where it has achieved zero human intervention for over 34 database maintenance scenarios. In this paper, we present significant improvements in these tasks with code at https://gitcode.com/opengauss/openGauss-GaussMaster.

cs.IR [Back]

[145] Teaching a Language Model to Speak the Language of Tools

Simeon Emanuilov

Main category: cs.IR

TL;DR: 论文提出了一种方法,让多语言模型能够可靠地使用工具(如函数调用)于非英语语言,以保加利亚语为例,通过双语数据集训练改进了模型的工具使用能力。

Details Motivation: 现有的大多数多语言模型在非英语语言中的工具使用能力不足,表现为语言混淆或生成的结构化输出不可靠。

Contribution: 提出了一种方法,通过在双语数据集上继续训练,显著提升了模型在工具使用任务中的准确性和输出格式的规范性,并公开了模型、评估框架和数据集。

Method: 对BgGPT系列模型进行继续训练,使用包含10,035个函数调用示例的双语数据集,支持标准化协议如MCP。

Result: TUCAN在函数调用任务中比基线模型提升了28.75%的准确率,同时保留了核心语言理解能力,输出更干净且可解析。

Insight: 标准化协议和双语数据集的结合是提升非英语语言工具使用能力的有效方法,为其他语言的类似工作提供了可复制的框架。

Abstract: External tool integration through function-calling is essential for practical language model applications, yet most multilingual models lack reliable tool-use capabilities in non-English languages. Even state-of-the-art multilingual models struggle with determining when to use tools and generating the structured outputs required for function calls, often exhibiting language confusion when prompted in lower-resource languages. This work presents a methodology for adapting existing language models to enable robust tool use in any target language, using Bulgarian as a case study. The approach involves continued training of the BgGPT model series (2.6B, 9B, 27B parameters) on a novel bilingual dataset of 10,035 function-calling examples designed to support standardized protocols like MCP (Model Context Protocol). The research introduces TUCAN (Tool-Using Capable Assistant Navigator), which achieves up to 28.75% improvement in function-calling accuracy over base models while preserving core language understanding, as verified on established Bulgarian benchmarks. Beyond accuracy gains, TUCAN models demonstrate production-ready response formatting with clean, parsable function calls, contrasting with the verbose and inconsistent outputs of base models. The models, evaluation framework, and dataset are released to enable replication for other languages. This work demonstrates a practical approach for extending tool-augmented capabilities beyond English-centric systems.

cs.CY [Back]

[146] Computational Analysis of Climate Policy

Carolyn Hicks

Main category: cs.CY

TL;DR: 该论文研究了气候紧急运动对地方政府气候政策的影响,使用GPT-4构建的系统PALLM分析了维多利亚州地方政府的政策文件,发现通过气候紧急宣言(CED)的议会更注重气候政策的紧迫性和社会公平。

Details Motivation: 研究旨在评估当前大型语言模型(如GPT-4)在回答复杂政策问题方面的潜力,特别是气候紧急运动对地方政府政策的影响。

Contribution: 1. 开发了名为PALLM的系统,利用GPT-4分析气候政策文件;2. 通过大规模分析,发现通过CED的议会在气候政策上表现更积极。

Method: 1. 使用GPT-4构建PALLM系统;2. 分析11个维多利亚州地方政府的政策文件,验证系统性能;3. 扩展分析至更多政策文件,比较通过和未通过CED的议会。

Result: 通过CED的议会更倾向于制定近期和具体的气候政策,同时更关注紧迫性、优先性和社会公平问题。

Insight: 利用GPT-4进行政策分析可以规模化评估政策文件,为政策研究提供新工具,但需注意其局限性(如缺乏可靠来源标注)。

Abstract: This thesis explores the impact of the Climate Emergency movement on local government climate policy, using computational methods. The Climate Emergency movement sought to accelerate climate action at local government level through the mechanism of Climate Emergency Declarations (CEDs), resulting in a series of commitments from councils to treat climate change as an emergency. With the aim of assessing the potential of current large language models to answer complex policy questions, I first built and configured a system named PALLM (Policy Analysis with a Large Language Model), using the OpenAI model GPT-4. This system is designed to apply a conceptual framework for climate emergency response plans to a dataset of climate policy documents. I validated the performance of this system with the help of local government policymakers, by generating analyses of the climate policies of 11 local governments in Victoria and assessing the policymakers’ level of agreement with PALLM’s responses. Having established that PALLM’s performance is satisfactory, I used it to conduct a large-scale analysis of current policy documents from local governments in the state of Victoria, Australia. This thesis presents the methodology and results of this analysis, comparing the results for councils which have passed a CED to those which did not. This study finds that GPT-4 is capable of high-level policy analysis, with limitations including a lack of reliable attribution, and can also enable more nuanced analysis by researchers. Its use in this research shows that councils which have passed a CED are more likely to have a recent and climate-specific policy, and show more attention to urgency, prioritisation, and equity and social justice, than councils which have not. It concludes that the ability to assess policy documents at scale opens up exciting new opportunities for policy researchers.

cs.DL [Back]

[147] Density, asymmetry and citation dynamics in scientific literature

Nathaniel Imel,Zachary Hafen

Main category: cs.DL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Scientific behavior is often characterized by a tension between building upon established knowledge and introducing novel ideas. Here, we investigate whether this tension is reflected in the relationship between the similarity of a scientific paper to previous research and its eventual citation rate. To operationalize similarity to previous research, we introduce two complementary metrics to characterize the local geometry of a publication’s semantic neighborhood: (1) \emph{density} ($\rho$), defined as the ratio between a fixed number of previously-published papers and the minimum distance enclosing those papers in a semantic embedding space, and (2) asymmetry ($\alpha$), defined as the average directional difference between a paper and its nearest neighbors. We tested the predictive relationship between these two metrics and its subsequent citation rate using a Bayesian hierarchical regression approach, surveying $\sim 53,000$ publications across nine academic disciplines and five different document embeddings. While the individual effects of $\rho$ on citation count are small and variable, incorporating density-based predictors consistently improves out-of-sample prediction when added to baseline models. These results suggest that the density of a paper’s surrounding scientific literature may carry modest but informative signals about its eventual impact. Meanwhile, we find no evidence that publication asymmetry improves model predictions of citation rates. Our work provides a scalable framework for linking document embeddings to scientometric outcomes and highlights new questions regarding the role that semantic similarity plays in shaping the dynamics of scientific reward.

cs.SD [Back]

[148] You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties

Paige Tuttösí,H. Henny Yeung,Yue Wang,Jean-Julien Aucouturier,Angelica Lim

Main category: cs.SD

TL;DR: 论文提出了一种为第二语言(L2)学习者设计的TTS系统,通过调整英语元音的持续时间(长元音和短元音的区别)来提升语音清晰度。实验表明,这种模式显著减少了法语音者的转录错误(至少9.15%),且被认为比整体减慢语速更受尊重和鼓励。但听众并未意识到这种效果,误认为整体减慢语速是最清晰的。此外,研究发现Whisper-ASR与L2学习者使用的语音线索不同,不足以评估TTS系统对L2学习者的效果。

Details Motivation: 第二语言学习者在理解英语语音时可能遇到困难,尤其是英语中的长元音和短元音区别不明显时。现有的TTS系统未针对这一群体优化,且传统方法(如整体减慢语速)效果有限。因此,需要一种专门为L2学习者设计的TTS系统,以提高语音的清晰度和可理解性。

Contribution: 1. 首次提出针对L2学习者的TTS系统,通过调整元音持续时间来提升清晰度。
2. 实验证明这种“清晰模式”能显著减少转录错误(至少9.15%)。
3. 揭示L2学习者感知清晰度与实际清晰度之间的不一致性。
4. 发现Whisper-ASR不适合评估TTS系统对L2学习者的效果。

Method: 1. 利用Matcha-TTS框架,调整英语中长元音和短元音的持续时间,生成“清晰模式”语音。
2. 对法语音者进行感知实验,比较转录错误率和主观评价。
3. 使用Whisper-ASR分析语音识别结果,评估其与L2学习者的匹配度。

Result: 1. 清晰模式显著减少了转录错误(至少9.15%)。
2. 被试认为清晰模式比整体减慢语速更受尊重和鼓励。
3. 听众误认为整体减慢语速是最清晰的,表明感知与实际清晰度不相关。
4. Whisper-ASR对L2学习者的语音线索识别不足。

Insight: 1. L2学习者的语音理解和语音识别工具(如ASR)的需求可能与母语者不同。
2. 清晰度的提升不仅是技术问题,还需要考虑用户的主观感知。
3. 专门针对L2学习者的TTS系统设计可以显著改善学习体验。

Abstract: We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a “clarity mode” for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.

[149] Efficient Interleaved Speech Modeling through Knowledge Distillation

Mohammadmahdi Nouriborji,Morteza Rohanian

Main category: cs.SD

TL;DR: 该论文提出了一种通过知识蒸馏构建紧凑且高效语音生成模型的方法,训练了名为TinyWave的小型模型家族,支持语音-语音和混合语音-文本生成,接近大模型性能且适合在资源有限的设备上部署。

Details Motivation: 当前语音语言模型在规模和延迟上难以满足许多部署环境的需求,因此需要开发更紧凑且高效的模型。

Contribution: 提出了通过层对齐蒸馏方法压缩大型多模态Transformer模型,训练了TinyWave模型家族,支持多样化的语音生成任务。

Method: 采用层对齐蒸馏,匹配隐藏状态、注意力图和软化logits,将大型模型压缩3倍。

Result: TinyWave在Libri-Light上的性能接近其教师模型(仅相差1.4个归一化困惑度点),在StoryCloze和SALMon任务上的准确率达到教师模型的93-97%,优于同尺寸基线模型。

Insight: 层对齐蒸馏可有效压缩语音生成模型,在性能损失最小的情况下实现高效部署,适用于实时对话助手和低资源环境。

Abstract: Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher’s performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.