Table of Contents
- cs.CL [Total: 28]
- cs.CV [Total: 64]
- cs.CR [Total: 5]
- eess.IV [Total: 3]
- cs.CY [Total: 2]
- cs.RO [Total: 2]
- cs.SD [Total: 1]
- astro-ph.IM [Total: 1]
- cs.AI [Total: 2]
- cs.SE [Total: 1]
- cs.HC [Total: 1]
- cs.LG [Total: 5]
cs.CL [Back]
[1] Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English
Nguyen Huu Nhat Minh,Tran Nguyen Anh,Truong Dinh Dung,Vo Van Nam,Le Pham Tuyen
Main category: cs.CL
TL;DR: 该论文提出了一种基于Whisper的跨语言音素识别方法,用于解决越南语和英语混合语音识别中的音素对齐问题,通过构建双语音素集和设计端到端系统提升识别精度。
Details
Motivation: 越南语和英语在音素系统上存在显著差异,越南语依赖声调区分词义,而英语则以重音和非标准发音为特点,导致混合语音识别中音素对齐困难。Contribution: 主要贡献包括构建越南语和英语的代表性双语音素集,以及设计基于PhoWhisper预训练编码器的端到端系统,以提升音素识别性能。
Method: 采用PhoWhisper预训练编码器生成高层特征,结合双语音素集设计端到端系统,优化跨语言音素识别。
Result: 实验表明,该方法显著提高了越南语双语语音识别的准确性,并提供了一个解决声调和重音复杂性的鲁棒框架。
Insight: 该研究为跨语言音素识别提供了新思路,尤其是对越南语这类声调语言的混合语音识别具有重要参考价值。
Abstract: Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy in bilingual speech recognition for Vietnamese but also provides a robust framework for addressing the complexities of tonal and stress-based phoneme recognition
[2] Rethinking Reasoning in LLMs: Neuro-Symbolic Local RetoMaton Beyond ICL and CoT
Rushitha Santhoshi Mamidala,Anshuman Chhabra,Ankur Mali
Main category: cs.CL
TL;DR: 该论文提出了一种名为局部RetoMaton的神经符号框架,用于改进大型语言模型(LLMs)的推理能力,相较于传统的提示方法(如CoT和ICL),提供了更稳定、可验证的检索行为。
Details
Motivation: 传统的提示方法(如CoT和ICL)依赖隐式机制,导致输出不稳定且不可靠。作者希望通过神经符号框架解决这一问题,提高推理的稳定性和可解释性。Contribution: 论文的主要贡献是提出了局部RetoMaton,通过基于外部领域语料库构建的局部加权有限自动机(WFA),实现了更鲁棒且透明的检索机制。
Method: 方法包括将全局数据存储替换为局部WFA,利用其确定性转换和符号化结构,增强模型的上下文感知能力和推理可追溯性。
Result: 在两个预训练LLM(LLaMA-3.2-1B和Gemma-3-1B-PT)上,局部RetoMaton在三个推理任务(TriviaQA、GSM8K和MMLU)中表现优于基础模型和提示方法,同时保持了低推理开销。
Insight: 论文强调了轻量级、基于自动机的符号推理在LLMs中的潜力,为提升模型的可信度和可解释性提供了新思路。
Abstract: Prompt-based reasoning strategies such as Chain-of-Thought (CoT) and In-Context Learning (ICL) have become widely used for eliciting reasoning capabilities in large language models (LLMs). However, these methods rely on fragile, implicit mechanisms often yielding inconsistent outputs across seeds, formats, or minor prompt variations making them fundamentally unreliable for tasks requiring stable, interpretable reasoning. In contrast, automata-based neuro-symbolic frameworks like RetoMaton offer a more structured and trustworthy alternative by grounding retrieval in symbolic memory with deterministic transitions. In this work, we extend RetoMaton by replacing its global datastore with a local, task-adaptive Weighted Finite Automaton (WFA), constructed directly from external domain corpora. This local automaton structure promotes robust, context-aware retrieval while preserving symbolic traceability and low inference overhead. Unlike prompting, which entangles context and memory in opaque ways, our approach leverages the explicit structure of WFAs to provide verifiable and modular retrieval behavior, making it better suited for domain transfer and interoperability. We evaluate this local RetoMaton variant on two pretrained LLMs LLaMA-3.2-1B and Gemma-3-1B-PT across three reasoning tasks: TriviaQA (reading comprehension), GSM8K (multi-step math), and MMLU (domain knowledge). Compared to the base model and prompting-based methods, augmenting these setups with local RetoMaton consistently improves performance while enabling transparent and reproducible retrieval dynamics. Our results highlight a promising shift toward trustworthy, symbolic reasoning in modern LLMs via lightweight, automaton-guided memory.
[3] Leveraging Language Models and Machine Learning in Verbal Autopsy Analysis
Yue Chu
Main category: cs.CL
TL;DR: 该论文探讨了如何利用预训练语言模型和机器学习技术,通过口头尸检中的叙述部分来自动化死因分类,并展示了其在个体和群体层面的优越性。
Details
Motivation: 在没有民事登记和人口统计数据的国家,口头尸检是估计死因和制定政策的重要工具。现有自动化分类方法仅使用结构化问题,忽略了叙述中的信息。Contribution: 论文的主要贡献在于:1) 展示了基于叙述的预训练语言模型在死因分类上的优势,2) 探索了多模态融合策略,3) 分析了医师感知的信息充分性及其对分类的影响。
Method: 使用了预训练语言模型(如Transformer)进行任务特定微调,并比较了仅用叙述、仅用问题以及多模态融合方法的性能。还分析了医师对信息充分性的感知。
Result: 实验表明,仅用叙述的Transformer模型优于仅用问题的方法,特别是在非传染性疾病的分类上。多模态方法进一步提升了分类性能。
Insight: 叙述和问题各有独特贡献,应结合使用。信息充分性影响分类准确性,未来需更多高质量数据优化模型,并重新设计口头尸检工具。
Abstract: In countries without civil registration and vital statistics, verbal autopsy (VA) is a critical tool for estimating cause of death (COD) and inform policy priorities. In VA, interviewers ask proximal informants for details on the circumstances preceding a death, in the form of unstructured narratives and structured questions. Existing automated VA cause classification algorithms only use the questions and ignore the information in the narratives. In this thesis, we investigate how the VA narrative can be used for automated COD classification using pretrained language models (PLMs) and machine learning (ML) techniques. Using empirical data from South Africa, we demonstrate that with the narrative alone, transformer-based PLMs with task-specific fine-tuning outperform leading question-only algorithms at both the individual and population levels, particularly in identifying non-communicable diseases. We explore various multimodal fusion strategies combining narratives and questions in unified frameworks. Multimodal approaches further improve performance in COD classification, confirming that each modality has unique contributions and may capture valuable information that is not present in the other modality. We also characterize physician-perceived information sufficiency in VA. We describe variations in sufficiency levels by age and COD and demonstrate that classification accuracy is affected by sufficiency for both physicians and models. Overall, this thesis advances the growing body of knowledge at the intersection of natural language processing, epidemiology, and global health. It demonstrates the value of narrative in enhancing COD classification. Our findings underscore the need for more high-quality data from more diverse settings to use in training and fine-tuning PLM/ML methods, and offer valuable insights to guide the rethinking and redesign of the VA instrument and interview.
[4] CORE: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning
Ziqiang Cui,Yunpeng Weng,Xing Tang,Peiyang Liu,Shiwei Li,Bowei He,Jiamin Chen,Xiuqiang He,Chen Ma
Main category: cs.CL
TL;DR: 论文提出了一种名为CORE的新型方法,通过强化学习优化检索增强生成(RAG)中的上下文压缩,实现了无损压缩,显著提升了任务性能。
Details
Motivation: 检索增强生成(RAG)虽能提升大型语言模型(LLM)的知识时效性和事实准确性,但过长的检索文档增加了计算成本;现有压缩方法常牺牲任务性能,且缺乏明确的压缩目标。Contribution: 提出CORE方法,利用强化学习(GRPO)优化压缩过程,以任务性能为奖励信号,实现无损压缩,显著提升答案生成准确性。
Method: 采用广义强化学习策略优化(GRPO)训练压缩器,端到端框架直接优化LLM生成答案的准确性,无需预定义压缩标签。
Result: 实验中以3%的高压缩率,不仅避免了性能下降,还将平均精确匹配(EM)分数提升了3.3分。
Insight: 通过强化学习直接将任务性能作为优化目标,可以更有效地压缩上下文,同时保持或提升LLM的表现。
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge and the factual accuracy of responses in Large Language Models (LLMs). However, the inclusion of excessive retrieved documents substantially increases the input length, leading to higher computational costs. Previous studies have attempted to compress retrieved documents into shorter texts before in-context integration, but such methods often compromise end-task performance. The lack of well-defined compression targets forces many approaches to rely on fixed heuristics, which cannot guarantee that the compressed content will effectively support the end task. To address these limitations, we propose CORE, a novel method designed to achieve lossless context compression for RAG. CORE employs reinforcement learning to optimize the compression process without relying on predefined compression labels. Specifically, it utilizes end-task performance as a reward signal and applies Generalized Reinforcement Learning Policy Optimization (GRPO) to train the compressor. This end-to-end training framework enables the compressor to generate summaries that maximize the accuracy of answers generated by the LLM. Extensive experiments on four datasets demonstrate the superiority of our approach. With a high compression ratio of 3%, our method not only avoids performance degradation compared to prepending full documents across all datasets but also improves the average Exact Match (EM) score by 3.3 points. The code will be released soon.
[5] Context-Adaptive Synthesis and Compression for Enhanced Retrieval-Augmented Generation in Complex Domains
Peiran Zhou,Junnan Zhu,Yichen Shen,Ruoxi Yu
Main category: cs.CL
TL;DR: 提出CASC框架,通过智能合成和压缩检索到的上下文,解决复杂领域中RAG的信息过载问题。
Details
Motivation: 传统RAG在多文档、长文本或冲突信息等复杂领域中效果不佳,导致信息过载和回答不准确。Contribution: 1. 提出CASC框架,结合上下文分析和合成模块;2. 设计新的SciDocs-QA数据集。
Method: 1. 使用微调的小型LLM(CAS模块)提取关键信息并解决冲突;2. 生成结构化、高语义密度的上下文。
Result: 在SciDocs-QA数据集上,CASC表现优于基线模型。
Insight: CASC通过结构化合成和压缩上下文,显著提升了复杂领域中的生成质量和效率。
Abstract: Large Language Models (LLMs) excel in language tasks but are prone to hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) mitigates these by grounding LLMs in external knowledge. However, in complex domains involving multiple, lengthy, or conflicting documents, traditional RAG suffers from information overload and inefficient synthesis, leading to inaccurate and untrustworthy answers. To address this, we propose CASC (Context-Adaptive Synthesis and Compression), a novel framework that intelligently processes retrieved contexts. CASC introduces a Context Analyzer & Synthesizer (CAS) module, powered by a fine-tuned smaller LLM, which performs key information extraction, cross-document consistency checking and conflict resolution, and question-oriented structured synthesis. This process transforms raw, scattered information into a highly condensed, structured, and semantically rich context, significantly reducing the token count and cognitive load for the final Reader LLM. We evaluate CASC on SciDocs-QA, a new challenging multi-document question answering dataset designed for complex scientific domains with inherent redundancies and conflicts. Our extensive experiments demonstrate that CASC consistently outperforms strong baselines.
[6] LongReasonArena: A Long Reasoning Benchmark for Large Language Models
Jiayu Ding,Shuming Ma,Lei Cui,Nanning Zheng,Furu Wei
Main category: cs.CL
TL;DR: 论文提出了一个名为LongReasonArena的基准测试,专注于评估大语言模型的长推理能力,而非传统的长文本理解。通过多步算法任务(如检索和回溯)评测模型,推理长度可达百万token。结果表明现有模型表现不佳,且准确率随推理步骤对数线性下降。
Details
Motivation: 现有基准测试主要关注长文本理解,而忽略了长推理能力的评估。为了填补这一空白,作者设计了一个专门评测长推理能力的基准测试。Contribution: 提出了首个专注于长推理能力的基准测试LongReasonArena,任务设计灵活,推理长度可扩展至百万token级,为LLM的长推理能力评估提供标准。
Method: 设计多步算法任务(如检索和回溯),通过控制输入实现推理长度的灵活扩展(最多百万token),并进行广泛实验评测。
Result: 现有模型(如Deepseek-R1)在任务中表现较差(准确率仅7.5%),且准确率随推理步骤对数线性下降。
Insight: 长推理能力是LLM的重要缺陷之一,现有模型在复杂、长链任务中表现不佳,未来研究需重点优化这方面。
Abstract: Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.
[7] Improving Low-Resource Translation with Dictionary-Guided Fine-Tuning and RL: A Spanish-to-Wayuunaiki Study
Manuel Mosquera,Melissa Robles,Johan Rodriguez,Ruben Manrique
Main category: cs.CL
TL;DR: 这篇论文提出了一种结合外部词典工具和强化学习的方法,用于改善低资源语言(如西班牙语-Wayuunaiki)的机器翻译,相比之前的方法取得了显著的BLEU分数提升。
Details
Motivation: 低资源语言的机器翻译是一个重要但尚未解决的挑战,现有的大语言模型(LLMs)在预训练中缺乏这些语言的数据,且微调时并行数据有限。Contribution: 提出了一种新颖的方法,将外部词典工具与强化学习结合,通过监督微调和Guided Reward Policy Optimization(GRPO)训练模型,显著提高了翻译质量。
Method: 将翻译任务建模为工具增强的决策问题,模型在生成时可以选择性查询双语词典。方法包括监督指令微调和GRPO,利用BLEU分数作为奖励信号。
Result: 模型在西班牙语-Wayuunaiki测试集上比之前工作提高了+3.37 BLEU,比没有词典访问的监督基线提升了18%的相对增益。
Insight: 结合LLMs与外部工具以及强化学习,可以有效改善低资源语言的机器翻译,为未来研究提供了新的方向。
Abstract: Low-resource machine translation remains a significant challenge for large language models (LLMs), which often lack exposure to these languages during pretraining and have limited parallel data for fine-tuning. We propose a novel approach that enhances translation for low-resource languages by integrating an external dictionary tool and training models end-to-end using reinforcement learning, in addition to supervised fine-tuning. Focusing on the Spanish-Wayuunaiki language pair, we frame translation as a tool-augmented decision-making problem in which the model can selectively consult a bilingual dictionary during generation. Our method combines supervised instruction tuning with Guided Reward Policy Optimization (GRPO), enabling the model to learn both when and how to use the tool effectively. BLEU similarity scores are used as rewards to guide this learning process. Preliminary results show that our tool-augmented models achieve up to +3.37 BLEU improvement over previous work, and a 18% relative gain compared to a supervised baseline without dictionary access, on the Spanish-Wayuunaiki test set from the AmericasNLP 2025 Shared Task. We also conduct ablation studies to assess the effects of model architecture and training strategy, comparing Qwen2.5-0.5B-Instruct with other models such as LLaMA and a prior NLLB-based system. These findings highlight the promise of combining LLMs with external tools and the role of reinforcement learning in improving translation quality in low-resource language settings.
[8] Rule Synergy Analysis using LLMs: State of the Art and Implications
Bahar Bateni,Benjamin Pratt,Jim Whitehead
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在理解动态环境中复杂规则交互(如卡牌游戏)时的表现,并基于游戏《Slay the Spire》构建了一个卡牌协同数据集。研究发现,LLMs在识别无协同作用的卡牌对时表现优异,但在检测正向和负向协同作用时表现不佳,并分析了错误类型。
Details
Motivation: 研究动机在于评估LLMs在复杂规则交互环境中的推理能力,特别是在卡牌游戏等动态场景中,探索其局限性并提出改进方向。Contribution: 论文的主要贡献包括:1) 构建了一个卡牌协同数据集;2) 系统地评估了LLMs在识别卡牌协同作用中的表现;3) 分类了常见的错误类型。
Method: 研究方法包括:1) 从游戏《Slay the Spire》中提取卡牌协同数据集;2) 对LLMs在识别正向、负向和无协同作用时的表现进行评估;3) 分析错误类型(如时序问题、游戏状态定义和规则遵循)。
Result: 结果显示,LLMs在识别无协同作用的卡牌对时表现良好,但在正向和负向协同作用的检测上表现较差。
Insight: 研究的启示在于,LLMs在处理复杂规则交互时存在局限性,未来研究可以针对时序、游戏状态和规则理解等方面进行改进。
Abstract: Large language models (LLMs) have demonstrated strong performance across a variety of domains, including logical reasoning, mathematics, and more. In this paper, we investigate how well LLMs understand and reason about complex rule interactions in dynamic environments, such as card games. We introduce a dataset of card synergies from the game Slay the Spire, where pairs of cards are classified based on their positive, negative, or neutral interactions. Our evaluation shows that while LLMs excel at identifying non-synergistic pairs, they struggle with detecting positive and, particularly, negative synergies. We categorize common error types, including issues with timing, defining game states, and following game rules. Our findings suggest directions for future research to improve model performance in predicting the effect of rules and their interactions.
[9] Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
Bowen Sun,Yujun Cai,Ming-Hsuan Yang,Yiwei Wang
Main category: cs.CL
TL;DR: 本文提出了Blockwise SFT方法,通过将训练与推理解码过程对齐,解决了传统监督微调在扩散语言模型中的不匹配问题,从而提升了文本生成性能。
Details
Motivation: 传统的监督微调(SFT)在扩散语言模型中存在训练与推理解码的不匹配问题,导致梯度偏差。本文旨在通过改进训练方法,实现对块级解码的更精准监督。Contribution: 提出Blockwise SFT方法,通过分区响应为固定大小的块、选择性掩码和冻结前后文,直接模拟块级解码过程,显著提升模型性能。
Method: 将响应分区为固定大小的块,每步选择一个活动块进行随机掩码,冻结前文并隐藏后文,仅计算活动块的损失,与块级解码过程对齐。
Result: 在GSM8K、MATH和MetaMathQA数据集上,Blockwise SFT在相同计算或令牌预算下,表现优于传统SFT。
Insight: 训练与推理解码过程的精确对齐对扩散语言模型的性能至关重要,Blockwise SFT为此提供了一个有效解决方案。
Abstract: Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.
[10] Language Models Identify Ambiguities and Exploit Loopholes
Jio Choi,Mohit Bansal,Elias Stengel-Eskin
Main category: cs.CL
TL;DR: 研究大型语言模型(LLM)如何识别和利用漏洞,揭示其潜在的安全风险和对齐问题。
Details
Motivation: 研究LLM对漏洞的响应,为理解模型的模糊性和语用推理提供视角,同时揭示其潜在的AI对齐问题。Contribution: 设计了多种实验场景,证明LLM(尤其是更强的闭源和开源模型)能够识别模糊性并利用漏洞,展示了其潜在的安全风险。
Method: 设计冲突目标和模糊指令的场景,涵盖标量蕴涵、结构模糊性和权力动态,测量模型利用漏洞的能力。
Result: 发现较强的模型能够明确识别模糊性并利用漏洞,表明其可能带来AI安全问题。
Insight: LLM在利用漏洞时会明确识别模糊性和冲突目标,这为AI对齐和安全研究提供了新方向。
Abstract: Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models’ abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
[11] Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs
Jun Bai,Minghao Tong,Yang Liu,Zixia Jia,Zilong Zheng
Main category: cs.CL
TL;DR: 该论文研究了混合专家(Mixture-of-Experts, MoE)架构中专家对上下文忠实性的专门化现象,提出了Router Lens方法来识别上下文忠实的专家,并通过轻量级优化的CEFT方法选择性微调这些专家,显著提升了模型性能。
Details
Motivation: 大型语言模型在依赖上下文的场景中常常无法充分基于上下文生成相关响应,这影响了其可靠性。论文旨在探索MoE架构中是否存在专门擅长上下文利用的专家,以针对性优化提升模型的上下文忠实性。Contribution: 主要贡献包括:1) 提出Router Lens方法,能够准确识别擅长上下文忠实性的专家;2) 揭示这些专家通过逐步增强对相关上下文信息的注意力来提升模型性能;3) 提出轻量级优化方法CEFT,仅微调上下文忠实专家,实现了高效且高性能的优化。
Method: 论文首先通过Router Lens分析MoE模型中各专家对上下文的利用能力,发现存在专门擅长上下文忠实的专家;随后提出CEFT方法,选择性地微调这些专家,而非全模型微调。
Result: 实验表明,CEFT在多个基准测试和模型上表现优异,性能与全模型微调相当或更优,同时计算效率显著更高。
Insight: MoE模型中存在专家对上下文忠实性的专门化现象,针对性优化这些专家能有效提升模型性能且降低计算开销,为高效优化提供了新思路。
Abstract: Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization, offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.
[12] LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
Yang Sun,Lixin Zou,Dan Luo,Zhiyong Xie,Long Zhang,Liming Dong,Yunwei Zhao,Xixun Lin,Yanxiong Lu,Chenliang Li
Main category: cs.CL
TL;DR: 该论文提出了一种层融合解码(LFD)方法,通过结合中间层和最终层的表示,更有效地利用检索增强生成(RAG)中的外部知识,并通过内部知识评分(IKS)选择最佳中间层。实验表明,LFD显著提升了RAG系统的性能。
Details
Motivation: 研究发现,通过注入噪声到检索到的文档中,可以促进外部知识的利用和生成质量的提升。这一现象启发了对大型语言模型(LLMs)如何整合外部知识的精细控制和深入分析。Contribution: 1. 提出了层融合解码(LFD)方法,优化了外部知识的利用;2. 设计了内部知识评分(IKS)准则,用于选择最佳中间层;3. 实验证明LFD在多个基准测试中均能显著提升RAG性能。
Method: 1. 在LLM中建立了层的功能划分:浅层专注于局部上下文建模,中间层整合长程外部知识,深层依赖内部知识;2. 提出LFD,融合中间层和最终层的表示;3. 使用IKS选择最优中间层。
Result: 实验结果表明,LFD能以最小成本显著提升RAG系统对外部上下文知识的利用效率。
Insight: LLM的不同层在知识整合中扮演不同角色,通过合理利用中间层可以更高效地结合外部和内部知识。
Abstract: Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.
[13] A Symbolic Adversarial Learning Framework for Evolving Fake News Generation and Detection
Chong Tian,Qirong Ho,Xiuying Chen
Main category: cs.CL
TL;DR: 本文提出了一种符号对抗学习框架(SALF),通过对抗训练范式动态优化假新闻生成与检测,避免了传统数值更新的限制,实验证明其能生成高质量假新闻并提升检测性能。
Details
Motivation: 随着大语言模型(LLM)的发展,假新闻生成技术日益复杂,传统检测方法难以应对动态演化的虚假信息。为此,作者提出了一种新的对抗学习框架,以增强检测系统的鲁棒性和适应性。Contribution: 1. 提出符号对抗学习框架(SALF),通过符号化表示和自然语言操作模拟反向传播和梯度下降。2. 设计生成与检测代理的对抗交互机制,动态优化双方能力。3. 在双语基准数据集上验证SALF的假新闻生成能力和检测改进效果。
Method: SALF采用符号对抗学习范式,生成代理通过符号化提示生成假新闻,检测代理通过结构化辩论识别逻辑和事实缺陷。双方通过自然语言表示的权重、损失和梯度进行优化,模拟传统反向传播过程。
Result: 实验显示,SALF生成的假新闻能显著降低现有检测方法的性能(中文平均下降53.4%,英文34.2%),同时检测器对优化内容的检测精度提升7.7%。
Insight: 符号对抗学习为动态对抗环境下的假新闻检测提供了新思路,其自然语言操作方式避免了传统神经更新的局限性,更具灵活性和适应性。
Abstract: Rapid LLM advancements heighten fake news risks by enabling the automatic generation of increasingly sophisticated misinformation. Previous detection methods, including fine-tuned small models or LLM-based detectors, often struggle with its dynamically evolving nature. In this work, we propose a novel framework called the Symbolic Adversarial Learning Framework (SALF), which implements an adversarial training paradigm by an agent symbolic learning optimization process, rather than relying on numerical updates. SALF introduces a paradigm where the generation agent crafts deceptive narratives, and the detection agent uses structured debates to identify logical and factual flaws for detection, and they iteratively refine themselves through such adversarial interactions. Unlike traditional neural updates, we represent agents using agent symbolic learning, where learnable weights are defined by agent prompts, and simulate back-propagation and gradient descent by operating on natural language representations of weights, loss, and gradients. Experiments on two multilingual benchmark datasets demonstrate SALF’s effectiveness, showing it generates sophisticated fake news that degrades state-of-the-art detection performance by up to 53.4% in Chinese and 34.2% in English on average. SALF also refines detectors, improving detection of refined content by up to 7.7%. We hope our work inspires further exploration into more robust, adaptable fake news detection systems.
[14] Automatic integration of SystemC in the FMI standard for Software-defined Vehicle design
Giovanni Pollo,Andrei Mihai Albu,Alessio Burrello,Daniele Jahier Pagliari,Cristian Tesconi,Loris Panaro,Dario Soldi,Fabio Autieri,Sara Vinco
Main category: cs.CL
TL;DR: 论文提出了一种将SystemC模型自动集成到FMI标准中的方法,解决了汽车行业在协同仿真中标准化接口不足的问题。
Details
Motivation: 汽车行业对协同仿真的需求日益增长,但缺乏标准化接口和专有仿真平台的局限性阻碍了协作与扩展性。Contribution: 提出了一种自动将SystemC模型封装为FMI兼容接口的方法,结合了SystemC的建模速度和FMI的互操作性。
Method: 通过自动封装SystemC模型,生成符合FMI标准的接口,支持协同仿真中的集成。
Result: 在真实案例中验证了该方法的有效性,能够处理复杂设计。
Insight: 结合SystemC和FMI的优势,可提升协同仿真的效率与安全性。
Abstract: The recent advancements of the automotive sector demand robust co-simulation methodologies that enable early validation and seamless integration across hardware and software domains. However, the lack of standardized interfaces and the dominance of proprietary simulation platforms pose significant challenges to collaboration, scalability, and IP protection. To address these limitations, this paper presents an approach for automatically wrapping SystemC models by using the Functional Mock-up Interface (FMI) standard. This method combines the modeling accuracy and fast time-to-market of SystemC with the interoperability and encapsulation benefits of FMI, enabling secure and portable integration of embedded components into co-simulation workflows. We validate the proposed methodology on real-world case studies, demonstrating its effectiveness with complex designs.
[15] Survey of Specialized Large Language Model
Chenghan Yang,Ruiyu Zhao,Yang Liu,Ling Jiang
Main category: cs.CL
TL;DR: 这篇综述系统地探讨了从简单领域适应到复杂原生架构的专用大语言模型(LLM)发展历程,重点关注其在医疗、金融、法律和技术领域的应用,并总结了领域原生设计、参数效率提升和多模态能力等技术创新。
Details
Motivation: 随着专用大语言模型的快速发展,其从通用模型转向领域专用模型的设计成为AI领域的重要趋势。本文旨在系统梳理这一演变过程,揭示其在专业领域中的应用潜力和技术突破。Contribution: 综述了专用大语言模型的演变历程,总结了领域原生设计、参数效率优化和多模态集成等关键技术突破,并分析了它们在医疗、金融等领域的性能优势。
Method: 通过系统性文献调研,梳理了专用大语言模型的技术发展路径,重点关注领域原生架构、稀疏计算、量化以及多模态能力等方法。
Result: 研究表明,专用大语言模型在专业领域的基准测试中表现优于通用模型,尤其在医疗、金融和法律任务中取得显著性能提升。
Insight: 专用大语言模型的领域原生设计和多模态能力是未来发展的关键方向,其技术突破为电子商务等领域填补了空白。
Abstract: The rapid evolution of specialized large language models (LLMs) has transitioned from simple domain adaptation to sophisticated native architectures, marking a paradigm shift in AI development. This survey systematically examines this progression across healthcare, finance, legal, and technical domains. Besides the wide use of specialized LLMs, technical breakthrough such as the emergence of domain-native designs beyond fine-tuning, growing emphasis on parameter efficiency through sparse computation and quantization, increasing integration of multimodal capabilities and so on are applied to recent LLM agent. Our analysis reveals how these innovations address fundamental limitations of general-purpose LLMs in professional applications, with specialized models consistently performance gains on domain-specific benchmarks. The survey further highlights the implications for E-Commerce field to fill gaps in the field.
[16] Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality
Xiaoying Zhang
Main category: cs.CL
TL;DR: 论文探讨了如何开发具有自适应、可扩展和高准确性的任务机器人,减少人为干预,并提出了创新技术以实现机器人在动态环境中的自主学习和适应。
Details
Motivation: 当前对话系统中,任务机器人的开发面临适应性、扩展性和事实准确性的挑战,需要减少人为干预并提高自主能力。Contribution: 提出了一种自学习框架,使任务机器人能够在动态环境中自主适应和扩展,提高其准确性和实用性。
Method: 采用了创新的自学习技术,结合动态环境下的持续学习和适应性优化。
Result: 设计的任务机器人表现出更高的适应性、扩展性和事实准确性,减少了人工干预的需求。
Insight: 自学习技术是提升任务机器人性能的关键,未来研究可以进一步优化自主学习的效率和范围。
Abstract: Developing adaptable, extensible, and accurate task bots with minimal or zero human intervention is a significant challenge in dialog research. This thesis examines the obstacles and potential solutions for creating such bots, focusing on innovative techniques that enable bots to learn and adapt autonomously in constantly changing environments.
[17] NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Aritra Dutta,Swapnanil Mukherjee,Deepanway Ghosal,Somak Aditya
Main category: cs.CL
TL;DR: 本文提出了一个轻量级的自然语言知识集成框架(NLKI),通过检索自然语言事实并利用大型语言模型(LLM)生成解释,显著提升了小型视觉语言模型(sVLMs)在常识视觉问答任务中的表现。
Details
Motivation: 常识视觉问答任务常因缺乏图像或问题中的知识而受限,小型视觉语言模型(sVLMs)表现较差。本文旨在通过集成外部的自然语言知识来提升sVLMs的性能。Contribution: 1. 提出了NLKI框架,结合知识检索和LLM生成解释。2. 在多个数据集上实现了显著的性能提升(最高7%)。3. 研究了噪声鲁棒损失对模型稳定性的影响。
Method: 1. 使用微调的ColBERTv2检索自然语言事实。2. 通过LLM生成解释。3. 将检索的事实和生成的解释输入sVLMs。4. 采用噪声鲁棒损失(如对称交叉熵和广义交叉熵)进行微调。
Result: NLKI显著提升了sVLMs的性能(最高7%),使其能够与中等规模的视觉语言模型(如Qwen-2 VL-2B和SmolVLM-2.5B)媲美。噪声鲁棒损失进一步提升了性能(CRIC提升2.5%,AOKVQA提升5.5%)。
Insight: 1. LLM生成的常识知识优于从知识库中检索的知识。2. 噪声鲁棒训练在小模型结合外部知识时能提升稳定性。3. 参数高效的常识推理在小模型中成为可能。
Abstract: Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
[18] Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Wenhao Li,Yuxin Zhang,Gen Luo,Haiyuan Wan,Ziyang Gong,Fei Chao,Rongrong Ji
Main category: cs.CL
TL;DR: 本文提出了一种名为Spotlight Attention的高效LLM生成方法,通过非线性哈希优化KV缓存检索,显著提升了推理效率。
Details
Motivation: 大型语言模型(LLM)中的键值(KV)缓存占用大量资源,现有方法使用线性哈希效率低下,分布不匹配影响了性能。Contribution: 提出非线性哈希方法优化查询和键的嵌入分布,设计了基于Bradley-Terry排序的轻量级训练框架,并用CUDA内核实现高效检索。
Method: 采用非线性哈希函数优化KV缓存检索,结合轻量级训练框架和专用CUDA内核实现高效计算。
Result: 实验结果显示,相比传统线性哈希,Spotlight Attention的检索精度显著提升,哈希码长度缩短至少5倍,推理吞吐量提高3倍。
Insight: 非线性哈希能更好地捕捉查询和键的分布特性,结合GPU优化可显著提升LLM的推理效率。
Abstract: Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.
[19] Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval
Yixuan Tang,Yuanyuan Shi,Yiqun Sun,Anthony Kum Hoe Tung
Main category: cs.CL
TL;DR: 这篇论文提出了NEWSCOPE,一个两阶段的多样化新闻检索框架,通过显式建模句子级语义变化,提高事件覆盖的多样性。实验表明,NEWSCOPE在不牺牲相关性的情况下显著提升了检索多样性。
Details
Motivation: 现有新闻检索系统通常仅关注文本相关性,导致结果冗余且视角单一,难以全面理解事件。作者希望通过多样化的新闻检索,提供更全面的视角。Contribution: 1. 提出NEWSCOPE框架,通过句子级聚类和多样性感知重排序提升多样性。2. 提出三个可解释的多样性评价指标,并构建两个基准数据集。3. 验证框架在提升多样性同时保持相关性的有效性。
Method: 两阶段框架:1. 使用稠密检索获取主题相关内容;2. 通过句子级聚类和多样性重排序筛选互补信息。
Result: 实验显示NEWSCOPE显著优于基线方法,多样性更高且相关性不受影响。
Insight: 论文表明,细粒度的语义建模和多样性度量对于缓解冗余、促进全面事件理解具有重要意义。
Abstract: Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at https://github.com/tangyixuan/NEWSCOPE.
[20] Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
Pedro Henrique Luz de Araujo,Paul Röttger,Dirk Hovy,Benjamin Roth
Main category: cs.CL
TL;DR: 本文研究了专家角色提示(persona prompting)对语言模型任务性能的影响,提出了三项期望标准(性能优势、鲁棒性、忠实度),并评估了9种先进LLM在27项任务上的表现。研究发现专家角色通常对性能有积极或非显著影响,但对无关细节高度敏感,且忠实度效果不一。
Details
Motivation: 研究旨在厘清角色提示对任务性能的具体影响,填补了现有文献中对其效果不一致和缺乏系统性分析的空白。Contribution: 提出了评估角色提示的三项标准(性能优势、鲁棒性、忠实度),并进行了大规模实验验证,揭示了无关细节对性能的负面影响。
Method: 通过文献综述提炼出三项期望标准,并在9种LLM和27项任务上进行实验,量化角色提示的效果及其敏感性。
Result: 专家角色通常对性能有积极或非显著影响;模型对无关细节高度敏感(性能降低近30%);忠实度指标(如教育水平、领域相关性)的效果不一致。
Insight: 角色设计需更谨慎,且评估方案应反映其预期效果;仅最大规模的模型才能从缓解策略中受益。
Abstract: Expert persona prompting – assigning roles such as expert in math to language models – is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve performance. We analyze the literature on persona prompting for task improvement and distill three desiderata: 1) performance advantage of expert personas, 2) robustness to irrelevant persona attributes, and 3) fidelity to persona attributes. We then evaluate 9 state-of-the-art LLMs across 27 tasks with respect to these desiderata. We find that expert personas usually lead to positive or non-significant performance changes. Surprisingly, models are highly sensitive to irrelevant persona details, with performance drops of almost 30 percentage points. In terms of fidelity, we find that while higher education, specialization, and domain-relatedness can boost performance, their effects are often inconsistent or negligible across tasks. We propose mitigation strategies to improve robustness – but find they only work for the largest, most capable models. Our findings underscore the need for more careful persona design and for evaluation schemes that reflect the intended effects of persona usage.
[21] T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
Jie Zhang,Changzai Pan,Kaiwen Wei,Sishi Xiong,Yu Zhao,Xiangyu Li,Jiaxin Peng,Xiaoyan Gu,Jian Yang,Wenhan Chang,Zhenhe Wu,Jiang Zhong,Shuangyong Song,Yongxiang Li,Xuelong Li
Main category: cs.CL
TL;DR: 论文提出了一个名为T2R-bench的双语基准测试,用于评估模型将工业表格转换为文章级报告的能力,并揭示了现有大语言模型在此任务上的局限性。
Details
Motivation: 当前的表格推理任务主要集中在结构化数据的理解上,但将表格信息转化为实际工业场景中的报告仍是一个未解决的挑战。现有的基准测试缺乏对实际应用能力的评估。Contribution: 1) 提出了表格到报告(T2R)任务;2) 构建了一个双语基准测试T2R-bench,包含457个来自真实工业场景的表格;3) 提出了一套评估报告生成质量的指标。
Method: 从19个工业领域收集了457个表格,涵盖4种工业表格类型,并通过实验评估了25种常见大语言模型的性能。
Result: 实验结果表明显先进的大语言模型(如Deepseek-R1)在此任务上的总体评分仅为62.71,表明模型在此任务上仍有改进空间。
Insight: 表格到报告任务的实际应用仍需进一步研究,当前模型的性能不足可能是由于表格的复杂性和多样性所致。
Abstract: Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.
[22] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
Sikuan Yan,Xiufeng Yang,Zuchao Huang,Ercong Nie,Zifeng Ding,Zonggen Li,Xiaowen Ma,Hinrich Schütze,Volker Tresp,Yunpu Ma
Main category: cs.CL
TL;DR: Memory-R1 是一个强化学习框架,通过两个专门代理(Memory Manager 和 Answer Agent)帮助LLM主动管理外部记忆,显著提升长时推理能力,训练数据需求少且泛化性强。
Details
Motivation: 大型语言模型(LLMs)因上下文窗口有限,难以支持长时推理。现有方法通常依赖静态启发式规则管理外部记忆,缺乏动态学习和适应性。Contribution: 提出Memory-R1框架,通过强化学习(RL)动态管理记忆,包含Memory Manager和Answer Agent两代理,显著提升LLM的记忆管理和推理能力。
Method: 使用PPO和GRPO训练两个代理:Memory Manager执行结构化记忆操作(ADD、UPDATE等),Answer Agent选择并推理相关记忆。仅需152个训练样本。
Result: Memory-R1在少量训练数据下超越基线方法,并在多样问题和不同LLM骨干上表现优异的泛化能力。
Insight: RL可将LLM转化为更自主、记忆感知的代理系统,为更持久的推理能力提供方向。
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations {ADD, UPDATE, DELETE, NOOP}, and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and use with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the most competitive existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behaviors in LLMs, pointing toward richer, more persistent reasoning systems.
[23] Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement
Mohammed Rakibul Hasan,Rafi Majid,Ahanaf Tahmid
Main category: cs.CL
TL;DR: 论文介绍了Bangla-Bayanno,一个包含52,650对问答的孟加拉语视觉问答数据集,通过LLM辅助的翻译优化流程提升质量,支持开放领域任务。
Details
Motivation: 现有数据集多为手动标注且领域受限,或受限于特定回答格式。孟加拉语作为低资源语言,缺乏高质量多模态数据集,阻碍研究进展。Contribution: 提出了Bangla-Bayanno,首个高质量、开放领域的孟加拉语VQA数据集,通过LLM辅助翻译优化解决低质量问题。
Method: 采用多语言LLM辅助的翻译优化流程,将原始问答内容翻译为孟加拉语,并确保清晰度和准确性。问题分为三类:名词性、量化和极性。
Result: 数据集包含52,650对问答,覆盖4,750多张图像,成为孟加拉语中最全面的开源VQA基准。
Insight: LLM辅助翻译优化可有效解决低资源语言的标注质量问题,推动多模态学习的研究和包容性AI系统的发展。
Abstract: In this paper, we introduce Bangla-Bayanno, an open-ended Visual Question Answering (VQA) Dataset in Bangla, a widely used, low-resource language in multimodal AI research. The majority of existing datasets are either manually annotated with an emphasis on a specific domain, query type, or answer type or are constrained by niche answer formats. In order to mitigate human-induced errors and guarantee lucidity, we implemented a multilingual LLM-assisted translation refinement pipeline. This dataset overcomes the issues of low-quality translations from multilingual sources. The dataset comprises 52,650 question-answer pairs across 4750+ images. Questions are classified into three distinct answer types: nominal (short descriptive), quantitative (numeric), and polar (yes/no). Bangla-Bayanno provides the most comprehensive open-source, high-quality VQA benchmark in Bangla, aiming to advance research in low-resource multimodal learning and facilitate the development of more inclusive AI systems.
[24] Logical Reasoning with Outcome Reward Models for Test-Time Scaling
Ramya Keerthy Thatikonda,Wray Buntine,Ehsan Shareghi
Main category: cs.CL
TL;DR: 本文针对大语言模型(LLMs)在逻辑推理任务中的表现,提出了一种基于结果奖励模型(ORMs)的方法,通过训练数据增强(如链式思考(CoT)和回声生成技术)来提升模型性能。
Details
Motivation: 逻辑推理是评估LLMs能力的重要指标,但其在演绎推理中的应用尚未充分探索。现有方法结合测试时缩放与奖励模型,但缺乏对演绎逻辑推理的系统研究。Contribution: 1. 提出了一组用于演绎推理的ORMs;2. 引入回声生成技术扩展训练数据的错误类型;3. 在多个数据集和LLMs上验证了方法的有效性。
Method: 1. 使用CoT生成单样本和多样本数据训练ORMs;2. 提出回声生成技术,利用LLMs对错误假设的反射倾向生成额外训练数据;3. 在FOLIO、JustLogic和ProverQA数据集上测试。
Result: 实验表明,基于CoT和回声数据训练的ORMs在四种LLMs上显著提升了性能。
Insight: 通过主动引导模型生成错误推理(回声技术),可以覆盖更多错误类型,从而提升奖励模型的泛化能力。
Abstract: Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs’ tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.
[25] AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios
Lisa Alazraki,Lihu Chen,Ana Brassard,Joe Stacey,Hossein A. Rahmani,Marek Rei
Main category: cs.CL
TL;DR: 该论文提出了一个结合常识推理和数学推理的基准测试AgentCoMa,发现当前的大语言模型(LLMs)在单独解决这两种推理任务时表现良好,但在组合任务中准确率显著下降,揭示了模型在混合类型推理中的脆弱性。
Details
Motivation: 现有基准测试多关注单一类型的推理(常识或数学),而现实任务需两者结合。论文旨在填补这一空白,评估LLMs在混合类型推理中的表现。Contribution: 1. 提出AgentCoMa基准测试,结合常识和数学推理;2. 测试61种LLMs,发现组合任务中性能显著下降;3. 通过可解释性研究分析性能差距。
Method: 构建AgentCoMa基准测试,包含需常识和数学推理的组合任务;测试多种LLMs;通过神经元模式、注意力图等方法分析模型行为。
Result: LLMs在组合任务中准确率平均下降约30%,远高于同类型推理组合的基准测试;人类标注者在组合任务中表现稳定。
Insight: LLMs在混合类型推理中存在脆弱性,需进一步改进;AgentCoMa为未来研究提供了测试平台。
Abstract: Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by ~30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.
[26] MathBuddy: A Multimodal System for Affective Math Tutoring
Debanjana Kar,Leopold Böss,Dacia Braca,Sebastian Maximilian Dennerlein,Nina Christine Hubig,Philipp Wintersberger,Yufang Hou
Main category: cs.CL
TL;DR: MathBuddy是一个多模态的数学辅导系统,利用LLM和情感建模技术动态调整教学策略,提升学生的情感体验和教学效果。
Details
Motivation: 当前基于大语言模型(LLM)的教育技术未考虑学生的情感状态,而研究表明情感状态对学习能力有显著影响。MathBuddy旨在填补这一空白,打造一个情感智能的数学辅导系统。Contribution: 1. 提出了一个多模态情感建模框架,结合文本对话和面部表情识别;2. 将情感信息映射到教学策略,实现了情感感知的LLM辅导系统;3. 通过实验验证了建模情感对教学效果的提升。
Method: 1. 从对话文本和面部表情两个模态捕捉学生情感;2. 聚合多模态情感数据,动态调整LLM生成的回应;3. 在八大教育学维度上进行了自动评估和用户研究。
Result: 实验显示,建模学生情感后,系统的教学能力显著提升(win rate提高23点,DAMR分数提高3点)。
Insight: 情感建模可以显著增强LLM在教育领域的实用性和用户满意度,为未来情感智能教育系统提供了范例。
Abstract: The rapid adoption of LLM-based conversational systems is already transforming the landscape of educational technology. However, the current state-of-the-art learning models do not take into account the student’s affective states. Multiple studies in educational psychology support the claim that positive or negative emotional states can impact a student’s learning capabilities. To bridge this gap, we present MathBuddy, an emotionally aware LLM-powered Math Tutor, which dynamically models the student’s emotions and maps them to relevant pedagogical strategies, making the tutor-student conversation a more empathetic one. The student’s emotions are captured from the conversational text as well as from their facial expressions. The student’s emotions are aggregated from both modalities to confidently prompt our LLM Tutor for an emotionally-aware response. We have effectively evaluated our model using automatic evaluation metrics across eight pedagogical dimensions and user studies. We report a massive 23 point performance gain using the win rate and a 3 point gain at an overall level using DAMR scores which strongly supports our hypothesis of improving LLM-based tutor’s pedagogical abilities by modeling students’ emotions.
[27] ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
Yiming Du,Yifan Xiang,Bin Liang,Dahua Lin,Kam-Fai Wong,Fei Tan
Main category: cs.CL
TL;DR: ReSURE提出了一种动态调整监督信号权重的自适应学习方法,用于解决多轮对话微调中的监督不可靠性问题,避免了静态预过滤的局限性。
Details
Motivation: 多轮对话系统的微调需要高质量的监督信号,但低质量数据会导致性能下降,且早期轮次的错误会传播到后续轮次。现有方法通常是静态预过滤数据,无法动态缓解错误传播。Contribution: 提出了ReSURE方法,通过动态估计每轮损失的分布并调整样本权重,无需显式过滤数据,提升了模型的鲁棒性和响应质量。
Method: ReSURE利用Welford在线统计方法估计每轮的损失分布,并动态调整样本损失的权重。
Result: 实验表明,ReSURE在单源和混合质量数据上均提升了稳定性和响应质量,且响应分数与样本数量呈正相关(Spearman相关系数0.21~1.0)。
Insight: 动态调整监督信号权重的方法可以有效利用大规模数据,为多轮对话系统的训练提供了新思路。
Abstract: Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose ReSURE (Regularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford’s online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively. Code is publicly available at https://github.com/Elvin-Yiming-Du/ReSURE_Multi_Turn_Training.
[28] 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
Chengzu Li,Wenshan Wu,Huanyu Zhang,Qingtao Li,Zeyu Gao,Yan Xia,José Hernández-Orallo,Ivan Vulić,Furu Wei
Main category: cs.CL
TL;DR: 这篇论文提出了一个名为11Plus-Bench的基准测试,用于系统评估多模态大语言模型(MLLMs)在空间推理能力方面的表现,并与人类认知进行对比。研究发现MLLMs展现出早期空间认知迹象,但仍存在显著性能差距和随机性。
Details
Motivation: 当前MLLMs在推理任务上表现出色,但其空间认知能力是否接近人类仍不明确。为了填补这一研究空白,论文提出了一个基于人类认知的评估框架。Contribution: 论文的主要贡献是11Plus-Bench,这是一个高质量基准测试,结合了真实的标准化空间能力测试和细粒度专家标注,用于评估MLLMs的空间推理能力。
Method: 论文采用了14种MLLMs和人类评估的对比实验,通过分析感知复杂性和推理过程的标注数据,研究模型行为。
Result: 实验表明,尽管MLLMs展现出人类类似的认知特征,但其性能仍显著落后于人类,且表现出随机性。
Insight: 研究揭示了MLLMs在空间推理中的潜力和局限,为模型设计提供了可操作的改进方向。
Abstract: For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs’ cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs’ spatial reasoning capabilities and provide actionable insights for advancing model design.
cs.CV [Back]
[29] Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration
Jookyung Song,Mookyoung Kang,Nojun Kwak
Main category: cs.CV
TL;DR: 这篇论文介绍了一种实时生成式绘图系统,通过结合形式意图(如线条轨迹、比例等)和上下文意图(如语义主题)来支持多用户协作绘画,实现低延迟的双阶段转换。
Details
Motivation: 传统基于文本提示的生成系统仅关注高层上下文描述,无法捕捉用户绘画时的低层几何特征和高层语义意图。本文旨在填补这一空白,通过综合分析两种意图,实现更直观的AI辅助绘图。Contribution: 主要贡献包括:1) 提出一种同时分析形式意图和上下文意图的生成框架;2) 设计了一个支持多用户协作的低延迟触摸屏接口;3) 实现了结构保留与风格内容融合的双阶段生成流程。
Method: 方法包括多阶段生成流程:首先提取线条轨迹、比例等几何特征(形式意图),再利用视觉-语言模型提取语义特征(上下文意图),最后通过条件生成模型结合两种意图完成绘图。
Result: 系统实现了实时双阶段绘图转换,支持无艺术经验的用户同步协作创作,重新定义了人机协同创作的过程。
Insight: 研究表明,同时捕捉用户的低层几何意图和高层语义意图能够显著提升AI辅助绘图的自然性和创造性,为未来协作式AI工具设计提供了新思路。
Abstract: This paper presents a real-time generative drawing system that interprets and integrates both formal intent - the structural, compositional, and stylistic attributes of a sketch - and contextual intent - the semantic and thematic meaning inferred from its visual content - into a unified transformation process. Unlike conventional text-prompt-based generative systems, which primarily capture high-level contextual descriptions, our approach simultaneously analyzes ground-level intuitive geometric features such as line trajectories, proportions, and spatial arrangement, and high-level semantic cues extracted via vision-language models. These dual intent signals are jointly conditioned in a multi-stage generation pipeline that combines contour-preserving structural control with style- and content-aware image synthesis. Implemented with a touchscreen-based interface and distributed inference architecture, the system achieves low-latency, two-stage transformation while supporting multi-user collaboration on shared canvases. The resulting platform enables participants, regardless of artistic expertise, to engage in synchronous, co-authored visual creation, redefining human-AI interaction as a process of co-creation and mutual enhancement.
[30] TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models
Chenghao Liu,Jiachen Zhang,Chengxuan Li,Zhimu Zhou,Shixin Wu,Songfang Huang,Huiling Duan
Main category: cs.CV
TL;DR: TTF-VLA提出了一种无需训练的时序令牌融合方法,通过结合历史与当前视觉信息提升视觉-语言-动作模型的推理质量,显著提高任务成功率。
Details
Motivation: 现有的视觉-语言-动作模型逐帧处理视觉输入,忽略了时序信息,容易受到视觉噪声干扰。作者希望通过融合时序信息提升模型性能。Contribution: 1. 提出无需训练的时序令牌融合方法(TTF);2. 结合像素差异分析与注意力语义评估实现选择性融合;3. 在多个基准和实际任务中验证了性能提升。
Method: TTF通过双维度检测(灰度像素差异与注意力语义评估)选择性地融合历史与当前视觉令牌,采用硬融合策略和关键帧锚定避免误差积累。
Result: 在LIBERO、SimplerEnv和实际机器人任务中分别提升了4.0、4.8%和8.7%的性能,且模型无关。
Insight: 选择性重用注意力机制的Query矩阵不仅能提升性能,还为直接重用KQV矩阵的计算加速提供了新思路。
Abstract: Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), cross-environment validation on SimplerEnv (4.8% relative improvement), and 8.7% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
[31] Seeing Like a Designer Without One: A Study on Unsupervised Slide Quality Assessment via Designer Cue Augmentation
Tai Inui,Steven Oh,Magdeline Kuan
Main category: cs.CV
TL;DR: 该论文提出了一种无监督的幻灯片质量评估方法,结合专家启发的设计指标与多模态嵌入,通过异常评分实现了与人类评估高度相关的质量评估。
Details
Motivation: 现有的幻灯片质量评估方法依赖于人类专家或预训练的视觉-语言模型,缺乏可扩展性和客观性。本文旨在通过无监督方法结合设计指标和多模态嵌入,逼近人类对幻灯片质量的感知。Contribution: 论文的主要贡献包括:1) 提出了一种结合7种专家启发的设计指标的管道;2) 使用CLIP-ViT嵌入和异常评分方法;3) 在专业数据集上验证了方法的有效性,相关性显著优于现有模型。
Method: 方法包括从幻灯片中提取7种设计指标(如留白、色彩丰富度等),结合CLIP-ViT的多模态嵌入特征,并使用Isolation Forest进行异常评分以评估质量。
Result: 在12k专业幻灯片和115张学术幻灯片上的实验显示,该方法与人类评分的皮尔逊相关性高达0.83,显著优于其他视觉-语言模型。
Insight: 研究表明,低层设计指标与多模态嵌入的结合可以高效且客观地评估幻灯片质量,为实时反馈提供了可能。
Abstract: We present an unsupervised slide-quality assessment pipeline that combines seven expert-inspired visual-design metrics (whitespace, colorfulness, edge density, brightness contrast, text density, color harmony, layout balance) with CLIP-ViT embeddings, using Isolation Forest-based anomaly scoring to evaluate presentation slides. Trained on 12k professional lecture slides and evaluated on six academic talks (115 slides), our method achieved Pearson correlations up to 0.83 with human visual-quality ratings-1.79x to 3.23x stronger than scores from leading vision-language models (ChatGPT o4-mini-high, ChatGPT o3, Claude Sonnet 4, Gemini 2.5 Pro). We demonstrate convergent validity with visual ratings, discriminant validity against speaker-delivery scores, and exploratory alignment with overall impressions. Our results show that augmenting low-level design cues with multimodal embeddings closely approximates audience perceptions of slide quality, enabling scalable, objective feedback in real time.
[32] Object Detection with Multimodal Large Vision-Language Models: An In-depth Review
Ranjan Sapkota,Manoj Karkee
Main category: cs.CV
TL;DR: 本文对大型视觉语言模型(LVLMs)在多模态目标检测中的应用进行了深入综述,探讨了其架构、训练方法及性能,并展望了未来发展方向。
Details
Motivation: 传统目标检测方法在上下文理解和泛化能力上存在局限性,而融合自然语言处理(NLP)和计算机视觉(CV)的LVLMs为这一领域带来了革命性改进。Contribution: 1. 系统梳理了LVLMs在目标检测中的最新进展;2. 总结了模型架构创新和训练范式;3. 通过可视化展示了LVLMs在多样化场景中的有效性;4. 提出当前局限性及未来研究方向。
Method: 1. 介绍视觉语言模型(VLMs)的工作原理;2. 分析LVLMs的架构创新和训练方法;3. 对比其与传统深度学习系统的性能。
Result: LVLMs在目标检测中展现出卓越的上下文理解和泛化能力,预计未来性能将超过传统方法。
Insight: LVLMs通过融合视觉与语言信息,为目标检测和机器人应用开辟了新的可能性,但需解决当前模型的计算复杂度和实时性等挑战。
Abstract: The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. This review presents comprehensive visualizations demonstrating LVLMs’ effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. The review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and robotic applications in the future.
[33] Large VLM-based Stylized Sports Captioning
Sauptik Dhar,Nicholas Buoncristiani,Joe Anakata,Haoyu Zhang,Michelle Munson
Main category: cs.CV
TL;DR: 该论文提出了一种基于大型视觉语言模型(LVLM)的两级微调流水线,用于生成风格化的体育比赛描述,显著提升了性能,并验证了其在实时体育报道中的实用性。
Details
Motivation: 现有的大型语言模型(LLM/LVLM)在生成体育比赛的风格化描述时缺乏领域特定的术语和自然性,无法满足生产级需求。Contribution: 提出了一种两级微调流水线,显著提升了体育描述生成的性能(F1提升8-10%,BERT分数提升2-10%),并在Super Bowl LIX中验证了其实时应用的可行性。
Method: 采用两级微调的LVLM流水线,结合领域特定的体育术语和风格化模板生成自然语言描述。
Result: 在F1和BERT分数上显著优于其他方法,运行时内存占用小,执行速度快(6张图像/3-5秒)。
Insight: 通过领域适应和风格化微调,大型视觉语言模型可以高效生成专业且自然的体育描述,适用于实时体育报道。
Abstract: The advent of large (visual) language models (LLM / LVLM) have led to a deluge of automated human-like systems in several domains including social media content generation, search and recommendation, healthcare prognosis, AI assistants for cognitive tasks etc. Although these systems have been successfully integrated in production; very little focus has been placed on sports, particularly accurate identification and natural language description of the game play. Most existing LLM/LVLMs can explain generic sports activities, but lack sufficient domain-centric sports’ jargon to create natural (human-like) descriptions. This work highlights the limitations of existing SoTA LLM/LVLMs for generating production-grade sports captions from images in a desired stylized format, and proposes a two-level fine-tuned LVLM pipeline to address that. The proposed pipeline yields an improvement > 8-10% in the F1, and > 2-10% in BERT score compared to alternative approaches. In addition, it has a small runtime memory footprint and fast execution time. During Super Bowl LIX the pipeline proved its practical application for live professional sports journalism; generating highly accurate and stylized captions at the rate of 6 images per 3-5 seconds for over 1000 images during the game play.
[34] DemoBias: An Empirical Study to Trace Demographic Biases in Vision Foundation Models
Abu Sufian,Anirudha Ghosh,Debaditya Barman,Marco Leo,Cosimo Distante
Main category: cs.CV
TL;DR: 论文DemoBias通过实证研究评估了大型视觉语言模型(LVLMs)在跨人口统计学群体(如种族、性别和年龄)的人脸识别任务中存在的偏见,发现PaliGemma和LLaVA对某些群体表现不一致。
Details
Motivation: 尽管LVLMs在各种下游任务中表现卓越,但在跨人口统计学群体的人脸识别任务中仍存在偏见。研究旨在量化这种偏见,促进模型公平性。Contribution: 1. 生成了人口统计学平衡的数据集;2. 量化了LVLMs的偏见;3. 揭示了PaliGemma和LLaVA对特定群体的不公平性。
Method: 微调并评估了LLaVA、BLIP-2和PaliGemma模型,使用BERTScores和Fairness Discrepancy Rate等指标衡量性能差异。
Result: 实验表明PaliGemma和LLaVA对Hispanic/Latino、Caucasian和South Asian群体表现较差,BLIP-2相对公平。
Insight: LVLMs在跨群体任务中可能隐含不公平性,需进一步优化以提高公平性。
Abstract: Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities across various downstream tasks, including biometric face recognition (FR) with description. However, demographic biases remain a critical concern in FR, as these foundation models often fail to perform equitably across diverse demographic groups, considering ethnicity/race, gender, and age. Therefore, through our work DemoBias, we conduct an empirical evaluation to investigate the extent of demographic biases in LVLMs for biometric FR with textual token generation tasks. We fine-tuned and evaluated three widely used pre-trained LVLMs: LLaVA, BLIP-2, and PaliGemma on our own generated demographic-balanced dataset. We utilize several evaluation metrics, like group-specific BERTScores and the Fairness Discrepancy Rate, to quantify and trace the performance disparities. The experimental results deliver compelling insights into the fairness and reliability of LVLMs across diverse demographic groups. Our empirical study uncovered demographic biases in LVLMs, with PaliGemma and LLaVA exhibiting higher disparities for Hispanic/Latino, Caucasian, and South Asian groups, whereas BLIP-2 demonstrated comparably consistent. Repository: https://github.com/Sufianlab/DemoBias.
[35] Geo2Vec: Shape- and Distance-Aware Neural Representation of Geospatial Entities
Chen Chu,Cyrus Shahabi
Main category: cs.CV
TL;DR: Geo2Vec是一种直接基于原始空间的新型空间表示学习方法,通过自适应采样和编码带符号距离场(SDF),解决了现有方法对单一地理实体类型支持不足和计算成本高的问题,并在形状、位置和空间关系表示上优于现有方法。
Details
Motivation: 现有地理空间表示学习方法要么仅支持单一地理实体类型,要么依赖分解和傅里叶变换,计算成本高且缺乏几何对齐,难以捕捉精细特征。Geo2Vec旨在解决这些问题。Contribution: 1. 提出Geo2Vec,直接基于原始空间的自适应采样和SDF编码方法,支持所有地理实体类型;2. 引入旋转不变的位置编码,增强高频空间变化的建模能力;3. 在形状、位置和空间关系表示任务中表现优于现有方法。
Method: Geo2Vec通过自适应采样点并编码其带符号距离场(SDF),避免实体分解。使用神经网络近似SDF生成紧凑、几何感知的统一表示,并采用旋转不变的位置编码增强鲁棒性。
Result: 实验表明,Geo2Vec在形状和位置表示、空间关系捕获以及计算效率方面均优于现有方法。
Insight: 直接基于原始空间的表示方法可以避免分解带来的计算开销和几何信息损失,自适应采样和SDF编码在捕捉精细几何特征方面具有优势。
Abstract: Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: https://github.com/chuchen2017/GeoNeuralRepresentation.
[36] Advancements in Crop Analysis through Deep Learning and Explainable AI
Hamza Khan
Main category: cs.CV
TL;DR: 论文提出了一种基于卷积神经网络(CNN)和可解释人工智能(XAI)的自动化方法,用于分类五种稻米品种并诊断稻叶病害,模型表现出高准确性和可解释性。
Details
Motivation: 稻米是全球重要的主食作物,但人工检测稻米质量和病害耗时费力且容易出错,亟需自动化解决方案。Contribution: 提出了结合CNN和XAI的框架,成功分类稻米品种并诊断病害,同时通过SHAP和LIME增强了模型的可解释性。
Method: 使用公开数据集(75,000张图像)训练CNN模型,并结合VGG16、ResNet50和MobileNetV2等深度学习模型,采用SHAP和LIME进行解释。
Result: 模型表现出高分类准确性,误分类极少,同时能够明确解释特征对预测的影响。
Insight: 深度学习与XAI结合在农业中具有巨大潜力,能够提升自动化检测系统的可靠性和透明度。
Abstract: Rice is a staple food of global importance in terms of trade, nutrition, and economic growth. Among Asian nations such as China, India, Pakistan, Thailand, Vietnam and Indonesia are leading producers of both long and short grain varieties, including basmati, jasmine, arborio, ipsala, and kainat saila. To ensure consumer satisfaction and strengthen national reputations, monitoring rice crops and grain quality is essential. Manual inspection, however, is labour intensive, time consuming and error prone, highlighting the need for automated solutions for quality control and yield improvement. This study proposes an automated approach to classify five rice grain varieties using Convolutional Neural Networks (CNN). A publicly available dataset of 75000 images was used for training and testing. Model evaluation employed accuracy, recall, precision, F1-score, ROC curves, and confusion matrices. Results demonstrated high classification accuracy with minimal misclassifications, confirming the model effectiveness in distinguishing rice varieties. In addition, an accurate diagnostic method for rice leaf diseases such as Brown Spot, Blast, Bacterial Blight, and Tungro was developed. The framework combined explainable artificial intelligence (XAI) with deep learning models including CNN, VGG16, ResNet50, and MobileNetV2. Explainability techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) revealed how specific grain and leaf features influenced predictions, enhancing model transparency and reliability. The findings demonstrate the strong potential of deep learning in agricultural applications, paving the way for robust, interpretable systems that can support automated crop quality inspection and disease diagnosis, ultimately benefiting farmers, consumers, and the agricultural economy.
[37] Sistema de Reconocimiento Facial Federado en Conjuntos Abiertos basado en OpenMax
Ander Galván,Marivi Higuero,Jorge Sasiain,Eduardo Jacob
Main category: cs.CV
TL;DR: 该论文提出了一种基于联邦学习的开放式人脸识别系统,结合OpenMax算法,通过交换平均激活向量和局部距离来有效区分已知和未知个体,提升了隐私保护和鲁棒性。
Details
Motivation: 当前AI驱动的人脸识别在特定场景下表现优异,但在隐私管理和未知个体识别方面存在挑战。联邦学习为解决隐私问题提供了框架,而OpenMax算法能有效处理开放集问题。Contribution: 1. 将OpenMax算法融入联邦学习框架,解决了开放集人脸识别问题;2. 提出通过交换平均激活向量和局部距离的方法,有效区分已知和未知个体。
Method: 1. 在联邦学习框架中集成OpenMax算法;2. 利用平均激活向量和局部距离的交换来识别未知个体。
Result: 实验验证了所提方法的有效性,表明其在分布式环境中能提升隐私保护和鲁棒性。
Insight: 结合联邦学习和OpenMax算法,不仅可以保护隐私,还能在开放集中提升识别性能,为实际应用提供了可行的解决方案。
Abstract: Facial recognition powered by Artificial Intelligence has achieved high accuracy in specific scenarios and applications. Nevertheless, it faces significant challenges regarding privacy and identity management, particularly when unknown individuals appear in the operational context. This paper presents the design, implementation, and evaluation of a facial recognition system within a federated learning framework tailored to open-set scenarios. The proposed approach integrates the OpenMax algorithm into federated learning, leveraging the exchange of mean activation vectors and local distance measures to reliably distinguish between known and unknown subjects. Experimental results validate the effectiveness of the proposed solution, demonstrating its potential for enhancing privacy-aware and robust facial recognition in distributed environments. – El reconocimiento facial impulsado por Inteligencia Artificial ha demostrado una alta precisi'on en algunos escenarios y aplicaciones. Sin embargo, presenta desaf'ios relacionados con la privacidad y la identificaci'on de personas, especialmente considerando que pueden aparecer sujetos desconocidos para el sistema que lo implementa. En este trabajo, se propone el dise~no, implementaci'on y evaluaci'on de un sistema de reconocimiento facial en un escenario de aprendizaje federado, orientado a conjuntos abiertos. Concretamente, se dise~na una soluci'on basada en el algoritmo OpenMax para escenarios de aprendizaje federado. La propuesta emplea el intercambio de los vectores de activaci'on promedio y distancias locales para identificar de manera eficaz tanto personas conocidas como desconocidas. Los experimentos realizados demuestran la implementaci'on efectiva de la soluci'on propuesta.
[38] Automated classification of natural habitats using ground-level imagery
Mahdis Tourian,Sareh Rowlands,Remy Vandaele,Max Fancourt,Rebecca Mein,Hywel T. P. Williams
Main category: cs.CV
TL;DR: 该论文提出了一种基于地面级图像的自然栖息地自动分类方法,利用深度学习技术(DeepLabV3-ResNet101)从照片中分类18种栖息地类别,展示了较高的分类潜力。
Details
Motivation: 准确的栖息地分类对生物多样性保护和生态监测至关重要。传统方法依赖卫星图像和人工验证,而地面级图像分类可以提供更好的验证和规模化能力。Contribution: 1. 开发了一种基于地面级图像的栖息地分类方法;2. 使用DeepLabV3-ResNet101模型实现了18类栖息地的分类,平均F1分数达0.61;3. 提供了实用的网络应用工具。
Method: 1. 图像预处理(调整大小、归一化和增强);2. 使用重采样平衡训练数据;3. 微调DeepLabV3-ResNet101模型,通过五折交叉验证评估性能。
Result: 模型表现良好,部分类别(如裸土和裸沙)F1分数超过0.90,混合类别的分数较低。
Insight: 地面级图像分类方法具有规模化潜力,为生态监测和公民科学提供了新工具。
Abstract: Accurate classification of terrestrial habitats is critical for biodiversity conservation, ecological monitoring, and land-use planning. Several habitat classification schemes are in use, typically based on analysis of satellite imagery with validation by field ecologists. Here we present a methodology for classification of habitats based solely on ground-level imagery (photographs), offering improved validation and the ability to classify habitats at scale (for example using citizen-science imagery). In collaboration with Natural England, a public sector organisation responsible for nature conservation in England, this study develops a classification system that applies deep learning to ground-level habitat photographs, categorising each image into one of 18 classes defined by the ‘Living England’ framework. Images were pre-processed using resizing, normalisation, and augmentation; re-sampling was used to balance classes in the training data and enhance model robustness. We developed and fine-tuned a DeepLabV3-ResNet101 classifier to assign a habitat class label to each photograph. Using five-fold cross-validation, the model demonstrated strong overall performance across 18 habitat classes, with accuracy and F1-scores varying between classes. Across all folds, the model achieved a mean F1-score of 0.61, with visually distinct habitats such as Bare Soil, Silt and Peat (BSSP) and Bare Sand (BS) reaching values above 0.90, and mixed or ambiguous classes scoring lower. These findings demonstrate the potential of this approach for ecological monitoring. Ground-level imagery is readily obtained, and accurate computational methods for habitat classification based on such data have many potential applications. To support use by practitioners, we also provide a simple web application that classifies uploaded images using our model.
[39] MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation
Ming Chen,Liyuan Cui,Wenyuan Zhang,Haoxian Zhang,Yan Zhou,Xiaohan Li,Xiaoqiang Liu,Pengfei Wan
Main category: cs.CV
TL;DR: MIDAS提出了一个通过实时自回归视频生成实现多模态交互式数字人合成的框架,解决了现有方法的高延迟、高计算成本和有限可控性问题。
Details
Motivation: 现有交互式数字人视频生成方法存在高延迟、计算成本高和可控性有限的问题,限制了实际应用。MIDAS旨在通过多模态控制和低延迟推断解决这些问题。Contribution: 1) 提出了一种支持多模态输入的自回归视频生成框架;2) 构建了一个大规模对话数据集;3) 引入了深度压缩自编码器以减少推理负担。
Method: 使用大语言模型(LLM)作为基础,接受音频、姿态和文本等多模态输入,并通过扩散头去噪生成视频。采用深度压缩自编码器(64倍压缩率)降低计算成本。
Result: 实验表明,MIDAS在双人对话、多语言人合成和交互式世界模型中表现出低延迟、高效和精细的多模态可控性。
Insight: 通过结合LLM和扩散模型,MIDAS展示了多模态交互式视频生成的潜力,同时深度压缩技术为长序列推理提供了解决方案。
Abstract: Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with high latency, heavy computational cost, and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.
[40] Deep Data Hiding for ICAO-Compliant Face Images: A Survey
Jefferson David Rodriguez Chivata,Davide Ghiani,Simone Maurizio La Cava,Marco Micheletto,Giulia Orrù,Federico Lama,Gian Luca Marcialis
Main category: cs.CV
TL;DR: 这篇综述论文探讨了数字水印和隐写术作为补充解决方案,以在ICAO合规的面部图像中嵌入防篡改信号,实现持久验证。
Details
Motivation: ICAO合规面部图像的标准化虽促进了全球互操作性,但也带来了诸如变形和深度伪造等恶意行为,传统实时检测方法无法提供后期保护,亟需持久性解决方案。Contribution: 首次全面分析了数字水印和隐写术在ICAO合规图像中的应用潜力及局限性,评估了其在标准约束下的适用性。
Method: 通过文献综述和状态分析,研究了数字水印和隐写术在ICAO合规图像中的技术实现和效果。
Result: 提出了关键权衡点,为现实身份系统的安全部署提供了指导。
Insight: 数字水印和隐写术是解决图像后期保护的有效手段,但其实现需平衡技术可行性与合规性要求。
Abstract: ICAO-compliant facial images, initially designed for secure biometric passports, are increasingly becoming central to identity verification in a wide range of application contexts, including border control, digital travel credentials, and financial services. While their standardization enables global interoperability, it also facilitates practices such as morphing and deepfakes, which can be exploited for harmful purposes like identity theft and illegal sharing of identity documents. Traditional countermeasures like Presentation Attack Detection (PAD) are limited to real-time capture and offer no post-capture protection. This survey paper investigates digital watermarking and steganography as complementary solutions that embed tamper-evident signals directly into the image, enabling persistent verification without compromising ICAO compliance. We provide the first comprehensive analysis of state-of-the-art techniques to evaluate the potential and drawbacks of the underlying approaches concerning the applications involving ICAO-compliant images and their suitability under standard constraints. We highlight key trade-offs, offering guidance for secure deployment in real-world identity systems.
[41] PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI
Haoyang Su,Jin-Yi Xiang,Shaohao Rui,Yifan Gao,Xingyu Chen,Tingxuan Yin,Xiaosong Wang,Lian-Ming Wu
Main category: cs.CV
TL;DR: PRISM是一个自监督框架,通过结合非对比心脏磁共振成像(MRI)的视觉表征和电子健康记录(EHR)进行生存分析,显著提升了主要不良心脏事件(MACE)的预测准确性。
Details
Motivation: 准确预测主要不良心脏事件(MACE)是心血管预后的核心挑战。现有的模型通常缺乏对多模态数据(如成像和EHR)的有效整合,且难以提供可解释的风险预测。Contribution: 提出了PRISM框架,首次将运动感知多视角蒸馏提取的视觉表征与医学提示文本调制相结合,用于精细化的风险预测,并在四个独立临床队列中验证了其优越性。
Method: PRISM通过自监督学习提取心脏MRI的时间同步特征,利用医学文本提示调制这些特征,并结合EHR数据进行生存分析。
Result: PRISM在内部和外部验证中均超越传统生存预测模型和SOTA深度学习方法,并发现了三种与MACE风险升高的成像特征。
Insight: PRISM不仅提供了卓越的预测性能,还通过提示引导的归因揭示了高血压、糖尿病和吸烟等关键临床因素的重要性,为临床决策提供了可解释的洞察。
Abstract: Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.
[42] EffNetViTLoRA: An Efficient Hybrid Deep Learning Approach for Alzheimer’s Disease Diagnosis
Mahdieh Behjat Khatooni,Mohsen Soryani
Main category: cs.CV
TL;DR: EffNetViTLoRA是一种高效的混合深度学习方法,结合CNN与ViT,用于阿尔茨海默病的早期诊断,并通过LoRA技术优化预训练模型,提高分类性能。
Details
Motivation: 阿尔茨海默病的早期诊断至关重要,但轻度认知障碍(MCI)的诊断难度高,现有方法因数据限制和领域差异效果不佳。Contribution: 提出EffNetViTLoRA模型,结合CNN和ViT提取MRI图像的全局与局部特征,利用LoRA技术高效适应目标域,提升分类性能。
Method: 采用CNN提取局部特征,ViT捕捉全局特征,结合LoRA微调预训练ViT,降低过拟合风险。
Result: 在完整ADNI数据集上实现了92.52%的分类准确率和92.76%的F1分数。
Insight: 混合CNN与ViT的架构显著提升特征提取能力,LoRA技术有效解决领域差异问题,提高模型临床可靠性。
Abstract: Alzheimer’s disease (AD) is one of the most prevalent neurodegenerative disorders worldwide. As it progresses, it leads to the deterioration of cognitive functions. Since AD is irreversible, early diagnosis is crucial for managing its progression. Mild Cognitive Impairment (MCI) represents an intermediate stage between Cognitively Normal (CN) individuals and those with AD, and is considered a transitional phase from normal cognition to Alzheimer’s disease. Diagnosing MCI is particularly challenging due to the subtle differences between adjacent diagnostic categories. In this study, we propose EffNetViTLoRA, a generalized end-to-end model for AD diagnosis using the whole Alzheimer’s Disease Neuroimaging Initiative (ADNI) Magnetic Resonance Imaging (MRI) dataset. Our model integrates a Convolutional Neural Network (CNN) with a Vision Transformer (ViT) to capture both local and global features from MRI images. Unlike previous studies that rely on limited subsets of data, our approach is trained on the full T1-weighted MRI dataset from ADNI, resulting in a more robust and unbiased model. This comprehensive methodology enhances the model’s clinical reliability. Furthermore, fine-tuning large pretrained models often yields suboptimal results when source and target dataset domains differ. To address this, we incorporate Low-Rank Adaptation (LoRA) to effectively adapt the pretrained ViT model to our target domain. This method enables efficient knowledge transfer and reduces the risk of overfitting. Our model achieves a classification accuracy of 92.52% and an F1-score of 92.76% across three diagnostic categories: AD, MCI, and CN for full ADNI dataset.
[43] Concurrent validity of computer-vision artificial intelligence player tracking software using broadcast footage
Zachary L. Crang,Rich D. Johnston,Katie L. Mills,Johsan Billingham,Sam Robertson,Michael H. Cole,Jonathon Weakley,Adam Hewitt and,Grant M. Duthie
Main category: cs.CV
TL;DR: 这项研究评估了商业化的计算机视觉和人工智能球员追踪软件通过转播画面测量球员位置、速度和距离的准确性,并探讨了摄像头画质和分辨率对其的影响。研究通过与高精度多摄像头系统对比,发现这些软件在检测到球员时具有合理精度,且战术画面和高分辨率(720p和1080p)更适合使用。
Details
Motivation: 足球比赛中球员数据的准确追踪对于战术分析和表现评估至关重要。然而,高昂的专业设备成本限制了其普及。研究旨在验证基于计算机视觉和人工智能的商业软件能否通过转播画面提供可接受的跟踪精度。Contribution: 1. 验证了商业化的计算机视觉和AI球员追踪软件在转播画面中的可行性;2. 提出了使用战术画面和高分辨率(720p或1080p)以优化跟踪精度的建议;3. 量化了这些软件在位置、速度和距离测量中的误差范围。
Method: 研究使用了2022年卡塔尔世界杯的一场比赛数据,对比了三种商业软件与TRACAB Gen 5高精度系统的追踪结果。指标包括球员的位置坐标、速度和比赛距离,并计算均方根误差(RMSE)和平均偏差。
Result: 位置测量的RMSE范围为1.68至16.39米,速度RMSE为0.34至2.38米/秒。比赛总距离的平均偏差为-1745米(-21.8%)至1945米(24.3%)。战术画面和高分辨率能显著提高精度。
Insight: 计算机视觉和AI技术可以以较低成本提供合理的球员追踪数据,但误差范围较大。未来的改进方向包括优化检测算法和利用更高分辨率的画面。
Abstract: This study aimed to: (1) understand whether commercially available computer-vision and artificial intelligence (AI) player tracking software can accurately measure player position, speed and distance using broadcast footage and (2) determine the impact of camera feed and resolution on accuracy. Data were obtained from one match at the 2022 Qatar Federation Internationale de Football Association (FIFA) World Cup. Tactical, programme and camera 1 feeds were used. Three commercial tracking providers that use computer-vision and AI participated. Providers analysed instantaneous position (x, y coordinates) and speed (m,s^{-1}) of each player. Their data were compared with a high-definition multi-camera tracking system (TRACAB Gen 5). Root mean square error (RMSE) and mean bias were calculated. Position RMSE ranged from 1.68 to 16.39 m, while speed RMSE ranged from 0.34 to 2.38 m,s^{-1}. Total match distance mean bias ranged from -1745 m (-21.8%) to 1945 m (24.3%) across providers. Computer-vision and AI player tracking software offer the ability to track players with fair precision when players are detected by the software. Providers should use a tactical feed when tracking position and speed, which will maximise player detection, improving accuracy. Both 720p and 1080p resolutions are suitable, assuming appropriate computer-vision and AI models are implemented.
[44] JVLGS: Joint Vision-Language Gas Leak Segmentation
Xinlong Zhao,Qixiang Pang,Shan Du
Main category: cs.CV
TL;DR: 论文提出了一种结合视觉和文本信息的联合框架JVLGS,用于提升气体泄漏的检测与分割效果,同时通过后处理减少误报。
Details
Motivation: 气体泄漏对健康和环境的威胁较大,现有基于视觉的方法因气体云的模糊性和非刚性特征效果有限。Contribution: 提出JVLGS框架,结合视觉与文本模态信息,并通过后处理减少误报,显著优于现有方法。
Method: 联合视觉和文本模态增强气体泄漏表示,加入后处理步骤减少噪声和非目标对象的影响。
Result: 在多样场景的实验中,JVLGS在监督和小样本学习设置下均表现优异。
Insight: 多模态信息和后处理步骤对提升气体泄漏检测的鲁棒性和准确性至关重要。
Abstract: Gas leaks pose serious threats to human health and contribute significantly to atmospheric pollution, drawing increasing public concern. However, the lack of effective detection methods hampers timely and accurate identification of gas leaks. While some vision-based techniques leverage infrared videos for leak detection, the blurry and non-rigid nature of gas clouds often limits their effectiveness. To address these challenges, we propose a novel framework called Joint Vision-Language Gas leak Segmentation (JVLGS), which integrates the complementary strengths of visual and textual modalities to enhance gas leak representation and segmentation. Recognizing that gas leaks are sporadic and many video frames may contain no leak at all, our method incorporates a post-processing step to reduce false positives caused by noise and non-target objects, an issue that affects many existing approaches. Extensive experiments conducted across diverse scenarios show that JVLGS significantly outperforms state-of-the-art gas leak segmentation methods. We evaluate our model under both supervised and few-shot learning settings, and it consistently achieves strong performance in both, whereas competing methods tend to perform well in only one setting or poorly in both. Code available at: https://github.com/GeekEagle/JVLGS
[45] UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models
Yimu Wang,Weiming Zhuang,Chen Chen,Jiabo Huang,Jingtao Li,Lingjuan Lyu
Main category: cs.CV
TL;DR: UNIFORM框架通过投票机制整合多种预训练模型的知识,实现了大规模知识融合,提升了无监督目标识别的性能。
Details
Motivation: 预训练模型的多样性和异构性使得知识整合面临挑战,现有方法受限于数据分布和网络架构的强假设。UNIFORM旨在突破这些限制,充分利用大规模和多样化的预训练模型知识。Contribution: 提出了UNIFORM框架,通过logit和特征层的投票机制,统一了不同预训练模型的知识,且无需依赖特定假设。
Method: 设计了logit层和特征层的投票机制,以捕捉教师模型的共识知识,支持从超百个教师模型中学习。
Result: 实验表明,UNIFORM在无监督目标识别任务中性能优于基线方法,并展现出良好的可扩展性。
Insight: 投票机制能有效整合异构模型的知识,证明了大规模预训练模型的集体共识具有通用性和泛化性。
Abstract: In the era of deep learning, the increasing number of pre-trained models available online presents a wealth of knowledge. These models, developed with diverse architectures and trained on varied datasets for different tasks, provide unique interpretations of the real world. Their collective consensus is likely universal and generalizable to unseen data. However, effectively harnessing this collective knowledge poses a fundamental challenge due to the heterogeneity of pre-trained models. Existing knowledge integration solutions typically rely on strong assumptions about training data distributions and network architectures, limiting them to learning only from specific types of models and resulting in data and/or inductive biases. In this work, we introduce a novel framework, namely UNIFORM, for knowledge transfer from a diverse set of off-the-shelf models into one student model without such constraints. Specifically, we propose a dedicated voting mechanism to capture the consensus of knowledge both at the logit level – incorporating teacher models that are capable of predicting target classes of interest – and at the feature level, utilizing visual representations learned on arbitrary label spaces. Extensive experiments demonstrate that UNIFORM effectively enhances unsupervised object recognition performance compared to strong knowledge transfer baselines. Notably, it exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale.
[46] Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery
Xiangxu Wang,Tianhong Zhao,Wei Tu,Bowen Zhang,Guanzhou Chen,Jinzhou Cao
Main category: cs.CV
TL;DR: Sat2Flow是一种基于扩散模型的新型框架,仅利用卫星图像生成结构一致的人类流动(OD流),解决了现有方法对辅助特征依赖和空间拓扑敏感的问题。
Details
Motivation: 现有的OD流生成方法依赖昂贵的辅助特征(如POI和社会经济数据),且对空间拓扑结构敏感(如区域重编号会破坏结构一致性),限制了其可扩展性和鲁棒性。Contribution: 1. 提出了Sat2Flow,一种仅依赖卫星图像的OD流生成方法,消除了对辅助数据的依赖;2. 通过多核编码器和排列感知扩散过程,确保生成的流动在任意区域重编号下保持结构一致性;3. 结合对比学习和等变扩散训练,实现了拓扑鲁棒性和高精度。
Method: 1. 多核编码器捕捉区域间的多样化交互;2. 排列感知扩散过程对齐不同区域顺序的潜在表示;3. 联合对比学习目标将卫星特征与OD模式关联;4. 等变扩散训练强制结构一致性。
Result: 在真实城市数据集上,Sat2Flow在数值精度上优于物理和数据驱动的基线方法,同时在区域重编号下保持了流动的统计分布和空间结构。
Insight: Sat2Flow展示了卫星图像的潜力,能够独立支持OD流生成,为数据稀缺的城市场景提供了可扩展的解决方案,同时通过结构感知设计提升了鲁棒性。
Abstract: Origin-Destination (OD) flow matrices are essential for urban mobility analysis, underpinning applications in traffic forecasting, infrastructure planning, and policy design. However, existing methods suffer from two critical limitations: (1) reliance on auxiliary features (e.g., Points of Interest, socioeconomic statistics) that are costly to collect and have limited spatial coverage; and (2) sensitivity to spatial topology, where minor index reordering of urban regions (e.g., census tract relabeling) disrupts structural coherence in generated flows. To address these challenges, we propose Sat2Flow, a latent structure-aware diffusion-based framework that generates structurally coherent OD flows using solely satellite imagery as input. Our approach introduces a multi-kernel encoder to capture diverse regional interactions and employs a permutation-aware diffusion process that aligns latent representations across different regional orderings. Through a joint contrastive training objective that bridges satellite-derived features with OD patterns, combined with equivariant diffusion training that enforces structural consistency, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experimental results on real-world urban datasets demonstrate that Sat2Flow outperforms both physics-based and data-driven baselines in numerical accuracy while preserving empirical distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce urban environments, eliminating region-specific auxiliary data dependencies while maintaining structural invariance for robust mobility modeling.
[47] Weed Detection in Challenging Field Conditions: A Semi-Supervised Framework for Overcoming Shadow Bias and Data Scarcity
Alzayat Saleh,Shunsuke Hatano,Mostafa Rahimi Azghadi
Main category: cs.CV
TL;DR: 该论文提出了一种半监督框架,旨在解决杂草检测中因阴影偏差和数据稀缺带来的挑战。通过利用少量标记数据和大量未标记数据,结合伪标签方法,显著提升了模型的鲁棒性和召回率。
Details
Motivation: 实际农田中的杂草检测面临环境复杂(如阴影干扰)和大规模数据标注成本高的挑战,论文旨在设计一种高效、鲁棒的解决方案。Contribution: 1) 揭示了深度学习模型在杂草检测中存在的阴影偏差问题;2) 提出了基于伪标签的半监督框架,利用未标记数据提升模型性能;3) 在公开数据集和自建数据集上验证了方法的有效性。
Method: 1) 使用ResNet、YOLO和RF-DETR构建基准模型;2) 通过可解释性工具诊断阴影偏差;3) 提出半监督框架,利用伪标签训练未标记数据,增强模型多样性。
Result: 在自建数据集上,分类F1分数达0.90,检测mAP50超过0.82;半监督框架显著提升了召回率,有效缓解了阴影偏差问题。
Insight: 半监督方法不仅能解决数据稀缺问题,还能通过伪标签引入更多视觉多样性,提升模型在复杂环境中的鲁棒性。
Abstract: The automated management of invasive weeds is critical for sustainable agriculture, yet the performance of deep learning models in real-world fields is often compromised by two factors: challenging environmental conditions and the high cost of data annotation. This study tackles both issues through a diagnostic-driven, semi-supervised framework. Using a unique dataset of approximately 975 labeled and 10,000 unlabeled images of Guinea Grass in sugarcane, we first establish strong supervised baselines for classification (ResNet) and detection (YOLO, RF-DETR), achieving F1 scores up to 0.90 and mAP50 scores exceeding 0.82. Crucially, this foundational analysis, aided by interpretability tools, uncovered a pervasive “shadow bias,” where models learned to misidentify shadows as vegetation. This diagnostic insight motivated our primary contribution: a semi-supervised pipeline that leverages unlabeled data to enhance model robustness. By training models on a more diverse set of visual information through pseudo-labeling, this framework not only helps mitigate the shadow bias but also provides a tangible boost in recall, a critical metric for minimizing weed escapes in automated spraying systems. To validate our methodology, we demonstrate its effectiveness in a low-data regime on a public crop-weed benchmark. Our work provides a clear and field-tested framework for developing, diagnosing, and improving robust computer vision systems for the complex realities of precision agriculture.
[48] MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
Zhiting Gao,Dan Song,Diqiong Jiang,Chao Xue,An-An Liu
Main category: cs.CV
TL;DR: 论文提出了一种新的文本引导运动生成框架,结合了TAPO(对齐偏好优化)和MotionFlux(高效流匹配),解决了现有方法在语义对齐和推理效率上的不足,实现了高质量和实时的运动生成。
Details
Motivation: 现有文本驱动的运动生成方法在语义对齐和推理速度上存在局限性,难以满足实时应用的需求。Contribution: 提出了TAPO框架,优化了文本与运动的语义对齐;提出MotionFlux,基于校正流匹配技术实现了高效的实时生成。
Method: TAPO通过迭代调整和对齐文本修饰符与运动语义;MotionFlux利用确定性校正流匹配技术,减少了传统扩散模型的多步采样需求。
Result: 实验表明,该方法在语义一致性和运动质量上优于现有方法,同时显著提升了生成速度。
Insight: 通过结合语义对齐偏好优化和高效的流匹配技术,可以在不牺牲生成质量的前提下大幅提升推理效率,为实时运动生成提供了新思路。
Abstract: Motion generation is essential for animating virtual characters and embodied agents. While recent text-driven methods have made significant strides, they often struggle with achieving precise alignment between linguistic descriptions and motion semantics, as well as with the inefficiencies of slow, multi-step inference. To address these issues, we introduce TMR++ Aligned Preference Optimization (TAPO), an innovative framework that aligns subtle motion variations with textual modifiers and incorporates iterative adjustments to reinforce semantic grounding. To further enable real-time synthesis, we propose MotionFLUX, a high-speed generation framework based on deterministic rectified flow matching. Unlike traditional diffusion models, which require hundreds of denoising steps, MotionFLUX constructs optimal transport paths between noise distributions and motion spaces, facilitating real-time synthesis. The linearized probability paths reduce the need for multi-step sampling typical of sequential methods, significantly accelerating inference time without sacrificing motion quality. Experimental results demonstrate that, together, TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality, while also accelerating generation speed. The code and pretrained models will be released.
[49] CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
Nannan Zhu,Yonghao Dong,Teng Wang,Xueqian Li,Shengjun Deng,Yijia Wang,Zheng Hong,Tiantian Geng,Guo Niu,Hanyan Huang,Xiongfei Yao,Shuaiwei Jiao
Main category: cs.CV
TL;DR: CVBench是首个全面评估跨视频关系推理能力的基准,涵盖三个层次任务,揭示了当前多模态大语言模型在多视频任务中的性能瓶颈。
Details
Motivation: 尽管多模态大语言模型(MLLMs)在单视频任务中表现优异,但其在多视频场景下的能力尚未充分探索,而这对实际应用(如多摄像头监控)至关重要。Contribution: 提出了CVBench基准,包含1000个跨视频问题-答案对,分为三个层次(对象关联、事件关联、复杂推理),并评估了10+领先模型的性能。
Method: 通过构建多样化的视频集群(如体育、生活记录)和分层任务,采用零样本或思维链提示范式对模型进行评估。
Result: 实验显示,即使是顶级模型(如GPT-4o)在因果推理任务中准确率仅为60%,远低于人类的91%,揭示了模型在跨视频上下文保留和实体消歧上的不足。
Insight: 当前MLLMs在多视频任务中存在显著性能瓶颈,需改进架构以提升跨视频推理能力。CVBench为此提供了诊断框架和设计启示。
Abstract: While multimodal large language models (MLLMs) exhibit strong performance on single-video tasks (e.g., video question answering), their ability across multiple videos remains critically underexplored. However, this capability is essential for real-world applications, including multi-camera surveillance and cross-video procedural learning. To bridge this gap, we present CVBench, the first comprehensive benchmark designed to assess cross-video relational reasoning rigorously. CVBench comprises 1,000 question-answer pairs spanning three hierarchical tiers: cross-video object association (identifying shared entities), cross-video event association (linking temporal or causal event chains), and cross-video complex reasoning (integrating commonsense and domain knowledge). Built from five domain-diverse video clusters (e.g., sports, life records), the benchmark challenges models to synthesise information across dynamic visual contexts. Extensive evaluation of 10+ leading MLLMs (including GPT-4o, Gemini-2.0-flash, Qwen2.5-VL) under zero-shot or chain-of-thought prompting paradigms. Key findings reveal stark performance gaps: even top models, such as GPT-4o, achieve only 60% accuracy on causal reasoning tasks, compared to the 91% accuracy of human performance. Crucially, our analysis reveals fundamental bottlenecks inherent in current MLLM architectures, notably deficient inter-video context retention and poor disambiguation of overlapping entities. CVBench establishes a rigorous framework for diagnosing and advancing multi-video reasoning, offering architectural insights for next-generation MLLMs.The data and evaluation code are available at https://github.com/Hokhim2/CVBench.
[50] WEBEYETRACK: Scalable Eye-Tracking for the Browser via On-Device Few-Shot Personalization
Eduardo Davalos,Yike Zhang,Namrata Srivastava,Yashvitha Thatigotla,Jorge A. Salas,Sara McFadden,Sun-Joo Cho,Amanda Goodwin,Ashwin TS,Gautam Biswas
Main category: cs.CV
TL;DR: WebEyeTrack是一个浏览器集成的轻量级眼动追踪框架,结合了基于模型的头部姿态估计和少量样本学习,实现了高精度和实时性能。
Details
Motivation: 论文的动机是解决现有眼动追踪方法在模型大小、推理时间和隐私方面的不足,尤其是基于摄像头的眼动追踪因头部移动导致的精度问题。Contribution: 主要贡献是提出了WebEyeTrack框架,整合了轻量级的SOTA凝视估计模型,支持少量样本学习,实现了高精度和实时性能。
Method: 方法包括基于模型的头部姿态估计和仅需9个校准样本的设备端少样本学习,显著提高了凝视估计的稳定性。
Result: 在GazeCapture数据集上实现了2.32厘米的误差,iPhone 14上的推理速度达到2.4毫秒。
Insight: 论文展示了如何在浏览器中实现高效且隐私友好的眼动追踪技术,为实际应用提供了可行性。
Abstract: With advancements in AI, new gaze estimation methods are exceeding state-of-the-art (SOTA) benchmarks, but their real-world application reveals a gap with commercial eye-tracking solutions. Factors like model size, inference time, and privacy often go unaddressed. Meanwhile, webcam-based eye-tracking methods lack sufficient accuracy, in particular due to head movement. To tackle these issues, we introduce We bEyeTrack, a framework that integrates lightweight SOTA gaze estimation models directly in the browser. It incorporates model-based head pose estimation and on-device few-shot learning with as few as nine calibration samples (k < 9). WebEyeTrack adapts to new users, achieving SOTA performance with an error margin of 2.32 cm on GazeCapture and real-time inference speeds of 2.4 milliseconds on an iPhone 14. Our open-source code is available at https://github.com/RedForestAi/WebEyeTrack.
[51] Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation
Mingxi Fu,Fanglei Fu,Xitong Ling,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu
Main category: cs.CV
TL;DR: MPAMatch是一个新型的半监督病理图像分割框架,通过多模态原型对齐和双重对比学习,显著提升了语义边界建模和结构建模能力。
Details
Motivation: 病理图像分割面临语义边界模糊和标注成本高的问题,而现有半监督方法主要依赖图像模态内的扰动一致性,难以捕捉高级语义先验。Contribution: 提出了MPAMatch框架,首次将文本原型监督引入分割任务,并通过双重对比学习(图像原型-像素标签和文本原型-像素标签)实现结构和语义层面的监督。
Method: 采用多模态原型对齐策略,结合病理预训练基础模型(Uni),重构了TransUNet架构,实现从粗到细的监督。
Result: 在多个数据集(GLAS、EBHI-SEG-GLAND等)上验证了MPAMatch优于现有方法,表现出在结构和语义建模方面的双重优势。
Insight: 多模态原型对齐和双重对比学习是提升病理图像分割性能的有效途径,首次引入的文本原型监督为语义建模提供了新思路。
Abstract: Pathological image segmentation faces numerous challenges, particularly due to ambiguous semantic boundaries and the high cost of pixel-level annotations. Although recent semi-supervised methods based on consistency regularization (e.g., UniMatch) have made notable progress, they mainly rely on perturbation-based consistency within the image modality, making it difficult to capture high-level semantic priors, especially in structurally complex pathology images. To address these limitations, we propose MPAMatch - a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm. The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels, providing supervision at both structural and semantic levels. This coarse-to-fine supervisory strategy not only enhances the discriminative capability on unlabeled samples but also introduces the text prototype supervision into segmentation for the first time, significantly improving semantic boundary modeling. In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni), enabling more effective extraction of pathology-relevant features. Extensive experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI show MPAMatch’s superiority over state-of-the-art methods, validating its dual advantages in structural and semantic modeling.
[52] High-Speed FHD Full-Color Video Computer-Generated Holography
Haomiao Zhang,Miao Cao,Xuan Yu,Hui Luo,Yanling Piao,Mengjie Qin,Zhangyuan Li,Ping Wang,Xin Yuan
Main category: cs.CV
TL;DR: 这篇论文提出了一种新型的高速全彩视频计算机生成全息(CGH)方案,通过频谱引导的深度分割复用(SGDDM)和轻量级Mamba-Unet架构HoloMamba,解决了高帧率全彩显示中的颜色串扰和计算效率问题。
Details
Motivation: 现有学习模型在高帧率全彩显示中会产生过度平滑的相位和窄角度光谱,导致颜色串扰,且帧间优化的方法忽略了时空相关性,导致计算效率低下。本文旨在解决这些问题。Contribution: 提出了SGDDM方法,通过频率调制优化相位分布;设计了HoloMamba架构,显式建模视频序列的时空相关性,提升重建质量和计算效率。
Method: 结合SGDDM(频谱调制优化相位)和HoloMamba(Mamba-Unet架构建模时空相关性),实现高速全彩视频CGH生成。
Result: SGDDM实现了高保真全彩显示,HoloMamba在260 FPS下生成1080p全彩全息视频,速度比现有技术快2.6倍。
Insight: 频谱调制和轻量级架构的结合为高帧率全彩全息显示提供了高效解决方案,显式建模时空相关性是提升视频重建质量的关键。
Abstract: Computer-generated holography (CGH) is a promising technology for next-generation displays. However, generating high-speed, high-quality holographic video requires both high frame rate display and efficient computation, but is constrained by two key limitations: ($i$) Learning-based models often produce over-smoothed phases with narrow angular spectra, causing severe color crosstalk in high frame rate full-color displays such as depth-division multiplexing and thus resulting in a trade-off between frame rate and color fidelity. ($ii$) Existing frame-by-frame optimization methods typically optimize frames independently, neglecting spatial-temporal correlations between consecutive frames and leading to computationally inefficient solutions. To overcome these challenges, in this paper, we propose a novel high-speed full-color video CGH generation scheme. First, we introduce Spectrum-Guided Depth Division Multiplexing (SGDDM), which optimizes phase distributions via frequency modulation, enabling high-fidelity full-color display at high frame rates. Second, we present HoloMamba, a lightweight asymmetric Mamba-Unet architecture that explicitly models spatial-temporal correlations across video sequences to enhance reconstruction quality and computational efficiency. Extensive simulated and real-world experiments demonstrate that SGDDM achieves high-fidelity full-color display without compromise in frame rate, while HoloMamba generates FHD (1080p) full-color holographic video at over 260 FPS, more than 2.6$\times$ faster than the prior state-of-the-art Divide-Conquer-and-Merge Strategy.
[53] Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction
Dat Nguyen Cong,Hieu Tran Bao,Hoang Thanh-Tung
Main category: cs.CV
TL;DR: 该论文提出了Score-based Discriminator Correction (SBDC),一种用于对齐带噪声预训练条件扩散模型的引导技术,通过判别器训练和对抗损失来修正噪声标签,实验证明了其优越性。
Details
Motivation: 大规模数据集中常存在手动标注错误,但这类噪声对扩散模型生成能力和可控性的影响尚未充分研究,因此需要一种高效的修正方法。Contribution: 提出了SBDC技术,通过判别器训练和对抗损失修正噪声标签,提升生成质量,且无需重新训练扩散模型。
Method: 采用基于得分的判别器修正方法,结合对抗损失和噪声检测技术,限制引导仅用于生成早期阶段以提高性能。
Result: 在多种噪声设置下的实验中,SBDC超越了现有最优方法,且计算高效,仅略微增加推理时间。
Insight: 早期阶段引导是关键,有限使用修正技术可平衡生成质量与计算效率。
Abstract: Diffusion models have gained prominence as state-of-the-art techniques for synthesizing images and videos, particularly due to their ability to scale effectively with large datasets. Recent studies have uncovered that these extensive datasets often contain mistakes from manual labeling processes. However, the extent to which such errors compromise the generative capabilities and controllability of diffusion models is not well studied. This paper introduces Score-based Discriminator Correction (SBDC), a guidance technique for aligning noisy pre-trained conditional diffusion models. The guidance is built on discriminator training using adversarial loss, drawing on prior noise detection techniques to assess the authenticity of each sample. We further show that limiting the usage of our guidance to the early phase of the generation process leads to better performance. Our method is computationally efficient, only marginally increases inference time, and does not require retraining diffusion models. Experiments on different noise settings demonstrate the superiority of our method over previous state-of-the-art methods.
[54] Generalizing Monocular 3D Object Detection
Abhinav Kumar
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的方法来改进单目3D目标检测(Mono3D)的泛化能力,包括处理遮挡、适应新数据集及不同相机参数等问题。
Details
Motivation: 单目3D目标检测在自动驾驶等应用中至关重要,但其性能在遮挡、新数据集、大目标以及不同相机参数场景下容易受限,急需提高泛化能力。Contribution: 论文提出了GrooMeD-NMS增强遮挡鲁棒性,开发了DEVIANT骨干网络适应新数据集,以及SeaBird方法解决大目标检测问题,并分析了相机高度变化的模型外推性。
Method: 方法包括数学可微NMS(GrooMeD-NMS)、深度等变骨干网络(DEVIANT)、基于BEV分割的SeaBird方法,以及对相机参数外推的数学分析。
Result: 论文展示了在遮挡、数据集适应性、大目标检测和相机参数变化下性能的提升,验证了方法的有效性。
Insight: 论文揭示了遮挡问题可通过数学优化解决,大目标检测的噪声敏感性是关键,而相机参数的变化需要通过数学建模来改进泛化能力。
Abstract: Monocular 3D object detection (Mono3D) is a fundamental computer vision task that estimates an object’s class, 3D position, dimensions, and orientation from a single image. Its applications, including autonomous driving, augmented reality, and robotics, critically rely on accurate 3D environmental understanding. This thesis addresses the challenge of generalizing Mono3D models to diverse scenarios, including occlusions, datasets, object sizes, and camera parameters. To enhance occlusion robustness, we propose a mathematically differentiable NMS (GrooMeD-NMS). To improve generalization to new datasets, we explore depth equivariant (DEVIANT) backbones. We address the issue of large object detection, demonstrating that it’s not solely a data imbalance or receptive field problem but also a noise sensitivity issue. To mitigate this, we introduce a segmentation-based approach in bird’s-eye view with dice loss (SeaBird). Finally, we mathematically analyze the extrapolation of Mono3D models to unseen camera heights and improve Mono3D generalization in such out-of-distribution settings.
[55] UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks
Bikash Kumar Badatya,Vipul Baghel,Ravi Hegde
Main category: cs.CV
TL;DR: 该论文提出了一种无监督的骨架动作定位方法UTAL-GNN,利用图神经网络和动作动态度量(ADM)实现实时高效的动作分析,性能媲美有监督方法。
Details
Motivation: 现有动作定位方法依赖大量标注数据和复杂模型,计算成本高且难以适应实际场景。论文旨在提出一种轻量且无需标注的无监督方法。Contribution: 1. 提出了基于图神经网络的无监督动作定位方法UTAL-GNN;2. 设计了动作动态度量(ADM)直接检测动作边界;3. 在性能和计算效率上与有监督方法相当。
Method: 1. 预训练基于注意力的时空图卷积网络(ASTGCN)进行姿态序列去噪;2. 低维嵌入上计算ADM,通过曲率拐点检测动作边界。
Result: 在DSV Diving数据集上达到82.66%的mAP,平均定位延迟29.09毫秒,且无需重训练即可泛化到新数据。
Insight: 无监督方法可以通过图神经网络有效捕捉动作动态,ADM为无监督动作边界检测提供了新思路。
Abstract: Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions over short durations. Existing supervised and weakly supervised solutions often rely on extensive annotated datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios. In this work, we introduce a lightweight and unsupervised skeleton-based action localization pipeline that leverages spatio-temporal graph neural representations. Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task with blockwise partitions, enabling it to learn intrinsic motion dynamics without any manual labeling. At inference, we define a novel Action Dynamics Metric (ADM), computed directly from low-dimensional ASTGCN embeddings, which detects motion boundaries by identifying inflection points in its curvature profile. Our method achieves a mean Average Precision (mAP) of 82.66% and average localization latency of 29.09 ms on the DSV Diving dataset, matching state-of-the-art supervised performance while maintaining computational efficiency. Furthermore, it generalizes robustly to unseen, in-the-wild diving footage without retraining, demonstrating its practical applicability for lightweight, real-time action analysis systems in embedded or dynamic environments.
[56] IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising
Dongjin Kim,Jaekyun Ko,Muhammad Kashif Ali,Tae Hyun Kim
Main category: cs.CV
TL;DR: 论文提出了IDF网络,通过动态生成核进行图像去噪,解决了现有方法对特定噪声分布依赖性强、泛化能力不足的问题。
Details
Motivation: 现有深度学习方法在图像去噪中依赖特定噪声分布,泛化能力有限,且容易过拟合。需要一种更高效、泛化性更强的去噪方法。Contribution: 1. 提出了IDF网络,利用动态核生成实现泛化性强的去噪;2. 设计了特征提取、全局统计和局部相关性模块,全面捕捉噪声特性;3. 通过迭代核预测实现了高效高质量的去噪。
Method: 1. 使用特征提取模块提取噪声不变特征;2. 全局统计模块和局部相关性模块分析噪声特性和结构相关性;3. 核预测模块动态生成像素级变化的核,迭代应用于去噪。
Result: 模型仅需0.04M参数,在多种噪声类型和水平上表现优异,即使仅用单级高斯噪声训练。
Insight: 动态核生成和迭代去噪策略能显著提升对未知噪声的适应能力,同时保持高效性。
Abstract: Image denoising is a fundamental challenge in computer vision, with applications in photography and medical imaging. While deep learning-based methods have shown remarkable success, their reliance on specific noise distributions limits generalization to unseen noise types and levels. Existing approaches attempt to address this with extensive training data and high computational resources but they still suffer from overfitting. To address these issues, we conduct image denoising by utilizing dynamically generated kernels via efficient operations. This approach helps prevent overfitting and improves resilience to unseen noise. Specifically, our method leverages a Feature Extraction Module for robust noise-invariant features, Global Statistics and Local Correlation Modules to capture comprehensive noise characteristics and structural correlations. The Kernel Prediction Module then employs these cues to produce pixel-wise varying kernels adapted to local structures, which are then applied iteratively for denoising. This ensures both efficiency and superior restoration quality. Despite being trained on single-level Gaussian noise, our compact model (~ 0.04 M) excels across diverse noise types and levels, demonstrating the promise of iterative dynamic filtering for practical image denoising.
[57] Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models
Hou Xia,Zheren Fu,Fangcan Ling,Jiajun Li,Yi Tu,Zhendong Mao,Yongdong Zhang
Main category: cs.CV
TL;DR: Video-LevelGauge是一个专为评估大型视频语言模型(LVLMs)中上下文位置偏差而设计的基准测试,通过标准化探针和定制化上下文设置来模拟多样化场景,揭示了许多开源模型的显著位置偏差,而商用模型(如Gemini2.5-Pro)表现更一致。
Details
Motivation: 现有基准测试通常评估视频序列的整体性能,忽略了LVLMs中的上下文位置偏差这一关键但未被充分探索的问题,因此需要一种系统化的评估方法。Contribution: 提出了Video-LevelGauge基准测试,结合标准化探针和定制化上下文设置,全面评估LVLMs的位置偏差,并提供了丰富的分析和实际见解。
Method: 采用标准化探针和定制化上下文设置,结合统计分析和形态模式识别方法,生成1,177个高质量选择题和120个开放式问题,评估27种先进LVLMs。
Result: 研究发现许多开源模型存在显著的头部或邻域内容偏好偏差,而商用模型表现更一致。进一步分析提供了减少偏差和优化模型的实用建议。
Insight: 上下文位置偏差是LVLMs性能评估中的重要因素,商用模型的一致性表现表明其优化潜力,开源模型需进一步改进以减少偏差。
Abstract: Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement.
[58] Scalable Object Detection in the Car Interior With Vision Foundation Models
Bálint Mészáros,Ahmet Firintepe,Sebastian Schmidt,Stephan Günnemann
Main category: cs.CV
TL;DR: 论文提出了一种名为ODAL的新型框架,通过在车载系统和云端之间分配计算任务,解决了车载资源受限问题,用于车内物体检测与定位。通过微调轻量级模型,性能显著提升。
Details
Motivation: 车载系统资源受限,无法直接运行大型视觉基础模型,限制了车内物体检测与定位任务的性能。Contribution: 1. 提出ODAL框架,结合车载与云端分布式架构;2. 引入ODALbench评估指标;3. 通过微调轻量级模型(LLaVA 1.5 7B)显著提升性能。
Method: 利用视觉基础模型(如GPT-4o和LLaVA 1.5 7B),通过分布式架构分配任务,并对轻量级模型进行微调。
Result: 微调后的ODAL-LLaVA模型在ODAL$_{score}$上达到89%,超过GPT-4o近20%,且幻觉现象大幅减少。
Insight: 轻量级模型通过微调可以显著提升性能,甚至超越大型模型,同时减少计算资源需求。
Abstract: AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework’s potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL${score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL${SNR}$ three times higher than GPT-4o.
[59] Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li,Wenhao Yu,Chengsong Huang,Rui Liu,Zhenwen Liang,Fuxiao Liu,Jingxi Che,Dian Yu,Jordan Boyd-Graber,Haitao Mi,Dong Yu
Main category: cs.CV
TL;DR: 论文提出一种名为Vision-SR1的自奖励方法,通过分解视觉-语言模型的推理过程为视觉感知和语言推理两阶段,无需外部监督即可提升视觉推理能力。
Details
Motivation: 现有视觉-语言模型(VLMs)存在视觉幻觉和语言捷径问题,原因是训练方法仅监督最终输出,缺乏对中间视觉推理的指导。Contribution: 提出Vision-SR1方法,通过自奖励机制分解推理过程,增强视觉感知与语言推理的平衡训练信号。
Method: 将VLM推理分为视觉感知和语言推理两阶段,模型首先生成自包含的视觉感知,随后仅基于感知计算奖励,以强化训练信号。
Result: 实验表明Vision-SR1提升了视觉推理能力,减少了视觉幻觉和语言捷径。
Insight: 自奖励机制提供了一种无需外部标注的新思路,适用于动态优化VLM的推理能力。
Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
[60] Hardware-aware vs. Hardware-agnostic Energy Estimation for SNN in Space Applications
Matthias Höfflin,Jürgen Wassner
Main category: cs.CV
TL;DR: 该论文研究了SNN在太空应用中的硬件感知与硬件无关能量估计方法,发现SNN仅在神经形态硬件和高输入稀疏性下实现显著节能。
Details
Motivation: SNN长期以来被认为具有高能效,但最近的研究对其在数字实现中的能效提出质疑,尤其是在太空应用中。论文旨在比较不同能量估计方法下SNN的表现。Contribution: 提出了基于LIF神经元膜电位的SNN训练方法,在3-D卫星位置估计任务中达到与CNN相当的MSE,并揭示硬件感知方法显示的SNN节能优势需特定条件。
Method: 使用LIF神经元膜电位作为损失函数训练SNN,比较硬件感知与硬件无关方法对能量消耗的估计。
Result: SNN节能优势仅在神经形态硬件和高输入稀疏性下显著,硬件无关方法高估了SNN的优势。
Insight: 数据特性和硬件假设对SNN能效评估至关重要,需透明化评估方法和明确假设以确保公平比较。
Abstract: Spiking Neural Networks (SNNs), inspired by biological intelligence, have long been considered inherently energy-efficient, making them attractive for resource-constrained domains such as space applications. However, recent comparative studies with conventional Artificial Neural Networks (ANNs) have begun to question this reputation, especially for digital implementations. This work investigates SNNs for multi-output regression, specifically 3-D satellite position estimation from monocular images, and compares hardware-aware and hardware-agnostic energy estimation methods. The proposed SNN, trained using the membrane potential of the Leaky Integrate-and-Fire (LIF) neuron in the final layer, achieves comparable Mean Squared Error (MSE) to a reference Convolutional Neural Network (CNN) on a photorealistic satellite dataset. Energy analysis shows that while hardware-agnostic methods predict a consistent 50-60% energy advantage for SNNs over CNNs, hardware-aware analysis reveals that significant energy savings are realized only on neuromorphic hardware and with high input sparsity. The influence of dark pixel ratio on energy consumption is quantified, emphasizing the impact of data characteristics and hardware assumptions. These findings highlight the need for transparent evaluation methods and explicit disclosure of underlying assumptions to ensure fair comparisons of neural network energy efficiency.
[61] SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Texture 3D Human Reconstruction
Gangjian Zhang,Jian Shu,Nanjie Yao,Hao Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为SAT的两阶段单目纹理3D人体重建方法,通过多视角网络和在线动画增强解决几何模糊性和数据稀缺问题。
Details
Motivation: 单目图像重建3D人体面临几何模糊性和3D训练数据稀缺的问题,现有方法难以有效整合多种几何先验,导致视角不一致等问题。Contribution: 提出两阶段框架SAT,通过Supervisor Feature Regularization模块更好地融合几何先验,并引入Online Animation Augmentation模块缓解数据稀缺问题。
Method: 采用两阶段框架,结合多视角网络提供特征监督,通过在线动画增强生成大量训练数据。
Result: 在两个基准测试中表现优于现有方法。
Insight: 通过多视角特征监督和在线数据增强,可以显著提升单目3D人体重建的质量和一致性。
Abstract: Monocular texture 3D human reconstruction aims to create a complete 3D digital avatar from just a single front-view human RGB image. However, the geometric ambiguity inherent in a single 2D image and the scarcity of 3D human training data are the main obstacles limiting progress in this field. To address these issues, current methods employ prior geometric estimation networks to derive various human geometric forms, such as the SMPL model and normal maps. However, they struggle to integrate these modalities effectively, leading to view inconsistencies, such as facial distortions. To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. To further facilitate geometry learning, we introduce a Supervisor Feature Regularization module. By employing a multi-view network with the same structure to provide intermediate features as training supervision, these varied geometric priors can be better fused. To tackle data scarcity and further improve reconstruction quality, we also propose an Online Animation Augmentation module. By building a one-feed-forward animation network, we augment a massive number of samples from the original 3D human data online for model training. Extensive experiments on two benchmarks show the superiority of our approach compared to state-of-the-art methods.
[62] Synthetic Image Detection via Spectral Gaps of QC-RBIM Nishimori Bethe-Hessian Operators
V. S. Usatyuk,D. A. Sapozhnikov,S. I. Egorov
Main category: cs.CV
TL;DR: 提出了一种基于QC-RBIM Nishimori Bethe-Hessian算子的无监督方法,通过检测图像特征图中的社区结构来识别合成图像,无需标记数据或重新训练特征提取器,准确率超94%。
Details
Motivation: 当前深度生成模型(如GAN和扩散网络)生成的图像与真实照片几乎无法区分,威胁媒体取证和生物识别安全。现有监督检测器对未见过的生成器或对抗后处理无效,而无监督方法依赖低阶统计特征,脆弱性高。Contribution: 1. 提出一种新型QC-LDPC图构建方法,嵌入深度图像特征;2. 建立Nishimori温度RBIM与Bethe-Hessian谱之间的分析联系,提供贝叶斯最优检测准则;3. 设计了一种实用且对新型生成架构鲁棒的无监督合成图像检测器。
Method: 1. 使用预训练CNN提取并降维图像特征;2. 构建多边类型的QC-LDPC图,节点为特征向量;3. 计算节点间相似性并转换为RBIM参数,利用Nishimori温度后的Bethe-Hessian谱检测社区结构(真实图像)或对称性缺失(合成图像)。
Result: 在Flickr-Faces-HQ和CelebA的真实照片及GAN和扩散模型生成的合成数据上,无需标记数据或特征提取器调整,检测准确率达94%以上。
Insight: 真实图像的Bethe-Hessian谱呈现多个分离的间隙,而合成图像的谱则坍缩,揭示了Nishimori对称性的破坏是合成图像检测的有效物理线索。
Abstract: The rapid advance of deep generative models such as GANs and diffusion networks now produces images that are virtually indistinguishable from genuine photographs, undermining media forensics and biometric security. Supervised detectors quickly lose effectiveness on unseen generators or after adversarial post-processing, while existing unsupervised methods that rely on low-level statistical cues remain fragile. We introduce a physics-inspired, model-agnostic detector that treats synthetic-image identification as a community-detection problem on a sparse weighted graph. Image features are first extracted with pretrained CNNs and reduced to 32 dimensions, each feature vector becomes a node of a Multi-Edge Type QC-LDPC graph. Pairwise similarities are transformed into edge couplings calibrated at the Nishimori temperature, producing a Random Bond Ising Model (RBIM) whose Bethe-Hessian spectrum exhibits a characteristic gap when genuine community structure (real images) is present. Synthetic images violate the Nishimori symmetry and therefore lack such gaps. We validate the approach on binary tasks cat versus dog and male versus female using real photos from Flickr-Faces-HQ and CelebA and synthetic counterparts generated by GANs and diffusion models. Without any labeled synthetic data or retraining of the feature extractor, the detector achieves over 94% accuracy. Spectral analysis shows multiple well separated gaps for real image sets and a collapsed spectrum for generated ones. Our contributions are threefold: a novel LDPC graph construction that embeds deep image features, an analytical link between Nishimori temperature RBIM and the Bethe-Hessian spectrum providing a Bayes optimal detection criterion; and a practical, unsupervised synthetic image detector robust to new generative architectures. Future work will extend the framework to video streams and multi-class anomaly detection.
[63] LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation
Yupeng Zhang,Dezhi Zheng,Ping Lu,Han Zhang,Lei Wang,Liping xiang,Cheng Luo,Kaijun Deng,Xiaowen Fu,Linlin Shen,Jinbao Wang
Main category: cs.CV
TL;DR: LabelGS是一种通过对象标签增强3D高斯溅射(3DGS)表示的方法,实现了高效的3D场景分割,同时在训练速度上显著优于现有方法。
Details
Motivation: 3D高斯溅射(3DGS)虽然在高保真重建和高效渲染方面表现出色,但缺乏3D分割能力,限制了其在需要场景理解的任务中的应用。Contribution: 提出了LabelGS,通过引入跨视图一致的语义掩码和遮挡分析模型,将2D语义先验提升到3D高斯表示中,并改进了优化过程。
Method: 结合了交叉视图语义掩码、遮挡分析模型、主高斯标记模型和高斯投影滤波技术,优化了高斯表示的解耦和优化效率。
Result: 在3D场景分割任务中,LabelGS显著优于现有方法(如Feature-3DGS),且训练速度提升了22倍。
Insight: 通过结合语义标签优化3D高斯表示,可以显著提升分割任务的性能和效率,展示了显式表示在场景理解中的潜力。
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a novel explicit representation for 3D scenes, offering both high-fidelity reconstruction and efficient rendering. However, 3DGS lacks 3D segmentation ability, which limits its applicability in tasks that require scene understanding. The identification and isolating of specific object components is crucial. To address this limitation, we propose Label-aware 3D Gaussian Splatting (LabelGS), a method that augments the Gaussian representation with object label.LabelGS introduces cross-view consistent semantic masks for 3D Gaussians and employs a novel Occlusion Analysis Model to avoid overfitting occlusion during optimization, Main Gaussian Labeling model to lift 2D semantic prior to 3D Gaussian and Gaussian Projection Filter to avoid Gaussian label conflict. Our approach achieves effective decoupling of Gaussian representations and refines the 3DGS optimization process through a random region sampling strategy, significantly improving efficiency. Extensive experiments demonstrate that LabelGS outperforms previous state-of-the-art methods, including Feature-3DGS, in the 3D scene segmentation task. Notably, LabelGS achieves a remarkable 22X speedup in training compared to Feature-3DGS, at a resolution of 1440X1080. Our code will be at https://github.com/garrisonz/LabelGS.
[64] FreeVPS: Repurposing Training-Free SAM2 for Generalizable Video Polyp Segmentation
Qiang Hu,Ying Zhou,Gepeng Ji,Nick Barnes,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为FreeVPS的方法,通过重新利用无需训练的SAM2模型,结合检测-跟踪范式,实现了泛化性强的视频息肉分割。通过两个无需训练的模块解决了SAM2在长时间跟踪中的误差累积问题,取得了优秀的性能。
Details
Motivation: 现有的视频息肉分割方法在时空建模和领域泛化性之间难以平衡,限制了其在实际临床场景中的应用。论文旨在解决这一问题,提升分割的稳定性和泛化性。Contribution: 1)将视频息肉分割任务重新定义为检测-跟踪范式;2)提出了两个无需训练的模块(帧内关联过滤和帧间关联优化)来稳定SAM2的分割输出;3)在领域内和跨领域场景中取得了先进性能。
Method: 方法包括:1)利用图像息肉分割模型(IPS)捕获空间上下文;2)结合SAM2的时序建模能力;3)通过帧内关联过滤模块减少空间误差;4)通过帧间关联优化模块更新记忆库,防止误差传播。
Result: FreeVPS在领域内和跨领域场景中均表现出色,尤其是在长时间未修剪的结肠镜视频中展现了鲁棒的跟踪能力。
Insight: 1)无需训练的模块可以有效解决误差累积问题;2)结合检测-跟踪范式是提升视频息肉分割性能的有效途径;3)SAM2的复用在医学图像分析中具有潜力。
Abstract: Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.
[65] Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning
Stelios Mylonas,Symeon Papadopoulos
Main category: cs.CV
TL;DR: 论文提出了一种基于面部基础模型和度量学习的深度伪造检测方法,通过自监督模型FSFM和多数据集微调,结合三元组损失提升泛化能力。
Details
Motivation: 深度伪造技术日益逼真和普及,但现有检测模型在新场景中泛化能力不足,需要更鲁棒的解决方案。Contribution: 1. 利用自监督面部基础模型FSFM的丰富表征;2. 结合三元组损失提升特征区分能力;3. 探索基于属性监督的泛化优化。
Method: 1. 使用FSFM提取面部特征;2. 在多数据集上微调;3. 引入三元组损失优化特征空间;4. 尝试属性监督分类。
Result: 在多样化基准测试中表现优异,尤其在真实场景中泛化能力显著提升。
Insight: 面部基础模型结合度量学习是提升深度伪造检测泛化能力的有效途径。
Abstract: The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.
[66] FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers
Yue Wu,Yufan Wu,Wen Li,Yuxi Lu,Kairui Feng,Xuanhong Chen
Main category: cs.CV
TL;DR: FastAvatar提出了一种快速、高质量的3D虚拟形象重建框架,通过统一模型实现从多种日常记录(单张图像、多视角观察或单目视频)中高效生成3D高斯泼溅模型,具有高数据利用率和秒级重建速度。
Details
Motivation: 现有3D虚拟形象重建方法存在时间复杂度过高、对数据质量敏感且数据利用率低的问题,限制了其实际应用。FastAvatar旨在解决这些问题,提出了一种高效且灵活的解决方案。Contribution: 1. 提出了FastAvatar框架,首次实现了从多种输入类型(单张图像、多视角或单目视频)中快速重建高质量3D高斯泼溅模型的统一模型;2. 设计了Large Gaussian Reconstruction Transformer,包含三个关键创新:VGGT风格的变换器架构、多粒度引导编码和增量高斯聚合;3. 实现了增量重建功能,能够随着输入数据的增加不断提高重建质量。
Method: FastAvatar的核心是Large Gaussian Reconstruction Transformer,包含:1. VGGT风格的变换器架构,用于聚合多帧信息并预测可聚合的规范3D高斯表示;2. 多粒度引导编码(相机姿态、FLAME表情、头部姿态),解决输入数据长度不一致的问题;3. 基于地标跟踪和切片融合损失的增量高斯聚合方法。
Result: 实验表明,FastAvatar在重建质量和速度上均优于现有方法,实现了秒级高质量重建。
Insight: FastAvatar通过统一模型和增量重建机制,显著提高了数据利用率和重建效率,为实际应用提供了灵活性。
Abstract: Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. FastAvatar’s core is a Large Gaussian Reconstruction Transformer featuring three key designs: First, a variant VGGT-style transformer architecture aggregating multi-frame cues while injecting initial 3D prompt to predict an aggregatable canonical 3DGS representation; Second, multi-granular guidance encoding (camera pose, FLAME expression, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations, unlike prior work wasting input data. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has higher quality and highly competitive speed compared to existing methods.
[67] BuzzSet v1.0: A Dataset for Pollinator Detection in Field Conditions
Ahmed Emam,Mohamed Elbassiouny,Julius Miller,Patrick Donworth,Sabine Seidel,Ribana Roscher
Main category: cs.CV
TL;DR: BuzzSet v1.0 是一个用于田间条件下传粉昆虫检测的新数据集,包含 7856 张高分辨率图像,标注了 8000 多个实例,分为蜜蜂、熊蜂和未识别昆虫三类。数据通过 YOLOv12 模型预标注并人工验证,使用 RF-DETR 检测器取得了较高的 F1 分数。
Details
Motivation: 传粉昆虫对全球食物生产和生态系统稳定至关重要,但其数量因环境和人为压力而下降。开发自动化监测工具需要高质量数据集,BuzzSet 填补了这一空白。Contribution: 1. 发布了首个大规模田间传粉昆虫图像数据集 BuzzSet;2. 提供了高质量的标注和预处理方法;3. 使用 RF-DETR 检测器建立了强基线性能。
Method: 1. 通过 YOLOv12 预标注并结合人工验证生成标注;2. 图像预处理为 256x256 小图以提升小目标检测;3. 使用 RF-DETR 进行目标检测。
Result: RF-DETR 在蜜蜂和熊蜂类别中分别达到 F1 分数 0.94 和 0.92,整体检测质量高(mAP@0.50 为 0.559),未识别类别表现较差。
Insight: 1. 小目标检测和标签噪声下的分类是主要挑战;2. BuzzSet 为生态计算机视觉和小目标检测提供了有价值的基准。
Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to increasing anthropogenic and environmental stressors. To support scalable, automated pollinator monitoring, we introduce BuzzSet, a new large-scale dataset of high-resolution pollinator images collected in real agricultural field conditions. BuzzSet contains 7856 manually verified and labeled images, with over 8000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were generated using a YOLOv12 model trained on external data and refined via human verification using open-source labeling tools. All images were preprocessed into 256$\times$256 tiles to improve the detection of small insects. We provide strong baselines using the RF-DETR transformer-based object detector. The model achieves high F1-scores of 0.94 and 0.92 for honeybee and bumblebee classes, respectively, with confusion matrix results showing minimal misclassification between these categories. The unidentified class remains more challenging due to label ambiguity and lower sample frequency, yet still contributes useful insights for robustness evaluation. Overall detection quality is strong, with a best mAP@0.50 of 0.559. BuzzSet offers a valuable benchmark for small object detection, class separation under label noise, and ecological computer vision.
[68] AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning
Shu Shen,C. L. Philip Chen,Tong Zhang
Main category: cs.CV
TL;DR: 本文提出了一种自适应网络内调制方法(AIM),用于解决多模态学习中的不平衡问题。不同于现有方法牺牲主导模态的性能,AIM通过解耦主导模态的未优化参数到辅助块中,并自适应调整各网络深度的调制强度,实现了平衡的多模态学习。
Details
Motivation: 多模态学习面临模态不平衡的问题,现有方法通常通过抑制主导模态来提升较弱模态的性能,但会降低整体性能。本文发现网络内部的优化偏差是这一问题的主要原因,并提出AIM来解决。Contribution: 1. 揭示了网络内部优化偏差在多模态不平衡中的作用。2. 提出AIM方法,首次实现不牺牲主导模态性能的平衡学习。3. 通过解耦主导模态的未优化参数到辅助块,并结合自适应调制策略,显著提升性能。
Method: AIM通过解耦主导模态的未优化参数到辅助块(Auxiliary Blocks),并在联合训练中引导依赖这些性能较差的块。同时,根据网络深度的不平衡程度动态调整调制强度。
Result: AIM在多个基准测试中优于现有不平衡模态学习方法,并展现出对不同主干网络、融合策略和优化器的强泛化能力。
Insight: 1. 模态间的不平衡不仅存在于全局,还体现在网络各层内部。2. 针对特定网络深度的调制策略能更有效平衡模态学习。
Abstract: Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality’s learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality’s under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.
[69] The Return of Structural Handwritten Mathematical Expression Recognition
Jakob Seitz,Tobias Lengfeld,Radu Timofte
Main category: cs.CV
TL;DR: 这篇论文提出了一种结构化的手写数学表达式识别方法,通过自动标注系统和模块化识别系统,实现了符号分割、分类和空间关系的优化,提升了可解释性和错误分析能力。
Details
Motivation: 现有基于编码器-解码器架构的大型语言模型在LaTeX生成上表现优异,但缺乏符号与轨迹的显式对齐,限制了错误分析和交互式应用的发展。Contribution: 1. 提出了自动标注系统,通过神经网络将LaTeX方程映射到原始轨迹;2. 设计了模块化结构识别系统,独立优化分割、分类和关系预测。
Method: 结合图基的轨迹排序、混合卷积-循环网络和基于Transformer的校正,生成完整的图结构,直接链接手写轨迹与预测符号。
Result: 在CROHME-2023基准测试中表现出色,同时实现了透明的错误分析和可解释的输出。
Insight: 结构化识别方法为手写数学表达式的可解释性和交互式应用提供了新思路,弥补了现有方法的不足。
Abstract: Handwritten Mathematical Expression Recognition is foundational for educational technologies, enabling applications like digital note-taking and automated grading. While modern encoder-decoder architectures with large language models excel at LaTeX generation, they lack explicit symbol-to-trace alignment, a critical limitation for error analysis, interpretability, and spatially aware interactive applications requiring selective content updates. This paper introduces a structural recognition approach with two innovations: 1 an automatic annotation system that uses a neural network to map LaTeX equations to raw traces, automatically generating annotations for symbol segmentation, classification, and spatial relations, and 2 a modular structural recognition system that independently optimizes segmentation, classification, and relation prediction. By leveraging a dataset enriched with structural annotations from our auto-labeling system, the proposed recognition system combines graph-based trace sorting, a hybrid convolutional-recurrent network, and transformer-based correction to achieve competitive performance on the CROHME-2023 benchmark. Crucially, our structural recognition system generates a complete graph structure that directly links handwritten traces to predicted symbols, enabling transparent error analysis and interpretable outputs.
[70] MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
Han Jiao,Jiakai Sun,Yexing Xu,Lei Zhao,Wei Xing,Huaizhong Lin
Main category: cs.CV
TL;DR: 针对基于3D高斯泼溅的动态场景重建中单一变形场难以捕捉复杂运动细节的问题,MAPo提出了一种基于运动感知的分区策略,通过动态划分高斯函数并为高动态区域分配专用变形网络,显著提升了渲染质量。
Details
Motivation: 现有基于变形场的动态3D高斯泼溅方法在高度动态区域容易丢失细节,原因是单一变形场无法充分捕捉多样的运动模式。MAPo旨在通过分区策略弥补这一缺陷。Contribution: 1. 提出动态分数驱动的分区策略,区分高动态和低动态3D高斯;2. 为高动态区域递归分配专用变形网络以捕获精细运动;3. 引入跨帧一致性损失解决分区边界的视觉不连续问题。
Method: 1. 动态分数划分高斯函数;2. 高动态区域递归分区并分配独立变形网络;3. 跨帧一致性损失优化视觉连续性。
Result: 实验表明,MAPo在复杂或快速运动区域的渲染质量显著优于基线方法,同时计算成本相当。
Insight: 动态分区的专用建模是提升高动态场景重建质量的关键,而跨帧一致性损失是缓解分区边界问题的有效手段。
Abstract: 3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.
[71] StableIntrinsic: Detail-preserving One-step Diffusion Model for Multi-view Material Estimation
Xiuchao Wu,Pengfei Zhu,Jiangjing Lyu,Xinguo Liu,Jie Guo,Yanwen Guo,Weiwei Xu,Chengfei Lyu
Main category: cs.CV
TL;DR: StableIntrinsic 提出了一种单步扩散模型用于多视角材质估计,通过改进损失函数和引入细节注入网络(DIN),解决了传统多步扩散模型的高方差和耗时问题。
Details
Motivation: 传统基于扩散模型的材质估计方法采用多步去噪策略,导致推理时间过长且结果方差大,与确定性材质估计任务的目标相冲突。Contribution: 1. 提出单步扩散模型 StableIntrinsic,显著降低计算成本,提高结果的稳定性。
2. 设计基于材质属性的像素空间损失函数,避免单步扩散的过度平滑问题。
3. 引入细节注入网络(DIN),弥补 VAE 编码导致的细节损失,提升预测清晰度。
Method: 1. 将多步扩散改为单步扩散,提高效率。
2. 在像素空间设计针对性损失函数,优化材质属性预测。
3. 通过 DIN 网络增强细节恢复能力。
Result: 实验显示,StableIntrinsic 在 PSNR 和 MSE 指标上均优于现有方法,例如:
- 反照率 PSNR 提升 9.9%。
- 金属性和粗糙度 MSE 分别降低 44.4% 和 60.0%。
Insight: 单步扩散模型结合针对性损失函数和细节增强模块,能够在保证效率的同时,显著提升材质估计的精度和稳定性。
Abstract: Recovering material information from images has been extensively studied in computer graphics and vision. Recent works in material estimation leverage diffusion model showing promising results. However, these diffusion-based methods adopt a multi-step denoising strategy, which is time-consuming for each estimation. Such stochastic inference also conflicts with the deterministic material estimation task, leading to a high variance estimated results. In this paper, we introduce StableIntrinsic, a one-step diffusion model for multi-view material estimation that can produce high-quality material parameters with low variance. To address the overly-smoothing problem in one-step diffusion, StableIntrinsic applies losses in pixel space, with each loss designed based on the properties of the material. Additionally, StableIntrinsic introduces a Detail Injection Network (DIN) to eliminate the detail loss caused by VAE encoding, while further enhancing the sharpness of material prediction results. The experimental results indicate that our method surpasses the current state-of-the-art techniques by achieving a $9.9%$ improvement in the Peak Signal-to-Noise Ratio (PSNR) of albedo, and by reducing the Mean Square Error (MSE) for metallic and roughness by $44.4%$ and $60.0%$, respectively.
[72] Context-aware Sparse Spatiotemporal Learning for Event-based Vision
Shenqi Wang,Guangzhi Tang
Main category: cs.CV
TL;DR: 该论文提出了一种名为CSSL的新框架,用于事件相机的稀疏时空学习,通过上下文感知阈值动态调节神经元激活,从而减少激活密度,提升效率,并在事件目标检测与光流估计任务中表现优异。
Details
Motivation: 事件相机的高时间分辨率和动态范围使其在机器人感知中具有潜力,但现有深度学习方法未能充分利用其稀疏性,且脉冲神经网络在复杂任务中表现不佳,需要更高效的稀疏学习方法。Contribution: 提出CSSL框架,通过上下文感知阈值技术动态调节神经元激活,实现高稀疏性,无需显式稀疏约束,显著提升事件视觉任务的效率。
Method: 采用上下文感知阈值动态调节神经元激活,结合事件数据的稀疏性,减少激活密度,实现高效的事件目标检测和光流估计。
Result: 在事件目标检测和光流估计任务中,CSSL性能优于或与现有方法相当,同时保持极高的神经元稀疏性。
Insight: CSSL通过动态调节激活阈值,为神经形态计算提供了高效的事件视觉处理方案,为资源受限的边缘应用提供了新思路。
Abstract: Event-based camera has emerged as a promising paradigm for robot perception, offering advantages with high temporal resolution, high dynamic range, and robustness to motion blur. However, existing deep learning-based event processing methods often fail to fully leverage the sparse nature of event data, complicating their integration into resource-constrained edge applications. While neuromorphic computing provides an energy-efficient alternative, spiking neural networks struggle to match of performance of state-of-the-art models in complex event-based vision tasks, like object detection and optical flow. Moreover, achieving high activation sparsity in neural networks is still difficult and often demands careful manual tuning of sparsity-inducing loss terms. Here, we propose Context-aware Sparse Spatiotemporal Learning (CSSL), a novel framework that introduces context-aware thresholding to dynamically regulate neuron activations based on the input distribution, naturally reducing activation density without explicit sparsity constraints. Applied to event-based object detection and optical flow estimation, CSSL achieves comparable or superior performance to state-of-the-art methods while maintaining extremely high neuronal sparsity. Our experimental results highlight CSSL’s crucial role in enabling efficient event-based vision for neuromorphic processing.
[73] AutoQ-VIS: Improving Unsupervised Video Instance Segmentation via Automatic Quality Assessment
Kaixuan Lu,Mehmet Onurcan Kaya,Dim P. Papadopoulos
Main category: cs.CV
TL;DR: AutoQ-VIS 是一种无监督视频实例分割框架,通过质量引导的自训练方法解决合成数据与真实数据的领域差异问题,性能达到 SOTA。
Details
Motivation: 视频实例分割(VIS)需要像素级掩码和时间一致性标注,标注成本高。现有无监督方法(如 VideoCutLER)依赖合成数据,但受限于合成数据与真实数据的领域差异。Contribution: 提出 AutoQ-VIS,通过质量引导的自训练方法,在无人工标注的情况下,实现了从合成数据到真实视频的渐进适应。
Method: 采用闭环系统,结合伪标签生成和自动质量评估,逐步优化模型。
Result: 在 YouTubeVIS-2019 val 集上达到 52.6 AP50,比 VideoCutLER 提升 4.4%,无需人工标注。
Insight: 质量感知的自训练方法对无监督 VIS 具有实用性。
Abstract: Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4$%$, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ-VIS.
[74] Image Quality Assessment for Machines: Paradigm, Large-scale Database, and Models
Xiaoqi Wang,Yun Zhang,Weisi Lin
Main category: cs.CV
TL;DR: 该论文提出了一种机器视觉系统(MVS)专用的图像质量评估框架(MIQA),并构建了一个包含250万样本的大规模数据库(MIQD-2.5M)。论文还提出了一种区域感知的MIQA模型(RA-MIQA),在多种任务中表现出优于传统人类视觉系统(HVS)评估方法的性能。
Details
Motivation: 机器视觉系统在恶劣视觉条件下性能易受退化影响,但现有的HVS评估方法无法有效预测MVS性能退化,因此需要开发专门的机器质量评估框架。Contribution: 1. 提出了MIQA框架,专注于MVS性能退化评估;2. 构建了大规模数据库MIQD-2.5M,涵盖75种视觉模型和250种退化类型;3. 提出了RA-MIQA模型,通过细粒度空间退化分析提升评估效果。
Method: RA-MIQA模型基于区域感知机制,对图像退化进行空间分析,结合一致性和准确性指标评估质量。实验对比了7种HVS评估方法和5种经典模型。
Result: RA-MIQA在多项任务中显著优于HVS方法,例如在图像分类任务中,一致性(SRCC)和准确性分别提升13.56%和13.37%。还揭示了任务特定的退化敏感性。
Insight: HVS评估方法不适用于MVS质量预测,而即使是专门设计的MIQA模型在处理背景退化、精度导向评估和细微畸变时仍有挑战。研究为MVS可靠性和机器优化奠定了基础。
Abstract: Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA’s superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: https://github.com/XiaoqiWang/MIQA.
[75] Ego-centric Predictive Model Conditioned on Hand Trajectories
Binjie Zhang,Mike Zheng Shou
Main category: cs.CV
TL;DR: 该论文提出了一个统一的两阶段预测框架,用于在自我中心场景中联合建模动作和视觉未来,并基于手部轨迹进行条件生成,显著提升了动作预测和未来视频合成的性能。
Details
Motivation: 现有方法在自我中心场景中未能联合建模动作预测及其对视觉场景的影响,导致预测结果不准确或不一致。Contribution: 1. 提出首个统一模型,联合处理自我中心人类活动理解和机器人操作任务;2. 基于手部轨迹,分两阶段建模动作和视觉未来;3. 在动作预测和未来视频合成上优于现有方法。
Method: 1. 第一阶段通过连续状态建模处理多模态输入(视觉、语言、动作历史),显式预测未来手部轨迹;2. 第二阶段引入因果交叉注意力融合多模态信号,指导基于Latent Diffusion Model (LDM)的视频生成。
Result: 在Ego4D、BridgeData和RLBench数据集上的实验表明,方法在动作预测和未来视频合成上优于现有基准。
Insight: 显式建模手部轨迹和多模态融合是提升自我中心场景预测任务的关键。
Abstract: In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.
[76] Multimodal Conditional MeshGAN for Personalized Aneurysm Growth Prediction
Long Chen,Ashiv Patel,Mengyun Qiao,Mohammad Yousuf Salmasi,Salah A. Hammouche,Vasilis Stavrinides,Jasleen Nagi,Soodeh Kalaie,Xiao Yun Xu,Wenjia Bai,Declan P. O’Regan
Main category: cs.CV
TL;DR: MCMeshGAN是一种多模态条件网格生成对抗网络,用于预测主动脉瘤的个性化生长,通过结合局部和全局特征解决了现有方法的局限性,并在临床数据上验证了其有效性。
Details
Motivation: 主动脉瘤的个性化预测对及时干预至关重要,但因需要建模复杂的3D几何结构和局部/全局变化而具有挑战性。Contribution: 提出了首个多模态条件网格生成对抗网络MCMeshGAN,结合局部KNN卷积网络(KCN)和全局图卷积网络(GCN)提升预测精度;构建了新的纵向数据集TAAMesh。
Method: 采用双分支架构,KCN保留细粒度几何细节,GCN捕捉长程结构上下文,并结合临床属性和时间间隔条件生成预测。
Result: 实验表明MCMeshGAN在几何精度和临床直径估计上优于现有方法。
Insight: MCMeshGAN为临床可部署的个性化3D疾病轨迹建模提供了可靠工具,数据和方法开源促进了研究社区的发展。
Abstract: Personalized, accurate prediction of aortic aneurysm progression is essential for timely intervention but remains challenging due to the need to model both subtle local deformations and global anatomical changes within complex 3D geometries. We propose MCMeshGAN, the first multimodal conditional mesh-to-mesh generative adversarial network for 3D aneurysm growth prediction. MCMeshGAN introduces a dual-branch architecture combining a novel local KNN-based convolutional network (KCN) to preserve fine-grained geometric details and a global graph convolutional network (GCN) to capture long-range structural context, overcoming the over-smoothing limitations of deep GCNs. A dedicated condition branch encodes clinical attributes (age, sex) and the target time interval to generate anatomically plausible, temporally controlled predictions, enabling retrospective and prospective modeling. We curated TAAMesh, a new longitudinal thoracic aortic aneurysm mesh dataset consisting of 590 multimodal records (CT scans, 3D meshes, and clinical data) from 208 patients. Extensive experiments demonstrate that MCMeshGAN consistently outperforms state-of-the-art baselines in both geometric accuracy and clinically important diameter estimation. This framework offers a robust step toward clinically deployable, personalized 3D disease trajectory modeling. The source code for MCMeshGAN and the baseline methods is publicly available at https://github.com/ImperialCollegeLondon/MCMeshGAN.
[77] Self-supervised structured object representation learning
Oussama Hadjerci,Antoine Letienne,Mohamed Abbas Hedjazi,Adel Hafiane
Main category: cs.CV
TL;DR: 提出了一种基于ProtoScale模块的自监督方法,通过语义分组、实例级分离和层次结构逐步构建结构化视觉表示,优于现有方法。
Details
Motivation: 现有的自监督学习方法在全局图像理解上表现良好,但在捕捉场景中的结构化表示方面存在局限。Contribution: 1. 引入了ProtoScale模块,跨多空间尺度捕捉视觉元素;2. 保留完整场景上下文以提升密集预测任务性能;3. 在有限标注数据和少量微调轮次下仍表现优越。
Method: 结合语义分组、实例级分离和层次结构,使用ProtoScale模块构建自监督结构化视觉表示。
Result: 在COCO和UA-DETRAC数据集上,该方法提升了监督式目标检测性能,并优于现有技术。
Insight: 保留场景上下文和结构化表示有助于提高自监督学习在密集预测任务中的性能。
Abstract: Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured representation in scenes. In this work, we propose a self-supervised approach that progressively builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring. Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales. Unlike common strategies like DINO that rely on random cropping and global embeddings, we preserve full scene context across augmented views to improve performance in dense prediction tasks. We validate our method on downstream object detection tasks using a combined subset of multiple datasets (COCO and UA-DETRAC). Experimental results show that our method learns object centric representations that enhance supervised object detection and outperform the state-of-the-art methods, even when trained with limited annotated data and fewer fine-tuning epochs.
[78] KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Taebaek Hwang,Minseo Kim,Gisang Lee,Seonuk Kim,Hyunjun Eun
Main category: cs.CV
TL;DR: KRETA是一个针对韩语的文本丰富视觉问答(VQA)基准,专注于多样视觉上下文中的阅读和推理能力评估,填补了低资源语言在VQA领域的空白。
Details
Motivation: 当前VQA数据集和基准主要针对高资源语言(如英语),而低资源语言(如韩语)缺乏全面的评估工具,阻碍了模型的发展与比较。Contribution: 引入KRETA基准,支持韩语的文本丰富VQA评估,涵盖15个领域和26种图像类型,并提出半自动VQA生成流程和七指标评估协议。
Method: 采用半自动化的VQA生成流程,结合分步图像分解和七项指标的数据质量评估协议。
Result: KRETA为韩语VQA提供了全面的评估工具,同时其方法论可扩展至其他语言,推动多语言VLM研究。
Insight: KRETA展示了在低资源语言中构建高质量VQA基准的可行性,并强调了多领域和多图像类型评估的重要性。
Abstract: Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
[79] GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Seongheon Park,Yixuan Li
Main category: cs.CV
TL;DR: GLSim是一种无需训练的对象幻觉检测框架,通过结合全局和局部嵌入相似性信号,显著提升了大型视觉语言模型中对象幻觉检测的准确性和可靠性。
Details
Motivation: 大型视觉语言模型中的对象幻觉问题对其在现实应用中的安全部署构成挑战,现有方法仅从全局或局部视角单独检测,限制了可靠性。Contribution: 提出GLSim框架,通过结合全局和局部嵌入相似性信号,实现了无需训练且更准确的对象幻觉检测。
Method: GLSim利用图像和文本模态的全局与局部嵌入相似性信号,无需训练即可检测对象幻觉。
Result: GLSim在多样场景中表现优异,显著优于现有基线方法。
Insight: 结合全局和局部视角的信号能够更全面地捕捉对象幻觉现象,提升检测性能。
Abstract: Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
[80] PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos
Ziyun Qian,Runyu Xiao,Shuyuan Tu,Wei Xue,Dingkang Yang,Mingcheng Li,Dongliang Kou,Minghao Han,Zizhi Chen,Lihua Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的任务:基于视频的运动个性化,并提出了PersonaAnimator框架,从无约束视频中学习个性化运动模式,实现个性化的运动传递。
Details
Motivation: 现有方法在运动传递时仅复制运动而忽略风格特征,依赖运动捕捉数据且生成的运动可能违反物理规律。Contribution: 提出了Video-to-Video Motion Personalization任务,创建了首个基于视频的个性化运动数据集PersonaVid,并引入了物理感知的运动风格正则化机制。
Method: PersonaAnimator框架直接从视频中学习个性化运动模式,结合物理感知正则化确保运动的物理合理性。
Result: 实验表明,PersonaAnimator超越了现有运动传递方法,成为Video-to-Video Motion Personalization任务的新基准。
Insight: 直接从视频中学习运动风格能够避免对运动捕捉数据的依赖,同时物理感知正则化能够提升生成运动的真实性。
Abstract: Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.
[81] Hyperspectral Sensors and Autonomous Driving: Technologies, Limitations, and Opportunities
Imad Ali Shah,Jiarong Li,Roshan George,Tim Brophy,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
Main category: cs.CV
TL;DR: 这篇论文首次全面评述了高光谱成像(HSI)在自动驾驶领域的应用潜力、技术限制和商业可行性,揭示了其在研究潜力与商用落地之间的巨大差距。
Details
Motivation: 自动驾驶需要超越传统RGB成像的感知能力,HSI通过高光谱分辨率提供物质级别的场景理解能力,但目前缺乏对其技术现状和商用可行性的系统性评估。Contribution: 首次对HSI在自动驾驶中的适用性进行了全面分析,评估了216款商用HSI和多光谱相机的性能,明确了当前技术的不足之处和未来研究方向。
Method: 通过对现有HSI技术的定性评述和商用相机的定量分析,结合关键指标(帧率、空间分辨率、光谱维度、AEC-Q100标准)对其性能进行了基准测试。
Result: 发现仅有4款相机达到性能阈值,且无一符合AEC-Q100标准;现有HSI数据集在规模、光谱一致性等方面存在局限性。
Insight: HSI在自动驾驶中潜力巨大,但需克服技术瓶颈(如实时性和环境适应性)和数据集的不足,才能实现工业化应用。
Abstract: Hyperspectral imaging (HSI) offers a transformative sensing modality for Advanced Driver Assistance Systems (ADAS) and autonomous driving (AD) applications, enabling material-level scene understanding through fine spectral resolution beyond the capabilities of traditional RGB imaging. This paper presents the first comprehensive review of HSI for automotive applications, examining the strengths, limitations, and suitability of current HSI technologies in the context of ADAS/AD. In addition to this qualitative review, we analyze 216 commercially available HSI and multispectral imaging cameras, benchmarking them against key automotive criteria: frame rate, spatial resolution, spectral dimensionality, and compliance with AEC-Q100 temperature standards. Our analysis reveals a significant gap between HSI’s demonstrated research potential and its commercial readiness. Only four cameras meet the defined performance thresholds, and none comply with AEC-Q100 requirements. In addition, the paper reviews recent HSI datasets and applications, including semantic segmentation for road surface classification, pedestrian separability, and adverse weather perception. Our review shows that current HSI datasets are limited in terms of scale, spectral consistency, the number of spectral channels, and environmental diversity, posing challenges for the development of perception algorithms and the adequate validation of HSI’s true potential in ADAS/AD applications. This review paper establishes the current state of HSI in automotive contexts as of 2025 and outlines key research directions toward practical integration of spectral imaging in ADAS and autonomous systems.
[82] Integrating SAM Supervision for 3D Weakly Supervised Point Cloud Segmentation
Lechun You,Zhonghua Wu,Weide Liu,Xulei Yang,Jun Cheng,Wei Zhou,Bharadwaj Veeravalli,Guosheng Lin
Main category: cs.CV
TL;DR: 提出了一种通过结合2D基础模型的语义分割掩码,提升3D弱监督点云分割性能的新方法。
Details
Motivation: 3D点云数据标注困难,现有方法仅关注3D域,未能充分利用2D和3D数据的互补性。2D基础模型的进步为利用稀疏3D标注提供了新思路。Contribution: 1. 使用2D基础模型生成的分割掩码扩展稀疏3D标注;2. 通过几何对应关系将2D掩码传播到3D空间;3. 基于置信度和不确定性的正则化方法筛选可靠伪标签。
Method: 1. 利用2D基础模型生成语义分割掩码;2. 建立3D场景与2D视图的几何对应关系;3. 通过一致性正则化筛选可靠伪标签。
Result: 通过结合2D掩码和一致性正则化,显著提升了3D弱监督分割的性能。
Insight: 充分利用2D基础模型的能力可以弥补3D标注的不足,几何对应关系和一致性正则化是提升性能的关键。
Abstract: Current methods for 3D semantic segmentation propose training models with limited annotations to address the difficulty of annotating large, irregular, and unordered 3D point cloud data. They usually focus on the 3D domain only, without leveraging the complementary nature of 2D and 3D data. Besides, some methods extend original labels or generate pseudo labels to guide the training, but they often fail to fully use these labels or address the noise within them. Meanwhile, the emergence of comprehensive and adaptable foundation models has offered effective solutions for segmenting 2D data. Leveraging this advancement, we present a novel approach that maximizes the utility of sparsely available 3D annotations by incorporating segmentation masks generated by 2D foundation models. We further propagate the 2D segmentation masks into the 3D space by establishing geometric correspondences between 3D scenes and 2D views. We extend the highly sparse annotations to encompass the areas delineated by 3D masks, thereby substantially augmenting the pool of available labels. Furthermore, we apply confidence- and uncertainty-based consistency regularization on augmentations of the 3D point cloud and select the reliable pseudo labels, which are further spread on the 3D masks to generate more labels. This innovative strategy bridges the gap between limited 3D annotations and the powerful capabilities of 2D foundation models, ultimately improving the performance of 3D weakly supervised segmentation.
[83] WaveHiT-SR: Hierarchical Wavelet Network for Efficient Image Super-Resolution
Fayaz Ali,Muhammad Zawish,Steven Davy,Radu Timofte
Main category: cs.CV
TL;DR: WaveHiT-SR是一种基于层次化小波变换的高效图像超分辨率网络,通过自适应窗口和小波分解解决了Transformer在SR任务中的计算复杂性问题,实现了高性能和低计算成本的平衡。
Details
Motivation: 现有基于Transformer的图像超分辨率方法因自注意力的二次计算复杂度而限制了窗口大小和感受野,难以高效建模长程依赖关系。Contribution: 1. 提出层次化自适应窗口机制,扩展感受野;2. 引入小波变换分解图像多频带,兼顾全局与局部特征;3. 设计高效网络架构,降低计算成本的同时保持性能。
Method: 1. 结合小波变换与层次化Transformer;2. 通过多级分解策略分离高低频信息;3. 逐步重建高分辨率图像。
Result: 在SwinIR-Light、SwinIR-NG和SRFormer-Light等变体上实现最优超分辨率性能,参数、FLOPs和速度均显著优化。
Insight: 小波变换的频域分解与层次化窗口设计可有效平衡计算效率与超分辨率性能,为Transformer在密集预测任务中的应用提供了新思路。
Abstract: Transformers have demonstrated promising performance in computer vision tasks, including image super-resolution (SR). The quadratic computational complexity of window self-attention mechanisms in many transformer-based SR methods forces the use of small, fixed windows, limiting the receptive field. In this paper, we propose a new approach by embedding the wavelet transform within a hierarchical transformer framework, called (WaveHiT-SR). First, using adaptive hierarchical windows instead of static small windows allows to capture features across different levels and greatly improve the ability to model long-range dependencies. Secondly, the proposed model utilizes wavelet transforms to decompose images into multiple frequency subbands, allowing the network to focus on both global and local features while preserving structural details. By progressively reconstructing high-resolution images through hierarchical processing, the network reduces computational complexity without sacrificing performance. The multi-level decomposition strategy enables the network to capture fine-grained information in lowfrequency components while enhancing high-frequency textures. Through extensive experimentation, we confirm the effectiveness and efficiency of our WaveHiT-SR. Our refined versions of SwinIR-Light, SwinIR-NG, and SRFormer-Light deliver cutting-edge SR results, achieving higher efficiency with fewer parameters, lower FLOPs, and faster speeds.
[84] Reimagining Image Segmentation using Active Contour: From Chan Vese Algorithm into a Proposal Novel Functional Loss Framework
Gianluca Guzzetta
Main category: cs.CV
TL;DR: 本文提出了一种基于Chan-Vese算法的功能性分割损失框架,并将其与常见的计算机视觉分割方法进行对比评估。
Details
Motivation: Chan-Vese算法在图像分割领域虽然经典,但其性能可能受限于传统的损失函数设计。作者希望通过引入现代计算机视觉方法,提出一种更优的分割损失框架。Contribution: 主要的贡献包括:1) 提供了Chan-Vese算法的离散化实现及理论证明;2) 提出了一种基于活动轮廓的功能性分割损失框架;3) 公开了所有代码和实验材料。
Method: 方法包括:1) 对Chan-Vese模型的功能能量和偏微分方程进行离散化;2) 利用pytorch.nn.ModuleLoss设计功能性损失;3) 使用基于水平集的方法实现分割。
Result: 通过与常见分割数据集的对比实验,证明了所提方法在性能上优于传统损失函数。
Insight: 结合现代深度学习框架(如PyTorch)与经典图像分割算法(如Chan-Vese),可以显著提升分割效果,为算法创新提供了新方向。
Abstract: In this paper, we present a comprehensive study and analysis of the Chan-Vese algorithm for image segmentation. We employ a discretized scheme derived from the empirical study of the Chan-Vese model’s functional energy and its partial differential equation based on its level set function. We provide a proof of the results and an implementation using MATLAB. Leveraging modern computer vision methodologies, we propose a functional segmentation loss based on active contours, utilizing pytorch.nn.ModuleLoss and a level set based on the Chan-Vese algorithm. We compare our results with common computer vision segmentation datasets and evaluate the performance of classical loss functions against our proposed method. All code and materials used are available at https://github.com/gguzzy/chan_vese_functional_loss.
[85] Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models
Oliver Grainge,Sania Waheed,Jack Stilgoe,Michael Milford,Shoaib Ehsan
Main category: cs.CV
TL;DR: 本文系统评估了25种先进视觉语言模型(VLMs)在地理定位任务中的能力、局限性及社会风险,发现其在通用街景图像表现差,但在社交媒体类图像上准确率达61%,引发隐私担忧。
Details
Motivation: 地理定位任务虽有广泛有益应用,但当前VLMs的精度提升带来潜在隐私风险(如追踪、监控),而系统评估其能力与局限的研究尚不充分。Contribution: 首次对25种先进VLMs进行全面的地理定位能力评估,揭示其推理机制、性能边界及社会风险。
Method: 在四种多样化环境的数据集上测试VLMs的地理定位能力,分析其在不同类型图像上的表现。
Result: 当前VLMs在通用街景图像表现较差,但对社交媒体类图像定位准确率达61%,凸显隐私风险。
Insight: VLMs的地理定位能力高度依赖图像类型,未来精度提升需伴随隐私保护措施的同步发展。
Abstract: Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61%) on images resembling social media content, raising significant and urgent privacy concerns.
[86] GS: Generative Segmentation via Label Diffusion
Yuhao Chen,Shubin Chen,Liang Lin,Guangrun Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为GS的新框架,将图像分割任务重新定义为生成式任务,通过标签扩散直接生成分割掩码,显著提升了语言驱动图像分割的性能。
Details
Motivation: 传统的图像分割方法通常采用判别式模型,而扩散模型虽已引入该领域,但仍以图像为中心。GS希望通过生成式标签扩散,将分割任务本身作为主要建模目标,直接生成高质量分割掩码。Contribution: 1. 提出GS框架,首次将分割任务定义为生成式任务;2. 通过标签扩散直接生成分割掩码,并以图像和语言描述为条件;3. 在Panoptic Narrative Grounding任务上实现了新的SOTA性能。
Method: 1. 反转生成过程:直接从噪声生成分割掩码;2. 结合输入图像和语言描述作为条件;3. 通过端到端训练实现对空间和语义保真度的显式控制。
Result: 在PNG基准测试中,GS显著优于现有的判别式和基于扩散的方法,取得了新的SOTA结果。
Insight: 将分割任务重新定义为生成式任务,能够更直接地控制分割的语义和空间一致性,为语言驱动分割任务提供了一种新思路。
Abstract: Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions. Traditional methods approach this as a discriminative problem, assigning each pixel to foreground or background based on semantic alignment. Recently, diffusion models have been introduced to this domain, but existing approaches remain image-centric: they either (i) use image diffusion models as visual feature extractors, (ii) synthesize segmentation data via image generation to train discriminative models, or (iii) perform diffusion inversion to extract attention cues from pre-trained image diffusion models-thereby treating segmentation as an auxiliary process. In this paper, we propose GS (Generative Segmentation), a novel framework that formulates segmentation itself as a generative task via label diffusion. Instead of generating images conditioned on label maps and text, GS reverses the generative process: it directly generates segmentation masks from noise, conditioned on both the input image and the accompanying language description. This paradigm makes label generation the primary modeling target, enabling end-to-end training with explicit control over spatial and semantic fidelity. To demonstrate the effectiveness of our approach, we evaluate GS on Panoptic Narrative Grounding (PNG), a representative and challenging benchmark for multimodal segmentation that requires panoptic-level reasoning guided by narrative captions. Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
[87] Segmentation Assisted Incremental Test Time Adaptation in an Open World
Manogna Sreenivas,Soma Biswas
Main category: cs.CV
TL;DR: 该论文提出了一种用于开放世界的增量测试时间适应方法,通过结合分割和主动标注技术,使视觉语言模型能够持续适应新的类别和领域。
Details
Motivation: 动态环境中常见陌生对象和分布偏移,传统测试时间适应方法无法处理测试时出现的新类别和新领域。为了解决这一问题,论文提出了一个增量测试时间适应框架。Contribution: 1. 提出了一种新的增量测试时间适应基准ITTA;2. 设计了SegAssist模块,利用分割能力和主动标注技术优化样本选择。
Method: 1. 结合单图像TTA方法和主动标注技术;2. SegAssist模块利用视觉语言模型的分割能力,优先选择可能属于新类别的样本。
Result: 在多个基准数据集上的实验表明,SegAssist能有效提升视觉语言模型在动态环境中的表现。
Insight: 动态环境中模型需要持续适应新数据,结合分割和主动标注技术可以显著提升开放世界中的适应能力。
Abstract: In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/
[88] Patch Progression Masked Autoencoder with Fusion CNN Network for Classifying Evolution Between Two Pairs of 2D OCT Slices
Philippe Zhang,Weili Jiang,Yihao Li,Jing Zhang,Sarah Matta,Yubo Tan,Hui Lin,Haoshen Wang,Jiangtian Pan,Hui Xu,Laurent Borderie,Alexandre Le Guilcher,Béatrice Cochener,Chubin Ou,Gwenolé Quellec,Mathieu Lamard
Main category: cs.CV
TL;DR: 论文提出了一种融合CNN网络和Patch Progression Masked Autoencoder的方法,用于分类2D OCT切片的演变,并在AMD进展监测挑战赛中取得了Top 10的成绩。
Details
Motivation: AMD的及时诊断和监测对抗VEGF治疗的效果至关重要。通过分析OCT扫描数据,可以制定更个性化的治疗方案,从而提高治疗效果。Contribution: 1. 提出了一种融合CNN网络和模型集成的方法(Task 1);2. 设计了Patch Progression Masked Autoencoder,用于生成未来OCT并分类演变(Task 2)。
Method: 1. Task 1使用融合CNN网络和模型集成;2. Task 2提出Patch Progression Masked Autoencoder生成未来OCT,并分类演变。
Result: 在MARIO挑战赛中,两项任务均进入Top 10。
Insight: 融合深度学习方法(如CNN和自编码器)在医学影像分析中具有潜力,尤其是用于疾病进展的动态监测和个性化治疗。
Abstract: Age-related Macular Degeneration (AMD) is a prevalent eye condition affecting visual acuity. Anti-vascular endothelial growth factor (anti-VEGF) treatments have been effective in slowing the progression of neovascular AMD, with better outcomes achieved through timely diagnosis and consistent monitoring. Tracking the progression of neovascular activity in OCT scans of patients with exudative AMD allows for the development of more personalized and effective treatment plans. This was the focus of the Monitoring Age-related Macular Degeneration Progression in Optical Coherence Tomography (MARIO) challenge, in which we participated. In Task 1, which involved classifying the evolution between two pairs of 2D slices from consecutive OCT acquisitions, we employed a fusion CNN network with model ensembling to further enhance the model’s performance. For Task 2, which focused on predicting progression over the next three months based on current exam data, we proposed the Patch Progression Masked Autoencoder that generates an OCT for the next exam and then classifies the evolution between the current OCT and the one generated using our solution from Task 1. The results we achieved allowed us to place in the Top 10 for both tasks. Some team members are part of the same organization as the challenge organizers; therefore, we are not eligible to compete for the prize.
[89] PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Zheng Li,Yanming Guo,WenZhe Liu,Xueyi Zhang,Zhaoyun Ding,Long Xu,Mingrui Lao
Main category: cs.CV
TL;DR: PAUL是一个针对跨视角地理定位中噪声对应问题的框架,通过不确定性引导的数据分区和增强,解决了实际应用中图像对不完全对齐的问题。
Details
Motivation: 现有的跨视角地理定位方法假设训练数据中的图像对完全对齐,而现实中由于GPS漂移等因素常导致部分对应关系噪声。PAUL旨在解决这一噪声对应问题。Contribution: 1) 形式化噪声对应问题(NC-CVGL);2) 提出PAUL框架,通过不确定性学习和针对性增强来抑制噪声;3) 在多种噪声比例下表现优异。
Method: PAUL利用不确定性估计对数据进行分区和增强,包括:1) 基于置信度的区域选择性增强;2) 通过证据协同训练优化特征学习。
Result: PAUL在多种噪声比例下均优于其他噪声对应方法,验证了其组件的有效性。
Insight: 噪声对应问题是实际应用的关键挑战;数据分区与增强结合不确定性学习可显著提升模型的鲁棒性。
Abstract: Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, as it enables matching between drone-captured and satellite imagery. Most existing approaches embed multi-modal data into a joint feature space to maximize the similarity of paired images. However, these methods typically assume perfect alignment of image pairs during training, which rarely holds true in real-world scenarios. In practice, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic alignment shifts where only partial correspondences exist between pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research. In this paper, we formally introduce and address the Noisy Correspondence on Cross-View Geo-Localization (NC-CVGL) problem, aiming to bridge the gap between idealized benchmarks and practical applications. To this end, we propose PAUL (Partition and Augmentation by Uncertainty Learning), a novel framework that partitions and augments training data based on estimated data uncertainty through uncertainty-aware co-augmentation and evidential co-training. Specifically, PAUL selectively augments regions with high correspondence confidence and utilizes uncertainty estimation to refine feature learning, effectively suppressing noise from misaligned pairs. Distinct from traditional filtering or label correction, PAUL leverages both data uncertainty and loss discrepancy for targeted partitioning and augmentation, thus providing robust supervision for noisy samples. Comprehensive experiments validate the effectiveness of individual components in PAUL,which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.
[90] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Zhixuan Liang,Yizhuo Li,Tianshuo Yang,Chengyue Wu,Sitong Mao,Liuao Pei,Xiaokang Yang,Jiangmiao Pang,Yao Mu,Ping Luo
Main category: cs.CV
TL;DR: 本文提出了Discrete Diffusion VLA方法,将离散扩散引入视觉-语言-动作(VLA)策略中,替代传统的自回归或连续扩散方法,实现了更统一的架构和解码效率提升。
Details
Motivation: 现有VLA解码器在生成动作时存在自回归固定顺序或需要专门训练的连续扩散头的问题,限制了架构的通用性和扩展性。Contribution: 提出了一种基于离散扩散的单Transformer策略,保留了扩散的渐进细化范式,同时兼容VLM的离散令牌接口,支持并行解码和自适应解码顺序。
Method: 通过离散扩散对离散化动作块建模,使用与VLM主干相同的交叉熵目标训练,并引入二次重掩码机制提升预测一致性。
Result: 在LIBERO、SimplerEnv Fractal和SimplerEnv Bridge任务中表现优异,超越自回归和连续扩散基线,支持更大模型和数据的扩展。
Insight: 离散扩散动作解码器能够在不牺牲训练一致性的同时实现精确动作建模,为VLA模型的规模化提供了基础。
Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion’s progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.
[91] AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Yuxin Guo,Teng Wang,Yuying Ge,Shijie Ma,Yixiao Ge,Wei Zou,Ying Shan
Main category: cs.CV
TL;DR: AudioStory通过将大语言模型(LLMs)与文本到音频(TTA)系统结合,生成长篇叙事音频,解决了现有技术在时间连贯性和组合推理上的局限性。
Details
Motivation: 现有的TTA技术在生成长篇叙事音频时难以保持时间连贯性和情感一致性,需要更强大的推理和生成能力。Contribution: 1. 提出AudioStory框架,结合LLMs和TTA系统,支持长篇音频生成;2. 设计了去耦合的桥接机制和端到端训练框架;3. 建立了AudioStory-10K基准数据集。
Method: 1. 使用LLMs分解复杂叙事查询为时间有序的子任务;2. 采用桥接查询和残差查询分离LLM与扩散模型的协作;3. 端到端训练框架统一指令理解和音频生成。
Result: AudioStory在单音频和长篇叙事音频生成任务中表现优异,超越了现有TTA基线模型。
Insight: 结合LLMs的推理能力与TTA技术的生成能力,能够显著提升长篇音频的连贯性和质量。
Abstract: Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
[92] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
Zeyi Sun,Yuhang Cao,Jianze Liang,Qiushi Sun,Ziyu Liu,Zhixiong Zhang,Yuhang Zang,Xiaoyi Dong,Kai Chen,Dahua Lin,Jiaqi Wang
Main category: cs.CV
TL;DR: CODA是一个新型可训练的组合式框架,通过整合通用规划器与专用执行器,采用两阶段训练方法,解决了GUI自主代理在科学计算领域的规划与执行挑战,显著优于基线模型。
Details
Motivation: 现有GUI自主代理在科学计算领域面临通用性与专用性之间的权衡问题,而静态组合框架无法从经验中学习适应。CODA通过可训练的组合框架与两阶段训练,解决了这一问题。Contribution: 1. 提出CODA框架,整合通用规划器(Cerebrum)与专用执行器(Cerebellum);
2. 设计两阶段训练方法(Specialization和Generalization),分别针对专家规划和跨领域泛化;
3. 在ScienceBoard基准测试中显著超越基线模型,达到了开源模型的新SOTA。
Method: 1. Specialization阶段:采用解耦的GRPO方法,为每个科学应用单独训练专家规划器;
2. Generalization阶段:通过汇总成功轨迹构建数据集,用于最终规划器的监督微调。
Result: 在ScienceBenchmark的四个应用上评估,CODA显著优于基线模型,表现最优。
Insight: 结合通用规划与专用执行,并通过两阶段训练实现跨领域泛化,是解决科学计算领域GUI代理挑战的有效途径。
Abstract: Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.
cs.CR [Back]
[93] An Investigation on Group Query Hallucination Attacks
Kehao Miao,Xiaolong Jin
Main category: cs.CR
TL;DR: 本文提出了一种名为‘群组查询攻击’的技术,用于模拟用户与大语言模型(LLM)交互时连续提出多问题的情况,研究发现这种攻击会显著降低模型性能,并可能触发隐藏的后门风险。
Details
Motivation: 随着大语言模型(LLM)的广泛应用,理解其在用户交互中的潜在失效模式变得至关重要。用户在实际交互中通常会连续提出多个问题,因此需要研究这种场景对模型输出的影响。Contribution: 1. 提出了‘群组查询攻击’技术,模拟多问题交互场景。2. 揭示了连续提示的累积上下文会显著降低模型性能。3. 发现这种攻击可能触发模型隐藏的后门风险。4. 证明了攻击在数学推理和代码生成等任务中的有效性。
Method: 通过设计‘群组查询攻击’技术,向LLM同时提出一组问题,研究累积上下文对模型输出的影响,并在多种任务(如特定任务微调模型、预训练和对齐模型)中进行实验验证。
Result: 实验表明,群组查询攻击显著降低了特定任务微调模型的性能,并可能触发隐藏的后门风险。攻击在数学推理和代码生成任务中对预训练和对齐模型也产生了负面影响。
Insight: 研究揭示了LLM在多问题交互场景中的脆弱性,强调了在设计模型时需要考虑连续提示的累积效应,并警惕潜在的后门风险。
Abstract: With the widespread use of large language models (LLMs), understanding their potential failure modes during user interactions is essential. In practice, users often pose multiple questions in a single conversation with LLMs. Therefore, in this study, we propose Group Query Attack, a technique that simulates this scenario by presenting groups of queries to LLMs simultaneously. We investigate how the accumulated context from consecutive prompts influences the outputs of LLMs. Specifically, we observe that Group Query Attack significantly degrades the performance of models fine-tuned on specific tasks. Moreover, we demonstrate that Group Query Attack induces a risk of triggering potential backdoors of LLMs. Besides, Group Query Attack is also effective in tasks involving reasoning, such as mathematical reasoning and code generation for pre-trained and aligned models.
[94] SoK: Large Language Model Copyright Auditing via Fingerprinting
Shuo Shao,Yiming Li,Yu He,Hongwei Yao,Wenyuan Yang,Dacheng Tao,Zhan Qin
Main category: cs.CR
TL;DR: 本文对大型语言模型(LLM)版权审计中的指纹技术进行了首次全面研究,提出统一框架和分类,并设计LeaFBench基准测试工具,揭示了现有方法的优缺点。
Details
Motivation: 大型语言模型因训练资源庞大而成为重要知识产权,但也面临版权侵权的风险。指纹技术作为非侵入式解决方案,其可靠性因模型修改多样性和缺乏标准化评估而不确定。Contribution: 1)首次对LLM指纹技术进行全面研究;2)提出统一框架和分类;3)设计LeaFBench基准测试工具,涵盖13种后开发技术。
Method: 将现有方法分为白盒和黑盒两类,并通过LeaFBench(基于149个模型实例和13种技术)进行系统评估。
Result: 实验揭示了现有方法的优缺点,为未来研究指明方向。
Insight: LLM版权审计需考虑多样化修改,标准化评估工具(如LeaFBench)对技术发展至关重要。
Abstract: The broad capabilities and substantial resources required to train Large Language Models (LLMs) make them valuable intellectual property, yet they remain vulnerable to copyright infringement, such as unauthorized use and model theft. LLM fingerprinting, a non-intrusive technique that extracts and compares the distinctive features from LLMs to identify infringements, offers a promising solution to copyright auditing. However, its reliability remains uncertain due to the prevalence of diverse model modifications and the lack of standardized evaluation. In this SoK, we present the first comprehensive study of LLM fingerprinting. We introduce a unified framework and formal taxonomy that categorizes existing methods into white-box and black-box approaches, providing a structured overview of the state of the art. We further propose LeaFBench, the first systematic benchmark for evaluating LLM fingerprinting under realistic deployment scenarios. Built upon mainstream foundation models and comprising 149 distinct model instances, LeaFBench integrates 13 representative post-development techniques, spanning both parameter-altering methods (e.g., fine-tuning, quantization) and parameter-independent mechanisms (e.g., system prompts, RAG). Extensive experiments on LeaFBench reveal the strengths and weaknesses of existing methods, thereby outlining future research directions and critical open problems in this emerging field. The code is available at https://github.com/shaoshuo-ss/LeaFBench.
[95] A Technical Review on Comparison and Estimation of Steganographic Tools
Ms. Preeti P. Bhatt,Rakesh R. Savant
Main category: cs.CR
TL;DR: 这篇综述论文对图像隐写工具进行了分类和比较,基于图像特征分析了多种工具,并测试了六种常用工具的性能。
Details
Motivation: 隐写术是一种将数据隐藏在载体媒体中的技术,而图像隐写工具的性能和效率因工具和图像特征而异。本文旨在通过比较不同工具的效果,为选择最佳工具提供参考。Contribution: 提供了对六种常用图像隐写工具的系统性比较,并基于图像特征(如大小、尺寸、像素值和直方图差异)分析了它们的性能。
Method: 选择了六种常用隐写工具,使用相同输入(嵌入特定文本的主图像)进行测试,对比工具的表现。
Result: 所有工具的性能相近,但某些工具在效率上表现更优,结果基于图像特征的差异。
Insight: 图像特征对隐写工具的性能有显著影响,工具选择应结合实际需求和图像属性。
Abstract: Steganography is technique of hiding a data under cover media using different steganography tools. Image steganography is hiding of data (Text/Image/Audio/Video) under a cover as Image. This review paper presents classification of image steganography and the comparison of various Image steganography tools using different image formats. Analyzing numerous tools on the basis of Image features and extracting the best one. Some of the tools available in the market were selected based on the frequent use; these tools were tested using the same input on all of them. Specific text was embedded within all host images for each of the six Steganography tools selected. The results of the experiment reveal that all the six tools were relatively performing at the same level, though some software performs better than others through efficiency. And it was based on the image features like size, dimensions, and pixel value and histogram differentiation.
[96] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents
Zhixin Lin,Jungang Li,Shidong Pan,Yibo Shi,Yue Yao,Dongliang Xu
Main category: cs.CR
TL;DR: 本文提出了首个大规模基准测试SAPA-Bench,评估了7种主流多模态大模型(MLLM)驱动的智能手机代理的隐私意识,发现其表现普遍不佳。
Details
Motivation: 智能手机代理在提升便利性的同时,也广泛访问用户的敏感信息,但缺乏对其隐私意识的系统性研究。Contribution: 1) 提出首个包含7,138个场景的隐私意识基准测试;2) 标注了隐私类型、敏感度和位置;3) 评估了7种主流代理的隐私意识表现。
Method: 通过大规模场景数据集,标注隐私信息并测试代理的隐私识别能力,分析其与敏感度的关联。
Result: 主流代理的隐私意识表现普遍低于60%,闭源代理优于开源代理,Gemini 2.0-flash表现最佳(67%)。
Insight: 隐私检测能力与场景敏感度高度相关,启发了研究社区需重新思考效用与隐私的平衡。
Abstract: Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users’ personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents’ privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at https://zhixin-l.github.io/SAPA-Bench.
[97] Addressing Deepfake Issue in Selfie banking through camera based authentication
Subhrojyoti Mukherjee,Manoranjan Mohanty
Main category: cs.CR
TL;DR: 该论文探讨了利用已有的法医识别系统(原本用于图片相机定位)来检测自拍银行中的深度伪造(deepfake)问题。
Details
Motivation: 深度伪造技术日益成熟,被用于生成高度逼真的虚假身份,威胁到了自拍银行等依赖生物识别技术(如面部识别)的安全性。Contribution: 提出利用现有的法医识别系统进行深度伪造检测,为自拍银行提供一种新的防欺诈手段。
Method: 采用已有的法医识别系统(原用于相机定位)进行深度伪造检测,通过分析图像的底层特征来区分真实与伪造图像。
Result: 该方法在检测深度伪造图像方面表现出潜力,但具体性能未详细说明。
Insight: 利用已经成熟的系统进行新问题的检测是一种高效且实用的思路,尤其在深度伪造技术快速发展的背景下。
Abstract: Fake images in selfie banking are increasingly becoming a threat. Previously, it was just Photoshop, but now deep learning technologies enable us to create highly realistic fake identities, which fraudsters exploit to bypass biometric systems such as facial recognition in online banking. This paper explores the use of an already established forensic recognition system, previously used for picture camera localization, in deepfake detection.
eess.IV [Back]
[98] 2D Ultrasound Elasticity Imaging of Abdominal Aortic Aneurysms Using Deep Neural Networks
Utsav Ratna Tuladhar,Richard Simon,Doran Mix,Michael Richards
Main category: eess.IV
TL;DR: 该论文提出了一种基于深度学习的方法,利用2D超声弹性成像技术评估腹主动脉瘤(AAA)的风险,通过生成位移场和对应的模量分布数据,训练U-Net模型预测组织刚度,为临床提供非侵入性评估工具。
Details
Motivation: 腹主动脉瘤(AAA)破裂风险难以通过传统直径测量准确评估,需要引入组织弹性性能的分析方法以提高风险评估精度。Contribution: 提出了一种基于深度学习的2D超声弹性成像框架,利用模拟数据训练模型,成功在仿真、体模实验和临床数据中验证了其有效性,为AAA风险评估提供了快速、非侵入性的解决方案。
Method: 通过有限元模拟生成多样化的位移场和模量分布数据,设计U-Net架构模型并采用归一化均方误差(NMSE)进行训练,从位移场推断空间模量分布。
Result: 模型在仿真数据中NMSE得分为0.73%,在体模实验和临床数据中均表现出良好的泛化能力,且计算效率优于迭代方法。
Insight: 深度学习能够直接从超声数据中快速预测组织刚度,避免传统方法的计算负担,为临床AAA风险评估提供了新思路。
Abstract: Abdominal aortic aneurysms (AAA) pose a significant clinical risk due to their potential for rupture, which is often asymptomatic but can be fatal. Although maximum diameter is commonly used for risk assessment, diameter alone is insufficient as it does not capture the properties of the underlying material of the vessel wall, which play a critical role in determining the risk of rupture. To overcome this limitation, we propose a deep learning-based framework for elasticity imaging of AAAs with 2D ultrasound. Leveraging finite element simulations, we generate a diverse dataset of displacement fields with their corresponding modulus distributions. We train a model with U-Net architecture and normalized mean squared error (NMSE) to infer the spatial modulus distribution from the axial and lateral components of the displacement fields. This model is evaluated across three experimental domains: digital phantom data from 3D COMSOL simulations, physical phantom experiments using biomechanically distinct vessel models, and clinical ultrasound exams from AAA patients. Our simulated results demonstrate that the proposed deep learning model is able to reconstruct modulus distributions, achieving an NMSE score of 0.73%. Similarly, in phantom data, the predicted modular ratio closely matches the expected values, affirming the model’s ability to generalize to phantom data. We compare our approach with an iterative method which shows comparable performance but higher computation time. In contrast, the deep learning method can provide quick and effective estimates of tissue stiffness from ultrasound images, which could help assess the risk of AAA rupture without invasive procedures.
[99] MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction
Pardis Moradbeiki,Nasser Ghadiri,Sayed Jalal Zahabi,Uffe Kock Wiil,Kristoffer Kittelmann Brockhattingen,Ali Ebrahimi
Main category: eess.IV
TL;DR: MedVQA-TREE是一个多模态框架,通过结合分层图像解析、门控特征融合和多跳多查询检索策略,显著提升了肌肉减少症的诊断准确性。
Details
Motivation: 肌肉减少症的超声诊断因图像线索细微、标记数据有限和缺乏临床背景而具有挑战性。Contribution: 提出了一个结合视觉多层次解析和多模态知识检索的框架,显著提升了诊断准确率。
Method: 包括分层图像解析(解剖分类、区域分割和基于图的空间推理)、门控特征融合和多跳多查询的知识检索策略。
Result: 在两个公开数据集和自定义数据集上达到99%的诊断准确率,超越之前方法10%以上。
Insight: 结合结构化视觉理解和引导知识检索可以有效辅助医疗AI诊断。
Abstract: Accurate sarcopenia diagnosis via ultrasound remains challenging due to subtle imaging cues, limited labeled data, and the absence of clinical context in most models. We propose MedVQA-TREE, a multimodal framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy. The vision module includes anatomical classification, region segmentation, and graph-based spatial reasoning to capture coarse, mid-level, and fine-grained structures. A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base. MedVQA-TREE was trained and evaluated on two public MedVQA datasets (VQA-RAD and PathVQA) and a custom sarcopenia ultrasound dataset. The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%. These results underscore the benefit of combining structured visual understanding with guided knowledge retrieval for effective AI-assisted diagnosis in sarcopenia.
[100] AT-CXR: Uncertainty-Aware Agentic Triage for Chest X-rays
Xueyang Li,Mingze Jiang,Gelei Xu,Jun Xia,Mengzhao Jia,Danny Chen,Yiyu Shi
Main category: eess.IV
TL;DR: 论文提出了AT-CXR,一种不确定性感知的自主胸部X光分诊系统,通过估计置信度和分布拟合,结合基于规则或LLM的路由策略,实现了高精度和低延迟的分诊性能。
Details
Motivation: 尽管AI代理技术发展迅速,但在医学影像分诊中实现真正自主的系统(如决定停止、升级或延迟)仍未被充分探索。本文旨在填补这一空白。Contribution: 1. 提出了AT-CXR系统,结合不确定性估计和分步决策策略。2. 设计了两种路由策略(基于规则和LLM),在性能和延迟上优于现有方法。3. 在NIH ChestX-ray14数据集上验证了系统的优越性。
Method: 1. 估计每个病例的置信度和分布拟合度。2. 采用分步策略输出自动化决策或建议人工干预的标签。3. 比较了基于规则的路由器和LLM路由器的表现。
Result: 两种路由器在平衡的NIH ChestX-ray14子集上优于现有方法,实现了更高的全覆盖精度和更低的AURC,同时满足临床延迟要求。
Insight: 两种路由策略提供了互补的操作点,可根据需求优先选择最大吞吐量或最高精度,为实际部署提供了灵活性。
Abstract: Agentic AI is advancing rapidly, yet truly autonomous medical-imaging triage, where a system decides when to stop, escalate, or defer under real constraints, remains relatively underexplored. To address this gap, we introduce AT-CXR, an uncertainty-aware agent for chest X-rays. The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision or abstain with a suggested label for human intervention. We evaluate two router designs that share the same inputs and actions: a deterministic rule-based router and an LLM-decided router. Across five-fold evaluation on a balanced subset of NIH ChestX-ray14 dataset, both variants outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and superior selective-prediction performance, evidenced by a lower area under the risk-coverage curve (AURC) and a lower error rate at high coverage, while operating with lower latency that meets practical clinical constraints. The two routers provide complementary operating points, enabling deployments to prioritize maximal throughput or maximal accuracy. Our code is available at https://github.com/XLIAaron/uncertainty-aware-cxr-agent.
cs.CY [Back]
[101] Should LLMs be WEIRD? Exploring WEIRDness and Human Rights in Large Language Models
Ke Zhou,Marios Constantinides,Daniele Quercia
Main category: cs.CY
TL;DR: 论文探讨了大型语言模型(LLMs)在训练数据中表现出的WEIRD价值观(西方、受过教育、工业化、富裕和民主)对文化偏见和公平性的影响,并通过分析其响应与全球人权原则的冲突程度。
Details
Motivation: 研究动机源于当前LLMs的训练数据主要反映WEIRD价值观,可能导致文化偏见,甚至违背全球人权原则。作者希望通过实证分析揭示这一问题。Contribution: 主要贡献包括:1)评估了五种主流LLMs(如GPT-3.5和Llama-3)对WEIRD价值观的依赖程度;2)揭示了这些模型在文化多样性与人权冲突之间的权衡关系。
Method: 方法上,作者利用世界价值观调查数据,对比LLMs的响应与WEIRD国家和全球人权宣言(如亚洲、中东和非洲的区域宪章)的匹配度。
Result: 结果显示,非WEIRD价值观的模型(如BLOOM和Qwen)生成更多文化多样性的响应,但违反人权的概率更高(2%-4%)。例如,某些响应强化了有害的性别规范。
Insight: 研究发现,文化多样性提升可能伴随人权风险的增加。宪法AI等方法可能无法完全解决这一矛盾,需要更全面的框架来平衡文化包容性与人权保护。
Abstract: Large language models (LLMs) are often trained on data that reflect WEIRD values: Western, Educated, Industrialized, Rich, and Democratic. This raises concerns about cultural bias and fairness. Using responses to the World Values Survey, we evaluated five widely used LLMs: GPT-3.5, GPT-4, Llama-3, BLOOM, and Qwen. We measured how closely these responses aligned with the values of the WEIRD countries and whether they conflicted with human rights principles. To reflect global diversity, we compared the results with the Universal Declaration of Human Rights and three regional charters from Asia, the Middle East, and Africa. Models with lower alignment to WEIRD values, such as BLOOM and Qwen, produced more culturally varied responses but were 2% to 4% more likely to generate outputs that violated human rights, especially regarding gender and equality. For example, some models agreed with the statements a man who cannot father children is not a real man'' and a husband should always know where his wife is’’, reflecting harmful gender norms. These findings suggest that as cultural representation in LLMs increases, so does the risk of reproducing discriminatory beliefs. Approaches such as Constitutional AI, which could embed human rights principles into model behavior, may only partly help resolve this tension.
[102] Geopolitical Parallax: Beyond Walter Lippmann Just After Large Language Models
Mehmet Can Yavuz,Humza Gohar Kabir,Aylin Özkan
Main category: cs.CY
TL;DR: 这篇论文研究了大型语言模型(LLMs)在新闻质量评估中表现出的地缘政治视角差异,揭示了模型起源对其评估结果的系统性影响。
Details
Motivation: 随着LLMs在新闻领域中的应用增加,其训练数据和设计选择可能嵌入文化或意识形态偏见,导致对新闻质量和主观性评估的差异。本文旨在量化这些差异。Contribution: 通过比较中西方LLM家族在新闻质量评估中的表现,论文揭示了地缘政治视角导致的非随机差异,并提出了文化校准的必要性。
Method: 使用逻辑回归探针和匹配主题评估,对中西方LLM家族在多个新闻质量维度上的表现进行比较,重点关注巴勒斯坦和中美新闻话题。
Result: 研究发现,西方模型在巴勒斯坦报道中倾向于高估主观性和积极情绪,而中国模型更注重新颖性和描述性。中美报道中,中国模型在流畅度和技术性等方面评分较低。
Insight: LLM的新闻质量评估存在地缘政治偏见,表明下游任务中需考虑文化差异以避免模型诱导的偏差。语义、情感和关系主观性是理解这些差异的关键。
Abstract: Objectivity in journalism has long been contested, oscillating between ideals of neutral, fact-based reporting and the inevitability of subjective framing. With the advent of large language models (LLMs), these tensions are now mediated by algorithmic systems whose training data and design choices may themselves embed cultural or ideological biases. This study investigates geopolitical parallax-systematic divergence in news quality and subjectivity assessments-by comparing article-level embeddings from Chinese-origin (Qwen, BGE, Jina) and Western-origin (Snowflake, Granite) model families. We evaluate both on a human-annotated news quality benchmark spanning fifteen stylistic, informational, and affective dimensions, and on parallel corpora covering politically sensitive topics, including Palestine and reciprocal China-United States coverage. Using logistic regression probes and matched-topic evaluation, we quantify per-metric differences in predicted positive-class probabilities between model families. Our findings reveal consistent, non-random divergences aligned with model origin. In Palestine-related coverage, Western models assign higher subjectivity and positive emotion scores, while Chinese models emphasize novelty and descriptiveness. Cross-topic analysis shows asymmetries in structural quality metrics Chinese-on-US scoring notably lower in fluency, conciseness, technicality, and overall quality-contrasted by higher negative emotion scores. These patterns align with media bias theory and our distinction between semantic, emotional, and relational subjectivity, and extend LLM bias literature by showing that geopolitical framing effects persist in downstream quality assessment tasks. We conclude that LLM-based media evaluation pipelines require cultural calibration to avoid conflating content differences with model-induced bias.
cs.RO [Back]
[103] DATR: Diffusion-based 3D Apple Tree Reconstruction Framework with Sparse-View
Tian Qiu,Alan Zoubi,Yiyuan Lin,Ruiming Du,Lailiang Cheng,Yu Jiang
Main category: cs.RO
TL;DR: 该论文提出了一个两阶段的框架(DATR),用于从稀疏视角重建苹果树的3D模型,结合了扩散模型和大规模重建模型(LRM),在真实和合成数据集上表现优于现有方法,并显著提升了处理效率。
Details
Motivation: 数字孪生应用需要高精度3D重建,但现有方法在稀疏视角和遮挡情况下表现不佳,而农业场景中树冠结构复杂,现有技术难以满足需求。Contribution: 1)提出DATR框架,结合扩散模型和大型重建模型,从稀疏视角完成3D树重建;2)利用Real2Sim生成合成数据训练模型;3)在真实与合成数据集上均表现优异,效率提升360倍。
Method: 1)第一阶段利用基础模型和传感器生成树冠掩码,过滤背景信息;2)第二阶段通过扩散模型生成多视角,再结合LRM生成隐式神经场3D模型。
Result: 在真实与合成数据集上,DATR优于其他方法,且效率接近工业级激光扫描仪的测量精度,但速度显著提升。
Insight: 扩散模型与LRM的结合在稀疏视角3D重建中具有潜力,合成数据辅助训练可有效解决真实数据不足的问题,为农业数字孪生提供了可行的技术方案。
Abstract: Digital twin applications offered transformative potential by enabling real-time monitoring and robotic simulation through accurate virtual replicas of physical assets. The key to these systems is 3D reconstruction with high geometrical fidelity. However, existing methods struggled under field conditions, especially with sparse and occluded views. This study developed a two-stage framework (DATR) for the reconstruction of apple trees from sparse views. The first stage leverages onboard sensors and foundation models to semi-automatically generate tree masks from complex field images. Tree masks are used to filter out background information in multi-modal data for the single-image-to-3D reconstruction at the second stage. This stage consists of a diffusion model and a large reconstruction model for respective multi view and implicit neural field generation. The training of the diffusion model and LRM was achieved by using realistic synthetic apple trees generated by a Real2Sim data generator. The framework was evaluated on both field and synthetic datasets. The field dataset includes six apple trees with field-measured ground truth, while the synthetic dataset featured structurally diverse trees. Evaluation results showed that our DATR framework outperformed existing 3D reconstruction methods across both datasets and achieved domain-trait estimation comparable to industrial-grade stationary laser scanners while improving the throughput by $\sim$360 times, demonstrating strong potential for scalable agricultural digital twin systems.
[104] Context-Aware Risk Estimation in Home Environments: A Probabilistic Framework for Service Robots
Sena Ishii,Akash Chikhalikar,Ankit A. Ravankar,Jose Victorio Salazar Luces,Yasuhisa Hirata
Main category: cs.RO
TL;DR: 论文提出了一种用于服务机器人的概率框架,通过语义图传播算法估计家庭环境中的风险区域,提升机器人对潜在危险的理解能力。
Details
Motivation: 随着服务机器人进入家庭环境,实时识别和响应环境风险变得至关重要,以确保用户安全和有效的人机交互。Contribution: 主要贡献是基于语义图的风险传播算法,能够推断未明确标注的风险区域,同时具备轻量化和可解释性。
Method: 方法通过构建语义图,将物体表示为节点并分配风险分,利用空间邻近性和事故关系不对称传播风险。
Result: 在人工标注的风险数据集上,系统达到75%的二元风险检测准确率,并与人类感知高度一致。
Insight: 上下文感知的风险推理可以显著提升机器人在共享空间中的安全行为,并为未来实时预警和自主辅助系统奠定基础。
Abstract: We present a novel framework for estimating accident-prone regions in everyday indoor scenes, aimed at improving real-time risk awareness in service robots operating in human-centric environments. As robots become integrated into daily life, particularly in homes, the ability to anticipate and respond to environmental hazards is crucial for ensuring user safety, trust, and effective human-robot interaction. Our approach models object-level risk and context through a semantic graph-based propagation algorithm. Each object is represented as a node with an associated risk score, and risk propagates asymmetrically from high-risk to low-risk objects based on spatial proximity and accident relationship. This enables the robot to infer potential hazards even when they are not explicitly visible or labeled. Designed for interpretability and lightweight onboard deployment, our method is validated on a dataset with human-annotated risk regions, achieving a binary risk detection accuracy of 75%. The system demonstrates strong alignment with human perception, particularly in scenes involving sharp or unstable objects. These results underline the potential of context-aware risk reasoning to enhance robotic scene understanding and proactive safety behaviors in shared human-robot spaces. This framework could serve as a foundation for future systems that make context-driven safety decisions, provide real-time alerts, or autonomously assist users in avoiding or mitigating hazards within home environments.
cs.SD [Back]
[105] Beat-Based Rhythm Quantization of MIDI Performances
Maximilian Wachter,Sebastian Murgul,Michael Heizmann
Main category: cs.SD
TL;DR: 该论文提出了一种基于Transformer的节奏量化模型,利用节拍和强拍信息将MIDI演奏量化为符合节奏的人类可读乐谱,并通过优化模型架构和数据表示提升了性能。
Details
Motivation: 现有的节奏量化方法通常忽略节拍和强拍信息,导致量化结果不够自然或不符合音乐节拍。本文希望通过引入这些信息,提升量化质量。Contribution: 主要贡献包括:1)提出了一种基于节拍的预处理方法,将乐谱和演奏数据统一为标记表示;2)设计了结合节拍和强拍信息的Transformer模型;3)在钢琴和吉他演奏数据集上实现了超过现有技术的性能。
Method: 方法包括:1)使用节拍和强拍信息预处理数据;2)设计基于Transformer的节奏量化模型;3)通过优化数据表示和模型架构提升性能。
Result: 实验表明,该模型在MUSTER指标上优于现有技术,生成了更符合音乐节拍的量化乐谱。
Insight: 引入节拍和强拍信息可以显著提升节奏量化的自然性和准确性,同时Transformer模型在处理时间序列音乐数据时表现出色。
Abstract: We propose a transformer-based rhythm quantization model that incorporates beat and downbeat information to quantize MIDI performances into metrically-aligned, human-readable scores. We propose a beat-based preprocessing method that transfers score and performance data into a unified token representation. We optimize our model architecture and data representation and train on piano and guitar performances. Our model exceeds state-of-the-art performance based on the MUSTER metric.
astro-ph.IM [Back]
[106] Modeling spectral filtering effects on color-matching functions: Implications for observer variability
Luvin Munish Ragoo,Ivar Farup,Casper F. Andersen,Graham Finlayson
Main category: astro-ph.IM
TL;DR: 该研究通过光谱过滤对色匹配函数的影响探讨了观察者变异性模型的改进,提出了一种将未过滤色匹配函数转换为过滤色匹配函数的计算方法和转换矩阵,并通过实验证实其有效性。
Details
Motivation: 研究旨在理解光谱过滤对色匹配函数的影响,并探讨如何通过单一过滤函数简化观察者变异性的表征,从而减少实验成本。Contribution: 提出了一种新颖的计算方法,通过单一光谱过滤函数描述观察者变异性,替代传统需要三个独立函数的方法。
Method: 结合颜色匹配实验和计算建模,估计过滤透射率和转换矩阵,将未过滤色匹配函数转换为过滤色匹配函数。
Result: 实验结果显示,估计与实测的过滤特性在中心波长区域高度吻合,且方法适用于不同年龄段的观察者色匹配函数转换。
Insight: 研究揭示观察者色匹配函数的差异可能源于年龄相关的晶状体黄化,支持了单一过滤函数可以有效表征这种变异性。
Abstract: This study investigates the impact of spectral filtering on color-matching functions (CMFs) and its implications for observer variability modeling. We conducted color matching experiments with a single observer, both with and without a spectral filter in front of a bipartite field. Using a novel computational approach, we estimated the filter transmittance and transformation matrix necessary to convert unfiltered CMFs to filtered CMFs. Statistical analysis revealed good agreement between estimated and measured filter characteristics, particularly in central wavelength regions. Applying this methodology to compare between Stiles and Burch 1955 (SB1955) mean observer CMFs and our previously published “ICVIO” mean observer CMFs, we identified a “yellow” (short-wavelength suppressing) filter that effectively transforms between these datasets. This finding aligns with our hypothesis that observed differences between the CMF sets are attributable to age-related lens yellowing (average observer age: 49 years in ICVIO versus 30 years in SB1955). Our approach enables efficient representation of observer variability through a single filter rather than three separate functions, offering potentially reduced experimental overhead while maintaining accuracy in characterizing individual color vision differences.
cs.AI [Back]
[107] Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation?
Samuel Lewis-Lim,Xingwei Tan,Zhixue Zhao,Nikolaos Aletras
Main category: cs.AI
TL;DR: 论文研究了Chain-of-Thought(CoT)在软推理任务中的动态性和忠实性,发现其增益有限且可能与模型的真实推理不一致。
Details
Motivation: 当前研究发现CoT在软推理问题(如分析和常识推理)中的效果有限,且可能不忠实于模型的真实推理。为了深入理解这一现象,论文探索了不同类型模型中CoT的动态性和忠实性。Contribution: 通过实验分析指令微调模型、推理模型和推理蒸馏模型在软推理任务中的CoT行为,揭示了CoT在这些模型中的依赖方式和影响程度,并发现其忠实性和影响并不总是一致。
Method: 针对软推理任务,通过对比不同模型(指令微调模型、推理模型和推理蒸馏模型)在CoT引导下的表现,分析其动态性和忠实性差异。
Result: 实验结果表明,CoT的影响和忠实性因模型类型而异,且二者并不总是一致,某些情况下CoT可能仅为事后合理化而非有效指导。
Insight: CoT的使用需谨慎,其效果和忠实性高度依赖模型类型和任务性质,未来研究应关注如何提升其可靠性。
Abstract: Recent work has demonstrated that Chain-of-Thought (CoT) often yields limited gains for soft-reasoning problems such as analytical and commonsense reasoning. CoT can also be unfaithful to a model’s actual reasoning. We investigate the dynamics and faithfulness of CoT in soft-reasoning tasks across instruction-tuned, reasoning and reasoning-distilled models. Our findings reveal differences in how these models rely on CoT, and show that CoT influence and faithfulness are not always aligned.
[108] SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
Quanfeng Lu,Zhantao Ma,Shuai Zhong,Jin Wang,Dahai Yu,Michael K. Ng,Ping Luo
Main category: cs.AI
TL;DR: SWIRL提出了一种分阶段的强化学习工作流,通过将多智能体强化学习(MARL)分解为单智能体任务序列,提升了训练稳定性和协调效率,适用于移动GUI控制和多智能体推理任务。
Details
Motivation: 现有单智能体方法在移动GUI控制中存在结构限制,而多智能体强化学习方法效率低下且与现有大型视觉语言模型不兼容。SWIRL旨在解决这些问题。Contribution: SWIRL的核心贡献是将MARL重新表述为单智能体任务序列,提供了理论安全保障和高效协调机制,在移动GUI控制和多智能体推理中表现优异。
Method: SWIRL采用分阶段工作流,逐次更新单个智能体(如导航器和交互器),其他智能体固定,确保训练稳定性和协调性。
Result: 实验表明,SWIRL在GUI任务和多智能体数学推理中表现出色,优于现有方法。
Insight: 通过分阶段训练单智能体任务,SWIRL提供了一种高效且理论完备的多智能体系统开发框架。
Abstract: The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
cs.SE [Back]
[109] Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking
Zhuohao Li,Wenqing Chen,Jianxing Yu,Zhichao Lu
Main category: cs.SE
TL;DR: 该论文提出了一个自演化数据合成框架,用于评估大语言模型(LLM)代码嵌入的功能一致性,并通过实验证明了该框架在多项下游任务中的有效性。
Details
Motivation: 目前关于代码嵌入的研究主要关注代码克隆检测,忽视了代码的功能语义。论文旨在填补这一空白,研究LLM代码嵌入是否能反映代码的功能一致性,而不仅仅是语法相似性。Contribution: 1. 提出了一种新的数据合成框架(Functionality-Oriented Code Self-Evolution),用于生成多样化的代码功能一致性基准数据集;
2. 定义了代码的四种语义和语法类别,发现现有数据集主要捕捉语法特性;
3. 通过实验证明基于该框架训练的嵌入模型在代码克隆检测、功能一致性识别和代码检索任务中表现更优。
Method: 1. 设计了一个自演化数据合成框架,从单个代码实例生成四种独特变体,丰富功能差异示例;
2. 对生成的基准数据进行了广泛实验,评估LLM代码嵌入的功能一致性。
Result: 实验表明,基于该框架训练的嵌入模型在代码克隆检测、功能一致性识别和代码检索任务中性能显著提升,验证了数据合成框架的有效性和泛化能力。
Insight: 代码嵌入的功能一致性研究有助于更深入地理解代码语义,而不仅仅是语法相似性。该框架为未来代码嵌入的评估和优化提供了新思路。
Abstract: Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that text embeddings from large language models (LLMs) capture rich semantic information, but their ability to reflect code-level functional semantics remains unclear. Existing studies largely focus on code clone detection, which emphasizes syntactic similarity and overlooks functional understanding. In this paper, we focus on the functional consistency of LLM code embeddings, which determines if two code snippets perform the same function regardless of syntactic differences. We propose a novel data synthesis framework called Functionality-Oriented Code Self-Evolution to construct diverse and challenging benchmarks. Specifically, we define code examples across four semantic and syntactic categories and find that existing datasets predominantly capture syntactic properties. Our framework generates four unique variations from a single code instance, providing a broader spectrum of code examples that better reflect functional differences. Extensive experiments on three downstream tasks-code clone detection, code functional consistency identification, and code retrieval-demonstrate that embedding models significantly improve their performance when trained on our evolved datasets. These results highlight the effectiveness and generalization of our data synthesis framework, advancing the functional understanding of code.
cs.HC [Back]
[110] Capabilities of GPT-5 across critical domains: Is it the next breakthrough?
Georgios P. Georgiou
Main category: cs.HC
TL;DR: 这篇论文系统地比较了GPT-4和GPT-5在多个关键领域的表现,发现GPT-5在教育、临床诊断、研究生成和伦理推理方面显著优于GPT-4,展示了其作为领域专用工具的潜力。
Details
Motivation: 随着大型语言模型的快速发展,对其实际应用性能的评估变得尤为重要。GPT-4已显示出在多领域的潜力,但仍存在缺陷,因此研究新一代模型GPT-5的性能提升及其实际应用价值是必要的。Contribution: 论文首次系统地比较了GPT-4和GPT-5在多个关键领域的表现,通过人类专家评估,确认了GPT-5在多个任务中的显著优势。
Method: 研究邀请了20位专家,对GPT-4和GPT-5在五个领域(课程规划、作业评估、临床诊断、研究生成和伦理推理)的生成输出进行了基于预定义标准的评估,并利用混合效应模型分析了数据。
Result: 结果表明,GPT-5在教育、临床诊断、研究生成和伦理推理任务中显著优于GPT-4,而在作业评估中两者表现相当。
Insight: GPT-5通过系统化模型架构优化了任务特定性能,证明了其在领域专用工具方面的潜力,同时对教育、临床实践和学术研究具有实际价值。
Abstract: The accelerated evolution of large language models has raised questions about their comparative performance across domains of practical importance. GPT-4 by OpenAI introduced advances in reasoning, multimodality, and task generalization, establishing itself as a valuable tool in education, clinical diagnosis, and academic writing, though it was accompanied by several flaws. Released in August 2025, GPT-5 incorporates a system-of-models architecture designed for task-specific optimization and, based on both anecdotal accounts and emerging evidence from the literature, demonstrates stronger performance than its predecessor in medical contexts. This study provides one of the first systematic comparisons of GPT-4 and GPT-5 using human raters from linguistics and clinical fields. Twenty experts evaluated model-generated outputs across five domains: lesson planning, assignment evaluation, clinical diagnosis, research generation, and ethical reasoning, based on predefined criteria. Mixed-effects models revealed that GPT-5 significantly outperformed GPT-4 in lesson planning, clinical diagnosis, research generation, and ethical reasoning, while both models performed comparably in assignment assessment. The findings highlight the potential of GPT-5 to serve as a context-sensitive and domain-specialized tool, offering tangible benefits for education, clinical practice, and academic research, while also advancing ethical reasoning. These results contribute to one of the earliest empirical evaluations of the evolving capabilities and practical promise of GPT-5.
cs.LG [Back]
[111] Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation
Ziniu Zhang,Zhenshuo Zhang,Dongyue Li,Lu Wang,Jennifer Dy,Hongyang R. Zhang
Main category: cs.LG
TL;DR: 该论文提出了一种基于梯度估计的线性时间算法,用于在上下文学习中快速选择最佳的演示示例。该方法通过近似模型输出显著提高了选择效率。
Details
Motivation: 上下文学习中,如何在固定模型权重的情况下高效选择最佳的演示示例是一个关键问题,直接影响提示调优和思维链推理的性能。Contribution: 提出了一种基于梯度估计的线性时间算法,通过一次预计算模型输出和梯度,显著提高了演示示例选择的效率和准确性。
Method: 通过梯度的一阶近似估计模型输出,对随机采样子集进行评分并聚合,形成每个示例的影响力得分,最后选择最优的k个示例。
Result: 在多个模型和数据集上验证了方法的效率,梯度估计的误差低于1%,选择速度提升了37.7倍,且性能优于现有方法11%。
Insight: 梯度估计为上下文学习中的示例选择提供了高效的近似方法,能够在大规模模型和数据集上显著提升效率而不损失精度。
Abstract: This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of $n$ examples, how can we quickly select $k$ out of $n$ to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select $k$ most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than $\mathbf{1}%$ error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to $\mathbf{37.7}\times$ on models with up to $34$ billion parameters, and outperform existing selection methods based on input embeddings by $\mathbf{11}%$ on average.
[112] Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence
Ji Wang,Kashing Chen,Xinyuan Song,Ke Zhang,Lynn Ai,Eric Yang,Bill Shi
Main category: cs.LG
TL;DR: Symphony 是一个去中心化的多智能体框架,通过分布式账本、动态任务分配和加权投票机制,实现了低成本、可扩展且具有容错能力的集体智能协作。
Details
Motivation: 现有的基于 LLM 的智能体框架多为集中式,部署成本高、通信拓扑结构僵化且适应性有限。Symphony 旨在解决这些问题。Contribution: 提出了去中心化的 Symphony 框架,包含分布式账本、Beacon 选择协议和加权投票机制,显著提升了扩展性和容错能力。
Method: 使用分布式账本记录能力,通过 Beacon 协议动态分配任务,并基于 CoTs 进行加权投票。
Result: 在推理任务上优于现有基线,显著提升了准确性,且对不同能力的模型表现出鲁棒性。
Insight: 去中心化设计降低了部署成本,同时提升了灵活性和适应性,为大规模集体智能系统提供了一种可行的解决方案。
Abstract: Most existing Large Language Model (LLM)-based agent frameworks rely on centralized orchestration, incurring high deployment costs, rigid communication topologies, and limited adaptability. To address these challenges, we introduce Symphony, a decentralized multi-agent system which enables lightweight LLMs on consumer-grade GPUs to coordinate. Symphony introduces three key mechanisms: (1) a decentralized ledger that records capabilities, (2) a Beacon-selection protocol for dynamic task allocation, and (3) weighted result voting based on CoTs. This design forms a privacy-saving, scalable, and fault-tolerant orchestration with low overhead. Empirically, Symphony outperforms existing baselines on reasoning benchmarks, achieving substantial accuracy gains and demonstrating robustness across models of varying capacities.
[113] Pruning Strategies for Backdoor Defense in LLMs
Santosh Chapagain,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
Main category: cs.LG
TL;DR: 该论文研究了如何通过注意力头剪枝(attention-head pruning)来防御预训练语言模型中的后门攻击,提出了六种剪枝策略,并通过实验验证了梯度剪枝和强化学习剪枝在不同类型攻击下的有效性。
Details
Motivation: 后门攻击对预训练语言模型的性能和完整性构成严重威胁,且在微调后仍可能存留。传统防御方法难以应对未知触发器的攻击,因此需要探索无需触发器知识或干净参考模型的防御策略。Contribution: 论文的主要贡献是提出并系统评估了六种注意力头剪枝策略,为防御后门攻击提供了一种无需触发器知识的新方法。
Method: 论文基于注意力头剪枝设计了六种策略:梯度剪枝、层间方差剪枝、结构化L1/L2稀疏剪枝、随机集成剪枝、强化学习引导剪枝和贝叶斯不确定性剪枝。
Result: 实验结果表明,梯度剪枝对语法触发器防御效果最佳,而强化学习和贝叶斯剪枝能更好地抵御风格化攻击。
Insight: 注意力头剪枝是一种有效的后门防御方法,适用于不同类型的攻击模式,且无需依赖触发器的先验知识。
Abstract: Backdoor attacks are a significant threat to the performance and integrity of pre-trained language models. Although such models are routinely fine-tuned for downstream NLP tasks, recent work shows they remain vulnerable to backdoor attacks that survive vanilla fine-tuning. These attacks are difficult to defend because end users typically lack knowledge of the attack triggers. Such attacks consist of stealthy malicious triggers introduced through subtle syntactic or stylistic manipulations, which can bypass traditional detection and remain in the model, making post-hoc purification essential. In this study, we explore whether attention-head pruning can mitigate these threats without any knowledge of the trigger or access to a clean reference model. To this end, we design and implement six pruning-based strategies: (i) gradient-based pruning, (ii) layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2 sparsification, (iv) randomized ensemble pruning, (v) reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning. Each method iteratively removes the least informative heads while monitoring validation accuracy to avoid over-pruning. Experimental evaluation shows that gradient-based pruning performs best while defending the syntactic triggers, whereas reinforcement learning and Bayesian pruning better withstand stylistic attacks.
[114] Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments
Dikshant Sagar,Kaiwen Yu,Alejandro Yankelevich,Jianming Bian,Pierre Baldi
Main category: cs.LG
TL;DR: 论文探讨了如何使用微调的视觉语言模型(VLM)对高能物理实验中的中微子事件进行分类,其性能超越传统CNN方法,并支持多模态推理。
Details
Motivation: 研究动机是基于大型语言模型(LLMs)在多模态推理中的潜力,探索视觉语言模型在高能物理实验中复杂事件分类任务中的应用,以弥补传统CNN方法的局限性。Contribution: 主要贡献是提出了一种基于LLaMA 3.2的微调视觉语言模型(VLM),在中微子事件分类任务中实现优于传统CNN的性能,并展示了多模态推理的优势。
Method: 方法包括微调LLaMA 3.2模型,结合像素化的探测器图像和文本或语义上下文,进行中微子事件分类,同时对比了CNN基准模型的性能。
Result: 结果显示VLM在分类准确率、精确率、召回率和AUC-ROC等指标上表现优于或与CNN相当,且能更好地整合辅助信息。
Insight: 研究揭示了视觉语言模型在高能物理实验中的潜力,为多模态方法在实验性中微子物理中的应用开辟了新途径。
Abstract: Recent progress in large language models (LLMs) has shown strong potential for multimodal reasoning beyond natural language. In this work, we explore the use of a fine-tuned Vision-Language Model (VLM), based on LLaMA 3.2, for classifying neutrino interactions from pixelated detector images in high-energy physics (HEP) experiments. We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC. Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context. These findings suggest that VLMs offer a promising general-purpose backbone for event classification in HEP, paving the way for multimodal approaches in experimental neutrino physics.
[115] NM-Hebb: Coupling Local Hebbian Plasticity with Metric Learning for More Accurate and Interpretable CNNs
Davorin Miličević,Ratko Grbić
Main category: cs.LG
TL;DR: NM-Hebb 是一个结合局部 Hebbian 可塑性和度量学习的 CNN 训练框架,通过两阶段训练(监督学习+度量学习)提升模型准确性和可解释性。
Details
Motivation: 当前 CNN 依赖全局梯度优化,易导致过拟合、冗余滤波器和可解释性差的问题。作者希望通过生物学启发的局部可塑性和度量学习来解决这些问题。Contribution: 1. 提出 NM-Hebb 框架,结合 Hebbian 正则化和可学习神经调节器;2. 通过度量学习显式优化特征空间;3. 在多个数据集和模型上验证了性能提升和可解释性增强。
Method: 1. 第一阶段:用 Hebbian 正则化对齐激活和滤波器权重均值,并引入神经调节器控制弹性权重损失;2. 第二阶段:通过度量学习优化特征空间的类内和类间距离。
Result: 在 CIFAR-10/100 和 TinyImageNet 上,Top-1 准确率提升 2.0-10.0 pp,NMI 提升最多 0.15,特征更结构化且可解释。
Insight: 结合局部可塑性和全局度量学习能同时提升 CNN 的准确性和可解释性,适用于资源受限和安全关键场景。
Abstract: Deep Convolutional Neural Networks (CNNs) achieve high accuracy but often rely on purely global, gradient-based optimisation, which can lead to overfitting, redundant filters, and reduced interpretability. To address these limitations, we propose NM-Hebb, a two-phase training framework that integrates neuro-inspired local plasticity with distance-aware supervision. Phase 1 extends standard supervised training by jointly optimising a cross-entropy objective with two biologically inspired mechanisms: (i) a Hebbian regulariser that aligns the spatial mean of activations with the mean of the corresponding convolutional filter weights, encouraging structured, reusable primitives; and (ii) a learnable neuromodulator that gates an elastic-weight-style consolidation loss, preserving beneficial parameters without freezing the network. Phase 2 fine-tunes the backbone with a pairwise metric-learning loss, explicitly compressing intra-class distances and enlarging inter-class margins in the embedding space. Evaluated on CIFAR-10, CIFAR-100, and TinyImageNet across five backbones (ResNet-18, VGG-11, MobileNet-v2, EfficientNet-V2, DenseNet-121), NM-Hebb achieves consistent gains over baseline and other methods: Top-1 accuracy improves by +2.0-10.0 pp (CIFAR-10), +2.0-9.0 pp (CIFAR-100), and up to +4.3-8.9 pp (TinyImageNet), with Normalised Mutual Information (NMI) increased by up to +0.15. Qualitative visualisations and filter-level analyses further confirm that NM-Hebb produces more structured and selective features, yielding tighter and more interpretable class clusters. Overall, coupling local Hebbian plasticity with metric-based fine-tuning yields CNNs that are not only more accurate but also more interpretable, offering practical benefits for resource-constrained and safety-critical AI deployments.