Table of Contents

cs.CL [Back]

[1] Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data

Ekaterina Borisova,Fabio Barth,Nils Feldhus,Raia Abu Ahmad,Malte Ostendorff,Pedro Ortiz Suarez,Georg Rehm,Sebastian Möller

Main category: cs.CL

TL;DR: 文章通过跨领域(科学与非科学数据)和跨模态(文本与图像)的评估,研究了基于文本和多模态的LLM在表格理解任务中的表现,发现LLM在处理科学表格时存在显著挑战,并提出了TableEval基准。

Details Motivation: 表格是广泛用于结构化数据表示的工具,但LLM在表格数据处理方面的效率尚未充分研究。本文旨在探索LLM在不同领域和模态下的表现。

Contribution: 1)跨领域和跨模态的LLM评估;2)提出了TableEval基准;3)揭示了LLM在科学表格处理中的挑战。

Method: 1)比较文本与多模态LLM在表格理解任务中的表现;2)对科学与非科学表格进行对比;3)通过可解释性分析衡量上下文使用和输入相关性。

Result: LLM在表格模态间表现稳健,但在科学表格处理中表现较差。

Insight: 科学表格的复杂性和领域特定性对LLM提出了更高要求,未来需要针对此类表格设计专用方法。

Abstract: Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.

[2] LineRetriever: Planning-Aware Observation Reduction for Web Agents

Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Massimo Caccia,Véronique Eglin,Alexandre Aussem,Jérémy Espinas,Alexandre Lacoste

Main category: cs.CL

TL;DR: LineRetriever提出了一种新的网络代理观察缩减方法,通过结合语言模型和规划意识,优先检索对未来导航步骤最相关的观察内容,解决了传统方法在语义相似性检索中丢失关键信息的问题。

Details Motivation: 现有的网络代理在导航任务中常因DOM或AxTree结构的上下文过长而超出模型限制,传统方法如嵌入检索无法保留与规划相关的关键信息,影响了自适应规划的能力。

Contribution: 引入了LineRetriever,一种结合语言模型与规划意识的观察缩减方法,能够高效检索对未来行动预测最相关的内容。

Method: 利用语言模型识别和检索与未来导航步骤最相关的观察行,考虑规划范围而非仅依赖语义相似性。

Result: LineRetriever在保持性能的同时,显著减少了网络代理每一步的观察内容大小。

Insight: 规划意识的检索方法在网络代理任务中比单纯语义相似性更有效,为上下文管理提供了新思路。

Abstract: While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textit{LineRetriever}, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textit{LineRetriever} explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textit{LineRetriever} can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.

[3] Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning

Mads Henrichsen,Rasmus Krebs

Main category: cs.CL

TL;DR: 提出了一种两阶段推理增强学习框架,通过LLM生成推理来提升文本分类性能。实验显示,该方法在情感分类任务中显著提高了准确性。

Details Motivation: 传统分类模型直接从输入到标签,缺乏显式推理,可能限制性能与可解释性。利用LLM生成的推理可以增强下游模型的泛化能力。

Contribution: 提出两阶段框架:首先生成推理数据,再训练下游模型输出推理与标签。实验验证了推理生成对性能的显著提升(8.7%准确率改进)。

Method: 1) 预训练LLM生成推理;2) 离线构建推理+标签的增强数据集,训练下游生成模型同时预测推理与标签。

Result: 在情感分类任务中,推理增强模型(Q->RA)比基线(Q->A)准确率提高8.7个百分点。

Insight: 显式推理训练能显著提升模型性能,为NLP任务提供了更丰富的训练数据与解释性。

Abstract: Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.

[4] Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava,Ari Holtzman

Main category: cs.CL

TL;DR: 论文研究表明,通过线性探针可以从对齐语言模型的隐藏状态中解码出被拒绝的知识,表明指令微调并未完全消除或重新定位有害信息。

Details Motivation: 研究动机是探索对齐语言模型(LMs)中拒绝机制的局限性,尤其是通过越狱提示(jailbreak prompts)能否绕过这些机制,以及被拒绝的信息是否仍然可以通过线性解码获取。

Contribution: 主要贡献包括:(1) 证明被拒绝的信息在线性解码下是可获取的;(2) 发现预训练模型(base models)的探针可迁移到指令微调模型;(3) 揭示这些信息在模型下游任务中仍被间接使用。

Method: 方法是通过训练线性探针(linear probes)分析语言模型隐藏状态,评估其对被拒绝信息的解码能力。实验包括Pearson相关性分析和生成行为的对比。

Result: 结果显示,线性探针可以高精度预测被拒绝信息(如国家平均IQ),Pearson相关性超过0.8。此外,预训练模型的探针在指令微调模型上仍有效。

Insight: 核心洞察是指令微调并未完全消除或重新定位有害信息,而是仅抑制其直接表达,导致这些信息仍可通过线性解码获取,并在下游行为中产生间接影响。

Abstract: Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely “leftover” in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.

[5] EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Sanchit Ahuja,Praneetha Vaddamanu,Barun Patra

Main category: cs.CL

TL;DR: 论文探讨非英语语言在推理任务中的效率,发现其不仅能减少token使用还能保持准确性,且多语言推理行为改变显著。

Details Motivation: 现有语言推理模型(LRMs)研究主要集中于英语,忽略了多语言预训练模型的潜力,论文旨在探究非英语语言是否更具token效率。

Contribution: 研究表明非英语语言在推理任务中token效率更高且准确性稳定,多语言推理行为改变非表面现象,并强调了多语言基础的重要性。

Method: 通过评估三款开源RLMs(DeepSeek R1, Qwen 2.5, Qwen 3)在四类数学数据集和七种语言中的表现,结合翻译验证行为变化。

Result: 非英语语言推理显著减少token使用且不影响准确性,多语言模型能力决定了增益程度。

Insight: 多语言推理为语言模型提供了更广阔的视角,强大多语言基础是关键。

Abstract: Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: https://github.com/microsoft/EfficientXLang.

[6] Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Daking Rai,Samuel Miller,Kevin Moran,Ziyu Yao

Main category: cs.CL

TL;DR: 本文研究了语言模型在生成平衡括号时错误的原因,提出了一种新的方法RASteer,通过增强可靠组件的作用来提高模型性能,显著提升了任务准确率。

Details Motivation: 尽管语言模型在编码能力上取得了显著进展,但在处理简单的语法任务(如生成平衡括号)时仍存在困难。本文旨在理解并解决这一问题。

Contribution: 揭示了模型错误是由于不可靠机制(faulty mechanisms)压制了可靠机制(sound mechanisms),并提出了RASteer方法,通过优化组件贡献显著提升性能。

Method: 提出了RASteer方法,通过分析和增强可靠组件(如注意力头和前馈神经元)的贡献,减少不可靠机制的影响。

Result: 在平衡括号任务中,RASteer将某些模型的准确率从0%提升至接近100%,且在算术推理任务中也取得了约20%的性能提升。

Insight: 模型错误源于多个组件的不均衡贡献,通过有针对性地增强可靠组件,可以显著优化模型性能而不影响其通用能力。

Abstract: Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing “sound mechanisms’’), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing “faulty mechanisms’’). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models’ general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

[7] Causal Prompting for Implicit Sentiment Analysis with Large Language Models

Jing Ren,Wenhao Zhou,Bowen Li,Mujie Liu,Nguyen Linh Dan Le,Jiade Cen,Liping Chen,Ziqi Xu,Xiwei Xu,Xiaodong Li

Main category: cs.CL

TL;DR: CAPITAL是一个基于因果推理的提示框架,通过分解因果效应并引入聚类和对比学习,提高了隐式情感分析的准确性和鲁棒性。

Details Motivation: 现有的基于提示的隐式情感分析方法因依赖多数投票而无视因果有效性,容易受到内部偏见和虚假相关性的影响。

Contribution: 提出了CAPITAL框架,将前门调整引入CoT推理,分解因果效应并利用聚类和NWGM近似估计,结合对比学习优化表示对齐。

Method: 通过分解因果效应为两部分,利用编码器聚类和NWGM近似进行估计,并通过对比学习对齐编码器与LLM的推理空间。

Result: 在多个基准数据集和LLM上,CAPITAL在准确性和鲁棒性上表现优于基线方法,尤其是在对抗条件下。

Insight: 通过因果推理改进LLM提示,能够更有效地处理隐式情感分析中的偏见问题,提升模型的推理质量。

Abstract: Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder’s representation with the LLM’s reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: https://github.com/whZ62/CAPITAL.

[8] Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions

Gauri Kambhatla,Sanjana Gautam,Angela Zhang,Alex Liu,Ravi Srinivasan,Junyi Jessy Li,Matthew Lease

Main category: cs.CL

TL;DR: 该论文提出了一种简单监督方法,用于改进语言模型与多样化人口群体在主观问题回答上的一致性,并通过三个数据集验证其有效性。

Details Motivation: 解决语言模型在预测不同人口群体对主观问题的回答时的对齐问题,以提升模型的实用性和公平性。

Contribution: 提出了一种简单且通用的监督方法,显著提高了语言模型与多样化群体的回答分布一致性,并提供了广泛评估和开源代码。

Method: 采用监督学习方法,通过多样化数据集对语言模型进行对齐训练,验证了不同提示策略和模型的效果。

Result: 结果表明,该方法在多个数据集和模型上显著提升了对齐性能,同时对特定群体的对齐效果进行了详细分析。

Insight: 简单监督方法可以显著改善语言模型与人类回答分布的对齐,但并非在所有情况下都适用,需根据具体场景选择。

Abstract: The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.

[9] Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan,Mohammad Fakhruddin Babar,Souvika Sarkar,Monowar Hasan,Santu Karmaker

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating’’ models – smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets – which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.

To Eun Kim,João Coelho,Gbemileke Onilude,Jai Singh

Main category: cs.CL

TL;DR: 论文提出了一个模块化管道,用于在基于RAG的对话系统中管理广告插入与检测,利用合成数据和对抗性共同进化框架提高广告隐藏效果和分类器性能。

Details Motivation: 随着基于生成模型的对话搜索引擎的普及,广告与信息内容的界限模糊化,引发了对透明性和用户体验的挑战,需要一种新的方法来平衡广告插入与检测。

Contribution: 1. 提出了一种模块化管道,包含广告重写器和广告分类器;2. 利用合成数据和课程学习训练高性能分类器;3. 通过监督微调和最佳N采样策略优化广告隐藏效果。

Method: 1. 使用合成数据训练广告分类器;2. 采用课程学习提升分类性能;3. 结合监督微调和最佳N采样策略优化广告重写器。

Result: 广告分类器在检测多样化广告插入策略中表现优异,分类器指导的优化显著提升了广告隐藏效果,实现了更无缝的广告集成。

Insight: 通过对抗性共同进化框架,可以同时提升广告隐藏能力和分类器的鲁棒性,为生成式搜索系统的广告管理提供了新思路。

Abstract: As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.

[11] Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Tao Xiong,Xavier Hu,Wenyan Fan,Shengyu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为混合推理(MoR)的训练框架,通过嵌入多样化的推理策略到大语言模型(LLMs)中,实现了无需外部提示工程的自主、任务自适应推理。

Details Motivation: 尽管现有的提示技术(如链式思考CoT和树式思考ToT)在大语言模型中表现良好,但其依赖手动设计的任务特定提示限制了适应性和效率。

Contribution: 提出了MoR框架,结合了多样化的推理策略,实现了任务自适应的推理,无需外部提示工程。

Method: MoR包括两个阶段:1)思想生成阶段,使用GPT-4o等模型创建推理链模板;2)监督微调数据集构建阶段,将模板与基准数据集配对。

Result: 实验表明,MoR显著提升了性能,MoR150在使用CoT提示时提高了2.2%,与基线相比提高了13.5%。

Insight: MoR提供了一个通用化的解决方案,可以在多样化任务中实现鲁棒的推理,摆脱了任务特定提示的限制。

Abstract: Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning.Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.

[12] SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Sihang Li,Wei Shi,Ziyuan Xie,Tao Liang,Guojun Ma,Xiang Wang

Main category: cs.CL

TL;DR: 论文提出了一种名为SAFER的新框架,通过稀疏自编码器(SAEs)解析和改进奖励模型,揭示其决策机制中的安全相关特征,并基于此设计数据毒化和去噪策略,提升模型安全性。

Details Motivation: 尽管基于人类反馈的强化学习(RLHF)在大型语言模型(LLM)对齐中至关重要,但其核心的奖励模型仍不透明,缺乏可解释性和安全性保障。

Contribution: 1. 提出了SAFER框架,通过稀疏自编码器解析奖励模型的激活特征;2. 量化了安全相关特征的显著性;3. 设计了针对性的数据毒化和去噪策略,提升模型安全性。

Method: 利用稀疏自编码器(SAEs)从奖励模型激活中提取可解释特征,量化特征对安全决策的重要性,并基于特征信号优化数据。

Result: 实验表明,SAFER能够通过少量数据修改精确提升或降低模型安全性,同时不影响聊天性能。

Insight: 通过奖励模型的机制性分析,可以更透明地干预和优化LLM的安全对齐,为高风险任务提供可解释性工具。

Abstract: Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}

[13] Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

Ahmed Sabir,Azinovič Gasper,Mengsay Loem,Rajesh Sharma

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.

[14] AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation

Elizabeth Fons,Elena Kochkina,Rachneet Kaur,Zhen Zeng,Berowne Hlavaty,Charese Smiley,Svitlana Vyetrenko,Manuela Veloso

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.

[15] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein,Sebastian Russo,Violet Xiang,Kabir Jolly,Rafael Rafailov,Nick Haber

Main category: cs.CL

TL;DR: LitBench是首个用于创意写作评估的标准化基准和数据集,解决了缺乏自动化评估方法的挑战。它包含由人类标注的故事比较数据,评估了零样本LLM法官和训练奖励模型的性能。

Details Motivation: 创意写作评估因开放性问题缺乏客观标准而困难,现有方法依赖未经验证的零样本LLM法官,需要更可靠的评估工具。

Contribution: 提出了LitBench基准和数据集,包含人类标注的偏好数据;评估了零样本LLM法官的性能,并训练了Bradley-Terry和生成奖励模型。

Method: 构建了大规模数据集,用于训练和评估奖励模型,并通过在线人类研究验证模型性能。Claude-3.7-Sonnet作为最佳零样本法官,奖励模型达到78%准确率。

Result: Claude-3.7-Sonnet与人类偏好一致性为73%;奖励模型性能优于零样本法官,人类研究进一步验证了其有效性。

Insight: LitBench为创意写作系统提供了可靠的自动化评估工具,展示了训练奖励模型在解决开放性问题中的潜力。

Abstract: Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

[16] Generative AI and the future of scientometrics: current topics and future questions

Benedetto Lepori,Jens Peter Andersen,Karsten Donnay

Main category: cs.CL

TL;DR: 该论文回顾了生成式AI(GenAI)在科学计量学中的应用,探讨了其对领域发展的潜在影响和挑战。

Details Motivation: 科学计量学领域需要理解GenAI的生成和概率性质及其对人类‘推理’的模仿能力,同时评估其在科学文本生成中的潜在颠覆性影响。

Contribution: 1. 介绍了GenAI的生成和概率性质;2. 批判性分析了GenAI在科学计量学中的实验应用;3. 探讨了GenAI对科学测量指标的潜在影响。

Method: 通过理论分析和实证研究,梳理了GenAI在科学计量学中的应用,并提出系统比较不同模型性能的建议。

Result: GenAI在语言生成任务(如标注)中表现优异,但在需要稳定语义和领域知识的任务中表现有限。

Insight: GenAI可能改变科学文本的特征(如作者、词汇和引用),需结合理论和实证研究以应对其对知识生产模式的影响。

Abstract: The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI’s generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human ‘reasoning’. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars’ profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.

[17] Many LLMs Are More Utilitarian Than One

Anita Keshmirian,Razan Baltaji,Babak Hemmatian,Hadi Asghari,Lav R. Varshney

Main category: cs.CL

TL;DR: 多智能体LLM在道德困境中表现出类似人类的功利主义倾向,但其决策机制与人类不同。

Details Motivation: 研究多智能体LLM在协作中的道德判断是否与人类群体决策相似,及其潜在机制。

Contribution: 揭示了LLM群体决策中功利主义倾向及其与人类机制的差异。

Method: 测试六个模型在独立和群体讨论条件下的道德困境反应。

Result: LLM群体更倾向于接受道德规范违反以最大化效用,但机制不同于人类。

Insight: LLM群体行为表面模仿人类,但其决策机制更多涉及对规范的敏感性降低或公正性增强。

Abstract: Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.

[18] ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

Alexander Hoyle,Lorena Calvo-Bartolomé,Jordan Boyd-Graber,Philip Resnik

Main category: cs.CL

TL;DR: 本文提出了一种可扩展的人类评估协议和自动化近似方法,用于评估主题模型和文档聚类,发现最佳的LLM代理与人类标注者在统计上无差异。

Details Motivation: 现有主题模型和文档聚类的评估方法要么依赖于与人类偏好不符的自动化指标,要么需要难以扩展的专家标注,因此需要一种更贴近实际使用场景的评估方法。

Contribution: 设计了一个可扩展的人类评估协议及自动化近似方法,验证了LLM代理可以替代人类标注者进行评估。

Method: 通过人类标注者或LLM代理对文本进行类别推断,并将其应用于其他文档,收集了大量标注数据以验证自动化代理的有效性。

Result: 最好的LLM代理与人类标注者在统计上无差异,可以作为自动化评估的合理替代。

Insight: 结合人类评估和自动化代理的方法可以更高效地评估主题模型和文档聚类,同时保持与人类偏好的对齐。

Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners’ real-world usage of models. Annotators – or an LLM-based proxy – review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann

[19] Stylometry recognizes human and LLM-generated texts in short samples

Karol Przystalski,Jan K. Argasiński,Iwona Grabska-Gradzińska,Jeremi K. Ochab

Main category: cs.CL

TL;DR: 该论文探讨了通过风格测量学方法区分人类与大型语言模型(LLM)生成的短文本,并在多分类和二元分类任务中取得了较高准确率。

Details Motivation: 研究动机在于解决模型归属、知识产权和AI伦理问题,通过风格分析识别LLM生成的文本特征。

Contribution: 主要贡献是通过风格测量学特征(词汇、语法、句法和标点)构建了一个基准数据集,并展示了在区分人类与LLM生成文本上的有效性。

Method: 方法包括构建基于Wikipedia的基准数据集,使用决策树和LightGBM模型,结合手工设计(StyloMetrix)和n-gram风格特征进行分类。

Result: 在多分类任务中达到0.87 Matthews相关系数,二元分类准确率在0.79到1之间,其中Wikipedia与GPT-4的区分准确率高达0.98。

Insight: 研究指出LLM在语法标准化上表现更强,而人类文本更个性化。风格测量学适用于特定文本类型的区分。

Abstract: The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

[20] TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

Xi Xuan,King-kui Sin,Yufei Zhou,Chunyu Kit

Main category: cs.CL

TL;DR: 本文提出了TransLaw,一个用于香港法律判决翻译的多智能体框架,通过Translator、Annotator和Proofreader三角色协作提升翻译质量,在成本和性能上优于GPT-4o,但仍不及人类专家。

Details Motivation: 香港法律判决翻译涉及复杂的法律术语和文化背景,现有LLMs在此任务上的能力尚不明确。本文旨在探索LLMs在多智能体协作翻译中的潜力。

Contribution: 提出了TransLaw框架,支持多种LLMs配置,显著降低成本,在语义准确性、结构连贯性和风格忠实度上超越GPT-4o。

Method: 采用三智能体(Translator、Annotator、Proofreader)协作框架,分别负责翻译、标注和校对任务,优化翻译质量。

Result: 在13种开源和商业LLMs上测试,TransLaw在多个维度优于GPT-4o,但在复杂术语上下文和风格自然度上不及人类专家。

Insight: 多智能体协作能有效提升LLMs在专业领域翻译的表现,但复杂文化和术语的处理仍需人类专家的深度参与。

Abstract: Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.

[21] Mathematics Isn’t Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

Aditya Tomar,Nihar Ranjan Sahoo,Ashish Mittal,Rudra Murthy,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 该论文探讨数学问题呈现方式中的文化背景影响,并通过文化适应数据集评估大型语言模型的表现。

Details Motivation: 尽管数学常被视为文化中立,但数学问题的呈现方式可能隐含文化背景。现有基准(如GSM8K)主要基于西方规范,可能忽略其他地区的文化差异。

Contribution: 创建了五种文化适应版本的GSM8K数据集(非洲、印度、中国、韩国、日本),并评估了六种大型语言模型对这些文化变化的鲁棒性。

Method: 通过提示转换生成文化适应数据集,并辅以人工验证。评估了不同参数规模和提示策略的大型语言模型。

Result: 模型在美国中心数据集上表现最佳,而在文化适应版本上性能下降;具备推理能力的模型对这些变化更具鲁棒性。

Insight: 数学问题的文化背景影响模型表现,推理能力有助于弥合文化差异。

Abstract: Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

[22] Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check

Nicholas Lourie,Michael Y. Hu,Kyunghyun Cho

Main category: cs.CL

TL;DR: 这篇论文通过元分析发现,下游任务的扩展规律(scaling laws)在仅39%的情况下符合线性趋势,且实验设置的微小变化可能完全改变扩展趋势,强调需深入理解扩展规律成功的前提条件。

Details Motivation: 现有的下游扩展规律研究存在矛盾,部分工作认为任务性能遵循线性扩展趋势,而另一些则指出如涌现(emergence)和反向扩展(inverse scaling)等根本挑战。本文旨在通过实证分析验证扩展规律的可靠性。

Contribution: 论文的主要贡献是通过元分析揭示下游扩展规律的不可靠性(仅39%符合线性趋势),并指出实验设置的敏感性。此外,提出需探索扩展行为偏离线性趋势的情况,以更全面建模预训练损失与下游任务性能的关系。

Method: 采用元分析方法,整合现有关于下游扩展规律的数据,统计线性扩展趋势的出现频率,并对实验设置变化的影响进行定性分析。

Result: 发现仅39%的情况下下游任务性能符合线性扩展趋势,且实验设置的微小调整可能导致趋势完全改变。

Insight: 论文指出了扩展规律的局限性,强调需进一步研究其适用的条件,并呼吁关注非线性扩展行为,以更准确地预测模型在大规模部署时的表现。

Abstract: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.

[23] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

Yuheng Wang,Xianhe Tang,Pufeng Huang

Main category: cs.CL

TL;DR: 论文提出了MemeCMD,一个自动生成的中文多轮对话数据集,结合了多模态表情包(meme)的上下文检索,以提升对话的生动性和情感表达。

Details Motivation: 现有对话数据集多为纯文本或人工标注,缺乏多模态互动的表现力和上下文细腻度。MemeCMD旨在填补这一空白。

Contribution: 1)提出了首个结合上下文检索表情包的中文多轮对话数据集;2)设计了一个检索框架和自适应阈值,确保表情包的自然使用;3)采用双代理自动生成对话,兼顾多样性和隐私保护。

Method: 1)利用大规模MLLM标注的表情包库;2)通过双代理生成多样化场景的对话;3)引入检索框架和自适应阈值优化表情包匹配。

Result: 实验表明,该方法能生成上下文相关且多样的表情包对话,为多模态对话AI提供了可扩展且隐私保护的资源。

Insight: 自动生成与上下文匹配的表情包对话是可行的,且能显著提升对话的情感表现力,同时避免隐私问题。

Abstract: Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.

[24] The Cognate Data Bottleneck in Language Phylogenetics

Luise Häuser,Alexandros Stamatakis

Main category: cs.CL

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.

cs.CV [Back]

[25] Moment Sampling in Video LLMs for Long-Form Video QA

Mustafa Chasmai,Gauri Jagatap,Gouthaman KV,Grant Van Horn,Subhransu Maji,Andrea Fanelli

Main category: cs.CV

TL;DR: 该论文提出了一种称为“时刻采样”的模型无关方法,通过文本到视频的时刻检索模型指导帧采样,以提升长视频问答性能。

Details Motivation: 现有的视频大语言模型在短视频问答中表现良好,但在长视频中因帧采样问题(丢失关键帧或包含冗余帧)导致推理性能下降。

Contribution: 提出了一种模型无关的“时刻采样”方法,利用轻量级时刻检索模型优先选择与问题最相关的帧。

Method: 通过文本到视频的时刻检索模型指导帧采样,确保选择最相关的帧。该方法适用于多种视频大语言模型。

Result: 在四个长视频问答数据集上,用四种先进视频大语言模型验证了该方法的有效性。

Insight: 精准的帧采样(而非均匀间隔采样)对长视频问答至关重要,轻量级检索模型可以显著提升性能。

Abstract: Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model’s ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose “moment sampling”, a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.

[26] Catastrophic Forgetting Mitigation via Discrepancy-Weighted Experience Replay

Xinrun Xu,Jianwen Yang,Qiuhong Zhang,Zhanbiao Lian,Zhiming Ding,Shan Jiang

Main category: cs.CV

TL;DR: 论文提出了一种基于自适应经验回放的方法ER-EMU,通过量化目标域之间的差异,优先选择与当前目标域最不相似的历史数据,从而缓解灾难性遗忘问题。

Details Motivation: 在动态交通环境中,边缘模型在适应新数据分布时容易丢失过去学习的知识(灾难性遗忘)。现有方法无法有效利用历史数据的多样性,导致知识保留不足。

Contribution: 提出了ER-EMU算法,实现了基于域距离度量的经验选择(DDM-ES),并通过多核最大均值差异(MK-MMD)量化域间差异,优化知识保留。

Method: 结合FIFO原则管理有限大小的经验缓冲区,使用DDM-ES算法选择最不相似的历史数据,并通过随机采样更新缓冲区。

Result: 在Bellevue交通视频数据集上,ER-EMU显著提升了多种最先进的云边协同目标检测框架的性能。

Insight: 选择与当前域差异较大的数据可以有效提升模型的泛化能力,避免过拟合新域,同时保留更广泛的过去知识。

Abstract: Continually adapting edge models in cloud-edge collaborative object detection for traffic monitoring suffers from catastrophic forgetting, where models lose previously learned knowledge when adapting to new data distributions. This is especially problematic in dynamic traffic environments characterised by periodic variations (e.g., day/night, peak hours), where past knowledge remains valuable. Existing approaches like experience replay and visual prompts offer some mitigation, but struggle to effectively prioritize and leverage historical data for optimal knowledge retention and adaptation. Specifically, simply storing and replaying all historical data can be inefficient, while treating all historical experiences as equally important overlooks their varying relevance to the current domain. This paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay, to address these limitations. ER-EMU utilizes a limited-size experience buffer managed using a First-In-First-Out (FIFO) principle, and a novel Domain Distance Metric-based Experience Selection (DDM-ES) algorithm. DDM-ES employs the multi-kernel maximum mean discrepancy (MK-MMD) to quantify the dissimilarity between target domains, prioritizing the selection of historical data that is most dissimilar to the current target domain. This ensures training diversity and facilitates the retention of knowledge from a wider range of past experiences, while also preventing overfitting to the new domain. The experience buffer is also updated using a simple random sampling strategy to maintain a balanced representation of previous domains. Experiments on the Bellevue traffic video dataset, involving repeated day/night cycles, demonstrate that ER-EMU consistently improves the performance of several state-of-the-art cloud-edge collaborative object detection frameworks.

[27] MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations

Mehmet Yigit Avci,Pedro Borges,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso

Main category: cs.CV

TL;DR: 论文提出MR-CLIP,一种多模态对比学习框架,通过将MRI图像与DICOM元数据对齐来学习对比感知的表征,无需依赖人工标注。

Details Motivation: MRI扫描的图像对比度由采集参数决定,但现有数据常缺乏标准化或可靠的元数据标签,且人工标注粗糙或缺失,导致图像解释和检索困难,阻碍临床应用。

Contribution: 提出MR-CLIP框架,利用DICOM元数据学习MRI扫描的对比感知表征,无需人工标签,支持跨模态检索和对比分类。

Method: 采用多模态对比学习,对齐MRI图像与DICOM元数据,训练于多样化的临床数据集,捕获扫描内外的对比度变化。

Result: 在跨模态检索和对比分类任务中表现优异,展示了其可扩展性和临床应用潜力。代码和权重已开源。

Insight: 通过元数据指导学习表征可解决医学图像中标签缺失或不一致的问题,为通用表征学习和数据整合提供新思路。

Abstract: Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at https://github.com/myigitavci/MR-CLIP.

[28] HistoART: Histopathology Artifact Detection and Reporting Tool

Seyed Kahaki,Alexander R. Webber,Ghada Zamzmi,Adarsh Subbaswamy,Rucha Deshpande,Aldo Badano

Main category: cs.CV

TL;DR: 论文提出了三种方法来检测WSI中的伪影:基于基础模型的FMA、基于ResNet50的DLA和基于手工特征的KBA,其中FMA表现最佳。同时,还开发了质量报告工具,可视化伪影分布。

Details Motivation: WSI在癌症诊断中广泛应用,但制备和扫描过程中引入的伪影会影响图像分析的准确性。研究旨在解决这一问题。

Contribution: 提出了三种伪影检测方法(FMA、DLA、KBA),并开发了质量报告工具,量化高质量图像块和可视化伪影分布。

Method: 1. FMA:采用微调的UNI架构;2. DLA:基于ResNet50;3. KBA:基于手工特征(纹理、颜色、频率)。针对六种常见伪影类型。

Result: FMA表现最佳,AUROC达0.995(95% CI [0.994, 0.995]),优于DLA(0.977)和KBA(0.940)。

Insight: 基础模型在伪影检测中表现优越,未来可直接应用于其他医学图像任务。

Abstract: In modern cancer diagnostics, Whole Slide Imaging (WSI) is widely used to digitize tissue specimens for detailed, high-resolution examination; however, other diagnostic approaches, such as liquid biopsy and molecular testing, are also utilized based on the cancer type and clinical context. While WSI has revolutionized digital histopathology by enabling automated, precise analysis, it remains vulnerable to artifacts introduced during slide preparation and scanning. These artifacts can compromise downstream image analysis. To address this challenge, we propose and compare three robust artifact detection approaches for WSIs: (1) a foundation model-based approach (FMA) using a fine-tuned Unified Neural Image (UNI) architecture, (2) a deep learning approach (DLA) built on a ResNet50 backbone, and (3) a knowledge-based approach (KBA) leveraging handcrafted features from texture, color, and frequency-based metrics. The methods target six common artifact types: tissue folds, out-of-focus regions, air bubbles, tissue damage, marker traces, and blood contamination. Evaluations were conducted on 50,000+ image patches from diverse scanners (Hamamatsu, Philips, Leica Aperio AT2) across multiple sites. The FMA achieved the highest patch-wise AUROC of 0.995 (95% CI [0.994, 0.995]), outperforming the ResNet50-based method (AUROC: 0.977, 95% CI [0.977, 0.978]) and the KBA (AUROC: 0.940, 95% CI [0.933, 0.946]). To translate detection into actionable insights, we developed a quality report scorecard that quantifies high-quality patches and visualizes artifact distributions.

[29] CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

Ming Li,Chenguang Wang,Yijun Liang,Xiyao Wang,Yuhang Zhou,Xiyang Wu,Yuqing Zhang,Ruiyi Zhang,Tianyi Zhou

Main category: cs.CV

TL;DR: 该论文探讨了多模态大语言模型(MLLMs)在视觉感知与推理任务中的局限性,特别是在检测‘CaughtCheating’这类挑战性任务时的表现。研究发现,尽管MLLMs在现有基准测试中表现出色,但在需要人类侦探级别感知和推理的任务中表现不佳。

Details Motivation: 由于MLLMs(如GPT-o3)在现有基准测试中已达到接近天花板的表现,研究者需要更具挑战性的任务来评估其能力。论文聚焦于MLLMs在需要高度视觉感知和推理的任务中的表现,尤其是人类侦探级别的能力。

Contribution: 论文提出‘CaughtCheating’任务,作为一类新的视觉感知与推理挑战,揭示了MLLMs在复杂现实场景中的局限性。通过实验分析,论文进一步探讨了MLLMs能力不足的原因。

Method: 论文设计了‘CaughtCheating’任务,模拟社交媒体中请求他人从照片中检测可疑线索的场景。通过大量实验和定性分析,评估了MLLMs在此类任务中的表现及其失败原因。

Result: 实验表明,尽管MLLMs在已有任务中表现优异,但在‘CaughtCheating’任务中性能骤降至接近零。这表明MLLMs在复杂视觉感知和推理任务中能力有限。

Insight: ‘CaughtCheating’任务为未来研究提供了新方向,有助于推动MLLMs达到人类级别的侦探能力。论文强调了MLLMs在解决现实复杂任务中仍需改进的领域。

Abstract: Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3’s performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster’s partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.

[30] Evolutionary computing-based image segmentation method to detect defects and features in Additive Friction Stir Deposition Process

Akshansh Mishra,Eyob Mesele Sefene,Shivraman Thapliyal

Main category: cs.CV

TL;DR: 提出了一种基于进化计算的图像分割方法,用于检测增材摩擦搅拌沉积过程中的缺陷和特征,结合PSO算法和多种可视化技术,成功识别材料界面质量。

Details Motivation: 传统成像技术难以观察到增材摩擦搅拌沉积过程中的细微缺陷和材料过渡区,需要一种自动化和精确的方法来评估界面质量。

Contribution: 结合PSO优化阈值和多通道可视化技术,开发了一种新型图像分割方法,提供了定量评估增材制造组件质量的指标。

Method: 使用PSO算法优化图像分割阈值,结合梯度幅度分析和距离变换生成注意力加权可视化,多通道技术综合边界、空间关系和材料密度信息。

Result: PSO自动确定了最优阈值(156-173),成功识别不完全结合和不均匀区域,多通道可视化提供了清晰的界面质量表征。

Insight: 注意力加权的可视化方法能够有效突出关键界面区域,为增材制造过程优化和质量控制提供新工具。

Abstract: This work proposes an evolutionary computing-based image segmentation approach for analyzing soundness in Additive Friction Stir Deposition (AFSD) processes. Particle Swarm Optimization (PSO) was employed to determine optimal segmentation thresholds for detecting defects and features in multilayer AFSD builds. The methodology integrates gradient magnitude analysis with distance transforms to create novel attention-weighted visualizations that highlight critical interface regions. Five AFSD samples processed under different conditions were analyzed using multiple visualization techniques i.e. self-attention maps, and multi-channel visualization. These complementary approaches reveal subtle material transition zones and potential defect regions which were not readily observable through conventional imaging. The PSO algorithm automatically identified optimal threshold values (ranging from 156-173) for each sample, enabling precise segmentation of material interfaces. The multi-channel visualization technique effectively combines boundary information (red channel), spatial relationships (green channel), and material density data (blue channel) into cohesive representations that quantify interface quality. The results demonstrate that attention-based analysis successfully identifies regions of incomplete bonding and inhomogeneities in AFSD joints, providing quantitative metrics for process optimization and quality assessment of additively manufactured components.

[31] AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

Feiyang Kang,Nadine Chang,Maying Shen,Marc T. Law,Rafid Mahmood,Ruoxi Jia,Jose M. Alvarez

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup’s advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.

[32] VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models

Binesh Sadanandan,Vahid Behzadan

Main category: cs.CV

TL;DR: VSF–Med 是一个端到端的漏洞评分框架,用于评估医疗视觉语言模型(VLMs)的安全性,通过文本提示攻击模板、视觉扰动和八维评估标准,为医疗 VLMs 提供系统性安全评估。

Details Motivation: 医疗视觉语言模型在医疗影像工作流程中具有巨大潜力,但其在临床环境中的系统性安全评估尚不完善,VSF--Med 旨在填补这一空白。

Contribution: 1) 提供了针对新兴威胁的文本提示攻击模板库;2) 设计了基于结构相似性(SSIM)阈值的视觉扰动方法;3) 引入了八维评分标准并由两个独立的大语言模型(LLMs)进行评估。

Method: VSF–Med 结合了文本提示攻击、视觉扰动和八维评分标准,通过 z-score 标准化生成 0–32 的综合风险指标。使用公开数据集和开源代码生成了 30,000 多个对抗样本。

Result: 实验显示,当前先进的 VLMs 在攻击持续性(0.90σ)、提示注入有效性(0.74σ)和安全绕过成功率(0.63σ)方面表现不一,其中 Llama-3.2-11B-Vision-Instruct 的漏洞增加最显著(1.29σ)。

Insight: 医疗 VLMs 的安全性存在显著漏洞,尤其是在攻击持续性和提示注入方面。VSF–Med 为标准化评估和改善模型安全性提供了实用工具。

Abstract: Vision Language Models (VLMs) hold great promise for streamlining labour-intensive medical imaging workflows, yet systematic security evaluations in clinical settings remain scarce. We introduce VSF–Med, an end-to-end vulnerability-scoring framework for medical VLMs that unites three novel components: (i) a rich library of sophisticated text-prompt attack templates targeting emerging threat vectors; (ii) imperceptible visual perturbations calibrated by structural similarity (SSIM) thresholds to preserve clinical realism; and (iii) an eight-dimensional rubric evaluated by two independent judge LLMs, whose raw scores are consolidated via z-score normalization to yield a 0–32 composite risk metric. Built entirely on publicly available datasets and accompanied by open-source code, VSF–Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command. Our consolidated analysis reports mean z-score shifts of $0.90\sigma$ for persistence-of-attack-effects, $0.74\sigma$ for prompt-injection effectiveness, and $0.63\sigma$ for safety-bypass success across state-of-the-art VLMs. Notably, Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29\sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69\sigma$ for that same vector and $0.28\sigma$ for prompt-injection attacks.

[33] MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

Ziqi Zhong,Daniel Tang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: While multi-modal learning has advanced significantly, current approaches often treat modalities separately, creating inconsistencies in representation and reasoning. We introduce MANTA (Multi-modal Abstraction and Normalization via Textual Alignment), a theoretically-grounded framework that unifies visual and auditory inputs into a structured textual space for seamless processing with large language models. MANTA addresses four key challenges: (1) semantic alignment across modalities with information-theoretic optimization, (2) adaptive temporal synchronization for varying information densities, (3) hierarchical content representation for multi-scale understanding, and (4) context-aware retrieval of sparse information from long sequences. We formalize our approach within a rigorous mathematical framework, proving its optimality for context selection under token constraints. Extensive experiments on the challenging task of Long Video Question Answering show that MANTA improves state-of-the-art models by up to 22.6% in overall accuracy, with particularly significant gains (27.3%) on videos exceeding 30 minutes. Additionally, we demonstrate MANTA’s superiority on temporal reasoning tasks (23.8% improvement) and cross-modal understanding (25.1% improvement). Our framework introduces novel density estimation techniques for redundancy minimization while preserving rare signals, establishing new foundations for unifying multimodal representations through structured text.

[34] An efficient plant disease detection using transfer learning approach

Bosubabu Sambana,Hillary Sunday Nnadi,Mohd Anas Wajid,Nwosu Ogochukwu Fidelia,Claudia Camacho-Zuñiga,Henry Dozie Ajuzie,Edeh Michael Onyema

Main category: cs.CV

TL;DR: 该论文提出了一种基于迁移学习的高效植物病害检测方法,利用YOLOv7和YOLOv8模型对植物叶片图像进行微调,成功检测多种病害,并通过多项指标验证了YOLOv8的优越性能。

Details Motivation: 植物病害对农业产量和质量造成严重影响,亟需一种自动化、高效的方法实现早期病害检测,以减轻病害带来的损失并支持可持续农业发展。

Contribution: 1. 提出了一种基于迁移学习的植物病害检测系统;2. 验证了YOLOv8在病害检测中的高效性和准确性;3. 为农业领域提供了一种可扩展的自动化解决方案。

Method: 采用YOLOv7和YOLOv8模型,通过对植物叶片图像数据集进行微调,实现病害检测。评估指标包括mAP、F1-score、精确率和召回率。

Result: YOLOv8表现最佳,mAP为91.05%,F1-score为89.40%,精确率和召回率分别为91.22%和87.66%。

Insight: YOLOv8在植物病害检测中性能优异,可用于实际农业场景,帮助提升作物产量并减少人工监测的依赖。

Abstract: Plant diseases pose significant challenges to farmers and the agricultural sector at large. However, early detection of plant diseases is crucial to mitigating their effects and preventing widespread damage, as outbreaks can severely impact the productivity and quality of crops. With advancements in technology, there are increasing opportunities for automating the monitoring and detection of disease outbreaks in plants. This study proposed a system designed to identify and monitor plant diseases using a transfer learning approach. Specifically, the study utilizes YOLOv7 and YOLOv8, two state-ofthe-art models in the field of object detection. By fine-tuning these models on a dataset of plant leaf images, the system is able to accurately detect the presence of Bacteria, Fungi and Viral diseases such as Powdery Mildew, Angular Leaf Spot, Early blight and Tomato mosaic virus. The model’s performance was evaluated using several metrics, including mean Average Precision (mAP), F1-score, Precision, and Recall, yielding values of 91.05, 89.40, 91.22, and 87.66, respectively. The result demonstrates the superior effectiveness and efficiency of YOLOv8 compared to other object detection methods, highlighting its potential for use in modern agricultural practices. The approach provides a scalable, automated solution for early any plant disease detection, contributing to enhanced crop yield, reduced reliance on manual monitoring, and supporting sustainable agricultural practices.

[35] Diffusion-Based Image Augmentation for Semantic Segmentation in Outdoor Robotics

Peter Mortimer,Mirko Maehlisch

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The performance of leaning-based perception algorithms suffer when deployed in out-of-distribution and underrepresented environments. Outdoor robots are particularly susceptible to rapid changes in visual scene appearance due to dynamic lighting, seasonality and weather effects that lead to scenes underrepresented in the training data of the learning-based perception system. In this conceptual paper, we focus on preparing our autonomous vehicle for deployment in snow-filled environments. We propose a novel method for diffusion-based image augmentation to more closely represent the deployment environment in our training data. Diffusion-based image augmentations rely on the public availability of vision foundation models learned on internet-scale datasets. The diffusion-based image augmentations allow us to take control over the semantic distribution of the ground surfaces in the training data and to fine-tune our model for its deployment environment. We employ open vocabulary semantic segmentation models to filter out augmentation candidates that contain hallucinations. We believe that diffusion-based image augmentations can be extended to many other environments apart from snow surfaces, like sandy environments and volcanic terrains.

[36] FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

Yu Lu,Yi Yang

Main category: cs.CV

TL;DR: FreeLong++ 是一种无需训练的长视频生成框架,通过多波段频谱融合解决现有短视频生成模型在长视频中的高频失真问题,显著提升时间一致性和视觉保真度。

Details Motivation: 现有短视频生成模型在生成长视频时因高频失真导致质量下降,缺乏时间一致性和视觉保真度。

Contribution: 提出了 FreeLong++ 框架,通过多分支设计实现低频全局语义和高频局部细节的融合,无需额外训练即可适配现有模型。

Method: 采用多分支注意力架构,分层处理不同时间尺度的频谱特征,实现从低频到高频的多波段融合。

Result: FreeLong++ 在 4 倍和 8 倍长度视频生成任务中表现优异,支持多提示生成和长序列控制。

Insight: 长视频生成的关键在于平衡全局语义和局部细节,多尺度频谱融合是解决高频失真的有效方法。

Abstract: Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

[37] Computer Vision for Objects used in Group Work: Challenges and Opportunities

Changsoo Jung,Sheikh Mannan,Jack Fitzgerald,Nathaniel Blanchard

Main category: cs.CV

TL;DR: 这篇论文探讨了在K-12教育环境中,计算机视觉如何通过6D姿态估计捕捉学生与物理对象的交互,并提出FiboSB数据集评估现有方法,发现局限性后通过微调YOLO11-x取得显著改进。

Details Motivation: 现有系统在协作任务中难以准确捕捉学生与物理对象的交互,限制了对协作学习的深入分析。

Contribution: 提出了FiboSB数据集,揭示了现有6D姿态估计方法在协作场景中的局限性,并通过微调YOLO11-x改善了性能。

Method: 通过FiboSB数据集和四种先进6D姿态估计方法的评估,发现对象检测模块的失败是主要问题,随后对YOLO11-x进行微调。

Result: 微调后的YOLO11-x在FiboSB上取得了0.898的mAP_50,显著改善了性能。

Insight: 协作场景的6D姿态估计需要更鲁棒的对象检测方法,FiboSB为未来的研究提供了基准和支持。

Abstract: Interactive and spatially aware technologies are transforming educational frameworks, particularly in K-12 settings where hands-on exploration fosters deeper conceptual understanding. However, during collaborative tasks, existing systems often lack the ability to accurately capture real-world interactions between students and physical objects. This issue could be addressed with automatic 6D pose estimation, i.e., estimation of an object’s position and orientation in 3D space from RGB images or videos. For collaborative groups that interact with physical objects, 6D pose estimates allow AI systems to relate objects and entities. As part of this work, we introduce FiboSB, a novel and challenging 6D pose video dataset featuring groups of three participants solving an interactive task featuring small hand-held cubes and a weight scale. This setup poses unique challenges for 6D pose because groups are holistically recorded from a distance in order to capture all participants – this, coupled with the small size of the cubes, makes 6D pose estimation inherently non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on FiboSB, exposing the limitations of current algorithms on collaborative group work. An error analysis of these methods reveals that the 6D pose methods’ object detection modules fail. We address this by fine-tuning YOLO11-x for FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results, and analysis of YOLO11-x errors presented here lay the groundwork for leveraging the estimation of 6D poses in difficult collaborative contexts.

[38] VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang,Zeel Bhatt,Yezhou Yang

Main category: cs.CV

TL;DR: VOCAL是一种基于对比学习的视觉里程计框架,通过将VO问题重新定义为标签排序任务,结合贝叶斯推断与表征学习,提升特征可解释性和多模态数据兼容性。

Details Motivation: 现有学习型VO方法依赖于刚性几何假设,缺乏可解释性和理论支持,VOCAL旨在解决这一问题。

Contribution: 提出了VOCAL框架,将VO任务重新定义为标签排序问题,结合贝叶斯推断和表征学习,提升特征可解释性与多模态兼容性。

Method: 通过对比学习组织视觉特征以反映相机状态,利用排序机制使相似相机状态在潜在空间中产生一致且空间连贯的表征。

Result: 在KITTI数据集上的实验表明,VOCAL在可解释性和灵活性上显著提升。

Insight: VOCAL通过标签排序和特征对齐,为VO任务提供了更具理论支持且更易于解释的解决方案。

Abstract: Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL’s enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

[39] Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition

Nikita Nikitin,Eugene Fomin

Main category: cs.CV

TL;DR: 本文提出了一种用于实时手语识别的轻量级DNN框架,解决了数据稀缺、计算成本高和帧率差异等挑战,通过编码手语特定参数并利用MediaPipe提取关键点,实现了高准确率和低延迟的分类。

Details Motivation: 手语识别的关键挑战包括数据稀缺、计算成本高以及训练与推理环境中的帧率差异,本文旨在开发一种轻量级DNN模型,以在有限数据下实现实时识别。

Contribution: 1. 提出了一种轻量级DNN框架,适用于实时手语识别;2. 设计了数据编码方法,将手语参数向量化;3. 开发了数据标注平台’slait data’,支持结构化标注;4. 模型在边缘设备上实现高精度和低延迟。

Method: 通过MediaPipe提取手部关键点,将手语参数(如手形、手掌方向和运动)编码为向量输入,并优化DNN架构以实现轻量化。

Result: 模型在343种手语分类中达到92%准确率,延迟低于10ms,并成功集成至’slait ai’应用中。

Insight: 结合特定领域知识(如手语参数)和轻量化设计,可以在有限数据下实现高效的实时识别任务。

Abstract: We present a novel framework for real-time sign language recognition using lightweight DNNs trained on limited data. Our system addresses key challenges in sign language recognition, including data scarcity, high computational costs, and discrepancies in frame rates between training and inference environments. By encoding sign language specific parameters, such as handshape, palm orientation, movement, and location into vectorized inputs, and leveraging MediaPipe for landmark extraction, we achieve highly separable input data representations. Our DNN architecture, optimized for sub 10MB deployment, enables accurate classification of 343 signs with less than 10ms latency on edge devices. The data annotation platform ‘slait data’ facilitates structured labeling and vector extraction. Our model achieved 92% accuracy in isolated sign recognition and has been integrated into the ‘slait ai’ web application, where it demonstrates stable inference.

[40] GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception

Zhuangzhuang Dai,Vincent Gbouna Zakka,Luis J. Manso,Chen Li

Main category: cs.CV

TL;DR: 论文提出GazeTarget360系统,用于从图像中估计360度视线目标,结合了眼神接触检测器、预训练视觉编码器和多尺度融合解码器,提升了机器人感知能力。

Details Motivation: 提升机器人在实际交互中对人类视线目标的理解能力,尤其是在视线超出相机帧范围时的预测问题。

Contribution: 首次提出了一种高效、可部署的360度视线目标估计系统,适用于真实相机场景。

Method: 结合条件推理引擎(眼神接触检测器、预训练视觉编码器、多尺度融合解码器)的多尺度融合方法。

Result: 交叉验证结果表明系统在未见场景中能准确可靠地预测视线目标。

Insight: 背景信息利用和跨帧视线预测是实现机器人感知的重要方向,多尺度融合方法提升了系统的泛化能力。

Abstract: Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.

[41] VirtualFencer: Generating Fencing Bouts based on Strategies Extracted from In-the-Wild Videos

Zhiyin Lin,Purvi Goel,Joy Yun,C. Karen Liu,Joao Pedro Araujo

Main category: cs.CV

TL;DR: VirtualFencer是一个系统,能从无监督的野外视频中提取3D击剑动作和策略,并利用这些知识生成逼真的击剑行为。

Details Motivation: 击剑运动中,运动员的动作多样但具有战略逻辑性,且动作执行差异显著(如快慢、大小、攻防),同时动作背后隐含对手行为的应对策略。结合动作多样性和双人策略的特点,激发了数据驱动建模在击剑中的应用。

Contribution: 提出了VirtualFencer系统,能够从无监督的野外视频中提取3D击剑动作和策略,并利用这些知识生成逼真的击剑行为。系统展示了三种能力:自对弈、与真实击剑动作对弈、与职业击剑手交互对弈。

Method: 通过数据驱动建模提取3D动作和策略,并基于这些信息生成击剑行为。系统支持自对弈、与真实视频动作对弈,以及交互式对弈。

Result: 系统成功展示了从无监督视频中提取动作和策略的能力,并能生成逼真的击剑行为,验证了其多方面的实用性。

Insight: 通过无监督学习从野外视频中提取3D动作和策略是可行的,且可以用于生成逼真的交互行为,为其他类似运动或交互场景提供了借鉴。

Abstract: Fencing is a sport where athletes engage in diverse yet strategically logical motions. While most motions fall into a few high-level actions (e.g. step, lunge, parry), the execution can vary widely-fast vs. slow, large vs. small, offensive vs. defensive. Moreover, a fencer’s actions are informed by a strategy that often comes in response to the opponent’s behavior. This combination of motion diversity with underlying two-player strategy motivates the application of data-driven modeling to fencing. We present VirtualFencer, a system capable of extracting 3D fencing motion and strategy from in-the-wild video without supervision, and then using that extracted knowledge to generate realistic fencing behavior. We demonstrate the versatile capabilities of our system by having it (i) fence against itself (self-play), (ii) fence against a real fencer’s motion from online video, and (iii) fence interactively against a professional fencer.

[42] Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections

Vignesh Ram Nithin Kappagantula,Shayan Hassantabar

Main category: cs.CV

TL;DR: 该论文提出了一种高效的机器学习流程,用于在非结构化的度假租赁图像集合中发现和分组房间场景,并识别每个卧室组的床类型。

Details Motivation: 度假租赁平台上的大量非结构化图像给旅行者理解房屋空间布局带来了挑战,尤其是当存在多个相同类型的房间时。

Contribution: 提出了一种计算高效且适合实时和数据稀缺环境的机器学习流程,结合了房间类型检测、重叠检测和聚类算法,并通过多模态大语言模型(MLLM)将卧室组映射到床类型。

Method: 流程包括监督学习的房间类型检测模型、重叠检测模型生成图像相似度分数、聚类算法分组图像,以及MLLM模型将卧室组与床类型关联。

Result: 实验表明,该流程的性能显著优于对比学习和预训练嵌入的聚类方法。

Insight: 通过结合监督学习和多模态模型,可以高效解决非结构化图像的分组和标注问题,提升用户体验。

Abstract: The rapid growth of vacation rental (VR) platforms has led to an increasing volume of property images, often uploaded without structured categorization. This lack of organization poses significant challenges for travelers attempting to understand the spatial layout of a property, particularly when multiple rooms of the same type are present. To address this issue, we introduce an effective approach for solving the room scene discovery and grouping problem, as well as identifying bed types within each bedroom group. This grouping is valuable for travelers to comprehend the spatial organization, layout, and the sleeping configuration of the property. We propose a computationally efficient machine learning pipeline characterized by low latency and the ability to perform effectively with sample-efficient learning, making it well-suited for real-time and data-scarce environments. The pipeline integrates a supervised room-type detection model, a supervised overlap detection model to identify the overlap similarity between two images, and a clustering algorithm to group the images of the same space together using the similarity scores. Additionally, the pipeline maps each bedroom group to the corresponding bed types specified in the property’s metadata, based on the visual content present in the group’s images using a Multi-modal Large Language Model (MLLM) model. We evaluate the aforementioned models individually and also assess the pipeline in its entirety, observing strong performance that significantly outperforms established approaches such as contrastive learning and clustering with pretrained embeddings.

[43] Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes

Chuyan Zhang,Kefan Wang,Yun Gu

Main category: cs.CV

TL;DR: 该论文提出了SR-LoRA框架,利用预训练权重矩阵的稳定秩作为先验,动态分配低秩适应的层间秩,以在低数据和领域差距大的场景中提升性能。

Details Motivation: LoRA方法在低秩适应中表现良好,但固定低秩结构限制了其在领域差距大的场景中的适应性。目前的自适应LoRA方法依赖计算密集型技术,如迭代剪枝或秩搜索。

Contribution: 提出SR-LoRA框架,通过稳定秩指导层间秩分配,避免了额外的搜索成本,显著提升了在领域差距大的任务中的性能。

Method: SR-LoRA利用预训练权重矩阵的稳定秩作为先验,动态分配各层的秩,从而实现高效且性能优异的低秩适应。

Result: 在领域差距大的少样本任务中,SR-LoRA优于现有的自适应LoRA方法,平衡了性能与效率。

Insight: 稳定秩能够反映权重矩阵的内在维度,为低秩适应提供了有效的指导,避免了复杂的计算开销。

Abstract: Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/SR-LoRA.

[44] Populate-A-Scene: Affordance-Aware Human Video Generation

Mengyi Shan,Zecheng He,Haoyu Ma,Felix Juefei-Xu,Peizhao Zhang,Tingbo Hou,Ching-Yao Chuang

Main category: cs.CV

TL;DR: 论文探讨如何将文本到视频生成模型转化为交互式世界模拟器,通过教模型预测人与环境的交互,实现在场景中插入符合行为、外观和场景功能的虚拟人。

Details Motivation: 研究动机是利用文本到视频模型的潜力,通过单张场景图像推断人类行为与环境的交互,无需显式的边界框或姿态条件,实现更自然的视频生成。

Contribution: 主要贡献是提出了一种方法,通过微调视频生成模型,实现从单张场景图像推断人类行为与场景的功能性交互(affordance),无需额外标注数据。

Method: 方法包括利用预训练的视频模型,通过跨注意力热图分析,从场景图像中无监督学习人类行为的潜在分布,生成符合场景功能的视频内容。

Result: 结果表明,模型能够生成行为合理、外观协调且符合场景功能的虚拟人视频,同时揭示了预训练模型的潜在功能感知能力。

Insight: 研究揭示了文本到视频模型在未标注数据下仍能感知场景功能,为未来交互式世界模拟器的开发提供了新思路。

Abstract: Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

[45] Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video

Alexander Moore,Amar Saini,Kylie Cancilla,Doug Poland,Carmen Carrano

Main category: cs.CV

TL;DR: 该论文介绍了MOVi-MC-AC数据集,这是目前最大的遮挡分割(amodal segmentation)和首个遮挡内容补全(amodal content completion)数据集,通过多摄像头视频提供场景的多视角信息。

Details Motivation: 现有遮挡分割和内容补全任务缺乏多摄像头视角的数据,限制了模型对物体上下文的理解。因此,作者提出MOVi-MC-AC数据集,填补了这一空白。

Contribution: 1. 提出MOVi-MC-AC数据集,包含多摄像头视角下的遮挡分割和内容补全标签;2. 首次提供真实遮挡内容的地面真值标签。

Method: 利用多摄像头模拟家庭场景中的复杂遮挡,为每个物体实例提供多视角下的检测、跟踪和分割标签。数据集中包含约580万物体实例。

Result: MOVi-MC-AC成为目前最大的遮挡分割数据集,并为遮挡内容补全任务提供了首个地面真值数据。

Insight: 多摄像头视角能够显著提升遮挡分割和内容补全任务的性能,未来可以进一步探索跨视角的物体表征学习。

Abstract: Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at https://huggingface.co/datasets/Amar-S/MOVi-MC-AC ,

[46] CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

Zhiwei Yi,Xin Cheng,Jingyu Ma,Ruifei Zhu,Junwei Tian,Yuanxiu Zhou,Xinge Zhao,Hongzhe Li

Main category: cs.CV

TL;DR: 论文提出了CGEarthEye,一个针对吉林一号卫星的高分辨率遥感视觉基础模型(RSVFM),基于大规模预训练范式,通过多时间自监督学习数据集JLSSD和多种对比策略,实现了在多个遥感任务中的SOTA性能。

Details Motivation: 现有超高分辨率光学遥感影像获取渠道有限,制约了高分辨率遥感视觉基础模型的发展。吉林一号卫星作为全球最大的亚米级商业遥感卫星星座,提供了丰富的资源,激发了针对其特性的基础模型研究。

Contribution: 1. 提出了针对吉林一号卫星的RSVFM框架CGEarthEye,包含五个不同参数规模的骨干网络,总参数达21亿;2. 构建了首个1500万规模的多时间自监督学习数据集JLSSD;3. 集成了多种对比策略(季节对比、增强对比、掩码补丁对比)进行预训练。

Method: 1. 使用JLSSD数据集,通过多级表示聚类和采样策略构建;2. 预训练中结合季节对比、增强对比和掩码补丁对比策略;3. 在10个基准数据集上评估模型性能。

Result: CGEarthEye在覆盖四种典型遥感任务的10个基准数据集中实现SOTA性能,并在特征可视化、模型收敛、参数效率和实际应用中表现出色。

Insight: CGEarthEye的卓越表示能力有望推动吉林一号数据在传统地球观测应用中的更广泛和高效使用,同时为高分辨率遥感基础模型提供了新的研究思路。

Abstract: Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world’s largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye’s superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application.

[47] Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

Yingping Liang,Yutao Hu,Wenqi Shao,Ying Fu

Main category: cs.CV

TL;DR: 提出了一种名为Lift to Match (L2M)的两阶段框架,通过将2D图像提升到3D空间,利用单视图图像进行大规模合成,实现了跨领域的鲁棒特征匹配。

Details Motivation: 现有特征匹配方法依赖多视图图像,泛化能力受限;传统特征编码器基于单视图2D图像,难以捕捉3D对应关系。

Contribution: 提出了一种两阶段的L2M框架,结合3D几何知识与大规模单视图图像合成,实现了3D感知的特征编码与跨领域特征匹配。

Method: 第一阶段:通过多视图图像合成和3D特征高斯表示学习3D感知特征编码器;第二阶段:利用新颖视图渲染和大规模合成数据学习特征解码器。

Result: 在零样本评估基准中表现优异,展现了强大的泛化能力。

Insight: 利用单视图图像合成3D信息,并结合3D几何知识,可以有效提升特征匹配的鲁棒性和跨领域泛化能力。

Abstract: Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.

[48] Few-shot Classification as Multi-instance Verification: Effective Backbone-agnostic Transfer across Domains

Xin Xu,Eibe Frank,Geoffrey Holmes

Main category: cs.CV

TL;DR: 该论文提出了一种新的跨域小样本学习方法MIV-head,通过将小样本分类任务转化为多实例验证问题,无需微调预训练主干网络,显著降低了适应成本,同时在性能上优于现有适配器方法。

Details Motivation: 当前跨域小样本学习中,微调主干网络往往不可行或成本高昂,而现有方法在处理静态、低质量的嵌入特征时效果有限,因此需要一种无需微调主干的适应方法。

Contribution: 提出了MIV-head方法,将小样本分类任务视为多实例验证问题,设计了一种主干无关的高效分类头,显著降低了适应成本并提升了性能。

Method: 通过多实例验证(MIV)任务表示小样本分类问题,设计核心组件用于MIV-head的构建,适应目标域的小样本数据,无需微调主干网络。

Result: 在多种设置和Meta-dataset基准测试中,MIV-head的性能优于部分微调适配器方法,且成本更低,分类头方法的性能则明显落后。

Insight: 将小样本分类问题转化为多实例验证任务是一种高效的主干无关适应策略,适用于实际应用中无法微调主干网络的场景。

Abstract: We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible – a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, “black-box” backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the “MIV-head”, akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the “meta-testing” phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art “adapter” (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known “classification head” approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at https://github.com/xxweka/MIV-head.

[49] DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting

Jingyi Pan,Dan Xu,Qiong Luo

Main category: cs.CV

TL;DR: DiGA3D提出了一种基于扩散模型的粗到细的3D修复方法,通过多参考视图选择、注意力特征传播和改进的几何一致性损失,解决了现有方法在视图依赖性和一致性上的挑战。

Details Motivation: 现有3D修复方法在多任务统一的框架中面临视图依赖性、外观不一致和几何不一致等问题,亟需一种更鲁棒和一致的解决方案。

Contribution: 1)提出多参考视图选择策略;2)设计注意力特征传播机制(AFP);3)引入纹理-几何分数蒸馏采样损失(TG-SDS)。

Method: 1)多参考视图选择;2)通过扩散模型传播注意力特征(AFP);3)使用TG-SDS提高几何一致性。

Result: 实验验证了DiGA3D在多种3D修复任务中的有效性。

Insight: 通过结合扩散模型和粗到细策略,DiGA3D实现了外观和几何的一致性传播,为3D修复提供了统一框架。

Abstract: Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes. Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. The project page is available at https://rorisis.github.io/DiGA3D/.

[50] MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition

Huanxin Yang,Qiwen Wang

Main category: cs.CV

TL;DR: 该论文提出了一种结合频域分析的HMER方法(MFH),通过离散余弦变换(DCT)利用频域信息提升手写数学表达式识别的性能,并在多个数据集上展现了显著的改进。

Details Motivation: 手写数学表达式识别(HMER)因复杂的公式结构和字符布局而具有挑战性,作者希望通过结合频域分析来增强模型的结构识别能力。

Contribution: 主要贡献是将频域分析(DCT)引入HMER,提出MFH方法,并通过实验验证了频域信息对HMER性能的提升。

Method: 利用离散余弦变换(DCT)从频域提取特征,结合序列预测模型,提升对复杂数学表达式结构的识别能力。

Result: 在CROHME 2014/2016/2019测试集上,MFH-CoMER分别达到了61.66%/62.07%/63.72%的准确率,展现了显著性能提升。

Insight: 频域信息能够有效辅助HMER的结构分析,为复杂公式识别提供了新的解决方向。

Abstract: Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at https://github.com/Hryxyhe/MFH.

[51] Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration

Xin Luo,Menglin Zhang,Yunwei Lan,Tianyu Zhang,Rui Li,Chang Liu,Dong Liu

Main category: cs.CV

TL;DR: Latent-PMRF在变分自编码器的潜在空间中重构Posterior-Mean Rectified Flow (PMRF),以更好地匹配人类感知,从而实现更高保真度的人脸修复。

Details Motivation: 现有PMRF方法在像素空间建模限制了其与人类感知对齐的能力,因此需要一种在潜在空间中优化的方法。

Contribution: 提出了Latent-PMRF,通过在VAE潜在空间中优化,显著提升了人脸修复的感知-失真权衡和收敛效率。

Method: 在VAE潜在空间重构PMRF,并设计了优化的VAE结构,以最小化重构误差和提升修复性能。

Result: 在盲人脸修复任务上,Latent-PMRF优于现有方法,且在FID指标上实现了5.79倍的加速。

Insight: 潜在空间的优化能够更好地与人类感知对齐,同时VAE的设计对性能至关重要。

Abstract: The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE’s reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.

[52] ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen,Qiang Wang,Yu Qiao,Liangqiong Qu,Huijie Fan

Main category: cs.CV

TL;DR: ATSTrack通过对齐视觉-语言输入的时间和空间尺度,提升视觉-语言跟踪的效果,并通过分解语言描述和引入视觉-语言令牌,显著改善了性能。

Details Motivation: 视觉-语言跟踪中,目标和语言描述之间的时间和空间尺度不匹配是主要挑战,现有方法未能充分解决这一问题。

Contribution: 提出了一种新颖的视觉-语言跟踪器ATSTrack,首次针对时间和空间尺度对齐问题,通过分解语言描述和引入视觉-语言令牌实现改进。

Method: 将语言描述分解为具有不同时间和空间属性的短语,并细粒度地调整特征;引入来自前一帧的视觉-语言令牌指导特征提取。

Result: ATSTrack在性能上与现有方法相当,验证了其有效性。

Insight: 时间和空间尺度的对齐是提升视觉-语言跟踪性能的关键,细粒度的特征调整和上下文信息引导有助于减少尺度差异的影响。

Abstract: A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbf{A}ligning \textbf{T}emporal and \textbf{S}patial scale of different input components, named as \textbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.

[53] Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

Jizhou Han,Chenhao Ding,SongLin Dong,Yuhang He,Xinyuan Gao,Yihong Gong

Main category: cs.CV

TL;DR: 论文提出了一种无需训练、基于Mean-Shift的测试时自适应方法MS-TTA,通过优化所有测试样本的特征表示,提升了CLIP模型在分布偏移下的性能。

Details Motivation: 现有的测试时自适应方法通常仅依赖于高置信度样本,忽略了低置信度样本的潜力,限制了模型的适应性。作者希望通过增强所有测试样本的特征表示来提升模型的泛化能力。

Contribution: 提出了MS-TTA方法,利用Mean-Shift技术优化所有测试样本的特征表示,增强了特征的紧凑性和类间可分性。此外,通过缓存优化后的嵌入进一步提升了推理性能。

Method: 采用基于k近邻的单步Mean-Shift算法,优化CLIP原始特征空间外的特征表示。同时,通过缓存机制增强推理时的对数概率。

Result: 在OOD和跨数据集基准测试中,MS-TTA表现优于现有的无训练测试时自适应方法,实现了稳健的自适应。

Insight: 利用Mean-Shift技术可以在无需额外训练的情况下显著提升模型对分布偏移的适应能力,且低置信度样本的优化是关键因素之一。

Abstract: Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP’s original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP’s space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.

[54] Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

Yue Tan,Xiaoqian Hu,Hao Xue,Celso De Melo,Flora D. Salim

Main category: cs.CV

TL;DR: 论文提出了Bisecle方法,通过模拟人类海马体的快速绑定和模式分离机制,解决视频语言理解中的持续学习问题,利用多方向监督和对比提示学习来减少灾难性遗忘和更新冲突。

Details Motivation: 现实世界中的视频通常是连续演化的数据流,现有的大型视觉语言模型(VLMs)在持续学习过程中面临灾难性遗忘和更新冲突的问题,而人类海马体的高效记忆机制提供了灵感。

Contribution: 提出Bisecle方法,结合多方向监督模块和对比提示学习方案,有效减少了持续学习中的遗忘问题,并提升了跨任务泛化能力。

Method: 设计多方向监督模块捕捉跨模态关系,采用对比提示学习隔离任务特定知识,并通过绑定和分离机制强化模型的复杂经验保留能力。

Result: 在多个VideoQA基准测试中验证了Bisecle的有效性,显著减少了遗忘并增强了跨任务泛化能力。

Insight: 通过模拟人类海马体的记忆机制,可以显著提升视觉语言模型在持续学习中的效率和鲁棒性。

Abstract: Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

[55] ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis

Yaofei Duan,Yuhao Huang,Xin Yang,Luyi Han,Xinyu Xie,Zhiyuan Zhu,Ping He,Ka-Hou Chan,Ligang Cui,Sio-Kei Im,Dong Ni,Tao Tan

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: https://github.com/miccai25-966/ADAptation.

[56] Just Noticeable Difference for Large Multimodal Models

Zijian Chen,Yuan Tian,Yuze Sun,Wei Sun,Zicheng Zhang,Weisi Lin,Guangtao Zhai,Wenjun Zhang

Main category: cs.CV

TL;DR: 该论文提出了LMM-JND(大型多模态模型的恰好可察觉差异)概念,并开发了相关数据集和评估流程,揭示了当前LMMs在视觉感知方面的盲点。

Details Motivation: 研究LMMs的视觉感知缺陷,尤其是其恰好可察觉差异(JND),填补了现有研究空白,并为模型安全性优化提供新视角。

Contribution: 1. 提出LMM-JND概念及评估方法;2. 构建VPA-JND数据集(21.5k参考图像,489k刺激);3. 揭示LMMs(如GPT-4o、InternVL2.5)在视觉任务中的性能不足。

Method: 系统性量化LMMs的JND特性,通过VPA-JND数据集评估12种失真类型下的模型表现,并分析视觉与语言主干网络的关联。

Result: 当前顶级LMMs在基础视觉比较任务中显著落后于人类水平,视觉主干设计对JND性能有显著影响。

Insight: LMM-JND是研究LMMs视觉感知能力的新指标,其可预测性对模型安全性至关重要,未来设计需兼顾多模态融合与视觉敏锐度。

Abstract: Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at https://github.com/zijianchen98/LMM-JND.

[57] Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Fenil R. Doshi,Thomas Fel,Talia Konkle,George Alvarez

Main category: cs.CV

TL;DR: 这篇论文提出了Configural Shape Score (CSS)来衡量视觉模型对全局形状配置的绝对能力,揭示了不同模型在形状处理上的差异,并指出高CSS模型依赖长程交互,为构建更鲁棒、类人的视觉系统提供了方向。

Details Motivation: 当代视觉模型过度依赖局部纹理线索,忽略了形状配置的重要性,而人类可以同时依赖纹理和形状。论文旨在量化模型对全局形状配置的绝对能力,为改进模型设计提供依据。

Contribution: 1. 提出了Configural Shape Score (CSS)来量化模型对全局形状配置的绝对能力;2. 揭示了不同模型在CSS上的差异,发现某些自监督和语言对齐的transformer模型表现最佳;3. 发现高CSS模型依赖长程交互,并展示了从局部到全局编码的转变。

Method: 使用Object-Anagram图像对(保留局部纹理但打乱全局部分排列)来评估模型的形状配置能力,定义了CSS指标,并通过注意力掩码和表征相似性分析揭示了高CSS模型的工作原理。

Result: 高CSS模型(如DINOv2、SigLIP2和EVA-CLIP)依赖长程交互,BagNet模型表现随机,排除了边界投机策略。CSS还能预测其他形状相关的评估任务。

Insight: 构建更鲁棒、通用且类人的视觉系统需要同时整合局部纹理和全局形状配置,而非强迫模型在二者中选择其一。

Abstract: Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers – exemplified by DINOv2, SigLIP2 and EVA-CLIP – occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out “border-hacking” strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

[58] LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou,Chunxiao Fan,Ziyan Liu,Yuexin Wu,Xinxiang Wang

Main category: cs.CV

TL;DR: LLaVA-SP通过添加六个空间视觉标记增强多模态大语言模型的视觉表示,提出了一种新的投影器和两种变体模型,显著提升了性能。

Details Motivation: CLIP-ViT在捕捉全局图像特征时表现良好,但在建模局部关系时表现不足,影响了MLLMs的细节理解能力。LLaVA-SP旨在解决这一问题。

Contribution: 1) 提出了一种新的投影器,通过卷积核从ViT特征中提取空间视觉标记;2) 设计了两种模型变体(Cropping和Pooling);3) 实验显示性能显著提升。

Method: 1) 使用卷积核生成空间视觉标记;2) 通过交叉注意力机制融合细粒度视觉信息;3) 提出两种变体模型(Cropping和Pooling)。

Result: LLaVA-SP在多个多模态基准测试中表现优异,性能超越LLaVA-1.5,推理延迟几乎不变。

Insight: 通过简单添加少量空间标记,可显著提升视觉表示能力,且不影响推理效率,为MLLMs的视觉增强提供了新思路。

Abstract: The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which \textbf{ only adds six spatial visual tokens} to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: from central region to global" and from abstract to specific”. Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at \href{https://github.com/CnFaker/LLaVA-SP}{\texttt{https://github.com/CnFaker/LLaVA-SP}}.

[59] SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

Yunfei Xie,Yuxuan Cheng,Juncheng Wu,Haoyu Zhang,Yuyin Zhou,Shoudong Han

Main category: cs.CV

TL;DR: 论文提出了一种名为SCING的框架,通过选择性跨模态提示调优来提升行人重识别任务的效率和鲁棒性,避免了复杂适配器的计算开销。

Details Motivation: 当前基于视觉-语言预训练模型(如CLIP)的行人重识别方法通常依赖复杂适配器设计或模态特定调优,忽视了跨模态交互,导致高计算成本或对齐不佳。

Contribution: 1. 提出了轻量级的选择性视觉提示融合模块(SVIP),通过跨模态门控机制动态注入判别性视觉特征到文本提示中;2. 设计了扰动驱动的一致性对齐策略(PDCA),通过正则化原始和增强跨模态嵌入的一致性来增强鲁棒性。

Method: SCING框架结合了SVIP和PDCA:SVIP模块动态融合视觉与文本特征,PDCA通过双路径训练策略强化对图像扰动的鲁棒性。

Result: 在多个基准数据集(如Market1501、DukeMTMC-ReID等)上验证了方法的优越性能,同时保持了高效推理。

Insight: 跨模态提示调优可以有效提升行人重识别的效率和鲁棒性,而无须复杂适配器设计。

Abstract: Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.

[60] Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Djamahl Etchegaray,Yuxia Fu,Zi Huang,Yadan Luo

Main category: cs.CV

TL;DR: Box-QAymo是一个针对自动驾驶设计的box-referring视觉问答数据集和基准,旨在评估和微调视觉语言模型(VLMs)在用户指定对象上的时空推理能力。

Details Motivation: 当前的视觉语言模型在真实场景中难以捕捉用户意图,现有自动驾驶相关的VQA数据集局限于全场景描述或路径点预测,无法评估模型对局部化用户驱动查询的响应能力。

Contribution: 1) 提出了Box-QAymo数据集,支持用户通过绘制边界框表达意图;2) 设计了分层评估协议,涵盖基础能力、属性预测、运动理解和时空推理;3) 通过质量控制确保数据集多样性和鲁棒性。

Method: 1) 众包细粒度对象类和视觉属性;2) 提取对象轨迹构建时空问答对;3) 采用负采样、时间一致性检查和难度平衡等质量控制措施。

Result: 当前VLMs在感知问题上的表现显著不足,凸显了与现实世界需求的差距。

Insight: 用户通过边界框直接标注意图的交互方式高效且直观,Box-QAymo为开发更鲁棒、可解释的自动驾驶系统奠定了基础。

Abstract: Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at https://djamahl99.github.io/qaymo-pages/.

[61] Not All Attention Heads Are What You Need: Refining CLIP’s Image Representation with Attention Ablation

Feng Lin,Marco Chen,Haokui Zhang,Xiaotian Yu,Guangming Lu,Rong Xiao

Main category: cs.CV

TL;DR: 论文分析了CLIP图像编码器中注意力头的作用,提出了一种注意力消融技术(AAT),通过抑制特定注意力头的贡献来提升下游任务性能。实验表明AAT能显著提升跨模态检索等任务的性能,且不影响推理成本。

Details Motivation: CLIP在多种应用中表现优异,但某些注意力头可能对最终表示产生负面影响。通过消融这些有害头,可以进一步提升模型性能。

Contribution: 提出了注意力消融技术(AAT),识别并消融有害的注意力头,从而优化CLIP的图像表示质量。

Method: AAT通过操纵注意力权重来抑制特定头的贡献,并针对不同应用场景设计了两种策略。

Result: 实验显示AAT在跨模态检索等任务上显著提升了性能(召回率最高提升11.1%),且不影响推理效率。

Insight: 大型视觉语言模型的性能可以通过针对性消融注意力头来优化,而无需增加计算成本。

Abstract: This paper studies the role of attention heads in CLIP’s image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.

[62] LOD-GS: Level-of-Detail-Sensitive 3D Gaussian Splatting for Detail Conserved Anti-Aliasing

Zhenya Yang,Bingchen Gong,Kai Chen,Qi Dou

Main category: cs.CV

TL;DR: LOD-GS 提出了一种基于 3D 高斯泼溅的细节保留抗锯齿方法,通过动态预测每个 3D 高斯的滤波强度,有效解决了现有方法因采样率不敏感导致的欠滤波或过平滑问题。

Details Motivation: 现有 3D 高斯泼溅方法在抗锯齿时依赖低通滤波,但对采样率不敏感,导致渲染效果欠佳(欠滤波或过平滑)。需要一种更敏感的动态滤波方法。

Contribution: 1. 提出了 LOD-GS 框架,动态预测滤波强度;2. 引入了基础函数建模外观变化,联合优化高斯参数;3. 发布了新合成数据集,支持更全面的抗锯齿评估。

Method: LOD-GS 为每个高斯引入基础函数,以采样率为输入动态调整滤波强度,并通过端到端优化与 3D 高斯参数联合训练。同时考虑了焦距和相机距离的影响。

Result: 实验表明,LOD-GS 在公共数据集和新数据集上均达到 SOTA 渲染质量,显著消除了锯齿。

Insight: 1. 采样率敏感滤波对高质量渲染至关重要;2. 相机距离的影响常被现有方法忽略,需更全面的评估手段;3. 动态滤波优化是未来抗锯齿研究的重要方向。

Abstract: Despite the advancements in quality and efficiency achieved by 3D Gaussian Splatting (3DGS) in 3D scene rendering, aliasing artifacts remain a persistent challenge. Existing approaches primarily rely on low-pass filtering to mitigate aliasing. However, these methods are not sensitive to the sampling rate, often resulting in under-filtering and over-smoothing renderings. To address this limitation, we propose LOD-GS, a Level-of-Detail-sensitive filtering framework for Gaussian Splatting, which dynamically predicts the optimal filtering strength for each 3D Gaussian primitive. Specifically, we introduce a set of basis functions to each Gaussian, which take the sampling rate as input to model appearance variations, enabling sampling-rate-sensitive filtering. These basis function parameters are jointly optimized with the 3D Gaussian in an end-to-end manner. The sampling rate is influenced by both focal length and camera distance. However, existing methods and datasets rely solely on down-sampling to simulate focal length changes for anti-aliasing evaluation, overlooking the impact of camera distance. To enable a more comprehensive assessment, we introduce a new synthetic dataset featuring objects rendered at varying camera distances. Extensive experiments on both public datasets and our newly collected dataset demonstrate that our method achieves SOTA rendering quality while effectively eliminating aliasing. The code and dataset have been open-sourced.

[63] Out-of-distribution detection in 3D applications: a review

Zizhao Li,Xueyang Kang,Joseph West,Kourosh Khoshelham

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The ability to detect objects that are not prevalent in the training set is a critical capability in many 3D applications, including autonomous driving. Machine learning methods for object recognition often assume that all object categories encountered during inference belong to a closed set of classes present in the training data. This assumption limits generalization to the real world, as objects not seen during training may be misclassified or entirely ignored. As part of reliable AI, OOD detection identifies inputs that deviate significantly from the training distribution. This paper provides a comprehensive overview of OOD detection within the broader scope of trustworthy and uncertain AI. We begin with key use cases across diverse domains, introduce benchmark datasets spanning multiple modalities, and discuss evaluation metrics. Next, we present a comparative analysis of OOD detection methodologies, exploring model structures, uncertainty indicators, and distributional distance taxonomies, alongside uncertainty calibration techniques. Finally, we highlight promising research directions, including adversarially robust OOD detection and failure identification, particularly relevant to 3D applications. The paper offers both theoretical and practical insights into OOD detection, showcasing emerging research opportunities such as 3D vision integration. These insights help new researchers navigate the field more effectively, contributing to the development of reliable, safe, and robust AI systems.

[64] AI-Generated Video Detection via Perceptual Straightening

Christian Internò,Robert Geirhos,Markus Olhofer,Sunny Liu,Barbara Hammer,David Klindt

Main category: cs.CV

TL;DR: 提出了一种名为ReStraV的新方法,通过分析视频在神经表示域中的时间曲率和步长距离,来区分自然视频与AI生成视频。该方法基于“感知直线化”假设,并在检测性能上显著优于现有方法。

Details Motivation: 生成式AI生成的视频越来越逼真,导致内容鉴别的挑战和滥用风险增加。现有检测方法在泛化和捕捉时间不一致性方面存在不足。

Contribution: 1. 提出了ReStraV,一种基于“感知直线化”假设的AI生成视频检测方法。2. 量化了视频在神经表示域中的时间曲率和步长距离差异。3. 在VidProM基准测试中取得了97.17%的准确率和98.63%的AUROC,显著优于现有方法。

Method: 1. 使用预训练的自监督视觉Transformer(DINOv2)提取视频表示。2. 分析视频表示中的时间曲率和步长距离。3. 聚合这些统计量并训练轻量级分类器。

Result: ReStraV在VidProM基准测试中表现出色(97.17%准确率,98.63% AUROC),超越了现有图像和视频检测方法。

Insight: 通过神经表示域的几何特性可以有效区分AI生成视频与真实视频,为视频检测提供了新的研究角度。

Abstract: The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the “perceptual straightening” hypothesis – which suggests real-world video trajectories become more straight in neural representation domain – we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model’s representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

[65] Context-Aware Academic Emotion Dataset and Benchmark

Luming Zhao,Jingwen Xuan,Jiamin Lou,Yonghui Yu,Wenwu Yang

Main category: cs.CV

TL;DR: 论文提出RAER数据集和CLIP-CAER方法,填补了学术情感识别领域的数据空白,并利用上下文信息提升识别效果。

Details Motivation: 学术情感识别对评估学生学习状态至关重要,但现有数据集稀缺且方法主要针对基本情感,难以满足实际需求。

Contribution: 1) 发布RAER数据集,涵盖多样化自然学习场景;2) 提出CLIP-CAER方法,通过可学习文本提示结合面部表情和上下文信息。

Method: 基于CLIP模型,利用可学习文本提示整合面部表情和上下文信息,实现多模态学术情感识别。

Result: CLIP-CAER在RAER数据集上显著优于现有基于视频的面部表情识别方法,证实上下文的重要性。

Insight: 上下文信息对学术情感识别至关重要,多模态融合能显著提升识别效果。

Abstract: Academic emotion analysis plays a crucial role in evaluating students’ engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: https://zgsfer.github.io/CAER

[66] Overtake Detection in Trucks Using CAN Bus Signals: A Comparative Study of Machine Learning Methods

Fernando Alonso-Fernandez,Talha Hanif Butt,Prayag Tiwari

Main category: cs.CV

TL;DR: 论文比较了三种机器学习方法(ANN、RF和SVM)在卡车超车检测任务中的表现,发现多车辆训练数据和分数级融合策略能有效提升分类性能。最终取得了较高的真负率和真正率。

Details Motivation: 卡车安全超车对减少事故和提高交通效率至关重要,准确的超车检测为ADAS系统提供了决策支持。研究基于真实的CAN总线数据,探索不同方法在复杂交通条件下的表现。

Contribution: 提出了基于多车辆数据的训练策略和分数级融合方法,显著提升了超车检测性能。研究结果对实际应用的ADAS系统有重要参考价值。

Method: 使用ANN、RF和SVM三种分类器,结合不同的预处理配置进行分析。采用分数级融合策略优化单车辆数据的分类结果。

Result: 通过融合策略,实现了TNR=93%和TPR=86.5%的准确率,表明多车辆数据和融合方法对性能提升的有效性。

Insight: 训练数据的多样性和车辆数量对模型泛化能力影响显著;分数级融合能弥补单车辆数据不足的问题。

Abstract: Safe overtaking manoeuvres in trucks are vital for preventing accidents and ensuring efficient traffic flow. Accurate prediction of such manoeuvres is essential for Advanced Driver Assistance Systems (ADAS) to make timely and informed decisions. In this study, we focus on overtake detection using Controller Area Network (CAN) bus data collected from five in-service trucks provided by the Volvo Group. We evaluate three common classifiers for vehicle manoeuvre detection, Artificial Neural Networks (ANN), Random Forest (RF), and Support Vector Machines (SVM), and analyse how different preprocessing configurations affect performance. We find that variability in traffic conditions strongly influences the signal patterns, particularly in the no-overtake class, affecting classification performance if training data lacks adequate diversity. Since the data were collected under unconstrained, real-world conditions, class diversity cannot be guaranteed a priori. However, training with data from multiple vehicles improves generalisation and reduces condition-specific bias. Our pertruck analysis also reveals that classification accuracy, especially for overtakes, depends on the amount of training data per vehicle. To address this, we apply a score-level fusion strategy, which yields the best per-truck performance across most cases. Overall, we achieve an accuracy via fusion of TNR=93% (True Negative Rate) and TPR=86.5% (True Positive Rate). This research has been part of the BIG FUN project, which explores how Artificial Intelligence can be applied to logged vehicle data to understand and predict driver behaviour, particularly in relation to Camera Monitor Systems (CMS), being introduced as digital replacements for traditional exterior mirrors.

[67] World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Yupeng Zheng,Pengxuan Yang,Zebin Xing,Qichao Zhang,Yuhang Zheng,Yinfeng Gao,Pengfei Li,Teng Zhang,Zhongpu Xia,Peng Jia,Dongbin Zhao

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

[68] De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection

Zehua Fu,Chenguang Liu,Yuyu Chen,Jiaqi Zhou,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: 该论文通过提出DeSimPL方法,解决了自标定检测器中简单样本比例过高(简单标签偏见)的问题,从而提升了无监督域适应目标检测的性能。

Details Motivation: 无监督域适应(UDA)在目标检测中的应用面临标注数据成本高的问题,而自标定方法虽然简单高效,但性能不及域对齐方法。研究旨在解决自标定方法中的简单标签偏见问题。

Contribution: 提出DeSimPL方法,通过实例级记忆库更新伪标签策略,引入对抗样本增加样本多样性,并提出自适应加权损失以减少后期训练中的假阳性伪标定影响。

Method: 采用实例级记忆库实现伪标签更新策略,引入对抗样本调整样本比例,设计自适应加权损失优化训练过程。

Result: 实验表明,DeSimPL能显著减少简单样本比例,提升自标定检测器的性能,并在四个基准测试中得到验证。

Insight: 简单标签偏见是自标定方法性能受限的主要原因,通过多样化和动态调整伪标签策略可以有效提升模型性能。

Abstract: Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance. Recently, self-labeling methods have gained popularity due to their simplicity and efficiency. In this paper, we investigate the limitations that prevent self-labeling detectors from achieving commensurate performance with domain alignment methods. Specifically, we identify the high proportion of simple samples during training, i.e., the simple-label bias, as the central cause. We propose a novel approach called De-Simplifying Pseudo Labels (DeSimPL) to mitigate the issue. DeSimPL utilizes an instance-level memory bank to implement an innovative pseudo label updating strategy. Then, adversarial samples are introduced during training to enhance the proportion. Furthermore, we propose an adaptive weighted loss to avoid the model suffering from an abundance of false positive pseudo labels in the late training period. Experimental results demonstrate that DeSimPL effectively reduces the proportion of simple samples during training, leading to a significant performance improvement for self-labeling detectors. Extensive experiments conducted on four benchmarks validate our analysis and conclusions.

[69] UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions

Siyuan Yao,Rui Zhu,Ziqi Wang,Wenqi Ren,Yanyang Yan,Xiaochun Cao

Main category: cs.CV

TL;DR: UMDATrack提出了一种统一的领域自适应框架,用于在恶劣天气条件下保持高质量的目标跟踪性能,通过可控场景生成器和领域定制适配器,显著提升了跟踪器的适应性和鲁棒性。

Details Motivation: 现有跟踪器在恶劣天气条件下性能显著下降,因领域偏移导致目标表示失效。UMDATrack旨在解决这一问题,提升跟踪器在多天气条件下的适应能力。

Contribution: 1. 提出可控场景生成器,合成少量多天气条件下的未标注视频。2. 设计领域定制适配器(DCA),快速适应不同天气条件。3. 提出目标感知置信度对齐模块(TCA),提升源域和目标域的定位一致性。

Method: 1. 使用可控场景生成器生成多天气条件下的视频。2. 通过DCA实现目标表示的快速适应。3. 利用TCA模块基于最优传输理论对齐域间定位置信度。

Result: 实验表明,UMDATrack显著超越现有先进跟踪器,达到新的最先进性能。

Insight: 通过小规模合成数据和轻量适配器设计,可以实现高效的跨领域自适应,提升跟踪器在恶劣条件下的鲁棒性。

Abstract: Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects’ representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.

[70] LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

Juelin Zhu,Shuaibang Peng,Long Wang,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan

Main category: cs.CV

TL;DR: LoD-Loc v2提出了一种基于显式轮廓对齐的无人机空中视觉定位方法,首次支持低细节层次(LoD1)城市模型,并通过粗到精的策略实现高精度定位。

Details Motivation: 现有定位方法主要依赖高细节层次(LoD3或LoD2)城市模型,但实际可用的多为低细节层次(LoD1)。支持LoD1模型可释放无人机在全球城市定位中的潜力。

Contribution: 1. 提出LoD-Loc v2,首次实现低LoD模型上的定位;2. 采用显式轮廓对齐和粗到精策略;3. 发布两个新数据集;4. 在精度和鲁棒性上优于当前最优方法。

Method: 1. 通过建筑分割网络提取轮廓;2. 粗姿态选择阶段构建代价体并采样姿态假设;3. 精姿态估计阶段采用粒子滤波和多光束跟踪;4. 支持低LoD模型。

Result: 实验表明,LoD-Loc v2在高/低LoD模型上均表现优异,超越现有基线方法,并拓宽了收敛范围以适应更大初始误差。

Insight: 显式轮廓对齐和粗到精策略的结合是实现低LoD模型定位的关键,同时数据集的开源将推动相关研究。

Abstract: We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones’ potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km , along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.

[71] A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation

Edward Effendy,Kuan-Wei Tseng,Rei Kawakami

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.

[72] ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Zifu Wan,Ce Zhang,Silong Yong,Martin Q. Ma,Simon Stepputtis,Louis-Philippe Morency,Deva Ramanan,Katia Sycara,Yaqi Xie

Main category: cs.CV

TL;DR: ONLY 是一种无需训练的解码方法,通过单次查询和单层干预,有效减少大型视觉语言模型(LVLM)的幻觉问题,适用于实时应用。

Details Motivation: 现有的对比解码方法需要多次查询,影响实时性能;ONLY 旨在通过单次查询和单层干预高效缓解幻觉问题。

Contribution: 提出 ONLY 方法,首次实现单次查询和单层干预即可显著减少 LVLM 的幻觉问题,同时保持高效和轻量级。

Method: 利用文本到视觉的熵比选择性增强关键文本信息,仅需一次解码干预。

Result: 在多个基准测试中表现优于现有方法,且计算成本低、实现简单。

Insight: 通过优化文本信息的局部干预方法,可以在不增加计算负担的情况下有效提升模型输出的可靠性。

Abstract: Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

[73] Rectifying Magnitude Neglect in Linear Attention

Qihang Fan,Huaibo Huang,Yuang Ai,ran He

Main category: cs.CV

TL;DR: 本文分析了线性注意力(Linear Attention)在性能上落后于标准Softmax Attention的原因,并提出了一种改进方法——Magnitude-Aware Linear Attention(MALA),通过融入Query的幅值信息,显著提升了线性注意力的表现。

Details Motivation: 线性注意力虽然计算复杂度低,适合视觉任务,但其性能显著低于标准Softmax Attention。本文旨在分析原因并提出改进方法。

Contribution: 1. 发现了线性注意力忽略Query幅值信息的核心问题;2. 提出了MALA方法,通过调整计算方式解决了这一问题;3. 在多任务中验证了MALA的有效性。

Method: 提出了Magnitude-Aware Linear Attention(MALA),通过修改线性注意力的计算方式,充分融入Query的幅值信息,使其生成与Softmax Attention相似的注意力分布。

Result: MALA在图像分类、目标检测、实例分割、语义分割、自然语言处理、语音识别和图像生成等任务中均取得了显著效果。

Insight: Query的幅值信息对注意力分布有重要影响,线性注意力通过融入这一信息可以显著提升性能。

Abstract: As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

[74] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang,Yueru Luo,Bingkun He,Hongbin Lin,Suzhong Fu,Chao Yan,Kun Tang,Xinrui Yan,Chao Zheng,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.4% mAP in lane segment perception and +2.1% OLS in centerline perception tasks.

[75] UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang,Fei Wei,Yong Wang,Wenda Zhao,Feiyi Li,Xiangxiang Chu

Main category: cs.CV

TL;DR: 论文提出了UPRE框架,通过联合优化文本提示和视觉表示,解决了零样本域适应(ZSDA)任务中检测任务与视觉语言模型(VLMs)的不对齐问题。

Details Motivation: 现有方法主要关注域分布偏移,忽略了检测任务与依赖手动设计提示的VLMs之间的不对齐问题。

Contribution: 提出了UPRE框架,包含多视角域提示和视觉表示增强模块,并引入了多级增强策略。

Method: 联合优化文本提示和视觉表示,结合语言域先验与检测特定知识,生成域风格变化。

Result: 在九个基准数据集上的实验表明,UPRE在ZSDA检测场景中表现优异。

Insight: 通过多模态对齐和多级增强策略,可以有效提升零样本域适应任务的性能。

Abstract: Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.

[76] Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features

Linghui Zhu,Yiming Li,Haiqin Weng,Yan Liu,Tianwei Zhang,Shu-Tao Xia,Zhi Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为Holmes的无害模型所有权验证方法,通过解耦公共特征来保护个性化大型视觉模型免受模型窃取攻击。

Details Motivation: 现有防御方法主要针对从头训练的模型,而针对微调模型的窃取攻击防御效果不佳,甚至可能引入额外安全风险或误判。因此,需要一种针对个性化模型的无害验证方法。

Contribution: 提出了一种分阶段的模型所有权验证方法,通过创建影子模型、训练元分类器以及假设检验来有效检测窃取行为,同时避免额外风险。

Method: 方法分为三阶段:1) 创建保留公共特征但破坏数据集特定特征的影子模型;2) 训练元分类器检测可疑模型中的数据集特定特征;3) 通过假设检验增强鲁棒性。

Result: 在基准数据集上的实验表明,该方法能同时有效检测多种类型的模型窃取行为。

Insight: 通过解耦公共特征与数据集特定特征,可以更准确地识别窃取模型,同时避免对正常模型使用造成干扰。

Abstract: Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.

[77] Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Bob Zhang,Haoran Li,Tao Zhang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yanbin Hao

Main category: cs.CV

TL;DR: 该论文提出了一种基于强化学习(RL)的后训练策略,以提升多模态大语言模型(MLLM)在多图像定位任务中的推理性能。通过合成高质量的思维链(CoT)数据进行冷启动初始化,并结合监督微调(SFT)和规则驱动的强化学习,显著提升了模型在多图像场景中的表现。

Details Motivation: 现有的MLLM在单图像场景中表现出色,但在多图像组合和多模态指令的实际应用中表现不佳,暴露出跨图像推理和泛化的局限性。

Contribution: 论文的主要贡献是提出了一种RL驱动的后训练策略,显著提升了MLLM在多图像定位任务中的推理能力,并通过实验验证了其有效性。

Method: 方法包括:(1)合成高质量的CoT数据用于冷启动初始化;(2)使用LoRA进行监督微调;(3)通过拒绝采样和规则驱动的RL优化推理路径。

Result: 实验结果显示,该方法在MIG-Bench上提升了9.04%,在多个域外基准上提升了4.98%,同时在BLINK和MMIU子集上分别提升了3.1%和2.4%。

Insight: 研究揭示了强化学习在提升多模态模型跨图像推理能力中的潜力,同时合成的CoT数据为类似任务提供了有效的数据增强思路。

Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04% improvements on MIG-Bench and +4.98% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1% and +2.4% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.

[78] Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Hao Xing,Kai Zhe Boey,Yuankai Wu,Darius Burschka,Gordon Cheng

Main category: cs.CV

TL;DR: 该论文提出了一种多模态图卷积网络(MMGCN),结合低帧率视觉数据和高帧率运动数据,通过正弦编码和时序图融合模块,有效减少了动作分割中的过分割问题。通过数据增强技术SmoothLabelMix,提升了预测的时间一致性。实验显示其性能优于现有方法。

Details Motivation: 在协作机器人场景中,精确的动作时间分割对理解子活动标签及其时间结构至关重要。然而,人体姿态估计和物体检测中的噪声常导致过分割错误,破坏动作序列的连贯性。

Contribution: 1. 提出了正弦编码策略,增强空间表示的鲁棒性;2. 设计了时序图融合模块,对齐多模态输入的不同分辨率;3. 引入SmoothLabelMix数据增强技术,提升时间一致性。

Method: 采用多模态图卷积网络(MMGCN),结合低帧率视觉数据和高帧率运动数据,通过正弦编码映射3D骨架坐标,并利用时序图融合模块进行多模态特征聚合。使用SmoothLabelMix生成合成训练数据。

Result: 在Bimanual Actions Dataset上,F1@10为94.5%,F1@25为92.8%,性能优于现有方法。

Insight: 正弦编码和时序图融合模块能有效缓解多模态数据的分辨率差异问题,SmoothLabelMix则通过模拟平滑动作过渡提升了模型的时间一致性预测能力。

Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

[79] Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

Selim Kuzucu,Muhammad Ferjad Naeem,Anna Kukleva,Federico Tombari,Bernt Schiele

Main category: cs.CV

TL;DR: LUViT通过结合Vision Transformers和大型语言模型(LLMs),提出了一种协同预训练策略,以解决两者模态不匹配的问题,显著提升了视觉任务的性能。

Details Motivation: 现有的直接融合LLMs和Vision Transformers(ViTs)的方法因模态不匹配和训练不稳定而效果有限。LUViT旨在通过协同预训练策略解决这一问题,充分挖掘LLMs在视觉任务中的潜力。

Contribution: 提出了Language-Unlocked Vision Transformers(LUViT),通过协同预训练策略桥接ViTs和LLMs的模态不匹配问题,显著提升了视觉任务的性能。

Method: 结合掩码自动编码(MAE)预训练ViT以增强视觉表征,同时使用低秩适配(LoRA)层在LLM块中通过MAE目标联合优化,实现模态对齐。

Result: 实验表明,LUViT在多种下游视觉任务上性能显著提升,证明了其在利用LLM知识进行视觉理解方面的有效性。

Insight: 通过协同预训练策略,LUViT成功解决了ViTs和LLMs融合中的模态不匹配问题,为视觉任务中利用语言模型知识提供了新思路。

Abstract: The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM’s potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.

[80] OptiPrune: Boosting Prompt-Image Consistency with Attention-Guided Noise and Dynamic Token Selection

Ziji Lu

Main category: cs.CV

TL;DR: OptiPrune提出了一种结合分布感知噪声优化和动态token剪枝的统一框架,以提高文本到图像扩散模型在资源受限硬件上的效率和语义一致性。

Details Motivation: 现有方法在优化噪声或剪枝token时往往牺牲计算效率或语义保真度,OptiPrune旨在同时解决这两个问题。

Contribution: 1. 提出分布感知噪声优化模块;2. 设计高效的动态token剪枝策略;3. 在保持高斯先验的同时提升一致性。

Method: 1. 基于注意力分数引导噪声优化;2. 利用相似性剪枝token并恢复。

Result: 在Animal-Animal等benchmark上达到SOTA一致性,显著降低计算成本。

Insight: 噪声优化和token剪枝的协同设计可提升生成质量和效率。

Abstract: Text-to-image diffusion models often struggle to achieve accurate semantic alignment between generated images and text prompts while maintaining efficiency for deployment on resource-constrained hardware. Existing approaches either incur substantial computational overhead through noise optimization or compromise semantic fidelity by aggressively pruning tokens. In this work, we propose OptiPrune, a unified framework that combines distribution-aware initial noise optimization with similarity-based token pruning to address both challenges simultaneously. Specifically, (1) we introduce a distribution-aware noise optimization module guided by attention scores to steer the initial latent noise toward semantically meaningful regions, mitigating issues such as subject neglect and feature entanglement; (2) we design a hardware-efficient token pruning strategy that selects representative base tokens via patch-wise similarity, injects randomness to enhance generalization, and recovers pruned tokens using maximum similarity copying before attention operations. Our method preserves the Gaussian prior during noise optimization and enables efficient inference without sacrificing alignment quality. Experiments on benchmark datasets, including Animal-Animal, demonstrate that OptiPrune achieves state-of-the-art prompt-image consistency with significantly reduced computational cost.

[81] LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Huaqiu Li,Yong Wang,Tongwen Huang,Hailang Huang,Haoqian Wang,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练潜在扩散模型的零样本统一图像复原方法,通过循环后验采样实现无需配对数据集的泛化性能,优于现有方法。

Details Motivation: 传统方法要么针对特定任务设计,泛化性差;要么依赖配对数据集,受限于闭集问题。本文旨在解决这些问题。

Contribution: 提出LD-RPS方法,利用预训练潜在扩散模型实现零样本统一图像复原,无需配对数据,且性能优越。

Method: 结合多模态理解模型提供语义先验,使用轻量模块对齐退化输入与扩散模型生成偏好,并通过循环后验采样进行细化。

Result: 实验表明该方法在统一图像复原任务中优于现有方法,验证了其有效性和鲁棒性。

Insight: 预训练扩散模型与多模态先验的结合为无需配对数据的图像复原提供了新思路,轻量模块和循环细化是关键创新。

Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at https://github.com/AMAP-ML/LD-RPS.

[82] TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency

Minye Shao,Xingyu Miao,Haoran Duan,Zeyu Wang,Jingkun Chen,Yawen Huang,Xian Wu,Jingjing Deng,Yang Long,Yefeng Zheng

Main category: cs.CV

TL;DR: TRACE提出了一种高效且可靠的3D医学影像生成框架,通过2D多模态条件扩散方法实现时空对齐,解决了当前方法在解剖保真度、轴向长度和计算成本上的局限。

Details Motivation: 当前3D医学影像生成方法存在解剖保真度低、轴向长度受限和计算成本高的缺点,限制了其在资源有限地区的应用。TRACE旨在通过高效且可靠的方法解决这些问题。

Contribution: 1. 提出TRACE框架,结合分割先验和放射学报告实现解剖对齐;2. 使用光流保持时间一致性;3. 采用重叠帧策略生成灵活长度的3D序列。

Method: 1. 2D多模态条件扩散方法;2. 将2D切片建模为视频帧对;3. 光流用于时间一致性;4. 重叠帧策略生成3D序列。

Result: 实验表明,TRACE在计算效率和保持解剖保真度、时空一致性之间取得良好平衡。

Insight: TRACE通过2D方法解决3D问题,既降低了计算成本,又保证了生成的影像质量,适用于资源有限的场景。

Abstract: 3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE.

[83] CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

Jiaming Zhang,Rui Hu,Qing Guo,Wei Yang Bryan Lim

Main category: cs.CV

TL;DR: CAVALRY-V是一个针对视频多模态大语言模型(V-MLLMs)的对抗攻击框架,通过双目标语义-视觉损失函数和高效的两阶段生成器设计,显著提升了攻击性能。

Details Motivation: 视频多模态大语言模型在跨模态理解和时间推理方面表现出色,但其对抗攻击的脆弱性尚未被充分研究,尤其是复杂的跨模态推理机制、时间依赖性和计算约束带来的挑战。

Contribution: 提出了CAVALRY-V框架,引入双目标语义-视觉损失函数和高效的两阶段生成器,显著提升了对抗攻击的效果。

Method: 1. 双目标损失函数同时破坏文本生成和视觉表征;2. 两阶段生成器结合大规模预训练和微调,确保时空一致性。

Result: 在多个基准测试中,CAVALRY-V平均性能提升22.8%,甚至在图像理解任务中表现优异(平均增益34.4%)。

Insight: CAVALRY-V通过隐式时间一致性建模,为跨模态系统的对抗研究提供了新思路。

Abstract: Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model’s text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V’s potential as a foundational approach for adversarial research across multimodal systems.

[84] High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng,Lide Chen,Hui Zhu,Yan Chen

Main category: cs.CV

TL;DR: HEGS-DETR提出了一种针对无人机图像检测的增强型Transformer框架,通过高频语义增强网络、高效小目标金字塔结构和选择性查询重收集模块,显著提升了小目标和密集目标的检测性能。

Details Motivation: 无人机图像中的目标检测面临小目标、高密度分布和复杂背景的挑战,现有方法依赖于手工设计的组件(如锚框和NMS),通用性差且性能受限。HEGS-DETR旨在通过端到端框架解决这些问题。

Contribution: 1. 提出高频语义增强网络(HFESNet)保留关键空间细节;2. 设计高效小目标金字塔(ESOP)策略;3. 引入选择性查询重收集(SQR)和几何感知位置编码(GAPE)模块。

Method: 1. HFESNet提取高频语义;2. ESOP融合高分辨率特征;3. SQR和GAPE模块优化解码器的稳定性和定位精度。

Result: 在VisDrone数据集上,AP50和AP分别提升5.1%和3.8%,同时保持实时速度并减少4M参数。

Insight: 高频语义和几何先验信息对无人机图像目标检测至关重要,端到端Transformer框架可以有效解决小目标和密集场景的挑战。

Abstract: Unmanned Aerial Vehicle-based Object Detection (UAV-OD) faces substantial challenges, including small target sizes, high-density distributions, and cluttered backgrounds in UAV imagery. Current algorithms often depend on hand-crafted components like anchor boxes, which demand fine-tuning and exhibit limited generalization, and Non-Maximum Suppression (NMS), which is threshold-sensitive and prone to misclassifying dense objects. These generic architectures thus struggle to adapt to aerial imaging characteristics, resulting in performance limitations. Moreover, emerging end-to-end frameworks have yet to effectively mitigate these aerial-specific challenges.To address these issues, we propose HEGS-DETR, a comprehensively enhanced, real-time Detection Transformer framework tailored for UAVs. First, we introduce the High-Frequency Enhanced Semantics Network (HFESNet) as a novel backbone. HFESNet preserves critical high-frequency spatial details to extract robust semantic features, thereby improving discriminative capability for small and occluded targets in complex backgrounds. Second, our Efficient Small Object Pyramid (ESOP) strategy strategically fuses high-resolution feature maps with minimal computational overhead, significantly boosting small object detection. Finally, the proposed Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE) modules enhance the detector’s decoder stability and localization accuracy, effectively optimizing bounding boxes and providing explicit spatial priors for dense scenes. Experiments on the VisDrone dataset demonstrate that HEGS-DETR achieves a 5.1% AP$_{50}$ and 3.8% AP increase over the baseline, while maintaining real-time speed and reducing parameter count by 4M.

[85] UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

Wei Li,Jiaman Tang,Yang Li,Beihao Xia,Ligang Tan,Hongmao Qin

Main category: cs.CV

TL;DR: UAVD-Mamba提出了一种基于Mamba架构的多模态无人机目标检测框架,通过改进的Deformable Token Mamba Block(DTMB)和多尺度特征融合,显著提升了遮挡、小目标和不规则形状场景下的检测性能。

Details Motivation: 无人机目标检测在交通管理、农业和紧急救援等领域有广泛应用,但由于遮挡、小目标和形状不规则等问题,现有方法面临挑战。因此,需要一种鲁棒且高效的多模态检测方法。

Contribution: 1. 提出UAVD-Mamba框架,整合RGB和红外模态;2. 设计DTMB模块,结合可变形卷积与普通卷积的特征;3. 提出多尺度特征融合和注意力机制优化检测性能。

Method: 1. 使用DTMB模块生成可变形和普通特征作为Mamba Block输入;2. 为RGB和红外模态设计独立DTMB模块,输出融合到Mamba Block;3. 堆叠4个DTMB模块提取多尺度特征,并改进YOLOv11的SPPF和C3K2结构。

Result: 在DroneVehicle数据集上,mAP指标超过基线OAFA方法3.6%。

Insight: 结合可变形卷积与Mamba架构的特征提取能力,可有效提升无人机多模态检测任务中的性能,尤其是在复杂场景下对小目标的检测能力。

Abstract: Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.

[86] Robust Component Detection for Flexible Manufacturing: A Deep Learning Approach to Tray-Free Object Recognition under Variable Lighting

Fatemeh Sadat Daneshmand

Main category: cs.CV

TL;DR: 本文提出了一种基于Mask R-CNN的计算机视觉系统,用于工业环境中无需结构化托盘的笔类组件检测,解决了无位置约束、极端光照变化和低成本摄像头等挑战,实现了95%的检测准确率。

Details Motivation: 工业4.0中的柔性制造系统需要能够在非结构化环境中处理物体的机器人,而传统方法依赖于固定位置的托盘,限制了灵活性。本文旨在解决这一问题,提高制造效率。

Contribution: 主要贡献包括:1) 开发了一种无需结构化托盘的物体检测系统;2) 提出了对极端光照变化具有鲁棒性的方法;3) 验证了低成本摄像头在实际工业环境中的可靠性。

Method: 采用了基于Mask R-CNN的深度学习模型,通过分割和检测结合的方式实现物体的精准定位和识别。系统在多种光照条件下进行了测试和优化。

Result: 系统在四种不同光照场景下的测试中达到了95%的检测准确率,同时减少了30%的安装时间,显著提升了制造灵活性。

Insight: 该方法展示了深度学习在工业场景中的潜力,通过结合鲁棒性和成本效益,为柔性制造提供了一种可行的解决方案。

Abstract: Flexible manufacturing systems in Industry 4.0 require robots capable of handling objects in unstructured environments without rigid positioning constraints. This paper presents a computer vision system that enables industrial robots to detect and grasp pen components in arbitrary orientations without requiring structured trays, while maintaining robust performance under varying lighting conditions. We implement and evaluate a Mask R-CNN-based approach on a complete pen manufacturing line at ZHAW, addressing three critical challenges: object detection without positional constraints, robustness to extreme lighting variations, and reliable performance with cost-effective cameras. Our system achieves 95% detection accuracy across diverse lighting conditions while eliminating the need for structured component placement, demonstrating a 30% reduction in setup time and significant improvement in manufacturing flexibility. The approach is validated through extensive testing under four distinct lighting scenarios, showing practical applicability for real-world industrial deployment.

[87] Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Simon Reiß,Zdravko Marinov,Alexander Jaus,Constantin Seibold,M. Saquib Sarfraz,Erik Rodner,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 本文探讨了视觉上下文学习的潜力,旨在通过单一模型处理多任务并适应新任务,而无需重新训练。作者提出了一种新的方法,通过合成任务生成引擎训练上下文学习模型,特别适用于复杂的医学视觉任务。

Details Motivation: 传统方法需要为每个任务单独训练模型,而本文旨在通过视觉上下文学习实现单一模型的多任务处理和测试时灵活适应,特别适用于复杂的医学任务序列。

Contribution: 提出了一种新的训练方法,通过合成任务生成引擎和掩码训练目标,使模型能够适应复杂的组合任务,特别关注医学领域的多模态任务。

Method: 引入合成组合任务生成引擎,从任意分割数据集中生成任务序列,并研究不同的掩码训练目标,以优化模型对复杂组合任务的处理能力。

Result: 实验表明,该方法能够有效训练模型处理复杂的组合任务,尤其是在医学领域,但同时也揭示了需要进一步解决的挑战。

Insight: 视觉上下文学习在医学多模态任务中具有潜力,但需进一步优化任务生成和训练目标以提高模型性能。

Abstract: In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.

[88] GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva,Jan-Nico Zaech,Xi Wang,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

[89] MVP: Winning Solution to SMP Challenge 2025 Video Track

Liliang Ye,Yunyao Zhang,Yafeng Wu,Yi-Ping Phoebe Chen,Junqing Yu,Wei Yang,Zikai Song

Main category: cs.CV

TL;DR: MVP是一个多模态视频预测模型,通过结合视频特征、用户元数据和上下文信息,成功赢得了SMP Challenge 2025视频赛道的第一名。

Details Motivation: 社交媒体视频的流行度预测在内容推荐和趋势检测中具有重要价值,但多模态数据的复杂性和噪声问题需要解决。

Contribution: 提出了MVP框架,整合了深度视频特征、用户元数据和上下文信息,并通过预处理和梯度提升回归模型提高了预测性能。

Method: 使用预训练模型提取视频特征,结合用户元数据和上下文信息,应用对数变换和异常值去除等预处理技术,最终训练梯度提升回归模型。

Result: 在SMP Challenge 2025视频赛道的官方评估中排名第一,证明了其有效性和可靠性。

Insight: 多模态数据的整合和系统性预处理是提升视频流行度预测性能的关键,尤其在处理社交媒体数据的复杂性时尤为重要。

Abstract: Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at https://anonymous.4open.science/r/SMPDVideo.

[90] RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Yuheng Du,Sheng Yang,Lingxuan Wang,Zhenghua Hou,Chengying Cai,Zhitao Tan,Mingxia Chen,Shi-Sheng Huang,Qiang Li

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap (Camera ready version incorporating reviewer suggestions will be updated soon).

[91] UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

Yuanrui Wang,Cong Han,YafeiLi,Zhipeng Jin,Xiawei Li,SiNan Du,Wen Tao,Yi Yang,shuanglong li,Chun Yuan,Liu Lin

Main category: cs.CV

TL;DR: UniGlyph提出了一种基于分段的文本生成框架,通过像素级视觉文本掩码作为统一条件输入,解决了现有方法在视觉文本生成中的模糊、语义漂移和风格控制不足的问题。

Details Motivation: 现有文本到图像生成方法在视觉文本渲染中存在模糊字形、语义漂移和风格控制不足的问题,且依赖预渲染字形图像导致模型复杂性和灵活性受限。

Contribution: 1. 提出了基于像素级文本掩码的统一条件输入框架;2.设计了细分的双语分割模型和增强的扩散模型,提升了文本生成的精度和风格保持能力;3.引入了新的评测基准GlyphMM-benchmark和MiniText-benchmark。

Method: 1. 使用细分的双语分割模型提取精确文本掩码;2. 结合自适应字形条件和区域特定损失,设计简化的扩散模型。

Result: 在AnyText基准测试中表现最优,显著优于现有方法;在新基准测试中,特别是在小文本渲染和复杂布局保持上表现突出。

Insight: 像素级文本掩码能有效提升文本生成的多模态条件控制能力,简化模型设计并提供更高的灵活性。

Abstract: Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks – rich in glyph shape, color, and spatial detail – as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.

[92] GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong,Wenmeng Yu,Xiaotao Gu,Guo Wang,Guobing Gan,Haomiao Tang,Jiale Cheng,Ji Qi,Junhui Ji,Lihang Pan,Shuaiqi Duan,Weihan Wang,Yan Wang,Yean Cheng,Zehai He,Zhe Su,Zhen Yang,Ziyang Pan,Aohan Zeng,Baoxu Wang,Boyan Shi,Changyu Pang,Chenhui Zhang,Da Yin,Fan Yang,Guoqing Chen,Jiazheng Xu,Jiali Chen,Jing Chen,Jinhao Chen,Jinghao Lin,Jinjiang Wang,Junjie Chen,Leqi Lei,Leyi Pan,Mingzhi Zhang,Qinkai Zheng,Sheng Yang,Shi Zhong,Shiyu Huang,Shuyuan Zhao,Siyan Xue,Shangqin Tu,Shengbiao Meng,Tianshu Zhang,Tianwei Luo,Tianxiang Hao,Tianle Gong,Wenkai Li,Wei Jia,Xin Lyu,Xuancheng Huang,Yanling Wang,Yadong Xue,Yanfeng Wang,Yifan An,Yifan Du,Yiming Shi,Yiheng Huang,Yilin Niu,Yuan Wang,Yuanchang Yue,Yuchen Li,Yutao Zhang,Yuxuan Zhang,Zhanxiao Du,Zhenyu Hou,Zhao Xue,Zhengxiao Du,Zihan Wang,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Minlie Huang,Yuxiao Dong,Jie Tang

Main category: cs.CV

TL;DR: GLM-4.1V-Thinking是一种多模态视觉语言模型,通过大规模预训练和增强的强化学习框架实现多功能推理,在多种任务上表现优于同类模型。

Details Motivation: 旨在推动通用多模态推理能力的发展,解决现有模型在复杂任务(如STEM问题解决、视频理解等)上的局限性。

Contribution: 1) 开发了一种以推理为中心的训练框架;2) 结合大规模预训练和强化学习课程采样(RLCS)提升模型性能;3) 开源了性能优越的GLM-4.1V-9B-Thinking模型。

Method: 1) 大规模预训练构建视觉基础模型;2) 使用强化学习课程采样(RLCS)全面增强模型能力。

Result: 在28个公开基准测试中表现优异,优于同类模型Qwen2.5-VL-7B,并在部分任务上媲美更大的Qwen2.5-VL-72B和闭源模型GPT-4o。

Insight: 强化学习与预训练结合能显著提升多模态模型的推理能力,尤其是在复杂任务(如长文档理解)上。

Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.

[93] ShapeEmbed: a self-supervised learning framework for 2D contour quantification

Anna Foix Romero,Craig Russell,Alexander Krull,Virginie Uhlmann

Main category: cs.CV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object’s intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.

[94] DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution

Zhe Kong,Le Li,Yong Zhang,Feng Gao,Shaoshu Yang,Tao Wang,Kaihao Zhang,Zhuoliang Kang,Xiaoming Wei,Guanying Chen,Wenhan Luo

Main category: cs.CV

TL;DR: DAM-VSR提出了一种外观和运动解耦的视频超分辨率框架,通过结合参考图像超分辨率和视频ControlNet,解决了现有方法在时间一致性上的不足,并在真实世界和AIGC数据上实现了最先进的性能。

Details Motivation: 现有的基于图像扩散模型的视频超分辨率方法在生成细节方面有所改进,但在时间一致性上表现不佳,尤其是面对复杂和不可预测的降质时。

Contribution: 提出了DAM-VSR框架,将视频超分辨率解耦为外观增强和运动控制两个独立问题,并引入运动对齐的双向采样策略以支持更长输入视频的处理。

Method: 结合了参考图像超分辨率(外观增强)和视频ControlNet(运动控制),并通过运动对齐的双向采样策略优化长视频处理。

Result: 在真实世界和AIGC数据上实现了最先进的性能,展示了强大的细节生成能力。

Insight: 通过解耦外观和运动问题,DAM-VSR充分利用了视频扩散模型的生成先验和图像超分辨率模型的细节生成能力,显著提升了视频超分辨率的效果和时间一致性。

Abstract: Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.

eess.IV [Back]

[95] Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)

Yang Zhou,Chrystie Wan Ning Quek,Jun Zhou,Yan Wang,Yang Bai,Yuhe Ke,Jie Yao,Laura Gutierrez,Zhen Ling Teo,Darren Shu Jeng Ting,Brian T. Soetikno,Christopher S. Nielsen,Tobias Elze,Zengxiang Li,Linh Le Dinh,Lionel Tim-Ee Cheng,Tran Nguyen Tuan Anh,Chee Leong Cheng,Tien Yin Wong,Nan Liu,Iain Beehuat Tan,Tony Kiat Hon Lim,Rick Siow Mong Goh,Yong Liu,Daniel Shu Wei Ting

Main category: eess.IV

TL;DR: MerMED-FM是一个多模态、多疾病的医学影像基础模型,通过自监督学习和记忆模块训练,覆盖了7种模态和10多个专科的3.3百万张影像,取得了优异的性能。

Details Motivation: 解决当前医学影像AI模型局限在单一模态和单一疾病的问题,并降低对大规模标注数据的依赖。

Contribution: 提出了一种跨专科、多模态的医学影像基础模型MerMED-FM,通过自监督学习和记忆模块实现高效训练和强大性能。

Method: 使用自监督学习和记忆模块,在大规模多模态、多专科医学影像数据上训练。

Result: 在多种疾病和模态上表现优异,AUROC最高达0.988(OCT)。

Insight: 多模态与多疾病结合的通用医学影像模型具有潜力,可通过自监督学习减少对标注数据的依赖。

Abstract: Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.

[96] Towards 3D Semantic Image Synthesis for Medical Imaging

Wenwu Tang,Khaled Seyam,Bin Yang

Main category: eess.IV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: In the medical domain, acquiring large datasets is challenging due to both accessibility issues and stringent privacy regulations. Consequently, data availability and privacy protection are major obstacles to applying machine learning in medical imaging. To address this, our study proposes the Med-LSDM (Latent Semantic Diffusion Model), which operates directly in the 3D domain and leverages de-identified semantic maps to generate synthetic data as a method of privacy preservation and data augmentation. Unlike many existing methods that focus on generating 2D slices, Med-LSDM is designed specifically for 3D semantic image synthesis, making it well-suited for applications requiring full volumetric data. Med-LSDM incorporates a guiding mechanism that controls the 3D image generation process by applying a diffusion model within the latent space of a pre-trained VQ-GAN. By operating in the compressed latent space, the model significantly reduces computational complexity while still preserving critical 3D spatial details. Our approach demonstrates strong performance in 3D semantic medical image synthesis, achieving a 3D-FID score of 0.0054 on the conditional Duke Breast dataset and similar Dice scores (0.70964) to those of real images (0.71496). These results demonstrate that the synthetic data from our model have a small domain gap with real data and are useful for data augmentation.

[97] SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

Fengyi Jiang,Xiaorui Zhang,Lingbo Jin,Ruixing Liang,Yuxin Chen,Adi Chola Venkatesh,Jason Culman,Tiantian Wu,Lirong Shao,Wenqing Sun,Cong Gao,Hallie McNamara,Jingpei Lu,Omid Mohareri

Main category: eess.IV

TL;DR: SurgiSR4K是首个公开的高分辨率(4K)内窥镜视频数据集,专门为机器人辅助微创手术设计,涵盖多种常见挑战场景,支持计算机视觉任务的研究与应用。

Details Motivation: 现有的公开数据集中缺乏针对机器人辅助微创手术的高分辨率(4K)数据,这限制了相关计算机视觉技术的发展。SurgiSR4K填补了这一空白,旨在推动高性能手术成像技术的研究。

Contribution: 提出了首个公开的4K分辨率内窥镜手术数据集SurgiSR4K,覆盖了反射、工具遮挡、出血等常见手术场景,为多种计算机视觉任务提供了基准数据。

Method: 数据集通过真实的机器人辅助手术场景采集,包含多样化的视觉挑战(如镜面反射、组织变形等),并经过了精心标注和设计。

Result: SurgiSR4K为超分辨率、烟雾移除、手术工具检测等任务提供了高质量数据支持,推动了高分辨率手术成像技术的研究。

Insight: 高分辨率数据集对手术中的计算机视觉任务至关重要,SurgiSR4K的开源将促进智能手术技术的发展,提升手术的安全性和效率。

Abstract: High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.

[98] Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound

Jian Wang,Qiongying Ni,Hongkui Yu,Ruixuan Yao,Jinqiao Ying,Bin Zhang,Xingyi Yang,Jin Peng,Jiongquan Chen,Junxuan Yu,Wenlong Shi,Chaoyu Chen,Zhongnuo Yan,Mingyuan Luo,Gaocheng Cai,Dong Ni,Jing Lu,Xin Yang

Main category: eess.IV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: https://github.com/Qioy-i/EFW.

[99] MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound

Rusi Chen,Yuanting Yang,Jiezhi Yao,Hongning Song,Ji Zhang,Yongsong Zhou,Yuhao Huang,Ronghao Yang,Dan Jia,Yuhan Zhang,Xing Tao,Haoran Dou,Qing Zhou,Xin Yang,Dong Ni

Main category: eess.IV

TL;DR: 该论文提出了MTCNet,一种用于4D超声二尖瓣分割的半监督学习方法,通过运动与拓扑一致性引导学习解决现有方法缺乏相位间依赖性的问题。

Details Motivation: 二尖瓣返流是常见心脏疾病,4D超声是其评估标准影像技术,但现有方法在相位间依赖性和图像质量差的挑战下效果不佳。

Contribution: 1. 设计了跨相位运动引导一致性学习策略;2. 提出拓扑引导的相关性正则化方法,利用解剖先验知识提升分割准确性。

Method: 结合双向注意力记忆库传播时空特征,并通过拓扑相关性正则化保持解剖合理性。

Result: 在包含160名患者的4D数据集上表现优异(Dice: 87.30%, HD: 1.75mm)。

Insight: 运动与拓扑一致性结合的半监督学习能有效解决4D超声分割中标注稀疏和图像质量差的问题。

Abstract: Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at https://github.com/crs524/MTCNet.

[100] Prompt2SegCXR:Prompt to Segment All Organs and Diseases in Chest X-rays

Abduz Zami,Shadman Sobhan,Rounaq Hossain,Md. Sawran Sorker,Mohiuddin Ahmed,Md. Redwan Hossain

Main category: eess.IV

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Image segmentation plays a vital role in the medical field by isolating organs or regions of interest from surrounding areas. Traditionally, segmentation models are trained on a specific organ or a disease, limiting their ability to handle other organs and diseases. At present, few advanced models can perform multi-organ or multi-disease segmentation, offering greater flexibility. Also, recently, prompt-based image segmentation has gained attention as a more flexible approach. It allows models to segment areas based on user-provided prompts. Despite these advances, there has been no dedicated work on prompt-based interactive multi-organ and multi-disease segmentation, especially for Chest X-rays. This work presents two main contributions: first, generating doodle prompts by medical experts of a collection of datasets from multiple sources with 23 classes, including 6 organs and 17 diseases, specifically designed for prompt-based Chest X-ray segmentation. Second, we introduce Prompt2SegCXR, a lightweight model for accurately segmenting multiple organs and diseases from Chest X-rays. The model incorporates multi-stage feature fusion, enabling it to combine features from various network layers for better spatial and semantic understanding, enhancing segmentation accuracy. Compared to existing pre-trained models for prompt-based image segmentation, our model scores well, providing a reliable solution for segmenting Chest X-rays based on user prompts.

[101] Tunable Wavelet Unit based Convolutional Neural Network in Optical Coherence Tomography Analysis Enhancement for Classifying Type of Epiretinal Membrane Surgery

An Le,Nehal Mehta,William Freeman,Ines Nagel,Melanie Tran,Anna Heinke,Akshay Agnihotri,Lingyun Cheng,Dirk-Uwe Bartsch,Hung Nguyen,Truong Nguyen,Cheolhong An

Main category: eess.IV

TL;DR: 该研究提出了一种基于ResNet18的可调节小波单元CNN模型,用于从术后OCT扫描中分类视网膜前膜(ERM)手术类型(ILM去除或仅ERM去除),结合小波去噪和可调节小波单元提升了模型性能。

Details Motivation: 临床中准确判断ERM手术类型对决策至关重要,但传统方法依赖于人工判断且准确率有限。深度学习结合小波变换可能提供更可靠的分类。

Contribution: 1. 首次将可调节小波单元用于ERM手术分类;2. 提出OrthLatt-UwU和PR-Relax-UwU两种小波单元,提升了模型性能;3. 模型准确率超过人类专家。

Method: 1. 使用ResNet18作为基础架构;2. 对小波单元进行训练时自适应调整滤波器系数;3. 结合能量裁剪和小波去噪预处理数据。

Result: 模型在预处理数据上准确率达到78%(PR-Relax-UwU),显著高于原始数据的66%和人类专家的50%。

Insight: 可调节小波单元能有效提升CNN在医学图像分类任务中的性能,尤其是对高噪声数据的处理能力。

Abstract: In this study, we developed deep learning-based method to classify the type of surgery performed for epiretinal membrane (ERM) removal, either internal limiting membrane (ILM) removal or ERM-alone removal. Our model, based on the ResNet18 convolutional neural network (CNN) architecture, utilizes postoperative optical coherence tomography (OCT) center scans as inputs. We evaluated the model using both original scans and scans preprocessed with energy crop and wavelet denoising, achieving 72% accuracy on preprocessed inputs, outperforming the 66% accuracy achieved on original scans. To further improve accuracy, we integrated tunable wavelet units with two key adaptations: Orthogonal Lattice-based Wavelet Units (OrthLatt-UwU) and Perfect Reconstruction Relaxation-based Wavelet Units (PR-Relax-UwU). These units allowed the model to automatically adjust filter coefficients during training and were incorporated into downsampling, stride-two convolution, and pooling layers, enhancing its ability to distinguish between ERM-ILM removal and ERM-alone removal, with OrthLattUwU boosting accuracy to 76% and PR-Relax-UwU increasing performance to 78%. Performance comparisons showed that our AI model outperformed a trained human grader, who achieved only 50% accuracy in classifying the removal surgery types from postoperative OCT scans. These findings highlight the potential of CNN based models to improve clinical decision-making by providing more accurate and reliable classifications. To the best of our knowledge, this is the first work to employ tunable wavelets for classifying different types of ERM removal surgery.

[102] Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detection

Jisoo Kim,Chu-Hsuan Lin,Alberto Ceballos-Arroyo,Ping Liu,Huaizu Jiang,Shrikanth Yadav,Qi Wan,Lei Qin,Geoffrey S Young

Main category: eess.IV

TL;DR: 该论文提出了一种基于解剖学的后处理方法,通过结合启发式学习和深度学习来降低颅内动脉瘤检测的假阳性率,提高模型性能和可解释性。

Details Motivation: 尽管深度学习模型在颅内动脉瘤检测中取得了一定的进展,但高假阳性率仍然是临床转化的主要障碍。因此,作者探索了一种结合解剖学知识的后处理方法,以进一步提高检测性能。

Contribution: 论文的主要贡献是提出了一种自动化的、基于解剖学的后处理技术,能够显著减少深度学习模型的假阳性率,同时不影响真阳性率,从而提升模型的临床适用性和可解释性。

Method: 作者使用了两种深度学习模型(CPM-Net和3D-CNN-TR)进行检测,并结合了血管和脑组织的分割掩码作为后处理步骤。通过移除与脑组织、静脉等部分重叠的检测结果,减少了假阳性。

Result: 实验结果表明,提出的方法(方法5)在CPM-Net中将假阳性率降低了70.6%,在3D-CNN-TR中降低了51.6%,同时未减少真阳性率。

Insight: 结合领域知识的后处理可以显著提升深度学习模型的性能,尤其是在医学图像分析中。这种方法不仅适用于动脉瘤检测,也可能推广到其他医学影像任务。

Abstract: Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.

[103] DMCIE: Diffusion Model with Concatenation of Inputs and Errors to Improve the Accuracy of the Segmentation of Brain Tumors in MRI Images

Sara Yavari,Rahul Nitin Pandya,Jacob Furst

Main category: eess.IV

TL;DR: 论文提出了一种名为DMCIE的新方法,通过结合输入和误差的扩散模型来提升脑肿瘤在MRI图像中的分割精度,显著优于现有方法。

Details Motivation: 脑肿瘤的精确分割对临床诊断和治疗计划至关重要。扩散模型在图像生成和分割任务中表现出色,但如何利用误差信息进一步提升分割精度是一个挑战。

Contribution: 提出了DMCIE框架,通过将初始分割掩码与误差图结合输入扩散模型,显著改善了脑肿瘤的分割精度。

Method: 1. 使用3D U-Net生成初始分割掩码;2. 生成误差图;3. 将误差图与原始MRI图像拼接,输入扩散模型进行校正分割。

Result: 在BraTS2020数据集上,Dice Score达到93.46,HD95为5.94 mm,优于现有扩散模型方法。

Insight: 误差信息可以有效地引导扩散模型,使其专注于错误分类区域,从而显著提升分割精度。

Abstract: Accurate segmentation of brain tumors in MRI scans is essential for reliable clinical diagnosis and effective treatment planning. Recently, diffusion models have demonstrated remarkable effectiveness in image generation and segmentation tasks. This paper introduces a novel approach to corrective segmentation based on diffusion models. We propose DMCIE (Diffusion Model with Concatenation of Inputs and Errors), a novel framework for accurate brain tumor segmentation in multi-modal MRI scans. We employ a 3D U-Net to generate an initial segmentation mask, from which an error map is generated by identifying the differences between the prediction and the ground truth. The error map, concatenated with the original MRI images, are used to guide a diffusion model. Using multimodal MRI inputs (T1, T1ce, T2, FLAIR), DMCIE effectively enhances segmentation accuracy by focusing on misclassified regions, guided by the original inputs. Evaluated on the BraTS2020 dataset, DMCIE outperforms several state-of-the-art diffusion-based segmentation methods, achieving a Dice Score of 93.46 and an HD95 of 5.94 mm. These results highlight the effectiveness of error-guided diffusion in producing precise and reliable brain tumor segmentations.

cs.HC [Back]

[104] Scope Meets Screen: Lessons Learned in Designing Composite Visualizations for Marksmanship Training Across Skill Levels

Emin Zerman,Jonas Carlsson,Mårten Sjöström

Main category: cs.HC

TL;DR: 论文设计了一个射击可视化系统,结合第一人称射击视频和数据覆盖,为不同技能水平的射手提供训练支持,并验证其有效性。

Details Motivation: 当前的射击训练主要依赖重复练习和教练的有限视角分析,缺乏实时反馈和全面的数据可视化支持,限制了训练效率。

Contribution: 提出了一套复合可视化系统,结合第一人称视频和图形化数据,为射击训练提供多维度、实时的分析工具,并被不同技能水平的射手广泛接受。

Method: 开发了五种复合可视化视图,基于第一人称射击视频和覆盖的指标数据,通过混合方法研究(任务、偏好比较和访谈)评估其效果。

Result: 仪表盘风格的复合视图(结合原始视频、极坐标图和选定图表)在10名参与者中9人偏好,能有效支持不同技能水平的射手理解数据。

Insight: 研究表明,第一人称视频与可视化分析的结合在射击训练中具有广泛价值,并可扩展到其他需要高精度的运动中。

Abstract: Marksmanship practices are required in various professions, including police, military personnel, hunters, as well as sports shooters, such as Olympic shooting, biathlon, and modern pentathlon. The current form of training and coaching is mostly based on repetition, where the coach does not see through the eyes of the shooter, and analysis is limited to stance and accuracy post-session. In this study, we present a shooting visualization system and evaluate its perceived effectiveness for both novice and expert shooters. To achieve this, five composite visualizations were developed using first-person shooting video recordings enriched with overlaid metrics and graphical summaries. These views were evaluated with 10 participants (5 expert marksmen, 5 novices) through a mixed-methods study including shot-count and aiming interpretation tasks, pairwise preference comparisons, and semi-structured interviews. The results show that a dashboard-style composite view, combining raw video with a polar plot and selected graphs, was preferred in 9 of 10 cases and supported understanding across skill levels. The insights gained from this design study point to the broader value of integrating first-person video with visual analytics for coaching, and we suggest directions for applying this approach to other precision-based sports.

cs.AI [Back]

[105] Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation

Shreyansh Padarha

Main category: cs.AI

TL;DR: 本文提出了一种名为AdvDistill的奖励引导数据集蒸馏框架,通过奖励机制提升小语言模型(SLM)在数学和复杂推理任务中的表现。

Details Motivation: 传统知识蒸馏(KD)方法通常局限于学生模型复制教师模型的分布内响应,限制了模型的泛化能力,尤其是在推理任务中。此外,现有方法计算成本较高。

Contribution: 提出了AdvDistill框架,利用基于规则的验证器为教师模型的多个生成结果分配奖励,并将其作为训练学生模型的权重,显著提升了推理任务的性能。

Method: 通过教师模型对同一提示生成多个响应,基于规则验证器分配奖励,并使用这些奖励作为训练学生模型时的权重。

Result: 实验表明,AdvDistill显著提升了SLM在数学和复杂推理任务中的表现,验证了奖励机制在数据集蒸馏中的有效性。

Insight: 引入奖励机制不仅可以优化模型的学习过程,还能在减少计算成本的同时提升模型的推理能力。

Abstract: The push to compress and impart the proficiency of Large Language Models (LLMs) into more deployable and efficient Small Language Models (SLMs) has benefited from improvements in knowledge distillation (KD) techniques. These techniques allow a smaller student model to learn from a more capable and larger teacher model’s responses. However, distillation often revolves around the student model merely copying the teacher’s in-distribution responses, limiting its generalisability. This limitation is amplified on reasoning tasks and can be computationally expensive. In this study, we propose AdvDistill, a reward-guided dataset distillation framework. We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers. These varying and normally distributed rewards serve as weights when training student models. Our methods and their subsequent behavioural analysis demonstrate a significant improvement in student model performance for mathematical and complex reasoning tasks, showcasing the efficacy and benefits of incorporating a rewarding mechanism in dataset distillation processes.

[106] Thinking About Thinking: SAGE-nano’s Inverse Reasoning for Self-Aware Language Models

Basab Jha,Firoj Paudel,Ujjwal Puri,Zhang Yuting,Choi Donghyuk,Wang Junhao

Main category: cs.AI

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought (CoT) prompting, but their decision-making processes remain somewhat blackbox. We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc. Our approach, used in SAGE-nano, a 4-billion-parameter reasoning model, employs a metacognitive structure that reflects back via attention processes to identify major decision points and generate explanations of reasoning choices. While typical CoT approaches are directed towards forward reasoning generation, inverse reasoning provides insight into why specific reasoning chains were selected over others. Through thorough testing of logical reasoning puzzles, math problems and ethical dilemmas from AQUA-RAT, CommonsenseQA, and customized benchmarks, we demonstrate that SAGE-nano is at the cutting edge both on reasoning accuracy (74.6% on AQUA-RAT) and explanation quality (92.1% human preference score) for its task, and offers performance almost on par with models like Claude-3.5 Sonnet or GPT-4o. Our contributions are: (i) the first rigorous framework for LLM self-reflection via inverse reasoning, (ii) a novel metalearning framework to reverse the attention flow, (iii) comprehensive evaluation frameworks for reasoning transparency, and (iv) evidence that increasing reasoning using inverse reasoning improves interpretability along with reasoning performance. Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.

[107] ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context

Joongwon Kim,Anirudh Goyal,Liang Tan,Hannaneh Hajishirzi,Srinivasan Iyer,Tianlu Wang

Main category: cs.AI

TL;DR: ASTRO框架通过自回归搜索训练语言模型,利用自我反思、回溯和探索行为,显著提升了非推理型模型(如Llama 3)的推理能力。

Details Motivation: 当前开源推理模型的成功依赖于已具备强推理能力的模型,而如何提升其他非推理型模型的推理能力尚不明确。ASTRO旨在通过结构化搜索行为训练模型,填补这一空白。

Contribution: 1. 提出ASTRO框架,通过蒙特卡洛树搜索(MCTS)生成的自然语言数据集训练模型;2. 结合强化学习(RL)进一步提高性能;3. 在Llama 3上实现了显著性能提升。

Method: 1. 从数学问题解决轨迹中生成合成数据集;2. 将搜索轨迹转换为自然语言链式思考(Chain-of-Thought);3. 基于搜索轨迹微调模型并通过RL优化。

Result: 在MATH-500、AMC 2023和AIME 2024数据集上分别实现16.0%、26.9%和20.0%的绝对性能提升,尤其是在需要迭代修正的问题上表现突出。

Insight: 结构化搜索行为的训练为开放LLM提供了一种原则性方法,可显著提升复杂推理任务的性能。

Abstract: We introduce ASTRO, the “Autoregressive Search-Taught Reasoner”, a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.

[108] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan,Yuetai Li,Tuney Zheng,Xiaoyu Xu,Seungone Kim,Minxin Du,Radha Poovendran,Graham Neubig,Xiang Yue

Main category: cs.AI

TL;DR: 数学推理能力的提升是否能够泛化到其他领域?研究发现,大多数擅长数学推理的大语言模型(LLM)无法将这种能力迁移到其他任务上。

Details Motivation: 数学推理已成为衡量LLM进步的重要标准,但研究者质疑这种提升是泛化能力的体现还是仅局限于数学领域的过拟合现象。

Contribution: 1. 评估了20多个开源推理优化模型在多领域的表现。2. 揭示了不同调优方法(RL vs. SFT)对泛化能力的影响。3. 通过潜在空间和输出分布分析解释了SFT为何损害泛化能力。

Method: 1. 在多领域任务(数学、科学QA、规划、编程等)上评估模型表现。2. 在Qwen3-14B模型上对比RL和SFT调优方法的效果。3. 使用潜在空间和token分布分析模型行为。

Result: RL调优的模型在多领域表现较好,而SFT调优的模型容易丢失通用能力。SFT会导致模型表示和输出发生显著偏移。

Insight: 仅依赖SFT蒸馏数据可能无法有效提升模型的泛化能力,需重新思考推理模型的调优策略。

Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

[109] Enhancing LLM Agent Safety via Causal Influence Prompting

Dongyoon Hahm,Woogyeol Jin,June Suk Choi,Sungsoo Ahn,Kimin Lee

Main category: cs.AI

TL;DR: 论文提出了一种名为CIP的新技术,利用因果影响图(CID)提升大型语言模型(LLM)代理的安全性,通过结构化表示因果关系,改善代理决策并减少潜在风险。

Details Motivation: 随着基于大语言模型的自主代理在辅助任务中展现出潜力,确保其行为安全可靠至关重要。当前方法缺乏对潜在风险的系统性识别和缓解。

Contribution: 论文的主要贡献是提出CIP技术,首次将因果影响图引入代理决策,以结构化的方式识别和缓解风险。

Method: 方法分为三步:1) 基于任务规范初始化CID;2) 利用CID指导代理与环境交互;3) 根据观察结果迭代优化CID。

Result: 实验证明,该方法在代码执行和移动设备控制任务中显著提升了安全性。

Insight: 因果影响图为代理决策提供了一种可解释且结构化的问题解决方法,能够有效预测和规避潜在危害。

Abstract: As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.

[110] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Hang Wu,Hongkai Chen,Yujun Cai,Chang Liu,Qingwen Ye,Ming-Hsuan Yang,Yiwei Wang

Main category: cs.AI

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model’s initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

[111] TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables

Varun Mannam,Fang Wang,Chaochun Liu,Xin Chen

Main category: cs.AI

TL;DR: 本文提出了TalentMine,一种基于LLM的框架,用于从多模态人才表格中提取信息并进行问答,解决了现有表格提取方法语义理解不足的问题。

Details Motivation: 在人才管理系统中,关键信息常以复杂表格形式存储,传统语言模型难以准确提取和利用这些信息,导致检索和问答任务失败。

Contribution: 1. 系统分析了当前表格提取流水线中的语义信息丢失问题。2. 提出了一种基于LLM的语义增强表格表示方法。3. 设计了一个高效的端到端检索增强系统集成框架。4. 在人才分析任务上进行了全面的性能评测。

Method: 采用多模态推理方法,将提取的表格转换为语义增强的表示形式,保留了表格的结构和语义信息。

Result: 实验表明,TalentMine在问答任务中实现了100%的准确率,显著优于AWS Textract等传统方法(0%和40%)。

Insight: Claude v3 Haiku模型在人才管理应用中表现最佳,凸显了LLM在语义增强任务中的潜力。

Abstract: In talent management systems, critical information often resides in complex tabular formats, presenting significant retrieval challenges for conventional language models. These challenges are pronounced when processing Talent documentation that requires precise interpretation of tabular relationships for accurate information retrieval and downstream decision-making. Current table extraction methods struggle with semantic understanding, resulting in poor performance when integrated into retrieval-augmented chat applications. This paper identifies a key bottleneck - while structural table information can be extracted, the semantic relationships between tabular elements are lost, causing downstream query failures. To address this, we introduce TalentMine, a novel LLM-enhanced framework that transforms extracted tables into semantically enriched representations. Unlike conventional approaches relying on CSV or text linearization, our method employs specialized multimodal reasoning to preserve both structural and semantic dimensions of tabular data. Experimental evaluation across employee benefits document collections demonstrates TalentMine’s superior performance, achieving 100% accuracy in query answering tasks compared to 0% for standard AWS Textract extraction and 40% for AWS Textract Visual Q&A capabilities. Our comparative analysis also reveals that the Claude v3 Haiku model achieves optimal performance for talent management applications. The key contributions of this work include (1) a systematic analysis of semantic information loss in current table extraction pipelines, (2) a novel LLM-based method for semantically enriched table representation, (3) an efficient integration framework for retrieval-augmented systems as end-to-end systems, and (4) comprehensive benchmarks on talent analytics tasks showing substantial improvements across multiple categories.

cs.CR [Back]

[112] BadViM: Backdoor Attack against Vision Mamba

Yinghao Wu,Liyan Zhang

Main category: cs.CR

TL;DR: 论文《BadViM: Backdoor Attack against Vision Mamba》首次研究了Vision Mamba(ViM)在后门攻击中的脆弱性,并提出了一种新型攻击框架BadViM,利用共振频率触发器和隐藏状态对齐损失,实现了高攻击成功率和强隐蔽性。

Details Motivation: Vision Mamba作为一种新兴的视觉状态空间模型,其安全性尚未被充分研究。论文旨在填补这一空白,探索其对后门攻击的脆弱性。

Contribution: 提出了首个针对ViM的后门攻击框架BadViM,设计了共振频率触发器(RFT)和隐藏状态对齐损失,展示了ViM在安全领域的潜在风险。

Method: BadViM通过共振频率触发器(RFT)生成隐蔽的分布触发信号,并利用隐藏状态对齐损失优化模型内部表示,以提高攻击成功率。

Result: 实验表明,BadViM在保持干净数据准确率的同时,攻击成功率高,且对常见防御手段(如PatchDrop、PatchShuffle和JPEG压缩)具有强鲁棒性。

Insight: ViM等新兴架构在安全领域可能存在潜在脆弱性,需要进一步研究其防御机制。

Abstract: Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.

cs.LG [Back]

[113] GLU Attention Improve Transformer

Zehao Wang

Main category: cs.LG

TL;DR: 论文提出了一种名为GLU Attention的新型注意力机制,通过将非线性引入注意力的值部分,提升了模型性能和收敛速度,同时无需额外参数且计算成本极低。

Details Motivation: 传统注意力机制缺乏非线性处理能力,限制了模型性能的提升空间。通过在注意力机制中引入Gated Linear Units (GLU)的非线性特性,可以为模型提供更强的表达能力。

Contribution: 1. 提出了GLU Attention机制,通过非线性改造注意力值部分,提升了Transformer的性能。2. 该方法无需额外参数,计算成本极低,且适用于多种模态(文本和视觉)。3. 开源了实现代码。

Method: 1. 在注意力机制的值部分引入GLU,增加非线性变换。2. 与其他技术(如Flash Attention、RoPE、GQA等)无缝兼容。

Result: 实验表明,GLU Attention在文本和视觉任务中均能显著提升模型性能和收敛速度。

Insight: 非线性注意力机制可能是未来提升Transformer性能的一个重要方向,尤其是这种轻量化的设计可以广泛兼容现有技术。

Abstract: Gated Linear Units (GLU) have shown great potential in enhancing neural network performance. In this paper, I introduce a novel attention mechanism called GLU Attention, which introduces nonlinearity into the values of Attention. My experiments demonstrate that GLU Attention improves both model performance and convergence speed across text and vision modalities with zero additional parameters and negligible computational costs. GLU Attention is lightweight and can seamlessly integrate with other technologies, such as Flash Attention, Rotary Position Embedding (RoPE), and various Multi-Head Attention (MHA) variants such as Grouped-Query Attention (GQA). This project is open-sourced at github.

[114] ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models

Jiale Ding,Xiang Zheng,Cong Wang,Wei-Bin Lee,Xingjun Ma,Yu-Gang Jiang

Main category: cs.LG

TL;DR: 论文提出了ROSE框架,利用多目标强化学习生成多样性和上下文丰富的对抗性提示,以更全面地评估大语言模型的安全漏洞。

Details Motivation: 随着大语言模型(LLMs)在现实应用中的广泛部署,安全评估变得至关重要。现有手动评估方法因静态性和高成本难以跟上LLMs的发展,而自动化方法又存在主题覆盖不足和上下文不真实的问题。

Contribution: 提出ROSE框架,通过多目标强化学习优化对抗性提示生成,实现主题多样性和上下文丰富性,从而更全面地评估LLMs的安全性。

Method: 采用多目标强化学习技术,训练一个对抗性LLM,生成多样且真实的对抗性提示,覆盖更多有害话题和现实场景。

Result: 实验表明,ROSE在揭示LLMs安全漏洞方面优于现有方法,综合评估指标显著提升。

Insight: ROSE为动态且面向现实的安全评估提供了可行路径,解决了现有方法的主题局限性和上下文不足问题。

Abstract: As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, evaluating their safety-especially under adversarial prompting-has become critical. Arguably, effective safety evaluations should be adaptive, evolving with LLM capabilities, and also cover a broad spectrum of harmful topics and real-world scenarios to fully expose potential vulnerabilities. Existing manual safety benchmarks, built on handcrafted adversarial prompts, are limited by their static nature and the intensive labor required to update them, making it difficult to keep pace with rapidly advancing LLMs. In contrast, automated adversarial prompt generation offers a promising path toward adaptive evaluation. However, current methods often suffer from insufficient adversarial topic coverage (topic-level diversity) and weak alignment with real-world contexts. These shortcomings stem from the exploration-exploitation dilemma in black-box optimization and a lack of real-world contextualization, resulting in adversarial prompts that are both topically narrow and scenario-repetitive. To address these issues, we propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM for generating topically diverse and contextually rich adversarial prompts. Experiments show that ROSE outperforms existing methods in uncovering safety vulnerabilities in state-of-the-art LLMs, with notable improvements in integrated evaluation metrics. We hope ROSE represents a step toward more practical and reality-oriented safety evaluation of LLMs. WARNING: This paper contains examples of potentially harmful text.

[115] The language of time: a language model perspective on time-series foundation models

Yi Xie,Yun Xiong,Zejian Shi,Hao Niu,Zhengfu Liu

Main category: cs.LG

TL;DR: 该论文探讨了时间序列基础模型的成功背后机制,提出其通过将时间序列数据离散化为类似自然语言的词汇表,继承了语言模型的强大表示与迁移能力。

Details Motivation: 研究大型语言模型的成功范式是否可扩展至时间序列领域,解释时间序列模型在跨领域迁移中的异常表现。

Contribution: 理论证明了时间序列数据可通过离散化词汇表继承语言模型的表示能力,为时间序列基础模型提供了理论支持。

Method: 通过理论和实验分析,研究了基于块的时间序列模型的表示学习机制与泛化能力。

Result: 时间序列数据可量化成离散词汇表,其统计特性与自然语言一致,从而解释了模型的优异表现。

Insight: 时间序列基础模型的核心在于将确定性表示泛化为潜在概率分布形式,继承了语言模型的优势。

Abstract: With the rise of large language models, the paradigm of training foundation models with massive parameter counts on vast datasets has been adopted in multiple domains to achieve remarkable success. Time series foundation models represent a significant extension of this paradigm, demonstrating exceptional expressive power, generalization, and cross-domain transferability. However, this gives rise to a fundamental paradox: time series data reflect distinct dynamical systems, making cross-domain transfer intuitively implausible, yet this is contradicted by the models’ empirical success. To resolve this paradox, this paper investigates, from both theoretical and experimental perspectives, the representation learning mechanisms and generalization capabilities of patch-based time series foundation models. We argue that such models are not merely applying a new architecture but are fundamentally generalizing the representation paradigm of language models by extending deterministic vector-based representations to latent probabilistic distributional forms. Our theoretical analysis supports this framework by demonstrating that continuous time-series patches can be faithfully quantized into a discrete vocabulary whose key statistical properties are highly consistent with those of natural language. This generalization allows time series models to inherit the robust representation and transfer abilities of large language models, thereby explaining their superior performance in temporal tasks. Ultimately, our work provides a rigorous theoretical cornerstone for understanding, evaluating, and improving the safety and reliability of large-scale time series foundation models.

[116] Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations

Jiztom Kavalakkatt Francis,Matthew J Darr

Main category: cs.LG

TL;DR: 本文提出了一种新颖的框架,通过融合ResNet和重构的2D Transformer生成的heatmap,结合全局注意力加权输入显著性,增强了时间序列模型的解释性。该方法解决了现有方法在时空对齐上的局限性,并在医疗和工业领域的数据集上实现了显著的性能提升。

Details Motivation: 现有解释性方法在时空对齐上存在局限性,卷积网络缺乏全局上下文,而Transformer缺乏局部精确性,这阻碍了安全关键领域(如医疗和工业监控)的行动见解。

Contribution: 1. 提出了一种融合ResNet和Transformer生成的heatmap的统一可视化框架;2. 通过全局注意力和NLP生成的解释,提升了模型的解释性和性能;3. 在医疗和工业数据集上验证了方法的有效性。

Method: 1. 将梯度加权激活映射(ResNet)和Transformer注意力展开融合为统一的heatmap;2. 使用NLP模块将heatmap翻译为领域特定的叙述;3. 在临床和工业数据集上进行实证评估。

Result: 在PhysioNet数据集上达到94.1%的准确率(F1 0.93),在UCI能源设备数据集上回归误差降至RMSE = 0.28 kWh(R2 = 0.95),性能优于基线模型3.8-12.4%。NLP生成的解释在BLEU-4和ROUGE-L分数上表现良好。

Insight: 该方法通过强调因果保真度和时空对齐,填补了技术输出和利益相关者理解之间的空白,为透明、时间感知的决策提供了可扩展的解决方案。

Abstract: In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input saliency. We address the critical problem of spatial-temporal misalignment in existing interpretability methods, where convolutional networks fail to capture global context and Transformers lack localized precision - a limitation that impedes actionable insights in safety-critical domains like healthcare and industrial monitoring. Our method merges gradient-weighted activation maps (ResNet) and Transformer attention rollout into a unified visualization, achieving full spatial-temporal alignment while preserving real-time performance. Empirical evaluations on clinical (ECG arrhythmia detection) and industrial (energy consumption prediction) datasets demonstrate significant improvements: the hybrid framework achieves 94.1% accuracy (F1 0.93) on the PhysioNet dataset and reduces regression error to RMSE = 0.28 kWh (R2 = 0.95) on the UCI Energy Appliance dataset-outperforming standalone ResNet, Transformer, and InceptionTime baselines by 3.8-12.4%. An NLP module translates fused heatmaps into domain-specific narratives (e.g., “Elevated ST-segment between 2-4 seconds suggests myocardial ischemia”), validated via BLEU-4 (0.586) and ROUGE-L (0.650) scores. By formalizing interpretability as causal fidelity and spatial-temporal alignment, our approach bridges the gap between technical outputs and stakeholder understanding, offering a scalable solution for transparent, time-aware decision-making.

[117] $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation

Siyou Li,Pengyao Qin,Huanan Wu,Dong Nie,Arun J. Thirunavukarasu,Juntao Yu,Le Zhang

Main category: cs.LG

TL;DR: 论文提出了μ2Tokenizer,一种用于放射学报告生成的多尺度多模态可微分分词器,通过结合视觉和文本特征提升报告生成质量。

Details Motivation: 放射学报告生成(RRG)面临两大挑战:从影像数据中高效提取信息以及客观评估生成报告与专家报告的差异。作者提出解决这些问题的方案。

Contribution: 论文的主要贡献是提出μ2Tokenizer和μ2LLM,通过多尺度多模态特征整合和直接偏好优化(DPO)提升放射学报告生成的性能。

Method: 方法包括多尺度视觉分词器与文本分词器的整合,以及使用GREEN-RedLlama指导的直接偏好优化(DPO)增强生成质量。

Result: 在四个大型CT影像报告数据集上的实验表明,该方法优于现有方法,显示了在小数据上微调μ2LLM的潜力。

Insight: 论文表明,通过多模态特征整合和强化学习优化,可以在有限数据条件下显著提升放射学报告生成的准确性和效率。

Abstract: Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $\mu^2$LLM, a $\underline{\textbf{mu}}$ltiscale $\underline{\textbf{mu}}$ltimodal large language models for RRG tasks. The novel ${\mu}^2$Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasetdemonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned $\mu^2$LLMs on limited data for RRG tasks.

[118] Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

Zhihao Zhan,Jianan Zhao,Zhaocheng Zhu,Jian Tang

Main category: cs.LG

TL;DR: 该论文分析了状态空间模型(SSMs)在长上下文建模中的局限性,提出了一种结合上下文相关稀疏注意力(CDSA)的方法,并通过理论证明和实验验证了其有效性。

Details Motivation: 当前主流的Transformer架构在长上下文建模中存在二次时间复杂度的限制,而状态空间模型(SSMs)虽然能提供次二次解决方案,但难以有效捕捉长距离依赖。因此,研究如何改进SSMs在长上下文建模中的能力显得尤为重要。

Contribution: 论文的主要贡献包括:(1)指出了传统关联回忆任务的不足,提出了新任务“联合回忆”来更全面地评估长上下文建模能力;(2)理论证明了SSMs无法在次二次复杂度下解决多查询联合回忆问题;(3)提出了一种结合上下文相关稀疏注意力(CDSA)的解决方案(HAX),并通过实验验证了其性能优势。

Method: 论文提出了一种结合SSMs和上下文相关稀疏注意力(CDSA)的方法,命名为HAX(Locality-Sensitive Hashing Attention with sparse Key Selection)。该方法通过理论设计,能够在次二次复杂度下解决多查询联合回忆问题,并针对自然语言领域进行了优化。

Result: 实验结果表明,HAX在合成任务和真实长上下文基准测试中均优于基于SSMs的基线方法和结合上下文无关稀疏注意力(CISA)的SSMs。

Insight: 论文揭示了SSMs在长上下文建模中的固有局限性,并通过结合上下文相关注意力机制提供了一种有效的解决方案,为未来改进长上下文建模方法提供了新思路。

Abstract: Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

[119] HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation

Lihuan Li,Hao Xue,Shuang Ao,Yang Song,Flora Salim

Main category: cs.LG

TL;DR: HiT-JEPA提出了一种分层自监督轨迹嵌入框架,用于解决城市轨迹数据表示中的多尺度信息融合问题。

Details Motivation: 现有方法难以同时捕捉轨迹的细粒度细节和高层摘要,限制了模型在长程依赖和局部细节上的表现。本文旨在设计一个统一框架,以融合多尺度轨迹语义信息。

Contribution: 提出了HiT-JEPA——一种分层自监督轨迹嵌入框架,能够通过三层结构逐步捕获点级细节、中间模式和高层抽象,从而将局部动态和全局语义整合到一个连贯的模型中。

Method: 采用三层分层架构,分别处理点级细粒度细节、中间模式和高层轨迹抽象,并通过联合嵌入预测架构(JEPA)实现多尺度表示学习。

Result: 在多个真实世界数据集上进行的轨迹相似性计算实验表明,HiT-JEPA的分层设计能够生成更丰富的多尺度表示。

Insight: 分层设计有效融合了轨迹的多尺度信息,为城市移动模式分析提供了更全面的表示方法。

Abstract: The representation of urban trajectory data plays a critical role in effectively analyzing spatial movement patterns. Despite considerable progress, the challenge of designing trajectory representations that can capture diverse and complementary information remains an open research problem. Existing methods struggle in incorporating trajectory fine-grained details and high-level summary in a single model, limiting their ability to attend to both long-term dependencies while preserving local nuances. To address this, we propose HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture), a unified framework for learning multi-scale urban trajectory representations across semantic abstraction levels. HiT-JEPA adopts a three-layer hierarchy that progressively captures point-level fine-grained details, intermediate patterns, and high-level trajectory abstractions, enabling the model to integrate both local dynamics and global semantics in one coherent structure. Extensive experiments on multiple real-world datasets for trajectory similarity computation show that HiT-JEPA’s hierarchical design yields richer, multi-scale representations. Code is available at: https://anonymous.4open.science/r/HiT-JEPA.

[120] Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding

Duc Cao-Dinh,Khai Le-Duc,Anh Dao,Bach Phan Tat,Chris Ngo,Duy M. H. Nguyen,Nguyen X. Khanh,Thanh Nguyen-Tang

Main category: cs.LG

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: 3D Visual Grounding (3DVG) involves localizing target objects in 3D point clouds based on natural language. While prior work has made strides using textual descriptions, leveraging spoken language-known as Audio-based 3D Visual Grounding-remains underexplored and challenging. Motivated by advances in automatic speech recognition (ASR) and speech representation learning, we propose Audio-3DVG, a simple yet effective framework that integrates audio and spatial information for enhanced grounding. Rather than treating speech as a monolithic input, we decompose the task into two complementary components. First, we introduce Object Mention Detection, a multi-label classification task that explicitly identifies which objects are referred to in the audio, enabling more structured audio-scene reasoning. Second, we propose an Audio-Guided Attention module that captures interactions between candidate objects and relational speech cues, improving target discrimination in cluttered scenes. To support benchmarking, we synthesize audio descriptions for standard 3DVG datasets, including ScanRefer, Sr3D, and Nr3D. Experimental results demonstrate that Audio-3DVG not only achieves new state-of-the-art performance in audio-based grounding, but also competes with text-based methods-highlighting the promise of integrating spoken language into 3D vision tasks.

cs.SD [Back]

[121] MuteSwap: Silent Face-based Voice Conversion

Yifan Liu,Yu Fang,Zhouhan Lin

Main category: cs.SD

TL;DR: MuteSwap提出了一种基于视频输入的静默人脸语音转换(SFVC)方法,通过视觉输入生成目标说话者的语音,解决了传统语音转换依赖音频输入的局限性。

Details Motivation: 传统语音转换依赖干净音频输入,但在无音频或噪音环境下(如静默视频)不可行。需要一种仅依靠视觉输入的解决方案。

Contribution: 提出SFVC任务,首次实现仅通过视觉输入完成语音转换;设计MuteSwap框架,利用对比学习对齐跨模态身份,并通过最小化互信息分离共享视觉特征。

Method: 采用对比学习对齐视觉和语音身份特征,并利用互信息最小化分离语音内容与身份特征。框架完全基于视觉输入,适应无音频场景。

Result: 实验表明,MuteSwap在语音合成和身份转换上表现优异,尤其在噪音环境下优于依赖音频输入的基准方法。

Insight: 视觉信息足以支持语音内容生成和身份转换,为无音频场景下的语音处理提供了新思路。跨模态对比学习是解决此类任务的有效手段。

Abstract: Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.

cs.GR [Back]

[122] FreNBRDF: A Frequency-Rectified Neural Material Representation

Chenliang Zhou,Zheyuan Hu,Cengiz Oztireli

Main category: cs.GR

TL;DR: FreNBRDF是一种基于频率校正的神经材质表示方法,通过球谐函数将频域分析引入神经BRDF建模,提升了材质重建和编辑的精度与鲁棒性。

Details Motivation: 传统方法依赖于表格化的BRDF数据,而最近的研究转向隐式神经表示,但其在频域的行为尚不明确。因此,作者提出FreNBRDF以解决这一问题,提升材质的建模精度和适应性。

Contribution: 1. 提出频率校正的神经材质表示方法FreNBRDF;2. 引入了基于频域分析的新型损失函数;3. 设计了一个通用的重建和编辑流程,提升材质建模的保真度和效率。

Method: 1. 利用球谐函数将频域分析融入神经BRDF建模;2. 提出频率校正损失函数;3. 构建了一个通用且自适应的重建和编辑框架。

Result: 实验表明,FreNBRDF在材质外观重建和编辑任务中优于现有方法,提高了精度和鲁棒性,并支持更结构化、可解释的下游应用。

Insight: 频域分析在神经网络材质建模中具有重要作用,通过频率校正可以显著提升模型的性能和可解释性。

Abstract: Accurate material modeling is crucial for achieving photorealistic rendering, bridging the gap between computer-generated imagery and real-world photographs. While traditional approaches rely on tabulated BRDF data, recent work has shifted towards implicit neural representations, which offer compact and flexible frameworks for a range of tasks. However, their behavior in the frequency domain remains poorly understood. To address this, we introduce FreNBRDF, a frequency-rectified neural material representation. By leveraging spherical harmonics, we integrate frequency-domain considerations into neural BRDF modeling. We propose a novel frequency-rectified loss, derived from a frequency analysis of neural materials, and incorporate it into a generalizable and adaptive reconstruction and editing pipeline. This framework enhances fidelity, adaptability, and efficiency. Extensive experiments demonstrate that \ours improves the accuracy and robustness of material appearance reconstruction and editing compared to state-of-the-art baselines, enabling more structured and interpretable downstream tasks and applications.

cs.MA [Back]

[123] State and Memory is All You Need for Robust and Reliable AI Agents

Matthew Muhoberac,Atharva Parikh,Nirvi Vakharia,Saniya Virani,Aco Radujevic,Savannah Wood,Meghav Verma,Dimitri Metaxotos,Jeyaraman Soundararajan,Thierry Masquelin,Alexander G. Godfrey,Sean Gardner,Dobrila Rudnicki,Sam Michael,Gaurav Chopra

Main category: cs.MA

TL;DR: 该论文提出了SciBORG框架,通过动态构建LLM代理并引入有限状态自动机(FSA)内存,解决了复杂科学工作流中的记忆、规划和工具集成问题,实现了可靠的任务执行。

Details Motivation: 现有LLM在复杂科学工作流中的应用受限于记忆、规划和工具集成的挑战,亟需一种鲁棒且可靠的代理框架。

Contribution: 提出了SciBORG框架,结合动态代理构建和FSA内存,实现了上下文感知决策和状态跟踪,为复杂环境下的AI代理提供了通用基础。

Method: 基于LLM动态构建代理,并引入FSA内存以实现持久状态跟踪和上下文感知决策,无需手动提示工程,支持多步任务执行和错误恢复。

Result: 实验表明SciBORG在物理和虚拟硬件中实现了可靠执行、自适应规划和可解释状态转换,多步生物检测任务和合成反应验证了其有效性。

Insight: 记忆和状态感知是代理规划和可靠性的关键,SciBORG为复杂环境下的AI代理部署提供了可扩展解决方案。

Abstract: Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real-world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite-state automata (FSA) memory, enabling persistent state tracking and context-aware decision-making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user-specified reactions, with context-aware decision making and demonstrate its use in autonomous multi-step bioassay retrieval from the PubChem database utilizing multi-step planning, reasoning, agent-to-agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.

cs.RO [Back]

[124] Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding

Tao Lin,Gen Li,Yilei Zhong,Yanwen Zou,Bo Zhao

Main category: cs.RO

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale text pretraining. However, VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or defective estimation. In contrast, our work introduces a plug-and-play module that implicitly injects 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation models. We design five spatially challenging tasks that require precise spatial understanding ability to validate effectiveness of our method. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.

[125] Stable Tracking of Eye Gaze Direction During Ophthalmic Surgery

Tinghe Hong,Shenlin Cai,Boyang Li,Kai Huang

Main category: cs.RO

TL;DR: Error

Details Motivation: Error

Contribution: Error

Method: Error

Result: Error

Insight: Error

Abstract: Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm’s movement based on the calculated orientation.

[126] RaGNNarok: A Light-Weight Graph Neural Network for Enhancing Radar Point Clouds on Unmanned Ground Vehicles

David Hunt,Shaocheng Luo,Spencer Hallyburton,Shafii Nillongo,Yi Li,Tingjun Chen,Miroslav Pajic

Main category: cs.RO

TL;DR: RaGNNarok proposes a lightweight GNN-based framework to enhance radar point clouds for low-cost indoor robots, addressing sparse data and noise issues, with efficient real-time performance on resource-constrained devices.

Details Motivation: Existing lidar and camera-based solutions for indoor robots have limitations in obscured environments and high costs, while radar sensors are cost-effective but suffer from noisy and sparse data.

Contribution: A real-time, lightweight GNN (RaGNNarok) is introduced to enhance radar point clouds, achieving robustness and generalizability in dynamic environments with minimal computational overhead.

Method: RaGNNarok uses a graph neural network (GNN) to process radar data, improving point cloud quality by filtering noise and false detections while maintaining low inference time (7.3 ms).

Result: The framework performs reliably in localization, SLAM, and navigation across diverse environments, demonstrating strong generalizability and efficiency on devices like Raspberry Pi 5.

Insight: GNNs can effectively enhance radar data for robotics tasks, opening new possibilities for low-cost, radar-based solutions in resource-constrained settings.

Abstract: Low-cost indoor mobile robots have gained popularity with the increasing adoption of automation in homes and commercial spaces. However, existing lidar and camera-based solutions have limitations such as poor performance in visually obscured environments, high computational overhead for data processing, and high costs for lidars. In contrast, mmWave radar sensors offer a cost-effective and lightweight alternative, providing accurate ranging regardless of visibility. However, existing radar-based localization suffers from sparse point cloud generation, noise, and false detections. Thus, in this work, we introduce RaGNNarok, a real-time, lightweight, and generalizable graph neural network (GNN)-based framework to enhance radar point clouds, even in complex and dynamic environments. With an inference time of just 7.3 ms on the low-cost Raspberry Pi 5, RaGNNarok runs efficiently even on such resource-constrained devices, requiring no additional computational resources. We evaluate its performance across key tasks, including localization, SLAM, and autonomous navigation, in three different environments. Our results demonstrate strong reliability and generalizability, making RaGNNarok a robust solution for low-cost indoor mobile robots.

[127] Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel,Shraddhaa Mohan,Hanlin Mai,Unnat Jain,Svetlana Lazebnik,Yunzhu Li

Main category: cs.RO

TL;DR: 论文提出了一种名为RIGVid的系统,通过模仿AI生成的视频完成复杂机器人操作任务,无需物理演示或机器人特定训练。系统结合视频扩散模型和视觉语言模型(VLM)生成并筛选视频,通过6D位姿跟踪提取物体轨迹,最终实现机器人操作。实验表明,生成的视频效果与真实演示相当,且性能随生成质量提升。

Details Motivation: 传统的机器人操作任务通常依赖于大量物理演示或专门的机器人训练数据,成本高昂且难以扩展。本文旨在探索如何利用现成的AI生成视频技术,为机器人提供高效的监督学习信号,从而摆脱对物理演示的依赖。

Contribution: 1. 提出了RIGVid系统,首次实现通过模仿AI生成视频完成机器人操作任务;2. 通过视频扩散模型和VLM结合,实现高质量视频生成与筛选;3. 展示了6D位姿跟踪在提取物体轨迹中的优越性;4. 实验验证了生成视频的监督信号与真实演示同样有效。

Method: 1. 使用视频扩散模型生成任务演示视频;2. 通过VLM筛选符合语言指令的视频;3. 利用6D位姿跟踪提取物体轨迹;4. 将轨迹重新映射到机器人执行。

Result: 实验证明,生成的视频在机器人操作任务中效果与真实演示相当,且性能随生成质量提升。6D位姿跟踪的轨迹提取方法优于其他替代方案(如密集特征点跟踪)。

Insight: 现成的AI生成视频模型可为机器人提供高效监督信号,减少对物理演示的依赖,同时6D位姿跟踪是实现机器人操作的关键技术。

Abstract: This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks–such as pouring, wiping, and mixing–purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

[128] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Yating Wang,Haoyi Zhu,Mingyu Liu,Jiange Yang,Hao-Shu Fang,Tong He

Main category: cs.RO

TL;DR: 本文提出了一种基于向量量化的动作标记器(VQ-VLA),利用大规模动作轨迹数据集(比之前方法多100倍数据)捕捉丰富的时空动态,显著提升动作输出的平滑性和连贯性。该方法支持零样本适应下游任务,并通过实验验证了合成数据对真实任务的性能提升。

Details Motivation: 现有动作标记器在小规模数据上表现有限,无法充分捕捉复杂的时空动态。本文通过利用超大规模数据集,旨在提升动作生成的效率和质量,同时验证合成数据对真实任务的泛化能力。

Contribution: 1. 提出了基于大规模动作轨迹数据集的VQ-VLA动作标记器,支持高效零样本适应;2. 发现合成与真实动作轨迹的领域差距极小,可充分利用合成数据提升性能;3. 实验证明数据规模扩大显著提升下游任务成功率(如真实任务提升30%)。

Method: 1. 使用超大规模动作轨迹数据集训练向量量化动作标记器;2. 通过捕捉时空动态生成平滑连贯的动作;3. 零样本适应多种下游任务(从短时反应到长时规划)。

Result: 实验表明,合成数据的增加显著提升性能,尤其在真实任务中长时规划场景下成功率提高30%。

Insight: 合成数据在动作生成任务中具有极大潜力,可有效弥补真实数据不足的问题,同时保持对真实任务的高泛化性能。

Abstract: In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io