Table of Contents

cs.CL [Back]

[1] Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

Lionel Wong,Katherine M. Collins,Lance Ying,Cedegao E. Zhang,Adrian Weller,Tobias Gersternberg,Timothy O’Donnell,Alexander K. Lew,Jacob D. Andreas,Joshua B. Tenenbaum,Tyler Brooke-Wilson

Main category: cs.CL

TL;DR: 该论文提出了一个结合分布式与符号化表征的“模型合成架构”(MSA),利用语言模型和概率程序实现人类式的开放世界推理能力。

Details Motivation: 人类在面对新情况时能灵活运用背景知识进行推理和预测,但现有模型缺乏这种能力。研究旨在探索如何结合分布式与符号化表征实现类似人类的开放世界推理。

Contribution: 提出了“模型合成架构”(MSA),结合语言模型的全局检索能力和概率程序的定制化建模能力,实现了人类式的开放推理。

Method: 使用语言模型进行全局信息检索和模型合成,结合概率程序生成定制化的世界模型。在“Model Olympics”数据集上评估性能。

Result: MSA在人类推理任务上表现优于纯语言模型基线,支持链式思维生成,验证了其局部连贯性与全局适应性。

Insight: MSA为复制人类开放世界推理提供了一种可行路径,展示分布式与符号化表征结合的重要性。

Abstract: When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea – a ``Model Synthesis Architecture’’ (MSA) – using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset – built around a Model Olympics domain of sports vignettes – tests models’ capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people’s ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

[2] The first open machine translation system for the Chechen language

Abu-Viskhan A. Umishov,Vladislav A. Grigorian

Main category: cs.CL

TL;DR: 论文介绍了首个开源的车臣语与俄语之间的机器翻译系统,并提供了用于训练和评估的数据集。通过微调方法将车臣语整合到多语言翻译模型NLLB-200中,取得了不错的翻译性能。

Details Motivation: 车臣语是一种弱势语言,缺乏开源翻译工具支持。本文旨在填补这一空白,通过构建首个车臣语-俄语的机器翻译系统,促进该语言的保护和数字化发展。

Contribution: 1. 发布了首个开源的车臣语-俄语机器翻译模型;2. 提供了用于训练和评估的平行语料库;3. 探索了如何通过微调将新语言整合到大型多语言翻译模型中。

Method: 采用了微调方法,将车臣语整合到NLLB-200这一多语言翻译模型中,并基于收集的平行语料库进行训练和评估。

Result: 模型在俄语到车臣语的翻译中BLEU/ChrF++得分为8.34/34.69,反向翻译为20.89/44.55。同时发布了平行语料库和适配车臣语的多语言句子编码器。

Insight: 通过微调大型多语言模型可以有效支持弱势语言的翻译任务,展示了开源工具在语言保护和数字化中的重要作用。

Abstract: We introduce the first open-source model for translation between the vulnerable Chechen language and Russian, and the dataset collected to train and evaluate it. We explore fine-tuning capabilities for including a new language into a large language model system for multilingual translation NLLB-200. The BLEU / ChrF++ scores for our model are 8.34 / 34.69 and 20.89 / 44.55 for translation from Russian to Chechen and reverse direction, respectively. The release of the translation models is accompanied by the distribution of parallel words, phrases and sentences corpora and multilingual sentence encoder adapted to the Chechen language.

[3] AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

S M Rafiuddin,Sadia Kamal,Mohammed Rakib,Arunkumar Bagavathi,Atriya Sen

Main category: cs.CL

TL;DR: AdaptiSent是一种新的多模态情感分析方法,通过自适应的跨模态注意力机制改进情感分类和方面词提取。它动态整合文本和图像的权重,并引入上下文感知注意力,显著提升了性能。

Details Motivation: 现有的多模态情感分析方法往往缺乏对文本和视觉上下文动态交互的关注,导致复杂信息处理不充分。AdaptiSent旨在通过自适应注意力机制解决这一问题。

Contribution: 提出AdaptiSent框架,结合动态模态加权和上下文自适应注意力,显著提升了多模态情感分析和方面词提取的性能。

Method: 模型采用自适应跨模态注意力机制,根据上下文动态调整权重,并提取文本和视觉模态之间的交互信息。

Result: 在标准Twitter数据集上,AdaptiSent在精度、召回率和F1分数上均优于基线模型,尤其在处理复杂多模态关系时表现突出。

Insight: 自适应注意力机制能够更灵活地捕捉多模态数据的交互信息,为情感分析任务提供更准确的上下文理解。

Abstract: We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model’s ability to adjust its focus dynamically based on the context’s relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.

[4] TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Richard Sproat,Tianyu Zhao,Llion Jones

Main category: cs.CL

TL;DR: TransEvalnia是一个基于推理的翻译评估和排名系统,通过提示技术对翻译进行细粒度评估,并优于当前最佳方法MT-Ranker。

Details Motivation: 现有的翻译评估方法缺乏细粒度分析和推理能力,TransEvalnia旨在通过LLM的推理能力改进这一过程。

Contribution: 提出了TransEvalnia系统,结合多维质量指标和LLM推理能力,实现高效且人类可接受的翻译评估。

Method: 通过Anthropic的Claude-3.5-Sonnet和Qwen-2.5-72B-Instruct等LLM进行推理评估,并提出解决位置偏差的方法。

Result: 在多种语言对上表现优于MT-Ranker,评估结果与人类评分高度相关。

Insight: 翻译评估系统对输入顺序敏感,需注意位置偏差,LLM的推理能力可有效提升评估质量。

Abstract: We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic’s Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system – as well as MT-Ranker – to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system’s evaluation and reasoning, human assessments, as well as code is released.

[5] Strategy Adaptation in Large Language Model Werewolf Agents

Fuya Nakamori,Yin Jou Huang,Fei Cheng

Main category: cs.CL

TL;DR: 该研究提出了一种通过根据其他玩家的态度和对话上下文切换预定义策略来提升狼人杀智能体性能的方法,验证了所提方法的有效性。

Details Motivation: 现有基于提示工程的狼人杀智能体采用隐式定义有效策略的方法,无法适应动态变化的游戏情境。因此,研究需要一种能够根据游戏上下文和其他玩家角色显式选择策略的方法。

Contribution: 研究的主要贡献是提出了一种显式策略选择方法,使狼人杀智能体能够根据游戏上下文和其他玩家角色动态调整策略,从而提升性能。

Method: 方法的核心是通过分析其他玩家的态度和对话上下文,显式选择预定义的策略,并与基于隐式或固定策略的基线智能体进行对比。

Result: 实验结果表明,所提出的策略适应方法在狼人杀智能体中的性能优于基线方法。

Insight: 研究揭示了在动态对话环境中,显式策略选择比隐式或固定策略更能有效提升智能体的适应性,为类似多智能体交互场景提供了重要参考。

Abstract: This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.

[6] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang,Muhammad Khalifa,Lechen Zhang,Xin Liu,Ayoung Lee,Xinliang Frederick Zhang,Farima Fatahi Bayat,Lu Wang

Main category: cs.CL

TL;DR: 论文提出了一种无需额外训练的解码时方法ThinkLogit,利用对数概率算术引导大型模型进行长推理任务,显著提升性能。

Details Motivation: 研究发现某些大型推理模型(LRMs)无需额外训练即可展示长推理能力,但如何高效引导此类能力尚不明确。

Contribution: 1. 提出ThinkLogit方法,通过小模型引导大模型实现长推理;2. 结合偏好优化(ThinkLogit-DPO)进一步提升性能;3. 实验显示性能显著提升。

Method: 利用对数概率算术,通过小模型(如R1-Distill-Qwen-1.5B)在解码时引导大模型(如Qwen2.5-32B)。ThinkLogit-DPO则通过偏好优化训练小模型。

Result: ThinkLogit和ThinkLogit-DPO在四个数学数据集上分别实现26%和29%的pass@1相对提升。此外,ThinkLogit能迁移强化学习中获得的长推理能力。

Insight: 无需额外训练或仅需小规模训练即可高效引导大模型的长推理能力,为计算效率提供了新思路。

Abstract: Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model – a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B – a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.

[7] Synergy: End-to-end Concept Model

Keli Zheng,Zerong Xie

Main category: cs.CL

TL;DR: Synergy是一个端到端的语言模型,通过学习路由机制连接不同层次的抽象,实现了自发的字节级分词,性能媲美BBPE分词器,并在相同规模和数据集下优于Llama3。

Details Motivation: 传统分词器(如BBPE)的固定性和局限性促使研究者探索无需分词器的架构,以实现更灵活和鲁棒的语言模型。

Contribution: 1. 提出Synergy模型,通过端到端学习实现字节级语言建模;2. 展示模型自发学习高效分词的能力;3. 发现模型在去除位置编码时表现更好,表明位置无关概念的涌现。

Method: 1. 基于字节级语言建模训练Synergy;2. 使用学习路由机制连接不同抽象层次;3. 实验比较Synergy与BBPE和Llama3。

Result: Synergy在相同条件下优于Llama3,且无需显式分词器即可实现高效分词,同时发现高层模块对位置编码的依赖性降低。

Insight: 无需显式分词器的端到端架构可行,位置无关概念的涌现为更灵活的模型设计提供了新思路。

Abstract: In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.

[8] Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?

Xi Ai,Mahardika Krisna Ihsani,Min-Yen Kan

Main category: cs.CL

TL;DR: 论文研究了多语言模型中知识的跨语言一致性,发现模型在不同语言间的知识一致性存在差异,并提出代码切换训练和跨语言词对齐目标可以提升一致性和性能。

Details Motivation: 跨语言一致性对于评估跨语言迁移能力、保持模型知识的准确性以及语言模型性能的平衡至关重要。

Contribution: 分析了多语言模型中知识的跨语言一致性,发现了影响一致性的因素,并提出了提升一致性的有效策略。

Method: 使用代码混合的共指陈述和可解释性方法分析模型行为,评估了多种提升多语言性能的策略。

Result: 多语言模型在不同语言间的知识一致性存在差异,代码切换训练和跨语言词对齐目标表现最佳。

Insight: 跨语言对齐监督和代码切换训练不仅能提升多语言性能,还能增强知识的一致性。

Abstract: Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.

[9] Making Language Model a Hierarchical Classifier and Generator

Yihong Wang,Zhonglin Jiang,Ningyuan Xi,Yue Zhao,Qingqing Gu,Xiyuan Chen,Hao Wu,Sheng Xu,Hange Zhou,Yong Chen,Luo Ji

Main category: cs.CL

TL;DR: 论文提出了一种分层解码器架构,通过在不同中间层同时解码文本,模仿人类分层思考能力,验证了预训练语言模型在分类和生成任务上的分层解码潜力。

Details Motivation: 受人类分层思考能力的启发,研究者希望通过构建分层解码器架构,提升语言模型在分类和生成任务上的表现。

Contribution: 主要贡献包括:1) 提出分层解码器架构;2) 通过实验验证中间层可生成有意义内容;3) 在分层文本分类、分类引导生成等任务上达到SOTA性能。

Method: 方法为:复制最后一层的语言头到选定中间层,并用不同任务输入进行微调。

Result: 实验证明,中间层能生成合理内容,且在多个任务上实现最佳性能。

Insight: 研究表明,从头预训练通用分层推理模型是可行的。

Abstract: Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human’s hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.

[10] MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps

Maximiliano Hormazábal Lagos,Álvaro Bueno Sáez,Héctor Cerezo-Costas,Pedro Alonso Doval,Jorge Alcalde Vesteiro

Main category: cs.CL

TL;DR: 这篇论文介绍了作者在IberLEF 2025任务PRESTA中的解决方案,通过多步骤方法利用LLMs生成Python代码来从表格中提取答案,达到了85%的准确率。

Details Motivation: PRESTA任务要求从西班牙语的表格中回答问题,作者希望通过自动化的代码生成与数据处理方法,高效准确地提取答案。

Contribution: 提出了一个基于LLMs的多步骤方法,包括表格分析、列选择、自然语言指令生成、代码翻译与执行,以及错误处理,显著提升了任务的准确性。

Method: 使用开源的LLMs,通过细粒度优化的提示词,分多个步骤处理表格数据,包括理解表格内容、生成代码并运行。

Result: 在PRESTA任务中取得了85%的准确率,证明了该方法的有效性。

Insight: 多步骤结合LLMs的代码生成方法可以有效解决复杂表格数据问答问题,提示词的优化对性能提升至关重要。

Abstract: This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85% in the task.

[11] Formalizing Attack Scenario Description: A Proposed Model

Quentin Goux,Nadira Lammari

Main category: cs.CL

TL;DR: 该论文提出了一种用于描述攻击场景的正式模型,旨在支持网络安全自动化流程,如攻击模拟脚本生成和攻击分析。

Details Motivation: 面对不断变化的威胁环境,组织需要自动化网络安全流程,而自动化需要输入数据的正式化。本文解决了攻击场景作为输入时的正式化需求。

Contribution: 1. 提出了一种新的正式模型,用于描述攻击场景及其上下文,抽象为UML类模型;2. 展示了该模型在攻击分析和自动生成攻击脚本中的应用。

Method: 采用UML类模型对攻击场景及其上下文进行抽象和正式化,并通过两个用例展示其实际应用。

Result: 提出的模型能够支持攻击分析和自动化脚本生成,为网络安全培训和分析提供了实用的工具。

Insight: 正式化攻击场景描述是网络安全自动化的关键步骤,抽象的UML模型为这一目标提供了可行的解决方案。

Abstract: Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper’s main research contribution is a novel formal model that encompasses the attack’s context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.

[12] GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems

Jisoo Lee,Raeyoung Chang,Dongwook Kwon,Harmanpreet Singh,Nikhil Verma

Main category: cs.CL

TL;DR: GEMMAS 是一个基于图的评估框架,通过将多智能体系统中的交互建模为有向无环图,提出两种过程级指标(信息多样性分数和冗余路径比)来评估协作质量,揭示了仅关注最终结果的评价方法的不足。

Details Motivation: 现有的多智能体系统评估仅关注最终输出的正确性,忽视了低效的通信和协作导致的冗余推理和高计算成本。

Contribution: 提出 GEMMAS 框架,通过图建模和两个新指标(IDS 和 UPR)量化协作过程中的语义多样性和冗余路径,弥补了现有评价方法的不足。

Method: 将智能体交互建模为有向无环图,定义信息多样性分数(IDS)和冗余路径比(UPR)来评估协作过程。

Result: 在 GSM8K 等五个基准上的实验表明,GEMMAS 能够揭示协作质量差异,即使准确率相近的系统也可能在过程指标上差异显著。

Insight: 过程级评价对设计高效、可解释的协作 AI 系统至关重要,仅依赖最终结果的评价方法可能掩盖内部协作问题。

Abstract: Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.

[13] HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Ashray Gupta,Rohan Joseph,Sunny Rai

Main category: cs.CL

TL;DR: 论文介绍了HATS(Hindi Analogy Test Set),一个用于评估大型语言模型在印地语中类比推理能力的测试集,包含405道选择题。作者通过多种提示策略评估多语言模型,并提出了一种基于认知理论的Chain of Thought方法,提升了模型表现。实验结果显示,英文提示效果最好。

Details Motivation: 目前大型语言模型在英语中的推理能力已得到广泛评估,但在印度语等语言中的表现仍需研究。HATS填补了这一空白,为评估模型在印地语中的类比推理能力提供了关键资源。

Contribution: 1) 提出了首个印地语类比测试集HATS;2) 引入基于认知理论的Chain of Thought方法,提升了模型表现;3) 揭示了英文提示在跨语言任务中的优势。

Method: 1) 从印度政府考试中收集405道印地语类比题;2) 采用多种提示策略评估多语言模型;3) 提出基于认知理论的Chain of Thought方法,改进模型推理。

Result: 实验表明,模型在英文提示下表现最佳,无论使用何种提示策略。提出的Chain of Thought方法显著提升了模型在印地语类比任务中的表现。

Insight: 语言模型的跨语言推理能力存在差异,英文提示对多语言任务具有优势,而基于认知理论的方法可以有效提升模型在特定语言中的表现。HATS为未来研究提供了重要基准。

Abstract: Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

[14] Automating Steering for Safe Multimodal Large Language Models

Lyucheng Wu,Mengru Wang,Ziwen Xu,Tri Cao,Nay Oo,Bryan Hooi,Shumin Deng

Main category: cs.CL

TL;DR: 该论文提出了一种名为AutoSteer的模块化、自适应推理时干预技术,旨在提升多模态大语言模型(MLLMs)的安全性,而无需微调底层模型。AutoSteer通过安全性分数和选择性干预机制显著降低了攻击成功率。

Details Motivation: 多模态大语言模型在跨模态推理方面表现强大,但也面临对抗性输入的安全风险,需要一个无需微调的干预技术来提升安全性。

Contribution: 提出了AutoSteer技术,包含安全性分数(SAS)、自适应安全探测器和轻量级拒绝头,显著降低了多模态攻击的成功率。

Method: AutoSteer通过分析模型内部层的安全性相关特征、估计毒性输出的概率,并在检测到风险时选择性地干预生成。

Result: 在多个安全关键基准测试中,AutoSteer显著降低了文本、视觉和跨模态威胁的攻击成功率,同时保持了模型的泛化能力。

Insight: AutoSteer为多模态AI系统的安全部署提供了实用、可解释且高效的框架,展示了推理时干预的潜力。

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

[15] QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Jiazheng Li,Hong Lu,Kaiyue Wen,Zaiwen Yang,Jiaxuan Gao,Hongzhou Lin,Yi Wu,Jingzhao Zhang

Main category: cs.CL

TL;DR: 论文提出了一种名为QuestA的新方法,通过问题增强扩展大语言模型的推理能力,在数学推理任务中显著提升了性能。

Details Motivation: 现有强化学习方法在提升大语言模型的多步推理能力,尤其是解决高难度问题时效果有限,因此需要一种更有效的策略来改善训练信号。

Contribution: 提出了问题增强方法QuestA,通过引入部分解决方案降低问题难度,提供更具信息量的学习信号。

Method: 在RL训练过程中,通过问题增强为模型提供部分解决方案,从而降低推理难度并优化训练效率。

Result: 在多个数学推理基准测试中取得了新的SOTA结果,模型性能显著提升,例如AIME24提升了5.3%。

Insight: 问题增强不仅提升了推理能力,还提高了样本效率,为扩展RL在推理任务中的应用提供了通用路径。

Abstract: Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.

[16] HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

Guimin Hu,Daniel Hershcovich,Hasti Seifi

Main category: cs.CL

TL;DR: 论文提出了首个完全人工标注的多模态触觉数据集HapticCap,用于匹配用户描述的触觉信号,并提出了触觉-描述检索任务,通过对比学习框架取得了最佳性能。

Details Motivation: 触觉信号设计有意义且能与用户产生共鸣的挑战在于缺乏大规模带文本标注的数据集,以及现有任务和模型在描述振动信号方面的能力有限。

Contribution: 1. 创建了HapticCap,首个完全人工标注的触觉-描述数据集(92,070对);2. 提出了触觉-描述检索任务;3. 展示了基于对比学习框架的模型性能。

Method: 采用了监督对比学习框架,结合语言模型T5和音频模型AST,针对不同描述类别分别训练。

Result: T5与AST组合在触觉-描述检索任务中表现最佳,尤其是在分类别训练时效果更显著。

Insight: 多模态模型(语言与音频)在触觉信号理解任务中具有潜力,分类别训练可以进一步提升任务性能。

Abstract: Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.

[17] Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

Yulu Qin,Dheeraj Varghese,Adam Dahlgren Lindström,Lucia Donatelli,Kanishka Misra,Najoung Kim

Main category: cs.CL

TL;DR: 论文探讨视觉与语言(VL)训练是否显著改变语言模型的语言表征,发现VL训练虽不影响模型的分类知识本身,但能提升其在纯文本任务中运用这类知识的能力。

Details Motivation: 研究视觉与语言联合训练是否会对语言模型的分类知识产生根本性改变,尤其是在纯文本任务中如何影响知识的部署。

Contribution: 发现VL训练对模型的分类知识本身改变不大,但显著提升其在纯文本任务中运用分类知识的能力,尤其是问题回答任务。

Method: 通过比较纯文本模型与VL模型的性能,分析其在分类关系和非分类关系问题上的行为与表征差异。

Result: VL模型在分类问题任务上表现优于纯文本模型,但两者的分类知识表征未显著不同,仅部署方式不同。

Insight: VL训练优化了模型对分类知识的任务适配能力,而非知识本身,说明多模态训练可通过任务上下文间接提升纯文本性能。

Abstract: Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

[18] The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Zhouqi Hua,Wenwei Zhang,Chengqi Lyu,Yuzhe Gu,Songyang Gao,Kuikun Liu,Kai Chen

Main category: cs.CL

TL;DR: 这篇论文提出了Turing MAchine Imitation Learning (TAIL),通过模仿图灵机的执行过程生成Chain-of-Thoughts (CoT) 数据,显著提升了Transformer大语言模型(LLM)的长度泛化能力。

Details Motivation: 现有方法主要针对特定任务的数据驱动方式,其性能有限且缺乏普适性。论文从可计算性问题(即图灵机可解决问题)出发,提出更通用的解决方案。

Contribution: 1. 提出TAIL方法,通过程序生成CoT数据,模仿图灵机执行过程;2. 设计了覆盖8类算法的合成数据集;3. 显著提升了LLM的长度泛化能力和任务表现,甚至超越DeepSeek-R1。

Method: TAIL通过程序合成CoT数据,将推理步骤线性扩展为原子状态,缓解捷径学习,并通过显式内存访问机制降低动态长程数据访问的难度。

Result: TAIL在合成数据上显著提升了Qwen2.5-7B的长度泛化能力和任务表现,其注意力层表现出与图灵机一致的读写行为。

Insight: 图灵机的关键概念(而非思考风格)对长度泛化至关重要,这为从合成数据学习LLM推理提供了新的研究方向。

Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

[19] A Survey of Context Engineering for Large Language Models

Lingrui Mei,Jiayu Yao,Yuyao Ge,Yiwei Wang,Baolong Bi,Yujun Cai,Jiazhi Liu,Mingyu Li,Zhong-Zhi Li,Duzhen Zhang,Chenlin Zhou,Jiayi Mao,Tianze Xia,Jiafeng Guo,Shenghua Liu

Main category: cs.CL

TL;DR: 这篇综述提出了上下文工程(Context Engineering)这一新兴领域,系统梳理了其在大型语言模型(LLM)中的作用,并提出了基于1300多篇论文的综合分类法,揭示了当前模型在复杂上下文生成能力上的不足。

Details Motivation: 大型语言模型的性能高度依赖推理时提供的上下文信息,但目前缺乏对上下文优化的系统性研究。这篇综述旨在填补这一空白,为上下文工程的实践和研究提供统一框架。

Contribution: 1. 提出了上下文工程的正式定义和分类法;2. 分析了上下文工程的三大核心组件(检索与生成、处理、管理)和系统实现(如RAG、多智能体系统);3. 指出了模型在复杂上下文生成与理解能力上的不对称性。

Method: 通过对1300多篇论文的系统分析,作者将上下文工程分解为组件和系统实现两个层次,构建了分类法,并总结了当前技术的局限性。

Result: 研究表明,虽然现代LLM在上下文理解方面表现优异,但在生成长篇复杂内容时仍存在显著不足。这一不对称性是未来研究的关键方向。

Insight: 上下文工程是提升LLM性能的关键,但目前的技术更侧重于理解而非生成。未来需在复杂上下文生成能力上进行创新,以实现更平衡的模型表现。

Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

[20] Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Tyler Loakman,William Thorne,Chenghua Lin

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLMs)对不同类型幽默的理解能力,发现现有模型无法可靠解释所有类型的笑话,尤其是需要外部世界知识的复杂幽默形式。

Details Motivation: 幽默作为复杂的语言形式,现有研究多集中于简单的双关语笑话,而忽略了更复杂的幽默类型。本文旨在填补这一研究空白。

Contribution: 1. 创建了一个包含4种笑话类型(600个笑话)的数据集,并为其撰写高质量解释;2. 对比了多种LLM在零样本任务中对不同类型幽默的解释能力。

Method: 1. 收集并标注不同幽默形式的笑话;2. 测试多种LLM的零样本解释能力;3. 手动评估生成的解释质量。

Result: 所有测试模型(包括推理模型)均无法可靠地为所有笑话类型生成充分解释,尤其是需要外部世界知识的笑话。

Insight: 计算幽默研究需扩展至更复杂的幽默形式,当前LLM在理解需要世界知识的幽默时表现不足。

Abstract: Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

cs.CV [Back]

[21] Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering

Maximiliano Hormazábal Lagos,Héctor Cerezo-Costas,Dimosthenis Karatzas

Main category: cs.CV

TL;DR: EaGERS是一个无需训练且与模型无关的流程,通过生成自然语言解释并将其空间接地化,提升了DocVQA任务的性能和透明度。

Details Motivation: 旨在提高视觉语言模型在文档视觉问答(DocVQA)中的解释性和性能,同时避免额外的模型微调。

Contribution: 提出了EaGERS流程,通过生成自然语言解释、空间接地化和掩码区域生成响应,显著提升了模型的准确性和透明度。

Method: 使用视觉语言模型生成解释,通过多模态嵌入相似度计算将其接地化到空间子区域,并限制响应生成的区域。

Result: 在DocVQA数据集上优于基准模型,提升了精确匹配准确性和ANLS指标。

Insight: 空间接地化的解释不仅提高了模型的性能,还增强了其透明度和可重复性。

Abstract: We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.

[22] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang,Jiageng Liu,Zheyuan Zhang,Siyuan Zhou,Reuben Tan,Jianwei Yang,Yilun Du,Chuang Gan

Main category: cs.CV

TL;DR: MindJourney通过将视觉语言模型(VLM)与基于视频扩散的世界模型结合,解决了VLM在3D空间推理中的不足,无需微调即可显著提升性能。

Details Motivation: 当前VLM在3D动态建模方面表现不佳,导致其在简单任务(如视角变化后的场景推理)中表现较差,亟需一种方法增强其3D推理能力。

Contribution: 提出了MindJourney框架,通过结合VLM与世界模型,实现了无需微调的3D空间推理性能提升。

Method: 利用VLM生成相机轨迹,世界模型合成多视角图像,VLM基于多视角证据进行推理。

Result: 在Spatial Reasoning Benchmark SAT上平均性能提升8%,且优于基于强化学习的VLM。

Insight: 测试时结合世界模型是增强VLM 3D推理能力的一种简单且可插拔的有效方法。

Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

[23] Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo,Wenhan Dou,Wenhao Li,Zhaokai Wang,Xue Yang,Changyao Tian,Hao Li,Weiyun Wang,Wenhai Wang,Xizhou Zhu,Yu Qiao,Jifeng Dai

Main category: cs.CV

TL;DR: 该论文提出了一种新型的单模态多模态大语言模型(MLLM)Mono-InternVL及其改进版Mono-InternVL-1.5,通过嵌入视觉参数空间和创新的预训练策略,解决了现有MLLM的优化不稳定和灾难性遗忘问题,同时降低了训练和推理成本。

Details Motivation: 现有的单模态多模态大语言模型(MLLM)在优化和预训练中存在不稳定性和灾难性遗忘问题,导致性能受限且成本高昂。为了解决这些问题,作者提出了一种新的方法,通过嵌入视觉参数空间和改进的预训练策略来提升模型的性能和效率。

Contribution: 1. 提出Mono-InternVL及其改进版Mono-InternVL-1.5,通过Delta调优和创新的视觉专家架构实现了稳定的视觉知识学习。 2. 设计了创新的Endogenous Visual Pre-training(EViP)及其改进版EViP++,通过渐进式学习提升了模型的视觉能力。 3. 在15个基准测试中表现出色,Mono-InternVL在12个测试中优于现有模型,Mono-InternVL-1.5在保持性能的同时显著降低了训练和推理成本。

Method: 1. 在预训练的大语言模型中嵌入新的视觉参数空间,通过Delta调优从噪声数据中稳定学习视觉知识。 2. 采用多模态专家混合(MoE)架构,引入视觉专家。 3. 设计EViP和EViP++预训练策略,通过渐进式学习和高效的视觉注意力专家提升视觉能力。 4. 优化推理过程,使用融合CUDA内核加速MoE操作。

Result: 1. Mono-InternVL在15个基准测试中的12个上超越了现有模型,例如在OCRBench上比Emu3提高了114分。 2. Mono-InternVL-1.5在保持性能的同时,训练和推理成本显著降低,延迟减少了69%。

Insight: 1. 单模态MLLM可以通过嵌入视觉参数空间和改进的预训练策略克服优化不稳定性和灾难性遗忘问题。 2. 渐进式学习和高效率设计可以显著降低模型的计算成本。

Abstract: This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

[24] Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows

Judy Long,Tao Liu,Sean Alexander Woznicki,Miljana Marković,Oskar Marko,Molly Sears

Main category: cs.CV

TL;DR: 这篇论文全面评估了大规模像素级作物分类的最佳方法,包括传统监督学习和迁移学习,总结了预处理、模型选择、训练样本和迁移策略的最优实践。

Details Motivation: 作物分类是农业遥感的核心任务,但大规模像素级分类方法缺乏系统评估,尤其是迁移学习的应用和效果不明确。

Contribution: 1. 系统比较了多种预处理方法和监督学习模型;2. 评估了迁移学习在不同领域偏移下的效果;3. 提出了基于样本量和领域偏移的流程选择指南。

Method: 通过实验比较六种预处理方法、十一种监督分类模型,并研究迁移学习(如域自适应和微调)在不同样本量和领域偏移下的表现。

Result: 1. Transformer+精细预处理表现最佳;2. 迁移学习中UDA适用于同质类别,微调广泛有效;3. 样本量足够时监督学习更优,不足时迁移学习是替代方案。

Insight: 作物分类流程的选择需结合样本量和领域偏移,迁移学习能有效解决样本不足问题,但需匹配具体场景。

Abstract: Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal supervised crop mapping workflows, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of best methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. Repository: Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows

[25] Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection

Sandipan Sarma,Agney Talwarr,Arijit Sur

Main category: cs.CV

TL;DR: Funnel-HOI 提出了一种自上而下的框架,通过在编码阶段挖掘人-物交互(HOI)的特定线索,结合零样本学习能力,显著提升了交互检测性能。

Details Motivation: 现有方法主要关注解码器设计,忽视了在编码阶段捕捉HOI特定线索的重要性。本文认为,人类理解场景时会将明确概念(如物体)与抽象概念(如动作)关联起来,从而提出了一个类似的自上而下框架。

Contribution: 1. 提出Funnel-HOI框架,在编码阶段挖掘HOI线索;2. 设计了一个新的非对称共注意力机制,结合多模态信息提升交互表示;3. 提出一种新的损失函数,更好地调节对象-动作关联性的分类惩罚。

Method: 采用自上而下的方式,先检测物体,再关联动作。通过非对称共注意力机制结合多模态信息,并在编码阶段生成更强的交互表示。新损失函数优化了分类器训练。

Result: 在HICO-DET和V-COCO数据集上,Funnel-HOI在完全监督和零样本设置中均取得SOTA性能,未见和罕见HOI类别的性能分别提升了12.4%和8.4%。

Insight: 在编码阶段显式挖掘HOI线索可以显著提升交互检测性能,尤其是零样本场景。自上而下的框架更贴近人类认知过程。

Abstract: Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions. Since there could be an exponential number of object-action combinations, labeled data is limited - leading to a long-tail distribution problem. Recently, zero-shot learning emerged as a solution, with end-to-end transformer-based object detectors adapted for HOID becoming successful frameworks. However, their primary focus is designing improved decoders for learning entangled or disentangled interpretations of interactions. We advocate that HOI-specific cues must be anticipated at the encoder stage itself to obtain a stronger scene interpretation. Consequently, we build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding. We first probe an image for the presence of objects (well-defined concepts) and then probe for actions (abstract concepts) associated with them. A novel asymmetric co-attention mechanism mines these cues utilizing multimodal information (incorporating zero-shot capabilities) and yields stronger interaction representations at the encoder level. Furthermore, a novel loss is devised that considers objectaction relatedness and regulates misclassification penalty better than existing loss functions for guiding the interaction classifier. Extensive experiments on the HICO-DET and V-COCO datasets across fully-supervised and six zero-shot settings reveal our state-of-the-art performance, with up to 12.4% and 8.4% gains for unseen and rare HOI categories, respectively.

[26] Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Kaihua Chen,Tarasha Khurana,Deva Ramanan

Main category: cs.CV

TL;DR: 该论文提出了一种动态场景的新视角合成方法,结合了3D重建和2D视频扩散模型,通过自监督训练实现零样本应用。

Details Motivation: 现有方法依赖昂贵的4D表示测试时优化,或在前馈训练中无法保留场景几何。本文旨在解决这些问题,提出更高效的动态场景新视角合成方案。

Contribution: 1) 利用共视像素通过3D重建渲染新视角;2) 提出自监督训练的2D视频扩散模型用于遮挡区域补全;3) 实现零样本测试时微调。

Method: 结合3D重建(处理共视像素)和2D视频扩散模型(处理隐藏像素),并通过自监督训练大规模视频数据。测试时通过微调适配新视频。

Result: 实验表明,该方法在动态场景的单目视频新视角合成任务上优于现有方法。

Insight: 自监督训练的2D视频扩散模型是处理遮挡像素的关键,测试时微调进一步提升了零样本适应能力。

Abstract: We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be “inpainted” with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

[27] Integrated Oculomics and Lipidomics Reveal Microvascular Metabolic Signatures Associated with Cardiovascular Health in a Healthy Cohort

Inamullah,Ernesto Elias Vidal Rosas,Imran Razzak,Shoaib Jameel

Main category: cs.CV

TL;DR: 该论文首次将视网膜微血管特征与血清脂质组数据结合,揭示了与心血管健康相关的微血管代谢标志物,提出了一种创新的成像组学框架,用于早期检测心血管疾病风险。

Details Motivation: 心血管疾病是全球主要死因,但现有风险分层方法无法检测早期亚临床变化。研究希望通过整合视网膜微血管特征和脂质组数据,填补对早期心血管疾病发病机制的理解空白。

Contribution: 首次在大规模健康人群中开展视网膜微血管特征与血清脂质组数据的相关性分析,揭示了微血管重塑的潜在代谢机制,并提出了非侵入性生物标志物。

Method: 结合深度学习图像处理和血清脂质组数据,使用自动化图像分析量化视网膜表型,并通过UHPLC ESI HRMS进行脂质组分析。

Result: 发现平均动脉宽度、血管密度与TAGs、DAGs和Cers等脂质亚类之间的强相关性,表明代谢压力下的微血管重塑机制。

Insight: 多组学整合为心血管疾病的早期检测和个性化预防提供了新思路,强调了微血管和代谢状态在疾病发展中的关键作用。

Abstract: Cardiovascular disease (CVD) remains the leading global cause of mortality, yet current risk stratification methods often fail to detect early, subclinical changes. Previous studies have generally not integrated retinal microvasculature characteristics with comprehensive serum lipidomic profiles as potential indicators of CVD risk. In this study, an innovative imaging omics framework was introduced, combining retinal microvascular traits derived through deep learning based image processing with serum lipidomic data to highlight asymptomatic biomarkers of cardiovascular risk beyond the conventional lipid panel. This represents the first large scale, covariate adjusted and stratified correlation analysis conducted in a healthy population, which is essential for identifying early indicators of disease. Retinal phenotypes were quantified using automated image analysis tools, while serum lipid profiling was performed by Ultra High Performance Liquid Chromatography Electrospray ionization High resolution mass spectrometry (UHPLC ESI HRMS). Strong, age- and sex-independent correlations were established, particularly between average artery width, vessel density, and lipid subclasses such as triacylglycerols (TAGs), diacylglycerols (DAGs), and ceramides (Cers). These associations suggest a converging mechanism of microvascular remodeling under metabolic stress. By linking detailed vascular structural phenotypes to specific lipid species, this study fills a critical gap in the understanding of early CVD pathogenesis. This integration not only offers a novel perspective on microvascular metabolic associations but also presents a significant opportunity for the identification of robust, non-invasive biomarkers. Ultimately, these findings may support improved early detection, targeted prevention, and personalized approaches in cardiovascular healthcare.

[28] A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique

Homare Sueyoshi,Kiyoshi Nishikawa,Hitoshi Kiya

Main category: cs.CV

TL;DR: 提出了一种隐私保护的语义分割方法,通过域适应技术结合ViT结构,在加密图像上实现与未加密图像几乎相同的分割精度。

Details Motivation: 研究如何在保证图像隐私(通过感知加密)的同时,不显著降低语义分割模型的性能。

Contribution: 结合域适应技术和ViT,提出了一种隐私保护的语义分割方法,在加密图像上保持高精度。

Method: 使用感知加密对训练和测试图像进行处理,并利用ViT的嵌入结构进行域适应。

Result: 实验证明,提出的方法在加密图像上的分割精度与未加密图像几乎一致。

Insight: 域适应技术可以有效解决加密图像与自然图像之间的域偏移问题。

Abstract: We propose a privacy-preserving semantic-segmentation method for applying perceptual encryption to images used for model training in addition to test images. This method also provides almost the same accuracy as models without any encryption. The above performance is achieved using a domain-adaptation technique on the embedding structure of the Vision Transformer (ViT). The effectiveness of the proposed method was experimentally confirmed in terms of the accuracy of semantic segmentation when using a powerful semantic-segmentation model with ViT called Segmentation Transformer.

[29] Transformer-based Spatial Grounding: A Comprehensive Survey

Ijazul Haq,Muhammad Saqib,Yingjie Zhang

Main category: cs.CV

TL;DR: 该论文是对2018年至2025年间基于Transformer的空间接地(spatial grounding)方法进行的系统文献综述,总结了模型架构、数据集、评估指标及方法论趋势,为研究与实践提供了指导。

Details Motivation: 尽管Transformer模型在空间接地领域取得了显著进展,但缺乏对当前方法、数据集、评估指标及工业适用性的全面综述。

Contribution: 提供了对Transformer空间接地方法的系统性回顾,总结了主要架构、数据集和评估指标,为领域研究与实践提供了结构化指导。

Method: 通过文献分析方法,系统梳理了2018年至2025年间的研究成果,识别出主要方法论、数据集及评估指标。

Result: 总结了当前Transformer空间接地领域的主导模型、常用数据集和评估指标,并揭示了方法论趋势与最佳实践。

Insight: 研究为开发鲁棒、可靠且适用于工业环境的Transformer空间接地模型提供了重要参考。

Abstract: Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.

[30] Domain-Enhanced Dual-Branch Model for Efficient and Interpretable Accident Anticipation

Yanchen Guan,Haicheng Liao,Chengyue Wang,Bonan Wang,Jiaxun Zhang,Jia Hu,Zhenning Li

Main category: cs.CV

TL;DR: 本文提出了一种基于双分支架构的交通事故预测框架,通过整合车载摄像头视频和事故报告的结构化文本数据,结合大模型(如GPT-4o和Long-CLIP)的多模态特征融合方法,显著提升了预测准确性和可解释性。

Details Motivation: 开发高效且精确的交通事故预测系统对于自动驾驶技术至关重要,能够实现及时干预以减少损失。现有的方法往往忽略了多模态数据的整合或计算效率不足。

Contribution: 1. 提出了一种双分支架构,整合视觉和文本数据;2. 引入了基于大模型的多模态特征融合方法;3. 通过目标导向的提示工程策略提升了系统的可解释性和标准化输出。

Method: 1. 使用双分支架构分别处理视频和文本数据;2. 利用GPT-4o和Long-CLIP进行特征融合;3. 结合提示工程生成标准化事故档案。

Result: 在DAD、CCD和A3D基准测试中,该方法展现出更高的预测准确性、响应速度和计算效率,同时改善了可解释性。

Insight: 多模态数据的整合和大模型的结合能够显著提升事故预测性能,同时提示工程为系统的标准化和可解释性提供了新思路。

Abstract: Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.

[31] HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation

Wangzheng Shi,Yinglin Zheng,Yuxin Lin,Jianmin Bao,Ming Zeng,Dong Chen

Main category: cs.CV

TL;DR: HairShifter是一种新颖的视频发式迁移方法,通过“锚定帧+动画”框架结合高质量图像迁移与连贯视频动画,实现了高保真和时间一致的视频发式转换。

Details Motivation: 视频发式迁移在社交媒体、游戏、广告等领域有广泛应用,但现有方法在时间一致性、空间保真度和动态适应性方面存在挑战。

Contribution: 提出了HairShifter框架,结合图像发式迁移模块和多尺度门控SPADE解码器,实现了高保真和时间一致的视频发式迁移。

Method: 采用“锚定帧+动画”框架,结合图像发式迁移模块(IHT)实现每帧转换,并通过多尺度门控SPADE解码器确保空间和时间一致性。

Result: 实验表明HairShifter在视频发式迁移任务中取得了最先进的性能,具有高质量的视觉输出和时间一致性。

Insight: 该方法为视频发式迁移提供了新的研究方向,并为此领域建立了强大基线。

Abstract: Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel “Anchor Frame + Animation” framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.

[32] Unified Medical Image Segmentation with State Space Modeling Snake

Ruicheng Zhang,Haowei Guo,Kanghui Tian,Jun Zhou,Mingliang Yan,Zeyu Zhang,Shen Zhao

Main category: cs.CV

TL;DR: 提出了Mamba Snake框架,用于统一医学图像分割,通过状态空间建模和层次化方法提升分割效果。

Details Motivation: 统一医学图像分割(UMIS)在解剖结构评估中很重要,但由于多尺度结构异质性和传统像素方法的局限性,现有方法难以处理形态复杂性和特征冲突。

Contribution: 提出了Mamba Snake框架,结合状态空间建模和蛇形算法,解决多器官拓扑关系和微观轮廓细化问题,同时引入能量图形状先验和双分类协同机制。

Method: 采用层次化状态空间建模,设计Mamba Evolution Block(MEB)实现时空信息聚合,并通过能量图和双分类优化检测与分割。

Result: 在五个临床数据集上的实验显示,Mamba Snake优于现有方法,Dice系数平均提升3%。

Insight: 通过层次化建模和能量图先验,可以更好地处理医学图像中的多尺度异质性和复杂形态,同时双分类机制能有效减少微观结构的欠分割。

Abstract: Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake’s superior performance, with an average Dice improvement of 3% over state-of-the-art methods.

[33] Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

Hanlei Shi,Leyuan Qu,Yu Liu,Di Gao,Yuhua Zheng,Taihao Li

Main category: cs.CV

TL;DR: 该论文提出了一种名为Think-Before-Draw的框架,通过Chain-of-Thought(CoT)分解情绪语义并结合渐进式引导去噪策略,实现了细粒度可控的情感头部生成。方法在MEAD和HDTF数据集上表现优异。

Details Motivation: 现有基于文本的情感头部生成方法依赖离散情绪标签,忽略了面部肌肉运动的动态复杂性,导致生成结果不够自然。

Contribution: 1. 引入CoT将抽象情绪标签转化为基于生理的面部肌肉运动描述;2. 提出渐进式引导去噪策略,优化微表情动态生成。

Method: 1. 使用CoT解析情绪语义;2. 通过全局情绪定位和局部肌肉控制机制进行细粒度优化。

Result: 在MEAD和HDTF数据集上达到SOTA性能,并展示了零样本生成能力。

Insight: 将情绪语义分解为生理动作描述是实现自然情感表达的关键,而渐进式生成策略能有效优化微表情动态。

Abstract: Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic engagement.With the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional expressiveness.This study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions–by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization–inspired by artists’ portrait painting process, a progressive guidance denoising strategy is proposed, employing a “global emotion localization–local muscle control” mechanism to refine micro-expression dynamics in generated videos.Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model’s zero-shot generation capability.

[34] World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving

Yanchen Guan,Haicheng Liao,Chengyue Wang,Xingcheng Liu,Jiaxun Zhang,Zhenning Li

Main category: cs.CV

TL;DR: 该论文提出了一种结合生成场景增强与自适应时序推理的框架,用于提升自动驾驶中的事故预测能力,生成了高质量的训练数据并解决了数据不完整问题。

Details Motivation: 自动驾驶系统在事故预测中面临两大挑战:高质量训练数据稀缺以及因环境干扰或传感器缺陷导致的关键物体级线索缺失。

Contribution: 1. 提出了结合生成场景增强与时序推理的综合框架;2. 开发了基于世界模型的视频生成管道,用于丰富边缘案例和复杂交互的覆盖;3. 发布了新的基准数据集,更好地捕捉多样化驾驶风险。

Method: 1. 使用世界模型生成高质量、高分辨率的驾驶场景;2. 设计了动态预测模型,通过图卷积和扩张时序算子编码时空关系;3. 新数据集支持多样化风险评估。

Result: 在公开和新发布的数据集上验证了框架的有效性,显著提升了事故预测的准确性和前瞻时间。

Insight: 生成式方法可以弥补实际数据的不足,而动态建模能有效应对数据不完整和噪声问题,为自动驾驶安全提供更鲁棒的解决方案。

Abstract: Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.

[35] Continuous Marine Tracking via Autonomous UAV Handoff

Heegyeong Kim,Alice James,Avishkar Seth,Endrowednes Kuantama,Jane Williamson,Yimeng Feng,Richard Han

Main category: cs.CV

TL;DR: 该论文提出了一种用于海洋动物(如鲨鱼)连续实时跟踪的自主无人机(UAV)视觉系统,通过无人机间交接协议扩展了跟踪范围,克服了单机电池限制,并在复杂环境下实现了高跟踪成功率。

Details Motivation: 海洋动物的实时跟踪在动态海洋环境中面临光照变化、遮挡和背景干扰等挑战。现有系统通常受限于单机电池寿命,无法实现长时间连续跟踪。

Contribution: 1. 提出了一种集成稳定RGB-D相机和定制OSTrack管道的自主UAV视觉系统。2. 设计了高效的无人机间交接协议,实现无缝跟踪责任转移。

Method: 系统使用稳定RGB-D相机和定制OSTrack进行视觉识别,并通过高置信度特征匹配实现无人机间无缝交接。

Result: 在5200帧鲨鱼数据集上,实现了81.9%的实时跟踪成功率和82.9%的覆盖目标率,对遮挡、光照变化和背景干扰表现鲁棒。

Insight: 无人机协同操作可显著扩展实时跟踪的应用范围,为海洋动物自主监测提供了可行方案。

Abstract: This paper introduces an autonomous UAV vision system for continuous, real-time tracking of marine animals, specifically sharks, in dynamic marine environments. The system integrates an onboard computer with a stabilised RGB-D camera and a custom-trained OSTrack pipeline, enabling visual identification under challenging lighting, occlusion, and sea-state conditions. A key innovation is the inter-UAV handoff protocol, which enables seamless transfer of tracking responsibilities between drones, extending operational coverage beyond single-drone battery limitations. Performance is evaluated on a curated shark dataset of 5,200 frames, achieving a tracking success rate of 81.9% during real-time flight control at 100 Hz, and robustness to occlusion, illumination variation, and background clutter. We present a seamless UAV handoff framework, where target transfer is attempted via high-confidence feature matching, achieving 82.9% target coverage. These results confirm the viability of coordinated UAV operations for extended marine tracking and lay the groundwork for scalable, autonomous monitoring.

[36] AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Hengkai Tan,Yao Feng,Xinyi Mao,Shuhe Huang,Guodong Liu,Zhongkai Hao,Hang Su,Jun Zhu

Main category: cs.CV

TL;DR: 这篇论文提出了一种任务无关的动作范式AnyPos,通过解耦动作执行与任务特定条件,结合ATARA框架高效收集数据,并利用改进的逆动力学模型提升泛化能力,在多种下游任务中取得了显著性能提升。

Details Motivation: 现有的视觉-语言-动作(VLA)模型在任务条件控制中表现良好,但依赖任务特定的人类示范数据,限制了泛化能力且成本高昂。因此,作者提出一种任务无关的动作范式。

Contribution: 1. 提出了任务无关的动作范式AnyPos;2. 开发了ATARA框架,显著提高数据收集效率;3. 提出改进的逆动力学模型(Arm-Decoupled Estimation和DAD);4. 集成了视频条件动作验证模块。

Method: 1. 使用ATARA框架自动化收集任务无关数据;2. 设计AnyPos模型,包含Arm-Decoupled Estimation和DAD模块;3. 结合视频验证模块优化策略可行性。

Result: 实验表明,AnyPos-ATARA在测试准确率上提升了51%,在搬运、抓取放置、点击等下游任务中成功率提高了30-40%。

Insight: 任务无关的动作范式可以显著降低数据成本并提升模型泛化能力,结合自监督学习和验证模块可进一步优化策略的有效性。

Abstract: Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm – such as low coverage density, behavioral redundancy, and safety risks – we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30\times $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos

[37] Compact Vision Transformer by Reduction of Kernel Complexity

Yancheng Wang,Yingzhen Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为KCR-Transformer的紧凑型视觉Transformer,通过减少内核复杂度(核复杂性)来降低计算成本,同时保证泛化能力。

Details Motivation: 现代深度学习中的自注意力和Transformer架构在计算效率上存在挑战。为了解决这一问题,本文提出了一种紧凑的Transformer模块,通过通道选择减少计算量,并确保泛化性能。

Contribution: 1. 提出了KCR-Transformer,通过在Transformer的MLP层中进行输入/输出通道选择来降低计算成本。2. 建立了严格的泛化界理论分析,确保压缩后的网络具有较小的泛化误差。3. 该模块兼容多种Transformer架构(如ViT和Swin),在减少FLOPs的同时保持或提升性能。

Method: 1. 在Transformer的MLP层中引入可微分的通道选择机制。2. 基于理论泛化界的指导,对通道进行剪枝。3. 将KCR-Transformer模块替换到现有Transformer架构中。

Result: 实验表明,KCR-Transformer在多种计算机视觉任务中表现优异,FLOPs和参数减少的同时,性能优于原始模型。

Insight: 1. 理论泛化界的引入为通道剪枝提供了可靠依据。2. 通道选择机制能够在不损失性能的情况下显著减少计算量。

Abstract: Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.

[38] City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning

Penglei Sun,Yaoxian Song,Xiangru Zhu,Xiang Liu,Qiang Wang,Yue Liu,Changqun Xia,Tiefeng Li,Yang Yang,Xiaowen Chu

Main category: cs.CV

TL;DR: 本文提出了City-VLM,一种多模态不完全学习的大规模视觉语言模型,用于解决室外大尺度场景理解中的多域感知问题。通过构建SVM-City数据集和引入联合概率分布空间的多模态融合方法,City-VLM在三个典型室外场景任务中表现出色,性能平均提升18.14%。

Details Motivation: 现有的LVLM主要集中在室内任务上,难以适应室外大尺度场景的多视角、多模态数据融合需求。同时,缺乏针对室外多域感知的数据集和方法。

Contribution: 1) 构建了首个多域感知室外场景理解数据集SVM-City;2) 提出了City-VLM模型,通过不完全多模态学习和联合概率分布空间实现多模态融合。

Method: 设计了基于联合概率分布空间的多模态融合方法,避免了显式融合操作(如拼接),并通过SVM-City数据集进行训练。

Result: 在三个室外场景理解任务中,City-VLM平均性能提升18.14%,表现出广泛的泛化能力。

Insight: 室外场景理解需要多模态数据的有效融合,尤其是在部分模态缺失的情况下。联合概率分布空间提供了一种灵活的融合方式,优于传统的显式融合方法。

Abstract: Scene understanding enables intelligent agents to interpret and comprehend their environment. While existing large vision-language models (LVLMs) for scene understanding have primarily focused on indoor household tasks, they face two significant limitations when applied to outdoor large-scale scene understanding. First, outdoor scenarios typically encompass larger-scale environments observed through various sensors from multiple viewpoints (e.g., bird view and terrestrial view), while existing indoor LVLMs mainly analyze single visual modalities within building-scale contexts from humanoid viewpoints. Second, existing LVLMs suffer from missing multidomain perception outdoor data and struggle to effectively integrate 2D and 3D visual information. To address the aforementioned limitations, we build the first multidomain perception outdoor scene understanding dataset, named \textbf{\underline{SVM-City}}, deriving from multi\textbf{\underline{S}}cale scenarios with multi\textbf{\underline{V}}iew and multi\textbf{\underline{M}}odal instruction tuning data. It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite. To effectively fuse the multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named \textbf{\underline{City-VLM}}. Multimodal fusion is realized by constructing a joint probabilistic distribution space rather than implementing directly explicit fusion operations (e.g., concatenation). Experimental results on three typical outdoor scene understanding tasks show City-VLM achieves $18.14 %$ performance surpassing existing LVLMs in question-answering tasks averagely. Our method demonstrates pragmatic and generalization performance across multiple outdoor scenes.

[39] DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment

Junjie Gao,Runze Liu,Yingzhe Peng,Shujian Yang,Jin Zhang,Kai Yang,Zhiyuan You

Main category: cs.CV

TL;DR: DeQA-Doc adapts DeQA-Score, an MLLM-based image quality scorer, to document quality assessment, achieving superior performance over existing methods.

Details Motivation: Existing document quality assessment methods lack accuracy and robustness, while MLLMs show promise in image quality assessment, motivating their adaptation to the document domain.

Contribution: Proposes DeQA-Doc, a framework combining MLLM capabilities and soft labels for continuous document quality scoring, with relaxed resolution constraints and ensemble methods.

Method: Adapts DeQA-Score for document images using soft labels and relaxed resolution constraints, enhanced by ensemble techniques.

Result: DeQA-Doc outperforms baselines, providing accurate and generalizable quality assessment for diverse document degradations.

Insight: Leveraging MLLMs and soft labels effectively transfers image quality assessment success to the document domain, highlighting the versatility of MLLMs.

Abstract: Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in https://github.com/Junjie-Gao19/DeQA-Doc.

[40] ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Hoang-Son Vo,Quang-Vinh Nguyen,Seungwon Kim,Hyung-Jeong Yang,Soonja Yeom,Soo-Hyung Kim

Main category: cs.CV

TL;DR: ATL-Diff通过早期地标引导的噪声扩散方法,解决了音频驱动说话头生成中的同步问题,降低了噪声和计算成本,实现了高质量、高效率的动画生成。

Details Motivation: 音频与面部动画的同步是说话头生成的核心挑战,现有方法在噪声处理、计算效率和身份特征保持方面存在局限性。

Contribution: 提出了ATL-Diff框架,包括地标生成模块、地标引导噪声方法和3D身份扩散网络,显著提升了同步性和计算效率。

Method: 1. 地标生成模块将音频转换为面部地标;2. 地标引导噪声方法分布噪声;3. 3D身份扩散网络保持身份特征。

Result: 在MEAD和CREMA-D数据集上表现优于现有方法,支持近实时处理,同时保持高质量动画和面部细节。

Insight: 分布噪声结合地标引导和身份特征保持,是高效说话头生成的关键。应用潜力广泛,如虚拟助手和教育领域。

Abstract: Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}

[41] Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition

Yufei Peng,Yonggang Zhang,Yiu-ming Cheung

Main category: cs.CV

TL;DR: 该论文提出了一种基于语义引导的预训练模型微调方法(Sage),通过引入SG-Adapter利用类描述作为语义指导,增强了视觉与文本模态的对齐,并提出了分布不匹配感知补偿因子以解决预测偏差问题,显著提升了长尾视觉识别的性能。

Details Motivation: 长尾场景中类别样本数量的不均衡导致低频类别性能下降,现有方法忽视了冻结文本编码器提供的语义信息,且现有损失函数未考虑类别条件分布不一致的问题。

Contribution: 1. 提出Sage方法,通过语义引导的微调策略增强视觉与文本模态对齐;2. 设计SG-Adapter,利用类描述作为语义指导;3. 提出分布不匹配感知补偿因子,校正预测偏差。

Method: 1. 引入SG-Adapter,通过注意力机制将类描述融入视觉编码器微调;2. 提出理论分析支持的新型补偿因子,调整损失函数以解决分布不匹配问题。

Result: 在多个基准数据集上的实验表明,Sage显著提升了长尾学习的性能,特别是在低频类别上。

Insight: 结合文本模态的语义信息和视觉模态的微调可以有效提升长尾识别的性能,同时需要注意类别条件分布的不一致性对预测偏差的影响。

Abstract: The variance in class-wise sample sizes within long-tailed scenarios often results in degraded performance in less frequent classes. Fortunately, foundation models, pre-trained on vast open-world datasets, demonstrate strong potential for this task due to their generalizable representation, which promotes the development of adaptive strategies on pre-trained models in long-tailed learning. Advanced fine-tuning methods typically adjust visual encoders while neglecting the semantics derived from the frozen text encoder, overlooking the visual and textual alignment. To strengthen this alignment, we propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage), which incorporates semantic guidance derived from textual modality into the visual fine-tuning process. Specifically, we introduce an SG-Adapter that integrates class descriptions as semantic guidance to guide the fine-tuning of the visual encoder. The introduced guidance is passesed through the attention mechanism and enables the model to focus more on semantically relevant content, strengthening the alignment between the visual and textual modalities. Due to the inconsistent class-conditional distributions neglected by the existing loss function, the resulting prediction bias causes performance improvements for the tail class less than for the head class, even when the multi-modal alignment is enhanced. To address this challenge, we propose a novel distribution mismatch-aware compensation factor, which is specifically designed to rectify the prediction bias caused by the ignored inconsistent distribution based on our theoretical analysis, and is seamlessly integrated into the loss function. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed Sage in enhancing performance in long-tailed learning.

[42] FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

Ju-Young Oh,Ho-Joong Kim,Seong-Whan Lee

Main category: cs.CV

TL;DR: FIQ提出了一种通过生成基础问题和答案对的方法,增强视频问答中的模型推理能力。通过结合问题嵌入和视觉特征,FIQ在SUTD-TrafficQA上实现了最优性能。

Details Motivation: 现有视频问答方法主要依赖事件中心的Q&A对,缺乏对视频基础场景信息的理解,限制了模型的泛化能力和推理能力。

Contribution: 提出了FIQ方法,生成基础Q&A对以增强场景理解;设计了VQ-CAlign模块,结合视觉特征和问题嵌入,提升下游任务适应性。

Method: 从视频描述中生成Q&A对,并结合VQ-CAlign模块对齐问题嵌入和视觉特征。

Result: 在SUTD-TrafficQA数据集上实现了SOTA性能。

Insight: 基础场景信息的生成和对齐问题嵌入与视觉特征是提升视频问答推理能力的关键。

Abstract: Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model’s capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.

[43] MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval

Jeong-Woo Park,Seong-Whan Lee

Main category: cs.CV

TL;DR: MCoT-RE提出了一种无需训练的零样本组合图像检索框架,通过多层面思维链和重新排序技术,平衡文本修改与视觉上下文,显著提升了检索性能。

Details Motivation: 现有零样本组合图像检索方法在模态独立处理或过度关注文本修改时存在信息损失或视觉上下文利用不足的问题。

Contribution: 提出MCoT-RE框架,结合多层面思维链和两阶段重新排序,平衡文本修改与视觉上下文信息,实现高效检索。

Method: 使用MLLM生成两个不同角度的描述(修改导向和上下文整合),先过滤候选图像,再结合描述和参考图像进行多粒度重新排序。

Result: 在FashionIQ和CIRR数据集上分别取得Recall@10提升6.24%和Recall@1提升8.58%的SOTA结果。

Insight: 多层面思维链和两阶段重新排序能有效解决零样本CIR中的模态交互不足问题,同时保持视觉上下文的完整性。

Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a composed query consisting of a reference image and a modification text. Among various CIR approaches, training-free zero-shot methods based on pre-trained models are cost-effective but still face notable limitations. For example, sequential VLM-LLM pipelines process each modality independently, which often results in information loss and limits cross-modal interaction. In contrast, methods based on multimodal large language models (MLLMs) often focus exclusively on applying changes indicated by the text, without fully utilizing the contextual visual information from the reference image. To address these issues, we propose multi-faceted Chain-of-Thought with re-ranking (MCoT-RE), a training-free zero-shot CIR framework. MCoT-RE utilizes multi-faceted Chain-of-Thought to guide the MLLM to balance explicit modifications and contextual visual cues, generating two distinct captions: one focused on modification and the other integrating comprehensive visual-textual context. The first caption is used to filter candidate images. Subsequently, we combine these two captions and the reference image to perform multi-grained re-ranking. This two-stage approach facilitates precise retrieval by aligning with the textual modification instructions while preserving the visual context of the reference image. Through extensive experiments, MCoT-RE achieves state-of-the-art results among training-free methods, yielding improvements of up to 6.24% in Recall@10 on FashionIQ and 8.58% in Recall@1 on CIRR.

[44] FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval

Jeong-Woo Park,Young-Eun Kim,Seong-Whan Lee

Main category: cs.CV

TL;DR: FAR-Net 是一个多阶段融合框架,通过增强语义对齐和自适应调和解决组合图像检索中的模态融合问题,显著提升了性能。

Details Motivation: 现有的组合图像检索方法在早期和晚期融合中存在局限性:早期融合忽略视觉上下文,晚期融合难以捕捉细粒度语义对齐。

Contribution: 提出了 FAR-Net,包含增强语义对齐模块(ESAM)和自适应调和模块(ARM),结合晚期和早期融合优势。

Method: ESAM 使用带交叉注意力的晚期融合捕捉细粒度语义关系;ARM 使用带不确定性嵌入的早期融合增强鲁棒性。

Result: 在 CIRR 和 FashionIQ 数据集上,Recall@1 提升 2.4%,Recall@50 提升 1.04%。

Insight: 多阶段融合结合晚期和早期优势,能更有效地对齐视觉和语言模态,提升检索性能。

Abstract: Composed image retrieval (CIR) is a vision language task that retrieves a target image using a reference image and modification text, enabling intuitive specification of desired changes. While effectively fusing visual and textual modalities is crucial, existing methods typically adopt either early or late fusion. Early fusion tends to excessively focus on explicitly mentioned textual details and neglect visual context, whereas late fusion struggles to capture fine-grained semantic alignments between image regions and textual tokens. To address these issues, we propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation, integrating two complementary modules. The enhanced semantic alignment module (ESAM) employs late fusion with cross-attention to capture fine-grained semantic relationships, while the adaptive reconciliation module (ARM) applies early fusion with uncertainty embeddings to enhance robustness and adaptability. Experiments on CIRR and FashionIQ show consistent performance gains, improving Recall@1 by up to 2.4% and Recall@50 by 1.04% over existing state-of-the-art methods, empirically demonstrating that FAR Net provides a robust and scalable solution to CIR tasks.

[45] MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results

Yuki Kondo,Norimichi Ukita,Riku Kanayama,Yuki Yoshida,Takayuki Yamaguchi,Xiang Yu,Guang Liang,Xinyao Liu,Guan-Zhang Wang,Wei-Ta Chu,Bing-Cheng Chuang,Jia-Hua Lee,Pin-Tseng Kuo,I-Hsuan Chu,Yi-Shein Hsiao,Cheng-Han Wu,Po-Yi Wu,Jui-Chien Tsou,Hsuan-Chi Liu,Chun-Yi Lee,Yuan-Fu Yang,Kosuke Shigematsu,Asuka Shin,Ba Tran

Main category: cs.CV

TL;DR: 论文介绍了SMOT4SB挑战赛,包括数据集、新评估指标SO-HOTA以及竞赛结果,旨在解决小型多目标跟踪中的难题。

Details Motivation: 针对小型目标(如鸟类)在多目标跟踪中的困难,特别是目标仅占几十像素时检测和外观关联不可靠的问题,提出了新的挑战赛和方法。

Contribution: 1. SMOT4SB数据集,包含211个无人机视频序列;2. SO-HOTA评估指标,结合点距离和HOTA;3. MVA2025竞赛结果,最优方法比基线提升5.1倍。

Method: 利用时间信息解决单帧检测的局限性,设计新的数据集和评估指标SO-HOTA。

Result: 竞赛中最佳方法性能提升显著,验证了方法的有效性。

Insight: 该研究为无人机场景下的小型多目标跟踪提供了新思路,适用于鸟类避撞、农业和生态监测等领域。

Abstract: Small Multi-Object Tracking (SMOT) is particularly challenging when targets occupy only a few dozen pixels, rendering detection and appearance-based association unreliable. Building on the success of the MVA2023 SOD4SB challenge, this paper introduces the SMOT4SB challenge, which leverages temporal information to address limitations of single-frame detection. Our three main contributions are: (1) the SMOT4SB dataset, consisting of 211 UAV video sequences with 108,192 annotated frames under diverse real-world conditions, designed to capture motion entanglement where both camera and targets move freely in 3D; (2) SO-HOTA, a novel metric combining Dot Distance with HOTA to mitigate the sensitivity of IoU-based metrics to small displacements; and (3) a competitive MVA2025 challenge with 78 participants and 308 submissions, where the winning method achieved a 5.1x improvement over the baseline. This work lays a foundation for advancing SMOT in UAV scenarios with applications in bird strike avoidance, agriculture, fisheries, and ecological monitoring.

[46] AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren,Zhiqiang Lin,Yu Li,Gao Meng,Weiyun Wang,Junjie Wang,Zicheng Lin,Jifeng Dai,Yujiu Yang,Wenhai Wang,Ruihang Chu

Main category: cs.CV

TL;DR: 该论文提出了AnyCap项目,包括AnyCapModel(ACM)框架、AnyCapDataset(ACD)数据集和AnyCapEval基准,旨在提升可控多模态标题生成的质量和评估可靠性。

Details Motivation: 现有模型在可控标题生成中缺乏细粒度控制和可靠评估方法,AnyCap解决了这一问题。

Contribution: 1. ACM:轻量级即插即用框架,增强基础模型的可控性;2. ACD:涵盖多种模态和指令类型的高质量数据集;3. AnyCapEval:新的评估基准,分离内容准确性和风格忠诚度。

Method: ACM通过复用基础模型的原始标题,结合用户指令和模态特征生成改进的标题,无需重新训练基础模型。

Result: ACM显著提升了标题质量,例如ACM-8B将GPT-4o的内容和风格得分分别提高45%和12%。

Insight: AnyCap展示了通过轻量级框架和高质量数据集的集成,可以在不重新训练基础模型的前提下显著提升多模态标题生成的可控性和性能。

Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o's content scores by 45% and style scores by 12%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

[47] SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Khang Truong,Lam Pham,Hieu Tang,Jasmin Lampert,Martin Boyer,Son Phan,Truong Nguyen

Main category: cs.CV

TL;DR: 论文提出了一种基于Transformer的静态扩展-记忆增强-Mesh Transformer网络架构(SEMT),用于遥感图像描述生成,显著提升了性能。

Details Motivation: 遥感图像描述生成(RSIC)在环境监测、灾害评估和城市规划中具有重要作用,但目前的方法在复杂卫星图像上的表现仍有提升空间。

Contribution: 1) 提出了SEMT网络架构;2) 结合静态扩展、记忆增强自注意力和Mesh Transformer技术;3) 在UCM-Caption和NWPU-Caption数据集上超越现有方法。

Method: 1) 静态扩展技术用于特征提取;2) 记忆增强自注意力机制优化长期依赖;3) Mesh Transformer改进上下文建模。

Result: 在多个评估指标上优于现有最佳系统,展示了实际应用的潜力。

Insight: 结合多种技术的Transformer架构在复杂遥感图像描述任务中表现优异,为实际场景提供了有效解决方案。

Abstract: Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

[48] WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding

Danilo Avola,Daniele Pannone,Dario Montagnini,Emad Emam

Main category: cs.CV

TL;DR: WhoFi提出了一种基于Wi-Fi信号的人体重识别方法,通过提取信道状态信息(CSI)作为生物特征,并使用基于Transformer的编码器进行特征学习,解决了传统视觉方法在光照、遮挡等问题上的局限性。

Details Motivation: 传统的视觉重识别方法在光照差、遮挡或角度不佳时表现不佳,WhoFi通过利用Wi-Fi信号的独特优势(不受视觉限制)来改进重识别任务。

Contribution: 1) 首次将Wi-Fi信号的CSI用于人体重识别;2) 设计了基于Transformer的编码器模块化深度神经网络;3) 提出使用批内负损失函数学习鲁棒的生物特征。

Method: 1) 从CSI中提取生物特征;2) 使用基于Transformer的编码器处理特征;3) 通过批内负损失函数训练网络以学习鲁棒特征。

Result: 在NTU-Fi数据集上,WhoFi的性能与现有最优方法相当,验证了Wi-Fi信号在重识别任务中的潜力。

Insight: Wi-Fi信号可以作为一种补充或替代视觉数据的方式,用于生物特征识别,尤其是在视觉条件受限的场景中。

Abstract: Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.

[49] HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation

Weihuang Lin,Yiwei Ma,Xiaoshuai Sun,Shuting He,Jiayi Ji,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: HRSeg提出了一种高效的高分辨率视觉感知与增强方法,用于推理分割任务,通过HRP和HRE模块实现多粒度感知和特征细化。

Details Motivation: 现有方法在推理分割任务中因视觉编码器预训练分辨率低而受限,直接插值位置嵌入效果有限且计算成本高,需更高效的解决方案。

Contribution: 提出HRSeg模型,包含HRP(高分辨率感知)和HRE(高分辨率增强)模块,显著提升分割精度,同时保持计算高效。

Method: HRP模块通过裁剪处理高分辨率图像,结合局部与全局特征;HRE模块整合高分辨率图像的细粒度信息,优化掩码与文本特征对齐。

Result: 在多个基准数据集上验证,HRSeg表现优于现有方法,并通过消融实验证实模块有效性。

Insight: 高分辨率图像的多粒度处理(局部与全局结合)与特征细化对推理分割任务至关重要。

Abstract: The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg’s superior performance.

[50] From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

Mengxi Liu,Lala Shakti Swarup Ray,Sizhen Bian,Ko Watanabe,Ankur Bhatt,Joanna Sorysz,Russel Torah,Bo Zhou,Paul Lukowicz

Main category: cs.CV

TL;DR: 本文提出了一种名为NeckSense的新型可穿戴系统,通过多通道生物阻抗传感实现头部姿态估计,无需视线且性能接近基于视觉的最先进方法。

Details Motivation: 现有的头部姿态跟踪系统通常依赖于视觉或复杂的传感器,限制了其便携性和实用性。NeckSense通过生物阻抗传感提供了一种轻便、无需视线的方法。

Contribution: 提出了基于生物阻抗传感的NeckSense系统,结合深度学习框架和解剖学先验知识,实现了高性能的头部姿态跟踪。

Method: 使用多通道生物阻抗传感和软干电极,结合深度学习框架,利用解剖学先验设计损失函数。通过留一人交叉验证进行性能评估。

Result: 在7名参与者上验证,平均每顶点误差为25.9毫米,性能与基于视觉的最先进方法相当。

Insight: 生物阻抗传感作为一种轻便、无需视线的方法,可以有效替代复杂的视觉系统,为可穿戴设备提供了新思路。

Abstract: We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.

[51] Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context

Mengke Song,Yuge Xie,Qi Cui,Luming Li,Xinyu Liu,Guotao Wang,Chenglizhao Chen,Shanchen Pang

Main category: cs.CV

TL;DR: 该论文提出了一种基于摄像头的隐蔽情感识别方法,通过结合凝视固定模式与环境语义及时态动态,捕捉人类凝视与环境间的高阶语义动态,从而推断内在情绪状态。

Details Motivation: 现有情感识别方法多依赖于显式信号(如面部表情、语音或手势),忽视了环境影响,且易被掩盖;生理信号方法虽直接但设备复杂。论文旨在通过分析凝视与环境的动态交互,揭示情绪与行为的深层联系。

Contribution: 提出了一种无需专用硬件或用户参与的摄像头情感识别方法,首次整合凝视行为的环境语义和时序动态,实现隐蔽、实时、连续的情绪识别。

Method: 利用普通高清摄像头捕捉用户眼部外观和头部运动,估算时空凝视轨迹,建模凝视行为的空间、语义和时序维度,分析视觉注意力与环境的动态交互。

Result: 该方法能够隐蔽、实时、连续地识别情绪,具有高泛化性和低部署成本。

Insight: 情绪不仅是生理反应,还是人与环境复杂交互的结果;凝视行为的环境背景和动态变化是理解深层情绪的重要线索。

Abstract: Emotion recognition,as a step toward mind reading,seeks to infer internal states from external cues.Most existing methods rely on explicit signals-such as facial expressions,speech,or gestures-that reflect only bodily responses and overlook the influence of environmental context.These cues are often voluntary,easy to mask,and insufficient for capturing deeper,implicit emotions. Physiological signal-based approaches offer more direct access to internal states but require complex sensors that compromise natural behavior and limit scalability.Gaze-based methods typically rely on static fixation analysis and fail to capture the rich,dynamic interactions between gaze and the environment,and thus cannot uncover the deep connection between emotion and implicit behavior.To address these limitations,we propose a novel camera-based,user-unaware emotion recognition approach that integrates gaze fixation patterns with environmental semantics and temporal dynamics.Leveraging standard HD cameras,our method unobtrusively captures users’eye appearance and head movements in natural settings-without the need for specialized hardware or active user participation.From these visual cues,the system estimates gaze trajectories over time and space, providing the basis for modeling the spatial, semantic,and temporal dimensions of gaze behavior. This allows us to capture the dynamic interplay between visual attention and the surrounding environment,revealing that emotions are not merely physiological responses but complex outcomes of human-environment interactions.The proposed approach enables user-unaware,real-time,and continuous emotion recognition,offering high generalizability and low deployment cost.

[52] LanePerf: a Performance Estimation Framework for Lane Detection

Yin Wu,Daniel Slieter,Ahmed Abouelazm,Christian Hubschneider,J. Marius Zöllner

Main category: cs.CV

TL;DR: 该论文提出了一种名为LanePerf的框架,用于无需标注数据的车道检测模型性能估计,通过结合图像和车道特征,显著优于现有方法。

Details Motivation: 车道检测在ADAS和ADS中至关重要,但领域偏移会降低模型可靠性。传统的标注目标域数据方法资源消耗大,因此需要一种无需标注的性能估计方法。

Contribution: 论文提出了LanePerf框架,首次将图像分类中的性能估计方法适配到车道检测任务,并进一步整合图像和车道特征,显著提升了性能估计的准确性。

Method: LanePerf结合了预训练图像编码器和基于DeepSets的架构,处理零车道检测和大领域偏移场景。基准方法包括五种从图像分类任务迁移的性能估计方法。

Result: 在OpenLane数据集上的实验表明,LanePerf的MAE为0.117,Spearman秩相关系数为0.727,优于所有基准方法。

Insight: 该研究表明,结合图像和车道特征能有效解决领域偏移问题,为ADAS提供了一种高效的模型性能评估方法,减少了标注需求。

Abstract: Lane detection is a critical component of Advanced Driver-Assistance Systems (ADAS) and Automated Driving System (ADS), providing essential spatial information for lateral control. However, domain shifts often undermine model reliability when deployed in new environments. Ensuring the robustness and safety of lane detection models typically requires collecting and annotating target domain data, which is resource-intensive. Estimating model performance without ground-truth labels offers a promising alternative for efficient robustness assessment, yet remains underexplored in lane detection. While previous work has addressed performance estimation in image classification, these methods are not directly applicable to lane detection tasks. This paper first adapts five well-performing performance estimation methods from image classification to lane detection, building a baseline. Addressing the limitations of prior approaches that solely rely on softmax scores or lane features, we further propose a new Lane Performance Estimation Framework (LanePerf), which integrates image and lane features using a pretrained image encoder and a DeepSets-based architecture, effectively handling zero-lane detection scenarios and large domain-shift cases. Extensive experiments on the OpenLane dataset, covering diverse domain shifts (scenes, weather, hours), demonstrate that our LanePerf outperforms all baselines, achieving a lower MAE of 0.117 and a higher Spearman’s rank correlation coefficient of 0.727. These findings pave the way for robust, label-free performance estimation in ADAS, supporting more efficient testing and improved safety in challenging driving scenarios.

[53] Federated Learning for Commercial Image Sources

Shreyansh Jain,Koteswar Rao Jerripothula

Main category: cs.CV

TL;DR: 本文提出首个专为联邦学习设计的图像分类数据集,并介绍了两种新算法Fed-Cyclic和Fed-Star,展示了它们在数据集上的优越性能。

Details Motivation: 联邦学习在保护隐私的同时实现协作学习,但目前缺乏专门设计的图像分类数据集。本文旨在填补这一空白,并提供新的算法优化联邦学习性能。

Contribution: 1) 提出首个面向联邦学习的图像分类数据集;2) 设计了两种新算法Fed-Cyclic和Fed-Star;3) 验证了算法在新数据集上的表现优于基线方法。

Method: 1) Fed-Cyclic: 客户间形成循环拓扑,依次传递和更新权重;2) Fed-Star: 客户形成星型拓扑,通过预聚合和本地训练更新权重。

Result: 实验表明,Fed-Cyclic和Fed-Star在提出的数据集上优于现有基线方法。

Insight: 联邦学习中,拓扑结构(循环或星型)对性能有显著影响,而预聚合可有效解决统计异构性问题。

Abstract: Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, similar to the Office-31 dataset. To the best of our knowledge, this is the first image classification dataset specifically designed for Federated Learning. We also propose two new Federated Learning algorithms, namely Fed-Cyclic and Fed-Star. In Fed-Cyclic, a client receives weights from its previous client, updates them through local training, and passes them to the next client, thus forming a cyclic topology. In Fed-Star, a client receives weights from all other clients, updates its local weights through pre-aggregation (to address statistical heterogeneity) and local training, and sends its updated local weights to all other clients, thus forming a star-like topology. Our experiments reveal that both algorithms perform better than existing baselines on our newly introduced dataset.

[54] Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu

Main category: cs.CV

TL;DR: Argus提出了一种利用多视角图像增强LLMs在3D场景理解任务中的方法,通过结合文本、图像和点云数据,弥补传统点云重建中的信息丢失问题。

Details Motivation: 传统3D点云重建在室内场景中容易丢失信息(如纹理简单的平面或复杂结构的细节),而多视角图像能够提供更丰富的视觉一致性,弥补这些不足。

Contribution: 提出Argus,一个3D多模态框架,通过融合多视角图像、相机位姿和点云数据,生成更全面的3D场景表示,扩展了LLMs在3D任务中的能力。

Method: 将多视角图像和相机位姿融合为场景特征,与3D点云特征交互生成3D感知的嵌入表示,弥补点云重建中的信息丢失问题。

Result: 实验表明,Argus在多种下游任务中优于现有3D-LMMs。

Insight: 多视角图像能够有效补充点云重建中的信息缺失,提升3D场景理解的全面性和细节表现。

Abstract: Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.

[55] DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Dongyeun Lee,Jiwan Hur,Hyounguk Shon,Jae Young Lee,Junmo Kim

Main category: cs.CV

TL;DR: 论文提出DMQ方法,通过结合LES和PTS技术,解决了扩散模型量化中的异常值问题,显著提升了低比特量化下的性能。

Details Motivation: 扩散模型计算成本高,现有后训练量化方法忽视了异常值问题,导致低比特量化时性能下降。

Contribution: 1. 提出结合LES和PTS的DMQ方法;2. 引入自适应时间步加权方案;3. 提出信道级PTS和投票算法。

Method: 1. LES优化信道级缩放因子以重分配量化难度;2. PTS处理高信道方差的层;3. 自适应加权关键时间步。

Result: 在W4A6和W4A8等低比特设置下显著优于现有方法,保持了高图像生成质量。

Insight: 异常值和早期时间步的量化误差对扩散模型输出影响显著,需要针对性优化。

Abstract: Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at https://github.com/LeeDongYeun/dmq.

[56] Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning

Yafei Zhang,Lingqi Kong,Huafeng Li,Jie Wen

Main category: cs.CV

TL;DR: 本文提出了一种弱监督的可见光-红外行人重识别方法,通过异构专家协同一致性学习框架,解决了跨模态标签缺失问题,仅需单模态身份标签即可训练模型。

Details Motivation: 在可见光-红外行人重识别任务中,跨模态样本的标注成本高或不可得,导致模型训练受限。本文旨在通过弱监督方式减少对跨模态标签的依赖。

Contribution: 提出异构专家协同一致性学习框架,利用单模态标签训练专家,并通过预测跨模态样本身份实现弱监督学习。设计了跨模态关系融合机制,优化预测准确率。

Method: 框架包含多个模态专用分类专家,通过预测其他模态样本的身份建立跨模态关联。使用关系融合机制整合预测结果,并推动协同一致性学习。

Result: 在两个挑战性数据集上的实验验证了方法的有效性,模型能够提取模态不变特征并提升跨模态识别能力。

Insight: 通过异构专家的协作学习,即使缺少跨模态标签,仍能有效建模跨模态行人身份对应关系,为弱监督跨模态学习提供了新思路。

Abstract: To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model’s ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.

[57] Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications

Yucheng Tang,Yunguan Fu,Weixi Yi,Yipei Wang,Daniel C. Alexander,Rhodri Davies,Yipeng Hu

Main category: cs.CV

TL;DR: 该论文提出了多模态不确定性传播模型(MUPM),用于量化和分析多模态大语言模型(MLLMs)中图像、文本及联合模态的不确定性,并在心脏MR扫描和数字健康记录数据上验证了其鲁棒性和泛化性。

Details Motivation: 研究多模态大语言模型(MLLMs)中图像和文本输入的不确定性传播及其相互关系,以支持临床应用,尤其是心脏疾病预测任务。

Contribution: 提出了一种基于不确定性传播的多模态不确定性传播模型(MUPM),能够量化和分析图像、文本及联合模态的不确定性,并展示了其在数据分布和任务间的泛化性。

Method: 通过不确定性传播建模,结合少量样本优化MUPM,并在心脏MR和数字健康记录数据上进行实验验证其鲁棒性和泛化性。

Result: MUPM不仅在小样本情况下表现鲁棒,还能跨数据分布和任务泛化,进一步支持临床应用,如不确定性估计和冗余因子识别。

Insight: 通过共享预训练和轻量微调,MUPM能够捕捉模态间的不确定性关系,并在新任务中实现高效迁移,为临床决策提供了实用工具。

Abstract: Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at https://github.com/yucheng722/MUPM.

[58] LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang,Wenbo Li,Jingjing Ren,Yuping Qiu,Yong Guo,Xiaogang Xu,Han Wu,Wangmeng Zuo

Main category: cs.CV

TL;DR: LoViC 是一种基于扩散变换器(DiT)的框架,旨在通过分块生成过程高效生成长视频。其核心是 FlexFormer,一种支持可变长度输入的统一潜在表示压缩模块。

Details Motivation: 现有的文本到视频生成方法在生成长视频时面临自注意力机制二次复杂度的挑战,稀疏注意力或时间自回归模型等方法往往牺牲了时间连贯性或可扩展性。

Contribution: 1. 提出了 LoViC,一种支持分块生成长视频的 DiT 框架;2. 设计了 FlexFormer,能将视频和文本压缩为统一的潜在表示;3. 通过位置感知机制编码时间上下文,支持多样化任务。

Method: LoViC 结合了扩散变换器和 FlexFormer,后者是一种支持可变长度压缩的自编码器,基于 Q-Former 架构的单查询令牌设计实现线性可调压缩率。

Result: 实验表明 LoViC 在多样化任务中表现有效且通用,能够生成长且时间连贯的视频。

Insight: 通过统一的潜在表示和分块生成,LoViC 在保持时间连贯性的同时,解决了长视频生成的复杂度问题。

Abstract: Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts – such as sparse attention and temporally autoregressive models – offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

[59] VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang,Junyi Li,Xin Lai,Bei Yu,Hengshuang Zhao,Jiaya Jia

Main category: cs.CV

TL;DR: VisionThink提出了一种动态视觉令牌压缩方法,通过强化学习智能决定图像分辨率需求,显著节省计算资源。

Details Motivation: 现有视觉语言模型通常使用大量视觉令牌,但在大多数实际任务中并非必需,导致效率低下。

Contribution: 1. 提出动态处理不同分辨率的样本的方法;2. 引入强化学习框架和LLM-as-Judge策略;3. 设计奖励函数和惩罚机制。

Method: 模型从下采样图像开始,通过特殊令牌请求更高分辨率,利用强化学习动态决定压缩需求。

Result: 在OCR任务中表现优异,同时在简单任务中大幅节省视觉令牌。

Insight: 动态分辨率和强化学习结合能显著提升模型效率和性能。

Abstract: Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

[60] FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Qiang Wang,Mengchao Wang,Fan Jiang,Yaqi Fan,Yonggang Qi,Mu Xu

Main category: cs.CV

TL;DR: FantasyPortrait提出了一种基于扩散变压器(diffusion transformer)的方法,用于生成高保真且情感丰富的单/多角色肖像动画。通过引入隐式表示和掩码交叉注意力机制,解决了现有方法在多角色动画中的特征干扰问题,并在跨角色重演(cross reenactment)中表现出色。

Details Motivation: 现有方法依赖显式几何先验(如面部关键点或3DMM),在多角色动画中易产生干扰且难以捕捉细微情感。FantasyPortrait旨在解决这些问题。

Contribution: 1. 提出基于扩散变压器的框架,支持单/多角色动画;2. 设计表情增强学习策略,捕捉身份无关的面部动态;3. 提出掩码交叉注意力机制,实现多角色独立表达;4. 发布Multi-Expr数据集和ExprBench基准。

Method: 使用扩散变压器生成动画,通过隐式表示捕捉表情动态,并利用掩码交叉注意力机制防止多角色特征干扰。

Result: 在定量和定性评估中显著优于现有方法,尤其在跨角色重演和多角色场景中表现突出。

Insight: 隐式表示和注意力机制的结合可有效提升动画生成的质量和多样性,尤其是在复杂场景中。

Abstract: Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model’s ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is https://fantasy-amap.github.io/fantasy-portrait/.

[61] Demographic-aware fine-grained classification of pediatric wrist fractures

Ammar Ahmed,Ali Shariq Imran,Zenun Kastrati,Sher Muhammad Daudpota

Main category: cs.CV

TL;DR: 该论文提出了一种结合多模态数据(X射线图像和患者元数据)的细粒度分类方法,用于儿童手腕骨折的诊断,解决了小数据集下的挑战,显著提高了诊断准确性。

Details Motivation: 儿童手腕骨折诊断耗时且依赖专家经验,传统计算机视觉方法因数据稀缺和多模态数据未充分利用而受限。

Contribution: 1. 将问题建模为细粒度识别任务;2. 通过融合患者元数据和X射线图像提升性能;3. 使用细粒度数据集预训练权重而非传统粗粒度数据集(如ImageNet)。

Method: 采用细粒度分类策略,结合X射线图像和患者元数据(如年龄、性别),并通过细粒度数据集预训练的权重优化模型。

Result: 在有限数据集上诊断准确率提高2%,在更大骨折数据集上提升超过10%。

Insight: 多模态数据融合和细粒度预训练策略在小样本医疗影像任务中具有显著优势。

Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.

[62] Variance-Based Pruning for Accelerating and Compressing Trained Networks

Uranik Berisha,Jens Mehnert,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: 论文提出了一种基于方差的剪枝方法,用于快速压缩和加速已训练的网络,同时通过最小微调保持高性能。

Details Motivation: 现有结构化剪枝方法通常需要昂贵的重训练或从头训练以恢复精度,而本文旨在通过一种简单的一次性剪枝技术避免此问题。

Contribution: 提出了Variance-Based Pruning方法,通过激活统计选择剪枝神经元,并集成均值激活以保持性能,仅需少量微调即可恢复原始精度。

Method: 方法包括收集激活统计数据以选择剪枝神经元,同时将均值激活整合回模型以维持性能,避免大量重训练。

Result: 实验表明,该方法在ImageNet-1k任务中剪枝后DeiT-Base保留70%以上性能,仅10轮微调即可恢复99%精度,同时减少35% MACs和36%模型大小,加速1.44倍。

Insight: 通过统计激活方差选择剪枝目标,结合均值激活补偿,能够高效压缩模型且避免复杂重训练,为资源受限设备部署提供新思路。

Abstract: Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44x.

[63] Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Zihua Zhao,Feng Hong,Mengxi Chen,Pengyi Chen,Benyuan Liu,Jiangchao Yao,Ya Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: 该论文提出了一种新型的Differential-Informed Sample Selection (DISSect)方法,通过动态差分信息区分噪声对应关系,加速了多模态对比学习的训练过程。

Details Motivation: 多模态对比学习的成功依赖于大规模数据集和高计算成本,而样本选择作为一种高效范式可以加速训练。但现有的方法要么依赖离线的预言模型(不适用于冷启动场景),要么未能充分高效地处理噪声对应关系。

Contribution: 提出了DISSect方法,利用当前模型和历史模型的预测相关性差异来更准确地判断样本质量,为多模态对比学习提供了高效的样本选择策略。

Method: 重新思考了噪声对应关系对对比学习的影响,提出基于差分信息的样本选择框架,并结合理论分析证明其有效性。

Result: 在三个基准数据集和多种下游任务上的实验表明,DISSect在训练加速和性能表现上均优于现有方法。

Insight: 差分信息可以有效区分噪声样本,动态调整样本选择策略能够显著提升多模态对比学习的效率和鲁棒性。

Abstract: The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: https://github.com/MediaBrain-SJTU/DISSect.

[64] Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization

Songlin Li,Guofeng Yu,Zhiqing Guo,Yunfeng Diao,Dan Ma,Gaobo Yang,Liejun Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于涂鸦标注的弱监督图像篡改定位框架,通过自监督训练和动态特征调整模块提高性能,并在实验中优于全监督方法。

Details Motivation: 当前基于深度学习的图像篡改定位方法依赖于大量像素级标注数据,而现有的弱监督方法(如图像级标签)性能有限,因此探索更高效的涂鸦标注作为弱监督信号。

Contribution: 1. 首个涂鸦标注的篡改定位数据集(Sc-IML);2. 提出涂鸦弱监督框架,包括自监督训练、动态特征调整模块(PFMM)、门控自适应融合模块(GAFM)和置信度感知熵最小化损失(${\mathcal{L}}_{ {CEM }}$)。

Method: 1. 自监督训练结合多尺度一致性损失;2. PFMM动态整合先验信息;3. GAFM通过门控机制引导特征融合;4. ${\mathcal{L}}_{ {CEM }}$基于模型不确定性动态正则化预测。

Result: 提出的方法在分布内和分布外测试中均优于现有全监督方法。

Insight: 涂鸦标注为高效的弱监督信号提供了一种新途径,动态特征调整和不确定性驱动的损失设计显著提升了篡改定位性能。

Abstract: Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential tampered regions. Finally, we propose a confidence-aware entropy minimization loss (${\mathcal{L}}_{ {CEM }}$). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution.

[65] Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection

Jingyao Wang,Yiming Chen,Lingyu Si,Changwen Zheng

Main category: cs.CV

TL;DR: 提出了一种分层核心集选择(HCS)机制,用于提升视觉语言模型(VLMs)在复杂广域场景理解中的适应性,无需额外微调即可实现多尺度场景的快速理解。

Details Motivation: 现有视觉语言模型在适应未见复杂广域场景时面临挑战,如特征密度不足和泛化能力有限,需要一种无需额外训练即可提升模型适应能力的方法。

Contribution: 提出了HCS机制,通过理论保证的重要性函数渐进选择关键区域,兼顾实用性、代表性、鲁棒性和协同性,显著提升了模型的泛化能力和效率。

Method: HCS基于分层核心集选择理论,逐步优化区域选择的优先级,结合重要性函数评估区域的多维特性,实现了高效且可解释的区域选择。

Result: 实验表明,HCS在各种任务中表现优异,具有普适性,且显著提升了模型的适应速度和性能。

Insight: HCS为视觉语言模型的场景理解提供了一种无需微调的高效解决方案,其分层选择和重要性评估方法具有广泛的应用潜力。

Abstract: Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.

[66] Channel-wise Motion Features for Efficient Motion Segmentation

Riku Inoue,Masamitsu Tsuchiya,Yuji Yasui

Main category: cs.CV

TL;DR: 论文提出了一种基于通道运动特征(Channel-wise Motion Features)的高效运动分割方法,仅需位姿网络(Pose Network),在速度与参数效率上显著优于现有方法。

Details Motivation: 现有的运动分割模型通常联合使用多个子网络(深度、位姿、光流等),导致计算成本高,难以满足实时性需求。作者旨在通过简化模型结构,实现高效运动分割。

Contribution: 提出了通道运动特征(Channel-wise Motion Features),仅依赖位姿网络即可捕捉场景的3D运动信息,显著减少了计算成本。

Method: 通过提取特征图中每个实例的深度特征,并结合位姿网络构建通道运动特征,无需其他子网络。

Result: 在KITTI和Cityscapes数据集上,实现了约4倍的帧率提升,参数减少至25%,同时保持同等精度。

Insight: 仅依赖位姿网络即可高效捕捉运动信息,为实时运动分割提供了一种轻量化的解决方案。

Abstract: For safety-critical robotics applications such as autonomous driving, it is important to detect all required objects accurately in real-time. Motion segmentation offers a solution by identifying dynamic objects from the scene in a class-agnostic manner. Recently, various motion segmentation models have been proposed, most of which jointly use subnetworks to estimate Depth, Pose, Optical Flow, and Scene Flow. As a result, the overall computational cost of the model increases, hindering real-time performance. In this paper, we propose a novel cost-volume-based motion feature representation, Channel-wise Motion Features. By extracting depth features of each instance in the feature map and capturing the scene’s 3D motion information, it offers enhanced efficiency. The only subnetwork used to build Channel-wise Motion Features is the Pose Network, and no others are required. Our method not only achieves about 4 times the FPS of state-of-the-art models in the KITTI Dataset and Cityscapes of the VCAS-Motion Dataset, but also demonstrates equivalent accuracy while reducing the parameters to about 25$%$.

[67] Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection

Riku Inoue,Masamitsu Tsuchiya,Yuji Yasui

Main category: cs.CV

TL;DR: 提出了一种名为Decoupled PROB的新模型,通过解耦查询初始化任务和对象性-类别学习,解决了Open World Object Detection中的对象性与类别预测冲突问题,显著提升了性能。

Details Motivation: Open World Object Detection(OWOD)任务中,未知对象的检测与学习存在挑战,特别是对象性与类别预测的冲突问题。现有的PROB方法虽然无需伪标签,但在对象性与类别学习上仍存在冲突。

Contribution: 1. 提出Early Termination of Objectness Prediction(ETOP),在适当层终止对象性预测,避免与类别预测冲突。2. 提出Task-Decoupled Query Initialization(TDQI),高效提取已知和未知对象特征。TDQI是一个易于集成到DETR-based OWOD模型中的模块。

Method: 1. ETOP模块在解码器适当层终止对象性预测。2. TDQI模块结合查询选择和可学习查询,优化特征提取。

Result: 在多个OWOD基准测试中,Decoupled PROB显著优于现有方法。

Insight: 解耦对象性和类别学习是提升OWOD性能的关键。ETOP和TDQI的设计为DETR-based方法提供了新的优化思路。

Abstract: Open World Object Detection (OWOD) is a challenging computer vision task that extends standard object detection by (1) detecting and classifying unknown objects without supervision, and (2) incrementally learning new object classes without forgetting previously learned ones. The absence of ground truths for unknown objects makes OWOD tasks particularly challenging. Many methods have addressed this by using pseudo-labels for unknown objects. The recently proposed Probabilistic Objectness transformer-based open-world detector (PROB) is a state-of-the-art model that does not require pseudo-labels for unknown objects, as it predicts probabilistic objectness. However, this method faces issues with learning conflicts between objectness and class predictions. To address this issue and further enhance performance, we propose a novel model, Decoupled PROB. Decoupled PROB introduces Early Termination of Objectness Prediction (ETOP) to stop objectness predictions at appropriate layers in the decoder, resolving the learning conflicts between class and objectness predictions in PROB. Additionally, we introduce Task-Decoupled Query Initialization (TDQI), which efficiently extracts features of known and unknown objects, thereby improving performance. TDQI is a query initialization method that combines query selection and learnable queries, and it is a module that can be easily integrated into existing DETR-based OWOD models. Extensive experiments on OWOD benchmarks demonstrate that Decoupled PROB surpasses all existing methods across several metrics, significantly improving performance.

[68] GLAD: Generalizable Tuning for Vision-Language Models

Yuqi Peng,Pengfei Wang,Jianzhuang Liu,Shifeng Chen

Main category: cs.CV

TL;DR: GLAD proposes a generalizable tuning framework for vision-language models, using LoRA and gradient regularization to mitigate overfitting in few-shot learning, outperforming existing methods on multiple benchmarks.

Details Motivation: Existing prompt tuning methods in vision-language models often overfit in few-shot scenarios and require complex architectures, limiting their general applicability.

Contribution: GLAD introduces a simple yet effective framework combining LoRA tuning and gradient regularization, enhancing robustness and performance across diverse tasks.

Method: GLAD employs Low-Rank Adaptation (LoRA) for efficient tuning and adds gradient-based regularization to stabilize optimization and prevent overfitting.

Result: GLAD achieves superior performance in base-to-novel class generalization, domain generalization, and cross-dataset tasks on 15 benchmarks.

Insight: The success of GLAD highlights the effectiveness of gradient regularization in few-shot learning, offering a simpler alternative to complex prompt-based methods.

Abstract: Pre-trained vision-language models, such as CLIP, show impressive zero-shot recognition ability and can be easily transferred to specific downstream tasks via prompt tuning, even with limited training data. However, existing prompt tuning methods face two main challenges: (1) In few-shot scenarios, data scarcity often leads to overfitting, making the model sensitive to changes in the input domain. (2) To mitigate overfitting, these methods typically rely on complex task-specific model architectures and sensitive hyperparameter tuning, severely restricting their general applicability. To address these issues, we propose a simpler and more general framework called GLAD (Generalizable LoRA tuning with RegulArized GraDient). We show that merely applying LoRA achieves performance in downstream tasks comparable to current state-of-the-art prompt-based methods. While LoRA is effective and easy to use, it remains susceptible to overfitting in few-shot learning scenarios. To mitigate this risk, we introduce a gradient-based regularization technique. This technique effectively steers the optimization trajectory, encouraging the model to find a more stable parameter region that is robust to variations in data distribution. Through extensive experiments conducted on 15 benchmark datasets, we demonstrate that GLAD outperforms previous tuning approaches in terms of base-to-novel class generalization, image domain generalization, and cross-dataset generalization. The code will be publicly available.

[69] R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning

Xiaohan Guo,Yusong Cai,Zejia Liu,Zhengning Wang,Lili Pan,Hongliang Li

Main category: cs.CV

TL;DR: R^2MoE提出了一种参数高效的终身视觉概念学习框架,通过专家混合与路由蒸馏机制缓解灾难性遗忘,并通过冗余专家消除与分层注意力减少参数和干扰。

Details Motivation: 现有终身学习模型面临灾难性遗忘和参数膨胀的挑战,R^2MoE旨在高效学习新概念并减少参数开销。

Contribution: 1. 路由蒸馏的专家混合框架;2. 冗余专家消除策略;3. 分层局部注意力引导的推理方法。

Method: 通过路由蒸馏保留专家知识,动态去除冗余专家,并利用分层注意力减少概念间干扰。

Result: 在CustomConcept 101数据集上,遗忘率降低87.8%,参数减少63.3%,生成图像的概念保真度优于SOTA方法。

Insight: 动态路由与冗余去除结合可高效支持终身学习,分层注意力有助于概念解耦。

Abstract: Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network’s routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8% reduction in forgetting rates and 63.3% fewer parameters on the CustomConcept 101 dataset. Our code is available at {https://github.com/learninginvision/R2MoE}

[70] Leveraging Language Prior for Infrared Small Target Detection

Pranav Singh,Pravendra Singh

Main category: cs.CV

TL;DR: 该论文提出了一种结合语言先验的多模态红外小目标检测(IRSTD)框架,通过文本信息引导图像检测,显著提升了检测性能。

Details Motivation: 现有IRSTD方法仅依赖图像模态,限制了检测性能。语言模型在视觉任务中的成功应用启发了多模态结合的思路。

Contribution: 1. 提出了结合语言先验的多模态IRSTD框架;2. 构建了包含文本和图像的多模态红外数据集LangIR;3. 通过实验验证了方法的有效性。

Method: 利用GPT-4生成目标位置文本描述,通过语言引导的注意力权重增强检测能力。

Result: 在NUAA-SIRST和IRSTD-1k子集上,IoU等指标显著优于现有方法。

Insight: 结合语言先验可以弥补纯图像方法的不足,多模态数据能进一步提升小目标检测性能。

Abstract: IRSTD (InfraRed Small Target Detection) detects small targets in infrared blurry backgrounds and is essential for various applications. The detection task is challenging due to the small size of the targets and their sparse distribution in infrared small target datasets. Although existing IRSTD methods and datasets have led to significant advancements, they are limited by their reliance solely on the image modality. Recent advances in deep learning and large vision-language models have shown remarkable performance in various visual recognition tasks. In this work, we propose a novel multimodal IRSTD framework that incorporates language priors to guide small target detection. We leverage language-guided attention weights derived from the language prior to enhance the model’s ability for IRSTD, presenting a novel approach that combines textual information with image data to improve IRSTD capabilities. Utilizing the state-of-the-art GPT-4 vision model, we generate text descriptions that provide the locations of small targets in infrared images, employing careful prompt engineering to ensure improved accuracy. Due to the absence of multimodal IR datasets, existing IRSTD methods rely solely on image data. To address this shortcoming, we have curated a multimodal infrared dataset that includes both image and text modalities for small target detection, expanding upon the popular IRSTD-1k and NUDT-SIRST datasets. We validate the effectiveness of our approach through extensive experiments and comprehensive ablation studies. The results demonstrate significant improvements over the state-of-the-art method, with relative percentage differences of 9.74%, 13.02%, 1.25%, and 67.87% in IoU, nIoU, Pd, and Fa on the NUAA-SIRST subset, and 4.41%, 2.04%, 2.01%, and 113.43% on the IRSTD-1k subset of the LangIR dataset, respectively.

[71] DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Maulana Bisyir Azhari,David Hyunchul Shim

Main category: cs.CV

TL;DR: DINO-VO 是一个基于特征的视觉里程计系统,利用 DINOv2 视觉基础模型的稀疏特征匹配能力,通过提出显著关键点检测器和结合几何特征,提升了定位精度和鲁棒性。

Details Motivation: 学习型单目视觉里程计在鲁棒性、泛化性和效率方面面临挑战。视觉基础模型(如 DINOv2)虽然在多种视觉任务中表现出色,但由于特征粒度较粗,其在 VO 中的应用受限。

Contribution: 1) 提出 DINO-VO,基于 DINOv2 的稀疏特征匹配;2) 设计针对 DINOv2 特征的显著关键点检测器;3) 结合几何特征提升定位能力;4) 使用基于 Transformer 的匹配器和可微分位姿估计层。

Method: 1) 使用 DINOv2 提取鲁棒语义特征;2) 结合几何特征改进表达;3) 基于 Transformer 的特征匹配和位姿估计。

Result: 在 TartanAir、KITTI 和 EuRoC 数据集上优于现有帧到帧 VO 方法,运行效率达 72 FPS,内存占用低于 1GB。在户外驾驶场景中与 SLAM 系统竞争。

Insight: 视觉基础模型的特征可以改进 VO 的鲁棒性和泛化性,但需结合几何特征以适应精细的定位任务。

Abstract: Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2’s coarse features. Furthermore, we complement DINOv2’s robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

[72] SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Xiangyu Dong,Haoran Zhao,Jiang Gao,Haozhou Li,Xiaoguang Ma,Yaoming Zhou,Fuhai Chen,Juan Liu

Main category: cs.CV

TL;DR: 该论文提出了一种基于多模态大语言模型的自演化视觉-语言导航框架SE-VLN,通过分层记忆、检索增强推理和反思模块,实现了导航代理在测试期间持续进化,显著提升了未见环境下的导航成功率。

Details Motivation: 现有视觉-语言导航(VLN)方法虽受益于大语言模型(LLM),但受限于其固定知识库和推理能力,缺乏经验知识的有效融入和自进化能力。

Contribution: 提出了首个基于多模态大语言模型的自演化VLN框架SE-VLN,通过分层记忆、检索增强推理和反思模块实现持续进化,显著提升了导航性能。

Method: 1. 分层记忆模块:将成功和失败案例转化为可重用知识;2. 检索增强推理模块:检索经验并支持多步决策;3. 反思模块:实现持续进化。

Result: 在R2R和REVERSE数据集上,导航成功率分别达到57%和35.2%,绝对性能提升23.9%和15.0%。且性能随经验库增长而提升。

Insight: SE-VLN展示了自演化代理在VLN任务中的巨大潜力,为多模态任务中的动态知识整合和持续学习提供了新思路。

Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

[73] Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Arian Mousakhan,Sudhanshu Mittal,Silvio Galesso,Karim Farid,Thomas Brox

Main category: cs.CV

TL;DR: Orbis 是针对自动驾驶世界模型中长时预测问题的改进方案,通过简化的设计选择和不依赖额外监督或多传感器输入,实现了先进性能。

Details Motivation: 现有的自动驾驶世界模型在长时预测和复杂场景泛化方面表现不佳,研究者希望通过简单设计改进这一问题。

Contribution: 提出了一种轻量级世界模型 Orbis,仅需 469M 参数和 280h 视频数据训练,并在复杂场景(如转弯和城市交通)中表现突出。

Method: 使用了一种混合分词器(hybrid tokenizer)来比较离散和连续自回归模型,最终选择了性能更强且更稳定的连续模型。

Result: Orbis 在长时预测和复杂场景中实现了最先进的性能。

Insight: 在自动驾驶世界模型中,连续自回归模型比离散令牌模型更具鲁棒性和优势。

Abstract: Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.

[74] Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

Keerthi Veeramachaneni,Praveen Tirupattur,Amrit Singh Bedi,Mubarak Shah

Main category: cs.CV

TL;DR: 该论文提出了一种利用预训练视觉模型检测AI生成视频的新方法,通过在提取的特征上训练简单线性分类层,实现了高效检测。

Details Motivation: 随着AI生成视频质量的提升,检测真实与生成内容的挑战日益重要。现有方法主要针对DeepFakes,而该研究填补了对通用视频内容的检测需求。

Contribution: 提出了一种无需额外模型训练的检测方法,利用预训练模型提取的特征;同时还发布了一个包含10,000生成视频和4,000真实视频的数据集VID-AID。

Method: 通过预训练视觉模型提取特征,并训练线性分类层,区分真实与生成视频。

Result: 在VID-AID数据集上,平均检测准确率超过90%。

Insight: 预训练模型的特征包含区分真实与生成内容的信号,简化了检测任务的复杂性。

Abstract: Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.

[75] VITA: Vision-to-Action Flow Matching Policy

Dechen Gao,Boqi Zhao,Andrew Lee,Ian Chuang,Hanchu Zhou,Hang Wang,Zhe Zhao,Junshan Zhang,Iman Soltani

Main category: cs.CV

TL;DR: VITA提出了一种从视觉到动作的流匹配策略,通过将潜在视觉表征直接映射为潜在动作,简化了传统方法的复杂条件机制,显著提升了推理效率。

Details Motivation: 传统流匹配和扩散策略需从高斯噪声等标准分布采样,并通过交叉注意力等额外机制将视觉信息条件化为动作,导致时间和空间开销。VITA旨在消除这些模块,直接学习视觉到动作的映射。

Contribution: 1. 提出VITA,将潜在图像作为流源,直接从视觉学习动作生成;2. 设计了结构化动作潜在空间,解决了视觉与动作模态的维度不匹配问题;3. 通过解码器监督流匹配,实现端到端高效学习。

Method: 1. 使用自编码器构建动作潜在空间作为流匹配目标;2. 通过流潜在解码监督动作重构损失,反向传播至流匹配ODE求解步骤;3. 采用简单MLP结构,避免了复杂条件机制。

Result: 在ALOHA平台的5个仿真和2个真实任务中,VITA性能优于或匹配SOTA生成策略,推理延迟降低50-130%,首次实现仅用MLP解决复杂双手操作任务。

Insight: 直接利用视觉表征作为流源并结构化动作空间,可简化生成策略架构,提升效率,同时保持生成能力。该方法可能扩展到其他跨模态任务。

Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

[76] Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

Yiting Yang,Hao Luo,Yuan Sun,Qingsen Yan,Haokui Zhang,Wei Dong,Guoqing Wang,Peng Wang,Yang Yang,Hengtao Shen

Main category: cs.CV

TL;DR: 该论文提出了一种近似正交微调(AOFT)策略,通过生成近似正交的降维和升维矩阵,提升预训练Vision Transformer(ViT)在下游任务中的泛化能力。

Details Motivation: 当前参数高效微调(PEFT)方法中,降维和升维矩阵缺乏近似正交性,而预训练ViT的权重矩阵具备这一特性。作者认为保持这一性质可以进一步提升模型的泛化能力。

Contribution: 提出AOFT策略,通过单个可学习向量生成近似正交的降维和升维矩阵,使其与预训练主干权重矩阵的性质一致,从而提升泛化能力。

Method: 使用单个可学习向量生成一组近似正交向量,构成降维和升维矩阵,确保其与预训练权重矩阵的近似正交性保持一致。

Result: 实验结果表明,AOFT在多种下游图像分类任务中表现优异,验证了其在增强泛化能力方面的有效性。

Insight: 近似正交性可以降低模型的泛化误差上界,从而提升模型的泛化能力。这一发现为ViT的高效微调提供了新思路。

Abstract: A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model’s generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.

[77] FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization

Chuancheng Shi,Yixiang Chen,Burong Lei,Jichao Chen

Main category: cs.CV

TL;DR: FashionPose首次提出了一个统一的文本到姿态到重新光照生成的框架,用于个性化时尚可视化。它通过文本描述生成2D人体姿态,再利用扩散模型生成高清人物图像,并通过轻量级重新光照模块调整光照,实现了语义灵活性和光照适应性的结合。

Details Motivation: 时尚电商业需要逼真且可控的服装可视化技术,以满足用户在不同姿态和光照条件下的个性化预览需求。现有方法通常依赖于预定义姿态,缺乏语义灵活性和光照适应性。

Contribution: 1. 提出首个文本到姿态到重新光照的生成框架FashionPose;
2. 通过文本输入实现姿态对齐、服装展现和光照控制的统一;
3. 实验展示了细粒度姿态合成和高效一致的光照调整。

Method: 1. 从文本描述预测2D人体姿态;
2. 利用扩散模型生成高保真人物图像;
3. 通过轻量级重新光照模块调整光照,全程由文本输入指导。

Result: 实验表明,FashionPose能够实现细粒度的姿态合成和高效一致的重新光照效果,为个性化虚拟时尚展示提供了实用解决方案。

Insight: FashionPose的创新在于将文本驱动的条件生成与姿态和光照控制统一,解决了现有方法在语义灵活性和光照适应性方面的限制,为时尚可视化提供了新的可能性。

Abstract: Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.

[78] Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark

Junsu Kim,Naeun Kim,Jaeho Lee,Incheol Park,Dongyoon Han,Seungryul Baek

Main category: cs.CV

TL;DR: 该论文指出当前基于推理的姿势估计(RPE)基准存在重复性和质量缺陷,并提出改进方法以提升评估的可靠性。

Details Motivation: RPE基准存在图像索引不一致、图像冗余、场景不平衡等问题,影响了姿势感知多模态大语言模型(MLLMs)的公平评估。

Contribution: 作者通过精细的视觉匹配改进了地面真实(GT)标注,并将其公开,以推动姿势感知多模态推理研究的进步。

Method: 采用视觉匹配方法修正GT标注,解决图像索引差异问题。

Result: 提供了更准确的GT标注,减少了手动匹配的错误和繁琐。

Insight: 高质量的基准标注对提升姿势感知多模态推理研究的可靠性至关重要。

Abstract: The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.

[79] A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

Antonio Finocchiaro,Alessandro Sebastiano Catinello,Michele Mazzamuto,Rosario Leonardi,Antonino Furnari,Giovanni Maria Farinella

Main category: cs.CV

TL;DR: 该论文提出了一种实时系统,用于检测工业领域中自我中心视角下的手-物体交互,结合了动作识别和物体检测模块,实现了高效且准确的交互检测。

Details Motivation: 在实时应用中,手-物体交互检测是提升用户体验的关键,但目前仍存在挑战。论文致力于解决这一问题,特别是在工业领域的实时需求中。

Contribution: 主要贡献是通过级联架构结合动作识别(Mamba模型)和物体检测(YOLOWorld),实现了高效且实时的交互检测,并在ENIGMA-51基准测试中表现出色。

Method: 采用级联架构:动作识别模块(Mamba+EfficientNetV2)触发物体检测模块(YOLOWorld)。动作识别确认接触状态后,激活物体检测模块进行推理。

Result: 动作识别模块在ENIGMA-51上达到38.52% p-AP(30fps),物体检测模块(YOLOWorld)对目标和物体的检测AP达85.13%。

Insight: 级联设计能够在保证实时性的同时提升检测精度,为工业领域的交互检测提供了可行方案。

Abstract: Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

[80] Taming Diffusion Transformer for Real-Time Mobile Video Generation

Yushu Wu,Yanyu Li,Anil Kag,Ivan Skorokhodov,Willi Menapace,Ke Ma,Arpit Sahni,Ju Hu,Aliaksandr Siarohin,Dhritiman Sagar,Yanzhi Wang,Sergey Tulyakov

Main category: cs.CV

TL;DR: 这篇论文提出了一系列优化方法,使Diffusion Transformers(DiT)能够在移动设备上实现实时高质量视频生成,包括数据压缩、模型剪裁和推理步骤蒸馏等技术。

Details Motivation: 当前DiT在视频生成任务中表现出色,但其高计算成本不适用于资源受限的移动设备,也无法满足实时生成的需求。

Contribution: 提出了三种优化方法:1) 高压缩率的变分自编码器(VAE)降低输入维度;2) 基于知识蒸馏(KD)的三级剪裁策略压缩模型;3) 针对DiT的对抗性步骤蒸馏技术,将推理步骤减至四次。

Method: 结合高压缩VAE、敏感感知的模型剪裁和专门设计的对抗性步骤蒸馏,显著加速DiT的视频生成能力。

Result: 在iPhone 16 Pro Max上实现了超过10 FPS的实时视频生成,验证了移动端高质量视频生成的可行性。

Insight: 通过压缩输入、模型和步骤的三重优化,DiT可以在移动设备上高效运行,同时保持生成质量。

Abstract: Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.

[81] Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Yudong Jin,Sida Peng,Xuan Wang,Tao Xie,Zhen Xu,Yifan Yang,Yujun Shen,Hujun Bao,Xiaowei Zhou

Main category: cs.CV

TL;DR: 论文提出了一种基于4D扩散模型的新方法Diffuman4D,通过滑动迭代去噪过程提升稀疏视角视频中人体的高保真视图合成质量,解决了现有方法中时空一致性的不足。

Details Motivation: 稀疏视角视频输入下的高保真人体视图合成是一个具有挑战性的问题,现有方法因时空一致性不足导致合成质量下降。

Contribution: 提出了一种滑动迭代去噪过程,通过交替对空间和时间维度的潜在网格进行去噪,增强了4D扩散模型的时空一致性。

Method: 定义了包含图像、相机位姿和人体位姿的潜在网格,采用滑动窗口交替去噪时空维度,最后从去噪后的潜在中解码出目标视角的视频。

Result: 在DNA-Rendering和ActorsHQ数据集上验证了方法的高效性,显著优于现有方法,能合成高质量且一致的视频。

Insight: 通过滑动窗口机制,信息在潜在网格中充分流动,扩散模型获得更大的感受野,从而提升了4D一致性,同时降低了GPU显存消耗。

Abstract: This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .

[82] AutoPartGen: Autogressive 3D Part Generation and Discovery

Minghao Chen,Jianyuan Wang,Roman Shapovalov,Tom Monnier,Hyunyoung Jung,Dilin Wang,Rakesh Ranjan,Iro Laina,Andrea Vedaldi

Main category: cs.CV

TL;DR: AutoPartGen是一个自回归生成3D部分的模型,能够根据图像、2D遮罩或现有3D对象生成组合式的3D重建。

Details Motivation: 现有方法在3D部分生成和组合任务中缺乏灵活性和效率,AutoPartGen旨在解决这些问题,实现自动化的部分生成与组合。

Contribution: 1. 提出AutoPartGen,一种自回归生成3D部分的模型;2. 展示了3DShape2VecSet潜在空间的组合性质适用于部分生成任务;3. 实现了自动决定部分类型和数量的机制。

Method: 基于3DShape2VecSet的潜在表示,模型以自回归方式生成部分,每一步生成一个部分并考虑之前生成的部分与额外输入(如图像或3D对象)。

Result: AutoPartGen在3D部分生成任务中实现了最先进的性能,并能无缝组合生成的部分。

Insight: 3DShape2VecSet潜在空间具有强大的组合性质,适用于部分生成任务,自回归方法能有效建模部分间的依赖关系。

Abstract: We introduce AutoPartGen, a model that generates objects composed of 3D parts in an autoregressive manner. This model can take as input an image of an object, 2D masks of the object’s parts, or an existing 3D object, and generate a corresponding compositional 3D reconstruction. Our approach builds upon 3DShape2VecSet, a recent latent 3D representation with powerful geometric expressiveness. We observe that this latent space exhibits strong compositional properties, making it particularly well-suited for part-based generation tasks. Specifically, AutoPartGen generates object parts autoregressively, predicting one part at a time while conditioning on previously generated parts and additional inputs, such as 2D images, masks, or 3D objects. This process continues until the model decides that all parts have been generated, thus determining automatically the type and number of parts. The resulting parts can be seamlessly assembled into coherent objects or scenes without requiring additional optimization. We evaluate both the overall 3D generation capabilities and the part-level generation quality of AutoPartGen, demonstrating that it achieves state-of-the-art performance in 3D part generation.

[83] $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning

Yifan Wang,Jianjun Zhou,Haoyi Zhu,Wenzheng Chang,Yang Zhou,Zizun Li,Junyi Chen,Jiangmiao Pang,Chunhua Shen,Tong He

Main category: cs.CV

TL;DR: 论文《π³》提出了一种无需固定参考视角的视觉几何学习方法,通过置换等变网络结构实现了对输入顺序的鲁棒性和高可扩展性,在多任务中达到SOTA性能。

Details Motivation: 传统方法依赖固定参考视角,导致重建不稳定或失败。本文旨在消除这种依赖,通过置换等变架构实现更鲁棒的几何重建。

Contribution: 提出了π³,一种完全置换等变的网络架构,可预测仿射不变的相机位姿和尺度不变的点地图,无需参考帧。

Method: 采用置换等变前馈神经网络,通过预测相机位姿和点地图,避免对输入顺序和参考视角的依赖。

Result: 在相机位姿估计、单目/视频深度估计和密集点地图重建等任务中实现了最优性能。

Insight: 置换等变性使模型对输入顺序不敏感,提高了稳定性和可扩展性,为视觉几何学习提供了新的方向。

Abstract: We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.

[84] Hierarchical Rectified Flow Matching with Mini-Batch Couplings

Yichi Zhang,Yici Yan,Alex Schwing,Zhizhen Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种通过小批量耦合(mini-batch couplings)在分层整流流匹配(hierarchical rectified flow matching)中逐步调整分布复杂性的方法,以更好地建模多模态速度场,并在合成和图像数据上展示了其优势。

Details Motivation: 传统的分层流匹配方法虽然能建模多模态速度场分布,但其复杂性在不同层级间保持不变。论文旨在通过小批量耦合逐步调整层级间分布复杂性,从而更灵活地建模复杂数据分布。

Contribution: 提出了一种基于小批量耦合的分层整流流匹配方法,通过动态调整层级间分布复杂性,提升了生成模型对多模态数据的建模能力。

Method: 在分层整流流匹配框架中引入小批量耦合,利用耦合机制在不同层级间动态调整速度场的分布复杂性,从而更好地捕捉数据多模态特性。

Result: 在合成和真实图像数据上的实验表明,该方法能有效提升生成模型的性能,展现了对复杂数据分布的建模优势。

Insight: 通过层级间动态调整分布复杂性,可以更灵活地建模多模态数据,为流匹配方法在多模态场景下的应用提供了新思路。

Abstract: Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at https://riccizz.github.io/HRF_coupling.

[85] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang,Guo Chen,De-an Huang,Zhiqi Li,Minghan Li,Guilin Li,Jose M. Alvarez,Lei Zhang,Zhiding Yu

Main category: cs.CV

TL;DR: VideoITG提出了一种基于用户指令的自定义视频帧采样方法,通过VidThinker管道自动生成标注,构建了VideoITG-40K数据集,并设计了插拔式VideoITG模型,显著提升了多模态视频理解性能。

Details Motivation: 当前的无监督学习方法在长视频理解中处理复杂场景时表现不佳,需要一种更灵活的方法来根据用户指令选择关键帧。

Contribution: 1) 提出VidThinker管道,模仿人类标注流程;2) 构建包含40K视频和500K标注的VideoITG-40K数据集;3) 设计插拔式VideoITG模型,提升视频理解性能。

Method: 1) 基于指令生成详细剪辑级标题;2) 通过指令引导推理检索相关视频片段;3) 细粒度帧选择定位关键视觉证据。

Result: VideoITG与Video-LLMs结合,在多模态视频理解基准上实现了一致性性能提升。

Insight: 通过指令驱动的帧选择方法可以更灵活地处理视频理解任务,尤其是在长视频场景中表现突出。

Abstract: Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

cs.GR [Back]

[86] HairFormer: Transformer-Based Dynamic Neural Hair Simulation

Joy Xiaoji Zhang,Jingsen Zhu,Hanyu Chen,Steve Marschner

Main category: cs.GR

TL;DR: 该论文提出了一种基于Transformer的两阶段神经网络方法(HairFormer),用于动态头发模拟,能够在不同发型、体型和运动下实现通用化。

Details Motivation: 动态头发模拟的关键挑战在于如何实现跨任意发型、体型和运动的通用化。作者希望通过Transformer架构解决这一问题。

Contribution: 1. 提出首个基于Transformer的头发动态模拟方法;2. 静态网络解决头发与身体的穿透问题;3. 动态网络通过交叉注意力机制生成复杂的二次动态;4. 支持实时推理和高效微调。

Method: 1. 静态网络预测任意发型的静态形状;2. 动态网络结合静态特征和运动输入,通过交叉注意力生成动态效果。

Result: 方法能够高保真地模拟各种发型,甚至在未见过的长发情况下也能解决穿透问题,展现了较强的通用性。

Insight: Transformer架构在动态头发模拟中展现出潜力,尤其是通过交叉注意力机制融合静态与动态特征,为通用化问题提供了新思路。

Abstract: Simulating hair dynamics that generalize across arbitrary hairstyles, body shapes, and motions is a critical challenge. Our novel two-stage neural solution is the first to leverage Transformer-based architectures for such a broad generalization. We propose a Transformer-powered static network that predicts static draped shapes for any hairstyle, effectively resolving hair-body penetrations and preserving hair fidelity. Subsequently, a dynamic network with a novel cross-attention mechanism fuses static hair features with kinematic input to generate expressive dynamics and complex secondary motions. This dynamic network also allows for efficient fine-tuning of challenging motion sequences, such as abrupt head movements. Our method offers real-time inference for both static single-frame drapes and dynamic drapes over pose sequences. Our method demonstrates high-fidelity and generalizable dynamic hair across various styles, guided by physics-informed losses, and can resolve penetrations even for complex, unseen long hairstyles, highlighting its broad generalization.

eess.SY [Back]

[87] Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis

Saswat Priyadarshi Nayak,Guoyuan Wu,Kanok Boriboonsomsin,Matthew Barth

Main category: eess.SY

TL;DR: 论文提出了一种基于双LiDAR系统的交通流量统计方法,用于信号灯交叉口的交通流量估计,解决了传统摄像头方法在恶劣天气和夜间效果不佳的问题。

Details Motivation: 传统交通流量统计方法(如摄像头)在恶劣光照条件下效果不佳,而LiDAR技术成本降低且应用广泛,为交通流量统计提供了新可能。

Contribution: 开发并部署了双LiDAR系统,用于交通流量统计,提供了基于3D边界框检测的车辆分类和流量估计。

Method: 使用两个LiDAR获取3D边界框检测数据,根据交通方向、车辆运动和车辆类型进行分类和计数。

Result: 论文展示了估计的交通流量结果,并分析了趋势和不规则现象,同时讨论了改进方向。

Insight: 双LiDAR系统不仅提升了交通流量统计的准确性,还为轨迹预测和意图预测提供了潜在改进空间。

Abstract: Traffic Movement Count (TMC) at intersections is crucial for optimizing signal timings, assessing the performance of existing traffic control measures, and proposing efficient lane configurations to minimize delays, reduce congestion, and promote safety. Traditionally, methods such as manual counting, loop detectors, pneumatic road tubes, and camera-based recognition have been used for TMC estimation. Although generally reliable, camera-based TMC estimation is prone to inaccuracies under poor lighting conditions during harsh weather and nighttime. In contrast, Light Detection and Ranging (LiDAR) technology is gaining popularity in recent times due to reduced costs and its expanding use in 3D object detection, tracking, and related applications. This paper presents the authors’ endeavor to develop, deploy and evaluate a dual-LiDAR system at an intersection in the city of Rialto, California, for TMC estimation. The 3D bounding box detections from the two LiDARs are used to classify vehicle counts based on traffic directions, vehicle movements, and vehicle classes. This work discusses the estimated TMC results and provides insights into the observed trends and irregularities. Potential improvements are also discussed that could enhance not only TMC estimation, but also trajectory forecasting and intent prediction at intersections.

cs.RO [Back]

[88] Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Liuyi Wang,Xinyuan Xia,Hui Zhao,Hanqing Wang,Tai Wang,Yilun Chen,Chengju Liu,Qijun Chen,Jiangmiao Pang

Main category: cs.RO

TL;DR: 该论文提出了VLN-PE平台,用于解决视觉与语言导航(VLN)中物理与视觉差异的问题,并评估了多种方法在物理机器人中的表现。

Details Motivation: 现有VLN研究在机器人运动和控制方面的理想化假设未能反映实际部署中的物理挑战,需要通过更真实的平台解决这些问题。

Contribution: 提出VLN-PE平台,支持多种机器人形态,并首次系统评估了多种VLN方法在物理环境中的表现。

Method: 引入了支持人形、四足和轮式机器人的VLN-PE平台,评估了分类模型、扩散模型和基于大型语言模型的路径规划方法。

Result: 结果显示性能显著下降,原因包括有限的观测空间、光照变化以及碰撞和跌倒等物理挑战。

Insight: 当前模型在物理部署中泛化能力较弱,但VLN-PE为提升跨具身适应性提供了新途径。

Abstract: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment’s overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.

eess.AS [Back]

[89] UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

Zhichao Sheng,Shilin Zhou,Chen Gong,Zhenghua Li

Main category: eess.AS

TL;DR: UniSLU 是一个统一的框架,通过单一架构联合建模多个 SLU 任务,利用跨任务异构数据集提升性能。

Details Motivation: 现有的 SLU 方法通常为每个任务设计独立模型,导致系统复杂且难以利用跨任务数据交互。

Contribution: 提出 UniSLU 统一框架和表示方法,支持 ASR、NER 和 SA 任务的联合建模,并整合大语言模型的生成能力。

Method: 采用统一表示方法和生成式建模,联合训练 ASR、NER 和 SA,增强任务间交互和数据集利用。

Result: 在公开 SLU 数据集上表现优于基准方法,适用于实际语音多媒体场景。

Insight: 统一的表示和联合建模能有效提升 SLU 任务性能,验证了大语言模型在 SLU 中的潜力。

Abstract: Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.

cs.AI [Back]

[90] From Roots to Rewards: Dynamic Tree Reasoning with RL

Ahmed Bahloul,Simon Malberg

Main category: cs.AI

TL;DR: 该论文提出了一种动态强化学习框架,通过实时置信度估计逐步构建推理树,并学习最优策略以选择分解、检索或聚合操作,从而解决了ProbTree框架的静态性和计算效率问题。

Details Motivation: 现代语言模型在处理复杂问题时存在错误传播和知识整合的难题。虽然树状推理方法(如ProbTree)通过分层分解和置信加权聚合缓解了这些问题,但其静态实现导致推理树无法动态适应中间结果,且计算效率低下。

Contribution: 论文的主要贡献是提出了一种动态强化学习框架,将树状推理转化为自适应过程,通过选择性扩展和资源聚焦分配,同时保持概率严谨性,提升了解题质量和计算效率。

Method: 方法的核心是将树状推理与强化学习结合,通过实时置信度估计动态构建推理树,并学习最优的行动策略(如分解、检索或聚合)。

Result: 结果表明,该框架在保持概率严谨性的同时,显著提升了解题质量和计算效率,为树状推理提供了新的范式。

Insight: 论文揭示了动态自适应推理树在平衡概率框架可靠性与现实问题灵活性方面的潜力,为复杂问答系统的优化提供了新思路。

Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.

cs.LG [Back]

[91] Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Mingjie Liu,Shizhe Diao,Jian Hu,Ximing Lu,Xin Dong,Hao Zhang,Alexander Bukharin,Shaokun Zhang,Jiaqi Zeng,Makesh Narsimhan Sreedhar,Gerald Shen,David Mosallanezhad,Di Zhang,Jonas Yang,June Yang,Oleksii Kuchaiev,Guilin Liu,Zhiding Yu,Pavlo Molchanov,Yejin Choi,Jan Kautz,Yi Dong

Main category: cs.LG

TL;DR: 该论文研究了延长强化学习(RL)训练对小型语言模型在多样化推理任务中的影响,提出了包括可验证奖励任务、改进的GRPO方法和稳定性增强技术在内的关键要素,显著提升了多个领域的性能。

Details Motivation: 现有研究主要集中在通过增加测试时计算(如链式推理和迭代探索)来提升语言模型的推理能力,而该研究则探讨了延长RL训练的影响,旨在揭示长期训练对小型语言模型性能的提升潜力。

Contribution: 1)提出了在多样化推理任务中延长RL训练的关键要素;2)改进了Group Relative Policy Optimization(GRPO);3)引入了控制KL正则化、剪辑比率和周期性参考策略重置等技术,显著提升了性能。

Method: 1)使用可验证奖励任务提供有监督的学习信号;2)改进GRPO以优化策略更新;3)引入稳定性增强技术(如KL正则化和周期性策略重置)。

Result: 模型在数学(+14.7%)、编码(+13.9%)和逻辑谜题(+54.8%)任务上显著超越了基线。

Insight: 延长RL训练结合有效的稳定性技术,可以显著提升小型语言模型在多样化推理任务中的性能,而不仅仅是依赖测试时计算扩展。

Abstract: Recent advancements in reasoning-focused language models such as OpenAI’s O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.

[92] A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Weijieying Ren,Jingxi Zhu,Zehao Liu,Tianxiang Zhao,Vasant Honavar

Main category: cs.LG

TL;DR: 本文是一篇关于电子健康记录(EHR)建模的综述,涵盖从深度学习到大型语言模型(LLMs)的方法,讨论了数据异质性、时间不规则性等独特挑战,并提出了五种关键设计维度的统一分类法。

Details Motivation: 电子健康记录(EHR)数据的复杂性和领域特异性使其建模面临独特挑战,亟需系统梳理深度学习和LLMs在该领域的进展,以推动AI在医疗领域的应用。

Contribution: 提出了一种统一的分类法,涵盖五个关键设计维度,系统总结了代表性方法,并突出了新兴趋势(如基础模型和LLM驱动的临床代理)。

Method: 综述了五种设计维度的方法:数据为中心的方法、神经架构设计、学习策略、多模态学习和基于LLM的建模系统,重点关注数据质量提升、自监督学习等。

Result: 提供了EHR建模领域的结构化路线图,总结了现有方法的优势和局限性,并讨论了未来研究方向。

Insight: LLMs在EHR建模中有巨大潜力,但需解决基准测试不足、可解释性差和临床对齐等挑战,以实现更广泛的临床应用。

Abstract: Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.

[93] Probabilistic Soundness Guarantees in LLM Reasoning Chains

Weiqiu You,Anton Xue,Shreya Havaldar,Delip Rao,Helen Jin,Chris Callison-Burch,Eric Wong

Main category: cs.LG

TL;DR: 论文提出了一种名为ARES的概率框架,用于在LLM推理链中检测和防止错误传播,通过仅依赖已验证的前提为每个推理步骤提供概率性保证,而非二元标签。

Details Motivation: 大型语言模型(LLM)生成的推理链中,初始错误容易传播并影响最终结论的可靠性。现有方法未能充分捕捉早期错误如何影响后续推理,因此需要一种更有效的方法来检测和防止这种错误传播。

Contribution: 提出了Autoregressive Reasoning Entailment Stability(ARES)框架,能够为推理链中的每一步提供概率性的正确性保证,并避免错误传播。

Method: ARES基于自回归方法,仅依赖已验证的可靠前提来评估每个推理步骤,并为每个步骤生成细微的得分而非二元标签。

Result: 在四个基准测试中达到72.1%的Macro-F1(提升8.2个百分点),在超长合成推理链的测试中表现出色,错误传播检测的F1达到90.3%(提升27.6个百分点)。

Insight: 通过概率性方法而非二元判断可以提供更鲁棒的推理链评估,尤其在处理长序列推理时,能够更有效地捕捉错误传播的影响。

Abstract: In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

[94] Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

Hao Sun,Mihaela van der Schaar

Main category: cs.LG

TL;DR: 这篇论文综述了在大语言模型(LLM)时代,如何通过逆强化学习(IRL)解决对齐问题,探讨了LLM对齐中的RL技术与传统RL任务的区别,并提出了未来的研究方向。

Details Motivation: 随着大型语言模型的兴起,对齐问题成为提升其可靠性、可控性和能力的核心挑战。RL在增强这些系统中的作用日益凸显,推动了RL与LLM对齐交叉领域的研究兴趣。

Contribution: 论文的主要贡献包括:1)从IRL视角系统综述了LLM对齐的最新进展;2)强调利用人类数据构建神经奖励模型的必要性;3)提出了该领域的开放问题和未来方向。

Method: 通过回顾RL基础概念,结合IRL技术,探讨了LLM对齐中的方法,包括神经奖励模型构建、数据集、评测指标和高效训练技术等。

Result: 论文整合了多项研究,指出了当前方法的局限性,并提出了改进方向,例如稀疏奖励RL的启发。

Insight: LLM对齐需要更多的形式化和实践结合,IRL技术在这一领域具有潜力,但仍面临数据、计算效率和评测标准等挑战。

Abstract: In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

[95] Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning

Suorong Yang,Peijia Li,Yujie Liu,Zhiming Xu,Peng Ye,Wanli Ouyang,Furao Shen,Dongzhan Zhou

Main category: cs.LG

TL;DR: 该论文提出了一种动态数据集修剪框架,通过结合任务驱动难度和多模态语义一致性,自适应地选择训练样本,从而提升数据高效性和模型鲁棒性。

Details Motivation: 现代深度学习模型通常在大规模的真实数据集上训练,但数据质量和冗余性差异较大。目前的数据修剪方法多依赖静态启发式或任务特定指标,缺乏跨域的鲁棒性和通用性。

Contribution: 提出了一种动态数据集修剪框架,利用预训练多模态基础模型的监督,结合任务驱动难度和跨模态语义一致性,自适应地选择样本。

Method: 框架通过多模态基础模型监督,动态评估样本的难度和语义一致性,从而过滤无信息样本。

Result: 该方法在提升训练效率和模型性能方面表现出色,展示了跨模态对齐对鲁棒样本选择的潜力。

Insight: 结合跨模态信息可以更有效地指导数据修剪,为数据驱动学习提供了高效且鲁棒的新思路。

Abstract: Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.

[96] WaveletInception Networks for Drive-by Vibration-Based Infrastructure Health Monitoring

Reza Riahi Samani,Alfredo Nunez,Bart De Schutter

Main category: cs.LG

TL;DR: 本文提出了一种基于深度学习的框架WaveletInception-BiLSTM,用于基础设施健康监测,结合了Learnable Wavelet Packet Transform (LWPT)和Inception网络,实现了多尺度振动信号特征提取,并通过BiLSTM捕获时间依赖性。在轨道刚度估计中表现优于现有方法。

Details Motivation: 基础设施健康监测需要高效、自动化的方法。传统的振动信号分析方法通常需要预处理且无法有效利用频谱和时域信息,因此需要一种能够同时提取多尺度特征并捕获时间依赖性的新方法。

Contribution: 1. 提出了WaveletInception-BiLSTM网络,结合了LWPT和Inception模块提取多尺度特征;2. 利用BiLSTM捕获双向时间依赖性;3. 在轨道刚度估计任务中显著优于现有方法。

Method: 1. 使用Learnable Wavelet Packet Transform (LWPT)作为早期特征提取;2. 通过1D Inception网络提取高层多尺度特征;3. 结合LSTM捕获时间依赖性;4. 采用BiLSTM进行双向时间关系建模。

Result: 在轨道刚度的模拟驱动振动信号测试中,模型显著优于现有方法,展示了高分辨率、自动化的基础设施健康监测能力。

Insight: 结合小波变换和Inception模块能够有效提取振动信号的频谱和多尺度特征,而BiLSTM进一步提升了时间依赖性的建模能力,为基础设施健康监测提供了新思路。

Abstract: This paper presents a novel deep learning-based framework for infrastructure health monitoring using drive-by vibration response signals. Recognizing the importance of spectral and temporal information, we introduce the WaveletInception-BiLSTM network. The WaveletInception feature extractor utilizes a Learnable Wavelet Packet Transform (LWPT) as the stem for extracting vibration signal features, incorporating spectral information in the early network layers. This is followed by 1D Inception networks that extract multi-scale, high-level features at deeper layers. The extracted vibration signal features are then integrated with operational conditions via a Long Short-term Memory (LSTM) layer. The resulting feature extraction network effectively analyzes drive-by vibration signals across various measurement speeds without preprocessing and uses LSTM to capture interrelated temporal dependencies among different modes of information and to create feature vectors for health condition estimation. The estimator head is designed with a sequential modeling architecture using bidirectional LSTM (BiLSTM) networks, capturing bi-directional temporal relationships from drive-by measurements. This architecture allows for a high-resolution, beam-level assessment of infrastructure health conditions. A case study focusing on railway track stiffness estimation with simulated drive-by vibration signals shows that the model significantly outperforms state-of-the-art methods in estimating railway ballast and railpad stiffness parameters. Results underscore the potential of this approach for accurate, localized, and fully automated drive-by infrastructure health monitoring.

[97] DASViT: Differentiable Architecture Search for Vision Transformer

Pengjin Wu,Ferrante Neri,Zhenhua Feng

Main category: cs.LG

TL;DR: DASViT提出了一种基于可微分架构搜索的Vision Transformer设计方法,填补了ViT在可微分搜索领域的空白,并发现了新颖的设计。实验表明,DASViT生成的模型在性能和效率上均优于传统ViT-B/16。

Details Motivation: 现有ViT架构搜索方法多依赖离散方法(如进化算法),计算成本高且难以发现创新设计。DASViT旨在通过可微分搜索解决这些问题。

Contribution: 1. 首次将可微分架构搜索应用于ViT设计;2. 提出了一种高效且创新的搜索空间,打破了传统Transformer编码器设计的限制。

Method: DASViT基于DARTS框架,对ViT的搜索空间进行可微分优化,支持宏级和微级架构的联合搜索。

Result: 在多个数据集上,DASViT生成的模型在精度上优于ViT-B/16,同时参数和计算量更少。

Insight: 可微分搜索在ViT设计中有巨大潜力,能够以较低成本生成高性能架构。

Abstract: Designing effective neural networks is a cornerstone of deep learning, and Neural Architecture Search (NAS) has emerged as a powerful tool for automating this process. Among the existing NAS approaches, Differentiable Architecture Search (DARTS) has gained prominence for its efficiency and ease of use, inspiring numerous advancements. Since the rise of Vision Transformers (ViT), researchers have applied NAS to explore ViT architectures, often focusing on macro-level search spaces and relying on discrete methods like evolutionary algorithms. While these methods ensure reliability, they face challenges in discovering innovative architectural designs, demand extensive computational resources, and are time-intensive. To address these limitations, we introduce Differentiable Architecture Search for Vision Transformer (DASViT), which bridges the gap in differentiable search for ViTs and uncovers novel designs. Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets, and achieve superior efficiency with fewer parameters and FLOPs.

eess.IV [Back]

[98] InSight: AI Mobile Screening Tool for Multiple Eye Disease Detection using Multimodal Fusion

Ananya Raghu,Anisha Raghu,Alice S. Tang,Yannis M. Paulus,Tyson N. Kim,Tomiko T. Oskotsky

Main category: eess.IV

TL;DR: InSight是一款基于AI的移动筛查工具,通过多模态融合(结合临床元数据和眼底图像)实现五种常见眼病的准确诊断,旨在提升筛查的可及性。

Details Motivation: 由于低收入和中等收入国家以及资源有限地区医疗资源不足,数百万患者无法及时筛查眼病,因此需要一种高效、准确的便携筛查工具。

Contribution: 1) 提出MetaFusion多模态融合技术;2) 结合有监督和自监督的预训练方法;3) 开发轻量级多任务模型同时预测五种疾病,计算效率提升五倍。

Method: 采用三阶段流程:实时图像质量评估、多模态疾病诊断模型、糖尿病视网膜病变分级模型。创新点包括多模态融合、混合预训练、多任务学习。

Result: 图像质量评估准确率近100%,多模态模型比仅使用图像的模型在BRSET和mBRSET上分别提升6%和4%的平衡准确率,且计算效率高。

Insight: 多模态融合可显著提升模型性能,轻量级多任务设计适合移动设备部署,为资源有限地区的眼病筛查提供可行方案。

Abstract: Background/Objectives: Age-related macular degeneration, glaucoma, diabetic retinopathy (DR), diabetic macular edema, and pathological myopia affect hundreds of millions of people worldwide. Early screening for these diseases is essential, yet access to medical care remains limited in low- and middle-income countries as well as in resource-limited settings. We develop InSight, an AI-based app that combines patient metadata with fundus images for accurate diagnosis of five common eye diseases to improve accessibility of screenings. Methods: InSight features a three-stage pipeline: real-time image quality assessment, disease diagnosis model, and a DR grading model to assess severity. Our disease diagnosis model incorporates three key innovations: (a) Multimodal fusion technique (MetaFusion) combining clinical metadata and images; (b) Pretraining method leveraging supervised and self-supervised loss functions; and (c) Multitask model to simultaneously predict 5 diseases. We make use of BRSET (lab-captured images) and mBRSET (smartphone-captured images) datasets, both of which also contain clinical metadata for model training/evaluation. Results: Trained on a dataset of BRSET and mBRSET images, the image quality checker achieves near-100% accuracy in filtering out low-quality fundus images. The multimodal pretrained disease diagnosis model outperforms models using only images by 6% in balanced accuracy for BRSET and 4% for mBRSET. Conclusions: The InSight pipeline demonstrates robustness across varied image conditions and has high diagnostic accuracy across all five diseases, generalizing to both smartphone and lab captured images. The multitask model contributes to the lightweight nature of the pipeline, making it five times computationally efficient compared to having five individual models corresponding to each disease.

[99] Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images

Zahra TehraniNasab,Amar Kumar,Tal Arbel

Main category: eess.IV

TL;DR: Pixel Perfect MegaMed是一个基于视觉-语言对齐的兆像素级医学图像生成模型,能够生成1024x1024分辨率的医学图像,解决了传统生成模型在保留医学图像细节上的不足。

Details Motivation: 医学图像合成需要高分辨率和精细细节,传统生成模型(如GANs或VAEs)难以满足这一需求,尤其是在保留对诊断至关重要的细粒度细节方面。

Contribution: 提出了第一个视觉-语言基础模型,支持兆像素级医学图像生成,并设计了多尺度transformer架构,结合医学术语和成像模态的视觉-语言对齐技术。

Method: 采用多尺度transformer架构和视觉-语言对齐技术,专门针对超高分辨率医学图像生成优化,保留了全局解剖结构和局部图像细节。

Result: 在CheXpert数据集上验证,模型能够生成临床可信的胸部X光片,且合成的图像在下游分类任务中作为数据增强手段表现优异,尤其在数据稀缺情况下。

Insight: 视觉-语言对齐技术在医学图像合成中具有巨大潜力,高分辨率合成图像可以显著提升下游任务的性能。

Abstract: Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measurable performance gains when used for data augmentation, particularly in low-data regimes. Our code is accessible through the project website - https://tehraninasab.github.io/pixelperfect-megamed.

[100] Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

Caixia Dong,Duwei Dai,Xinyi Han,Fan Liu,Xu Yang,Zongfang Li,Songhua Xu

Main category: eess.IV

TL;DR: 本文提出了一种基于视觉基础模型(VFM)的冠状动脉分割框架,通过并行编码架构(ViT-CNN)和变分融合技术(CVF)提升了分割精度。

Details Motivation: 冠状动脉分割对CAD诊断至关重要,但因其尺寸小、形态复杂且与周围组织对比度低而具有挑战性。

Contribution: 1)提出并行ViT-CNN编码架构;2)设计跨分支变分融合(CVF)模块自适应融合特征;3)引入证据学习不确定性优化(EUR)模块提升鲁棒性。

Method: ViT捕获全局特征,CNN提取局部细节,CVF融合二者,EUR用证据理论优化不确定区域。

Result: 在内部和公开数据集上显著超越SOTA方法,展示了强泛化能力。

Insight: 并行编码+变分融合能有效结合VFM的全局与局部信息,证据理论为不确定区域提供了创新优化思路。

Abstract: Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.

[101] fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting

Alicia Durrer,Florentin Bieder,Paul Friedrich,Bjoern Menze,Philippe C. Cattin,Florian Kofler

Main category: eess.IV

TL;DR: 论文提出了一种名为fastWDM3D的3D健康组织修复方法,通过结合DDPM与生成对抗网络(GAN)以及方差保持噪声调度,实现了快速且高质量的三维修复。该方法仅需两步时间即可完成修复,速度比其他DDPM方法快800倍,同时保持了优越的性能指标。

Details Motivation: 健康组织修复在肿瘤生长模型和图像配准中有重要应用。现有的DDPM方法虽然修复效果较好,但采样速度较慢,限制了其实际应用。因此,论文旨在提出一种快速且高质量的3D修复方法。

Contribution: 论文的主要贡献是提出了fastWDM3D模型,通过结合DDPM与方差保持噪声调度,实现了快速3D修复(仅需两步时间)。该方法无需对抗训练,且性能优于其他DDPM方法。

Method: 论文采用了3D小波扩散模型(WDM3D),结合方差保持噪声调度和特定的重建损失函数,避免了对抗训练的需要。该方法通过优化噪声调度和损失函数,实现了高效修复。

Result: 在BraTS修复测试集上,fastWDM3D取得了SSIM为0.8571、MSE为0.0079、PSNR为22.26的成绩,且每张图像的修复时间仅为1.81秒,比其他DDPM方法快800倍。

Insight: 论文表明,通过合理设计噪声调度和损失函数,可以显著提高扩散模型的效率,同时保持或超越其性能。此外,无需对抗训练的模型也能实现高质量的修复效果。

Abstract: Healthy tissue inpainting has significant applications, including the generation of pseudo-healthy baselines for tumor growth models and the facilitation of image registration. In previous editions of the BraTS Local Synthesis of Healthy Brain Tissue via Inpainting Challenge, denoising diffusion probabilistic models (DDPMs) demonstrated qualitatively convincing results but suffered from low sampling speed. To mitigate this limitation, we adapted a 2D image generation approach, combining DDPMs with generative adversarial networks (GANs) and employing a variance-preserving noise schedule, for the task of 3D inpainting. Our experiments showed that the variance-preserving noise schedule and the selected reconstruction losses can be effectively utilized for high-quality 3D inpainting in a few time steps without requiring adversarial training. We applied our findings to a different architecture, a 3D wavelet diffusion model (WDM3D) that does not include a GAN component. The resulting model, denoted as fastWDM3D, obtained a SSIM of 0.8571, a MSE of 0.0079, and a PSNR of 22.26 on the BraTS inpainting test set. Remarkably, it achieved these scores using only two time steps, completing the 3D inpainting process in 1.81 s per image. When compared to other DDPMs used for healthy brain tissue inpainting, our model is up to 800 x faster while still achieving superior performance metrics. Our proposed method, fastWDM3D, represents a promising approach for fast and accurate healthy tissue inpainting. Our code is available at https://github.com/AliciaDurrer/fastWDM3D.