Table of Contents

cs.CL [Back]

[1] On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Wenlong Deng,Yushu Li,Boying Gong,Yi Ren,Christos Thrampoulidis,Xiaoxiao Li

Main category: cs.CL

TL;DR: 论文揭示了GRPO在Search-R1中的训练崩溃问题源于’懒惰似然位移(LLD)’机制,并提出了一种轻量级的正则化方法LLDS来稳定训练。

Details Motivation: 研究发现GRPO在工具集成强化学习(TI-RL)中虽然具有快速收敛和无价值函数的优势,但存在训练崩溃的问题,需要解决。

Contribution: 1. 识别LLD为GRPO崩溃的核心机制;2. 提出LLDS正则化方法,显著提升了训练稳定性和模型性能。

Method: 提出LLDS正则化方法,仅在似然下降时激活,并仅对相关token进行正则化,从而最小化对优化的干扰。

Result: 实验表明,LLDS在七个开放域和多跳QA基准上显著提升了性能(如Qwen2.5-3B提升37.8%),同时防止了梯度爆炸。

Insight: LLD是GRPO在TI-RL中的一个基本瓶颈,而细粒度的正则化方法可以有效缓解这一问题。

Abstract: Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory’s likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

[2] SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh

Main category: cs.CL

TL;DR: SQuARE是一个混合检索框架,通过自适应路由处理复杂表格问题,结合结构保持的块检索和SQL查询,提高了检索精度和答案准确性。

Details Motivation: 现实的电子表格中存在多行标题、合并单元格和单位注释等问题,传统方法难以处理,而SQL视图又无法应对不一致的模式,因此需要一种更灵活且准确的解决方案。

Contribution: 提出了SQuARE框架,通过复杂度感知的路由机制,结合结构保持检索和SQL查询,实现了对复杂表格的高效问题解答。

Method: 1. 计算基于标题深度和合并密度的连续分数;2. 自适应路由选择检索策略(块检索或SQL查询);3. 轻量级代理监督检索结果的低置信度情况。

Result: 在多标题企业资产负债表、世界银行工作簿等数据集上,SQuARE在检索精度和答案准确性上优于单一策略基线和ChatGPT-4o。

Insight: 通过解耦检索和模型选择,SQuARE兼容新兴的表格基础模型,为表格理解提供了更鲁棒的桥梁。

Abstract: Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

[3] DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Fangyu Lei,Jinxiang Meng,Yiming Huang,Junjie Zhao,Yitong Zhang,Jianwen Luo,Xin Zou,Ruiyi Yang,Wenbo Shi,Yan Gao,Shizhu He,Zuo Wang,Qian Liu,Yang Wang,Ke Wang,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: DAComp提出了一个包含210个任务的基准测试,模拟企业级数据智能工作流,揭示当前数据代理在数据工程和分析任务中的关键瓶颈。

Details Motivation: 企业数据智能工作流复杂且多样化,现有数据代理在综合任务中表现不佳,需要一个全面的基准测试来评估和推动其发展。

Contribution: 1. 提出DAComp基准测试,涵盖数据工程(DE)和分析(DA)任务;2. 开发多阶段SQL管道设计和开放性问题评估方法;3. 揭示了当前代理在数据工程和分析任务中的显著不足。

Method: 1. 数据工程任务评估基于执行和多指标评分;2. 数据分析任务使用LLM-judge结合分层评估标准;3. 通过实验验证评估方法的可靠性。

Result: 当前最先进的代理在DAComp任务中表现不佳——DE任务成功率低于20%,DA任务平均分数低于40%,表明现有代理在综合能力和开放推理上存在严重不足。

Insight: 数据工程和分析是两种不同的能力,当前代理在管道编排和开放推理方面仍需大幅改进;DAComp提供了一个严格的测试平台以推动未来发展。

Abstract: Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

[4] ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Yiming Xu,Yuan Yuan,Vijay Viswanathan,Graham Neubig

Main category: cs.CL

TL;DR: ClusterFusion是一种混合聚类框架,结合了轻量级嵌入方法和LLM(大型语言模型)的上下文推理能力,通过三阶段流程实现高效领域特定文本聚类。

Details Motivation: 传统聚类算法和预训练嵌入在领域特定任务中表现不佳,而LLM虽然具有强大的上下文推理能力,但之前的工作仅将其作为辅助模块。作者希望直接以LLM为聚类核心,结合嵌入方法,提升领域适应性。

Contribution: 1. 提出ClusterFusion框架,以LLM为核心实现聚类;2. 引入三阶段流程:嵌入引导子集划分、LLM驱动的主题归纳和基于LLM的主题分配;3. 在公开和领域特定数据集上验证了其优越性。

Method: 1. 嵌入方法引导子集划分;2. LLM对子集进行主题归纳;3. LLM完成最终主题分配。这一设计结合了嵌入的效率与LLM的领域适应性。

Result: 在公开基准和领域特定数据集上,ClusterFusion均达到SOTA性能,尤其在专业领域表现突出。

Insight: 将LLM作为核心而非辅助,结合轻量级嵌入方法,可以显著提升聚类任务的领域适应性和用户定制能力。

Abstract: Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.

[5] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving

Muyu Pan,Matthew Walter,Dheeraj Kodakandla,Mahfuza Farooque

Main category: cs.CL

TL;DR: 论文提出了一种结合自然语言处理(NLP)和强化学习(RL)的框架LangSAT,用于优化SAT求解中的启发式选择,并通过将英语描述转换为CNF表达式,使SAT求解更易于使用。

Details Motivation: 传统SAT求解需要用户输入CNF表达式,对非专业人士不友好。LangSAT旨在通过自然语言输入降低使用门槛,同时利用强化学习提升求解效率。

Contribution: 1. 提出首个结合NLP和RL的SAT求解框架LangSAT;2. 设计了Lang2Logic模块,将英语描述转换为CNF表达式;3. 开发了SmartSAT模块,通过RL优化传统CDCL求解器的启发式选择。

Method: 1. Lang2Logic模块将英语描述翻译为CNF表达式;2. SmartSAT模块使用图结构表示子句-变量关系,提取全局特征,并通过RL训练智能体优化启发式选择。

Result: Lang2Logic支持450词以内的英语描述转换,SmartSAT与传统CDCL启发式方法在求解时间上表现相当。

Insight: NLP与RL的结合为SAT求解提供了新的方向,既提升了易用性,又通过智能决策优化了求解效率相当。整个框架提升了SAT求解的可访问性和可扩展性。

Abstract: Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfiability (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.

[6] MASE: Interpretable NLP Models via Model-Agnostic Saliency Estimation

Zhou Yang,Shunyan Luo,Jiazhen Zhu,Fang Jin

Main category: cs.CL

TL;DR: 论文提出了一种称为MASE的模型无关显著性估计框架,用于解释NLP模型的决策过程,通过嵌入层的高斯扰动生成显著性图,优于其他解释方法。

Details Motivation: 深度神经网络在NLP中的决策过程缺乏可解释性,传统方法难以直接应用于离散文本数据。MASE旨在提供一种通用且高效的局部解释方法。

Contribution: 提出了MASE框架,通过模型无关的高斯扰动方法估计输入显著性,适用于多种NLP模型,提升了可解释性和准确性。

Method: 利用Normalized Linear Gaussian Perturbations (NLGP)方法,在嵌入层而非原始词输入上进行扰动,生成显著性估计。

Result: MASE在Delta Accuracy等指标上优于其他模型无关解释方法,显示出其在解释NLP模型决策上的优越性。

Insight: 通过嵌入层扰动的方法比传统词输入扰动更高效,体现了模型无关解释方法在NLP中的潜力。

Abstract: Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model’s internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE’s superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.

[7] MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

Yuanshuo Zhang,Aohua Li,Bo Chen,Jingbo Sun,Xiaobing Zhao

Main category: cs.CL

TL;DR: MSME 是一个多阶段多专家的零样本立场检测框架,通过知识准备、专家推理和决策聚合三阶段处理复杂场景中的立场检测问题,并取得SOTA效果。

Details Motivation: 现有LLM方法在零样本立场检测中表现优异,但在复杂现实场景中仍需动态背景知识、目标定义和修辞设备处理能力。

Contribution: 提出MSME框架,结合多阶段多专家方法,显著提升复杂场景下的立场检测性能。

Method: 三阶段框架:知识准备(检索背景、澄清标签)、专家推理(知识专家、标签专家、语用专家)、决策聚合(元判断整合)。

Result: 在三个公共数据集上实现SOTA性能。

Insight: 多专家模块分工明确,有效解决了动态知识、复合目标定义和修辞干扰问题。

Abstract: LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author’s actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.

[8] EvoEdit: Lifelong Free-Text Knowledge Editing through Latent Perturbation Augmentation and Knowledge-driven Parameter Fusion

Pengfei Cao,Zeao Ji,Daojian Zeng,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: 该论文提出新任务LF-Edit和解决方案EvoEdit,通过潜在扰动增强和知识驱动参数融合,实现对LLM的自然语言知识终身编辑,并在新构建的MRLF-Bench上验证了其优越性。

Details Motivation: 大型语言模型部署后知识更新困难,现有方法依赖结构化三元组且不支持终身编辑,无法适应自然语言输入和持续更新需求。

Contribution: 1. 提出LF-Edit任务,支持自然语言表达的知识终身编辑;2. 构建MRLF-Bench基准;3. 设计EvoEdit方法,结合潜在扰动增强与知识驱动参数融合,解决知识注入与遗忘问题。

Method: EvoEdit方法:1. 通过潜在扰动增强(Latent Perturbation Augmentation)提升知识注入能力;2. 利用知识驱动参数融合(Knowledge-driven Parameter Fusion)保留旧知识。

Result: 实验表明EvoEdit在LF-Edit任务上显著优于现有知识编辑方法,验证了其有效性。

Insight: 自然语言知识编辑需兼顾知识注入与防遗忘,多层级评估框架(记忆、理解、约束理解、推理)能更全面衡量模型性能。

Abstract: Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model’s internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.

[9] ADAPT: Learning Task Mixtures for Budget-Constrained Instruction Tuning

Pritam Kadasi,Abhishek Upperwal,Mayank SIngh

Main category: cs.CL

TL;DR: ADAPT是一种元学习算法,用于在多任务指令微调中学习任务采样比例,基于明确的token预算。它通过元梯度更新任务分布,形成自适应课程,比静态混合方法更高效。

Details Motivation: 传统多任务指令微调中,任务权重通常是手动固定的,缺乏对任务重要性的动态调整。ADAPT旨在通过学习任务分布,优化token预算的分配,提升模型泛化能力。

Contribution: 1. 提出了ADAPT算法,动态学习任务采样比例;2. 在三种1B参数模型上验证了其有效性;3. 在有限预算下优于静态混合方法。

Method: ADAPT通过平滑的最坏情况验证目标的元梯度,更新连续的任务分布,形成自适应课程,避免任务崩溃。实验在三种LLM上进行,预算为1%、5%和10%的监督token。

Result: 在11个域外基准测试中,ADAPT性能优于或等于最佳静态混合方法,同时使用更少的训练token,并将预算分配给更困难的任务。

Insight: 动态调整任务权重在有限预算下更具优势,而自适应课程能够有效分配资源,避免任务权重分配不均衡的问题。

Abstract: We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1%$, $5%$, and $10%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.

Wenjin Liu,Haoran Luo,Xin Feng,Xiang Ji,Lijuan Zhou,Rui Mao,Jiapu Wang,Shirui Pan,Erik Cambria

Main category: cs.CL

TL;DR: 本文提出LexGenius,一个专家级的中文法律基准测试,用于评估大语言模型(LLMs)在通用法律智能(Legal GI)中的表现,覆盖7个维度、11项任务和20种能力,并通过人工和LLM结合的审查方式确保数据可靠性。

Details Motivation: 现有法律智能基准测试多关注结果导向,缺乏系统评估LLMs在法律领域的理解与推理能力,阻碍了法律通用智能的发展。

Contribution: 提出了LexGenius基准测试框架,首次在中文法律领域系统评估LLMs的能力,揭示了LLMs在法律智能上的差距。

Method: 采用Dimension-Task-Ability框架,结合人工和LLM审查生成多选题,并对12个先进LLMs进行评估。

Result: 实验显示LLMs在法律智能上存在显著差距,最佳模型仍落后于人类法律专家。

Insight: LexGenius为法律通用智能的发展提供了量化标准和改进方向,凸显了LLMs在法律领域的潜力与不足。

Abstract: Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

[11] Geschlechtsübergreifende Maskulina im Sprachgebrauch Eine korpusbasierte Untersuchung zu lexemspezifischen Unterschieden

Carolin Mueller-Spitzer,Samira Ochs,Jan Oliver Ruediger,Sascha Wolfer

Main category: cs.CL

TL;DR: 这篇研究分析了当代德语新闻文本中性别通用阳性词(GM)的分布及语言特征,揭示了词汇间的显著差异,并提供了对其实际使用的实证见解。

Details Motivation: 性别通用阳性词在学术界和公众中引发了广泛争议,但其实际使用情况的语料库分析较少。

Contribution: 通过大规模语料库分析,揭示了词汇间的显著差异,并提供了GM在真实语言使用中的形式和表现。

Method: 对21个人称名词的屈折变化范式进行手动标注,分析了6,195个标注实例。

Result: 发现GM主要用于复数和不定名词短语,且并非主要用于表示整个人群类别。

Insight: 研究结果为语言心理学实验提供了更接近真实语言使用的刺激材料基础。

Abstract: This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.

[12] OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models

Zhuoyue Wan,Wentao Hu,Chen Jason Zhang,Yuanfeng Song,Shuaimin Li,Ruiqiang Xiao,Xiao-Yong Wei,Raymond Chi-Wing Wong

Main category: cs.CL

TL;DR: OsmT是一个开源的自然语言与OverpassQL(OpenStreetMap的结构化查询语言)之间的桥梁,通过Tag Retrieval Augmentation(TRA)机制提升查询生成的结构有效性,并在性能上达到与更大模型相当的水平。

Details Motivation: 现有解决方案依赖大规模闭源模型,导致高推理成本、低透明度和难以轻量部署。研究目标是开发开源模型,提升自然语言与结构化查询语言的转换效率。

Contribution: 1. 提出OsmT,开源模型专为自然语言与OverpassQL转换设计;2. 引入TRA机制,增强查询生成的准确性和结构有效性;3. 定义反向任务(OverpassQL-to-Text),提升用户可访问性。

Method: 1. 使用TRA机制,注入上下文相关标签知识;2. 捕获OSM数据库中的层次和关系依赖;3. 结合反向自然语言生成任务。

Result: 在公开基准测试中,OsmT在查询生成和解释任务中均表现优异,参数量更少但准确率竞争性强。

Insight: 开源预训练语言模型在富模式地理空间环境中具有竞争力,TRA设计为复杂结构化查询生成提供有效解决方案。

Abstract: Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.

[13] Model Whisper: Steering Vectors Unlock Large Language Models’ Potential in Test-time

Xinyue Kang,Diwei Shi,Li Chen

Main category: cs.CL

TL;DR: 本文提出了一种轻量级的测试时间导向向量(TTSV)方法,通过优化输入前的向量来激活大语言模型的潜在能力,而无需调整模型参数,显著提升了模型在特定任务上的表现。

Details Motivation: 现有测试时间适应方法通常需要调整模型参数,计算成本高且可能损害模型的预训练能力。因此,需要一种轻量、高效的替代方案。

Contribution: 提出了TTSV方法,通过优化输入导向向量而非模型参数来激活模型的潜在能力,实现了轻量、高效的测试时间适应。

Method: 在测试数据上优化预输入的TTSV,最小化模型输出熵,从而引导模型进入更高置信度的内部状态。

Result: 在MATH500任务中,TTSV使Qwen2.5-Math-7B和Qwen3-4B模型的性能分别相对提升了45.88%和16.22%。

Insight: TTSV不仅高效轻量,还展示了跨任务的强泛化能力,导向向量在不同任务中具有高度可迁移性。

Abstract: It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model’s pre-existing abilities.To address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM’s parameters entirely frozen. By optimizing the TTSV on test data to minimize the model’s output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach’s effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.

[14] Challenging the Abilities of Large Language Models in Italian: a Community Initiative

Malvina Nissim,Danilo Croce,Viviana Patti,Pierpaolo Basile,Giuseppe Attanasio,Elio Musacchio,Matteo Rinaldi,Federico Borazio,Maria Francis,Jacopo Gili,Daniel Scalena,Begoña Altuna,Ekhi Azurmendi,Valerio Basile,Luisa Bentivogli,Arianna Bisazza,Marianna Bolognesi,Dominique Brunato,Tommaso Caselli,Silvia Casola,Maria Cassese,Mauro Cettolo,Claudia Collacciani,Leonardo De Cosmo,Maria Pia Di Buono,Andrea Esuli,Julen Etxaniz,Chiara Ferrando,Alessia Fidelangeli,Simona Frenda,Achille Fusco,Marco Gaido,Andrea Galassi,Federico Galli,Luca Giordano,Mattia Goffetti,Itziar Gonzalez-Dios,Lorenzo Gregori,Giulia Grundler,Sandro Iannaccone,Chunyang Jiang,Moreno La Quatra,Francesca Lagioia,Soda Marem Lo,Marco Madeddu,Bernardo Magnini,Raffaele Manna,Fabio Mercorio,Paola Merlo,Arianna Muti,Vivi Nastase,Matteo Negri,Dario Onorati,Elena Palmieri,Sara Papi,Lucia Passaro,Giulia Pensa,Andrea Piergentili,Daniele Potertì,Giovanni Puccetti,Federico Ranaldi,Leonardo Ranaldi,Andrea Amelio Ravelli,Martina Rosola,Elena Sofia Ruzzetti,Giuseppe Samo,Andrea Santilli,Piera Santin,Gabriele Sarti,Giovanni Sartor,Beatrice Savoldi,Antonio Serino,Andrea Seveso,Lucia Siciliani,Paolo Torroni,Rossella Varvara,Andrea Zaninello,Asya Zanollo,Fabio Massimo Zanzotto,Kamyar Zeinalipour,Andrea Zugarini

Main category: cs.CL

TL;DR: CALAMITA是一个针对意大利语的大规模协作评测项目,旨在系统评估大型语言模型(LLMs)在多任务中的表现,促进社区驱动的评估框架。

Details Motivation: 当前大型语言模型的评估主要集中在英语上,对其他语言的系统性评测不足,CALAMITA旨在填补意大利语评测的空白。

Contribution: 1. 构建了涵盖20多个任务和近100个子任务的意大利语评测基准;2. 建立了支持异构数据集和指标的集中式评测流程;3. 提供了社区驱动的可持续评测框架。

Method: 通过联合80多位贡献者设计多样化任务,使用集中式评测流程评估模型的各项能力,包括语言能力、常识推理、事实一致性等。

Result: 评测了四个公开权重的LLMs,揭示了模型在能力和任务评测中的系统性优势和弱点。

Insight: 1. 细粒度和代表性的评价指标是关键;2. 协调一致的评测流程很重要;3. 社区广泛参与的利弊需权衡。

Abstract: The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. “Challenging the Abilities of LAnguage Models in ITAlian” (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource – the most comprehensive and diverse benchmark for Italian to date – and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

[15] SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Hao Wang,Jialun Zhong,Changcheng Wang,Zhujun Nie,Zheng Li,Shunyu Yao,Yanzeng Li,Xinchi Li

Main category: cs.CL

TL;DR: SEAL是一种新颖的两阶段语义解析框架,基于自进化代理学习,旨在解决知识图谱上的对话问答问题。通过提取核心语义表达式并通过代理校准模块修正,结合模板完成和自进化机制,SEAL在结构准确性和计算效率上取得了显著提升。

Details Motivation: 现有方法在解决知识图谱上的复杂查询时存在结构不准确和计算成本高的问题,SEAL旨在通过自进化代理学习克服这些限制。

Contribution: 提出SEAL框架,通过两阶段语义解析结合代理校准和自我进化机制,显著提升了结构准确性和计算效率。

Method: SEAL采用两阶段方法:1) 提取核心语义表达式并进行代理校准;2) 使用模板完成构造完整语义表达式,并结合自我进化机制。

Result: 在SPICE基准测试中,SEAL在多跳推理、比较和聚合任务中表现最优,验证了其结构准确性和计算效率的显著提升。

Insight: SEAL的自进化机制使其能够从对话历史和执行反馈中持续学习,无需显式重新训练,具有较强的适应性和扩展性。

Abstract: Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework’s capacity for robust and scalable conversational reasoning.

[16] Factuality and Transparency Are All RAG Needs! Self-Explaining Contrastive Evidence Re-ranking

Francielle Vargas,Daniel Pedronette

Main category: cs.CL

TL;DR: 该论文提出了一种名为CER的新方法,通过对比学习微调嵌入向量,并为检索到的文本生成词级归因解释,以提高检索的准确性和透明度。

Details Motivation: 在检索增强生成(RAG)系统中,确保检索到的证据具有事实性和透明性是关键,尤其是在安全关键领域(如临床试验报告)中,避免幻觉生成尤为重要。

Contribution: 主要贡献是提出了CER方法,通过自解释的对比证据重排序,将检索重构为围绕事实证据,并在嵌入空间中显式对齐证据推理。

Method: 方法包括:1) 使用对比学习微调嵌入向量;2) 基于主观性标准自动选择困难负样本;3) 生成词级归因解释,使模型能够区分事实性和误导性解释。

Result: 实验结果表明,CER提高了检索准确性,减少了RAG系统中的幻觉生成,并提供了透明的、基于证据的检索结果。

Insight: 通过在嵌入空间中对齐事实证据,CER为RAG系统提供了一种更具可靠性和透明性的检索方法,尤其是在需要高可信度的领域中。

Abstract: This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.

[17] Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran,Rishabh Tiwari,Yuezhou Hu,Kerem Dilmen,Coleman Hooper,Haocheng Xi,Nicholas Lee,Mehrdad Farajtabar,Michael W. Mahoney,Kurt Keutzer,Amir Gholami

Main category: cs.CL

TL;DR: 论文提出Arbitrage框架,通过动态路由选择草稿模型和目标模型的输出,优化推理过程中的效率-准确性权衡,显著减少推理延迟。

Details Motivation: 现有推理任务的Speculative Decoding方法在语义等效步骤中因无效拒绝导致效率低下,传统方法无法充分利用目标模型的计算资源。

Contribution: 提出Arbitrage框架,通过轻量级路由器动态选择草稿模型或目标模型的输出,实现高效推理。

Method: 使用训练好的路由器预测目标模型是否能够生成显著更优的步骤,动态选择输出源。

Result: 在多个数学推理基准测试中,Arbitrage显著减少推理延迟(最高达2倍),同时保持准确性。

Insight: 动态路由策略能更高效地利用计算资源,为推理任务的效率优化提供了新思路。

Abstract: Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

[18] Structured Document Translation via Format Reinforcement Learning

Haiyue Song,Johannes Eschbach-Dymanus,Hour Kaing,Sumire Honda,Hideki Tanaka,Bianka Buschbeck,Masao Utiyama

Main category: cs.CL

TL;DR: 该论文提出了一种名为FormatRL的方法,通过强化学习优化文档级结构化翻译,引入TreeSim和Node-chrF两种奖励函数,显著提升了翻译质量与结构准确性。

Details Motivation: 现有结构化文本翻译方法局限于句子级别,难以处理复杂的文档级XML/HTML结构,亟需一种能够同时优化翻译质量和结构一致性的方法。

Contribution: 提出FormatRL方法,结合Group Relative Policy Optimization和两种结构感知奖励(TreeSim和Node-chrF),以及新指标StrucAUC,实现文档级结构化翻译的优化。

Method: 基于监督微调模型,通过强化学习优化TreeSim(结构相似性)和Node-chrF(节点级翻译质量)奖励,并使用StrucAUC评估细粒度错误。

Result: 在SAP软件文档基准测试中,六项指标均有提升,验证了方法在翻译质量和结构准确性上的双重优势。

Insight: 结构感知奖励函数的引入能够有效平衡翻译与结构优化,而StrucAUC为评估文档级翻译提供了更细粒度的分析工具。

Abstract: Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

[19] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

Purbesh Mitra,Sennur Ulukus

Main category: cs.CL

TL;DR: 论文提出了一种名为“语义软自举(SSB)”的自蒸馏技术,用于改进大语言模型(LLM)的长上下文推理能力,避免了强化学习的高资源消耗问题,取得了显著的性能提升。

Details Motivation: 传统基于强化学习的训练方法(RLVR)在长上下文推理任务中存在样本效率低、奖励稀疏和计算资源消耗大的问题,亟需一种更高效的训练方案。

Contribution: 提出SSB方法,通过自蒸馏技术自动生成教师-学生训练对,无需人工干预,显著提升了模型的推理性能。

Method: 使用同一基础模型既作为教师又作为学生,通过生成多个解答并筛选正确和典型错误答案,构建训练数据;学生模型仅基于问题学习匹配教师模型的输出。

Result: 在GSM8K、MATH500和AIME2024数据集上,SSB相比GRPO(一种RLVR算法)分别提升了10.6%和10%的准确率。

Insight: 自蒸馏技术在长上下文推理任务中具有潜力,避免了强化学习的复杂性和资源消耗问题,同时保持了性能的提升。

Abstract: Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.

cs.CV [Back]

[20] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Alejandro Cobo,Roberto Valle,José Miguel Buenaposada,Luis Baumela

Main category: cs.CV

TL;DR: 该论文提出了一种用于通用深度伪造视频检测的新方法,通过合成具有运动学不一致性的训练数据,使模型能够检测复杂的生物力学缺陷,从而实现优异的泛化性能。

Details Motivation: 现有深度伪造检测方法在视频领域中难以泛化到未见过的伪造手法,尤其是忽略了面部区域间自然运动依赖关系的破坏。

Contribution: 提出了一种合成视频生成方法,通过操纵运动基破坏面部运动的自然相关性,生成具有运动学不一致性的训练数据。

Method: 使用自动编码器分解面部关键点配置为运动基,选择性破坏面部运动的自然相关性,并通过面部变形将这些伪影引入干净视频。

Result: 在多个流行基准测试中取得了最先进的泛化结果。

Insight: 面部区域间的运动学不一致性是检测深度伪造视频的有效线索,通过模拟这些不一致性可以提升模型的泛化能力。

Abstract: Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.

[21] OnSight Pathology: A real-time platform-agnostic computational pathology companion for histopathology

Jinzhen Hu,Kevin Faust,Parsa Babaei Zadeh,Adrienn Bourkas,Shane Eaton,Andrew Young,Anzar Alvi,Dimitrios George Oreopoulos,Ameesha Paliwal,Assem Saleh Alrumeh,Evelyn Rose Kamski-Hennekam,Phedias Diamandis

Main category: cs.CV

TL;DR: OnSight Pathology是一个平台无关的计算机视觉软件,通过实时屏幕捕获提供AI推断,解决了数字病理学中部署AI工具的障碍。

Details Motivation: 传统的组织病理学检查依赖主观解释和专业专家,可能影响准确性和临床护理。现有AI解决方案多为专有系统,部署困难。

Contribution: 提出了OnSight Pathology,支持跨平台实时AI推断,无需复杂集成,适用于研究和临床工作流。

Method: 使用连续屏幕捕获技术,在本地运行AI推断,支持多种数字切片图像查看器和实时显微镜摄像头。

Result: 在2500多张公开切片图像和临床案例中验证了其鲁棒性,完成了脑肿瘤分类、有丝分裂检测和免疫组化定量等任务。

Insight: 提供多模态聊天助手和兼容智能手机摄像头等功能,扩展了AI工具在模拟和远程病理学场景中的应用潜力。

Abstract: The microscopic examination of surgical tissue remains a cornerstone of disease classification but relies on subjective interpretations and access to highly specialized experts, which can compromise accuracy and clinical care. While emerging breakthroughs in artificial intelligence (AI) offer promise for automated histological analysis, the growing number of proprietary digital pathology solutions has created barriers to real-world deployment. To address these challenges, we introduce OnSight Pathology, a platform-agnostic computer vision software that uses continuous custom screen captures to provide real-time AI inferences to users as they review digital slide images. Accessible as a single, self-contained executable file (https://onsightpathology.github.io/ ), OnSight Pathology operates locally on consumer-grade personal computers without complex software integration, enabling cost-effective and secure deployment in research and clinical workflows. Here we demonstrate the utility of OnSight Pathology using over 2,500 publicly available whole slide images across different slide viewers, as well as cases from our clinical digital pathology setup. The software’s robustness is highlighted across routine histopathological tasks, including the classification of common brain tumor types, mitosis detection, and the quantification of immunohistochemical stains. A built-in multi-modal chat assistant provides verifiable descriptions of images, free of rigid class labels, for added quality control. Lastly, we show compatibility with live microscope camera feeds, including from personal smartphones, offering potential for deployment in more analog, inter-operative, and telepathology settings. Together, we highlight how OnSight Pathology can deliver real-time AI inferences across a broad range of pathology pipelines, removing key barriers to the adoption of AI tools in histopathology.

[22] Generalized Event Partonomy Inference with Structured Hierarchical Predictive Learning

Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur\

Main category: cs.CV

TL;DR: PARSE是一个从流式视频中无监督学习多层次事件结构的框架,通过分层预测模型实现,其性能优于现有流式方法并与离线基准相当。

Details Motivation: 人类自然地将连续经验感知为时间嵌套事件的层次结构,而计算机视觉模型需要能够预测性地和层次化地分割视频。

Contribution: 提出了PARSE框架,能够从流式视频中无监督地学习多层次事件结构,并通过分层递归预测器和基于注意力的反馈实现。

Method: PARSE使用多尺度分层递归预测器,低层建模短期动态,高层通过注意力反馈整合长期上下文;事件边界通过预测误差的瞬态峰值自然产生。

Result: 在Breakfast Actions、50 Salads和Assembly 101三个基准上,PARSE在流式方法中表现最佳,并与离线基准相当(H-GEBD、TED、hF1指标)。

Insight: 预测学习在不确定性下为实现类似人类的时间抽象和组合性事件理解提供了可扩展的路径。

Abstract: Humans naturally perceive continuous experience as a hierarchy of temporally nested events, fine-grained actions embedded within coarser routines. Replicating this structure in computer vision requires models that can segment video not just retrospectively, but predictively and hierarchically. We introduce PARSE, a unified framework that learns multiscale event structure directly from streaming video without supervision. PARSE organizes perception into a hierarchy of recurrent predictors, each operating at its own temporal granularity: lower layers model short-term dynamics while higher layers integrate longer-term context through attention-based feedback. Event boundaries emerge naturally as transient peaks in prediction error, yielding temporally coherent, nested partonomies that mirror the containment relations observed in human event perception. Evaluated across three benchmarks, Breakfast Actions, 50 Salads, and Assembly 101, PARSE achieves state-of-the-art performance among streaming methods and rivals offline baselines in both temporal alignment (H-GEBD) and structural consistency (TED, hF1). The results demonstrate that predictive learning under uncertainty provides a scalable path toward human-like temporal abstraction and compositional event understanding.

[23] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Xiangyu Bai,He Liang,Bishoy Galoaa,Utsav Nandi,Shayda Moezzi,Yuhang He,Sarah Ostadabbas

Main category: cs.CV

TL;DR: MoReGen提出了一种基于多智能体和物理模拟的运动感知文本到视频生成框架,旨在解决现有方法在物理合理性上的不足,并通过MoReSet基准和MoRe指标定量评估了模型的物理有效性。

Details Motivation: 现有的文本到视频(T2V)生成方法虽然在真实感上取得了进展,但在生成符合物理原理的视频方面仍然存在挑战。

Contribution: 1.提出了MoReGen框架,结合多智能体LLM、物理模拟器和渲染器生成物理准确的视频;2.引入了MoReSet基准,包含1,275个人工标注的视频,用于评估物理有效性;3.提出了基于物体轨迹一致性的评价指标MoRe。

Method: MoReGen通过多智能体LLM解析文本,结合物理模拟器生成运动轨迹,并使用渲染器合成视频。

Result: 实验表明,现有SOTA模型在物理合理性上表现不佳,而MoReGen在物理一致性方面表现出色。

Insight: MoReGen为视频合成的物理合理性提供了一个系统化的解决方案,展示了代码域生成方法的潜力。

Abstract: While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

[24] ReasonX: MLLM-Guided Intrinsic Image Decomposition

Alara Dirik,Tuanfeng Wang,Duygu Ceylan,Stefanos Zafeiriou,Anna Frühstück

Main category: cs.CV

TL;DR: ReasonX利用多模态大语言模型(MLLM)作为感知评判者,通过比较生成的图像内在属性来优化分解模型。这种方法在不依赖标注数据的情况下显著提升了性能。

Details Motivation: 真实场景中的内在图像分解通常缺乏足够的标注数据,而基于合成数据训练的模型泛化能力有限。通过MLLM的感知评判能力,可以弥补这一缺陷。

Contribution: 提出了一种新型框架ReasonX,利用MLLM作为感知评判者,通过比较分析生成的内在属性,在不依赖标注数据的情况下优化分解模型。

Method: 使用MLLM生成相对比较信号,作为Gradient-relative Policy Optimization(GRPO)的奖励信号,以优化内在分解模型。该方法具有模型无关性,适用于多种基础架构和模态。

Result: 在IIW和ETH3D数据集上,ReasonX显著提升了性能(WHDR降低9-25%,深度精度提升46%),验证了其有效性。

Insight: MLLM的感知评判能力可作为低层和高层视觉推理之间的桥梁,为无监督或弱监督学习提供了新思路。

Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge’s relational assessments and analytically derived relations from the model’s outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

[25] 6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer,Piotr Kalinowski,Caroline Ebersbach,Marcel Knopp,Tim Rädsch,Evangelia Christodoulou,Annika Reinke,Fiona R. Kolbinger,Lena Maier-Hein

Main category: cs.CV

TL;DR: 该论文提出了AdversarialAnatomyBench,首个针对罕见解剖变体的自然对抗性医学图像基准测试,揭示了现有视觉语言模型(VLM)在面对罕见解剖结构时的显著性能下降和局限性。

Details Motivation: 现有医学图像基准测试主要关注常见解剖结构,忽略了罕见变体对模型的挑战。作者希望通过建立包含多种成像模态和解剖区域的罕见变体基准,量化VLM在这些情况下的表现。

Contribution: 1. 提出AdversarialAnatomyBench基准,填补了罕见解剖变体评估的空白;2. 揭示了VLM在罕见解剖结构上的性能下降和局限性;3. 证明模型缩放和干预措施无法有效解决这一问题。

Method: 作者收集并构建了一个包含自然发生的罕见解剖变体的基准测试集,涵盖多种成像模态和解剖区域。然后对22种先进VLM进行评测,分析其在基本医学感知任务上的性能。

Result: VLM在罕见解剖变体上的平均准确率从典型的74%降至29%。即使性能最佳的模型(如GPT-5、Gemini 2.5 Pro和Llama 4 Maverick)也表现出41-51%的性能下降。模型错误与解剖偏见高度相关。

Insight: 当前VLM在处理罕见解剖结构时存在显著局限性,且现有技术手段难以解决这一挑战。这一发现对医学AI系统的实际应用提出了警示。

Abstract: Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about “typical” human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely mirrored expected anatomical biases. Third, neither model scaling nor interventions, including bias-aware prompting and test-time reasoning, resolved these issues. These findings highlight a critical and previously unquantified limitation in current VLM: their poor generalization to rare anatomical presentations. AdversarialAnatomyBench provides a foundation for systematically measuring and mitigating anatomical bias in multimodal medical AI systems.

[26] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang

Main category: cs.CV

TL;DR: MVRoom提出了一种可控的3D室内场景生成方法,结合多视角扩散模型和3D布局条件,通过两阶段设计实现多视角一致性,并支持文本到场景的生成。

Details Motivation: 现有方法在多视角一致性和可控性上存在不足,MVRoom旨在通过结合3D布局和多视角扩散模型,解决这些问题。

Contribution: 1. 提出了MVRoom,一种结合3D布局和多视角扩散模型的可控NVS方法;2. 设计了两阶段流程,利用3D布局贯穿多视角生成;3. 引入了布局感知的极线注意力机制;4. 支持迭代生成和文本到场景功能。

Method: 1. 使用3D布局作为条件,通过新型表示桥接3D和图像信号;2. 第二阶段采用图像条件扩散生成多视角,加入极线注意力机制;3. 支持递归生成复杂场景。

Result: 在定量和定性评估中优于现有方法,验证了多视角一致性和生成质量。

Insight: 结合3D布局和扩散模型是实现可控多视角生成的有效路径,递归生成可扩展至更复杂场景。

Abstract: We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

[27] UniLight: A Unified Representation for Lighting

Zitian Zhang,Iliyan Georgiev,Michael Fischer,Yannick Hold-Geoffroy,Jean-François Lalonde,Valentin Deschaintre

Main category: cs.CV

TL;DR: UniLight提出了一种统一的照明表示方法,通过联合潜在空间整合多模态数据,支持跨模态的照明理解和生成任务。

Details Motivation: 现有照明表示方法(如环境图、辐照度等)互不兼容,限制了跨模态的应用和学习。因此,需要一种统一的表示方法来支持多模态数据的对齐和共享。

Contribution: UniLight的核心贡献是提出了一种联合潜在空间表示方法,能够整合文本、图像、辐照度和环境图等多种照明模态,并通过对比学习和辅助任务增强方向性理解。

Method: 方法包括构建多模态编码器,利用对比学习对齐不同模态的表示,并通过球谐函数预测任务强化方向性特征。此外,设计了多模态数据流水线,支持大规模的跨任务训练和评估。

Result: 实验表明,UniLight的表示方法能够捕捉一致且可迁移的照明特征,支持照明检索、环境图生成和基于扩散模型的图像合成等多种任务。

Insight: 多模态联合表示为解决复杂的视觉问题(如照明理解和控制)提供了新思路,同时强调了方向性建模在照明任务中的重要性。

Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

[28] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

Tasmiah Haque,Srinjoy Das

Main category: cs.CV

TL;DR: 论文提出了一种通过结合GRU-标准化流(GRU-NF)与随机采样方法的新型推断时精炼技术,以提升实时视频动作迁移中的多样性和准确性。该方法通过引入马尔可夫链蒙特卡洛(MCMC)步骤,扩展了模型的表达能力,优于原GRU-NF方法。

Details Motivation: 实时视频动作迁移在沉浸式游戏和基于视觉的异常检测等应用中需要准确且多样化的预测,以支持逼真合成和不确定性下的鲁棒决策。当前GRU-NF方法的确定性变换结构限制了其表达多样性。

Contribution: 提出了GRU-随机标准化流(GRU-SNF),通过推断时引入MCMC步骤,扩展了模型的输出空间多样性,无需重新训练即可更好地逼近真实数据分布。

Method: 结合GRU-NF与随机采样方法,在推断时引入MCMC步骤,提升输出的多样性。方法在关键点视频动作迁移任务中验证,确保了时间一致性和感知多样性。

Result: GRU-SNF在保持准确性的同时,生成了更多样化的输出,尤其在长时预测中表现优于GRU-NF。随机性注入显著提升了多模态行为的捕获能力。

Insight: 研究表明,将随机动态与基于流的序列模型结合是生成时间序列预测的有效途径,为生成任务提供了新的思路。

Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.

[29] How (Mis)calibrated is Your Federated CLIP and What To Do About It?

Mainak Singha,Masih Aminbeidokhti,Paolo Casari,Elisa Ricci,Subhankar Roy

Main category: cs.CV

TL;DR: 该论文探讨了联邦学习(FL)对CLIP模型校准的影响,并提出了一种基于LoRA的方法(FL²oRA),显著改善了FL环境下的校准性能。

Details Motivation: CLIP模型的校准性在分布式学习(如联邦学习)中的表现尚未被充分研究,而校准对模型的可靠性至关重要。论文填补了这一研究空白。

Contribution: 1. 分析了FL对CLIP校准的影响;2. 提出FL²oRA方法,无需显式校准步骤即可提升模型校准性能。

Method: 通过分析文本提示调优和现有校准技术的局限性,提出了基于LoRA的FL²oRA方法,重点调整模型的部分组件以提高校准性。

Result: 在多基准测试中,FL²oRA显著改善了FL下CLIP的校准性能,减少了显式校准的需求。

Insight: 模型校准的关键不仅在于聚合方法或校准技术,更在于选择调整的模型组件。

Abstract: While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

[30] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Rui Fonseca,Bruno Martins,Gil Rocha

Main category: cs.CV

TL;DR: 论文提出了一种无需对齐图像-文本对的文本训练方法TOMCap,通过检索增强和模态差距校正,显著提升了无监督图像描述任务的性能。

Details Motivation: 减少对标注数据的依赖,探索纯文本训练方法在图像描述任务中的应用,以缩小无监督方法与全监督方法的性能差距。

Contribution: 1) 提出TOMCap方法,通过检索增强和模态差距校正提升性能;2) 在纯文本训练设置下优于其他方法。

Method: 1) 使用预训练语言模型(如CLIP)生成图像表示;2) 通过检索相似文本增强生成;3) 引入模态差距校正技术优化跨模态对齐。

Result: TOMCap在无监督图像描述任务中表现优于其他文本训练方法,并通过实验验证了检索增强和模态差距校正的有效性。

Insight: 检索相似文本和校正模态差距是提升纯文本训练性能的关键,无需对齐数据也能实现高质量的图像描述生成。

Abstract: Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

[31] Real-time Cricket Sorting By Sex

Juan Manuel Cantarero Angulo,Matthew Smith

Main category: cs.CV

TL;DR: 该论文提出了一种低成本、实时自动分性别排序黑蟋蟀的系统,结合计算机视觉和物理驱动,使用轻量级深度学习模型在资源受限设备上实现高效可持续的蟋蟀生产。

Details Motivation: 全球对可持续蛋白质的需求推动了食用昆虫的兴趣,但当前蟋蟀养殖缺乏自动化性别分选,阻碍了选择性育种、繁殖比例优化和营养差异化等潜在优势。

Contribution: 提出了一种基于计算机视觉和物理驱动的实时性别分选系统,证明了轻量级深度学习模型在资源受限设备上的可行性。

Method: 系统集成树莓派5和官方AI相机,使用自定义YOLOv8 nano目标检测模型和伺服驱动的分选臂。

Result: 模型测试中的mAP@0.5为0.977,实际蟋蟀分选准确率达86.8%。

Insight: 该研究为昆虫养殖提供了高效可持续的自动化解决方案,展示了深度学习在农业应用中的潜力。

Abstract: The global demand for sustainable protein sources is driving increasing interest in edible insects, with Acheta domesticus (house cricket) identified as one of the most suitable species for industrial production. Current farming practices typically rear crickets in mixed-sex populations without automated sex sorting, despite potential benefits such as selective breeding, optimized reproduction ratios, and nutritional differentiation. This work presents a low-cost, real-time system for automated sex-based sorting of Acheta domesticus, combining computer vision and physical actuation. The device integrates a Raspberry Pi 5 with the official Raspberry AI Camera and a custom YOLOv8 nano object detection model, together with a servo-actuated sorting arm. The model reached a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.977 during testing, and real-world experiments with groups of crickets achieved an overall sorting accuracy of 86.8%. These results demonstrate the feasibility of deploying lightweight deep learning models on resource-constrained devices for insect farming applications, offering a practical solution to improve efficiency and sustainability in cricket production.

[32] Mind-to-Face: Neural-Driven Photorealistic Avatar Synthesis via EEG Decoding

Haolin Xiong,Tianwen Fu,Pratusha Bhuvana Prasad,Yunxuan Cai,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao

Main category: cs.CV

TL;DR: Mind-to-Face是首个通过解码非侵入性脑电图(EEG)信号直接合成高保真面部表情的框架,突破了传统表情系统依赖视觉线索的局限,为神经驱动的虚拟形象提供了新范式。

Details Motivation: 现有表情系统过度依赖视觉线索,在面部遮挡或内部情绪未外显时失效。脑电图信号可能包含更丰富的情感和几何信息,但此前未被充分利用。

Contribution: 1.提出首个EEG信号解码为高保真面部表情的框架Mind-to-Face;2.设计双模态记录装置同步采集EEG和多视角面部视频,实现神经到视觉的精确监督学习;3.提出CNN-Transformer编码器和改进的3D高斯溅射渲染管道,生成逼真、视角一致的表情。

Method: 1.建立双模态数据集;2.使用CNN-Transformer编码器将EEG映射到稠密3D位置图;3.通过改进的3D高斯溅射渲染管道生成结果。

Result: EEG能可靠预测动态、个性化的面部表情,包括细微情绪变化,证明神经信号包含远超预期的情感和几何信息。

Insight: 神经信号可用于高保真表情合成,为个性化情感感知远程呈现和沉浸式认知交互提供了新方向。

Abstract: Current expressive avatar systems rely heavily on visual cues, failing when faces are occluded or when emotions remain internal. We present Mind-to-Face, the first framework that decodes non-invasive electroencephalogram (EEG) signals directly into high-fidelity facial expressions. We build a dual-modality recording setup to obtain synchronized EEG and multi-view facial video during emotion-eliciting stimuli, enabling precise supervision for neural-to-visual learning. Our model uses a CNN-Transformer encoder to map EEG signals into dense 3D position maps, capable of sampling over 65k vertices, capturing fine-scale geometry and subtle emotional dynamics, and renders them through a modified 3D Gaussian Splatting pipeline for photorealistic, view-consistent results. Through extensive evaluation, we show that EEG alone can reliably predict dynamic, subject-specific facial expressions, including subtle emotional responses, demonstrating that neural signals contain far richer affective and geometric information than previously assumed. Mind-to-Face establishes a new paradigm for neural-driven avatars, enabling personalized, emotion-aware telepresence and cognitive interaction in immersive environments.

[33] DisentangleFormer: Spatial-Channel Decoupling for Multi-Channel Vision

Jiashu Liao,Pietro Liò,Marc de Kamps,Duygu Sarikaya

Main category: cs.CV

TL;DR: DisentangleFormer提出了一种空间-通道解耦的视觉Transformer架构,通过独立建模结构和语义依赖,解决了多通道视觉任务中表示纠缠的问题,并在多个基准数据集上实现了最先进的性能。

Details Motivation: 传统视觉Transformer的空间和通道维度联合处理导致表示纠缠,无法独立建模结构和语义依赖,尤其在多通道视觉任务(如高光谱成像)中表现突出。

Contribution: 1. 提出了空间-通道解耦的并行设计,独立建模结构和语义依赖;2. 设计了一种自适应融合模块(Squeezed Token Enhancer)和多尺度FFN,增强特征表达能力;3. 在多个高光谱和通用视觉任务中实现了最优性能。

Method: 1. 并行解耦空间和通道流;2. Squeezed Token Enhancer动态融合空间和通道特征;3. 多尺度FFN补充局部上下文信息。

Result: 在Indian Pine、Pavia University、Houston和BigEarthNet等数据集上取得了SOTA性能,并在ImageNet上保持了竞争力,同时将计算成本降低了17.8%。

Insight: 通过信息论原则的解耦设计,能够更有效地建模多通道视觉任务中的复杂依赖关系,同时提升计算效率。

Abstract: Vision Transformers face a fundamental limitation: standard self-attention jointly processes spatial and channel dimensions, leading to entangled representations that prevent independent modeling of structural and semantic dependencies. This problem is especially pronounced in hyperspectral imaging, from satellite hyperspectral remote sensing to infrared pathology imaging, where channels capture distinct biophysical or biochemical cues. We propose DisentangleFormer, an architecture that achieves robust multi-channel vision representation through principled spatial-channel decoupling. Motivated by information-theoretic principles of decorrelated representation learning, our parallel design enables independent modeling of structural and semantic cues while minimizing redundancy between spatial and channel streams. Our design integrates three core components: (1) Parallel Disentanglement: Independently processes spatial-token and channel-token streams, enabling decorrelated feature learning across spatial and spectral dimensions, (2) Squeezed Token Enhancer: An adaptive calibration module that dynamically fuses spatial and channel streams, and (3) Multi-Scale FFN: complementing global attention with multi-scale local context to capture fine-grained structural and semantic dependencies. Extensive experiments on hyperspectral benchmarks demonstrate that DisentangleFormer achieves state-of-the-art performance, consistently outperforming existing models on Indian Pine, Pavia University, and Houston, the large-scale BigEarthNet remote sensing dataset, as well as an infrared pathology dataset. Moreover, it retains competitive accuracy on ImageNet while reducing computational cost by 17.8% in FLOPs. The code will be made publicly available upon acceptance.

[34] SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting

Yonghan Lee,Tsung-Wei Huang,Shiv Gehlot,Jaehoon Choi,Guan-Ming Su,Dinesh Manocha

Main category: cs.CV

TL;DR: SyncTrack4D是一种新颖的4D高斯泼溅方法,用于处理非同步的多视频数据,通过密集的4D轨迹表示同时实现跨视频同步和高保真4D重建。

Details Motivation: 动态3D场景建模面临高维度挑战,且现有方法通常需要预设场景对象或先验模型。SyncTrack4D旨在为未同步的多视频数据集提供通用的4D重建方案。

Contribution: 1. 提出首个通用的非同步多视频4D高斯泼溅方法;2. 利用密集4D轨迹作为同步与重建的线索;3. 实现了子帧级的同步精度(误差低于0.26帧)。

Method: 1. 计算每视频的密集4D特征轨迹;2. 通过Fused Gromov-Wasserstein最优传输实现跨视频轨迹匹配;3. 全局帧级时间对齐;4. 基于运动样条支架的多视频4D高斯泼溅。

Result: 在Panoptic Studio和SyncNeRF Blender数据集上,平均时间误差低于0.26帧,重建质量达26.3 PSNR。

Insight: 密集4D轨迹可作为有效同步线索,同时运动样条支架支持高质量的4D高斯泼溅重建。

Abstract: Modeling dynamic 3D scenes is challenging due to their high-dimensional nature, which requires aggregating information from multiple views to reconstruct time-evolving 3D geometry and motion. We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets. Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction. We first compute dense per-video 4D feature tracks and cross-video track correspondences by Fused Gromov-Wasserstein optimal transport approach. Next, we perform global frame-level temporal alignment to maximize overlapping motion of matched 4D tracks. Finally, we achieve sub-frame synchronization through our multi-video 4D Gaussian splatting built upon a motion-spline scaffold representation. The final output is a synchronized 4DGS representation with dense, explicit 3D trajectories, and temporal offsets for each video. We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores on the Panoptic Studio dataset. To the best of our knowledge, our work is the first general 4D Gaussian Splatting approach for unsynchronized video sets, without assuming the existence of predefined scene objects or prior models.

[35] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

Biao Chen,Zhenhua Lei,Yahui Zhang,Tongzhi Niu

Main category: cs.CV

TL;DR: 本文提出了一种基于非均匀B样条曲面的DIC数据集生成方法,并通过Bayes-DIC Net架构实现了位移场预测的不确定性评估,提升了深度学习DIC算法的实用性和可靠性。

Details Motivation: 现有DIC数据集难以涵盖多样化的真实位移场情况,同时缺乏对预测结果的置信度评估,限制了深度学习DIC算法的实际应用。

Contribution: 1. 提出了一种基于非均匀B样条曲面的数据集生成方法;2. 设计了Bayes-DIC Net架构,支持多层次信息提取和不确定性估计;3. 将Bayes-DIC Net转化为贝叶斯神经网络,实现了预测结果的置信度输出。

Method: 1. 通过随机生成控制点坐标构建多样化位移场;2. 设计轻量级卷积模块扩展感受野;3. 引入dropout模块实现贝叶斯推理。

Result: 生成的数据集捕获了真实位移场特征,Bayes-DIC Net不仅提供高精度预测,还能输出置信度,增强了算法的可靠性。

Insight: 结合贝叶斯方法可以提升深度学习在DIC任务中的可解释性和实用性,轻量级设计兼顾性能与效率。

Abstract: This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.

[36] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: NN-RAG是一个检索增强生成系统,用于从大规模PyTorch代码库中提取、验证和复用神经网络模块,显著提升研究效率和架构多样性。

Details Motivation: 现有神经网络的组件复用效率低,缺乏跨仓库的模块发现和验证工具,阻碍了研究效率。

Contribution: 1. 提出NN-RAG系统,支持跨代码库的神经网络模块检索、验证和再生;2. 首次实现依赖完整的模块迁移能力;3. 显著扩展了LEMUR数据集的架构多样性(贡献72%的新颖结构)。

Method: 采用检索增强生成技术,结合范围感知依赖解析、保留导入的重构和验证器门控提升,确保模块可编译、可运行且结构唯一。

Result: 从19个仓库提取1,289个候选模块,验证成功941个(73%),80%以上结构唯一。

Insight: NN-RAG为神经网络组件的跨项目复用提供了一种可扩展的解决方案,同时支持语言模型的集成,推动了算法发现的标准化和效率。

Abstract: Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion – ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework’s neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.

[37] Open Set Face Forgery Detection via Dual-Level Evidence Collection

Zhongyi Cai,Bryce Gernon,Wentao Bao,Yifan Li,Matthew Wright,Yu Kong

Main category: cs.CV

TL;DR: 本文提出了一种双层面证据收集方法(DLED),用于解决开放集人脸伪造检测(OSFFD)问题,通过在空间和频率层面融合类别特异性证据来估计预测不确定性,显著提升了对新型伪造类别的检测能力。

Details Motivation: 随着人脸伪造生成算法的快速发展,新型伪造类别不断出现,现有方法仅能处理已知伪造类别或二元分类问题,难以应对现实场景中的开放集检测需求。

Contribution: 1. 重新定义了开放集人脸伪造检测问题;2. 提出了DLED方法,通过双层面证据收集和融合来估计不确定性;3. 在新型伪造类别检测任务中平均提升20%的性能。

Method: DLED方法在空间和频率层面分别收集类别特异性证据,并通过融合这些证据来估计预测的不确定性,从而识别新型伪造类别。

Result: DLED在多样化的实验设置中表现最佳,对新型伪造类别的检测性能平均提升20%,同时在传统二元分类任务中也具有竞争力。

Insight: 不确定性估计是解决开放集问题的有效途径,双层面证据的融合可以提高模型的鲁棒性和泛化能力。

Abstract: The proliferation of face forgeries has increasingly undermined confidence in the authenticity of online content. Given the rapid development of face forgery generation algorithms, new fake categories are likely to keep appearing, posing a major challenge to existing face forgery detection methods. Despite recent advances in face forgery detection, existing methods are typically limited to binary Real-vs-Fake classification or the identification of known fake categories, and are incapable of detecting the emergence of novel types of forgeries. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which demands that the detection model recognize novel fake categories. We reformulate the OSFFD problem and address it through uncertainty estimation, enhancing its applicability to real-world scenarios. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which collects and fuses category-specific evidence on the spatial and frequency levels to estimate prediction uncertainty. Extensive evaluations conducted across diverse experimental settings demonstrate that the proposed DLED method achieves state-of-the-art performance, outperforming various baseline models by an average of 20% in detecting forgeries from novel fake categories. Moreover, on the traditional Real-versus-Fake face forgery detection task, our DLED method concurrently exhibits competitive performance.

[38] Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang,Wei-Yuan Cheng,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 论文提出了SANTA框架,通过自我增强的对比对齐技术,减少多模态大型语言模型(MLLMs)在视频描述中出现的物体和动作幻觉问题。

Details Motivation: 现有的MLLMs在生成视频描述时存在严重的幻觉问题(如物体和动作的不准确描述),尤其是在动态视频中这一问题更为复杂且尚未解决。

Contribution: 提出了SANTA框架,通过自我增强的对比对齐技术,识别并修正MLLMs中的潜在幻觉,提升物体和动作描述的准确性。

Method: SANTA通过幻觉自我增强方案生成负样本,并采用tracklet-phrase对比对齐技术,匹配区域物体和关系引导的动作与视觉和时间短语。

Result: 实验表明,SANTA在减少物体和动作幻觉方面优于现有方法,在幻觉检测基准上表现出色。

Insight: 通过对比对齐和自我增强技术可以有效减少动态视频描述中的幻觉问题,为多模态模型的精确性提供了新思路。

Abstract: Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

[39] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

Ao Xu,Rujin Zhao,Xiong Xu,Boceng Huang,Yujia Jia,Hongfeng Long,Fuxuan Chen,Zilong Cao,Fangyuan Chen

Main category: cs.CV

TL;DR: MAFNet提出了一种多频自适应融合网络,通过高效2D卷积和频域注意力模块实现实时立体匹配,在精度和速度上取得平衡。

Details Motivation: 现有立体匹配网络要么计算开销大,要么缺乏非局部上下文建模能力,难以在资源受限的设备上实时部署,需要一种高效且性能优异的方法。

Contribution: 1. 提出多频自适应融合网络(MAFNet);2. 设计频域自适应注意力模块分解全代价体;3. 引入低秩注意力机制融合高低频信息。

Method: 利用频域过滤将全代价体分解为高频和低频部分,分别进行特征聚合,并通过Linformer低秩注意力机制自适应融合信息。

Result: 在Scene Flow和KITTI 2015数据集上表现优异,精度与实时性平衡。

Insight: 频域分解和低秩注意力机制是提升立体匹配效率与性能的有效途径。

Abstract: Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.

[40] FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

Geunhyuk Youk,Jihyong Oh,Munchurl Kim

Main category: cs.CV

TL;DR: FMA-Net++ 是一个专为真实世界视频修复设计的框架,专注于同时解决超分辨率与去模糊问题,并明确建模了动态曝光与运动的耦合效应。通过引入分层细化与双向传播块,框架实现了并行、长范围的时间建模,同时利用曝光时间感知调制层和流引导动态滤波模块,提升了恢复质量与推理速度。

Details Motivation: 真实世界视频修复中普遍存在动态曝光与运动耦合的复杂退化问题,而现有方法往往忽视这一挑战。FMA-Net++ 旨在通过建模这两者的耦合效应,提升视频修复的质量与时间一致性。

Contribution: 1) 提出了第一个显式建模动态曝光与运动耦合效应的联合视频超分辨率与去模糊框架;2) 设计了分层细化与双向传播块的序列级架构;3) 引入了两个新的多曝光与随机曝光基准数据集 REDS-ME 和 REDS-RE。

Method: FMA-Net++ 采用序列级架构,结合 Hierarchical Refinement with Bidirectional Propagation 块进行并行长范围时间建模。其 Exposure Time-aware Modulation 层通过每帧曝光调节特征,Flow-Guided Dynamic Filtering 模块则基于此推断运动与曝光相关的退化核。

Result: 在 REDS-ME、REDS-RE 和 GoPro 基准测试中,FMA-Net++ 取得了最优的恢复质量与时间一致性表现,并显著提升了推理速度。

Insight: 显式建模动态曝光与运动的耦合效应可显著提升真实世界视频修复的性能,而分层架构与曝光感知机制的结合为处理复杂退化问题提供了新思路。

Abstract: Real-world video restoration is plagued by complex degradations from motion coupled with dynamically varying exposure - a key challenge largely overlooked by prior works and a common artifact of auto-exposure or low-light capture. We present FMA-Net++, a framework for joint video super-resolution and deblurring that explicitly models this coupled effect of motion and dynamically varying exposure. FMA-Net++ adopts a sequence-level architecture built from Hierarchical Refinement with Bidirectional Propagation blocks, enabling parallel, long-range temporal modeling. Within each block, an Exposure Time-aware Modulation layer conditions features on per-frame exposure, which in turn drives an exposure-aware Flow-Guided Dynamic Filtering module to infer motion- and exposure-aware degradation kernels. FMA-Net++ decouples degradation learning from restoration: the former predicts exposure- and motion-aware priors to guide the latter, improving both accuracy and efficiency. To evaluate under realistic capture conditions, we introduce REDS-ME (multi-exposure) and REDS-RE (random-exposure) benchmarks. Trained solely on synthetic data, FMA-Net++ achieves state-of-the-art accuracy and temporal consistency on our new benchmarks and GoPro, outperforming recent methods in both restoration quality and inference speed, and generalizes well to challenging real-world videos.

[41] Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models

Hieu Dinh Trung Pham,Huy Minh Nhat Nguyen,Cuong Tuan Nguyen

Main category: cs.CV

TL;DR: 该论文提出了一种名为FARL的新型框架,通过傅里叶分析显式解耦视觉表示,以增强视觉语言模型的小样本泛化能力。

Details Motivation: 当前大规模预训练视觉语言模型(VLMs)在小样本学习中表现良好,但其表示通常隐含地将图像的领域不变结构与领域特定风格纠缠在一起,这限制了泛化能力。

Contribution: 提出了FARL框架,通过傅里叶分析显式解耦视觉表示,并设计了一种双交叉注意力机制和不对称注入策略,增强了模型的泛化能力。

Method: 使用傅里叶分析将图像分解为结构特征(相位谱)和风格特征(幅度谱),通过双交叉注意力机制学习解耦的表示,并采用不对称注入策略引导模型适应。

Result: 在15个数据集上的实验证明了方法的有效性。

Insight: 解耦视觉表示中的领域不变结构和领域特定风格可以显著提升模型的泛化能力,特别是在小样本学习任务中。

Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have demonstrated strong few-shot learning capabilities. However, these methods typically learn holistic representations where an image’s domain-invariant structure is implicitly entangled with its domain-specific style. This presents an opportunity to further enhance generalization by disentangling these visual cues. In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that addresses this by explicitly disentangling visual representations using Fourier analysis. The core of our method is a dual cross-attention mechanism, where learnable representation tokens separately query an image’s structural features (from the phase spectrum) and stylistic features (from the amplitude spectrum). This process yields enriched, disentangled tokens that are then injected deep into the VLM encoders to guide adaptation. Our design, which includes an asymmetric injection strategy, forces the model to learn a more robust vision-language alignment. Extensive experiments on 15 datasets demonstrate the effectiveness of our approach.

[42] Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

Manar Alnaasan,Md Selim Sarowar,Sungho Kim

Main category: cs.CV

TL;DR: 一种可解释的多模态识别框架,结合RGB-D数据和大型语言模型(LLM),用于帕金森病步态识别,提高准确性和临床透明度。

Details Motivation: 现有帕金森病步态分析方法多为单模态输入,缺乏鲁棒性和临床解释性。本文旨在通过多模态数据和LLM弥补这些不足。

Contribution: 提出RGB-D融合框架和MLGE模块,增强时空表征;引入LLM生成临床意义的解释,提供更高准确性和可解释性。

Method: 使用双重YOLOv11编码器提取模态特征,MLGE和跨空间融合机制增强表征,LLM翻译为文本解释。

Result: 在多模态步态数据集上表现优于单模态基线,识别准确性和环境鲁棒性更高。

Insight: 多模态数据融合和语言模型的结合,为医疗视觉任务提供了新的可解释性和准确性解决方案。

Abstract: Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM

[43] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Sidan Zhu,Hongteng Xu,Dixin Luo

Main category: cs.CV

TL;DR: 本文提出了一种自定步调和自纠正的掩码预测方法SSMP,用于电影预告片生成,通过双向上下文建模和渐进式自纠正机制,实现了优于现有方法的表现。

Details Motivation: 现有的自动预告片生成方法采用‘先选择后排序’范式,存在错误传播问题,限制了生成质量。本文旨在超越这一范式,提出更高效的解决方案。

Contribution: 提出了SSMP方法,通过自定步调掩码训练和渐进式自纠正机制,显著提升了预告片生成的质量。

Method: 采用Transformer编码器,通过自定步调掩码预测训练模型,并引入渐进式自纠正机制生成预告片。

Result: 定量结果和用户研究均表明SSMP优于现有方法。

Insight: 自定步调掩码训练和渐进式自纠正机制模拟了人类编辑的工作方式,是一种有效提升生成质量的方法。

Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a “selection-then-ranking” paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: https://github.com/Dixin-Lab/SSMP.

[44] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Bin Suna,Yaoguang Caob,Yan Wanga,Rui Wanga,Jiachen Shanga,Xiejie Fenga,Jiayi Lu,Jia Shi,Shichun Yang,Xiaoyu Yane,Ziying Song

Main category: cs.CV

TL;DR: MindDrive是一个端到端自动驾驶框架,结合了高质量轨迹生成和综合决策推理,通过未来感知轨迹生成器和视觉语言模型评估器实现安全和人类对齐的驾驶决策。

Details Motivation: 现有端到端自动驾驶研究在轨迹生成和轨迹选择之间存在不平衡,MindDrive旨在弥合这一差距,实现高质量生成与综合决策的统一。

Contribution: 提出MindDrive框架,整合了基于世界动作模型的未来感知轨迹生成器(FaTG)和视觉语言模型评估器(VLoE),实现了生成与决策的协同优化。

Method: 通过“情景模拟-候选生成-多目标权衡”的结构化推理范式,FaTG进行“假设”模拟生成前瞻性轨迹,VLoE利用视觉语言模型进行多维度评估。

Result: 在NAVSIM-v1和NAVSIM-v2基准测试中表现优异,显著提升了安全性、合规性和泛化能力。

Insight: MindDrive为可解释和认知引导的自动驾驶提供了新思路,通过结合生成与决策能力实现了更全面的驾驶表现。

Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of “context simulation - candidate generation - multi-objective trade-off”. In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned “what-if” simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

[45] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Yifei Wang,Zhenkai Li,Tianwen Qian,Huanran Zheng,Zheng Wang,Yuqian Fu,Xiaoling Wang

Main category: cs.CV

TL;DR: StreamEQA是首個針對具身智能場景下的流式視頻問答提出的基準測試,評估多模態大語言模型在流式視頻理解中的表現。

Details Motivation: 隨著具身智能的發展,模型需要能夠持續感知流式視頻輸入並進行實時推理,以適應真實世界的動態環境。

Contribution: 提出StreamEQA基準,首次將流式視頻理解與具身場景結合,並設計了分層的任務(感知、交互、規劃)和時間維度的推理(後向、實時、前向)。

Method: 通過156個長視頻構建了42個任務和約21K個帶時間戳的問答對,結合自動生成與人工精煉的混合流程。

Result: 對13個SOTA視頻-LLM的評估顯示,這些模型在具身流式視頻理解中仍有困難,表現不如傳統基準測試。

Insight: 流式視頻理解在具身場景中具有獨特挑戰,需要模型結合時間上下文和動態推理能力。

Abstract: As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model’s ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.

[46] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Changjin Kim,HyeokJun Lee,YoungJoon Yoo

Main category: cs.CV

TL;DR: GuidNoise提出了一种基于单对图像的引导扩散方法,用于生成高质量的通用噪声合成,无需额外摄像机元数据或大量数据对。

Details Motivation: 现有噪声合成方法常依赖大量数据对和相机元数据,限制了通用性和实用性。

Contribution: 1. 提出了单对图像引导的GuidNoise方法;2. 引入引导感知的特征修改(GAFM)和噪声感知损失;3. 实现了无需元数据的高质量噪声合成。

Method: 利用单对图像作为引导,结合GAFM和噪声感知损失优化扩散模型的反向过程,生成逼真的噪声分布。

Result: GuidNoise在不同噪声环境下生成了高质量合成噪声,并显著提升了轻量级去噪模型的性能。

Insight: 单对图像引导的噪声合成是一种高效且通用的数据增强方法,适用于数据受限的实际场景。

Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model’s backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.

[47] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Yingzi Ma,Yulong Cao,Wenhao Ding,Shuibai Zhang,Yan Wang,Boris Ivanovic,Ming Jiang,Marco Pavone,Chaowei Xiao

Main category: cs.CV

TL;DR: 论文提出了dVLM-AD,一种基于扩散模型的视觉语言模型,用于增强端到端自动驾驶系统的可控性和一致性。

Details Motivation: 现有基于自回归的视觉语言模型在自动驾驶中难以保持高层推理与低层规划的一致性和可控性。

Contribution: 提出了dVLM-AD,利用扩散模型的迭代去噪和双向注意力机制,提升了推理和规划的可控性与可靠性。

Method: 结合扩散模型的双向注意力,统一了感知、结构化推理和低层规划。

Result: 在nuScenes和WOD-E2E数据集上,dVLM-AD在行为-轨迹一致性和长尾场景表现上均优于自回归基线。

Insight: 扩散模型为自动驾驶的可控性和一致性提供了一个可靠的解决方案。

Abstract: The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs – limited by causal attention and sequential token generation – often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

[48] UniTS: Unified Time Series Generative Model for Remote Sensing

Yuxiang Zhang,Shunlin Liang,Wenyuan Li,Han Ma,Jianglei Xu,Yichuan Ma,Jiangwei Xie,Wei Li,Mengmeng Zhang,Ran Tao,Xiang-Gen Xia

Main category: cs.CV

TL;DR: UniTS是一种统一的时序生成模型,适用于多种遥感时序任务,通过流匹配生成范式实现对多任务时空特征的统一建模,表现显著优于现有方法。

Details Motivation: 现有的遥感时序任务方法通常需要针对不同任务设计专用模型,缺乏对多任务时空特征的统一建模能力。

Contribution: 提出UniTS,一个通用的时序生成框架,支持重构、云去除、语义变化检测和预测等多种任务;设计了Adaptive Condition Injector和Spatiotemporal-aware Modulator,增强模型的感知和依赖捕获能力;构建了两个高质量多模态时序数据集。

Method: 基于流匹配生成范式,设计扩散transformer架构,包含时空块;引入ACor增强多模态输入的条件感知,STM提升时空依赖性捕获能力。

Result: 实验表明,UniTS在低层和高层时序任务中均表现出色,尤其在云污染、模态缺失和物候变化预测等挑战下显著优于现有方法。

Insight: 统一的生成建模范式可以有效整合多任务时空特征,为遥感时序分析提供了一种通用解决方案。

Abstract: One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model’s conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

[49] DeRA: Decoupled Representation Alignment for Video Tokenization

Pengbo Guo,Junke Wang,Zhen Xing,Chengxu Liu,Daoguo Dong,Xueming Qian,Zuxuan Wu

Main category: cs.CV

TL;DR: DeRA提出了一种解耦空间-时间表征的视频tokenizer,通过分离外观和运动流,结合SACP模块解决异质监督的梯度冲突,显著提升了视频tokenization的性能。

Details Motivation: 现有的视频tokenizer通常在空间和时间维度上耦合学习,导致训练效率低下且性能受限。DeRA旨在解耦这两个维度的表征学习,以提高效果和效率。

Contribution: 1. 提出了DeRA,一种解耦空间-时间表征的1D视频tokenizer;2. 引入SACP模块解决异质监督的梯度冲突;3. 在UCF-101和K600上取得了新的SOTA结果。

Method: 1. 将视频编码分解为外观和运动流,分别与预训练的视觉基础模型对齐;2. 使用SACP模块抑制梯度冲突方向的分量,优化训练过程。

Result: DeRA在UCF-101上比LARP提升了25%的rFVD,在视频生成和帧预测任务上也达到了新的SOTA性能。

Insight: 解耦空间和时间表征学习,并通过主动抑制梯度冲突,可以有效提升视频tokenization的效率和性能。

Abstract: This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.

[50] Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun,Oindrila Saha,Subhransu Maji

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为NABirds Look-Alikes (NABLA)的数据集,用于评估鸟类图像的身份保持生成任务。现有方法在该数据集上表现不佳,而通过按物种、年龄和性别分组训练,性能显著提升。

Details Motivation: 现有的可控图像生成方法在人类和刚性物体上表现良好,但对非刚性或细粒度类别(如鸟类)效果不佳。这些领域缺乏高质量数据,尤其是同一主题的多视图或视频数据,限制了方法的评估和改进。

Contribution: 论文的主要贡献包括:(1) 引入NABLA数据集,包含专家标注的鸟类图像对;(2) 展示了现有方法在鸟类身份保持生成任务上的不足;(3) 提出一种按物种、年龄和性别分组的训练策略,显著提升了性能。

Method: 论文通过构建NABLA数据集,并结合iNaturalist的多图像数据和少量视频数据,形成一个基准。训练时,将图像按物种、年龄和性别分组,以此作为身份的代理。

Result: 实验表明,现有方法在NABLA数据集上无法保持身份一致性,而提出的分组训练策略在可见和未见物种上均表现更好。

Insight: 研究表明,分组的训练策略可以捕捉细粒度的身份特征,这对于身份保持生成任务非常重要。同时,数据集的丰富性和多样性是关键,尤其是在缺乏多视图数据的领域。

Abstract: Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data – especially videos or multi-view observations of the same subject – making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex – used as a proxy for identity – substantially improves performance on both seen and unseen species.

[51] Controllable Long-term Motion Generation with Extended Joint Targets

Eunjong Lee,Eunhee Kim,Sanghoon Hong,Eunho Jung,Jihoon Kim

Main category: cs.CV

TL;DR: 该论文提出了一种名为COMET的自回归框架,用于实时生成稳定且可控的角色动画,解决了现有方法在长序列中控制不足和运动退化的问题。

Details Motivation: 目前的方法在生成角色动画时,往往无法提供细粒度的控制或在长序列中表现不稳定,限制了其在交互式应用中的实用性。

Contribution: COMET的主要贡献包括:1) 一种高效的基于Transformer的条件VAE,支持对任意关节的精确交互控制;2) 引入了一种参考引导的反馈机制,确保长期时间稳定性,同时可作为即插即用的风格化模块。

Method: COMET采用自回归框架,结合Transformer-based条件VAE和参考引导反馈机制,实现了长序列运动的稳定生成和实时控制。

Result: 实验表明,COMET能够在实时速度下生成高质量运动,并在复杂运动控制任务中显著优于现有方法。

Insight: 参考引导反馈机制不仅能防止误差累积,还支持实时风格迁移,展示了框架的多功能性和应用潜力。

Abstract: Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.

[52] SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Chang-Hsun Wu,Kai-Po Chang,Yu-Yang Sheng,Hung-Kai Chung,Kuei-Chun Wang,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 论文提出了SEASON方法,通过自诊断对比解码技术,解决VideoLLMs中时间不一致和因果不合理的问题,显著减少了时间幻觉,并在多个基准测试中表现优异。

Details Motivation: 现有VideoLLMs在处理视频时难以有效利用时间信息,导致时间不一致或因果不合理的事件描述,引发严重的幻觉问题。此前研究多关注空间幻觉,而时间推理的研究较少。

Contribution: 提出SEASON方法,一种无需训练的自适应解码技术,通过动态诊断标记的幻觉倾向并应用对比解码,显著提升VideoLLMs的时间和空间一致性。

Method: SEASON采用自诊断对比解码,动态分析每个输出标记的幻觉风险,并针对时空负样本进行自适应对比解码,从而减少幻觉。

Result: 在三个幻觉检测基准和四个通用视频理解基准上,SEASON均优于现有无需训练的幻觉缓解方法。

Insight: 研究成果表明,针对时空负面样本的自适应对比解码能有效提升VideoLLMs的时间和空间一致性,为未来视频理解研究提供了新方向。

Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token’s hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

[53] EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang,Jiarui Ye,Yuanlei Wang,Ming Zhong,Mingju Cao,Wanke Xia,Bowen Zeng,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: EgoLCD提出了一种用于生成长上下文自我中心视频的端到端框架,通过高效稳定的内存管理解决内容漂移问题,结合长短期记忆与LoRA局部适应,取得了显著的性能提升。

Details Motivation: 现有的自回归模型在处理长视频生成时容易发生内容漂移,导致物体身份和场景语义退化。EgoLCD旨在解决这一问题,通过稳定内存管理提升生成的连贯性。

Contribution: 1. 提出长视频生成的内存管理框架EgoLCD;2. 结合长短期稀疏KV缓存与LoRA局部适应;3. 引入记忆调节损失和结构化叙事提示。

Method: 1. 使用Long-Term Sparse KV Cache稳定全局上下文;2. 基于注意力的短时记忆结合LoRA局部适应;3. 记忆调节损失和结构化叙事提示指导生成。

Result: 在EgoVid-5M基准测试中,EgoLCD在感知质量和时间一致性上达到SOTA,有效缓解了生成遗忘问题。

Insight: 高效的内存管理是生成长连贯视频的关键,结合全局稳定性和局部适应性可显著提升生成质量。

Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

[54] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu,Xiaoshan Wu,Xinting Hu,Tao Hu,Yangtian Sun,Xiaoyang Lyu,Bo Wang,Lin Ma,Yuewen Ma,Zhongrui Wang,Xiaojuan Qi

Main category: cs.CV

TL;DR: VideoSSM提出了一种结合自回归扩散与混合状态空间记忆的长视频生成方法,通过全局和局部记忆的协同,解决长视频生成中的一致性和多样性问题。

Details Motivation: 长视频生成中存在累积误差、运动漂移和内容重复等问题,现有方法难以保持分钟级的一致性。作者希望通过结合全局和局部的记忆机制,提升长视频生成的连贯性和多样性。

Contribution: 1. 提出VideoSSM,结合自回归扩散与混合状态空间记忆;2. 通过状态空间模型和上下文窗口分别管理全局和局部记忆;3. 在分钟级视频生成中实现最优时间一致性和运动稳定性。

Method: 1. 使用自回归扩散生成视频帧;2. 引入状态空间模型作为全局记忆,管理场景动态;3. 利用上下文窗口提供局部记忆,捕捉运动和细节;4. 线性时间复杂度设计,适应长序列。

Result: 在短长期基准测试中,VideoSSM在时间一致性和运动稳定性上达到最优表现,支持多样化和交互式提示控制。

Insight: 混合全局和局部记忆机制是提升长视频生成质量的关键,状态空间模型可有效管理长序列的动态信息。

Abstract: Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

[55] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

Chenlin Xu,Lei Zhang,Lituan Wang,Xinyu Pu,Pengfei Ma,Guangwu Qian,Zizhou Wang,Yan Wang

Main category: cs.CV

TL;DR: 该论文提出了BA-TTA-SAM框架,通过测试时自适应(TTA)显著提升了SAM在零样本医学图像分割中的性能,无需源域训练数据。

Details Motivation: 医学图像分割中,标注数据稀缺且计算成本高,现有方法依赖任务特异性训练,而SAM在医学数据集上因域偏移表现不足,亟需提升零样本分割能力。

Contribution: 提出BA-TTA-SAM框架,通过高斯提示注入和边界感知注意力对齐机制,显著提升SAM的零样本分割性能。

Method: 1)编码器级高斯提示注入;2)跨层边界感知注意力对齐。

Result: 在四个医学数据集上平均DICE分数提升12.4%,优于现有方法。

Insight: 测试时自适应可以有效缓解域偏移问题,结合边界信息能显著提升医学图像分割的零样本性能。

Abstract: Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4% in the DICE score compared with SAM’s zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.

[56] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Dongzhi Jiang,Renrui Zhang,Haodong Li,Zhuofan Zong,Ziyu Guo,Jun He,Claire Guo,Junyan Ye,Rongyao Fang,Weijia Li,Rui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 论文提出了DraCo方法,通过低分辨率草图生成和选择性修正,利用视觉内容改进文本到图像的生成,解决文本规划的粗粒度问题和罕见属性组合生成的困难。

Details Motivation: 现有方法在处理文本到图像生成时,要么仅将模型作为独立生成器,要么依赖抽象文本规划,限制了效果。DraCo旨在通过视觉草图规划和验证,提升生成质量。

Contribution: 1) 提出Draft-as-CoT方法,结合文本和视觉内容进行规划和验证;2) 引入DraCo-240K数据集支持训练;3) 设计DraCo-CFG策略优化交错推理。

Method: 1) 生成低分辨率草图作为预览;2) 验证草图与输入提示的语义对齐;3) 通过选择性修正和高分辨率重建优化生成。

Result: 在GenEval、Imagine-Bench和GenEval++基准上,DraCo显著优于直接生成和其他基于CoT的方法。

Insight: 视觉草图提供更具体的规划指导,结合选择性修正可以显著提升生成质量,尤其是罕见属性组合的生成效果。

Abstract: Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model’s inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.

[57] PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Yu-Wei Zhan,Xin Wang,Hong Chen,Tongtong Feng,Wei Feng,Ren Wang,Guangyao Li,Qing Li,Wenwu Zhu

Main category: cs.CV

TL;DR: 该论文提出了PhyVLLM,一种通过物理学引导的视频语言模型,通过双分支编码器解耦视觉外观和对象运动,结合神经ODE模块建模物理动态,并在预训练LLM中实现物理推理,显著优于现有Video LLM。

Details Motivation: 现有的Video LLM在需要深入理解物理动态的场景中表现不佳,主要依赖于外观匹配。解决这一问题需应对运动信号与外观变化的纠缠、连续时间运动建模的需求以及物理标注成本高的挑战。

Contribution: 提出了PhyVLLM框架,显式地将物理运动结合到Video LLM中;通过双分支编码器解耦外观与运动;引入神经ODE模块建模连续物理动态;采用自监督方式避免物理标注需求。

Method: 使用双分支编码器解耦外观和运动;结合神经ODE模块生成可微的物理动态表征;将运动感知表征投影到预训练LLM的token空间中,实现物理推理。

Result: 实验表明PhyVLLM在物理推理和通用视频理解任务上显著优于现有Video LLM,展示了显式物理建模的优势。

Insight: 通过解耦外观与运动并建模物理动态,可以显著提升Video LLM在物理理解任务上的性能,同时不影响其原有多模态能力。

Abstract: Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model’s original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

[58] Refaçade: Editing Object with Given Reference Texture

Youze Huang,Penghui Ruan,Bojia Zi,Xianbiao Qi,Jianan Wang,Rong Xiao

Main category: cs.CV

TL;DR: Refaçade提出了一种新的物体重纹理任务,通过纹理去除和全局布局干扰的方法,实现精确可控的纹理迁移。

Details Motivation: 当前扩散模型在图像和视频编辑中取得了显著进展,但物体重纹理任务仍未被充分探索。传统方法如ControlNet存在结构信息干扰和纹理结构纠缠问题,Refaçade旨在解决这些问题。

Contribution: 1. 提出Object Retexture任务;2. 设计纹理去除器和全局布局干扰方法,实现精确可控的纹理迁移。

Method: 1. 使用纹理去除器保留源视频的几何和运动信息;2. 通过拼图排列干扰参考对象的全局布局,聚焦局部纹理统计。

Result: 实验表明,Refaçade在视觉质量、编辑精确性和可控性上优于基线方法。

Insight: 纹理迁移需分离结构和纹理信息,全局布局干扰可提升模型对局部细节的关注。

Abstract: Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.

[59] Detection of Intoxicated Individuals from Facial Video Sequences via a Recurrent Fusion Model

Bita Baroutian,Atefe Aghaei,Mohsen Ebrahimi Moghaddam

Main category: cs.CV

TL;DR: 该论文提出了一种基于视频序列的酒精中毒检测方法,通过结合图注意力网络(GAT)和3D ResNet提取的特征,实现了高精度分类,并在新数据集上验证了其优越性。

Details Motivation: 酒精中毒是全球公共健康和安全的重要问题,亟需非侵入式、高效的检测方法。

Contribution: 1. 提出了一种结合GAT和3D ResNet的动态特征融合模型;2. 引入了一个包含3,542个视频片段的新数据集;3. 在性能上显著优于基线方法。

Method: 1. 使用GAT处理面部关键点;2. 使用3D ResNet提取时空特征;3. 动态融合特征并自适应优先排序。

Result: 模型达到95.82%准确率、0.977精确率和0.97召回率,优于基线方法。

Insight: 动态特征融合和自适应优先排序有助于提升酒精中毒检测的准确性和鲁棒性。

Abstract: Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model’s potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.

[60] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang,Hai Ci,Yiren Song,Mike Zheng Shou

Main category: cs.CV

TL;DR: X-Humanoid通过生成视频编辑方法,将人类视频转化为人形机器人视频,填补了大规模训练数据的缺口。

Details Motivation: 目前,VLA模型和世界模型的发展受限于大规模多样化训练数据的稀缺。通过‘机器人化’人类视频的方法虽有效,但现有方法无法处理复杂全身运动和遮挡。

Contribution: 1. 提出了X-Humanoid,一种生成视频编辑方法,将人类视频转化为人形机器人视频。2. 设计了可扩展的数据创建流程,生成了17+小时的配对合成视频。3. 发布了包含360万帧的新数据集。

Method: 利用Wan 2.2模型构建视频到视频结构,并对其进行微调,以完成人类到人形机器人的转换任务。采用Unreal Engine生成合成视频数据。

Result: 在定量分析和用户研究中,X-Humanoid表现优于基线:69%的用户认为其运动一致性最佳,62.1%认可其体现正确性。

Insight: 通过合成数据生成和大规模数据集的发布,能够显著推动具身AI领域的发展。

Abstract: The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to “robotize” web-scale human videos, which has been proven effective for policy training. However, these solutions mainly “overlay” robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million “robotized” humanoid video frames. Quantitative analysis and user studies confirm our method’s superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

[61] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin,Qingyuan Wang,Wenhao Zhang,Yang Liu,Sijie Cheng

Main category: cs.CV

TL;DR: VideoMem提出了一种适应性的内存管理框架,通过动态更新全局内存缓冲区,显著提升超长视频理解任务的表现。

Details Motivation: 超长视频理解是一个未解决的挑战,现有视觉语言模型(VLMs)由于上下文长度有限和长期记忆保留效率低而表现不佳。尽管已有工作尝试构建外部知识库和检索增强生成(RAG)系统,但其存储和计算开销巨大。

Contribution: VideoMem是一种新型框架,将超长视频理解建模为通过适应性内存管理的序列生成任务。提出PRPO算法(包括PSP和TCR模块),优化模型训练。

Method: 动态更新全局内存缓冲区,保留关键信息,丢弃冗余内容。PRPO算法结合PSP(渐进状态传播)和TCR(时序级联奖励),提升训练效率和收敛速度。

Result: 实验表明,VideoMem在多种超长视频理解基准任务上显著优于现有开源模型。

Insight: 自适应内存管理和PRPO算法的结合有效解决了长期依赖和奖励稀疏性问题,为超长视频理解任务提供了新的思路。

Abstract: Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

[62] Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering

Marco Pintore,Maura Pintor,Dimosthenis Karatzas,Battista Biggio

Main category: cs.CV

TL;DR: 这篇论文提出了一种针对文档视觉问答(DocVQA)系统的新型对抗攻击方法,通过视觉难以察觉的方式伪造文档内容,诱导模型给出错误答案。

Details Motivation: 当前DocVQA模型在处理文档视觉问答任务时表现出色,但对对抗攻击的脆弱性尚未充分研究。论文旨在揭示这种脆弱性,并提出针对性攻击方法。

Contribution: 主要贡献是提出了一种针对DocVQA的新型对抗攻击方法,能够以视觉不可察觉的方式伪造文档内容,并针对不同攻击目标定制攻击算法。

Method: 论文开发了专门的对抗攻击算法,针对两种先进的DocVQA模型(Pix2Struct和Donut)进行攻击。攻击分为目标误导和系统性模型失效两种场景。

Result: 实验结果表明,该方法能够成功诱导两种模型生成特定或普遍错误的答案,揭示了当前DocVQA系统的严重脆弱性。

Insight: 研究强调了DocVQA系统对对抗攻击的敏感性,呼吁未来研究需要开发更鲁棒的防御机制,特别是针对视觉不可察觉的攻击。

Abstract: Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers’ goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.

[63] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang,Xiangzhao Hao,Hengzhu Tang,Zhenyu Zhang,Jiawei Sheng,Xiaodong Li,Zhenyang Li,Li Gao,Daiting Shi,Dawei Yin,Tingwen Liu

Main category: cs.CV

TL;DR: COOPER是一种统一的多模态大语言模型,通过结合深度和分割作为辅助模态,分两阶段训练以实现辅助模态生成和自适应交错推理能力,显著提升了空间推理性能。

Details Motivation: 当前的多模态大语言模型在3D感知推理方面表现不足,现有的方法通常孤立地增强感知或推理能力,导致无法统一优化空间智能。

Contribution: 提出COOPER,一个统一的多模态大语言模型,能够通过辅助模态生成和自适应交错推理增强空间感知和推理能力。

Method: COOPER利用深度和分割作为辅助模态,采用两阶段训练方法,分别学习辅助模态生成和自适应交错推理能力。

Result: COOPER在空间推理任务上平均提升6.91%,同时在通用性能上保持稳定。仅训练辅助模态生成的变体在距离和大小估计任务上提升7.92%。

Insight: 学习生成辅助模态有助于模型内化空间知识,从而增强空间理解能力。

Abstract: Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.

[64] Prompt2Craft: Generating Functional Craft Assemblies with LLMs

Vitor Hideyo Isume,Takuya Kiyokawa,Natsuki Yamanobe,Yukiyasu Domae,Weiwei Wan,Kensuke Harada

Main category: cs.CV

TL;DR: 论文提出了一种基于LLMs的功能性手工组件生成方法Prompt2Craft,通过视觉输入和模板优化实现目标对象的组装任务。

Details Motivation: 受传统手工制作的启发,研究目标是通过可用对象组装目标物体,这些对象与目标部件并不直接对应。

Contribution: 正式引入了Craft Assembly Task,提出了一种结合视觉分割、模板检索和优化的方法,简化了目标对象的建模过程。

Method: 使用掩码分割网络识别目标部件,检索模板网格并优化其姿态,简化部件为基本形状,设计搜索算法匹配场景中的对应物。

Result: 方法在两种场景下与基线方法表现相当,并在实际场景中展示了定性结果。

Insight: 通过视觉输入和模板优化,可以有效地实现复杂对象的组装任务,为机器人操作提供了新思路。

Abstract: Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.

[65] TARDis: Time Attenuated Representation Disentanglement for Incomplete Multi-Modal Tumor Segmentation and Classification

Zishuo Wan,Qinqin Kang,Yi Huang,Yun Bian,Dawei Ding,Ke Yan

Main category: cs.CV

TL;DR: TARDis提出了一种物理感知的多模态肿瘤分割与分类框架,通过将缺失模态重新定义为时间-衰减曲线上的缺失采样点,显式解耦特征空间为静态解剖结构和动态灌注组件,显著提升了不完整模态下的性能。

Details Motivation: 肿瘤分割与诊断在多期相CT中依赖对比剂的生理动态,但完整多期相扫描常因辐射或扫描限制不可行。现有方法将缺失模态视为独立通道的缺失,忽略了血流动力学的时序连续性。

Contribution: 1) 提出TARDis框架,将缺失模态建模为时间-衰减曲线的缺失点;2) 显式解耦特征为时间不变的解剖结构和时间依赖的灌注特征;3) 通过双路径架构(量化路径和概率路径)生成缺失的血流动力学特征。

Method: 1) 量化路径使用可学习的嵌入词典提取解剖结构;2) 概率路径使用条件变分自编码器建模动态增强;3) 通过采样学习到的潜分布生成缺失特征。

Result: 在私有腹部CT数据集(2,282例)和两个公开数据集上,TARDis显著优于现有方法,且在极低数据量下仍保持稳健诊断性能。

Insight: 显式建模时序动态和解剖结构解耦能有效解决缺失模态问题,同时降低辐射暴露风险。

Abstract: Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the “missing modality” problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.

[66] SAM3-I: Segment Anything with Instructions

Jingjing Li,Yue Feng,Yuchen Guo,Jincai Huang,Yongri Piao,Qi Bi,Miao Zhang,Xiaoqi Zhao,Qiang Chen,Shihao Zou,Wei Ji,Huchuan Lu,Li Cheng

Main category: cs.CV

TL;DR: SAM3-I是对Segment Anything Model 3(SAM3)的增强框架,通过引入指令感知的级联适应机制,直接将自然语言指令与视觉语言表示对齐,实现了无需外部代理的精确分割。

Details Motivation: 现有的SAM3依赖外部多模态代理将复杂指令转换为名词短语(NP),但其概念表达过于粗糙,无法精准分割特定实例。SAM3-I旨在统一概念级理解与指令级推理,提升分割精度。

Contribution: 1. 提出指令感知的级联适应机制,对齐指令语义与SAM3的视觉语言表示;2. 设计结构化指令分类法(概念、简单、复杂级别);3. 构建多样化的指令-掩码对数据集;4. 开源框架并提供微调工作流。

Method: SAM3-I通过渐进式对齐指令语义与SAM3的视觉语言表示,直接支持指令驱动的分割。同时设计了结构化指令分类法和数据引擎,生成训练数据。

Result: 实验表明,SAM3-I能够高效遵循自然语言指令,同时保留其原有的概念驱动分割能力。

Insight: SAM3-I的指令感知机制展示了如何将高层次语义与底层视觉表示结合,为开放词汇分割任务提供了新的研究方向。

Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3’s existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.

[67] When Robots Should Say “I Don’t Know”: Benchmarking Abstention in Embodied Question Answering

Tao Wu,Chuhao Zhou,Guangyu Zhao,Haozhi Cao,Yewen Pu,Jianfei Yang

Main category: cs.CV

TL;DR: 该论文提出了Embodied Question Answering(EQA)任务中机器人应当知道何时‘说不知道’的问题,并通过构建新的数据集AbstainEQA,研究了五种典型的不确定性情景,揭示了现有模型的局限性。

Details Motivation: 现有的EQA基准假设所有问题都必须回答,但实际中机器人应能识别信息不足的场景。论文通过人类查询分析,发现32.4%的问题存在缺失或未明确上下文,凸显了研究机器人‘不回答’能力的必要性。

Contribution: 1. 提出了EQA中‘不回答’的重要性;2. 基于人类认知理论,归纳了五种需要机器人‘不回答’的情景;3. 构建了包含1,636个标注案例的AbstainEQA数据集;4. 评估了前沿模型的表现,揭示了其局限性。

Method: 1. 从500个人类查询中分析出32.4%问题的缺失信息;2. 基于认知理论定义了五种‘不回答’情景;3. 将OpenEQA中的明确问题转化为模糊问题,构建AbstainEQA数据集;4. 评估模型在数据集上的表现。

Result: 最佳前沿模型的‘不回答’召回率仅为42.79%,远低于人类的91.17%。扩展、提示和推理仅带来边际提升,且微调模型容易过拟合文本线索。

Insight: ‘不回答’能力是机器人可靠交互的基础,也是后续澄清的前提。当前模型的局限性表明,单纯依靠规模或提示难以解决这一问题,未来需探索更本质的方法。

Abstract: Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.

[68] Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

Sheng Hang,Chaoxiang He,Hongsheng Hu,Hanqing Hu,Bin Benjamin Zhu,Shi-Feng Sun,Dawu Gu,Shuo Wang

Main category: cs.CV

TL;DR: 论文提出了一种零样本视觉-语言分割融合方法,可一次性检测恶意图像内容、识别关键元素并定位其位置。该方法结合了基础分割模型(SAM)和视觉语言模型,并通过多分割器集成提高了鲁棒性。

Details Motivation: 现有内容审核方法通常仅提供图像级别的NSFW标记,而缺乏对恶意元素的细粒度和定位信息。为解决这一问题,作者开发了一种能够同时检测、识别和定位恶意内容的一次性解决方案。

Contribution: 提出了一个零样本管道,能够一次性完成恶意图像检测、元素识别和定位;设计了一种融合分割和视觉语言模型的方法,并通过集成多个分割器提高了对抗攻击的鲁棒性。

Method: 首先使用基础分割模型(SAM)生成候选对象掩码并细化成独立区域;然后利用视觉语言模型对每个区域的恶意相关性进行评分;最后通过加权融合步骤生成统一的恶意对象图。

Result: 在790张图像的数据集上,实现了85.8%的元素级召回率和78.1%的精确率;对抗攻击下的性能下降不超过10%。

Insight: 该方法不仅高效(处理单张图像仅需数秒),还能无缝集成到现有视觉语言模型工作流中,为细粒度和可解释的恶意图像审核提供了实用工具。

Abstract: Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method’s precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

[69] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan,Yuanbo Yang,Lin-Zhuo Chen,Yao Yao,Zhuzhong Qian

Main category: cs.CV

TL;DR: HeFT利用预训练视频扩散模型的视觉先验,通过分析VDiT的内部表示,提出了一种零样本点跟踪框架,结合头频感知特征选择策略,显著提升了跟踪性能。

Details Motivation: 视频扩散模型在视觉任务中展现出潜力,但其内部表示特性尚未充分理解。本文旨在探索其时空编码能力,并利用这些先验提升跟踪任务的鲁棒性。

Contribution: 1. 揭示了VDiT中注意力头的功能分化特征;2. 提出基于低频率分量的头频感知特征选择策略;3. 设计了单步去噪与软最大值定位结合的零样本跟踪框架HeFT。

Method: 1. 分析VDiT的注意力头功能与频段特性;2. 通过单步去噪提取特征;3. 结合头频选择与前后一致性检查实现鲁棒跟踪。

Result: 在TAP-Vid基准测试中,HeFT实现了零样本跟踪的SOTA性能,接近监督方法的精度。

Insight: 视频扩散模型的低频率分量具有稳定的时空信息,而高频率分量易引入噪声,这一发现为未来视觉基础模型的设计提供了新方向。

Abstract: In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

[70] I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

Juntong Wang,Jiarui Wang,Huiyu Duan,Jiaxiang Kang,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 本文提出了I2I-Bench,一个全面的图像到图像编辑模型基准测试套件,涵盖多种任务和评估维度,并采用了自动化混合评估方法。

Details Motivation: 现有图像编辑模型的评估基准存在任务范围有限、评估维度不足以及对人工标注依赖性强等问题,限制了其扩展性和实用性。

Contribution: I2I-Bench提供了一个多样化的任务集合(10个任务类别)、全面的评估维度(30个维度),并引入了自动化混合评估方法,验证了其评估结果与人类偏好的一致性。

Method: 通过结合专用工具和大型多模态模型(LMMs)的自动化混合评估方法,实现了对30个细粒度维度的评估。

Result: 使用I2I-Bench对多个主流图像编辑模型进行了评测,揭示了不同维度下模型的差距和权衡。

Insight: 自动化混合评估方法能够高效替代人工标注,提升评估的可扩展性和实用性。

Abstract: Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

[71] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang,Hailong Guo,Fangtai Wu,Shifeng Zhang,Shijie Huang,Qijun Gan,Lin Liu,Sirui Zhao,Enhong Chen,Jiaming Liu,Steven Hoi

Main category: cs.CV

TL;DR: Live Avatar是一个结合算法与系统设计的框架,通过14B参数的扩散模型实现高效、高保真且无限长度的实时头像生成,突破了现有方法的计算限制。

Details Motivation: 现有基于扩散的视频生成方法受限于顺序计算和长序列不一致问题,难以满足实时音频驱动头像合成的需求。

Contribution: 1)提出TPP分布式推断范式,通过多GPU流水线打破自回归瓶颈;2)引入RSFM机制增强时序一致性;3)采用自强迫分布匹配蒸馏实现流式适配大模型。

Method: 采用TPP并行化扩散模型推断,RSFM动态校准外观,结合蒸馏技术优化流式生成。

Result: 在5块H800 GPU上实现20FPS端到端生成,首次达到工业级实时高保真头像生成。

Insight: 通过算法与系统协同设计,大规模扩散模型可高效部署于长视频合成场景,开辟了新的应用范式。

Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

[72] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Yunhong Lu,Yanhong Zeng,Haobo Li,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Jiapeng Zhu,Hengyuan Cao,Zhipeng Zhang,Xing Zhu,Yujun Shen,Min Zhang

Main category: cs.CV

TL;DR: 论文提出了Reward Forcing框架,通过EMA-Sink和改进的Re-DMD方法,解决了现有视频生成模型中初始帧依赖和动态内容不足的问题,实现了高效的流式视频生成。

Details Motivation: 现有的视频生成模型在滑动窗口注意力机制下,过度依赖初始帧作为固定token,导致动态性和长时一致性不足。论文旨在解决这一问题。

Contribution: 1. 提出EMA-Sink机制,通过指数移动平均更新token,平衡长时上下文和近期动态;2. 提出Rewarded Distribution Matching Distillation(Re-DMD),通过奖励机制优先学习动态内容。

Method: 1. EMA-Sink:固定大小token通过指数移动平均更新,避免初始帧复制;2. Re-DMD:基于视觉-语言模型的奖励机制,优化动态内容的生成。

Result: Reward Forcing在标准基准上达到SOTA性能,单H100 GPU上实现23.1 FPS的高质量流式视频生成。

Insight: 动态内容的优先级学习和长时上下文的平衡是高效视频生成的关键。

Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model’s ability to prioritize dynamic content. Instead, Re-DMD biases the model’s output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

[73] Towards Cross-View Point Correspondence in Vision-Language Models

Yipu Wang,Yuheng Ji,Yuyang Liu,Enshen Zhou,Ziqiang Yang,Yuxuan Tian,Ziheng Qin,Yue Liu,Huajie Tan,Cheng Chi,Zhiyuan Ma,Daniel Dajun Zeng,Xiaolong Zheng

Main category: cs.CV

TL;DR: 该论文提出了跨视角点对应(CVPC)任务和CrossPoint-Bench基准测试,展示了当前VLMs在精确点级对应上的不足,并提出了数据集CrossPoint-378K和模型CroPond,显著提升了性能。

Details Motivation: 现有的视觉语言模型(VLMs)在跨视角点级对应能力上存在明显不足,限制了其在空间理解和具身AI中的应用。论文旨在解决这一问题,推动VLMs在精细对应任务中的发展。

Contribution: 1. 提出了CVPC任务和CrossPoint-Bench基准;2. 构建了大规模数据集CrossPoint-378K;3. 提出了CroPond模型,显著提升了性能。

Method: 论文通过构建包含37.8万个问答对的数据集CrossPoint-378K,并设计CroPond模型,结合感知、推理和对应的人类认知过程,实现跨视角点级对应。

Result: 实验表明,CroPond在CrossPoint-Bench上的性能超越Gemini-2.5-Pro达39.7%,展示了其在跨视角对应任务中的显著优势。

Insight: 论文揭示了VLMs在点级对应任务中的局限性,并提出数据驱动的方法为未来研究提供了新方向。

Abstract: Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of “perceive”, “reason”, and “correspond”. Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.

[74] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Yigui Feng,Qinglin Wang,Haotian Mo,Yang Liu,Ke Liu,Gencheng Liu,Xinhai Chen,Siqi Shen,Songzhu Mei,Jie Liu

Main category: cs.CV

TL;DR: 论文提出了一种解决生成式心理分析中视觉-语言模型面临的两大挑战的完整生态系统,包括新模型MIND、数据集ConvoInsight-DB和评估指标PRISM。MIND通过层次化视觉编码器和状态判断模块实现视觉解耦,显著提升性能。

Details Motivation: 生成式心理分析在自然对话中面临两大问题:视觉-语言模型无法解决发音-情感模糊性,以及缺乏可验证的评估指标来衡量视觉基础和推理深度。

Contribution: 1. 提出多级解耦网络MIND,通过状态判断模块抑制模糊的唇部特征。2. 构建大规模标注数据集ConvoInsight-DB。3. 设计自动化评估框架PRISM,使用专家指导的LLM衡量模型性能。

Method: MIND采用层次化视觉编码器和状态判断模块,通过时间特征方差算法抑制模糊的唇部特征,实现视觉解耦。

Result: 在PRISM基准测试中,MIND显著优于现有方法,微表情检测性能提升86.95%。

Insight: 状态判断模块是性能提升的关键,视觉解耦对心理分析的准确性至关重要。

Abstract: Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

[75] E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

Yihong Tang,Haicheng Liao,Tong Nie,Junlin He,Ao Qu,Kehua Chen,Wei Ma,Zhenning Li,Lijun Sun,Chengzhong Xu

Main category: cs.CV

TL;DR: 论文提出E3AD框架,结合情感感知与视觉-语言-动作模型,用于端到端自动驾驶,提升乘客舒适度和系统人性化表现。

Details Motivation: 现有端到端自动驾驶系统忽视乘客情感状态,而情感对舒适度和系统接受度至关重要,因此提出结合情感感知的改进框架。

Contribution: 1.引入开放域端到端自动驾驶概念;2.提出E3AD框架,结合VAD情感模型和双通路空间推理模块;3.设计一致性训练方案,确保情感意图与驾驶行为一致。

Method: 1.采用VAD情感模型捕捉语言中的情感特征;2.通过双通路空间推理模块融合自我中心和他者中心视角;3.结合模态预训练与基于偏好的对齐进行一致性训练。

Result: 在真实数据集上,E3AD在视觉定位、路径规划和情感估计方面达到SOTA表现,情感注入提升了人性化反馈。

Insight: 情感感知是提升自动驾驶人性化表现的关键,多模态融合与一致性训练能显著优化系统行为与乘客体验。

Abstract: End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger’s emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

[76] Order Matters: 3D Shape Generation from Sequential VR Sketches

Yizi Chen,Sidi Wu,Tianyi Xiao,Nina Wiedemann,Loic Landrieu

Main category: cs.CV

TL;DR: 该论文提出了VRSketch2Shape框架,首次通过考虑VR草图的时序顺序,实现了从连续笔触生成3D形状的高保真模型,并提供了合成和手绘草图的数据集。

Details Motivation: 传统草图到形状的模型忽略了笔触的时序顺序,丢失了结构和设计意图的关键信息,而VR草图提供了更直观的3D设计方式,因此需要一种能利用时序信息的方法。

Contribution: 1. 提出自动生成时序VR草图的流水线;2. 提供超过20k合成和900手绘草图的数据集;3. 设计了一个时序感知的草图编码器和基于扩散的3D生成器。

Method: 结合时序感知的草图编码器和扩散模型生成3D形状,重点利用笔触的顺序信息来提升几何保真度。

Result: 相比于现有方法,该方法在几何保真度上表现更好,且能在少量监督下从合成草图泛化到真实草图,甚至对部分草图也有效。

Insight: 时序信息对3D形状生成至关重要,扩散模型在草图到形状的任务中展现出潜力。

Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.

[77] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping,Chengyou Jia,Minnan Luo,Changliang Xia,Xin Shen,Zhuohang Dang,Hangwei Qian

Main category: cs.CV

TL;DR: PaCo-RL提出了一种基于强化学习的框架,用于解决一致图像生成问题,通过结合一致性奖励模型和高效的RL算法,显著提升了生成图像的视觉一致性。

Details Motivation: 传统监督学习方法在一致图像生成任务中面临数据集不足和人类感知偏好复杂性等问题,而强化学习能够以无数据方式学习复杂且主观的视觉标准,因此提出了PaCo-RL框架。

Contribution: 1. 提出PaCo-Reward,一种基于大规模自动化子图配对数据训练的成对一致性评估器。2. 提出PaCo-GRPO,一种高效的RL算法,通过分辨率解耦优化策略和log调制的多奖励聚合机制降低训练成本。

Method: 1. PaCo-Reward通过生成式自回归评分机制评估一致性,结合任务感知指令和CoT推理增强效果。2. PaCo-GRPO采用分辨率解耦优化策略和log调制的多奖励聚合机制,提升训练效率和稳定性。

Result: 实验表明,PaCo-Reward显著提升了与人类视觉一致性的对齐度,PaCo-GRPO在一致性和训练效率上达到SOTA水平。

Insight: PaCo-RL展示了强化学习在一致图像生成任务中的潜力,解决了数据集稀缺和主观标准建模的挑战。

Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at https://x-gengroup.github.io/HomePage_PaCo-RL/.

[78] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He,Longhui Wei,Jianbo Ouyang,Lingxi Xie,Qi Tian

Main category: cs.CV

TL;DR: EMMA是一种高效统一的多模态理解、生成和编辑架构,通过高效的自动编码器、通道级串联、共享-解耦网络和专家混合机制,在性能和效率上均优于现有方法。

Details Motivation: 现有统一多模态架构在理解、生成和编辑任务中存在效率低和性能不足的问题,EMMA旨在通过创新设计和高效压缩解决这些问题。

Contribution: 1) 提出高效的32x压缩比自动编码器;2) 采用通道级串联减少视觉令牌;3) 设计共享-解耦网络实现任务间相互提升;4) 在视觉理解编码器中引入专家混合机制。

Method: 结合高效压缩、通道级串联、共享-解耦网络和专家混合机制,实现多模态任务的统一高效处理。

Result: EMMA-4B在效率和性能上显著优于BAGEL-7B等现有方法,且与专家模型如Qwen3-VL和Qwen-Image竞争。

Insight: EMMA为未来统一多模态架构的发展奠定了基础,展示了高效压缩和任务共享设计的潜力。

Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

[79] RobustSplat++: Decoupling Densification, Dynamics, and Illumination for In-the-Wild 3DGS

Chuanyu Fu,Guanying Chen,Yuqi Zhang,Kunbin Yao,Yuan Xiong,Chuan Huang,Shuguang Cui,Yasuyuki Matsushita,Xiaochun Cao

Main category: cs.CV

TL;DR: RobustSplat++提出了一种改进的3D高斯泼溅(3DGS)方法,通过解耦高斯密度化、动态对象和光照变化,提升了在复杂场景下的渲染鲁棒性。

Details Motivation: 现有3DGS方法在复杂场景中(如动态对象和光照变化)容易出现渲染伪影,主要原因是高斯密度化过程会无意中建模瞬态干扰和光照变化。

Contribution: 1. 延迟高斯增长策略,优先优化静态场景结构;2. 尺度级联掩码引导方法,从低分辨率到高分辨率逐步优化掩码预测;3. 结合外观建模,处理复杂场景中的瞬态对象和光照变化。

Method: 提出了一种三阶段方法:延迟高斯密度化、尺度级联掩码引导和外观建模结合,以优化3DGS在复杂场景下的表现。

Result: 在多个挑战性数据集上的实验表明,该方法显著优于现有方法,证明了其鲁棒性和有效性。

Insight: 通过解耦静态结构与瞬态干扰的建模,可以显著提升3DGS在复杂场景中的渲染质量,同时避免过拟合问题。

Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.

[80] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Shijie Chen,Peixi Peng

Main category: cs.CV

TL;DR: FreeGen提出了一种前馈式的重建-生成协同训练框架,用于自由视角驾驶场景的合成,解决了现有方法在插值一致性和外推真实性上的不足。

Details Motivation: 自动驾驶的闭环模拟和大规模预训练需要合成自由视角的驾驶场景,但现有数据集和生成方法缺乏一致的偏离轨迹观测,限制了评估和训练的规模。

Contribution: FreeGen的主要贡献是通过协同训练框架,将生成先验知识蒸馏到重建模型中,同时利用精炼的几何结构指导生成模型,实现了插值一致性和外推真实性的平衡。

Method: FreeGen结合了重建模型和生成模型的协同训练,前者提供稳定的几何表示保障插值一致性,后者通过几何感知增强提升未见视角的真实性。

Result: 实验表明,FreeGen在自由视角驾驶场景合成任务上达到了最先进的性能。

Insight: 通过协同训练,重建和生成模型的互补性被有效利用,解决了生成任务中几何一致性和真实性的矛盾。

Abstract: Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

[81] A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

Jikang Cheng,Renye Yan,Zhiyuan Yan,Yaozhong Gan,Xueyi Zhang,Zhongyuan Wang,Wei Peng,Ling Liang

Main category: cs.CV

TL;DR: 该论文提出了一个名为MID-FFD的新研究范式,专注于多域真实世界人脸伪造检测,并通过DevDet框架解决域差异主导问题,提升单帧独立检测的准确性。

Details Motivation: 现有的深度伪造检测方法在训练数据有限的情况下,难以覆盖真实世界中多样化的深度伪造样本,导致在未指定域条件下难以准确判断真实与伪造样本。

Contribution: 1. 提出了多域人脸伪造检测(MID-FFD)的新研究范式;2. 开发了DevDet框架,包含FFDev和DAFT策略,以放大真实与伪造的差异特征;3. 在实验中展示了在MID-FFD场景下的优越性能。

Method: 1. 定义了MID-FFD范式,引入大规模多域训练数据;2. 设计了DevDet框架,包括FFDev(伪造特征开发器)和DAFT(剂量自适应微调策略)两部分,以增强真实/伪造差异的特性。

Result: 实验表明,DevDet在MID-FFD场景下能够显著提升真实与伪造样本的区分能力,同时保持对未见数据的泛化能力。

Insight: 在真实世界的深度伪造检测中,多域训练数据的引入是关键;通过特征空间优化,可以有效地解决域差异主导问题,提升检测性能。

Abstract: Existing methods for deepfake detection aim to develop generalizable detectors. Although “generalizable” is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.

[82] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Chanfan Gan,Tieyuan Chen,Weiyao Lin

Main category: cs.CV

TL;DR: LineAR是一种新颖的训练无关的渐进式KV缓存压缩方法,显著降低自回归图像生成的内存需求并提升吞吐量,同时保持或提高生成质量。

Details Motivation: 现有的自回归图像生成方法由于需要缓存所有已生成的视觉令牌,导致内存需求高和吞吐量低,亟需一种高效的缓存管理方法。

Contribution: 提出了LineAR,通过线级KV缓存管理和渐进式令牌淘汰,实现了内存节省和吞吐量提升,同时保持生成质量。

Method: LineAR利用视觉注意力的内在特性,以2D视图管理缓存,逐步淘汰对后续生成无用的令牌。

Result: 在多个模型上验证了有效性,显著降低内存需求(如67.61%)并提高生成速度(如7.57倍),同时提升了生成质量(如FID指标)。

Insight: 视觉依赖的区域可以高效管理,淘汰冗余令牌对生成质量影响有限,同时大幅优化了资源利用。

Abstract: Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.

[83] Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

Maria-Paola Forte,Nikos Athanasiou,Giulia Ballardini,Jan Ulrich Bartels,Katherine J. Kuchenbecker,Michael J. Black

Main category: cs.CV

TL;DR: 这篇论文提出了一种结合视觉姿态估计器和生物阻抗传感的新框架BioTUCH,用于更准确地捕捉3D人体姿态,特别是自接触场景。通过接触感知的姿态优化,重建精度平均提高了11.7%。

Details Motivation: 现有基于视频的姿态估计方法在自接触场景(如手触摸脸)中表现不佳,而穿戴式生物阻抗传感可以提供地面真实的皮肤接触数据。因此,结合两者的优势成为一个自然的选择。

Contribution: 1. 提出了BioTUCH框架,结合视觉姿态估计和生物阻抗传感优化3D姿态;2. 引入接触感知的姿态优化方法;3. 发布了一个包含同步RGB视频、生物阻抗测量和3D运动捕捉的新数据集;4. 设计了一种微型穿戴式生物阻抗传感器,便于大规模数据收集。

Method: 1. 使用现成的姿态估计器初始化姿态;2. 在检测到的自接触区域引入接触感知优化:最小化重投影误差和输入估计的偏差,同时强制执行顶点接近约束;3. 利用同步的生物阻抗数据优化接触区域的姿态。

Result: 实验验证表明,与三种输入姿态估计器相比,BioTUCH平均提升了11.7%的重建精度。

Insight: 1. 生物阻抗传感是解决姿态估计中自接触问题的有效手段;2. 接触感知优化可以显著提升姿态重建的准确性;3. 数据集的开放性有助于推动相关研究的发展。

Abstract: Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at biotuch.is.tue.mpg.de

[84] SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

Jiawen Wen,Yu Hu,Suixuan Qiu,Jinshan Huang,Xiaowen Chu

Main category: cs.CV

TL;DR: SDG-Track提出了一种异构观察者-跟随者框架,用于在嵌入式平台上实现高分辨率无人机实时跟踪。通过分离高精度检测(Observer)和高频轨迹插值(Follower),解决了分辨率与速度的冲突,并在资源受限设备上实现了高效跟踪。

Details Motivation: 实时跟踪小型无人机时,高分辨率图像处理与资源受限的边缘设备之间存在速度与分辨率的冲突。传统方法在降低分辨率时会丢失小目标的特征,而直接处理高分辨率帧则无法满足实时性需求。

Contribution: 1. 异构Observer-Follower架构,解决高分辨率与实时性的冲突。2. 提出了Dual-Space Recovery机制,无需训练即可处理跟踪失败情况。3. 在嵌入式平台上实现了高效(35.1 FPS)且精准(97.2%精度)的无人机跟踪。

Method: 1. Observer-stream:GPU上低频运行高精度检测生成位置锚点。2. Follower-stream:CPU上高频稀疏光流插值轨迹。3. Dual-Space Recovery:结合颜色直方图匹配与几何一致性约束重新捕获目标。

Result: 在NVIDIA Jetson Orin Nano上实现了35.1 FPS的吞吐量,保留97.2%的逐帧检测精度,成功跟踪FPV无人机。

Insight: 通过异构计算和任务分解,可以在资源受限平台上实现高分辨率目标的实时跟踪,为边缘设备上的计算机视觉应用提供了新思路。

Abstract: Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

[85] You Only Train Once (YOTO): A Retraining-Free Object Detection Framework

Priyanto Hidayatullah,Nurjannah Syakrani,Yudi Widhiyasana,Muhammad Rizqi Sholahuddin,Refdinal Tubagus,Zahri Al Adzani Hidayat,Hanri Fajar Ramadhan,Dafa Alfarizki Pratama,Farhan Muhammad Yasin

Main category: cs.CV

TL;DR: 该论文提出了YOTO框架,旨在解决目标检测中灾难性遗忘问题,通过结合YOLO11n、DeIT和Proxy Anchor Loss,实现了无需重新训练的高效检测新老产品。

Details Motivation: 目标检测在计算机视觉领域应用广泛,但灾难性遗忘问题导致模型需频繁重新训练,成本高昂。尤其在零售结账等频繁新增产品的场景中,这一问题尤为突出。

Contribution: 提出了YOTO框架,结合YOLO11n、DeIT和Proxy Anchor Loss,实现无需重新训练的目标检测,显著提升了训练效率和实用性。

Method: 使用YOLO11n进行目标定位,DeIT和Proxy Anchor Loss进行特征提取与度量学习,通过余弦相似度分类特征的嵌入向量与Qdrant数据库中的数据。

Result: 在140种商品的零售案例中,YOTO框架对新老商品的检测均表现出色,训练效率是传统方法的三倍,边缘设备上平均推理时间为580ms。

Insight: YOTO框架通过避免重新训练显著降低了成本和时间,尤其适用于产品频繁更新的场景,展示了边缘设备上实际应用的可行性。

Abstract: Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework’s feasibility for practical use.

[86] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

Guanbo Huang,Jingjia Mao,Fanding Huang,Fengkai Liu,Xiangyang Luo,Yaoyuan Liang,Jiasheng Lu,Xiaoe Wang,Pei Liu,Ruiliu Fu,Shao-Lun Huang

Main category: cs.CV

TL;DR: 本文研究了Flow Matching方法中的曝光偏差问题,并提出了一种名为ReflexFlow的动态校正方法,通过Anti-Drift Rectification和Frequency Compensation两大组件有效减轻偏差,显著提升了生成质量。

Details Motivation: Flow Matching方法在训练和推理之间存在曝光偏差问题,限制了模型的泛化能力和生成质量。本文旨在揭示偏差的根源并提出解决方法。

Contribution: 提出ReflexFlow方法,包括Anti-Drift Rectification(ADR)和Frequency Compensation(FC),动态校正曝光偏差,提升Flow Matching的性能。

Method: 通过ADR调整训练时的预测目标,减少对偏差输入的依赖;通过FC补偿低频内容缺失,重加权损失函数。方法兼容所有Flow Matching框架。

Result: 在CIFAR-10、CelebA-64和ImageNet-256等数据集上,ReflexFlow显著降低了FID指标(CelebA-64上减少35.65%)。

Insight: 揭示了曝光偏差的两大根源:模型对训练时偏差输入的泛化能力不足和低频内容捕获不足。动态校正损失函数是解决这一问题的有效途径。

Abstract: Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.

[87] Virtually Unrolling the Herculaneum Papyri by Diffeomorphic Spiral Fitting

Paul Henderson

Main category: cs.CV

TL;DR: 提出了一种新颖的自动虚拟展开赫库兰尼姆纸莎草卷的方法,通过全局拟合显式参数模型到神经网络预测的卷曲路径,确保表面连续且完整,优于现有自动化方法。

Details Motivation: 赫库兰尼姆纸莎草卷因火山爆发碳化且脆弱,无法物理展开。现有虚拟展开方法依赖人工追踪,效率低下,亟需自动化解决方案。

Contribution: 提出首个自上而下的自动化方法,通过显式参数模型拟合神经网络预测的卷曲路径,确保表面连续且完整。

Method: 结合神经网络预测和全局拟合的显式参数模型,生成连续的2D平面表示。

Result: 在两种纸莎草卷的高分辨率CT扫描上验证,成功展开大范围区域,性能优于现有方法。

Insight: 通过结合深度学习预测与全局建模,解决了破损区域的表面重建难题,为文化遗产数字化提供了新思路。

Abstract: The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.

[88] LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Zhijian Shu,Cheng Lin,Tao Xie,Wei Yin,Ben Li,Zhiyuan Pu,Weize Li,Yao Yao,Xun Cao,Xiaoyang Guo,Xiao-Xiao Long

Main category: cs.CV

TL;DR: LiteVGGT通过几何感知的缓存令牌合并技术,显著提升了VGGT的效率,实现了10倍加速和内存缩减,适用于大规模3D场景。

Details Motivation: 传统VGGT在处理长序列图像时计算和内存开销大,限制了其在大规模场景中的应用。

Contribution: 提出几何感知的缓存令牌合并策略,显著提升效率;分析了3D重建中令牌的相关性与计算冗余性。

Method: 基于局部图像区域令牌的几何相关性和相似性稳定的特点,设计令牌合并策略并重用合并索引。

Result: 实验表明LiteVGGT在保持核心性能的同时,实现了高效性和扩展性。

Insight: 局部令牌的几何相关性和层间相似性稳定是提升3D重建效率的关键。

Abstract: 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token’s geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT’s core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT’s effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/

[89] Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

Novanto Yudistira

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度神经网络和自适应融合策略的多模态动作识别方法,通过选择性整合RGB、光流、音频和深度信息,显著提升了动作识别的准确性和鲁棒性。

Details Motivation: 传统单模态动作识别方法存在局限性,无法充分挖掘多模态数据的潜力,因此需要一种自适应融合策略来整合不同模态的信息。

Contribution: 提出了一种基于门控机制的自适应多模态融合框架,能够选择性地整合关键特征,显著提升了动作识别的性能。

Method: 采用门控机制和自适应加权融合架构,对不同模态的信息进行选择性融合,并通过实验验证了其有效性。

Result: 在基准数据集上的实验表明,该方法在动作识别、暴力动作检测和自监督学习任务中均取得了显著的性能提升。

Insight: 多模态融合和门控机制的结合为动作识别提供了更全面的特征表示,尤其在辅助生活和监控等领域具有广泛的应用潜力。

Abstract: This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.

[90] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization

Yicheng Liu,Shiduo Zhang,Zibin Dong,Baijun Ye,Tianyuan Yuan,Xiaopeng Yu,Linqi Yin,Chenhao Lu,Junhao Shi,Luca Jiang-Tao Yu,Liangtao Zheng,Tao Jiang,Jingjing Gong,Xipeng Qiu,Hang Zhao

Main category: cs.CV

TL;DR: 本文提出FASTer,一种高效的视觉语言动作(VLA)自回归建模框架,通过可学习的动作分词器(FASTerVQ)和基于它的自回归策略(FASTerVLA),在保持高压缩比的同时提升了推理效率和任务性能。

Details Motivation: 现有自回归VLA模型在动作分词过程中存在重构保真度与推理效率之间的权衡问题。作者希望通过统一的框架解决这一问题,实现高效且可泛化的机器人学习。

Contribution: 1) 提出FASTerVQ,通过将动作块编码为单通道图像捕捉时空依赖性; 2) 提出FASTerVLA,结合块级自回归解码和轻量级动作专家,提升推理速度和任务性能。

Method: 1) FASTerVQ利用可学习分词器对动作进行高效编码; 2) FASTerVLA采用块级自回归解码和轻量级专家网络,优化推理流程。

Result: 实验表明FASTerVQ在重构质量、分词利用率和跨任务/跨躯体泛化上表现优异; FASTerVLA在推理速度和任务性能上超越现有VLA模型。

Insight: 动作分词器的设计(如单通道图像表示)对提升VLA模型的效率和性能具有关键作用; 块级解码和专家网络的结合能有效优化自回归推理过程。

Abstract: Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

[91] GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

Yupu Yao,Bowen Yang

Main category: cs.CV

TL;DR: GeoPE是一种几何位置嵌入方法,通过四元数将旋转扩展到3D欧几里得空间,解决了标准Vision Transformers中空间拓扑被破坏的问题,提升了2D和3D任务的性能。

Details Motivation: 标准Vision Transformers将2D图像展平为1D序列,破坏了自然空间拓扑结构。现有2D方法无法分离虚假的序列邻近性与真实空间距离,GeoPE旨在恢复2D空间流形。

Contribution: 提出GeoPE框架,利用四元数和李代数构造统一的旋转算子,实现几何耦合的位置编码,有效分离空间维度。

Method: 通过四元数扩展旋转到3D空间,使用李代数计算几何平均以解决非交换性问题,构造对称的旋转算子。

Result: 在图像分类、目标检测和3D语义分割任务中,GeoPE显著优于现有2D RoPE变体,并增强了形状偏差。

Insight: GeoPE通过几何方法恢复了空间结构,证明了其在捕捉真实几何结构方面的能力,适用于高维任务。

Abstract: Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

[92] Rethinking the Use of Vision Transformers for AI-Generated Image Detection

NaHyeon Park,Kunhee Kim,Junsuk Choe,Hyunjung Shim

Main category: cs.CV

TL;DR: 论文探讨了使用CLIP-ViT不同层次特征对AI生成图像检测的影响,并提出了一种自适应动态集成方法MoLD,显著提升了检测性能和泛化能力。

Details Motivation: 现有方法主要依赖CLIP-ViT的最后一层特征,但缺乏对中层特征在检测任务中作用的系统性分析。作者发现中层特征更具局部性和泛化性。

Contribution: 1. 系统分析了ViT不同层次特征对AI生成图像检测的贡献;2. 提出了动态集成多层次特征的MoLD方法;3. 验证了方法在GAN和扩散模型生成图像上的有效性。

Method: 提出MoLD方法,通过门控机制自适应集成ViT的多层次特征,充分利用不同层的独特信息。

Result: MoLD在多种生成模型生成的图像上表现出优越的检测性能和鲁棒性,并能扩展到其他预训练ViT模型(如DINOv2)。

Insight: 中层特征在AI生成图像检测中可能比最后一层特征更有效,动态集成策略能显著提升模型的适应性和性能。

Abstract: Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

[93] Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

NaHyeon Park,Namin An,Kunhee Kim,Soyeon Yoon,Jiahao Huo,Hyunjung Shim

Main category: cs.CV

TL;DR: 该论文研究了基于大型视觉语言模型(LVLM)的文本到图像(T2I)系统中社会偏见的来源,发现系统提示(system prompts)是偏见的主要驱动因素,并提出了一种无需训练的元提示框架FairPro以减少偏见。

Details Motivation: 随着LVLM-based T2I模型成为图像生成的主流范式,这些模型是否放大了社会偏见仍未充分理解。论文旨在揭示偏见来源并提出解决方案。

Contribution: 1) 提出了一个包含1024个提示的基准测试,评估了多属性下的社会偏见;2) 揭示了系统提示是偏见的主要驱动因素;3) 提出了训练无关的FairPro框架以减少偏见。

Method: 1) 通过解码中间表示、词概率诊断和嵌入关联分析揭示系统提示中的偏见;2) 设计FairPro框架,使模型能在测试时自检和构建公平性感知的系统提示。

Result: 在SANA和Qwen-Image两个模型上实验表明,FairPro显著减少了人口统计偏见,同时保持了文本-图像的对齐性。

Insight: 系统提示在偏见传播中扮演核心角色,无需修改模型参数的公平性优化方法具有实际部署价值。

Abstract: Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.

[94] Reflection Removal through Efficient Adaptation of Diffusion Transformers

Daniyar Zakarin,Thiemo Wandel,Anton Obukhov,Dengxin Dai

Main category: cs.CV

TL;DR: 该论文提出了一种基于扩散变换器(DiT)的框架,用于单图像反射去除,通过利用预训练扩散模型在修复任务中的泛化能力,结合高效的LoRA适应和合成的物理真实数据,实现了领域内和零样本基准的最优性能。

Details Motivation: 现有的反射去除方法通常依赖于特定任务架构,缺乏通用性和可扩展性。论文旨在通过预训练扩散模型的高泛化能力,结合物理真实的数据合成,解决反射去除任务中的数据不足和真实性问题。

Contribution: 1. 提出了基于扩散变换器的反射去除框架;2. 设计了一个物理真实的合成数据生成流程(PBR);3. 结合LoRA高效适应预训练模型,实现高性能反射去除。

Method: 1. 使用预训练的DiT模型,通过反射污染的输入条件化和清洁传输层的引导实现反射去除;2. 在Blender中构建基于物理的渲染(PBR)流水线生成合成数据;3. 采用LoRA技术高效适应预训练模型。

Result: 在领域内和零样本基准测试中实现了最佳性能,验证了方法的有效性和泛化能力。

Insight: 预训练扩散模型通过物理真实数据合成和高效适应,可以扩展到图像修复任务中,并提供高保真的解决方案。

Abstract: We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

[95] Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects

Xianghui Fan,Zhaoyu Chen,Mengyang Pan,Anping Deng,Hang Yang

Main category: cs.CV

TL;DR: 该论文提出了一种自监督学习方法,用于透明物体的深度补全,通过模拟非透明区域中的深度缺失来训练网络,无需大量标注数据。

Details Motivation: 透明物体的深度感知因折射和反射而困难,传统方法依赖大量标注数据,标注成本高。论文旨在通过自监督学习解决这一问题。

Contribution: 提出了一种新的自监督方法,利用非透明区域的深度作为监督信号,减少对标注数据的依赖。

Method: 方法模拟透明物体的深度缺失,利用原始深度图作为监督信号训练深度补全网络。

Result: 实验表明,该方法性能接近有监督方法,且在小样本训练中能提升模型表现。

Insight: 自监督方法可以在减少标注需求的同时保持性能,适合数据稀缺场景。

Abstract: The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.

[96] Generative Neural Video Compression via Video Diffusion Prior

Qi Mao,Hao Cheng,Tinghan Yang,Libiao Jin,Siwei Ma

Main category: cs.CV

TL;DR: GNVC-VD是一个基于DiT的生成式神经视频压缩框架,通过视频扩散先验统一时空潜在压缩和序列级生成细化,显著减少闪烁伪影并提升感知质量。

Details Motivation: 现有的感知编解码器主要依赖预训练的图像生成先验恢复高频细节,但其帧级特性缺乏时间建模,导致感知闪烁问题。GNVC-VD引入视频扩散先验,解决这一问题。

Contribution: 1.首次将DiT用于视频压缩;2.提出统一的流匹配潜在细化模块,利用视频扩散Transformer联合增强帧内和帧间潜在表示;3.引入条件适配器,注入压缩感知线索。

Method: 1.利用视频扩散Transformer进行序列级去噪;2.从解码的时空潜在表示初始化细化,学习适应压缩退化的校正项;3.通过条件适配器注入压缩感知线索。

Result: GNVC-VD在感知质量上超越传统和学习型编解码器,显著减少闪烁伪影,极低比特率(<0.01bpp)下仍保持时间一致性。

Insight: 视频原生生成先验在神经编解码器中具有巨大潜力,可显著提升下一代感知视频压缩的效果。

Abstract: We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

[97] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Nicolas Houdré,Diego Marcos,Hugo Riffaud de Turckheim,Dino Ienco,Laurent Wendling,Camille Kurtz,Sylvain Lobry

Main category: cs.CV

TL;DR: RAMEN是一种分辨率可调的多模态编码器,适用于地球观测数据,能够跨模态学习共享视觉表示。

Details Motivation: 现有基础模型通常要求固定输入分辨率或基于特定传感器编码器,限制了异构地球观测模态的泛化能力。

Contribution: 提出了分辨率可调的多模态编码器RAMEN,支持用户直接控制输出分辨率,平衡空间精度与计算成本。

Method: 通过统一的Transformer编码器重构掩码多模态数据,将空间分辨率作为可控输出参数。

Result: 在社区标准PANGAEA基准测试中,RAMEN优于现有大模型,且能泛化到已知和未知传感器配置。

Insight: RAMEN的传感器无关设计为地球观测数据的多模态分析提供了统一的潜在空间。

Abstract: Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

[98] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Abhigyan Bhattacharya,Hiranmoy Roy

Main category: cs.CV

TL;DR: 论文提出了一种语义引导的两阶段GAN,用于人脸修复,结合混合感知编码和分层合成方法,解决现有方法在修复大不规则掩码时的问题。

Details Motivation: 现有的人脸修复方法在处理大不规则掩码时,常产生模糊纹理、语义不一致或难以令人信服的面部结构,亟需一种更有效的修复技术。

Contribution: 提出了一种全新的语义引导两阶段GAN架构,通过分层合成和多尺度纹理生成器,实现了高质量的修复效果,同时支持动态注意力机制。

Method: 第一阶段结合CNN和Vision Transformer生成清晰的语义布局;第二阶段通过多模态纹理生成器细化纹理,利用多尺度信息提升一致性。

Result: 在CelebA-HQ和FFHQ数据集上的实验表明,模型在LPIPS、PSNR和SSIM等指标上优于现有方法,尤其在修复大面积缺失时表现突出。

Insight: 分层合成和语义引导的方法能有效解决修复中的结构一致性和纹理细节问题,混合感知编码进一步提升了修复质量。

Abstract: Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

[99] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang,Ziyi Wang,Wenzhao Zheng,Zheng Zhu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为MoRe4D的方法,通过联合几何重建和运动生成从单张静态图像合成动态4D场景,解决了现有方法将几何与运动解耦导致的时空不一致问题。

Details Motivation: 现有方法将几何重建和运动生成解耦,导致生成的4D场景时空不一致且泛化能力不足。MoRe4D旨在联合解决这两个问题,并通过新数据集和模块设计提升效果。

Contribution: 1. 提出MoRe4D框架,联合进行运动生成和几何重建;2. 发布TrajScene-60K数据集;3. 设计扩散模型4D-STraG和4D-ViSM模块。

Method: 1. 基于扩散模型的4D轨迹生成器(4D-STraG)生成几何一致的运动轨迹;2. 深度引导的运动归一化和运动感知模块整合几何与动力学;3. 4D-ViSM模块从点轨迹渲染视频。

Result: 实验表明,MoRe4D能从单张图像生成高质量、多视角一致且动态细节丰富的4D场景。

Insight: 联合建模几何与运动是关键,而高质量数据集(如TrajScene-60K)和新模块设计(如4D-STraG)显著提升了4D合成的性能。

Abstract: Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

[100] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang,Qihang Zhang,Shengqu Cai,Tong Wu,Jan Ackermann,Zhengfei Kuang,Yang Zheng,Frano Rajič,Siyu Tang,Gordon Wetzstein

Main category: cs.CV

TL;DR: 该论文提出了一个4D可控的视频生成框架BulletTime,通过解耦场景动态和相机位姿的控制,实现了对时间和相机视角的精细调节。

Details Motivation: 现有的视频扩散模型无法独立控制场景动态和相机运动,限制了其在实际应用中的灵活性。本文旨在解决这一问题。

Contribution: 主要贡献包括提出了一个4D可控的视频扩散框架,支持对时间和相机位姿的解耦控制,并通过4D位置编码和自适应归一化实现了高质量的生成。

Method: 方法包括使用连续的世界时间序列和相机轨迹作为条件输入,通过4D位置编码和自适应归一化技术将这些条件注入视频扩散模型中。

Result: 实验表明,该模型在多样化的时间模式和相机轨迹下实现了鲁棒的4D控制,同时保持了高生成质量,超越了现有方法。

Insight: 论文展示了通过解耦时间和相机位姿的控制,可以实现更灵活、更高质量的视频生成,这对于视频编辑和虚拟现实等领域具有重要意义。

Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/

[101] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Minghan Zhu,Zhiyi Wang,Qihang Sun,Maani Ghaffari,Michael Posa

Main category: cs.CV

TL;DR: 该论文通过结合生成先验和物理接触约束,提出了一种在遮挡条件下重建物体几何的方法。

Details Motivation: 目标几何信息对机器人操作至关重要,但遮挡会导致视觉信号不完整。利用生成模型的先验知识和接触信息可以减少这种模糊性。

Contribution: 提出了一种结合生成模型先验和接触诱导约束的方法,用于物体几何重建。

Method: 采用接触引导的3D生成,灵感来自生成模型中的拖拽式编辑。

Result: 在合成和真实数据上的实验表明,该方法优于纯3D生成和基于接触的优化。

Insight: 生成先验与接触约束的结合能够有效提升遮挡条件下的几何重建质量。

Abstract: Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

[102] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Jung Yi,Wooseok Jang,Paul Hyunbin Cho,Jisu Nam,Heeji Yoon,Seungryong Kim

Main category: cs.CV

TL;DR: Deep Forcing提出了一种无需训练的长视频生成方法,通过Deep Sink和Participative Compression解决时间重复、漂移和运动减速问题,实现高质量实时生成长视频。

Details Motivation: 现有自回归视频扩散方法存在时间重复、漂移和运动减速问题,StreamingLLM风格的注意力机制会导致质量下降和运动停滞。

Contribution: 1) Deep Sink通过持久性sink token和对齐RoPE相位稳定全局上下文;2) Participative Compression通过重要性感知的KV缓存剪枝减少冗余历史,最小化误差累积。

Method: 结合Deep Sink和Participative Compression两种无需训练的机制,优化KV缓存管理。

Result: 实现了12倍以上的外推(如从5秒训练扩展到60秒以上生成),成像和美学质量优于现有方法,同时保持实时生成能力。

Insight: 无需训练的KV缓存管理方法可以媲美或超越基于训练的方法,为生成长视频提供新思路。

Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

[103] Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Haobo Yuan,Yueyi Sun,Yanwei Li,Tao Zhang,Xueqing Deng,Henghui Ding,Lu Qi,Anran Wang,Xiangtai Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 该论文提出了Visual Reasoning Tracer(VRT)任务,要求模型不仅定位目标对象,还需预测推理路径中的中间对象,并贡献了VRT-Bench基准、新评估指标和VRT-80k数据集。实验表明,现有模型虽能输出正确结果,但难以支持中间推理,而VRT-80k训练的模型在推理路径追踪上显著改进。

Details Motivation: 当前多模态大语言模型(MLLMs)虽然在视觉问答等任务中表现优异,但其推理过程缺乏透明度,仅输出最终结果而未能展示中间步骤或细粒度证据。这与人类通过视觉推理链完成任务的方式形成对比,亟需一种能揭示模型推理路径的任务和评估方法。

Contribution: 1. 提出了VRT任务,要求模型显式预测推理路径中的中间对象;2. 构建了人工标注的VRT-Bench基准和新评估指标;3. 发布了大规模数据集VRT-80k用于训练推理模型。

Method: 通过设计VRT任务,要求模型在定位目标对象的同时预测推理路径中的中间对象。使用VRT-Bench和VRT-80k数据集评估和训练模型,并提出新指标量化推理路径的质量。

Result: 实验结果表明,现有模型虽能正确输出结果,但在中间推理的支撑上表现不佳。而基于VRT-80k训练的模型显著提升了推理路径追踪能力。

Insight: 通过显式要求模型预测推理路径中的中间对象,可以更好地揭示模型的推理过程,并提升其可解释性和性能。这表明透明推理路径对多模态任务至关重要。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

[104] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Yuan Gao,Jin Song

Main category: cs.CV

TL;DR: 论文提出了SA-IQA,一个基于空间美学的图像质量评估框架,通过构建首个室内场景美学评测基准SA-BENCH,并采用多维度奖励融合方法,显著提升了AI生成图像的质量评估能力。

Details Motivation: 现有图像质量评估方法主要针对人像和艺术图像,缺乏对室内场景的系统化美学评估,因此论文提出了一种新的空间美学评估范式。

Contribution: 1. 提出空间美学概念,定义布局、和谐、光照和扭曲四个评估维度;2. 构建首个室内场景美学评测基准SA-BENCH;3. 开发SA-IQA框架,结合MLLM和多维奖励融合方法。

Method: 1. 设计SA-BENCH数据集(18,000张图像,50,000条标注);2. 通过MLLM微调和多维融合方法开发SA-IQA;3. 应用于GRPO强化学习和Best-of-N选择任务。

Result: 实验表明SA-IQA在SA-BENCH上显著优于现有方法,为空间美学评估树立了新标准。

Insight: 通过系统化评估空间美学的多维特征,可以更全面地优化AI生成图像的质量,并为后续研究提供基准和工具。

Abstract: In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.

[105] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Yu Zeng,Charles Ochoa,Mingyuan Zhou,Vishal M. Patel,Vitor Guizilini,Rowan McAllister

Main category: cs.CV

TL;DR: 这篇论文提出了Phase-Preserving Diffusion (φ-PD),一种保留输入相位并随机化幅度的扩散方法,适用于需要几何一致性的生成任务,如重新渲染和图像转换,同时不增加推理成本。

Details Motivation: 传统的扩散方法通过高斯噪声破坏数据的相位和幅度,破坏了空间结构,不适用于需要几何一致性的任务。

Contribution: 提出了φ-PD和频率选择性结构化(FSS)噪声,实现了结构对齐生成,无需额外参数或架构修改。

Method: 通过保留相位并随机化幅度,结合FSS噪声控制结构刚性。

Result: 在重新渲染和仿真增强任务中表现优异,CARLA到Waymo规划器性能提升50%。

Insight: 相位信息对空间结构保留至关重要,φ-PD为图像和视频生成提供了一种高效且灵活的解决方案。

Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion φ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD improves CARLA-to-Waymo planner performance by 50%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.

[106] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Shengyuan Ding,Xinyu Fang,Ziyu Liu,Yuhang Zang,Yuhang Cao,Xiangyu Zhao,Haodong Duan,Xiaoyi Dong,Jianze Liang,Bin Wang,Conghui He,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: ARM-Thinker 是一种多模态奖励模型,通过自主调用外部工具(如图像裁剪、文档检索)来验证判断的依据,解决了现有奖励模型在幻觉、视觉基础薄弱和工具利用不足等问题上的局限性。

Details Motivation: 当前的多模态奖励模型存在幻觉、视觉基础不足以及无法使用工具验证的问题,限制了其在复杂多模态推理任务中的可靠性。

Contribution: 提出 ARM-Thinker,一种具备自主能力的多模态奖励模型,能够调用外部工具以验证判断的依据,显著提升了奖励模型的准确性和可解释性。

Method: 采用多阶段强化学习联合优化工具调用决策和判断准确性,并引入 ARMBench-VL 基准测试工具。

Result: ARM-Thinker 在奖励建模基准上平均提升 16.2%,在工具使用任务中提升 9.6%,同时在多模态数学和逻辑推理基准上优于基线模型。

Insight: 自主工具调用能力不仅提升了奖励模型的准确性,还增强了其可解释性,为多模态任务中的验证和推理提供了新思路。

Abstract: Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

[107] Splannequin: Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

Hao-Jen Chien,Yi-Chuan Huang,Chung-Ho Wu,Wei-Lun Chao,Yu-Lun Liu

Main category: cs.CV

TL;DR: 论文《Splannequin》提出了一种从单目Mannequin-Challenge视频中合成高质量静态3D场景的方法,通过动态高斯泼溅技术保留细微动态,并引入双重检测和锚定方法减少伪影。

Details Motivation: 传统的动态场景重建方法不适合处理Mannequin-Challenge视频的目标,即合成静态场景同时保留可控的动态选择。为此,需要一种新方法来克服单目捕获和时间监督稀疏带来的问题。

Contribution: 提出了Splannequin,一种动态高斯泼溅的改进方法,通过检测高斯基元的隐藏和缺陷状态,并进行时间锚定,显著提升了视觉质量。

Method: 使用动态高斯泼溅技术建模场景,并引入双重检测和锚定方法。隐藏状态锚定到过去观测良好的状态,缺陷状态锚定到未来监督更强的状态。

Result: 方法无需修改架构或增加推理开销,即可显著减少伪影,实现高质量的静态渲染,用户偏好率达96%。

Insight: 动态高斯泼溅技术可以在保留场景细微动态的同时,通过时间锚定有效提升静态渲染的质量。

Abstract: Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model’s time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/

[108] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Tianqi Liu,Zhaoxi Chen,Zihao Huang,Shaocong Xu,Saining Zhang,Chongjie Ye,Bohan Li,Zhiguo Cao,Wei Li,Hao Zhao,Ziwei Liu

Main category: cs.CV

TL;DR: Light-X是一个视频生成框架,通过解耦几何和光照信号,实现了从单目视频中进行视角和光照控制的渲染,并通过Light-Syn合成的数据集提升了训练效果。

Details Motivation: 现有的光照控制方法在视频中仍面临光照保真度和时间一致性之间的权衡。为了实现真实场景的生成建模,需要同时控制相机轨迹和光照。

Contribution: 1) 提出了几何和光照信号的解耦设计;2) 引入了Light-Syn,一种通过逆映射合成训练对的策略,解决了多视角和多光照视频数据不足的问题。

Method: 通过动态点云捕捉几何和运动,并使用重光照帧提供光照线索,同时利用Light-Syn合成训练数据。

Result: 实验表明,Light-X在联合相机和光照控制上优于基线方法,并在文本和背景条件下超越了先前的视频重光照方法。

Insight: 解耦几何和光照信号是高效控制视频生成的关键,而合成训练数据可以弥补真实数据不足的问题。

Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

cs.RO [Back]

[109] From Generated Human Videos to Physically Plausible Robot Trajectories

James Ni,Zekai Wang,Wei Lin,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik,Roei Herzig

Main category: cs.RO

TL;DR: 该论文提出了一种将生成的人类视频转化为物理可行的机器人轨迹的两阶段方法,通过4D人体表征和强化学习策略实现了对噪声视频的零射击模仿。

Details Motivation: 利用视频生成模型作为机器人控制的高层策略潜力巨大,但生成视频中的噪声和形态扭曲使其直接模仿困难。

Contribution: 1. 引入了将生成视频转化为机器人动作的两阶段管道;2. 提出了GenMimic,一种基于物理的强化学习策略;3. 创建了合成数据集GenMimicBench用于评估零射击泛化能力。

Method: 1. 将视频像素提升为4D人体表征并重定向到人形机器人形态;2. 使用对称正则化和关键点加权跟踪奖励训练的3D关键点条件强化学习策略。

Result: 在仿真和真实机器人(Unitree G1)上展示了优于基线方法的性能,实现了稳定且连贯的运动跟踪。

Insight: 视频生成模型可作为高层策略应用于机器人控制,通过物理感知的强化学习可以有效解决生成视频中的噪声问题。

Abstract: Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

cs.AI [Back]

[110] Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

Jae Hee Lee,Anne Lauscher,Stefano V. Albrecht

Main category: cs.AI

TL;DR: 这篇立场论文提出了一项研究议程,旨在通过机制可解释性视角确保基于大型语言模型的多智能体系统(MALMs)的伦理行为,重点关注评估框架、机制解析和高效对齐技术。

Details Motivation: 随着大型语言模型在多智能体系统中的广泛应用,其伦理问题日益凸显。作者希望通过机制可解释性视角解决这些问题,确保系统的安全性与伦理性。

Contribution: 主要贡献包括:(1)提出了评估MALMs伦理行为的多层次框架;(2)利用机制可解释性解析其内部机制导致的涌现行为;(3)开发参数高效的对齐技术,以在不影响性能的前提下引导MALMs的伦理行为。

Method: 论文采用机制可解释性方法,通过分析模型的内部机制和行为模式,开发评估框架和对齐技术。

Result: 虽然论文未提供具体实验结果,但提出了一个系统的研究议程,未来可通过实验验证其方法的有效性。

Insight: 论文强调了多智能体系统中伦理问题的重要性,并通过机制可解释性为解决方案提供了新思路,为未来研究指明了方向。

Abstract: Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.

[111] Are LLMs Truly Multilingual? Exploring Zero-Shot Multilingual Capability of LLMs for Information Retrieval: An Italian Healthcare Use Case

Vignesh Kumar Kembu,Pierandrea Morandini,Marta Bianca Maria Ranzini,Antonino Nocera

Main category: cs.AI

TL;DR: 该论文探讨了开源多语言大语言模型(LLMs)在意大利医疗电子病历(EHR)信息检索中的零样本能力,发现部分模型在实际环境中表现不佳,且性能差异显著。

Details Motivation: 医疗电子病历的信息提取是数字医疗中的关键任务,传统NLP方法因临床语言的复杂性和变异性而表现不佳,LLMs的出现为解决这一问题提供了新思路。

Contribution: 论文通过实验验证了LLMs在意大利语EHR信息提取中的零样本能力,揭示了模型在实际应用中的局限性及性能差异。

Method: 研究采用详细的实验设计,测试了开源多语言LLMs在实时提取意大利语EHR信息的能力,并与传统模式匹配和人工标注进行比较。

Result: 实验结果表明,部分LLMs在零样本设置下表现不佳,且模型在处理不同疾病时泛化能力有限。

Insight: 论文指出,尽管LLMs在文本理解和生成方面表现强大,但在特定语言(如意大利语)和领域(如医疗)的实际应用中仍需进一步优化和改进。

Abstract: Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.

[112] STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions

Junjie Fan,Hongye Zhao,Linduo Wei,Jiayu Rao,Guijia Li,Jiaxin Yuan,Wenqi Xu,Yong Qi

Main category: cs.AI

TL;DR: STELLA提出了一种框架,通过动态语义抽象机制将时间序列分解为趋势、季节性和残差分量,并转化为分层语义锚点,以提高LLM在时间序列预测中的表现。

Details Motivation: 现有的大语言模型(LLM)在时间序列预测中未能充分利用其推理能力,缺乏动态行为和全局上下文的有效建模。

Contribution: 提出STELLA框架,通过动态语义抽象机制和分层语义锚点(全局语料库语义先验CSP和细粒度行为提示FBP),显著提升LLM在时间序列预测中的性能。

Method: 1. 将时间序列分解为趋势、季节性和残差分量;2. 为分量生成分层语义锚点(CSP和FBP);3. 将这些锚点作为前缀提示引导LLM建模。

Result: 在八个基准数据集上的实验表明,STELLA在长短期预测、零样本和少样本设置中均优于现有方法。

Insight: 动态语义抽象和分层语义锚点能够有效捕捉时间序列的动态行为,提升LLM的预测能力和泛化性。

Abstract: Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.

[113] Algorithmic Thinking Theory

MohammadHossein Bateni,Vincent Cohen-Addad,Yuzhou Gu,Silvio Lattanzi,Simon Meierhans,Christopher Mohri

Main category: cs.AI

TL;DR: 论文提出了一个理论框架,分析大语言模型(LLMs)迭代改进推理能力的算法原理,为设计更强大的推理方法奠定基础。

Details Motivation: 现有大语言模型在复杂推理任务中表现出色,但其能力可通过迭代生成方案进一步提升。然而,缺乏理论框架支持这种迭代改进的原理。

Contribution: 1. 引入了一个理论框架,形式化了迭代改进和答案聚合的底层原理;2. 提出了一种与实验证据相关的通用模型,适用于当前和未来的推理模型。

Method: 基于实验证据,形式化了推理算法,将其视为使用概率oracle的算法设计问题,提供了理论分析工具。

Result: 框架为设计更强大的推理方法奠定了基础,适用于广泛的推理模型和任务。

Insight: 1. 推理能力的提升可通过算法化设计实现;2. 理论框架的通用性使其不依赖于特定模型结构。

Abstract: Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle. We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.

cs.LG [Back]

[114] Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong,Kang Liu,Zhan Ling,Jiecao Chen,Sergey Levine

Main category: cs.LG

TL;DR: 本文提出了一种名为Natural Language Actor-Critic (NLAC)的新算法,通过生成自然语言而非标量值的LLM批评家来训练LLM策略,解决了长时程任务中稀疏奖励和探索困难的挑战。

Details Motivation: 在长时程任务中,稀疏奖励会导致训练不稳定和高样本复杂度,且探索自然语言动作空间较为困难。现有策略梯度方法依赖轨迹级奖励,效果有限。

Contribution: 提出了NLAC算法,利用生成式LLM批评家提供自然语言形式的训练信号,避免了随机探索和标量奖励的局限性,提升了训练的数据效率和稳定性。

Method: 采用基于演员-批评家框架的算法,批评家生成自然语言解释而非标量值,演员根据这些解释优化策略。支持离线训练,摆脱策略梯度的限制。

Result: 在推理、网页浏览和工具使用等任务中,NLAC表现优于现有方法,提供了一种更具扩展性和稳定性的训练范式。

Insight: 自然语言形式的批评信号能够更有效地指导LLM策略改进,尤其在开放式动作空间中,其解释性为策略优化提供了明确方向。

Abstract: Large language model (LLM) agents – LLMs that dynamically interact with an environment over long horizons – have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

[115] MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Massimo Bini,Ondrej Bohdal,Umberto Michieli,Zeynep Akata,Mete Ozay,Taha Ceritli

Main category: cs.LG

TL;DR: MemLoRA提出了一种通过为小型语言模型(SLM)配备专用记忆适配器的新型内存系统,支持本地部署,并通过MemLoRA-V扩展视觉能力,在文本和视觉任务中均表现出色。

Details Motivation: 当前基于大型语言模型(LLM)的内存系统计算成本高,不适合本地部署,且缺乏多模态能力。而小型模型(SLM)虽适合本地部署,但性能不足。

Contribution: 1) MemLoRA:通过专用适配器使SLM支持本地内存操作;2) MemLoRA-V:集成小型视觉语言模型(SVLM),实现多模态理解。

Method: 采用知识蒸馏原则,为不同内存操作(如知识提取、记忆更新等)训练独立适配器。MemLoRA-V进一步整合SVLM。

Result: MemLoRA在文本任务上超越10倍大的基线模型,接近60倍大模型的性能;MemLoRA-V在视觉任务中显著优于基于字幕的方法(81.3 vs. 23.7准确率)。

Insight: 适配器架构可以有效提升小型模型的内存操作能力,结合多模态模型能显著扩展其在视觉任务中的表现。

Abstract: Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.

[116] CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent

Leyang Shen,Yang Zhang,Chun Kai Ling,Xiaoyan Zhao,Tat-Seng Chua

Main category: cs.LG

TL;DR: CARL是一种针对多步骤智能体的强化学习算法,专注于关键动作的训练,通过优化高关键性动作的信号,同时忽略低关键性动作,显著提高了性能和效率。

Details Motivation: 传统的多步骤强化学习算法假设所有动作贡献均等,这与现实不符,实际上只有少数动作对最终结果至关重要。CARL旨在解决这一优化偏差问题。

Contribution: 提出了CARL算法,通过区分关键动作和非关键动作,实现了更高效的训练和推理,显著提升了多步骤智能体的性能。

Method: CARL通过分析动作的关键性,为高关键性动作提供优化信号,同时排除低关键性动作的模型更新,实现聚焦训练。

Result: 在多样化的实验设置中,CARL表现出了更强的性能和更高的训练与推理效率。

Insight: 在多步骤智能体中,区分关键动作和非关键动作是提升算法性能的核心,优化信号应集中于关键动作。

Abstract: Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.

[117] Multi-LLM Collaboration for Medication Recommendation

Huascar Sanchez,Briland Hitaj,Jules Bergmann,Linda Briesemeister

Main category: cs.LG

TL;DR: 提出了一种基于LLM Chemistry的多LLM协作框架,用于提升临床药物治疗推荐的可靠性和一致性。

Details Motivation: 单个大型语言模型(LLM)容易产生幻觉和不一致性,而简单的模型集成无法提供稳定可信的推荐,需解决临床决策支持中的可靠性问题。

Contribution: 1. 提出基于LLM Chemistry的多LLM协作框架;2. 通过协作兼容性量化提升推荐的可靠性和一致性;3. 实验验证框架在真实临床场景中的有效性。

Method: 利用Chemistry启发的交互建模指导多LLM协作,构建高效(互补优势)、稳定(一致质量)和校准(最小化干扰和误差放大)的集成模型。

Result: 初步结果显示,LLM Chemistry指导的协作能为临床实践提供可信赖的患者特异性药物推荐。

Insight: LLM协作兼容性建模是实现可靠AI临床助手的有前景方向。

Abstract: As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.

[118] Studying Various Activation Functions and Non-IID Data for Machine Learning Model Robustness

Long Dang,Thushari Hapuarachchi,Kaiqi Xiong,Jing Lin

Main category: cs.LG

TL;DR: 本文研究了十种激活函数在对抗训练中对机器学习模型鲁棒性的影响,提出了改进的对抗训练方法,并在联邦学习环境中验证了数据共享的重要性。

Details Motivation: 现有研究多关注ReLU激活函数和集中式训练环境,鲜有探讨其他激活函数和联邦学习环境下的模型鲁棒性。本文旨在填补这一空白。

Contribution: 1. 提出了结合模型架构变更、软标签、简化数据增强和动态学习率的对抗训练方法;2. 系统比较了十种激活函数的性能;3. 在联邦学习环境中验证了数据共享对非独立同分布数据的有效性。

Method: 1. 改进的对抗训练方法;2. 在集中式环境中测试十种激活函数;3. 在联邦学习环境中引入数据共享机制以应对非IID数据。

Result: 1. 集中式环境中,改进方法在CIFAR-10上的自然和鲁棒准确率分别为77.08%和67.96%;2. ReLU在大多数情况下表现最佳;3. 联邦学习中,40%数据共享可将自然和鲁棒准确率提升至70.09%和54.79%。

Insight: 1. 数据共享能显著提升联邦学习中非IID数据的模型鲁棒性;2. ReLU仍是大多数场景下的最优选择;3. 动态学习率等技巧对对抗训练效果有重要影响。

Abstract: Adversarial training is an effective method to improve the machine learning (ML) model robustness. Most existing studies typically consider the Rectified linear unit (ReLU) activation function and centralized training environments. In this paper, we study the ML model robustness using ten different activation functions through adversarial training in centralized environments and explore the ML model robustness in federal learning environments. In the centralized environment, we first propose an advanced adversarial training approach to improving the ML model robustness by incorporating model architecture change, soft labeling, simplified data augmentation, and varying learning rates. Then, we conduct extensive experiments on ten well-known activation functions in addition to ReLU to better understand how they impact the ML model robustness. Furthermore, we extend the proposed adversarial training approach to the federal learning environment, where both independent and identically distributed (IID) and non-IID data settings are considered. Our proposed centralized adversarial training approach achieves a natural and robust accuracy of 77.08% and 67.96%, respectively on CIFAR-10 against the fast gradient sign attacks. Experiments on ten activation functions reveal ReLU usually performs best. In the federated learning environment, however, the robust accuracy decreases significantly, especially on non-IID data. To address the significant performance drop in the non-IID data case, we introduce data sharing and achieve the natural and robust accuracy of 70.09% and 54.79%, respectively, surpassing the CalFAT algorithm, when 40% data sharing is used. That is, a proper percentage of data sharing can significantly improve the ML model robustness, which is useful to some real-world applications.

[119] Feature Engineering vs. Deep Learning for Automated Coin Grading: A Comparative Study on Saint-Gaudens Double Eagles

Tanmay Dogra,Eric Ngo,Mohammad Alam,Jean-Paul Talavera,Asim Dahal

Main category: cs.LG

TL;DR: 本文通过对比特征工程和深度学习在自动硬币评级任务中的表现,证明了在小样本和不平衡类别的场景下,结合领域知识的特征工程方法优于深度学习。

Details Motivation: 研究者挑战了深度学习在所有任务中都优于传统方法的普遍假设,尤其是在样本稀缺且类别不平衡的自动硬币评级任务中。

Contribution: 主要贡献是通过实验证明,在小样本情境下,结合领域知识的特征工程方法(如基于Sobel边缘检测和HSV颜色分析的ANN)表现优于深度学习(如CNN)和传统方法(如SVM)。

Method: 使用了三种方法进行比较:1)基于192个自定义特征的ANN;2)结合EfficientNetV2的混合CNN;3)作为对照的SVM。ANN的特征来自Sobel边缘检测和HSV颜色分析。

Result: 在1785个专家评级的硬币测试中,ANN的精确匹配率为86%,允许3级误差时为98%;而CNN和SVM的精确匹配率仅为31%和30%。CNN在宽泛容忍度指标上表现较好,但在具体评级上失败。

Insight: 在小样本和不平衡类别的任务中,结合领域知识的特征工程方法可能比端到端的深度学习更有效。这对于其他数据稀缺且依赖领域知识的任务有借鉴意义。

Abstract: We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data’s thin and know-how matters more than raw compute.

[120] Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

Bowen Zheng,Ran Cheng

Main category: cs.LG

TL;DR: 本文从预测分布的角度重新审视了Decoupled Knowledge Distillation(DKD),提出了广义解耦知识蒸馏损失(GDKD),并通过实验验证了其在多个基准数据集上的优越性能。

Details Motivation: 尽管DKD在知识蒸馏领域取得了显著进展,但其机制仍需深入探讨。本文从预测分布的角度出发,旨在进一步优化DKD的解耦策略和权重分配。

Contribution: 1. 提出了广义解耦知识蒸馏损失(GDKD),扩展了DKD的解耦方法;2. 揭示了教师模型预测分布对GDKD损失梯度的关键影响;3. 提出了高效的划分策略以处理教师模型预测分布的多模态问题。

Method: 1. 引入GDKD损失,改进DKD的解耦方式;2. 分析教师模型的预测分布及其对损失梯度的作用;3. 设计高效的划分策略,优化多模态预测分布的处理。

Result: 在CIFAR-100、ImageNet、Tiny-ImageNet、CUB-200-2011和Cityscapes等基准数据集上,GDKD均超越了原始DKD及其他主流知识蒸馏方法。

Insight: 1. 通过顶部分数划分显著改善了非顶部分数之间的相互作用;2. 增强对非顶部分数蒸馏损失的关注,可以更好地提取其知识。

Abstract: In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model’s predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models’ predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD’s superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.

[121] TV2TV: A Unified Framework for Interleaved Language and Video Generation

Xiaochuang Han,Youssef Emad,Melissa Hall,John Nguyen,Karthik Padthe,Liam Robbins,Amir Bar,Delong Chen,Michal Drozdzal,Maha Elbayad,Yushi Hu,Shang-Wen Li,Sreya Dutta Roy,Jakob Verbeek,XuDong Wang,Marjan Ghazvininejad,Luke Zettlemoyer,Emily Dinan

Main category: cs.LG

TL;DR: TV2TV提出了一种统一的视频生成框架,通过交替生成文本和视频帧,利用语言模型的推理能力提升视频生成的质量和可控性。

Details Motivation: 现有视频生成模型难以处理需要复杂语义分支或高层推理的任务,TV2TV通过结合语言模型的推理能力来解决这一问题。

Contribution: 提出TV2TV框架,将视频生成分解为文本和视频帧交替生成的任务,基于混合Transformer架构(MoT)联合训练语言建模和视频流匹配。

Method: 使用Mixture-of-Transformers架构,交替生成文本(语言建模)和视频帧(视频流匹配),允许模型在生成帧前通过文本推理内容。

Result: 在视频游戏数据上,TV2TV显著提升了生成视频的视觉质量和可控性;在自然视频(如体育视频)上也表现出色。

Insight: TV2TV展示了结合语言模型推理能力的潜力,为开放式文本推理和控制的视频生成提供了新方向。

Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to “think in words” about subsequent content before ``acting in pixels’’ to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model’s ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

[122] Value Gradient Guidance for Flow Matching Alignment

Zhen Liu,Tim Z. Xiao,Carles Domingo-Enrich,Weiyang Liu,Dinghuai Zhang

Main category: cs.LG

TL;DR: 该论文提出了一种基于梯度匹配的方法VGG-Flow,用于微调预训练的流匹配模型。该方法利用最优控制理论,将微调后的速度场与预训练场之间的差异与值函数的梯度场匹配,实现了高效且保留概率先验的对齐。

Details Motivation: 现有的流匹配模型对齐方法难以同时实现适应效率和概率先验的保留。为了解决这一问题,作者提出了一种结合最优控制理论的方法。

Contribution: 主要贡献是提出了VGG-Flow方法,通过匹配速度场差异与值函数梯度场,实现了高效且概率先验保留的微调。

Method: 利用了最优控制理论,将微调后的速度场与预训练场的差异与值函数的梯度场匹配,并结合启发式初始化值函数以加速适应。

Result: 实验表明,该方法能够在有限计算预算下高效微调Stable Diffusion 3模型,并实现有效的对齐和先验保留。

Insight: 该方法的成功表明,结合最优控制理论和启发式初始化可以有效提升流匹配模型的适应效率和先验保留能力。

Abstract: While methods exist for aligning flow matching models–a popular and effective class of generative models–with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.

[123] The Universal Weight Subspace Hypothesis

Prakhar Kaushik,Shravan Chaudhari,Ankit Vaidya,Rama Chellappa,Alan Yuille

Main category: cs.LG

TL;DR: 论文提出了一种新的假设:深度神经网络在多样化任务训练中会收敛到相似的低维参数子空间,并通过大尺度实验验证了这一现象。

Details Motivation: 现有研究表明深度神经网络具有高度复杂的参数空间,但对其内在结构知之甚少。本文旨在探索是否存在通用的低维子空间,能够解释不同任务和模型的共性。

Contribution: 提出了“通用权重子空间假设”,并通过大规模实验验证了这一假设,发现了跨任务和模型的低维通用子空间。

Method: 对1100多个模型(包括Mistral-7B LoRAs、Vision Transformers和LLaMA-8B)的权重矩阵进行模态谱分析,通过谱分解技术识别共享的低维子空间。

Result: 实验表明,神经网络在不同任务和初始条件下会收敛到相似的子空间,且这些子空间可以用少数主方向捕捉大部分方差。

Insight: 该发现为深度学习的内在结构提供了新见解,并启发了模型复用、多任务学习及高效训练算法的开发,可能减少大规模模型的碳足迹。

Abstract: We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.

cs.CR [Back]

[124] Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Jinbo Liu,Defu Cao,Yifei Wei,Tianyao Su,Yuan Liang,Yushun Dong,Yue Zhao,Xiyang Hu

Main category: cs.CR

TL;DR: 本文提出了一种名为MAMA的框架,用于量化多智能体LLM系统中图拓扑结构对内存泄漏的影响,揭示了网络结构如何影响隐私风险。

Details Motivation: 研究图拓扑结构在多智能体LLM系统中对隐私泄漏的影响,填补了现有研究中对此问题的量化空白。

Contribution: 1. 提出MAMA框架,系统测量网络拓扑结构对内存泄漏的影响;2. 揭示了不同拓扑结构对隐私泄漏的具体影响模式。

Method: 采用两阶段协议(Engram和Resonance),通过合成文档和PII标签,测试多轮交互中的泄漏情况。评估了六种常见网络拓扑结构。

Result: 发现全连接网络的泄漏最大,链式结构保护性最强;攻击者与目标距离越近或目标中心性越高,泄漏越严重。

Insight: 稀疏或分层连接、最大化攻击者与目标距离、限制节点度数和网络半径以及实施拓扑感知访问控制可降低隐私风险。

Abstract: Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent’s memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts $n\in{4,5,6}$, attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.

q-bio.NC [Back]

[125] Human-Centred Evaluation of Text-to-Image Generation Models for Self-expression of Mental Distress: A Dataset Based on GPT-4o

Sui He,Shenbin Qian

Main category: q-bio.NC

TL;DR: 本研究通过GPT-4生成的图像评估其在帮助中国留学生表达心理困扰时的效果,发现提示设计对感知帮助性有显著影响,并提供了一个包含文本描述、生成图像和人工评分的公开数据集。

Details Motivation: 国际学生在心理健康领域常面临语言和文化障碍,本研究旨在探索AI生成图像是否有助于其自我表达心理困扰。

Contribution: 提出了首个公开的、基于人类评分的文本到图像生成评估数据集,为心理健康领域的多模态研究提供了资源。

Method: 邀请20名中国留学生描述心理困扰,使用GPT-4基于四种人格模板生成图像,并让参与者评估图像的帮助性。

Result: 提示设计显著影响图像的感知帮助性,其中插画师人格模板的评分最高。

Insight: 合理设计的AI生成内容可以有效支持心理健康领域的自我表达,提示设计是关键因素。

Abstract: Effective communication is central to achieving positive healthcare outcomes in mental health contexts, yet international students often face linguistic and cultural barriers that hinder their communication of mental distress. In this study, we evaluate the effectiveness of AI-generated images in supporting self-expression of mental distress. To achieve this, twenty Chinese international students studying at UK universities were invited to describe their personal experiences of mental distress. These descriptions were elaborated using GPT-4o with four persona-based prompt templates rooted in contemporary counselling practice to generate corresponding images. Participants then evaluated the helpfulness of generated images in facilitating the expression of their feelings based on their original descriptions. The resulting dataset comprises 100 textual descriptions of mental distress, 400 generated images, and corresponding human evaluation scores. Findings indicate that prompt design substantially affects perceived helpfulness, with the illustrator persona achieving the highest ratings. This work introduces the first publicly available text-to-image evaluation dataset with human judgment scores in the mental health domain, offering valuable resources for image evaluation, reinforcement learning with human feedback, and multi-modal research on mental health communication.

cs.SD [Back]

[126] Shared Multi-modal Embedding Space for Face-Voice Association

Christopher Simic,Korbinian Riedhammer,Tobias Bocklet

Main category: cs.SD

TL;DR: 该论文提出了一种多模态共享嵌入空间方法,用于关联人脸和语音,并在多语言环境下取得了优异成绩。

Details Motivation: 解决多模态(人脸和语音)关联的挑战,尤其是在多语言环境中对未训练语言的泛化能力。

Contribution: 设计了一个共享嵌入空间的联合训练框架,结合了通用特征提取和年龄-性别特征提取,并使用了自适应角度边缘损失(AAM)。

Method: 采用独立的单模态处理流程提取特征,投影到共享嵌入空间,并通过AAM损失进行训练。

Result: 在FAME 2026挑战赛中取得了第一名,平均等误率(EER)为23.99%。

Insight: 多模态共享嵌入空间能够有效提升模型在未知语言环境下的泛化能力,通用特征与辅助特征的结合进一步增强了性能。

Abstract: The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.