Table of Contents

cs.CL [Back]

[1] PHRASED: Phrase Dictionary Biasing for Speech Translation

Peidong Wang,Jian Xue,Rui Zhao,Junkun Chen,Aswin Shanmugam Subramanian,Jinyu Li

Main category: cs.CL

TL;DR: 本文提出了一种短语词典偏置方法,用于提升语音翻译任务中短语的翻译准确性,并在流式语音翻译模型和多模态大语言模型中验证了其有效性。

Details Motivation: 由于短语在训练数据中出现频率较低,导致其在语音翻译任务中的准确翻译具有挑战性。为了解决这一问题,论文提出了短语词典偏置方法。

Contribution: 主要贡献是提出了一种新的短语词典偏置方法,能够有效利用源语言和目标语言之间的短语映射对,提升翻译质量。

Method: 该方法应用于两种模型:基于transducer的流式语音翻译模型和多模态大语言模型,通过引入外部短语信息改进翻译性能。

Result: 实验结果表明,短语词典偏置方法在流式语音翻译模型中相对提升了21%,在多模态大语言模型中实现了85%的短语召回率相对提升。

Insight: 短语词典偏置不仅适用于传统语音翻译模型,还能显著提升大语言模型在短语翻译任务中的表现,为结合外部知识的模型优化提供了新思路。

Abstract: Phrases are essential to understand the core concepts in conversations.
However, due to their rare occurrence in training data, correct translation of
phrases is challenging in speech translation tasks. In this paper, we propose a
phrase dictionary biasing method to leverage pairs of phrases mapping from the
source language to the target language. We apply the phrase dictionary biasing
method to two types of widely adopted models, a transducer-based streaming
speech translation model and a multimodal large language model. Experimental
results show that the phrase dictionary biasing method outperforms phrase list
biasing by 21% relatively for the streaming speech translation model. In
addition, phrase dictionary biasing enables multimodal large language models to
use external phrase information, achieving 85% relative improvement in phrase
recall.

[2] Extrapolation by Association: Length Generalization Transfer in Transformers

Ziyang Cai,Nayoung Lee,Avi Schwarzschild,Samet Oymak,Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: 该论文研究了Transformer语言模型在长度泛化方面的能力,通过任务关联(task association)揭示了模型可以从相关任务中继承泛化能力,从而实现从较短输入到较长输入的推断。

Details Motivation: 尽管Transformer在自然语言领域表现出强大的泛化能力,但其如何实现长度泛化(从短输入推断长输入)的机制尚不清晰。论文希望通过任务关联的角度来探索这一问题。

Contribution: 1. 提出并验证了长度泛化可以通过相关任务之间的传递实现;2. 展示了这一现象在多种算法任务中的普适性;3. 揭示了预训练语言模型中存在类似的泛化传递机制;4. 从注意力机制的角度提供了初步的机制解释。

Method: 通过设计多个算法任务(如算术运算、字符串变换、迷宫导航)进行实验,训练模型时结合长输入的相关任务,观察其对目标任务的泛化能力影响。同时分析了预训练模型的注意力机制。

Result: 实验表明,模型可以从相关任务中继承长度泛化能力,且预训练模型中也存在类似的传递效应。注意力头的重用与泛化能力显著相关。

Insight: Transformer的泛化能力可以通过任务间的关联实现,暗示了模型具有可复用的计算结构,并能跨任务组合利用这些结构。这为理解模型的外推能力提供了新视角。

Abstract: Transformer language models have demonstrated impressive generalization
capabilities in natural language domains, yet we lack a fine-grained
understanding of how such generalization arises. In this paper, we investigate
length generalization–the ability to extrapolate from shorter to longer
inputs–through the lens of \textit{task association}. We find that length
generalization can be \textit{transferred} across related tasks. That is,
training a model with a longer and related auxiliary task can lead it to
generalize to unseen and longer inputs from some other target task. We
demonstrate this length generalization transfer across diverse algorithmic
tasks, including arithmetic operations, string transformations, and maze
navigation. Our results show that transformer models can inherit generalization
capabilities from similar tasks when trained jointly. Moreover, we observe
similar transfer effects in pretrained language models, suggesting that
pretraining equips models with reusable computational scaffolding that
facilitates extrapolation in downstream settings. Finally, we provide initial
mechanistic evidence that length generalization transfer correlates with the
re-use of the same attention heads between the tasks. Together, our findings
deepen our understanding of how transformers generalize to out-of-distribution
inputs and highlight the compositional reuse of inductive structure across
tasks.

[3] Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat

Zhuofang Li,Rafal Kocielnik,Fereshteh Soltani,Penphob,Boonyarungsrit,Animashree Anandkumar,R. Michael Alvarez

Main category: cs.CL

TL;DR: 本文提出了一种新颖的自我锚定注意力模型(SAAM),用于在低资源环境下高效分类游戏聊天中的亲社会文本,相比现有技术提升了7.9%。

Details Motivation: 尽管在线游戏中存在大量亲社会聊天内容,但现有研究主要集中在小规模的毒性检测上。识别和推广亲社会行为对促进积极互动至关重要,但缺乏相关数据集和模型。

Contribution: 1)通过与游戏领域专家合作,首次在游戏聊天中发现并分类亲社会行为;2)提出SAAM模型,在训练数据稀缺的情况下显著提升性能;3)开发了首个用于分类游戏聊天中亲社会行为的自动化系统。

Method: 采用无监督发现与专家协作的方法识别亲社会行为,并提出SAAM模型,利用整个训练集作为”锚点”以提高模型在数据稀缺下的性能。

Result: SAAM模型在亲社会行为分类任务中比现有最佳技术提升了7.9%,成功应用于《使命召唤:现代战争II》。

Insight: 本研究为从单纯惩罚毒性转向鼓励积极互动提供了新思路,展示了NLP在低资源环境下应用的可能性。

Abstract: Millions of players engage daily in competitive online games, communicating
through in-game chat. Prior research has focused on detecting relatively small
volumes of toxic content using various Natural Language Processing (NLP)
techniques for the purpose of moderation. However, recent studies emphasize the
importance of detecting prosocial communication, which can be as crucial as
identifying toxic interactions. Recognizing prosocial behavior allows for its
analysis, rewarding, and promotion. Unlike toxicity, there are limited
datasets, models, and resources for identifying prosocial behaviors in
game-chat text. In this work, we employed unsupervised discovery combined with
game domain expert collaboration to identify and categorize prosocial player
behaviors from game chat. We further propose a novel Self-Anchored Attention
Model (SAAM) which gives 7.9% improvement compared to the best existing
technique. The approach utilizes the entire training set as “anchors” to help
improve model performance under the scarcity of training data. This approach
led to the development of the first automated system for classifying prosocial
behaviors in in-game chats, particularly given the low-resource settings where
large-scale labeled data is not available. Our methodology was applied to one
of the most popular online gaming titles - Call of Duty(R): Modern
Warfare(R)II, showcasing its effectiveness. This research is novel in applying
NLP techniques to discover and classify prosocial behaviors in player in-game
chat communication. It can help shift the focus of moderation from solely
penalizing toxicity to actively encouraging positive interactions on online
platforms.

[4] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Milan Bhan,Jean-Noel Vittaut,Nicolas Chesneau,Sarath Chandar,Marie-Jeanne Lesot

Main category: cs.CL

TL;DR: 这篇论文提出了一种新颖的框架,通过比较大语言模型生成的自我解释(self-NLE)与其内部隐藏状态的解释,定量评估自我解释的忠实性,揭示了自我解释与模型实际推理过程之间的联系。

Details Motivation: 现有方法主要通过行为测试或计算块识别来评估自我解释的忠实性,但这些方法忽略了模型的神经活动。本文旨在通过直接分析模型的隐藏状态,填补这一空白。

Contribution: 提出了一个灵活的框架,通过比较自我解释与隐藏状态解释,定量评估自我解释的忠实性,为生成更忠实的解释奠定了基础。

Method: 通过分析模型的内部隐藏状态,生成基于神经活动的解释,并将其与模型自我生成的解释(self-NLE)进行比较,以量化忠实性。

Result: 该框架揭示了自我解释与模型实际推理之间的不一致性,为理解自我解释的忠实性提供了新视角。

Insight: 自我解释可能表面上逻辑合理,但未必反映模型的真实推理过程;直接分析神经活动有助于揭示这种不忠实性。

Abstract: Large Language Models (LLM) have demonstrated the capability of generating
free text self Natural Language Explanation (self-NLE) to justify their
answers. Despite their logical appearance, self-NLE do not necessarily reflect
the LLM actual decision-making process, making such explanations unfaithful.
While existing methods for measuring self-NLE faithfulness mostly rely on
behavioral tests or computational block identification, none of them examines
the neural activity underlying the model’s reasoning. This work introduces a
novel flexible framework for quantitatively measuring the faithfulness of
LLM-generated self-NLE by directly comparing the latter with interpretations of
the model’s internal hidden states. The proposed framework is versatile and
provides deep insights into self-NLE faithfulness by establishing a direct
connection between self-NLE and model reasoning. This approach advances the
understanding of self-NLE faithfulness and provides building blocks for
generating more faithful self-NLE.

[5] $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding

Cesare Spinoso-Di Piano,David Austin,Pablo Piantanida,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 论文提出了一个名为$(RSA)^2$的新的RSA框架,通过考虑说话者的修辞策略来理解比喻语言(如反讽、夸张),无需建模说话者的非字面表达动机,并在新数据集PragMega+上取得了最先进的性能。

Details Motivation: 比喻语言(如反讽、夸张)在人类交流中非常普遍,但其字面意义与真实意图不一致,现有RSA框架无法处理此类现象或需要特定场景下建模说话者的动机。因此,需要一种更通用的方法来理解比喻语言。

Contribution: 提出了$(RSA)^2$框架,通过引入修辞策略的概念,解决了现有RSA框架在理解比喻语言时的局限性,同时展示了该框架在反讽识别任务上的优越性能。

Method: $(RSA)^2$框架扩展了传统的RSA模型,加入了对说话者修辞策略的建模,使其能够捕捉非字面表达的意图,而无需依赖特定场景的动机解释。

Result: 结合大语言模型(LLMs),$(RSA)^2$在PragMega+数据集的反讽识别任务上达到了最先进的性能。

Insight: 通过建模修辞策略而非说话者的具体动机,可以更通用且高效地理解比喻语言,这对于自然语言处理任务中复杂意图的捕捉具有重要意义。

Abstract: Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in
human communication, resulting in utterances where the literal and the intended
meanings do not match. The Rational Speech Act (RSA) framework, which
explicitly models speaker intentions, is the most widespread theory of
probabilistic pragmatics, but existing implementations are either unable to
account for figurative expressions or require modeling the implicit motivations
for using figurative language (e.g., to express joy or annoyance) in a
setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware
RSA $(RSA)^2$ framework which models figurative language use by considering a
speaker’s employed rhetorical strategy. We show that $(RSA)^2$ enables
human-compatible interpretations of non-literal utterances without modeling a
speaker’s motivations for being non-literal. Combined with LLMs, it achieves
state-of-the-art performance on the ironic split of PragMega+, a new irony
interpretation dataset introduced in this study.

[6] Towards Efficient and Effective Alignment of Large Language Models

Yuxin Jiang

Main category: cs.CL

TL;DR: 论文提出了一套高效、有效对齐大语言模型的方法,包括数据收集、训练和评估的创新技术。

Details Motivation: 大语言模型(LLMs)在多任务中表现出色,但与人类期望的精确对齐仍然是一个关键挑战。现有的方法在数据收集、训练和评估方面存在局限性,难以满足对齐的需求。

Contribution: 1. 提出Lion框架,通过对抗蒸馏迭代优化训练数据;2. 开发WebR框架,直接从网络文档生成多样化的指令调优数据;3. 设计LTE框架,实现高效的知识更新;4. 改进DPO为BMC,显式建模偏好数据的Token级相关性;5. 提出FollowBench基准,用于评估模型对复杂约束的遵循能力。

Method: 1. 数据收集:Lion框架通过对抗蒸馏生成高质量数据,WebR从网络文档中自动合成数据;2. 训练优化:LTE框架通过元学习实现高效知识更新,BMC改进DPO以捕捉Token级相关性;3. 评估:FollowBench提供多层级、细粒度的约束遵循能力评估。

Result: 实验表明,新方法在零样本推理、数据多样性和对齐任务中表现优越。FollowBench揭示了当前模型在约束遵循方面的弱点。

Insight: 对齐大语言模型需要多管齐下,从数据生成到训练优化再到评估,每一步都需创新。约束遵循能力的评估是未来改进的重要方向。

Abstract: Large language models (LLMs) exhibit remarkable capabilities across diverse
tasks, yet aligning them efficiently and effectively with human expectations
remains a critical challenge. This thesis advances LLM alignment by introducing
novel methodologies in data collection, training, and evaluation. We first
address alignment data collection. Existing approaches rely heavily on manually
curated datasets or proprietary models. To overcome these limitations, we
propose Lion, an adversarial distillation framework that iteratively refines
training data by identifying and generating challenging instructions, enabling
state-of-the-art zero-shot reasoning. Additionally, we introduce Web
Reconstruction (WebR), a fully automated framework that synthesizes
instruction-tuning data directly from raw web documents, significantly
improving data diversity and scalability over existing synthetic data methods.
Next, we enhance alignment training through novel optimization techniques. We
develop Learning to Edit (LTE), a framework that enables LLMs to efficiently
integrate new knowledge while preserving existing information. LTE leverages
meta-learning to improve both real-time and batch knowledge updates.
Furthermore, we introduce Bridging and Modeling Correlations (BMC), a
refinement of Direct Preference Optimization (DPO) that explicitly captures
token-level correlations in preference data, leading to superior alignment
across QA and mathematical reasoning tasks. Finally, we tackle the challenge of
evaluating alignment. Existing benchmarks emphasize response quality but
overlook adherence to specific constraints. To bridge this gap, we introduce
FollowBench, a multi-level, fine-grained benchmark assessing LLMs’ ability to
follow complex constraints across diverse instruction types. Our results expose
key weaknesses in current models’ constraint adherence, offering insights for
future improvements.

[7] Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Arjun Vaithilingam Sudhakar

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLMs)是否具备“心智理论”(theory of mind)能力,即能否推理他人意图,并将其应用于多智能体协作学习(MARL)中,以提升AI与人类或AI之间的协作能力。

Details Motivation: 现代LLMs在零样本和少样本任务中表现出强大的泛化能力,但能否理解和推理他人意图(即心智理论)尚不明确。这一能力对多智能体协作至关重要,尤其是在人机协作场景中。

Contribution: 论文通过多智能体强化学习(MARL)框架,研究LLMs是否具备心智理论能力,并提出一种基于LLM的自然语言交互方法,旨在提升AI的协作与适应能力。

Method: 利用LLM构建多智能体系统,通过重复交互模拟人类社交推理,研究智能体在协作任务中的表现。实验聚焦于LLM能否推断他人意图并实现高效协作。

Result: 研究发现,LLMs在多智能体协作任务中展现出一定的心智理论能力,能够通过自然语言交互实现协作目标,为构建更强大的人机协作系统提供了基础。

Insight: LLMs不仅可用于单机任务,还能在复杂的社会交互中发挥作用,未来可能推动人机协作技术的进一步发展。

Abstract: Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot
generalization capabilities across complex natural language tasks, enabling
their widespread use as virtual assistants for diverse applications such as
translation and summarization. Despite being trained solely on large corpora of
text without explicit supervision on author intent, LLMs appear to infer the
underlying meaning of textual interactions. This raises a fundamental question:
can LLMs model and reason about the intentions of others, i.e., do they possess
a form of theory of mind? Understanding other’s intentions is crucial for
effective collaboration, which underpins human societal success and is
essential for cooperative interactions among multiple agents, including humans
and autonomous systems. In this work, we investigate the theory of mind in LLMs
through the lens of cooperative multi-agent reinforcement learning (MARL),
where agents learn to collaborate via repeated interactions, mirroring human
social reasoning. Our approach aims to enhance artificial agent’s ability to
adapt and cooperate with both artificial and human partners. By leveraging
LLM-based agents capable of natural language interaction, we move towards
creating hybrid human-AI systems that can foster seamless collaboration, with
broad implications for the future of human-artificial interaction.

[8] RePO: Replay-Enhanced Policy Optimization

Siheng Li,Zhanhui Zhou,Wai Lam,Chao Yang,Chaochao Lu

Main category: cs.CL

TL;DR: RePO通过多样化的回放策略从回放缓冲区中检索离策略样本,优化了大语言模型的策略学习,显著提升了计算效率和性能。

Details Motivation: 当前GRPO方法因使用多个同策略样本而计算成本高、数据效率低,需要一种更高效的方法。

Contribution: 提出了RePO方法,利用回放策略从缓冲区中检索离策略样本,显著提升策略优化的多样性和效率。

Method: 采用多样化的回放策略从缓冲区检索离策略样本,结合同策略和离策略数据进行优化。

Result: 在多个数学推理基准测试中,RePO显著优于GRPO,性能提升显著(如Qwen2.5-Math-1.5B提升18.4分)。

Insight: RePO通过结合同策略和离策略样本,在增加计算成本的同时显著提升了优化步数和性能。

Abstract: Reinforcement learning (RL) is vital for optimizing large language models
(LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages
using multiple on-policy outputs per prompt, leading to high computational
costs and low data efficiency. To address this, we introduce Replay-Enhanced
Policy Optimization (RePO), which leverages diverse replay strategies to
retrieve off-policy samples from a replay buffer, allowing policy optimization
based on a broader and more diverse set of samples for each prompt. Experiments
on five LLMs across seven mathematical reasoning benchmarks demonstrate that
RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for
Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further
analysis indicates that RePO increases computational cost by $15%$ while
raising the number of effective optimization steps by $48%$ for Qwen3-1.7B,
with both on-policy and off-policy sample numbers set to $8$. The repository
can be accessed at https://github.com/SihengLi99/RePO.

[9] Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

Jui-Ming Yao,Hao-Yuan Chen,Zi-Xian Tang,Bing-Jia Tan,Sheng-Wei Peng,Bing-Cheng Xie,Shun-Feng Su

Main category: cs.CL

TL;DR: 该论文提出了一种名为Token Constraint Decoding (TCD)的推理时算法,用于提升大型语言模型在噪声环境下的稳健性,尤其在多项选择题回答任务中表现显著,最高实现了39%的绝对性能提升。

Details Motivation: 大型语言模型在多选题回答任务中表现出色,但对输入扰动非常敏感,导致性能下降。论文旨在解决这一问题,提高模型在噪声环境下的稳健性。

Contribution: 提出了Token Constraint Decoding (TCD)算法,通过强制对齐令牌级预测来增强模型的稳健性,并且该方法是模型无关的,适用于不同规模的语言模型。

Method: TCD是一种推理时算法,通过约束令牌级预测的分布来减少过拟合和对噪声的敏感性。该方法还可以与提示工程结合使用以进一步提升性能。

Result: 在CommonsenseQA、MMLU和MMLU-Pro等数据集上的实验表明,TCD显著提升了模型的稳健性,尤其在较弱的模型(如Gemma3 1B)上实现了高达39%的性能提升。

Insight: TCD通过隐式正则化过自信的输出,提高了模型在噪声环境下的稳定性。不同模型需要不同的惩罚调度以最大化稳健性,这为未来的研究和实际部署提供了参考。

Abstract: Large Language Models (LLMs) have demonstrated impressive performance on
multiple-choice question answering (MCQA) benchmarks, yet they remain highly
vulnerable to minor input perturbations. In this paper, we introduce and
evaluate Token Constraint Decoding (TCD). This simple yet effective
inference-time algorithm enforces alignment between token-level predictions to
enhance robustness in noisy settings. Through extensive experiments on
CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired
with prompt engineering (PE) fixes, significantly restores performance degraded
by input noise, yielding up to +39% absolute gains for weaker models like
Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly
regularizes overconfident outputs, with different models requiring distinct
penalty schedules to maximize resilience. Our findings establish TCD as a
practical, model-agnostic approach for improving reasoning stability under
real-world imperfections and pave the way for more reliable deployment of LLMs
in safety-critical or user-facing applications.

[10] PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

Xiujun Zhou,Pingjian Zhang,Deyou Tang

Main category: cs.CL

TL;DR: 该论文提出了PGDA-KGQA框架,通过多策略数据增强提升知识图谱问答(KGQA)任务的性能,有效解决了数据稀缺和多跳推理问题,显著优于现有方法。

Details Motivation: KGQA任务中,现有方法受限于标注数据的稀缺和多跳推理样本的不足,传统数据增强方法容易导致语义失真,而基于LLM的方法忽视了多跳推理。因此,亟需一种能兼顾语义完整性和多样性的增强方法。

Contribution: PGDA-KGQA框架通过提示设计引导LLM生成多样化的训练数据,包括单跳伪问题生成、语义保留的问题重写和基于答案的反向路径探索,显著提升了逻辑形式生成和答案检索的准确性。

Method: 框架采用提示引导的生成范式,通过精心设计的提示从LLM生成大规模(问题,逻辑形式)对。具体包括单跳伪问题生成、语义保留重写和多跳问题生成,并通过生成-检索流水线优化逻辑形式生成。

Result: 在标准数据集WebQSP和ComplexWebQuestions上,PGDA-KGQA在F1、Hits@1和准确率指标上分别提升2.8%、1.2%、3.1%和1.8%、1.1%、2.4%,显著优于现有方法。

Insight: 提示设计是高效利用LLM的关键,多策略数据增强能兼顾语义对齐和多样性,反向路径探索为多跳推理提供了更真实的训练样本。

Abstract: Knowledge Graph Question Answering (KGQA) is a crucial task in natural
language processing that requires reasoning over knowledge graphs (KGs) to
answer natural language questions. Recent methods utilizing large language
models (LLMs) have shown remarkable semantic parsing capabilities but are
limited by the scarcity of diverse annotated data and multi-hop reasoning
samples. Traditional data augmentation approaches are focus mainly on
single-hop questions and prone to semantic distortion, while LLM-based methods
primarily address semantic distortion but usually neglect multi-hop reasoning,
thus limiting data diversity. The scarcity of multi-hop samples further weakens
models’ generalization. To address these issues, we propose PGDA-KGQA, a
prompt-guided generative framework with multiple data augmentation strategies
for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by
crafting meticulously engineered prompts that integrate the provided textual
content, it leverages LLMs to generate large-scale (question, logical form)
pairs for model training. Specifically, PGDA-KGQA enriches its training set by:
(1) generating single-hop pseudo questions to improve the alignment of question
semantics with KG relations; (2) applying semantic-preserving question
rewriting to improve robustness against linguistic variations; (3) employing
answer-guided reverse path exploration to create realistic multi-hop questions.
By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA
utilizes the augmented data to enhance the accuracy of logical form generation
and thus improve answer retrieval performance. Experiments demonstrate that
outperforms state-of-the-art methods on standard KGQA datasets, achieving
improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by
1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.

[11] Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings

Md Messal Monem Miah,Adrita Anika,Xi Shi,Ruihong Huang

Main category: cs.CL

TL;DR: 论文评估了大型语言模型(LLMs)和大型多模态模型(LMMs)在欺骗检测任务中的表现,发现微调后的LLMs在文本欺骗检测中表现最佳,而LMMs在多模态线索利用上存在局限。

Details Motivation: 数字世界中欺骗检测的挑战日益突出,需要评估现有模型在多模态场景下的表现及其潜力。

Contribution: 1. 对LLMs和LMMs在多模态欺骗检测中的能力进行了系统评估;2. 分析了不同实验设置和提示策略的影响;3. 揭示了模型在跨模态欺骗线索处理上的局限性。

Method: 使用三个数据集(RLTD、MU3D、OpSpam),评估零样本和少样本方法,分析随机或基于相似性的上下文示例选择,并测试直接标签生成和思维链提示策略。

Result: 微调LLMs在文本欺骗检测任务中达到最优性能,LMMs未能充分利用跨模态线索。辅助特征(如非语言手势)对性能影响有限。

Insight: LLMs在多模态欺骗检测中潜力显著,但需进一步改进跨模态信息融合能力;提示策略对模型性能有显著影响。

Abstract: Detecting deception in an increasingly digital world is both a critical and
challenging task. In this study, we present a comprehensive evaluation of the
automated deception detection capabilities of Large Language Models (LLMs) and
Large Multimodal Models (LMMs) across diverse domains. We assess the
performance of both open-source and commercial LLMs on three distinct datasets:
real life trial interviews (RLTD), instructed deception in interpersonal
scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the
effectiveness of different experimental setups for deception detection,
including zero-shot and few-shot approaches with random or similarity-based
in-context example selection. Our results show that fine-tuned LLMs achieve
state-of-the-art performance on textual deception detection tasks, while LMMs
struggle to fully leverage cross-modal cues. Additionally, we analyze the
impact of auxiliary features, such as non-verbal gestures and video summaries,
and examine the effectiveness of different prompting strategies, including
direct label generation and chain-of-thought reasoning. Our findings provide
key insights into how LLMs process and interpret deceptive cues across
modalities, highlighting their potential and limitations in real-world
deception detection applications.

[12] Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: 该论文提出了一种名为POET的新方法,旨在解决直接对齐算法(DAAs)中的“奖励-生成差距”问题,通过截断偏好和非偏好响应至相同长度,改进了现有方法的性能。

Details Motivation: 研究者在直接对齐算法(DAAs)中发现了一个关键问题,即训练时的优化目标与推理时的生成性能之间存在的“奖励-生成差距”。这一问题源于模型对前缀标记的重要性与奖励函数对其反映的权重不匹配。

Contribution: 论文的主要贡献包括:1)识别了DAAs中的奖励-生成差距;2)提出了一种新方法POET,通过截断响应至相同长度来解决这一问题;3)在AlpacaEval 2等任务中实现了显著性能提升。

Method: POET方法的核心是将偏好和非偏好响应截断至较短响应长度,并在训练中保持各样本的截断长度多样化。这种方法隐式地约束DAAs的优化目标,使其在所有位置上收敛,并更关注前缀标记。

Result: 实验结果表明,POET在DPO和SimPO等DAAs中表现优异,在AlpacaEval 2上提升了15.6分,并在下游任务中展现出总体改进。

Insight: 论文揭示了奖励优化与生成性能之间的不匹配问题,并通过简单但高效的方法解决了这一问题,提供了对DAAs进一步优化的新思路。

Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization
(DPO) and Simple Preference Optimization (SimPO), have emerged as efficient
alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms
for aligning large language models (LLMs) with human preferences. However, DAAs
suffer from a fundamental limitation we identify as the “reward-generation gap”
– a misalignment between optimization objectives during training and actual
generation performance during inference. In this paper, we find a contributor
to the reward-generation gap is the mismatch between the inherent importance of
prefix tokens during the LLM generation process and how this importance is
reflected in the implicit reward functions of DAAs. To bridge the gap, we
introduce a simple yet effective approach called Prefix-Oriented Equal-length
Training (POET), which truncates both preferred and dispreferred responses to
match the shorter one’s length. Training with POET, where both responses in
each sample are truncated to equal length, resulting in diverse truncated
lengths across samples, the optimization of DAAs objective is implicitly
constrained to converge across all positions, thus paying more attention to
prefix tokens than the standard DAAs. We conduct experiments with DPO and
SimPO, two representative DAAs, demonstrating that POET improves over their
standard implementations, achieving up to 15.6 points in AlpacaEval 2 and
overall improvements across downstream tasks. Our results highlight the
importance of addressing the misalignment between reward optimization and
generation performance in DAAs.

[13] Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers

Ilanit Sobol,Shir Lissak,Refael Tikochinski,Tal Nakash,Anat Brunstein Klomek,Eyal Fruchter,Roi Reichart

Main category: cs.CL

TL;DR: 这篇论文通过结合计算方法和专家知识,研究了YouTube上自杀行为的数字标记,揭示了新的行为模式和临床见解。

Details Motivation: 由于自杀是西方国家的主要死因之一,而社交媒体为研究自杀行为提供了新的数据来源,论文旨在探索如何在YouTube上通过数字足迹识别自杀行为的迹象。

Contribution: 论文的主要贡献是通过三种互补的方法(自下而上、混合和自上而下)分析了YouTube上与自杀行为相关的数字标记,并发现了平台特有的行为模式(如YouTube Engagement)。

Method: 论文采用了三种方法:1)自下而上的LLM主题建模识别行为指标;2)混合方法结合专家评估LLM生成的主题;3)自上而下的心理评估分析自杀叙事的动机差异。

Result: 研究发现,与自杀行为相关的主题中,Mental Health Struggles和YouTube Engagement在时间上有显著变化。此外,专家未识别的YouTube Engagement显示了自下而上方法的独特价值。自杀者的动机差异也被揭示:一些人旨在帮助他人,另一些人则将其视为个人康复的一部分。

Insight: 论文强调了结合计算方法和专家知识的重要性,同时揭示了平台特有的行为模式可能成为自杀风险的新标记。这种方法为自杀预防提供了新的研究视角。

Abstract: Suicide remains a leading cause of death in Western countries, underscoring
the need for new research approaches. As social media becomes central to daily
life, digital footprints offer valuable insight into suicidal behavior.
Focusing on individuals who attempted suicide while uploading videos to their
channels, we investigate: How do suicidal behaviors manifest on YouTube, and
how do they differ from expert knowledge? We applied complementary approaches:
computational bottom-up, hybrid, and expert-driven top-down, on a novel
longitudinal dataset of 181 YouTube channels from individuals with
life-threatening attempts, alongside 134 control channels. In the bottom-up
approach, we applied LLM-based topic modeling to identify behavioral
indicators. Of 166 topics, five were associated with suicide-attempt, with two
also showing temporal attempt-related changes ($p<.01$) - Mental Health
Struggles ($+0.08$)* and YouTube Engagement ($+0.1$)*. In the hybrid approach,
a clinical expert reviewed LLM-derived topics and flagged 19 as
suicide-related. However, none showed significant attempt-related temporal
effects beyond those identified bottom-up. Notably, YouTube Engagement, a
platform-specific indicator, was not flagged by the expert, underscoring the
value of bottom-up discovery. In the top-down approach, psychological
assessment of suicide attempt narratives revealed that the only significant
difference between individuals who attempted before and those attempted during
their upload period was the motivation to share this experience: the former
aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as
part of their Personal Recovery ($\beta=1.08$, $p<.01$). By integrating these
approaches, we offer a nuanced understanding of suicidality, bridging digital
behavior and clinical insights.

  • Within-group changes in relation to the suicide attempt.

[14] Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

Jiayi Yuan,Hao Li,Xinheng Ding,Wenya Xie,Yu-Jhe Li,Wentian Zhao,Kun Wan,Jing Shi,Xia Hu,Zirui Liu

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)的推理结果因硬件配置和数值精度(如 GPU 类型、批次大小)而产生的不可复现性问题,并提出了一种轻量级推理框架 LayerCast,平衡内存效率和数值稳定性。

Details Motivation: 研究发现 LLMs 的推理结果在不同硬件配置和数值精度下存在显著差异,尤其是推理任务中微小的浮点运算误差会导致输出结果的分歧,这挑战了现有评测的可信度。

Contribution: 1. 首次系统研究了数值精度对 LLM 推理可复现性的影响;2. 提出了 LayerCast 框架,在 16 位权重存储下执行 FP32 计算以兼顾效率和稳定性。

Method: 通过在不同硬件、软件和精度设置下的实验,量化模型输出的分歧情况,并提出 LayerCast 框架,将权重存储为 16 位但计算使用 FP32。

Result: 实验表明,推理任务中 GPU 数量、类型和批次大小的差异可导致准确性变化高达 9%,响应长度差异达 9000 个 token。LayerCast 显著提高了结果的稳定性。

Insight: 浮点运算的非结合性在有限精度下会放大误差,影响 LLMs 的推理结果;评测实践中忽视数值精度可能导致误导性结论。

Abstract: Large Language Models (LLMs) are now integral across various domains and have
demonstrated impressive performance. Progress, however, rests on the premise
that benchmark scores are both accurate and reproducible. We demonstrate that
the reproducibility of LLM performance is fragile: changing system
configuration such as evaluation batch size, GPU count, and GPU version can
introduce significant difference in the generated responses. This issue is
especially pronounced in reasoning models, where minor rounding differences in
early tokens can cascade into divergent chains of thought, ultimately affecting
accuracy. For instance, under bfloat16 precision with greedy decoding, a
reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation
in accuracy and 9,000 tokens difference in response length due to differences
in GPU count, type, and evaluation batch size. We trace the root cause of this
variability to the non-associative nature of floating-point arithmetic under
limited numerical precision. This work presents the first systematic
investigation into how numerical precision affects reproducibility in LLM
inference. Through carefully controlled experiments across various hardware,
software, and precision settings, we quantify when and how model outputs
diverge. Our analysis reveals that floating-point precision – while critical
for reproducibility – is often neglected in evaluation practices. Inspired by
this, we develop a lightweight inference pipeline, dubbed LayerCast, that
stores weights in 16-bit precision but performs all computations in FP32,
balancing memory efficiency with numerical stability. Code is available at
https://github.com/nanomaoli/llm_reproducibility.

[15] TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Bingheng Wu,Jingze Shi,Yifan Wu,Nan Tang,Yuyu Luo

Main category: cs.CL

TL;DR: 论文提出了一种统一的旋转位置嵌入方法(RoPE),将Transformer和状态空间模型(SSM)结合为混合架构TransXSSM,显著提升了长序列建模的效率和性能。

Details Motivation: Transformer擅长捕捉长程依赖,而SSM支持线性时间序列建模,但两者的位置编码机制不一致(RoPE与卷积隐式表示)导致性能下降。亟需统一的解决方案。

Contribution: 1. 提出统一的旋转位置嵌入方法(RoPE),解决了Transformer与SSM位置编码的不兼容性;2. 设计了混合架构TransXSSM,结合了二者的优势。

Method: 通过统一RoPE框架,将Transformer的自注意力层与SSM的状态空间层融合,形成TransXSSM。训练和推理时采用了优化的位置编码策略。

Result: 在4K序列长度下,TransXSSM训练和推理速度分别提升42.3%和29.5%,语言建模任务精度提升4%,1.3B版本比320M版本平均精度提升7.22%。

Insight: 统一位置编码是混合模型高效长上下文建模的关键,避免了位置机制不一致带来的性能损失。

Abstract: Transformers exhibit proficiency in capturing long-range dependencies,
whereas State Space Models (SSMs) facilitate linear-time sequence modeling.
Notwithstanding their synergistic potential, the integration of these
architectures presents a significant challenge, primarily attributable to a
fundamental incongruity in their respective positional encoding mechanisms:
Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs
leverage implicit positional representations via convolutions. This divergence
often precipitates discontinuities and suboptimal performance. To address this
impediment, we propose a unified rotary position embedding (\textbf{\ourRoPE})
methodology, thereby establishing a consistent positional encoding framework
for both self-attention and state-space components. Using this \ourRoPE, we
introduce \textbf{\model}, a hybrid architecture that coherently integrates the
Transformer and SSM layers under this unified positional encoding scheme. At a
4K sequence length, \model exhibits training and inference speeds that are
\textbf{42.3% and 29.5% faster}, respectively, relative to standard
Transformer models. It also delivers higher accuracy: under comparable
settings, it surpasses a Transformer baseline by over 4% on language modeling
benchmarks. \model furthermore scales more effectively: \model-1.3B gains
\textbf{7.22%} in average accuracy over its 320M version (versus about 6%
gains for equivalent Transformers or SSMs). Our results show that unified
positional encoding resolves positional incompatibility in hybrid models,
enabling efficient, high-performance long-context modeling.

[16] ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Yu Sun,Xingyu Qian,Weiwen Xu,Hao Zhang,Chenghao Xiao,Long Li,Yu Rong,Wenbing Huang,Qifeng Bai,Tingyang Xu

Main category: cs.CL

TL;DR: 论文ReasonMed提出了一个包含370K高质量医疗推理样本的数据集,通过多智能体验证和精修过程生成,并探索了最优的医疗推理模型微调策略。

Details Motivation: 尽管推理型大语言模型在数学和编程领域表现出色,但其在知识密集型的医疗问答中的能力尚未充分探索。论文旨在填补这一空白。

Contribution: 1) 提出了ReasonMed,是目前最大的医疗推理数据集,基于多智能体验证和精修生成;2) 设计了Error Refiner工具用于修正错误步骤;3) 探索并验证了结合详细思维链(CoT)和简洁答案摘要的微调策略最优。

Method: 1) 通过多智能体验证和精修过程生成高质量数据集;2) 设计Error Refiner工具修正推理路径中的错误;3) 结合详细CoT和简洁答案摘要进行模型微调。

Result: 训练的ReasonMed-7B模型在sub-10B模型中表现最佳,超越先前最佳模型4.17%,甚至超过LLaMA3.1-70B在PubMedQA上的表现4.60%。

Insight: 多智能体验证和精修过程能显著提升数据集质量;结合详细CoT和简洁答案摘要是医疗问答任务的理想微调策略。

Abstract: Though reasoning-based large language models (LLMs) have excelled in
mathematics and programming, their capabilities in knowledge-intensive medical
question answering remain underexplored. To address this, we introduce
ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality
examples distilled from 1.7 million initial reasoning paths generated by
various LLMs. ReasonMed is constructed through a \textit{multi-agent
verification and refinement process}, where we design an \textit{Error Refiner}
to enhance the reasoning paths by identifying and correcting error-prone steps
flagged by a verifier. Leveraging ReasonMed, we systematically investigate best
practices for training medical reasoning models and find that combining
detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields
the most effective fine-tuning strategy. Based on this strategy, we train
ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the
prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.

[17] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou,Despoina Kosmopoulou,Antonios Kritikos,Anastasia Poulopoulou,Efthymios Georgiou,Athanasios Katsamanis,Vassilis Katsouros,Alexandros Potamianos

Main category: cs.CL

TL;DR: 论文提出了MEDUSA,一种多模态深度融合的多阶段训练框架,用于自然条件下的语音情感识别(SER)。该框架通过四阶段训练流程解决类别不平衡和情感模糊问题,最终在Interspeech 2025挑战赛中排名第一。

Details Motivation: 自然条件下的语音情感识别(SER)存在情感主观性和数据不平衡的挑战,传统方法难以有效应对。

Contribution: 1. 提出MEDUSA框架,结合多阶段训练和深度跨模态融合;2. 引入DeepSER,一种新颖的跨模态Transformer融合机制;3. 使用Manifold MixUp进行正则化和多任务学习。

Method: 1. 前两阶段训练基于DeepSER的集成分类器;2. 后两阶段优化可训练的元分类器;3. 结合软标签、平衡采样和多任务学习。

Result: MEDUSA在Interspeech 2025挑战赛的任务1中排名第一。

Insight: 多模态融合和多阶段训练能显著提升SER性能,尤其在面对自然条件下的数据不平衡和模糊性时。

Abstract: SER is a challenging task due to the subjective nature of human emotions and
their uneven representation under naturalistic conditions. We propose MEDUSA, a
multimodal framework with a four-stage training pipeline, which effectively
handles class imbalance and emotion ambiguity. The first two stages train an
ensemble of classifiers that utilize DeepSER, a novel extension of a deep
cross-modal transformer fusion mechanism from pretrained self-supervised
acoustic and linguistic representations. Manifold MixUp is employed for further
regularization. The last two stages optimize a trainable meta-classifier that
combines the ensemble predictions. Our training approach incorporates human
annotation scores as soft targets, coupled with balanced data sampling and
multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion
Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic
Conditions Challenge.

[18] Gender Bias in English-to-Greek Machine Translation

Eleni Gkovedarou,Joke Daems,Luna De Bruyne

Main category: cs.CL

TL;DR: 该研究调查了商业机器翻译系统(Google Translate和DeepL)在英语到希腊语翻译中的性别偏见问题,揭示了男性偏见、职业刻板印象和反刻板翻译中的错误,并探索了GPT-4o作为缓解偏见工具的潜力。

Details Motivation: 随着对包容性语言需求的增加,机器翻译系统可能强化性别刻板印象的问题引起了关注。本研究聚焦于较少研究的英语到希腊语翻译,评估商业系统的性别偏见。

Contribution: 引入了GendEL数据集,评估了Google Translate、DeepL和GPT-4o在性别偏见上的表现,并探讨了GPT-4o作为缓解工具的可能性。

Method: 使用手工制作的GendEL数据集(240句双语数据),分析男性偏见、职业刻板印象和反刻板翻译错误,并测试GPT-4o在生成性别明确或中性替代方案上的能力。

Result: 研究发现MT系统在性别明确时表现较好(DeepL优于Google Translate和GPT-4o),但在性别未指定时无法生成包容性或中性翻译;GPT-4o虽能生成替代方案,但仍存在偏见。

Insight: 商业MT系统在性别偏见问题上仍有不足,GPT-4o作为生成工具展示了潜力,但需进一步优化以实现真正的性别包容性。

Abstract: As the demand for inclusive language increases, concern has grown over the
susceptibility of machine translation (MT) systems to reinforce gender
stereotypes. This study investigates gender bias in two commercial MT systems,
Google Translate and DeepL, focusing on the understudied English-to-Greek
language pair. We address three aspects of gender bias: i) male bias, ii)
occupational stereotyping, and iii) errors in anti-stereotypical translations.
Additionally, we explore the potential of prompted GPT-4o as a bias mitigation
tool that provides both gender-explicit and gender-neutral alternatives when
necessary. To achieve this, we introduce GendEL, a manually crafted bilingual
dataset of 240 gender-ambiguous and unambiguous sentences that feature
stereotypical occupational nouns and adjectives. We find persistent gender bias
in translations by both MT systems; while they perform well in cases where
gender is explicitly defined, with DeepL outperforming both Google Translate
and GPT-4o in feminine gender-unambiguous sentences, they are far from
producing gender-inclusive or neutral translations when the gender is
unspecified. GPT-4o shows promise, generating appropriate gendered and neutral
alternatives for most ambiguous cases, though residual biases remain evident.

[19] From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies

Blaž Škrlj,Boshko Koloski,Senja Pollak,Nada Lavrač

Main category: cs.CL

TL;DR: 这篇系统综述探讨了知识图谱(KGs)与大语言模型(LLMs)的协同作用,分为KG增强LLMs和LLM增强KGs两类,强调其双向优势与未来研究方向。

Details Motivation: 研究动机在于整合KGs的结构化知识与LLMs的语言能力,以增强LLMs的推理能力、减少幻觉,并提升KGs的构建与查询效率。

Contribution: 主要贡献是对KGs与LLMs协同作用的系统性分类与分析,提出未来研究方向,包括神经符号集成和动态KG更新。

Method: 方法上,通过文献综述对现有研究进行分类,分为KG增强LLMs和LLM增强KGs两类,并分析其优缺点。

Result: 结果展示了KGs与LLMs整合的潜力,如改进推理、减少幻觉,并推动智能系统在复杂任务中的应用。

Insight: 重要见解包括双向协同的优势、计算效率与数据质量的关键作用,以及未来在神经符号集成和伦理问题上的挑战。

Abstract: Integrating structured knowledge from Knowledge Graphs (KGs) into Large
Language Models (LLMs) enhances factual grounding and reasoning capabilities.
This survey paper systematically examines the synergy between KGs and LLMs,
categorizing existing approaches into two main groups: KG-enhanced LLMs, which
improve reasoning, reduce hallucinations, and enable complex question
answering; and LLM-augmented KGs, which facilitate KG construction, completion,
and querying. Through comprehensive analysis, we identify critical gaps and
highlight the mutual benefits of structured knowledge integration. Compared to
existing surveys, our study uniquely emphasizes scalability, computational
efficiency, and data quality. Finally, we propose future research directions,
including neuro-symbolic integration, dynamic KG updating, data reliability,
and ethical considerations, paving the way for intelligent systems capable of
managing more complex real-world knowledge tasks.

[20] Using Sign Language Production as Data Augmentation to enhance Sign Language Translation

Harry Walsh,Maksym Ivashechkin,Richard Bowden

Main category: cs.CL

TL;DR: 论文提出利用手语生成技术(Sign Language Production)增强手语翻译模型的性能,通过数据增广方法提升低资源手语数据集的效果。

Details Motivation: 手语数据稀缺且收集成本高,限制了手语翻译模型的性能。手语生成技术为数据增广提供了新思路。

Contribution: 提出利用手语生成技术(骨架生成、拼接和生成模型)增广手语数据,提升翻译模型的性能(最高19%)。

Method: 采用骨架生成、手语拼接和两种生成模型(SignGAN和SignSplat)生成多样化数据,用于训练翻译模型。

Result: 数据增广显著提升翻译性能,最高达19%,为低资源环境下的手语翻译提供了可行方案。

Insight: 手语生成技术的进步可以缓解数据稀缺问题,为其他低资源语言任务提供了借鉴。

Abstract: Machine learning models fundamentally rely on large quantities of
high-quality data. Collecting the necessary data for these models can be
challenging due to cost, scarcity, and privacy restrictions. Signed languages
are visual languages used by the deaf community and are considered low-resource
languages. Sign language datasets are often orders of magnitude smaller than
their spoken language counterparts. Sign Language Production is the task of
generating sign language videos from spoken language sentences, while Sign
Language Translation is the reverse translation task. Here, we propose
leveraging recent advancements in Sign Language Production to augment existing
sign language datasets and enhance the performance of Sign Language Translation
models. For this, we utilize three techniques: a skeleton-based approach to
production, sign stitching, and two photo-realistic generative models, SignGAN
and SignSplat. We evaluate the effectiveness of these techniques in enhancing
the performance of Sign Language Translation models by generating variation in
the signer’s appearance and the motion of the skeletal data. Our results
demonstrate that the proposed methods can effectively augment existing datasets
and enhance the performance of Sign Language Translation models by up to 19%,
paving the way for more robust and accurate Sign Language Translation systems,
even in resource-constrained environments.

[21] Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

Tianjun Yao,Haoxuan Li,Zhiqiang Shen,Pan Li,Tongliang Liu,Kun Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为RAPL的新型框架,旨在提高知识图谱问答(KGQA)中图检索的效率与泛化能力。RAPL通过两阶段标注策略、模型无关的图变换方法和路径推理策略,显著提升了检索性能和泛化能力。

Details Motivation: 传统检索增强生成(RAG)方法依赖非结构化文本,限制了可解释性和结构化推理能力。知识图谱因其结构化特性成为更优选择,但现有图检索方法泛化能力不足。本文旨在解决这一问题。

Contribution: 1. 提出两阶段标注策略,结合启发式信号与参数化模型,提供因果基础监督;2. 提出模型无关的图变换方法,捕捉三元组内外交互;3. 设计路径推理策略,支持结构化输入。

Method: RAPL框架包含三部分:1. 两阶段标注策略(启发式+参数化);2. 模型无关的图变换方法;3. 路径推理策略,增强下游推理器能力。

Result: 实验表明,RAPL在性能上超越现有方法2.66%-20.34%,显著缩小小模型与大模型之间的性能差距,并在跨数据集场景中表现优异。

Insight: 通过结构化监督和路径推理,RAPL展示了如何增强图检索的泛化能力,为知识图谱与LLMs的结合提供了新思路。

Abstract: Large Language Models (LLMs) have shown strong inductive reasoning ability
across various domains, but their reliability is hindered by the outdated
knowledge and hallucinations. Retrieval-Augmented Generation mitigates these
issues by grounding LLMs with external knowledge; however, most existing RAG
pipelines rely on unstructured text, limiting interpretability and structured
reasoning. Knowledge graphs, which represent facts as relational triples, offer
a more structured and compact alternative. Recent studies have explored
integrating knowledge graphs with LLMs for knowledge graph question answering
(KGQA), with a significant proportion adopting the retrieve-then-reasoning
paradigm. In this framework, graph-based retrievers have demonstrated strong
empirical performance, yet they still face challenges in generalization
ability. In this work, we propose RAPL, a novel framework for efficient and
effective graph retrieval in KGQA. RAPL addresses these limitations through
three aspects: (1) a two-stage labeling strategy that combines heuristic
signals with parametric models to provide causally grounded supervision; (2) a
model-agnostic graph transformation approach to capture both intra- and
inter-triple interactions, thereby enhancing representational capacity; and (3)
a path-based reasoning strategy that facilitates learning from the injected
rational knowledge, and supports downstream reasoner through structured inputs.
Empirically, RAPL outperforms state-of-the-art methods by $2.66%-20.34%$, and
significantly reduces the performance gap between smaller and more powerful
LLM-based reasoners, as well as the gap under cross-dataset settings,
highlighting its superior retrieval capability and generalizability. Codes are
available at: https://github.com/tianyao-aka/RAPL.

[22] Query-Level Uncertainty in Large Language Models

Lihu Chen,Gaël Varoquaux

Main category: cs.CL

TL;DR: 这是一篇关于大语言模型的查询级别不确定性检测的论文,提出了一种无训练的方法Internal Confidence,通过自评估来判断模型是否能回答查询,实验证明其优于多个基线方法,并可用于高效的RAG和模型级联。

Details Motivation: 为了提高大语言模型的效率和可信性,需要让模型能够识别自身知识的边界,从而支持自适应推理(如调用RAG或选择放弃回答)。

Contribution: 1. 提出了一种无训练的方法Internal Confidence,用于检测模型的知识边界;2. 在事实QA和数学推理任务中验证了其优越性;3. 展示了该方法在高效RAG和模型级联中的应用。

Method: 通过自评估不同层和令牌的置信度,提出了一种无需训练的Internal Confidence方法,用于判断模型是否能回答查询。

Result: 在实验部分,Internal Confidence在多个任务中优于基线方法,并能有效降低推理成本。

Insight: 模型的内部置信度可以作为判断知识边界的有效指标,且无训练方法在实际应用中更具灵活性。

Abstract: It is important for Large Language Models to be aware of the boundary of
their knowledge, the mechanism of identifying known and unknown queries. This
type of awareness can help models perform adaptive inference, such as invoking
RAG, engaging in slow and deep thinking, or adopting the abstention mechanism,
which is beneficial to the development of efficient and trustworthy AI. In this
work, we propose a method to detect knowledge boundaries via Query-Level
Uncertainty, which aims to determine if the model is able to address a given
query without generating any tokens. To this end, we introduce a novel and
training-free method called \emph{Internal Confidence}, which leverages
self-evaluations across layers and tokens. Empirical results on both factual QA
and mathematical reasoning tasks demonstrate that our internal confidence can
outperform several baselines. Furthermore, we showcase that our proposed method
can be used for efficient RAG and model cascading, which is able to reduce
inference costs while maintaining performance.

[23] Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data

Hao Xiong,Chuanyuan Tan,Wenliang Chen

Main category: cs.CL

TL;DR: 本文针对非结构化知识编辑(UKE)的局部性评估不足和微调(FT)方法异常失效的问题,构建了两个扩展数据集UnKEBench-Loc和AKEW-Loc,并提出了一种优化的微调方法FT-UKE。实验表明,FT-UKE在性能和批量编辑场景中表现优异,优于现有SOTA方法。

Details Motivation: 非结构化知识编辑(UKE)对大型语言模型(LLMs)的知识更新至关重要,但现有方法缺乏局部性评估,且微调方法表现异常。本文旨在解决这些问题并优化微调方法。

Contribution: 1)构建了包含局部性测试数据的扩展数据集UnKEBench-Loc和AKEW-Loc;2)分析了影响FT方法的四个因素,并提出优化的FT-UKE方法;3)实验证明FT-UKE在性能和批量编辑中优于SOTA。

Method: 1)扩展现有UKE数据集以支持局部性评估;2)分析FT方法的性能影响因素;3)提出并验证优化的FT-UKE方法。

Result: FT-UKE在性能上显著优于现有SOTA方法,批量编辑中优势随批量增大而增加,平均指标领先从+6.78%提升至+10.80%。

Insight: 优化后的微调方法在非结构化知识编辑任务中表现优异,局部性评估是提升模型编辑能力的关键。批量编辑场景中,方法扩展性良好。

Abstract: Unstructured Knowledge Editing (UKE) is crucial for updating the relevant
knowledge of large language models (LLMs). It focuses on unstructured inputs,
such as long or free-form texts, which are common forms of real-world
knowledge. Although previous studies have proposed effective methods and tested
them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2)
Abnormal failure of fine-tuning (FT) based methods for UKE. To address these
issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by
extending two existing UKE datasets with locality test data from the
unstructured and structured views. This enables a systematic evaluation of the
Locality of post-edited models. Furthermore, we identify four factors that may
affect the performance of FT-based methods. Based on these factors, we conduct
experiments to determine how the well-performing FT-based methods should be
trained for the UKE task, providing a training recipe for future research. Our
experimental results indicate that the FT-based method with the optimal setting
(FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art
(SOTA). In batch editing scenarios, FT-UKE shows strong performance as well,
with its advantage over SOTA methods increasing as the batch size grows,
expanding the average metric lead from +6.78% to +10.80%

[24] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Zhenran Xu,Yiyu Wang,Xue Yang,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: ComfyUI-R1是首个用于自动生成工作流的大型推理模型,通过两阶段训练框架(CoT微调和强化学习)实现高效工作流生成,显著优于现有方法。

Details Motivation: AI生成内容从单一模型发展为模块化工作流(如ComfyUI平台),但工作流设计需要专业知识,用户学习曲线陡峭。因此,研究者提出了ComfyUI-R1,以简化这一过程。

Contribution: 1. 提出了首个面向工作流生成的推理模型ComfyUI-R1;2. 构建了包含4K工作流的链式思维推理数据集;3. 开发了结合CoT微调和强化学习的训练框架。

Method: 1. 通过链式思维(CoT)微调实现冷启动,适应ComfyUI领域;2. 引入强化学习,基于规则-指标混合奖励优化推理能力。

Result: 7B参数模型实现了97%格式有效性,并在节点级、图级F1分数上显著优于GPT-4o和Claude等领先闭源模型。

Insight: 链式思维推理和将工作流转化为代码的方法是关键,尤其在复杂节点合成的艺术创作中表现出色。

Abstract: AI-generated content has evolved from monolithic models to modular workflows,
particularly on platforms like ComfyUI, enabling customization in creative
pipelines. However, crafting effective workflows requires great expertise to
orchestrate numerous specialized components, presenting a steep learning curve
for users. To address this challenge, we introduce ComfyUI-R1, the first large
reasoning model for automated workflow generation. Starting with our curated
dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning
data, including node selection, workflow planning, and code-level workflow
representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT
fine-tuning for cold start, adapting models to the ComfyUI domain; (2)
reinforcement learning for incentivizing reasoning capability, guided by a
fine-grained rule-metric hybrid reward, ensuring format validity, structural
integrity, and node-level fidelity. Experiments show that our 7B-parameter
model achieves a 97% format validity rate, along with high pass rate,
node-level and graph-level F1 scores, significantly surpassing prior
state-of-the-art methods that employ leading closed-source models such as
GPT-4o and Claude series. Further analysis highlights the critical role of the
reasoning process and the advantage of transforming workflows into code.
Qualitative comparison reveals our strength in synthesizing intricate workflows
with diverse nodes, underscoring the potential of long CoT reasoning in AI art
creation.

[25] CoRT: Code-integrated Reasoning within Thinking

Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu

Main category: cs.CL

TL;DR: CoRT提出了一种后训练框架,通过代码集成推理(Code-integrated Reasoning)提升大型推理模型(LRMs)在复杂数学运算中的效率与准确性,并通过Hint-Engineering解决数据稀缺问题。

Details Motivation: 大型推理模型(如o1和DeepSeek-R1)在自然语言推理中表现优异,但在复杂数学运算中效率低下或准确性不足。直接结合计算工具(如代码解释器)会引入外部知识,导致模型内部文本表示与外部工具交互不高效。

Contribution: 1. 引入CoRT框架,通过后训练教会LRMs有效利用代码解释器;2. 提出Hint-Engineering方法,通过合成代码集成推理数据解决数据稀缺问题;3. 实验表明模型在不同规模(1.5B到32B参数)上显著提升性能并减少推理token量。

Method: 1. 合成代码集成推理数据,通过Hint-Engineering策略性地插入提示优化LRM与代码解释器的交互;2. 手动创建30个高质量样本,基于此进行监督微调、拒绝微调和强化学习;3. 应用于1.5B到32B参数的模型。

Result: 在五个数学推理数据集上,Hint-Engineering模型在32B和1.5B模型上分别实现了4%和8%的绝对性能提升;同时,32B和1.5B模型的推理token量分别减少了30%和50%。

Insight: 1. Hint-Engineering能有效解决代码集成推理中的数据稀缺问题;2. 代码解释器的高效利用显著提升了数学推理任务的性能与效率;3. 小规模高质量数据也能带来显著性能提升。

Abstract: Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable
progress in natural language reasoning with long chain-of-thought (CoT), yet
they remain inefficient or inaccurate when handling complex mathematical
operations. Addressing these limitations through computational tools (e.g.,
computation libraries and symbolic solvers) is promising, but it introduces a
technical challenge: Code Interpreter (CI) brings external knowledge beyond the
model’s internal text representations, thus the direct combination is not
efficient. This paper introduces CoRT, a post-training framework for teaching
LRMs to leverage CI effectively and efficiently. As a first step, we address
the data scarcity issue by synthesizing code-integrated reasoning data through
Hint-Engineering, which strategically inserts different hints at appropriate
positions to optimize LRM-CI interaction. We manually create 30 high-quality
samples, upon which we post-train models ranging from 1.5B to 32B parameters,
with supervised fine-tuning, rejection fine-tuning and reinforcement learning.
Our experimental results demonstrate that Hint-Engineering models achieve 4%
and 8% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and
DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging
mathematical reasoning datasets. Furthermore, Hint-Engineering models use about
30% fewer tokens for the 32B model and 50% fewer tokens for the 1.5B model
compared with the natural language models. The models and code are available at
https://github.com/ChengpengLi1003/CoRT.

[26] EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Christoph Schuhmann,Robert Kaczmarczyk,Gollam Rabby,Felix Friedrich,Maurice Kraus,Kourosh Nadi,Huu Nguyen,Kristian Kersting,Sören Auer

Main category: cs.CL

TL;DR: EmoNet-Voice是一个用于语音情感检测的细粒度专家验证基准,包含大规模预训练数据集和专家标注的基准数据集,旨在评估40种情感类别的语音情感识别模型,并通过心理学专家验证其准确性。

Details Motivation: 当前语音情感识别数据集存在情感细粒度不足、隐私问题和依赖表演性数据等局限性,需要一个更全面、隐私保护的基准来推动AI情感理解能力的发展。

Contribution: 1. 提出EmoNet-Voice,包含大规模预训练数据集(4,500小时语音、11种声音、40种情感、4种语言)和专家标注的基准数据集。2. 通过合成语音和专家验证实现隐私保护和细粒度情感标注。3. 提出的Empathic Insight Voice模型在语音情感识别上达到新标准。

Method: 利用最先进的语音生成技术合成模拟特定情感的语音片段,并通过心理学专家标注情感强度和类别。

Result: 评估显示,高唤醒情感(如愤怒)比低唤醒情感(如专注)更易检测,模型与专家标注高度一致。

Insight: 合成数据结合专家验证是一种有效的隐私保护方法,同时细粒度情感类别和强度标注有助于更全面的语音情感识别评估。

Abstract: The advancement of text-to-speech and audio generation models necessitates
robust benchmarks for evaluating the emotional understanding capabilities of AI
systems. Current speech emotion recognition (SER) datasets often exhibit
limitations in emotional granularity, privacy concerns, or reliance on acted
portrayals. This paper introduces EmoNet-Voice, a new resource for speech
emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training
dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions,
and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human
expert annotations. EmoNet-Voice is designed to evaluate SER models on a
fine-grained spectrum of 40 emotion categories with different levels of
intensities. Leveraging state-of-the-art voice generation, we curated synthetic
audio snippets simulating actors portraying scenes designed to evoke specific
emotions. Crucially, we conducted rigorous validation by psychology experts who
assigned perceived intensity labels. This synthetic, privacy-preserving
approach allows for the inclusion of sensitive emotional states often absent in
existing datasets. Lastly, we introduce Empathic Insight Voice models that set
a new standard in speech emotion recognition with high agreement with human
experts. Our evaluations across the current model landscape exhibit valuable
findings, such as high-arousal emotions like anger being much easier to detect
than low-arousal states like concentration.

[27] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Xiangning Yu,Zhuohan Wang,Linyi Yang,Haoxuan Li,Anjie Liu,Xiao Xue,Jun Wang,Mengyue Yang

Main category: cs.CL

TL;DR: 论文提出了一种基于因果充分性和必要性的框架,用于改进链式思维(CoT)推理,通过量化推理步骤的影响并优化步骤的生成与剪枝,提高了推理效率和成本效益。

Details Motivation: 链式思维(CoT)在提升大语言模型(LLM)复杂推理能力方面至关重要,但其面临推理步骤的充分性和必要性不足的问题,影响了推理的准确性和效率。

Contribution: 提出了一种因果框架,通过充分性和必要性的双重视角改进CoT推理,能够量化推理步骤的影响,并自动添加缺失步骤或剪枝冗余步骤。

Method: 引入因果概率(Probability of Sufficiency and Necessity)来衡量推理步骤的逻辑充分性和必要性,并通过干预场景下的量化分析优化推理链。

Result: 在多个数学和常识推理基准测试中,显著提高了推理效率并减少了token使用量,同时保持了准确性。

Insight: 通过因果充分性和必要性的量化分析可以有效优化推理链,为LLM的推理性能提升和成本控制提供了新方向。

Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing
large language models (LLMs) with complex reasoning capabilities. However, CoT
currently faces two fundamental challenges: (1) Sufficiency, which ensures that
the generated intermediate inference steps comprehensively cover and
substantiate the final conclusion; and (2) Necessity, which identifies the
inference steps that are truly indispensable for the soundness of the resulting
answer. We propose a causal framework that characterizes CoT reasoning through
the dual lenses of sufficiency and necessity. Incorporating causal Probability
of Sufficiency and Necessity allows us not only to determine which steps are
logically sufficient or necessary to the prediction outcome, but also to
quantify their actual influence on the final reasoning outcome under different
intervention scenarios, thereby enabling the automated addition of missing
steps and the pruning of redundant ones. Extensive experimental results on
various mathematical and commonsense reasoning benchmarks confirm substantial
improvements in reasoning efficiency and reduced token usage without
sacrificing accuracy. Our work provides a promising direction for improving LLM
reasoning performance and cost-effectiveness.

[28] Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs

Rodion Oblovatny,Alexandra Bazarova,Alexey Zaytsev

Main category: cs.CL

TL;DR: 本文提出了一种通过分析提示和响应隐藏状态分布的概率差异来检测大型语言模型(LLMs)中幻觉的新方法,利用可训练的深度核增强敏感度,表现出卓越性能。

Details Motivation: 大型语言模型的幻觉问题日益突出,传统方法依赖外部知识或辅助模型,缺乏模型内在的检测机制。作者希望通过分布距离作为原则性分数来检测幻觉。

Contribution: 1. 发现幻觉响应的分布偏差较小;2. 提出基于分布距离的模型内检测方法;3. 引入可训练深度核提升检测敏感度。

Method: 通过比较提示和响应隐藏状态的概率分布差异,利用深度可学习核自动捕捉分布间的几何差异,生成幻觉分数。

Result: 在多个基准测试中表现优于现有基线,即使未经核训练也保持竞争力。

Insight: 幻觉可能源于表面改写而非实质性推理,分布距离可作为检测幻觉的有效指标。

Abstract: We present a novel approach for detecting hallucinations in large language
models (LLMs) by analyzing the probabilistic divergence between prompt and
response hidden-state distributions. Counterintuitively, we find that
hallucinated responses exhibit smaller deviations from their prompts compared
to grounded responses, suggesting that hallucinations often arise from
superficial rephrasing rather than substantive reasoning. Leveraging this
insight, we propose a model-intrinsic detection method that uses distributional
distances as principled hallucination scores, eliminating the need for external
knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable
kernels that automatically adapt to capture nuanced geometric differences
between distributions. Our approach outperforms existing baselines,
demonstrating state-of-the-art performance on several benchmarks. The method
remains competitive even without kernel training, offering a robust, scalable
solution for hallucination detection.

[29] The Emergence of Abstract Thought in Large Language Models Beyond Any Language

Yuxin Chen,Yiran Zhao,Yang Zhang,An Zhang,Kenji Kawaguchi,Shafiq Joty,Junnan Li,Tat-Seng Chua,Michael Qizhe Shieh,Wenxuan Zhang

Main category: cs.CL

TL;DR: 研究发现,大型语言模型(LLMs)在训练过程中逐渐形成了一个与语言无关的核心参数空间,支持跨语言的抽象思维。共享神经元的比例和重要性随模型发展逐渐增加。

Details Motivation: 初步研究表明,LLMs的隐藏激活似乎以英语为主导,但多语言性能的提升挑战了这一观点。研究者希望探索LLMs是否真正依赖特定语言进行思考。

Contribution: 识别了一种语言无关的核心参数空间,并发现共享神经元的增加支持了跨语言的抽象思维。此外,提出了针对不同发展阶段LLMs的神经元特异性训练策略。

Method: 通过分析语言相关神经元(共享和独占),观察其在模型发展中的变化,并设计神经元特异性训练策略。

Result: 实验表明,共享神经元的比例和功能重要性随时间增加,支持跨语言抽象思维的形成。提出的训练策略在不同LLM家族中有效。

Insight: LLMs的抽象思维能力不依赖于特定语言,而是通过共享神经元实现。模型的发展阶段对训练策略的设计至关重要。

Abstract: As large language models (LLMs) continue to advance, their capacity to
function effectively across a diverse range of languages has shown marked
improvement. Preliminary studies observe that the hidden activations of LLMs
often resemble English, even when responding to non-English prompts. This has
led to the widespread assumption that LLMs may “think” in English. However,
more recent results showing strong multilingual performance, even surpassing
English performance on specific tasks in other languages, challenge this view.
In this work, we find that LLMs progressively develop a core language-agnostic
parameter space-a remarkably small subset of parameters whose deactivation
results in significant performance degradation across all languages. This
compact yet critical set of parameters underlies the model’s ability to
generalize beyond individual languages, supporting the emergence of abstract
thought that is not tied to any specific linguistic system. Specifically, we
identify language-related neurons-those are consistently activated during the
processing of particular languages, and categorize them as either shared
(active across multiple languages) or exclusive (specific to one). As LLMs
undergo continued development over time, we observe a marked increase in both
the proportion and functional importance of shared neurons, while exclusive
neurons progressively diminish in influence. These shared neurons constitute
the backbone of the core language-agnostic parameter space, supporting the
emergence of abstract thought. Motivated by these insights, we propose
neuron-specific training strategies tailored to LLMs’ language-agnostic levels
at different development stages. Experiments across diverse LLM families
support our approach.

[30] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 该论文提出了VerIF方法,结合基于规则的代码验证和基于大型推理模型(如QwQ-32B)的验证,用于增强指令跟随任务中的强化学习(RLVR)。

Details Motivation: 尽管可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)的关键技术,但在指令跟随任务中,其最佳实践仍未被充分探索。

Contribution: 1. 提出了VerIF方法,结合规则与LLM验证;2. 构建了高质量指令数据集VerInstruct(约22,000条实例);3. 在多个指令跟随基准测试中实现显著提升。

Method: VerIF结合了基于规则的代码验证和大型推理模型的LLM验证,同时使用了高质数据集VerInstruct进行强化学习训练。

Result: 训练的模型在同类模型中达到最先进水平,并能良好泛化到未见约束,且不影响模型的一般能力。

Insight: VerIF方法可整合到现有强化学习框架中,显著提升模型性能,同时保持其一般能力。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key
technique for enhancing large language models (LLMs), with verification
engineering playing a central role. However, best practices for RL in
instruction following remain underexplored. In this work, we explore the
verification challenge in RL for instruction following and propose VerIF, a
verification method that combines rule-based code verification with LLM-based
verification from a large reasoning model (e.g., QwQ-32B). To support this
approach, we construct a high-quality instruction-following dataset,
VerInstruct, containing approximately 22,000 instances with associated
verification signals. We apply RL training with VerIF to two models, achieving
significant improvements across several representative instruction-following
benchmarks. The trained models reach state-of-the-art performance among models
of comparable size and generalize well to unseen constraints. We further
observe that their general capabilities remain unaffected, suggesting that RL
with VerIF can be integrated into existing RL recipes to enhance overall model
performance. We have released our datasets, codes, and models to facilitate
future research at https://github.com/THU-KEG/VerIF.

[31] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Wuwei Zhang,Fangcong Yin,Howard Yen,Danqi Chen,Xi Ye

Main category: cs.CL

TL;DR: 论文提出了一种名为QRHEAD的查询聚焦检索头,通过优化注意力头提升长上下文信息的检索能力,并开发了高效检索器QR-RETRIEVER,在长上下文推理任务中表现突出。

Details Motivation: 现有研究发现了检索头(retrieval heads)在长上下文语言模型中的作用,但如何进一步优化这些头的性能以提升检索和推理能力仍是一个开放问题。

Contribution: 论文的主要贡献包括:1)提出QRHEAD,一种通过查询聚焦注意力分数优化的检索头;2)开发QR-RETRIEVER,基于QRHEAD的检索器,在长上下文推理和重排序任务中表现优异;3)提供了对LM长上下文能力的解释性分析。

Method: 方法包括:1)通过输入查询聚合注意力分数识别QRHEAD;2)利用QRHEAD的注意力质量作为检索分数,构建QR-RETRIEVER;3)在长上下文推理任务中优先选择高检索分数的相关部分。

Result: 在长上下文推理任务(如LongMemEval和CLIPPER)中,QR-RETRIEVER比全上下文方法性能提升超过10%,并优于其他密集检索器。在BEIR基准测试中,作为重排序器表现优异,超越了如RankGPT等基于LLM的重排序器。

Insight: 研究表明,查询-上下文注意力评分和任务选择是识别具有下游实用性的QRHEAD的关键,同时为理解LM的长上下文能力提供了新视角。

Abstract: Recent work has identified retrieval heads (Wu et al., 2025b), a subset of
attention heads responsible for retrieving salient information in long-context
language models (LMs), as measured by their copy-paste behavior in
Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused
Retrieval Head), an improved set of attention heads that enhance retrieval from
long context. We identify QRHEAD by aggregating attention scores with respect
to the input query, using a handful of examples from real-world tasks (e.g.,
long-context QA). We further introduce QR- RETRIEVER, an efficient and
effective retriever that uses the accumulated attention mass of QRHEAD as
retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting
the most relevant parts with the highest retrieval scores. On multi-hop
reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains
over full context and outperforms strong dense retrievers. We also evaluate
QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves
strong zero-shot performance, outperforming other LLM-based re-rankers such as
RankGPT. Further analysis shows that both the querycontext attention scoring
and task selection are crucial for identifying QRHEAD with strong downstream
utility. Overall, our work contributes a general-purpose retriever and offers
interpretability insights into the long-context capabilities of LMs.

[32] Resa: Transparent Reasoning Models via SAEs

Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Deqing Fu,Willie Neiswanger

Main category: cs.CL

TL;DR: Resa提出了一种高效的稀疏自编码器调整方法(SAE-Tuning),通过捕获源模型的推理能力并引导目标模型的训练,显著降低了训练成本和时间,同时保持了高性能。

Details Motivation: 如何低成本高效地激发语言模型的推理能力?现有的方法通常依赖昂贵的强化学习(RL)训练,Resa旨在通过稀疏自编码器(SAE)来提取和转移推理能力,从而大幅降低成本和训练时间。

Contribution: 1. 提出SAE-Tuning方法,通过稀疏自编码器提取推理能力并指导目标模型的训练;2. 在低成本(约1美元)和短时间(约20分钟)内实现接近RL训练的推理性能;3. 展示了提取能力的通用性和模块化特性。

Method: 1. 训练稀疏自编码器(SAE)从源模型捕获推理能力;2. 使用SAE引导标准监督微调过程,激发目标模型的推理能力;3. 仅使用问答数据,无需推理轨迹。

Result: 1. 在AIME24和AMC23等任务上表现优异(如43.33% Pass@1和90% Pass@1);2. 训练成本降低2000倍以上,时间缩短450倍以上;3. 提取的能力具有通用性和模块化特性。

Insight: 1. 稀疏自编码器可以有效提取和转移语言模型的推理能力;2. 推理能力的通用性和模块化特性为模型的灵活应用提供了可能;3. 低成本高效训练方法为资源受限的场景提供了新思路。

Abstract: How cost-effectively can we elicit strong reasoning in language models by
leveraging their underlying representations? We answer this question with Resa,
a family of 1.5B reasoning models trained via a novel and efficient sparse
autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to
capture reasoning abilities from a source model, and then uses the trained SAE
to guide a standard supervised fine-tuning process to elicit such abilities in
a target model, all using verified question-answer data without any reasoning
traces. Notably, when applied to certain base models before further RL
post-training, SAE-Tuning retains >97% of its RL-trained counterpart’s
reasoning performance while reducing training costs by >2000x to roughly $1
and training time by >450x to around 20 minutes. Furthermore, when applied to
lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning
performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only
around $1 additional cost. Surprisingly, the reasoning abilities extracted via
SAEs are potentially both generalizable and modular. Generality means abilities
extracted from one dataset still elevate performance on a larger and
overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math
can be attached to the R1-Distill model at test time, without any retraining,
and yield comparable gains. Extensive ablations validate these findings and all
artifacts are fully open-sourced.

[33] Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda,Chunpeng Ma,Masayuki Asahara

Main category: cs.CL

TL;DR: 本文提出了一种逐步指导策略和简化的输出格式,显著提升了LLM在依赖解析任务中的准确率,并在17种语言的通用依赖数据集上达到了SOTA性能。

Details Motivation: 尽管大语言模型(LLM)在各种任务中表现出色,但标准提示方法在依赖解析任务中难以生成结构有效且准确的结果。本文旨在通过改进指导策略和输出格式解决这一问题。

Contribution: 1. 提出逐步指导策略(先通用词性标注,再预测句法头和依赖标签)。2. 引入简化的CoNLL-U类似输出格式,减少幻觉或污染。3. 展示了多语言微调对跨语言泛化性能的提升。

Method: 1. 分步指导:先进行词性标注,再预测句法头和依赖标签。2. 使用简化的CoNLL-U格式输出结果。3. 在多语言数据集上进行微调。

Result: 在17种语言的通用依赖数据集上实现了SOTA性能,且输出无幻觉或污染,同时跨语言泛化性能得到提升。

Insight: 显式的推理步骤和格式一致性对提升LLM在依赖解析任务中的表现至关重要,同时多语言微调是提升跨语言泛化的有效方法。

Abstract: Recent advances in large language models (LLMs) have enabled impressive
performance in various tasks. However, standard prompting often struggles to
produce structurally valid and accurate outputs, especially in dependency
parsing. We propose a novel step-by-step instruction strategy, where universal
part-of-speech tagging precedes the prediction of syntactic heads and
dependency labels, and a simplified CoNLL-U like output format, our method
achieves state-of-the-art accuracy on Universal Dependencies datasets across 17
languages without hallucination or contamination. We further show that
multilingual fine-tuning simultaneously improves cross-language generalization
performance. Our results highlight the effectiveness of explicit reasoning
steps in LLM-based parsing and offer a scalable, format-consistent alternative
to bracket-based approaches.

[34] Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

Amel Muminovic,Amela Kadric Muminovic

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型在塞尔维亚语、克罗地亚语和波斯尼亚语等低资源语言中的毒性语言检测能力,提出通过增加上下文片段和优化提示设计可显著提升性能。

Details Motivation: 在线毒性语言对社会造成实际危害,尤其是缺乏标注数据的低资源语言地区,因此研究如何利用大型语言模型检测这些语言的毒性内容具有重要意义。

Contribution: 1) 构建并标注了包含4,500条来自YouTube和TikTok的多类别评论数据集;2) 提出通过上下文增强(context-augmented)和提示优化提升模型在低资源语言中的毒性检测性能;3) 验证了上下文片段对召回率和F1分的积极影响。

Method: 测试了四种大型语言模型(GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro和Claude 3 Opus)在零样本和上下文增强两种模式下对毒性语言的检测能力,评估指标包括精确率、召回率、F1分、准确率和误报率。

Result: 上下文增强模式平均提升召回率0.12,F1分最高提升0.10;Gemini模型在上下文增强模式下表现最佳(F1=0.82,准确率=0.82),而零样本GPT-4.1在精确率和低误报率上领先。

Insight: 在低资源语言中,简单的上下文增强和提示设计即可显著提升毒性检测性能,为实际应用提供了可行策略。

Abstract: Online toxic language causes real harm, especially in regions with limited
moderation tools. In this study, we evaluate how large language models handle
toxic comments in Serbian, Croatian, and Bosnian, languages with limited
labeled data. We built and manually labeled a dataset of 4,500 YouTube and
TikTok comments drawn from videos across diverse categories, including music,
politics, sports, modeling, influencer content, discussions of sexism, and
general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude
3 Opus) were tested in two modes: zero-shot and context-augmented. We measured
precision, recall, F1 score, accuracy and false positive rates. Including a
short context snippet raised recall by about 0.12 on average and improved F1
score by up to 0.10, though it sometimes increased false positives. The best
balance came from Gemini in context-augmented mode, reaching an F1 score of
0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the
lowest false alarms. We show how adding minimal context can improve toxic
language detection in low-resource settings and suggest practical strategies
such as improved prompt design and threshold calibration. These results show
that prompt design alone can yield meaningful gains in toxicity detection for
underserved Balkan language communities.

[35] From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li,Qiang Sheng,Yehan Yang,Xueyao Zhang,Juan Cao

Main category: cs.CL

TL;DR: 该论文提出了一种流式内容监控方法(SCM),用于在LLM生成过程中早期检测并停止有害输出,解决了传统完整检测方法的高延迟问题。通过构建细粒度标注数据集FineHarm和训练双监督模型,SCM在仅观察前18%的tokens时仍能达到与完整检测相当的性能。

Details Motivation: 现有LLM安全审核方法依赖完整输出检测,导致高延迟;而部分检测方法直接使用完整检测训练的模型,存在训练-推断差距。论文旨在提出一种原生支持部分检测的解决方案。

Contribution: 1. 构建了FineHarm数据集,提供细粒度标注支持token级训练;2. 提出了流式内容监控方法(SCM),通过双监督训练(响应和token级标签)实现高效早期检测。

Method: 1. 设计FineHarm数据集,含29K prompt-response对;2. 训练SCM模型,结合响应和token级标签监督,实时监控LLM输出流。

Result: SCM仅需观察前18%的tokens即可实现0.95+的宏F1得分,性能接近完整检测,并能提升安全对齐效果。

Insight: 细粒度标注和双监督训练是早期检测的关键;SCM不仅能高效拦截有害内容,还可作为伪标注工具提升LLM安全对齐。

Abstract: Though safety alignment has been applied to most large language models
(LLMs), LLM service providers generally deploy a subsequent moderation as the
external safety guardrail in real-world products. Existing moderators mainly
practice a conventional full detection, which determines the harmfulness based
on the complete LLM output, causing high service latency. Recent works pay more
attention to partial detection where moderators oversee the generation midway
and early stop the output if harmfulness is detected, but they directly apply
moderators trained with the full detection paradigm to incomplete outputs,
introducing a training-inference gap that lowers the performance. In this
paper, we explore how to form a data-and-model solution that natively supports
partial detection. For the data, we construct FineHarm, a dataset consisting of
29K prompt-response pairs with fine-grained annotations to provide reasonable
supervision for token-level training. Then, we propose the streaming content
monitor, which is trained with dual supervision of response- and token-level
labels and can follow the output stream of LLM to make a timely judgment of
harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is
comparable to full detection, by only seeing the first 18% of tokens in
responses on average. Moreover, the SCM can serve as a pseudo-harmfulness
annotator for improving safety alignment and lead to a higher harmlessness
score than DPO.

cs.CV [Back]

[36] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

Zhiyu Xue,Reza Abbasi-Asl,Ramtin Pedarsani

Main category: cs.CV

TL;DR: 本文提出了一种新型推理时防御策略,通过合成临床演示增强医学视觉-语言模型(Med-VLMs)的安全性,同时避免过度防御导致的性能下降。

Details Motivation: 生成式医学视觉-语言模型(Med-VLMs)在生成复杂文本信息(如诊断报告)时面临安全漏洞问题,需既能拒绝有害查询(如保险欺诈指令),又需避免因过度防御而拒绝良性临床查询。

Contribution: 1)提出推理时防御策略,通过合成临床演示增强模型安全性;2)展示了该策略在多模态医学数据集上的有效性;3)提出混合演示策略以平衡安全性与性能。

Method: 使用来自九种模态的多样化医学影像数据集,基于合成临床演示开发防御策略,并通过增加演示预算和混合演示策略解决过度防御问题。

Result: 实验表明,合成演示策略有效提升了模型安全性且未显著影响性能,混合演示策略在少样本预算下平衡了安全与性能。

Insight: 合成演示是一种有效且灵活的安全增强方法,演示预算的增加可以缓解过度防御问题,同时混合演示策略在资源受限时提供了可行的解决方案。

Abstract: Generative medical vision-language models(Med-VLMs) are primarily designed
to generate complex textual information
(e.g., diagnostic reports) from
multimodal inputs including vision modality(e.g., medical images) and language
modality
(e.g., clinical queries). However, their security vulnerabilities
remain underexplored. Med-VLMs should be capable of rejecting harmful queries,
such as \textit{Provide detailed instructions for using this CT scan for
insurance fraud}. At the same time, addressing security concerns introduces the
risk of over-defense, where safety-enhancing mechanisms may degrade general
performance, causing Med-VLMs to reject benign clinical queries. In this paper,
we propose a novel inference-time defense strategy to mitigate harmful queries,
enabling defense against visual and textual jailbreak attacks. Using diverse
medical imaging datasets collected from nine modalities, we demonstrate that
our defense strategy based on synthetic clinical demonstrations enhances model
safety without significantly compromising performance. Additionally, we find
that increasing the demonstration budget alleviates the over-defense issue. We
then introduce a mixed demonstration strategy as a trade-off solution for
balancing security and performance under few-shot demonstration budget
constraints.

[37] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance

Peilin Li,Jun Yin,Jing Zhong,Ran Luo,Pengyu Zeng,Miao Zhang

Main category: cs.CV

TL;DR: SAAF是一种基于多模态语义指导的建筑立面、墙体和窗户自动分割模型,结合自然语言处理技术,通过端到端训练框架提升分割的自动化与鲁棒性,实验表明其在多样化数据集上优于现有方法。

Details Motivation: 建筑数字化发展中,墙体和窗户的自动分割是提高建筑信息模型和计算机辅助设计效率的关键步骤。

Contribution: 1. 提出多模态语义协作特征提取机制;2. 开发端到端训练框架,减少人工干预影响;3. 在多样化数据集上验证了模型的高精度分割能力。

Method: 结合自然语言处理技术,将文本描述语义与图像特征融合,构建端到端训练框架,自主学���文本描述到图像分割的映射关系。

Result: 在多立面数据集上,SAAF的mIoU指标优于现有方法,提高了分割任务的准确性和泛化能力。

Insight: 多模态学习在建筑领域的应用为建筑计算机视觉技术的发展提供了新思路和技术路径。

Abstract: In the context of the digital development of architecture, the automatic
segmentation of walls and windows is a key step in improving the efficiency of
building information models and computer-aided design. This study proposes an
automatic segmentation model for building facade walls and windows based on
multimodal semantic guidance, called Segment Any Architectural Facades (SAAF).
First, SAAF has a multimodal semantic collaborative feature extraction
mechanism. By combining natural language processing technology, it can fuse the
semantic information in text descriptions with image features, enhancing the
semantic understanding of building facade components. Second, we developed an
end-to-end training framework that enables the model to autonomously learn the
mapping relationship from text descriptions to image segmentation, reducing the
influence of manual intervention on the segmentation results and improving the
automation and robustness of the model. Finally, we conducted extensive
experiments on multiple facade datasets. The segmentation results of SAAF
outperformed existing methods in the mIoU metric, indicating that the SAAF
model can maintain high-precision segmentation ability when faced with diverse
datasets. Our model has made certain progress in improving the accuracy and
generalization ability of the wall and window segmentation task. It is expected
to provide a reference for the development of architectural computer vision
technology and also explore new ideas and technical paths for the application
of multimodal learning in the architectural field.

[38] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

Xinlong Chen,Yuanxing Zhang,Yushuo Guan,Bohan Zeng,Yang Shi,Sihan Yang,Pengfei Wan,Qiang Liu,Liang Wang,Tieniu Tan

Main category: cs.CV

TL;DR: 这篇论文提出了VersaVid-R1模型,通过两个新数据集DarkEventInfer和MixVidQA,结合强化学习方法,首次将Reason-Then-Respond范式扩展到视频理解与推理任务中,并在多项任务上显著超越现有模型。

Details Motivation: 目前多模态大语言模型已成功将Reason-Then-Respond范式应用于图像推理,但视频推理领域由于高质量数据和有效训练方法的缺乏仍未被充分开发。本文旨在填补这一空白。

Contribution: 1. 提出两个新数据集DarkEventInfer和MixVidQA,专注于视频理解与推理能力;2. 开发了VersaVid-R1模型,首次将Reason-Then-Respond范式扩展到视频任务。

Method: 1. 使用DarkEventInfer和MixVidQA数据集;2. 结合强化学习方法,通过多样化的奖励函数指导训练。

Result: VersaVid-R1在视频通用理解、认知推理和描述任务中显著优于现有模型。

Insight: 通过专门设计的数据集和强化学习,可以有效提升视频理解与推理能力,为视频多模态任务提供了新的解决方案。

Abstract: Recent advancements in multimodal large language models have successfully
extended the Reason-Then-Respond paradigm to image-based reasoning, yet
video-based reasoning remains an underdeveloped frontier, primarily due to the
scarcity of high-quality reasoning-oriented data and effective training
methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA,
two novel datasets specifically designed to stimulate the model’s advanced
video understanding and reasoning abilities. DarkEventinfer presents videos
with masked event segments, requiring models to infer the obscured content
based on contextual video cues. MixVidQA, on the other hand, presents
interleaved video sequences composed of two distinct clips, challenging models
to isolate and reason about one while disregarding the other. Leveraging these
carefully curated training samples together with reinforcement learning guided
by diverse reward functions, we develop VersaVid-R1, the first versatile video
understanding and reasoning model under the Reason-Then-Respond paradigm
capable of handling multiple-choice and open-ended question answering, as well
as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1
significantly outperforms existing models across a broad spectrum of
benchmarks, covering video general understanding, cognitive reasoning, and
captioning tasks.

[39] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Zheqi He,Yesheng Liu,Jing-shu Zheng,Xuejing Li,Richeng Xuan,Jin-Ge Yao,Xi Yang

Main category: cs.CV

TL;DR: FlagEvalMM是一个开源的多模态评估框架,用于全面评估模型在视觉语言任务中的表现,支持灵活的资源分配和任务扩展。

Details Motivation: 现有的多模态模型评估工具往往缺乏灵活性和效率,FlagEvalMM旨在解决这些问题,提供更全面的评估能力。

Contribution: 1. 提出一个解耦推理与评估的独立服务框架;2. 引入高效推理工具和异步数据加载以提升效率;3. 支持广泛的多模态任务。

Method: 通过独立评估服务分离推理与评估,利用vLLM和SGLang等工具加速推理,并采用异步数据加载技术。

Result: 实验表明FlagEvalMM能够高效准确地评估模型,揭示其优缺点。

Insight: 该框架为多模态研究提供了标准化且高效的评估工具,促进了模型的对比与改进。

Abstract: We present FlagEvalMM, an open-source evaluation framework designed to
comprehensively assess multimodal models across a diverse range of
vision-language understanding and generation tasks, such as visual question
answering, text-to-image/video generation, and image-text retrieval. We
decouple model inference from evaluation through an independent evaluation
service, thus enabling flexible resource allocation and seamless integration of
new tasks and models. Moreover, FlagEvalMM utilizes advanced inference
acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to
significantly enhance evaluation efficiency. Extensive experiments show that
FlagEvalMM offers accurate and efficient insights into model strengths and
limitations, making it a valuable tool for advancing multimodal research. The
framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.

[40] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai,Arpita Chowdhury,Zihe Wang,Sooyoung Jeon,Lemeng Wang,Jiacheng Hou,Jihyung Kil,Wei-Lun Chao

Main category: cs.CV

TL;DR: AVA-Bench是一个专门针对视觉基础模型(VFMs)的原子视觉能力评测基准,通过解耦14种原子视觉能力(如定位、深度估计等),解决了传统评测方法中数据不匹配和多能力耦合的问题。

Details Motivation: 传统评测方法(如VQA基准测试)存在数据不匹配和多能力耦合的盲点,无法准确评估视觉基础模型的具体能力短板。

Contribution: 提出AVA-Bench,首个明确解耦14种原子视觉能力的评测基准,能够精准定位视觉基础模型的能力优势和不足。

Method: 通过匹配训练和测试数据分布,解耦14种原子视觉能力,并在领先的视觉基础模型上应用该基准,生成独特的“能力指纹”。

Result: 实验显示,即使使用较小的语言模型(0.5B),也能获得与较大模型(7B)类似的排名效果,同时显著减少计算成本(GPU时间减少8倍)。

Insight: AVA-Bench为视觉基础模型的精准评测和能力优化提供了透明、高效的框架,有望推动下一代模型的开发。

Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation.
A common approach pairs VFMs with large language models (LLMs) as
general-purpose heads, followed by evaluation on broad Visual Question
Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i)
the instruction tuning data may not align with VQA test distributions, meaning
a wrong prediction can stem from such data mismatch rather than a VFM’ visual
shortcomings; (ii) VQA benchmarks often require multiple visual abilities,
making it hard to tell whether errors stem from lacking all required abilities
or just a single critical one. To address these gaps, we introduce AVA-Bench,
the first benchmark that explicitly disentangles 14 Atomic Visual Abilities
(AVAs) – foundational skills like localization, depth estimation, and spatial
understanding that collectively support complex visual reasoning tasks. By
decoupling AVAs and matching training and test distributions within each,
AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench
to leading VFMs thus reveals distinctive “ability fingerprints,” turning VFM
selection from educated guesswork into principled engineering. Notably, we find
that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours
by 8x, enabling more efficient evaluation. By offering a comprehensive and
transparent benchmark, we hope AVA-Bench lays the foundation for the next
generation of VFMs.

[41] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

Jerry Lin,Partick P. W. Chen

Main category: cs.CV

TL;DR: BakuFlow 是一个半自动化的图像标注工具,通过结合手动修正、数据增强、标签传播和自动标注模块,显著提升了标注效率和用户体验。

Details Motivation: 手动标注大规模数据集时耗且容易出错,现有工具如 LabelImg 仍需手动标注每张图像,亟需更高效的解决方案。

Contribution: BakuFlow 提出四项创新功能:1) 动态放大镜手动修正;2) 交互式数据增强;3) 标签传播技术;4) 改进的 YOLOE 自动标注框架,支持灵活扩展类别和视觉提示。

Method: 结合手动和自动标注,引入动态放大镜、数据增强模块、标签传播技术,以及改进的 YOLOE 框架(支持动态类别扩展)。

Result: 显著减少了标注工作量,特别适用于视频数据标注和动态数据集,提升了计算机视觉任务的实际效率。

Insight: 半自动化工具通过结合手动与自动标注的优势,能够有效解决大规模数据标注的瓶颈问题,尤其适用于工业场景的动态需求。

Abstract: Accurately labeling (or annotation) data is still a bottleneck in computer
vision, especially for large-scale tasks where manual labeling is
time-consuming and error-prone. While tools like LabelImg can handle the
labeling task, some of them still require annotators to manually label each
image. In this paper, we introduce BakuFlow, a streamlining semi-automatic
label generation tool. Key features include (1) a live adjustable magnifier for
pixel-precise manual corrections, improving user experience; (2) an interactive
data augmentation module to diversify training datasets; (3) label propagation
for rapidly copying labeled objects between consecutive frames, greatly
accelerating annotation of video data; and (4) an automatic labeling module
powered by a modified YOLOE framework. Unlike the original YOLOE, our extension
supports adding new object classes and any number of visual prompts per class
during annotation, enabling flexible and scalable labeling for dynamic,
real-world datasets. These innovations make BakuFlow especially effective for
object detection and tracking, substantially reducing labeling workload and
improving efficiency in practical computer vision and industrial scenarios.

[42] Bias Analysis in Unconditional Image Generative Models

Xiaofeng Zhang,Michelle Lin,Simon Lacoste-Julien,Aaron Courville,Yash Goyal

Main category: cs.CV

TL;DR: 论文分析了无条件图像生成模型的偏见机制,定义偏见为观察分布与理想参考分布中属性出现概率的差异,实验显示检测到的属性偏移较小,但对分类器的敏感性显著。

Details Motivation: 研究无条件图像生成模型中的偏见机制,揭示生成分布与训练分布之间的属性偏移,以及评估框架中对分类器的依赖性问题。

Contribution: 提出了一种偏见定义方法,揭示了分类器敏感性对属性偏移检测的影响,强调了评估框架中的局限性。

Method: 训练一组无条件图像生成模型,采用常见的偏见评估框架,分析生成分布与训练分布之间的属性偏移。

Result: 实验结果显示属性偏移较小,但偏移检测对分类器敏感,尤其在属性值为连续谱而非二元时。

Insight: 研究指出需要改进标签代表性,深入理解评估框架的局限性,并认识到属性在社会复杂性中的多样性。

Abstract: The widespread adoption of generative AI models has raised growing concerns
about representational harm and potential discriminatory outcomes. Yet, despite
growing literature on this topic, the mechanisms by which bias emerges -
especially in unconditional generation - remain disentangled. We define the
bias of an attribute as the difference between the probability of its presence
in the observed distribution and its expected proportion in an ideal reference
distribution. In our analysis, we train a set of unconditional image generative
models and adopt a commonly used bias evaluation framework to study bias shift
between training and generated distributions. Our experiments reveal that the
detected attribute shifts are small. We find that the attribute shifts are
sensitive to the attribute classifier used to label generated images in the
evaluation framework, particularly when its decision boundaries fall in
high-density regions. Our empirical analysis indicates that this classifier
sensitivity is often observed in attributes values that lie on a spectrum, as
opposed to exhibiting a binary nature. This highlights the need for more
representative labeling practices, understanding the shortcomings through
greater scrutiny of evaluation frameworks, and recognizing the socially complex
nature of attributes when evaluating bias.

[43] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Arnav Yayavaram,Siddharth Yayavaram,Simran Khanuja,Michael Saxon,Graham Neubig

Main category: cs.CV

TL;DR: CAIRe是一种新颖的评估指标,用于衡量图像的文化相关性,解决了文本生成图像模型中跨文化偏见的测量问题。

Details Motivation: 随着文本生成图像模型的普及,确保其在多样文化环境中的公平性至关重要。然而,跨文化偏见的测量问题阻碍了进展。

Contribution: CAIRe框架通过知识库和事实信息为图像中的实体和概念提供独立的文化标签评分,显著提升了测量准确性。

Method: CAIRe将图像中的实体和概念与知识库关联,利用检索增强评估方法进行文化相关性评分。

Result: 在手动标注的数据集上,CAIRe比基线方法提高了28%的F1分数,与人类评分的相关性达到0.56和0.66。

Insight: CAIRe的检索增强评估方法为跨文化偏见的量化提供了有效工具,为未来研究奠定了基础。

Abstract: As text-to-image models become increasingly prevalent, ensuring their
equitable performance across diverse cultural contexts is critical. Efforts to
mitigate cross-cultural biases have been hampered by trade-offs, including a
loss in performance, factual inaccuracies, or offensive outputs. Despite
widespread recognition of these challenges, an inability to reliably measure
these biases has stalled progress. To address this gap, we introduce CAIRe, a
novel evaluation metric that assesses the degree of cultural relevance of an
image, given a user-defined set of labels. Our framework grounds entities and
concepts in the image to a knowledge base and uses factual information to give
independent graded judgments for each culture label. On a manually curated
dataset of culturally salient but rare items built using language models, CAIRe
surpasses all baselines by 28% F1 points. Additionally, we construct two
datasets for culturally universal concept, one comprising of T2I-generated
outputs and another retrieved from naturally occurring data. CAIRe achieves
Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based
on a 5-point Likert scale of cultural relevance. This demonstrates its strong
alignment with human judgment across diverse image sources.

[44] Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao,Haoyuan Guo,Tuyen Hoang,Weilin Huang,Lu Jiang,Fangyuan Kong,Huixia Li,Jiashi Li,Liang Li,Xiaojie Li,Xunsong Li,Yifu Li,Shanchuan Lin,Zhijie Lin,Jiawei Liu,Shu Liu,Xiaonan Nie,Zhiwu Qing,Yuxi Ren,Li Sun,Zhi Tian,Rui Wang,Sen Wang,Guoqiang Wei,Guohong Wu,Jie Wu,Ruiqi Xia,Fei Xiao,Xuefeng Xiao,Jiangqiao Yan,Ceyuan Yang,Jianchao Yang,Runkai Yang,Tao Yang,Yihang Yang,Zilyu Ye,Xuejiao Zeng,Yan Zeng,Heng Zhang,Yang Zhao,Xiaozheng Zheng,Peihao Zhu,Jiaxin Zou,Feilong Zuo

Main category: cs.CV

TL;DR: Seedance 1.0是一个高性能的视频生成基础模型,通过多源数据增强、高效架构设计和优化训练方法,实现了高质量、快速且符合指令的视频生成。

Details Motivation: 当前视频生成模型难以同时平衡指令跟随、运动合理性和视觉质量,Seedance 1.0旨在解决这些问题。

Contribution: 1) 结合多源数据增强的高精度视频标注;2) 高效架构设计支持多镜头生成和双任务学习;3) 优化的后训练方法;4) 10倍推理加速。

Method: 采用多源数据增强、高效架构设计、多任务学习、精细化监督微调和视频特定RLHF。

Result: 在1080p分辨率下5秒视频生成仅需41.4秒,生成质量高,时空流畅,指令跟随精准。

Insight: 数据增强和架构设计对视频生成模型的性能至关重要,多任务学习可以进一步提升模型的通用性和效率。

Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements
in video generation, yet current foundational model still face critical
challenges in simultaneously balancing prompt following, motion plausibility,
and visual quality. In this report, we introduce Seedance 1.0, a
high-performance and inference-efficient video foundation generation model that
integrates several core technical improvements: (i) multi-source data curation
augmented with precision and meaningful video captioning, enabling
comprehensive learning across diverse scenarios; (ii) an efficient architecture
design with proposed training paradigm, which allows for natively supporting
multi-shot generation and jointly learning of both text-to-video and
image-to-video tasks. (iii) carefully-optimized post-training approaches
leveraging fine-grained supervised fine-tuning, and video-specific RLHF with
multi-dimensional reward mechanisms for comprehensive performance improvements;
(iv) excellent model acceleration achieving ~10x inference speedup through
multi-stage distillation strategies and system-level optimizations. Seedance
1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds
(NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance
1.0 stands out with high-quality and fast video generation having superior
spatiotemporal fluidity with structural stability, precise instruction
adherence in complex multi-subject contexts, native multi-shot narrative
coherence with consistent subject representation.

[45] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Sungwon Hwang,Hyojin Jang,Kinam Kim,Minho Park,Jaegul choo

Main category: cs.CV

TL;DR: 论文提出了一种新的正则化技术CREPA,用于改进视频扩散模型(VDMs)的微调,通过跨帧对齐隐藏状态与外部特征,显著提升了视觉保真度和帧间语义一致性。

Details Motivation: 现有视频扩散模型的微调在保持帧间语义一致性方面存在不足,而现有的表征对齐方法(如REPA)仅适用于图像扩散模型,无法直接迁移到视频任务中。

Contribution: 提出了跨帧表征对齐(CREPA),通过对齐当前帧隐藏状态与邻近帧的外部特征,优化了VDMs的微调效果,提升了视觉质量和语义一致性。

Method: CREPA引入了一种正则化技术,将当前帧的隐藏状态与邻近帧的外部特征对齐,支持参数高效的微调方法(如LoRA)。

Result: 在CogVideoX-5B和Hunyuan Video等大规模VDMs上验证了CREPA的有效性,显著提升了视觉保真度和帧间语义一致性。

Insight: 跨帧特征对齐是优化视频扩散模型微调的关键,尤其在参数高效微调场景中,能显著提升生成视频的质量和一致性。

Abstract: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate
videos that reflect specific attributes of training data presents notable
challenges, yet remains underexplored despite its practical importance.
Meanwhile, recent work such as Representation Alignment (REPA) has shown
promise in improving the convergence and quality of DiT-based image diffusion
models by aligning, or assimilating, its internal hidden states with external
pretrained visual features, suggesting its potential for VDM fine-tuning. In
this work, we first propose a straightforward adaptation of REPA for VDMs and
empirically show that, while effective for convergence, it is suboptimal in
preserving semantic consistency across frames. To address this limitation, we
introduce Cross-frame Representation Alignment (CREPA), a novel regularization
technique that aligns hidden states of a frame with external features from
neighboring frames. Empirical evaluations on large-scale VDMs, including
CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual
fidelity and cross-frame semantic coherence when fine-tuned with
parameter-efficient methods such as LoRA. We further validate CREPA across
diverse datasets with varying attributes, confirming its broad applicability.
Project page: https://crepavideo.github.io

[46] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

Mojtaba Nafez,Amirhossein Koochakian,Arad Maleki,Jafar Habibi,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: PatchGuard是一种基于Vision Transformer(ViT)和伪异常的对抗鲁棒性异常检测(AD)与定位(AL)方法,通过引入伪异常和定位掩码提升模型对抗攻击的鲁棒性,显著优于现有方法。

Details Motivation: 当前AD和AL方法因仅使用正常样本训练而易受对抗攻击,限制了其在高可靠性领域(如医疗影像和工业监控)的应用。PatchGuard旨在解决这一问题。

Contribution: 1. 提出PatchGuard,首次将伪异常和ViT结合用于AD和AL的对抗鲁棒性增强;2. 设计了新型损失函数和对抗训练策略;3. 在工业和医疗数据集上取得显著性能提升。

Method: 1. 分析伪异常的关键属性;2. 通过理论分析指导ViT注意机制的设计;3. 提出前景感知伪异常(Foreground-Aware Pseudo-Anomalies)生成方法;4. 结合对抗训练和新型损失函数优化模型。

Result: 在对抗环境下,PatchGuard在AD和AL任务上分别提升53.2%和68.5%,且在正常环境下仍保持竞争力。

Insight: 伪异常与ViT的结合及对抗训练的优化是提升AD和AL对抗鲁棒性的有效途径。

Abstract: Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields
that demand high reliability, such as medical imaging and industrial
monitoring. However, current AD and AL approaches are often susceptible to
adversarial attacks due to limitations in training data, which typically
include only normal, unlabeled samples. This study introduces PatchGuard, an
adversarially robust AD and AL method that incorporates pseudo anomalies with
localization masks within a Vision Transformer (ViT)-based architecture to
address these vulnerabilities. We begin by examining the essential properties
of pseudo anomalies, and follow it by providing theoretical insights into the
attention mechanisms required to enhance the adversarial robustness of AD and
AL systems. We then present our approach, which leverages Foreground-Aware
Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware
methods. Our method incorporates these crafted pseudo-anomaly samples into a
ViT-based framework, with adversarial training guided by a novel loss function
designed to improve model robustness, as supported by our theoretical analysis.
Experimental results on well-established industrial and medical datasets
demonstrate that PatchGuard significantly outperforms previous methods in
adversarial settings, achieving performance gains of $53.2%$ in AD and
$68.5%$ in AL, while also maintaining competitive accuracy in non-adversarial
settings. The code repository is available at
https://github.com/rohban-lab/PatchGuard .

[47] UFM: A Simple Path towards Unified Dense Correspondence with Flow

Yuchen Zhang,Nikhil Keetha,Chenwei Lyu,Bhuvan Jhamb,Yutian Chen,Yuheng Qiu,Jay Karhade,Shreyas Jha,Yaoyu Hu,Deva Ramanan,Sebastian Scherer,Wenshan Wang

Main category: cs.CV

TL;DR: UFM提出了一种统一的稠密对应模型,通过简单的Transformer架构直接回归(u,v)光流,在训练和精度上优于传统方法,并首次证明了统一训练可以超越专用方法。

Details Motivation: 传统稠密对应方法在宽基线场景和光流估计中分别处理,但实际上两者目标一致。UFM试图通过统一训练解决这一问题。

Contribution: 1. 提出UFM模型,统一处理稠密对应任务;2. 使用简单的Transformer架构直接回归光流,训练更简单且对大光流更准确;3. 在光流和宽基线匹配任务中均优于专用方法。

Method: UFM采用统一的训练数据(共视像素)和通用的Transformer架构,直接回归(u,v)光流,避免了传统粗到细成本体积方法的复杂度。

Result: UFM在光流任务中比最优方法精确28%,在宽基线匹配中误差减少62%,速度快6.7倍,首次展示了统一训练在多个任务中的优势。

Insight: 统一训练可以超越任务专用方法,为多模态、长距离和实时对应任务开辟了新方向。

Abstract: Dense image correspondence is central to many applications, such as visual
odometry, 3D reconstruction, object association, and re-identification.
Historically, dense correspondence has been tackled separately for
wide-baseline scenarios and optical flow estimation, despite the common goal of
matching content between two images. In this paper, we develop a Unified Flow &
Matching model (UFM), which is trained on unified data for pixels that are
co-visible in both source and target images. UFM uses a simple, generic
transformer architecture that directly regresses the (u,v) flow. It is easier
to train and more accurate for large flows compared to the typical
coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than
state-of-the-art flow methods (Unimatch), while also having 62% less error and
6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to
demonstrate that unified training can outperform specialized approaches across
both domains. This result enables fast, general-purpose correspondence and
opens new directions for multi-modal, long-range, and real-time correspondence
tasks.

[48] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery

Sindhu Boddu,Arindam Mukherjee

Main category: cs.CV

TL;DR: 该论文提出了一种轻量化的目标检测方法,通过量化YOLOv4-Tiny模型,优化其在紧急响应中的空中图像检测性能,显著减小了模型体积并提高了推理速度。

Details Motivation: 现有公开数据集中缺乏无人机视角的紧急场景图像,限制了相关研究的发展。论文旨在为紧急响应提供一种高效、轻量化的目标检测方案。

Contribution: 1. 提出了首个针对紧急场景的自定义无人机图像数据集;2. 通过INT8量化优化YOLOv4-Tiny模型,显著降低了模型体积和提升了推理速度。

Method: 采用YOLOv4-Tiny模型,并通过后训练量化将其精度优化至INT8,在自定义数据集上进行训练和评估。

Result: 量化后的模型体积从22.5 MB降至6.4 MB,推理速度提升44%,同时保持了与YOLOv5-small相当的检测性能(mAP和F1分数)。

Insight: INT8量化在轻量化模型中表现优异,尤其适合边缘设备部署,为实时紧急检测提供了高效解决方案。

Abstract: This paper presents a lightweight and energy-efficient object detection
solution for aerial imagery captured during emergency response situations. We
focus on deploying the YOLOv4-Tiny model, a compact convolutional neural
network, optimized through post-training quantization to INT8 precision. The
model is trained on a custom-curated aerial emergency dataset, consisting of
10,820 annotated images covering critical emergency scenarios. Unlike prior
works that rely on publicly available datasets, we created this dataset
ourselves due to the lack of publicly available drone-view emergency imagery,
making the dataset itself a key contribution of this work. The quantized model
is evaluated against YOLOv5-small across multiple metrics, including mean
Average Precision (mAP), F1 score, inference time, and model size. Experimental
results demonstrate that the quantized YOLOv4-Tiny achieves comparable
detection performance while reducing the model size from 22.5 MB to 6.4 MB and
improving inference speed by 44%. With a 71% reduction in model size and a
44% increase in inference speed, the quantized YOLOv4-Tiny model proves highly
suitable for real-time emergency detection on low-power edge devices.

[49] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5

Sindhu Boddu,Arindam Mukherjee

Main category: cs.CV

TL;DR: 论文展示了量化YOLOv4-Tiny在树莓派5上的部署与性能评估,针对空中紧急图像实时目标检测,量化后模型在嵌入式条件下表现高效。

Details Motivation: 研究旨在解决资源受限的边缘设备(如树莓派5)上实时目标检测的挑战,尤其是在紧急响应应用中对低功耗和高效率的需求。

Contribution: 提出了一种基于TensorFlow Lite后训练量化的INT8精度YOLOv4-Tiny模型,显著降低了功耗并保持了检测精度。

Method: 采用TensorFlow Lite后训练量化技术将YOLOv4-Tiny从FP32转换为INT8精度,并在树莓派5上评估其速度、功耗和热可行性。

Result: 量化模型每张图像推理时间为28.2毫秒,平均功耗为13.85W,相比FP32版本显著降低了功耗,检测精度在关键紧急类别中保持稳定。

Insight: 低功耗嵌入式AI系统在安全关键型应急响应应用中具有实时部署潜力,展示了量化技术的实用性。

Abstract: This paper presents the deployment and performance evaluation of a quantized
YOLOv4-Tiny model for real-time object detection in aerial emergency imagery on
a resource-constrained edge device the Raspberry Pi 5. The YOLOv4-Tiny model
was quantized to INT8 precision using TensorFlow Lite post-training
quantization techniques and evaluated for detection speed, power consumption,
and thermal feasibility under embedded deployment conditions. The quantized
model achieved an inference time of 28.2 ms per image with an average power
consumption of 13.85 W, demonstrating a significant reduction in power usage
compared to its FP32 counterpart. Detection accuracy remained robust across key
emergency classes such as Ambulance, Police, Fire Engine, and Car Crash. These
results highlight the potential of low-power embedded AI systems for real-time
deployment in safety-critical emergency response applications.

[50] MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning

Tong Wang,Guanzhou Chen,Xiaodong Zhang,Chenxi Liu,Jiaqi Wang,Xiaoliang Tan,Wenchao Guo,Qingyuan Yang,Kaiqi Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种多模态自监督学习框架MSSDF,结合RGB图像、多光谱数据和数字表面模型进行预训练,通过自适应掩码策略和多任务目标,显著提升了遥感图像下游任务的性能。

Details Motivation: 遥感图像标注数据获取成本高且耗时,因此需要一种高效的自监督学习方法,利用多模态数据中的信息进行预训练。

Contribution: 1. 提出了一种多模态自监督学习框架MSSDF;2. 设计了信息感知的自适应掩码策略和跨模态掩码机制;3. 在多个下游任务上验证了方法的优越性。

Method: 通过自适应掩码策略、跨模态掩码机制和多任务自监督目标,捕捉多模态数据的相关性和独有特征结构。

Result: 在15个数据集上验证,多项任务表现优异,如语义分割(mIoU 78.30%)、深度估计(RMSE 0.182)和变化检测(mIoU 47.51%)。

Insight: 多模态数据结合自监督学习能有效提升遥感图像任务的性能,尤其是在标注数据有限的场景下。

Abstract: Remote sensing image interpretation plays a critical role in environmental
monitoring, urban planning, and disaster assessment. However, acquiring
high-quality labeled data is often costly and time-consuming. To address this
challenge, we proposes a multi-modal self-supervised learning framework that
leverages high-resolution RGB images, multi-spectral data, and digital surface
models (DSM) for pre-training. By designing an information-aware adaptive
masking strategy, cross-modal masking mechanism, and multi-task self-supervised
objectives, the framework effectively captures both the correlations across
different modalities and the unique feature structures within each modality. We
evaluated the proposed method on multiple downstream tasks, covering typical
remote sensing applications such as scene classification, semantic
segmentation, change detection, object detection, and depth estimation.
Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks.
The results demonstrate that the proposed method outperforms existing
pretraining approaches in most tasks. Specifically, on the Potsdam and
Vaihingen semantic segmentation tasks, our method achieved mIoU scores of
78.30% and 76.50%, with only 50% train-set. For the US3D depth estimation
task, the RMSE error is reduced to 0.182, and for the binary change detection
task in SECOND dataset, our method achieved mIoU scores of 47.51%, surpassing
the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and
HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.

[51] An Effective End-to-End Solution for Multimodal Action Recognition

Songping Wang,Xiantao Hu,Yueming Lyu,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了一种多模态动作识别的端到端解决方案,通过数据增强、迁移学习、多模态特征提取和预测增强方法,显著提升了识别性能。

Details Motivation: 多模态动作识别任务由于三模态数据的稀缺性面临诸多挑战,本文旨在通过综合利用多模态信息和优化技术解决这一问题。

Contribution: 1) 提出了一种全面的多模态动作识别解决方案;2) 通过数据增强和迁移学习优化训练;3) 结合2D CNN和TSM提取时空特征;4) 使用多种预测增强方法整合知识。

Method: 1) 数据增强扩展训练规模;2) 使用RGB数据集预训练骨干网络;3) 结合2D CNN和TSM提取时空特征;4) 采用SWA、Ensemble和TTA等预测增强方法。

Result: 在竞赛排行榜上达到Top-1准确率99%和Top-5准确率100%,验证了解决方案的优越性。

Insight: 多模态信息的综合利用和预测增强方法的结合对提升动作识别性能具有显著效果,尤其在数据稀缺场景下更具优势。

Abstract: Recently, multimodal tasks have strongly advanced the field of action
recognition with their rich multimodal information. However, due to the
scarcity of tri-modal data, research on tri-modal action recognition tasks
faces many challenges. To this end, we have proposed a comprehensive multimodal
action recognition solution that effectively utilizes multimodal information.
First, the existing data are transformed and expanded by optimizing data
enhancement techniques to enlarge the training scale. At the same time, more
RGB datasets are used to pre-train the backbone network, which is better
adapted to the new task by means of transfer learning. Secondly, multimodal
spatial features are extracted with the help of 2D CNNs and combined with the
Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature
extraction comparable to 3D CNNs and improve the computational efficiency. In
addition, common prediction enhancement methods, such as Stochastic Weight
Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to
integrate the knowledge of models from different training periods of the same
architecture and different architectures, so as to predict the actions from
different perspectives and fully exploit the target information. Ultimately, we
achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the
competition leaderboard, demonstrating the superiority of our solution.

[52] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Shanchuan Lin,Ceyuan Yang,Hao He,Jianwen Jiang,Yuxi Ren,Xin Xia,Yang Zhao,Xuefeng Xiao,Lu Jiang

Main category: cs.CV

TL;DR: 论文提出了一种自回归对抗后训练方法(AAPT),将预训练的潜在视频扩散模型转化为实时交互式视频生成器,支持单步生成和交互控制。

Details Motivation: 现有大规模视频生成模型计算量过大,难以满足实时交互应用的需求。

Contribution: 提出了AAPT方法,将预训练模型转化为实时交互视频生成器,采用了对抗训练和KV缓存优化。

Method: 采用自回归对抗训练,单步生成潜在帧,支持KV缓存和交互控制,减少长视频生成中的误差累积。

Result: 在单个H100上实现24fps的736x416分辨率实时视频生成,8xH100上支持1280x720分辨率长达一分钟的视频生成。

Insight: 对抗训练是自回归生成的有效范式,单步生成和KV缓存优化显著提升效率。

Abstract: Existing large-scale video generation models are computationally intensive,
preventing adoption in real-time and interactive applications. In this work, we
propose autoregressive adversarial post-training (AAPT) to transform a
pre-trained latent video diffusion model into a real-time, interactive video
generator. Our model autoregressively generates a latent frame at a time using
a single neural function evaluation (1NFE). The model can stream the result to
the user in real time and receive interactive responses as controls to generate
the next latent frame. Unlike existing approaches, our method explores
adversarial training as an effective paradigm for autoregressive generation.
This not only allows us to design an architecture that is more efficient for
one-step generation while fully utilizing the KV cache, but also enables
training the model in a student-forcing manner that proves to be effective in
reducing error accumulation during long video generation. Our experiments
demonstrate that our 8B model achieves real-time, 24fps, streaming video
generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to
a minute long (1440 frames). Visit our research website at
https://seaweed-apt.com/2

[53] A new approach for image segmentation based on diffeomorphic registration and gradient fields

Junchao Zhou

Main category: cs.CV

TL;DR: 论文提出了一种基于微分同胚配准和梯度场的图像分割新方法,通过变形模板曲线和梯度场比较实现分割,无需依赖大数据集。

Details Motivation: 传统的图像分割方法依赖于大量训练数据或有限的边缘检测技术,而本文希望通过结合形状分析和微分同胚变换,提出一种更灵活且理论扎实的方法。

Contribution: 提出了一种新的变分框架,将分割问题建模为通过微分同胚变换变形模板曲线的过程,并结合梯度场引导分割,使用了LDDMM框架和varifold几何形状表示。

Method: 利用LDDMM框架实现图像域的微分同胚变换,通过变形模板曲线并与图像的梯度场进行比较,采用varifold表示几何形状,使用Python和GPU加速实现。

Result: 该方法实现了高精度的图像分割,且不依赖于大数据集,具有灵活性和理论支撑。

Insight: 结合微分同胚变换和几何形状表示可以显著提升分割的准确性,尤其在数据稀缺的场景中具有优势。

Abstract: Image segmentation is a fundamental task in computer vision aimed at
delineating object boundaries within images. Traditional approaches, such as
edge detection and variational methods, have been widely explored, while recent
advances in deep learning have shown promising results but often require
extensive training data. In this work, we propose a novel variational framework
for 2D image segmentation that integrates concepts from shape analysis and
diffeomorphic transformations. Our method models segmentation as the
deformation of a template curve via a diffeomorphic transformation of the image
domain, using the Large Deformation Diffeomorphic Metric Mapping (LDDMM)
framework. The curve evolution is guided by a loss function that compares the
deformed curve to the image gradient field, formulated through the varifold
representation of geometric shapes. The approach is implemented in Python with
GPU acceleration using the PyKeops library. This framework allows for accurate
segmentation with a flexible and theoretically grounded methodology that does
not rely on large datasets.

[54] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

Hongguang Zhu,Yunchao Wei,Mengyu Wang,Siyu Jiao,Yan Fang,Jiannan Huang,Yao Zhao

Main category: cs.CV

TL;DR: SAGE提出了一种基于语义增强擦除的方法,探索不安全概念域的边界,通过自检查和自擦除实现概念域的泛化擦除,同时提出了全局-局部协作保留机制以避免无关概念的退化。

Details Motivation: 扩散模型在文本到图像生成中取得了显著进展,但预训练中包含的敏感信息带来了安全隐患,如不安全内容的生成和版权侵权问题。现有方法将不安全概念视为固定词并重复擦除,导致模型陷入‘词概念深渊’,限制了概念相关擦除的泛化能力。

Contribution: 1. 引入了语义增强擦除,将概念词擦除转化为概念域擦除;2. 提出了全局-局部协作保留机制,在擦除不安全概念的同时减少无关概念的退化。

Method: 通过循环自检查和自擦除探索概念域的边界表示,利用原始与训练扩散模型的语义空间关系进行擦除,无需额外预处理数据。全局-局部机制结合了全局语义关系对齐和局部预测噪声保留。

Result: 实验表明,SAGE在扩散模型的安全生成方面全面优于其他方法。

Insight: 语义增强擦除和全局-局部保留机制的结合为解决扩散模型中的不安全概念擦除问题提供了新思路,提升了模型的泛化能力和安全性。

Abstract: Diffusion models (DMs) have achieved significant progress in text-to-image
generation. However, the inevitable inclusion of sensitive information during
pre-training poses safety risks, such as unsafe content generation and
copyright infringement. Concept erasing finetunes weights to unlearn
undesirable concepts, and has emerged as a promising solution. However,
existing methods treat unsafe concept as a fixed word and repeatedly erase it,
trapping DMs in ``word concept abyss’’, which prevents generalized
concept-related erasing. To escape this abyss, we introduce semantic-augment
erasing which transforms concept word erasure into concept domain erasure by
the cyclic self-check and self-erasure. It efficiently explores and unlearns
the boundary representation of concept domain through semantic spatial
relationships between original and training DMs, without requiring additional
preprocessed data. Meanwhile, to mitigate the retention degradation of
irrelevant concepts while erasing unsafe concepts, we further propose the
global-local collaborative retention mechanism that combines global semantic
relationship alignment with local predicted noise preservation, effectively
expanding the retentive receptive field for irrelevant concepts. We name our
method SAGE, and extensive experiments demonstrate the comprehensive
superiority of SAGE compared with other methods in the safe generation of DMs.
The code and weights will be open-sourced at
https://github.com/KevinLight831/SAGE.

[55] ScaleLSD: Scalable Deep Line Segment Detection Streamlined

Zeran Ke,Bin Tan,Xianwei Zheng,Yujun Shen,Tianfu Wu,Nan Xue

Main category: cs.CV

TL;DR: ScaleLSD提出了一种可扩展的自监督学习方法,用于检测图像中的线段,性能优于传统非深度学习方法,并在多种任务中表现出色。

Details Motivation: 研究目标是学习一个领域无关且鲁棒的线段检测模型,适用于任何自然图像,通过自监督学习解决传统方法在可扩展性上的不足。

Contribution: 1. 提出了ScaleLSD,一个高性能且高效的线段检测模型;2. 通过自监督学习从1000万张未标注图像中学习线几何特征;3. 在检测性能、3D几何估计和多视角线段匹配等任务中超越传统方法。

Method: 重新设计并简化了深度和非深度线段检测方法的核心设计,结合自监督学习,从大规模未标注图像中学习线几何特征。

Result: 在零样本协议下,ScaleLSD在多种任务中表现优异,成为首个在所有测试任务中超越传统非深度方法的深度模型。

Insight: 自监督学习和规模化数据可以显著提升线几何检测的鲁棒性和泛化能力,为图像几何表征提供了新思路。

Abstract: This paper studies the problem of Line Segment Detection (LSD) for the
characterization of line geometry in images, with the aim of learning a
domain-agnostic robust LSD model that works well for any natural images. With
the focus of scalable self-supervised learning of LSD, we revisit and
streamline the fundamental designs of (deep and non-deep) LSD approaches to
have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the
curation of line geometry at scale from over 10M unlabeled real-world images.
Our ScaleLSD works very well to detect much more number of line segments from
any natural images even than the pioneered non-deep LSD approach, having a more
complete and accurate geometric characterization of images using line segments.
Experimentally, our proposed ScaleLSD is comprehensively testified under
zero-shot protocols in detection performance, single-view 3D geometry
estimation, two-view line segment matching, and multiview 3D line mapping, all
with excellent performance obtained. Based on the thorough evaluation, our
ScaleLSD is observed to be the first deep approach that outperforms the
pioneered non-deep LSD in all aspects we have tested, significantly expanding
and reinforcing the versatility of the line geometry of images. Code and Models
are available at https://github.com/ant-research/scalelsd

[56] ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model

Jialong Zuo,Yongtai Deng,Mengdan Tan,Rui Jin,Dongyue Wu,Nong Sang,Liang Pan,Changxin Gao

Main category: cs.CV

TL;DR: 论文提出了一种新的多模态行人重识别任务(OM-ReID),并构建了首个高质量多模态数据集ORBench。同时提出了一种名为ReID5o的多模态学习框架,能够在一个模型中实现多种模态的协同融合和对齐。实验验证了其先进性和实用性。

Details Motivation: 现有行人重识别方法和数据集仅支持有限模态,无法满足实际场景中对多模态查询(如RGB、红外、文本描述等)的需求。

Contribution: 1) 提出了OM-ReID任务;2) 构建了首个高质量多模态数据集ORBench;3) 设计了ReID5o框架,支持多种模态的协同学习和有效对齐。

Method: ReID5o框架采用统一编码和多专家路由机制,实现任意模态组合的协同融合和对齐。

Result: 实验表明ORBench数据集的多样性和ReID5o框架的优越性能,其在不同模态组合上的表现优于其他模型。

Insight: 多模态行人重识别是实际应用中的重要需求,而统一的模型设计和高质量数据集是关键。

Abstract: In real-word scenarios, person re-identification (ReID) expects to identify a
person-of-interest via the descriptive query, regardless of whether the query
is a single modality or a combination of multiple modalities. However, existing
methods and datasets remain constrained to limited modalities, failing to meet
this requirement. Therefore, we investigate a new challenging problem called
Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve
effective retrieval with varying multi-modal queries. To address dataset
scarcity, we construct ORBench, the first high-quality multi-modal dataset
comprising 1,000 unique identities across five modalities: RGB, infrared, color
pencil, sketch, and textual description. This dataset also has significant
superiority in terms of diversity, such as the painting perspectives and
textual information. It could serve as an ideal platform for follow-up
investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal
learning framework for person ReID. It enables synergistic fusion and
cross-modal alignment of arbitrary modality combinations in a single model,
with a unified encoding and multi-expert routing mechanism proposed. Extensive
experiments verify the advancement and practicality of our ORBench. A wide
range of possible models have been evaluated and compared on it, and our
proposed ReID5o model gives the best performance. The dataset and code will be
made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

[57] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation

Xinya Liu,Jianghao Wu,Tao Lu,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: SRPL-SFDA 提出了一种基于 SAM 的可靠伪标签方法,用于医学图像分割的无源域自适应(SFDA),通过三支强度增强(T3IE)、伪标签选择模块和可靠性感知训练,显著提升了目标域中的分割性能。

Details Motivation: 医学图像分割模型在新临床中心部署时面临显著的域偏移问题,而 SFDA 在保护隐私的同时需要解决目标域无标签数据监督不足的挑战。

Contribution: 1)提出 T3IE 模块,提升伪标签质量并适配 SAM 输入;2)设计基于多输出一致性的可靠伪标签选择;3)引入可靠性感知训练策略。

Method: 结合 T3IE 增强伪标签质量,利用 SAM 的零样本能力优化伪标签,并通过一致性评估选择可靠标签,最终进行可靠性感知训练。

Result: 在两个医学图像数据集上,SRPL-SFDA 表现优于现有 SFDA 方法,接近有监督训练性能。

Insight: 利用 SAM 的零样本能力可以有效提升伪标签质量,可靠性选择策略在无监督训练中尤为重要。

Abstract: Domain Adaptation (DA) is crucial for robust deployment of medical image
segmentation models when applied to new clinical centers with significant
domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal
with privacy concerns and access constraints on source-domain data during
adaptation to target-domain data. However, SFDA faces challenges such as
insufficient supervision in the target domain with unlabeled images. In this
work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels
method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch
Intensity Enhancement (T3IE) that not only improves quality of raw
pseudo-labels in the target domain, but also leads to SAM-compatible inputs
with three channels to better leverage SAM’s zero-shot inference ability for
refining the pseudo-labels; 2) A reliable pseudo-label selection module that
rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs
(CMSO) under input perturbations with T3IE; and 3) A reliability-aware training
procedure in the unlabeled target domain where reliable pseudo-labels are used
for supervision and unreliable parts are regularized by entropy minimization.
Experiments conducted on two multi-domain medical image segmentation datasets
for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA
effectively enhances pseudo-label quality in the unlabeled target domain, and
improves SFDA performance by leveraging the reliability-aware training; 2)
SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is
close to that of supervised training in the target domain. The code of this
work is available online: https://github.com/HiLab-git/SRPL-SFDA.

[58] Synthetic Human Action Video Data Generation with Pose Transfer

Vaclav Knapp,Matyas Bohacek

Main category: cs.CV

TL;DR: 论文提出了一种通过姿态迁移生成合成人类动作视频数据的方法,解决了传统合成数据在视频理解任务中的‘不自然’问题,并展示了其在动作识别任务和少量数据增强中的有效性。

Details Motivation: 现有的人类动作合成数据常因‘不自然’特征而无效,限制了其在手势识别、自动驾驶等任务中的应用。为此,论文提出一种新方法以生成更自然的合成数据。

Contribution: 1. 提出了一种基于姿态迁移的合成人类动作视频生成方法(使用可控3D高斯虚拟人模型)。2. 开源了方法及相关数据集RANDOM People。3. 在Toyota Smarthome和NTU RGB+D数据集上验证了方法的有效性。

Method: 使用可控3D高斯虚拟人模型进行姿态迁移,生成多样化的合成动作视频数据。方法还支持数据扩充,为少量样本任务提供更多训练数据。

Result: 在Toyota Smarthome和NTU RGB+D数据集上验证了方法的有效性,提高了动作识别任务的性能,并能有效扩充少量样本数据。

Insight: 合成数据生成技术可以通过更自然的姿态迁移方法解决传统合成数据的不自然问题,同时为数据稀缺任务提供有效解决方案。

Abstract: In video understanding tasks, particularly those involving human motion,
synthetic data generation often suffers from uncanny features, diminishing its
effectiveness for training. Tasks such as sign language translation, gesture
recognition, and human motion understanding in autonomous driving have thus
been unable to exploit the full potential of synthetic data. This paper
proposes a method for generating synthetic human action video data using pose
transfer (specifically, controllable 3D Gaussian avatar models). We evaluate
this method on the Toyota Smarthome and NTU RGB+D datasets and show that it
improves performance in action recognition tasks. Moreover, we demonstrate that
the method can effectively scale few-shot datasets, making up for groups
underrepresented in the real training data and adding diverse backgrounds. We
open-source the method along with RANDOM People, a dataset with videos and
avatars of novel human identities for pose transfer crowd-sourced from the
internet.

[59] Noise Conditional Variational Score Distillation

Xinyu Peng,Ziyang Zheng,Yaoming Wang,Han Li,Nuowen Kan,Wenrui Dai,Chenglin Li,Junni Zou,Hongkai Xiong

Main category: cs.CV

TL;DR: NCVSD是一种将预训练扩散模型蒸馏为生成降噪器的新方法,通过揭示无条件评分函数隐含表征降噪后验分布的评分函数,实现了高效学习和灵活采样。

Details Motivation: 现有扩散模型的生成效率较低,NCVSD旨在通过降噪器的蒸馏优化生成速度和采样质量,同时保留迭代优化的优势。

Contribution: 1. 揭示了无条件评分函数与降噪后验分布评分函数的隐含联系;2. 提出了NCVSD框架,实现了跨噪声水平的可扩展降噪器学习;3. 在快速生成、采样质量和零样本推理方面表现出色。

Method: 将无条件评分函数的洞察整合到VSD框架中,通过噪声条件化的变分评分蒸馏,学习生成降噪器。这些降噪器支持从高斯噪声的一步生成和多步迭代优化。

Result: 实验表明,NCVSD在图像生成和逆问题求解中表现优异,生成效率超过扩散模型,并与更大规模的一致性模型相当,同时在低NFE下达到了逆问题的最高LPIPS记录。

Insight: 通过噪声条件化和评分函数的内隐联系,NCVSD实现了高效与高质量的平衡,为生成模型的实用性提供了新思路。

Abstract: We propose Noise Conditional Variational Score Distillation (NCVSD), a novel
method for distilling pretrained diffusion models into generative denoisers. We
achieve this by revealing that the unconditional score function implicitly
characterizes the score function of denoising posterior distributions. By
integrating this insight into the Variational Score Distillation (VSD)
framework, we enable scalable learning of generative denoisers capable of
approximating samples from the denoising posterior distribution across a wide
range of noise levels. The proposed generative denoisers exhibit desirable
properties that allow fast generation while preserve the benefit of iterative
refinement: (1) fast one-step generation through sampling from pure Gaussian
noise at high noise levels; (2) improved sample quality by scaling the
test-time compute with multi-step sampling; and (3) zero-shot probabilistic
inference for flexible and controllable sampling. We evaluate NCVSD through
extensive experiments, including class-conditional image generation and inverse
problem solving. By scaling the test-time compute, our method outperforms
teacher diffusion models and is on par with consistency models of larger sizes.
Additionally, with significantly fewer NFEs than diffusion-based methods, we
achieve record-breaking LPIPS on inverse problems.

[60] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng,Jianwen Sun,Chuanhao Li,Zizhen Li,Jiaxin Ai,Fanrui Zhang,Yifan Chang,Sizhuo Zhou,Shenglin Zhang,Yu Dai,Kaipeng Zhang

Main category: cs.CV

TL;DR: 该论文提出了一个高质量的数据集InterSyn和一个自动化评估工具SynJudge,用于提升多模态模型中交织图像-文本生成的训练和评估。

Details Motivation: 当前的大型多模态模型(LMMs)在多模态理解和生成方面虽有显著进步,但在生成紧密交织的图像-文本输出时仍有困难,主要原因是现有训练数据集的规模、质量和指令多样性不足。

Contribution: 主要贡献包括:1)构建了InterSyn数据集,通过自评估与迭代优化(SEIR)方法实现了高质量的指令驱动多轮对话数据;2)提出了SynJudge评估模型,用于定量评估多模态输出的多个维度。

Method: 采用SEIR方法对数据集进行迭代优化,确保数据质量;SynJudge通过多维度(文本内容、图像内容、图像质量和图像-文本协同性)评估多模态输出。

Result: 实验表明SEIR方法显著提升了数据集质量,基于InterSyn训练的LMMs在所有评估指标上均表现出性能提升。

Insight: 高质量的数据集和可靠的评估工具是提升多模态模型交织生成能力的关键,为未来指令跟随型LMMs的发展提供了重要基础。

Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly
improved multimodal understanding and generation. However, these models still
struggle to generate tightly interleaved image-text outputs, primarily due to
the limited scale, quality and instructional richness of current training
datasets. To address this, we introduce InterSyn, a large-scale multimodal
dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR)
method. InterSyn features multi-turn, instruction-driven dialogues with tightly
interleaved imagetext responses, providing rich object diversity and rigorous
automated quality refinement, making it well-suited for training
next-generation instruction-following LMMs. Furthermore, to address the lack of
reliable evaluation tools capable of assessing interleaved multimodal outputs,
we introduce SynJudge, an automatic evaluation model designed to quantitatively
assess multimodal outputs along four dimensions: text content, image content,
image quality, and image-text synergy.
Experimental studies show that the SEIR method leads to substantially higher
dataset quality compared to an otherwise identical process without refinement.
Moreover, LMMs trained on InterSyn achieve uniform performance gains across
all evaluation metrics, confirming InterSyn’s utility for advancing multimodal
systems.

[61] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

Swadhin Das,Divyansh Mundra,Priyanshu Dayal,Raksha Sharma

Main category: cs.CV

TL;DR: 论文提出了一种轻量化的Transformer架构,结合边缘感知融合策略,用于遥感图像描述任务,提升了模型计算效率和精细空间细节捕捉能力。

Details Motivation: 现有的Transformer模型在遥感图像描述任务中计算成本高,且多模态框架中常忽视细粒度结构特征(如边缘和边界)。

Contribution: 提出轻量化Transformer架构、基于知识蒸馏的解码器和边缘感知增强策略,显著提升了模型性能和计算效率。

Method: 通过降低编码器层维度、采用轻量化GPT-2解码器及知识蒸馏策略,并结合边缘感知技术增强图像表示。

Result: 实验表明,该方法在提升描述质量的同时显著降低计算成本,优于现有方法。

Insight: 结合轻量化设计和细粒度特征提取可优化遥感图像描述任务的性能和实用性。

Abstract: Transformer-based models have achieved strong performance in remote sensing
image captioning by capturing long-range dependencies and contextual
information. However, their practical deployment is hindered by high
computational costs, especially in multi-modal frameworks that employ separate
transformer-based encoders and decoders. In addition, existing remote sensing
image captioning models primarily focus on high-level semantic extraction while
often overlooking fine-grained structural features such as edges, contours, and
object boundaries. To address these challenges, a lightweight transformer
architecture is proposed by reducing the dimensionality of the encoder layers
and employing a distilled version of GPT-2 as the decoder. A knowledge
distillation strategy is used to transfer knowledge from a more complex teacher
model to improve the performance of the lightweight network. Furthermore, an
edge-aware enhancement strategy is incorporated to enhance image representation
and object boundary understanding, enabling the model to capture fine-grained
spatial details in remote sensing images. Experimental results demonstrate that
the proposed approach significantly improves caption quality compared to
state-of-the-art methods.

[62] TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Ayush Gupta,Anirban Roy,Rama Chellappa,Nathaniel D. Bastian,Alvaro Velasquez,Susmit Jha

Main category: cs.CV

TL;DR: 论文提出了TOGA模型,用于弱监督下的时间接地开放视频问答,通过联合生成答案和时间接地,并在没有时间标注的情况下生成伪标签。

Details Motivation: 解决视频问答中的时间接地问题,同时在没有时间标注的弱监督条件下实现开放视频问答。

Contribution: 提出了TOGA模型,能够联合生成答案和时间接地;在弱监督条件下通过伪标签和一致性约束实现时间接地。

Method: 通过指令微调TOGA模型生成答案和时间接地,利用伪标签和一致性约束确保时间接地的有效性。

Result: 在NExT-GQA、MSVD-QA和ActivityNet-QA基准测试中取得了最先进的性能。

Insight: 联合生成答案和时间接地可以提升问答和时间接地的性能,弱监督方法在无标注数据中也能有效工作。

Abstract: We address the problem of video question answering (video QA) with temporal
grounding in a weakly supervised setup, without any temporal annotations. Given
a video and a question, we generate an open-ended answer grounded with the
start and end time. For this task, we propose TOGA: a vision-language model for
Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune
TOGA to jointly generate the answer and the temporal grounding. We operate in a
weakly supervised setup where the temporal grounding annotations are not
available. We generate pseudo labels for temporal grounding and ensure the
validity of these labels by imposing a consistency constraint between the
question of a grounding response and the response generated by a question
referring to the same temporal segment. We notice that jointly generating the
answers with the grounding improves performance on question answering as well
as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For
grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate
weakly supervised grounded question answering. For open-ended QA, we consider
the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art
performance for both tasks on these benchmarks.

[63] Harmonizing and Merging Source Models for CLIP-based Domain Generalization

Yuhe Ding,Jian Liang,Bo Jiang,Zi Wang,Aihua Zheng,Bin Luo

Main category: cs.CV

TL;DR: 论文提出了一种名为HAM的新框架,用于解决CLIP基领域泛化中多源训练时的样本和优化冲突问题,通过样本增强和模型合并提升泛化能力。

Details Motivation: 现有的CLIP基领域泛化方法在多源训练中面临样本冲突和优化冲突问题,导致泛化能力受限。HAM通过解决这些冲突,提升模型在未见领域的表现。

Contribution: 1) 提出了HAM框架,通过样本增强和模型合并解决多源训练中的冲突;2) 引入冗余感知的历史模型合并方法,有效整合多源知识;3) 在五个基准数据集上实现最优性能。

Method: HAM首先在训练过程中增强样本并避免冲突,然后通过冗余感知的目标将所有历史模型合并为一个更优的模型,从而综合多源域信息。

Result: 在五个广泛使用的基准数据集上,HAM取得了最先进的性能,验证了其有效性。

Insight: 通过样本增强和模型合并可以有效解决多源训练中的冲突问题,提升模型的泛化能力,尤其适合CLIP基领域泛化任务。

Abstract: CLIP-based domain generalization aims to improve model generalization to
unseen domains by leveraging the powerful zero-shot classification capabilities
of CLIP and multiple source datasets. Existing methods typically train a single
model across multiple source domains to capture domain-shared information.
However, this paradigm inherently suffers from two types of conflicts: 1)
sample conflicts, arising from noisy samples and extreme domain shifts among
sources; and 2) optimization conflicts, stemming from competition and
trade-offs during multi-source training. Both hinder the generalization and
lead to suboptimal solutions. Recent studies have shown that model merging can
effectively mitigate the competition of multi-objective optimization and
improve generalization performance. Inspired by these findings, we propose
Harmonizing and Merging (HAM), a novel source model merging framework for
CLIP-based domain generalization. During the training process of the source
models, HAM enriches the source samples without conflicting samples, and
harmonizes the update directions of all models. Then, a redundancy-aware
historical model merging method is introduced to effectively integrate
knowledge across all source models. HAM comprehensively consolidates source
domain information while enabling mutual enhancement among source models,
ultimately yielding a final model with optimal generalization capabilities.
Extensive experiments on five widely used benchmark datasets demonstrate the
effectiveness of our approach, achieving state-of-the-art performance.

[64] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization

Amirreza Khoshbakht,Erchan Aptoula

Main category: cs.CV

TL;DR: 论文提出了一种新颖的开集域泛化框架,结合频谱不变频率解耦(SIFD)、双通道残差网络(DCRN)、证据深度学习(EDL)和频谱空间不确定性解耦(SSUD),用于高光谱图像分类,解决了未知类别和域偏移问题。

Details Motivation: 高光谱图像分类中的开集域泛化(OSDG)面临未知类别和多域泛化的挑战,现有域适应方法依赖目标域数据且无法处理未知类别,导致负迁移和性能下降。

Contribution: 提出了结合SIFD、DCRN、EDL和SSUD的框架,实现了域不变特征提取、稳健特征学习、不确定性量化和可靠开集分类。

Method: 通过SIFD在频域提取域不变特征,DCRN学习互补的频谱和空间特征,EDL量化不确定性,SSUD进行可靠开集决策。

Result: 在三个跨场景高光谱分类任务中表现优异,性能接近最先进的域适应方法,且无需目标域训练数据。

Insight: 频域分析和不确定性解耦是解决开集域泛化的有效方法,为多域高光谱分类提供了新思路。

Abstract: Open-set domain generalization(OSDG) for hyperspectral image classification
presents significant challenges due to the presence of unknown classes in
target domains and the need for models to generalize across multiple unseen
domains without target-specific adaptation. Existing domain adaptation methods
assume access to target domain data during training and fail to address the
fundamental issue of domain shift when unknown classes are present, leading to
negative transfer and reduced classification performance. To address these
limitations, we propose a novel open-set domain generalization framework that
combines four key components: Spectrum-Invariant Frequency Disentanglement
(SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network
(DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning
(EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty
Disentanglement (SSUD) for reliable open-set classification. The SIFD module
extracts domain-invariant spectral features in the frequency domain through
attention-weighted frequency analysis and domain-agnostic regularization, while
DCRN captures complementary spectral and spatial information via parallel
pathways with adaptive fusion. EDL provides principled uncertainty estimation
using Dirichlet distributions, enabling the SSUD module to make reliable
open-set decisions through uncertainty-aware pathway weighting and adaptive
rejection thresholding. Experimental results on three cross-scene hyperspectral
classification tasks show that our approach achieves performance comparable to
state-of-the-art domain adaptation methods while requiring no access to the
target domain during training. The implementation will be made available at
https://github.com/amir-khb/SSUDOSDG upon acceptance.

[65] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing

Maria Damanaki,Nikos Piperigkos,Alexandros Gkillas,Aris S. Lalos

Main category: cs.CV

TL;DR: 该论文提出了一种基于图信号处理的协同多目标跟踪框架,通过融合多车辆信息,优化3D LiDAR场景中的目标跟踪。

Details Motivation: 单智能体的多目标跟踪(MOT)因遮挡和传感器故障等问题存在感知局限性,而多智能体协作能提供更全面的环境理解。

Contribution: 提出了一种新颖的协同MOT框架,利用图拓扑感知优化方法融合多车辆信息,提升跟踪精度。

Method: 采用全连接图拓扑结构,结合Graph Laplacian优化技术平滑边界框位置误差,并通过两阶段关联方法优化定位和跟踪。

Result: 在真实数据集V2V4Real上的实验表明,该方法显著优于基线框架(如DMSTrack和V2V4Real)。

Insight: 通过图信号处理揭示多智能体检测的内在一致性,为MOT提供了新的优化思路。

Abstract: Multi-Object Tracking (MOT) plays a crucial role in autonomous driving
systems, as it lays the foundations for advanced perception and precise path
planning modules. Nonetheless, single agent based MOT lacks in sensing
surroundings due to occlusions, sensors failures, etc. Hence, the integration
of multiagent information is essential for comprehensive understanding of the
environment. This paper proposes a novel Cooperative MOT framework for tracking
objects in 3D LiDAR scene by formulating and solving a graph topology-aware
optimization problem so as to fuse information coming from multiple vehicles.
By exploiting a fully connected graph topology defined by the detected bounding
boxes, we employ the Graph Laplacian processing optimization technique to
smooth the position error of bounding boxes and effectively combine them. In
that manner, we reveal and leverage inherent coherences of diverse multi-agent
detections, and associate the refined bounding boxes to tracked objects at two
stages, optimizing localization and tracking accuracies. An extensive
evaluation study has been conducted, using the real-world V2V4Real dataset,
where the proposed method significantly outperforms the baseline frameworks,
including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various
testing sequences.

[66] Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

Cheng Chen,Yunpeng Zhai,Yifan Zhao,Jinyang Gao,Bolin Ding,Jia Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于探索-利用强化学习的框架,用于优化多模态少样本大视觉语言模型(LVLM)中的上下文学习(ICL),通过自适应选择多模态演示组合提升模型任务理解和执行能力。

Details Motivation: 当前ICL方法依赖预定义或启发式选择的演示,无法覆盖多样任务需求且忽略了演示间的交互作用,导致性能不佳。因此,需要一种能动态优化演示选择策略的方法。

Contribution: 提出了一个探索-利用强化学习框架,用于动态选择和优化多模态演示组合,提升LVLM在少样本场景下的泛化能力。

Method: 通过强化学习框架探索多模态信息融合策略,自适应地选择演示组合,并通过自我探索持续优化选择策略。

Result: 在四个视觉问答(VQA)数据集上验证了方法的有效性,证明了其在提升少样本LVLM性能方面的优势。

Insight: 强化学习能够有效捕捉演示间的交互作用,动态优化选择策略,从而提升上下文学习的灵活性。

Abstract: In-context learning (ICL), a predominant trend in instruction learning, aims
at enhancing the performance of large language models by providing clear task
guidance and examples, improving their capability in task understanding and
execution. This paper investigates ICL on Large Vision-Language Models (LVLMs)
and explores the policies of multi-modal demonstration selection. Existing
research efforts in ICL face significant challenges: First, they rely on
pre-defined demonstrations or heuristic selecting strategies based on human
intuition, which are usually inadequate for covering diverse task requirements,
leading to sub-optimal solutions; Second, individually selecting each
demonstration fails in modeling the interactions between them, resulting in
information redundancy. Unlike these prevailing efforts, we propose a new
exploration-exploitation reinforcement learning framework, which explores
policies to fuse multi-modal information and adaptively select adequate
demonstrations as an integrated whole. The framework allows LVLMs to optimize
themselves by continually refining their demonstrations through
self-exploration, enabling the ability to autonomously identify and generate
the most effective selection policies for in-context learning. Experimental
results verify the superior performance of our approach on four Visual
Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing
the generalization capability of few-shot LVLMs.

[67] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries

Tianxiang Hao,Lixian Zhang,Yingjia Zhang,Mengxuan Chen,Jinxiao Zhang,Haohuan Fu

Main category: cs.CV

TL;DR: 该论文提出了首个基于20世纪中叶历史卫星影像的语义分割数据集Urban1960SatBench,以及无监督分割框架Urban1960SatUSM,用于研究早期城市发展。

Details Motivation: 历史卫星影像(如Keyhole数据)在理解早期城市发展和长期变化方面具有重要价值,但由于质量退化(如失真、错位和光谱稀缺)和标注缺失,其语义分割一直面临挑战。

Contribution: 贡献包括:(1)创建了首个20世纪中叶历史卫星影像的分割数据集Urban1960SatBench;(2)提出了一种无监督分割框架Urban1960SatUSM,通过置信对齐机制和焦点置信损失提升分割性能。

Method: 方法基于自监督学习架构,利用置信感知对齐机制和焦点置信损失生成鲁棒的伪标签,并自适应地优先处理预测难度和标签可靠性。

Result: 实验表明,Urban1960SatUSM在Urban1960SatSeg数据集上显著优于现有无监督分割方法。

Insight: 该研究为利用现代计算机视觉技术量化长期城市变化提供了新途径。

Abstract: Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data,
offers rare insights into understanding early urban development and long-term
transformation. However, severe quality degradation (e.g., distortion,
misalignment, and spectral scarcity) and annotation absence have long hindered
semantic segmentation on such historical RS imagery. To bridge this gap and
enhance understanding of urban development, we introduce
$\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on
historical satellite imagery with the earliest observation time among all
existing segmentation datasets, along with a benchmark framework for
unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First,
$\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic
segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering
1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the
earliest segmentation dataset of its kind, it provides a pioneering benchmark
for historical urban understanding. Second,
$\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel
unsupervised semantic segmentation framework for historical RS imagery. It
employs a confidence-aware alignment mechanism and focal-confidence loss based
on a self-supervised learning architecture, which generates robust
pseudo-labels and adaptively prioritizes prediction difficulty and label
reliability to improve unsupervised segmentation on noisy historical data
without manual supervision. Experiments show Urban1960SatUSM significantly
outperforms existing unsupervised segmentation methods on Urban1960SatSeg for
segmenting historical urban scenes, promising in paving the way for
quantitative studies of long-term urban change using modern computer vision.
Our benchmark and supplementary material are available at
https://github.com/Tianxiang-Hao/Urban1960SatSeg.

[68] TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation

Zetian Song,Jiaye Fu,Jiaqi Zhang,Xiaohan Lu,Chuanmin Jia,Siwei Ma,Wen Gao

Main category: cs.CV

TL;DR: TinySplat 是一种前馈方法,用于生成紧凑的3D场景表示。通过消除几何、感知和空间冗余,实现了超过100倍的3D高斯数据压缩,存储大小仅为现有最优方法的6%,同时编码和解码时间大幅减少。

Details Motivation: 现有的前馈3D高斯喷溅(3DGS)方法虽然重建速度快,但存储成本高,且现有压缩方法不兼容前馈架构。TinySplat 的目标是解决这一存储瓶颈。

Contribution: 1. 提出了TinySplat,一种完全前馈的3D场景压缩方法。2. 引入了View-Projection Transformation (VPT) 和 Visibility-Aware Basis Reduction (VABR) 来减少几何和感知冗余。3. 使用现成的视频编解码器处理空间冗余,实现了显著的压缩效果。

Method: 1. VPT将几何参数投影到更紧凑的空间以减少几何冗余。2. VABR通过基变换对齐特征能量以减少感知冗余。3. 利用视频编解码器处理空间冗余。

Result: 在多个基准数据集上,TinySplat 实现了超过100倍的3D高斯数据压缩,存储大小仅为现有最优方法的6%,编码时间减少75%,解码时间减少99%。

Insight: TinySplat 的压缩框架展示了在前馈架构中高效处理3D场景表示的潜力,同时保持了高质量和低存储成本。

Abstract: The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a
new paradigm to reconstruct 3D scenes. Using neural networks trained on
large-scale multi-view datasets, it can directly infer 3DGS representations
from sparse input views. Although the feedforward approach achieves high
reconstruction speed, it still suffers from the substantial storage cost of 3D
Gaussians. Existing 3DGS compression methods relying on scene-wise optimization
are not applicable due to architectural incompatibilities. To overcome this
limitation, we propose TinySplat, a complete feedforward approach for
generating compact 3D scene representations. Built upon standard feedforward
3DGS methods, TinySplat integrates a training-free compression framework that
systematically eliminates key sources of redundancy. Specifically, we introduce
View-Projection Transformation (VPT) to reduce geometric redundancy by
projecting geometric parameters into a more compact space. We further present
Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy
by aligning feature energy along dominant viewing directions via basis
transformation. Lastly, spatial redundancy is addressed through an
off-the-shelf video codec. Comprehensive experimental results on multiple
benchmark datasets demonstrate that TinySplat achieves over 100x compression
for 3D Gaussian data generated by feedforward methods. Compared to the
state-of-the-art compression approach, we achieve comparable quality with only
6% of the storage size. Meanwhile, our compression framework requires only 25%
of the encoding time and 1% of the decoding time.

[69] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

Changhao Peng,Yuqi Ye,Wei Gao

Main category: cs.CV

TL;DR: 该论文提出了一种广义高斯熵模型和动态似然区间方法,用于优化点云属性压缩中的概率估计和算术编码性能。

Details Motivation: 现有方法中使用的高斯和拉普拉斯熵模型未能充分利用神经网络估计的熵参数信息,且固定似然区间限制了模型性能,因此需要更灵活的熵模型和动态调整技术。

Contribution: 1. 提出广义高斯熵模型,通过形状参数控制尾部形状以更准确地估计潜变量的概率;2. 引入均值误差判别器(MED),动态调整算术编码中的似然区间。

Method: 1. 使用广义高斯分布替代传统高斯或拉普拉斯分布作为熵模型;2. 通过MED动态评估熵参数估计的准确性,并调整似然区间。

Result: 实验表明,该方法在三个基于VAE的点云属性压缩模型中显著提高了率失真(RD)性能,并可推广到图像和视频压缩任务。

Insight: 更灵活的熵模型和动态区间调整策略能够显著提升压缩任务的性能,表明潜在信息的高效利用是关键。

Abstract: Gaussian and Laplacian entropy models are proved effective in learned point
cloud attribute compression, as they assist in arithmetic coding of latents.
However, we demonstrate through experiments that there is still unutilized
information in entropy parameters estimated by neural networks in current
methods, which can be used for more accurate probability estimation. Thus we
introduce generalized Gaussian entropy model, which controls the tail shape
through shape parameter to more accurately estimate the probability of latents.
Meanwhile, to the best of our knowledge, existing methods use fixed likelihood
intervals for each integer during arithmetic coding, which limits model
performance. We propose Mean Error Discriminator (MED) to determine whether the
entropy parameter estimation is accurate and then dynamically adjust likelihood
intervals. Experiments show that our method significantly improves
rate-distortion (RD) performance on three VAE-based models for point cloud
attribute compression, and our method can be applied to other compression
tasks, such as image and video compression.

[70] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Chengxuan Qian,Juyuan Kang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang

Main category: cs.CV

TL;DR: HAIF-GS提出了一种层次化和诱导流引导的高斯溅射方法,用于动态场景重建,解决了现有方法在运动一致性和非刚性变形建模中的局限性。

Details Motivation: 动态3D场景的重建是一个基础挑战,现有方法在高斯溅射(3DGS)中难以实现结构化和时间一致的运动表示,存在冗余更新、运动监督不足和非刚性变形建模弱的问题。

Contribution: 通过稀疏锚点驱动的变形框架,提出了锚点过滤器和自监督的诱导流引导变形模块,以及分层次的锚点传播机制,显著提升了动态重建的质量和效率。

Method: 结合锚点过滤器减少静态区域冗余更新,利用多帧特征聚合诱导锚点运动,并通过分层次锚点传播处理细粒度变形。

Result: 在合成和真实世界基准测试中,HAIF-GS在渲染质量、时间一致性和重建效率上显著优于现有动态3DGS方法。

Insight: 稀疏锚点驱动和分层次变形机制有助于解决动态场景重建中的复杂非刚性变形问题,同时自监督方法减少了对显式流标签的依赖。

Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental
challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time
rendering in static settings, extending it to dynamic scenes is challenging due
to the difficulty of learning structured and temporally consistent motion
representations. This challenge often manifests as three limitations in
existing methods: redundant Gaussian updates, insufficient motion supervision,
and weak modeling of complex non-rigid deformations. These issues collectively
hinder coherent and efficient dynamic reconstruction. To address these
limitations, we propose HAIF-GS, a unified framework that enables structured
and consistent dynamic modeling through sparse anchor-driven deformation. It
first identifies motion-relevant regions via an Anchor Filter to suppresses
redundant updates in static areas. A self-supervised Induced Flow-Guided
Deformation module induces anchor motion using multi-frame feature aggregation,
eliminating the need for explicit flow labels. To further handle fine-grained
deformations, a Hierarchical Anchor Propagation mechanism increases anchor
resolution based on motion complexity and propagates multi-level
transformations. Extensive experiments on synthetic and real-world benchmarks
validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in
rendering quality, temporal coherence, and reconstruction efficiency.

[71] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Beomsik Cho,Jaehyung Kim

Main category: cs.CV

TL;DR: 该论文提出了一种名为ReVisiT的解码方法,通过重用视觉标记来优化LVLM的文本生成过程,从而提升视觉信息的利用率,减少计算开销。

Details Motivation: 传统的LVLM解码策略未能充分利用视觉信息,导致生成内容与视觉输入脱节。现有方法通常需要额外训练或多步推理,限制了效率。ReVisiT旨在通过简单的解码优化机制解决这一问题。

Contribution: 主要贡献是提出了一种无需额外训练或外部依赖的解码方法ReVisiT,通过动态选择和重用视觉标记来增强LVLM的视觉语义理解能力,显著提高了生成结果的可信度与效率。

Method: ReVisiT的核心方法是将视觉标记投影到文本标记分布空间,并通过约束差异最小化动态选择最相关的视觉标记。随后用该标记修正输出分布,提升视觉语义的整合效果。

Result: 在三个LVLM幻觉基准测试中,ReVisiT显著提升了视觉相关性,同时在计算成本降低至2倍的情况下,性能达到或超过现有最佳方法。

Insight: 论文揭示了视觉标记中隐含的语言先验信息对LVLM解码过程的潜在价值,为未来高效的多模态模型设计提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across various multimodal tasks by integrating visual perception with language
understanding. However, conventional decoding strategies of LVLMs often fail to
successfully utilize visual information, leading to visually ungrounded
responses. While various approaches have been proposed to address this
limitation, they typically require additional training, multi-step inference
procedures, or external model dependencies. This paper introduces ReVisiT, a
simple yet effective decoding method that references vision tokens to guide the
text generation process in LVLMs. Our approach leverages the semantic
information embedded within vision tokens by projecting them into the text
token distribution space, and dynamically selecting the most relevant vision
token at each decoding step through constrained divergence minimization. This
selected vision token is then used to refine the output distribution to better
incorporate visual semantics. Experiments on three LVLM hallucination
benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances
visual grounding with minimal computational overhead. Moreover, our method
achieves competitive or superior results relative to state-of-the-art baselines
while reducing computational costs for up to $2\times$.

[72] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

Tao Wang,Mengyu Li,Geduo Zeng,Cheng Meng,Qiong Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于最优传输视角的全局高斯混合降维方法,用于解决3D高斯泼溅(3DGS)中高斯基元冗余问题,显著减少了内存和渲染开销,同时保持了渲染质量。

Details Motivation: 3DGS虽然是一种强大的辐射场渲染技术,但其通常需要数百万冗余的高斯基元,导致内存和渲染开销巨大。现有的压缩方法基于启发式重要性评分进行剪枝,缺乏全局保真性保证。

Contribution: 提出了一种新的最优传输视角,将3DGS压缩问题转化为全局高斯混合降维问题;通过KD树分割最小化复合传输散度,并解耦几何与外观优化。

Method: 1. 基于KD树分割最小化复合传输散度,实现紧凑几何表示;2. 解耦几何与外观,通过少量高斯基元微调颜色和透明度属性。

Result: 实验表明,该方法仅需10%的高斯基元即可实现与原始3DGS相当的渲染质量(PSNR、SSIM、LPIPS),并优于现有压缩技术。

Insight: 该方法不仅适用于原始3DGS流水线,还兼容加速流水线,为轻量级神经渲染提供了一种高效且通用的解决方案。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance
field rendering, but it typically requires millions of redundant Gaussian
primitives, overwhelming memory and rendering budgets. Existing compaction
approaches address this by pruning Gaussians based on heuristic importance
scores, without global fidelity guarantee. To bridge this gap, we propose a
novel optimal transport perspective that casts 3DGS compaction as global
Gaussian mixture reduction. Specifically, we first minimize the composite
transport divergence over a KD-tree partition to produce a compact geometric
representation, and then decouple appearance from geometry by fine-tuning color
and opacity attributes with far fewer Gaussian primitives. Experiments on
benchmark datasets show that our method (i) yields negligible loss in rendering
quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians;
and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques.
Notably, our method is applicable to any stage of vanilla or accelerated 3DGS
pipelines, providing an efficient and agnostic pathway to lightweight neural
rendering.

[73] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection

Yi Zhang,Yi Wang,Yawen Cui,Lap-Pui Chau

Main category: cs.CV

TL;DR: 这篇论文提出了一种通用的几何感知3D目标检测方法3DGeoDet,通过显式和隐式结合的方式生成3D几何表示,在单视角和多视角RGB图像的室内外环境中均表现出色。

Details Motivation: 基于图像的3D目标检测任务由于缺乏3D几何线索,导致图像与3D表示之间的对应关系模糊。为了解决这一问题,论文提出结合显式和隐式的3D几何表示方法,提升模型对3D几何的理解。

Contribution: 1. 提出了3DGeoDet,一种通用性强、几何感知的图像3D目标检测方法;2. 通过显式体素占用注意力和隐式TSDF结合的方式增强3D几何表示;3. 无需3D信号监督即可实现端到端训练,并在多个基准数据集上表现优异。

Method: 1. 利用预测的深度信息生成显式体素占用表示;2. 通过体素占用注意力优化3D特征体;3. 结合隐式TSDF进一步提升3D几何感知;4. 实现端到端训练。

Result: 在SUN RGB-D上mAP@0.5提升9.3,ScanNetV2上提升3.3,KITTI上AP3D@0.7提升0.19,超越了当前最佳图像基方法。

Insight: 通过显式和隐式3D几何表示的结合,可以有效缓解图像基3D检测中的几何模糊问题,提升模型的泛化能力和性能。

Abstract: This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection
approach that effectively handles single- and multi-view RGB images in indoor
and outdoor environments, showcasing its general-purpose applicability. The key
challenge for image-based 3D object detection tasks is the lack of 3D geometric
cues, which leads to ambiguity in establishing correspondences between images
and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D
geometric representations in both explicit and implicit manners based on
predicted depth information. Specifically, we utilize the predicted depth to
learn voxel occupancy and optimize the voxelized 3D feature volume explicitly
through the proposed voxel occupancy attention. To further enhance 3D
awareness, the feature volume is integrated with an implicit 3D representation,
the truncated signed distance function (TSDF). Without requiring supervision
from 3D signals, we significantly improve the model’s comprehension of 3D
geometry by leveraging intermediate 3D representations and achieve end-to-end
training. Our approach surpasses the performance of state-of-the-art
image-based methods on both single- and multi-view benchmark datasets across
diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D
dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19
AP3D@0.7 improvement on the KITTI dataset. The project page is available at:
https://cindy0725.github.io/3DGeoDet/.

[74] GLD-Road:A global-local decoding road network extraction model for remote sensing images

Ligao Deng,Yupeng Deng,Yu Meng,Jingbo Chen,Zhihao Xi,Diyou Liu,Qifeng Chu

Main category: cs.CV

TL;DR: 本文提出了一种名为GLD-Road的两阶段模型,通过结合全局效率和局部精度,显著提高了道路网络提取的准确性和效率。

Details Motivation: 道路网络提取对于地图制图、自动驾驶和灾害响应至关重要,但现有方法在精度或效率方面存在局限性。

Contribution: 提出了GLD-Road,结合全局并行和局部迭代方法,显著提升了道路提取的准确性(APLS提升1.9%和0.67%)和效率(检索时间减少40%-92%)。

Method: 采用两阶段方法:1) 检测道路节点并通过连接模块进行全局连接;2) 通过局部搜索迭代细化断裂道路。

Result: 在City-Scale和SpaceNet3数据集上,APLS分别提升了1.9%和0.67%;检索时间比Sat2Graph和RNGDet++分别减少了40%和92%。

Insight: 结合全局和局部方法的优势,既能保持高效性,又能提升精度,为道路网络提取提供了新的解决方案。

Abstract: Road networks are crucial for mapping, autonomous driving, and disaster
response. While manual annotation is costly, deep learning offers efficient
extraction. Current methods include postprocessing (prone to errors), global
parallel (fast but misses nodes), and local iterative (accurate but slow). We
propose GLD-Road, a two-stage model combining global efficiency and local
precision. First, it detects road nodes and connects them via a Connect Module.
Then, it iteratively refines broken roads using local searches, drastically
reducing computation. Experiments show GLD-Road outperforms state-of-the-art
methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also
reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++
(local). The experimental results are available at
https://github.com/ucas-dlg/GLD-Road.

[75] AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

Zhaoyang Wei,Chenhui Qiang,Bowen Jiang,Xumeng Han,Xuehui Yu,Zhenjun Han

Main category: cs.CV

TL;DR: AD^2-Bench是首个针对恶劣天气和复杂场景下自动驾驶的多模态大模型(MLLM)的链式思考(CoT)评测基准,填补了现有评测的空白,并通过5.4k高质量标注实例支持多步推理和细粒度分析。

Details Motivation: 现有评测基准未充分评估自动驾驶在恶劣天气和复杂环境下的链式思考推理能力,AD^2-Bench旨在填补这一关键空白,推动MLLM在自动驾驶中的鲁棒性和可解释性发展。

Contribution: 1) 首个针对自动驾驶中恶劣天气和复杂场景的CoT评测基准;2) 提供5.4k高质量标注实例,支持文本、点和区域级别的视觉提示;3) 设计的评测框架揭示了现有MLLM推理性能的不足(准确率低于60%)。

Method: AD^2-Bench通过以下步骤构建:1) 覆盖多样恶劣环境的数据采集;2) 细粒度标注多步推理过程;3) 设计专用评测框架分析MLLM的推理性能。

Result: 评测显示,现有MLLM在AD^2-Bench上的准确率低于60%,表明基准的挑战性和当前模型的局限性。

Insight: AD^2-Bench揭示了自动驾驶中恶劣环境对MLLM推理的显著影响,强调了提升模型鲁棒性和可解释性的必要性。

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to
enhance the structured, multi-step decision-making capabilities of Multi-Modal
Large Models (MLLMs), is particularly crucial for autonomous driving with
adverse weather conditions and complex traffic environments. However, existing
benchmarks have largely overlooked the need for rigorous evaluation of CoT
processes in these specific and challenging scenarios. To address this critical
gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically
designed for autonomous driving with adverse weather and complex scenes.
AD^2-Bench is meticulously constructed to fulfill three key criteria:
comprehensive data coverage across diverse adverse environments, fine-grained
annotations that support multi-step reasoning, and a dedicated evaluation
framework tailored for assessing CoT performance. The core contribution of
AD^2-Bench is its extensive collection of over 5.4k high-quality, manually
annotated CoT instances. Each intermediate reasoning step in these annotations
is treated as an atomic unit with explicit ground truth, enabling unprecedented
fine-grained analysis of MLLMs’ inferential processes under text-level,
point-level, and region-level visual prompts. Our comprehensive evaluation of
state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting
the benchmark’s difficulty and the need to advance robust, interpretable
end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized
evaluation platform, driving research forward by improving MLLMs’ reasoning in
autonomous driving, making it an invaluable resource.

[76] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

Qijing Li,Jingxiang Sun,Liang An,Zhaoqi Su,Hongwen Zhang,Yebin Liu

Main category: cs.CV

TL;DR: SemanticSplat提出了一种前馈式语义感知3D重建方法,结合3D高斯与潜在语义属性,实现联合几何-外观-语义建模,显著提升了稀疏视角下的场景理解能力。

Details Motivation: 现有前馈式3D场景理解方法(如LSM)仅能提取基于语言的语义信息,且存在几何重建质量低和噪声问题;而逐场景优化方法依赖密集输入视角,实用性不足。

Contribution: 1. 提出SemanticSplat,首次将3D高斯与语义属性统一建模;2. 提出基于两阶段蒸馏的框架,从稀疏视角重建多模态语义特征场;3. 支持可提示和开放词汇分割任务。

Method: 融合LSeg和SAM等特征场与代价体积表示,预测语义各向异性高斯;通过两阶段蒸馏从稀疏视图重建多模态语义特征场。

Result: 实验验证了其在提示式和开放词汇分割任务中的有效性,显著优于现有方法。

Insight: 语义与几何的联合建模可显著提升稀疏输入下的场景理解鲁棒性,为AR和机器人交互提供了新思路。

Abstract: Holistic 3D scene understanding, which jointly models geometry, appearance,
and semantics, is crucial for applications like augmented reality and robotic
interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM)
are limited to extracting language-based semantics from scenes, failing to
achieve holistic scene comprehension. Additionally, they suffer from
low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene
optimization methods rely on dense input views, which reduces practicality and
increases complexity during deployment. In this paper, we propose
SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which
unifies 3D Gaussians with latent semantic attributes for joint
geometry-appearance-semantics modeling. To predict the semantic anisotropic
Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a
cost volume representation that stores cross-view feature similarities,
enhancing coherent and accurate scene comprehension. Leveraging a two-stage
distillation framework, SemanticSplat reconstructs a holistic multi-modal
semantic feature field from sparse-view images. Experiments demonstrate the
effectiveness of our method for 3D scene understanding tasks like promptable
and open-vocabulary segmentation. Video results are available at
https://semanticsplat.github.io.

[77] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting

Giacomo Rosin,Muhammad Rameez Ur Rahman,Sebastiano Vascon

Main category: cs.CV

TL;DR: 该论文提出了一种基于对比学习的模块ECAM,用于增强轨迹预测模型的环境碰撞避免能力,显著降低了碰撞率。

Details Motivation: 现有轨迹预测方法常忽视环境影响,导致预测轨迹与障碍物碰撞,ECAM旨在解决这一问题。

Contribution: ECAM模块通过对比学习提升轨迹预测的环境碰撞避免能力,可无缝集成至现有模型。

Method: 采用对比学习方法训练ECAM模块,使其在预测轨迹时避免与环境障碍物碰撞,实验基于ETH/UCY数据集。

Result: 实验表明,集成ECAM的SOTA模型碰撞率显著降低(40-50%)。

Insight: 环境碰撞避免是轨迹预测的关键因素,对比学习可有效提升模型的物理合理性。

Abstract: Human trajectory forecasting is crucial in applications such as autonomous
driving, robotics and surveillance. Accurate forecasting requires models to
consider various factors, including social interactions, multi-modal
predictions, pedestrian intention and environmental context. While existing
methods account for these factors, they often overlook the impact of the
environment, which leads to collisions with obstacles. This paper introduces
ECAM (Environmental Collision Avoidance Module), a contrastive learning-based
module to enhance collision avoidance ability with the environment. The
proposed module can be integrated into existing trajectory forecasting models,
improving their ability to generate collision-free predictions. We evaluate our
method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate
its collision avoidance capabilities. Our experiments show that
state-of-the-art methods significantly reduce (-40/50%) the collision rate when
integrated with the proposed module. The code is available at
https://github.com/CVML-CFU/ECAM.

[78] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Yanzhao Shi,Xiaodan Zhang,Junzhong Ji,Haoning Jiang,Chengxin Zheng,Yinong Wang,Liangqiong Qu

Main category: cs.CV

TL;DR: HSENet提出了一种混合空间编码网络,用于3D医学视觉-语言理解,通过双3D视觉编码器和空间压缩技术提升诊断准确性和效率。

Details Motivation: 现有多模态大语言模型(MLLMs)主要针对2D医学图像,限制了其捕捉复杂3D解剖结构的能力,导致诊断错误。HSENet旨在解决这一问题。

Contribution: 1. 提出HSENet框架,结合双3D视觉编码器和空间压缩技术;2. 引入Spatial Packer,高效压缩高分辨率3D空间信息;3. 在3D语言-视觉检索、报告生成和视觉问答任务中实现SOTA性能。

Method: HSENet采用双3D视觉编码器分别捕捉全局体积上下文和细粒度解剖细节,通过双阶段预训练与诊断报告对齐。Spatial Packer通过基于质心的压缩技术将3D空间区域转化为紧凑的视觉令牌。

Result: 在3D语言-视觉检索(R@100提升5.96%)、医学报告生成(BLEU-4提升8.01%)和视觉问答(Major Class Accuracy提升1.99%)中表现优异。

Insight: 结合3D视觉信息与语言模型能显著提升医学诊断任务性能,高效的空间压缩技术是关键。

Abstract: Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based
decisions by enhancing diagnostic accuracy and workflow efficiency. While
multimodal large language models (MLLMs) exhibit promising performance in
visual-language understanding, existing methods mainly focus on 2D medical
images, which fundamentally limits their ability to capture complex 3D
anatomical structures. This limitation often leads to misinterpretation of
subtle pathologies and causes diagnostic hallucinations. In this paper, we
present Hybrid Spatial Encoding Network (HSENet), a framework that exploits
enriched 3D medical visual cues by effective visual perception and projection
for accurate and robust vision-language understanding. Specifically, HSENet
employs dual-3D vision encoders to perceive both global volumetric contexts and
fine-grained anatomical details, which are pre-trained by dual-stage alignment
with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient
multimodal projector that condenses high-resolution 3D spatial regions into a
compact set of informative visual tokens via centroid-based compression. By
assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly
perceive and transfer hybrid visual representations to LLM’s semantic space,
facilitating accurate diagnostic text generation. Experimental results
demonstrate that our method achieves state-of-the-art performance in 3D
language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report
generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering
(73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness.
Our code is available at https://github.com/YanzhaoShi/HSENet.

[79] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Dongxu Liu,Yuang Peng,Haomiao Tang,Yuwei Chen,Chunrui Han,Zheng Ge,Daxin Jiang,Mingxue Liao

Main category: cs.CV

TL;DR: DGAE是一种通过扩散模型引导解码器学习的自动编码器,旨在解决高压缩比下的性能下降问题,同时实现更紧凑的潜在表示。

Details Motivation: 自动编码器在高压缩比下性能下降且训练不稳定,尤其是与GAN相关的挑战。论文希望通过改进解码器表达能力,实现更高效的潜在表示学习。

Contribution: 提出DGAE,利用扩散模型引导解码器恢复潜在表示中未完全解码的信息,有效缓解高压缩比下的性能下降,同时实现2倍更小的潜在空间。

Method: 通过引入扩散模型来指导解码器,增强其表达能力,从而提高潜在表示的恢复能力,减少高压缩比下的信息损失。

Result: DGAE在ImageNet-1K图像生成任务中表现优异,潜在空间缩小2倍,且加速了扩散模型的收敛速度。

Insight: 扩散模型的引导可以显著提升解码器的表达能力和潜在表示的效率,为高压缩比下的生成模型提供了新思路。

Abstract: Autoencoders empower state-of-the-art image and video generative models by
compressing pixels into a latent space through visual tokenization. Although
recent advances have alleviated the performance degradation of autoencoders
under high compression ratios, addressing the training instability caused by
GAN remains an open challenge. While improving spatial compression, we also aim
to minimize the latent space dimensionality, enabling more efficient and
compact representations. To tackle these challenges, we focus on improving the
decoder’s expressiveness. Concretely, we propose DGAE, which employs a
diffusion model to guide the decoder in recovering informative signals that are
not fully decoded from the latent representation. With this design, DGAE
effectively mitigates the performance degradation under high spatial
compression rates. At the same time, DGAE achieves state-of-the-art performance
with a 2x smaller latent space. When integrated with Diffusion Models, DGAE
demonstrates competitive performance on image generation for ImageNet-1K and
shows that this compact latent representation facilitates faster convergence of
the diffusion model.

[80] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng,Junchao Huang,Xiangsheng Huang,Di Wen,Junwei Zheng,Yufan Chen,Kailun Yang,Jiamin Wu,Chongqing Hao,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: HopaDIFF是一种新颖的扩散框架,专注于多人物场景中的文本参考引导动作分割任务,提出了首个RHAS133数据集,并通过结合整体-部分感知的傅里叶条件优化,显著提升了分割性能。

Details Motivation: 现有动作分割方法主要针对单人物活动,忽视了多人物场景的需求。本文旨在填补这一空白,通过文本描述指定目标人物,实现多人物场景的动作分割。

Contribution: 1. 提出首个Referring Human Action Segmentation数据集RHAS133;2. 设计HopaDIFF框架,结合整体-部分感知和傅里叶条件扩散,优化动作分割。

Method: 1. 使用交叉输入门控注意力xLSTM增强长距离推理;2. 引入傅里叶条件扩散模块,提供细粒度控制。

Result: HopaDIFF在RHAS133数据集上取得了最佳性能,优于现有方法。

Insight: 多人物动作分割需要结合全局和局部信息,傅里叶条件扩散为生成任务提供了新思路。

Abstract: Action segmentation is a core challenge in high-level video understanding,
aiming to partition untrimmed videos into segments and assign each a label from
a predefined action set. Existing methods primarily address single-person
activities with fixed action sequences, overlooking multi-person scenarios. In
this work, we pioneer textual reference-guided human action segmentation in
multi-person settings, where a textual description specifies the target person
for segmentation. We introduce the first dataset for Referring Human Action
Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137
fine-grained actions with 33h video data, together with textual descriptions
for this new task. Benchmarking existing action recognition methods on RHAS133
using VLM-based feature extractors reveals limited performance and poor
aggregation of visual cues for the target person. To address this, we propose a
holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF,
leveraging a novel cross-input gate attentional xLSTM to enhance
holistic-partial long-range reasoning and a novel Fourier condition to
introduce more fine-grained control to improve the action segmentation
generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse
evaluation settings. The code is available at
https://github.com/KPeng9510/HopaDIFF.git.

[81] CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain

Maik Dannecker,Vasiliki Sideri-Lampretsa,Sophie Starck,Angeline Mihailov,Mathieu Milh,Nadine Girard,Guillaume Auzias,Daniel Rueckert

Main category: cs.CV

TL;DR: CINeMA提出了一种新型的条件隐式神经多模态图谱框架,能够在低数据环境下创建高分辨率、时空多模态的脑图谱,显著提升了效率和灵活性。

Details Motivation: 研究胎儿和新生儿大脑的快速发育需要高时空分辨率的脑图谱,但现有方法依赖大量数据,无法解决病理情况下数据稀缺的问题。

Contribution: CINeMA通过隐式神经表示在潜空间工作,避免了计算密集的图像配准,大幅缩短了图谱构建时间,并支持灵活的条件生成(如年龄、病理特征)。

Method: 利用条件隐式神经多模态图谱框架,在潜空间中直接建模脑图谱的时空变化,支持下游任务如组织分割和年龄预测,并提供生成能力。

Result: CINeMA在准确性、效率和多功能性上超越现有方法,适合低数据环境,并能生成合成数据用于增强模型训练。

Insight: 通过隐式表示和条件生成,CINeMA为脑发育研究提供了高效的解决方案,尤其适用于数据稀缺的病理研究场景。

Abstract: Magnetic resonance imaging of fetal and neonatal brains reveals rapid
neurodevelopment marked by substantial anatomical changes unfolding within
days. Studying this critical stage of the developing human brain, therefore,
requires accurate brain models-referred to as atlases-of high spatial and
temporal resolution. To meet these demands, established traditional atlases and
recently proposed deep learning-based methods rely on large and comprehensive
datasets. This poses a major challenge for studying brains in the presence of
pathologies for which data remains scarce. We address this limitation with
CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for
creating high-resolution, spatio-temporal, multimodal brain atlases, suitable
for low-data settings. Unlike established methods, CINeMA operates in latent
space, avoiding compute-intensive image registration and reducing atlas
construction times from days to minutes. Furthermore, it enables flexible
conditioning on anatomical features including GA, birth age, and pathologies
like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA
supports downstream tasks such as tissue segmentation and age prediction
whereas its generative properties enable synthetic data creation and
anatomically informed data augmentation. Surpassing state-of-the-art methods in
accuracy, efficiency, and versatility, CINeMA represents a powerful tool for
advancing brain research. We release the code and atlases at
https://github.com/m-dannecker/CINeMA.

[82] Reasoning Models Are More Easily Gaslighted Than You Think

Bin Zhu,Hailong Yin,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 论文通过系统评估发现,当前最先进的推理模型(如OpenAI、Claude和Gemini)在应对误导性用户输入时表现脆弱,准确性显著下降(平均25-29%)。作者提出了GaslightingBench-R基准,进一步揭示了模型在对抗性提示下的脆弱性,准确性下降超过53%。

Details Motivation: 研究动机是探讨推理模型在面对误导性用户输入时的鲁棒性,填补这一领域的空白。

Contribution: 论文的主要贡献是:1) 系统评估了三种先进推理模型在对抗性提示下的表现;2) 提出了GaslightingBench-R基准,专门用于测试模型在对抗性反馈中的表现;3) 揭示了推理模型在信念持久性方面的根本局限性。

Method: 研究方法包括:1) 在三个多模态基准(MMMU、MathVista、CharXiv)上测试推理模型;2) 设计了GaslightingBench-R基准,筛选了1,025个具有挑战性的样本;3) 使用了对抗性提示(gaslighting negation)来评估模型的鲁棒性。

Result: 结果显示,对抗性提示导致模型准确性显著下降:原始基准中下降25-29%,GaslightingBench-R中下降超过53%。

Insight: 研究揭示了推理模型的局限性:尽管它们具备逐步推理能力,但在信念持久性方面表现脆弱,容易被用户误导。这为未来改进模型的鲁棒性提供了方向。

Abstract: Recent advances in reasoning-centric models promise improved robustness
through mechanisms such as chain-of-thought prompting and test-time scaling.
However, their ability to withstand misleading user input remains
underexplored. In this paper, we conduct a systematic evaluation of three
state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet
and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and
CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average)
following gaslighting negation prompts, indicating that even top-tier reasoning
models struggle to preserve correct answers under manipulative user feedback.
Built upon the insights of the evaluation and to further probe this
vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark
specifically designed to evaluate reasoning models’ susceptibility to defend
their belief under gaslighting negation prompt. Constructed by filtering and
curating 1,025 challenging samples from the existing benchmarks,
GaslightingBench-R induces even more dramatic failures, with accuracy drops
exceeding 53% on average. Our findings reveal fundamental limitations in the
robustness of reasoning models, highlighting the gap between step-by-step
reasoning and belief persistence.

[83] Adding simple structure at inference improves Vision-Language Compositionality

Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune

Main category: cs.CV

TL;DR: 该论文提出了一种在推理阶段添加简单结构的方法,以提高视觉-语言模型(VLM)的组合性,无需重新训练模型,通过分割图像和文本片段进行匹配并聚合相似性,显著提升了性能。

Details Motivation: 现有的双编码器视觉-语言模型(如CLIP)在组合性任务(例如对象-属性绑定)上表现不佳,表现出的“词袋”行为限制了其检索性能。虽然已有许多训练方法尝试改进此类模型,但推理阶段的技术却鲜少被探索。

Contribution: 提出了一种无需训练、仅在推理阶段通过分割图像和文本片段进行匹配和相似性聚合的方法,显著提升了视觉-语言模型在组合性任务上的性能,尤其在属性-对象绑定任务上表现突出。

Method: 方法包括四个步骤:1) 将图像分割为多个小区域(crops),2) 提取文本片段(对象、属性和关系),3) 使用VLM找到图像区域与文本片段的最佳匹配,4) 通过聚合个体匹配的相似性计算最终的图像-文本相似度。

Result: 实验表明,该方法在各种双编码器VLM上均显著提升了组合性任务的性能,尤其是在属性-对象绑定任务中表现突出。此外,分析显示图像分割对性能提升至关重要。

Insight: 推理阶段的技术具有被低估的潜力,图像分割和文本片段匹配是提升视觉-语言组合性的关键方向,未来的工作可以进一步优化推理阶段的处理流程。

Abstract: Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for
image-text retrieval tasks. However, those models struggle with
compositionality, showing a bag-of-words-like behavior that limits their
retrieval performance. Many different training approaches have been proposed to
improve the vision-language compositionality capabilities of those models. In
comparison, inference-time techniques have received little attention. In this
paper, we propose to add simple structure at inference, where, given an image
and a caption: i) we divide the image into different smaller crops, ii) we
extract text segments, capturing objects, attributes and relations, iii) using
a VLM, we find the image crops that better align with text segments obtaining
matches, and iv) we compute the final image-text similarity aggregating the
individual similarities of the matches. Based on various popular dual encoder
VLMs, we evaluate our approach in controlled and natural datasets for VL
compositionality. We find that our approach consistently improves the
performance of evaluated VLMs without any training, which shows the potential
of inference-time techniques. The results are especially good for
attribute-object binding as shown in the controlled dataset. As a result of an
extensive analysis: i) we show that processing image crops is actually
essential for the observed gains in performance, and ii) we identify specific
areas to further improve inference-time approaches.

[84] Towards Practical Alzheimer’s Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

Changwei Wu,Yifei Chen,Yuxin Du,Jinying Zong,Jie Dong,Mingxuan Liu,Yong Peng,Jin Fan,Feiwei Qin,Changmiao Wang

Main category: cs.CV

TL;DR: 论文提出FasterSNN,一种轻量级且可解释的脉冲神经网络模型,用于阿尔茨海默病(AD)的早期诊断,解决了传统深度学习方法的高能耗问题,并通过混合架构提高了模型的表达能力和训练稳定性。

Details Motivation: 现有阿尔茨海默病诊断方法依赖主观评估和多模态成像,成本高且效率低;深度学习虽能自动化但能耗过大,而SNNs虽有潜力但在复杂医疗任务中表达能力和稳定性不足。

Contribution: 提出FasterSNN,结合LIF神经元、区域自适应卷积和多尺度脉冲注意力,实现高效、稀疏的3D MRI处理,并在保持诊断准确性的同时提升能效。

Method: 采用混合神经网络架构,整合LIF神经元、区域自适应卷积和多尺度脉冲注意力机制,优化了表达能力和训练稳定性。

Result: 在基准数据集上验证,FasterSNN性能优异,显著提高了效率和稳定性,适用于实际AD筛查。

Insight: SNNs在医疗诊断中具有潜力,通过合理的架构设计可以解决表达能力和稳定性问题,实现低功耗且高效的自动化诊断。

Abstract: Early diagnosis of Alzheimer’s Disease (AD), especially at the mild cognitive
impairment (MCI) stage, is vital yet hindered by subjective assessments and the
high cost of multimodal imaging modalities. Although deep learning methods
offer automated alternatives, their energy inefficiency and computational
demands limit real-world deployment, particularly in resource-constrained
settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are
inherently well-suited for modeling the sparse, event-driven patterns of neural
degeneration in AD, offering a promising foundation for interpretable and
low-power medical diagnostics. However, existing SNNs often suffer from weak
expressiveness and unstable training, which restrict their effectiveness in
complex medical tasks. To address these limitations, we propose FasterSNN, a
hybrid neural architecture that integrates biologically inspired LIF neurons
with region-adaptive convolution and multi-scale spiking attention. This design
enables sparse, efficient processing of 3D MRI while preserving diagnostic
accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves
competitive performance with substantially improved efficiency and stability,
supporting its potential for practical AD screening. Our source code is
available at https://github.com/wuchangw/FasterSNN.

[85] Non-Contact Health Monitoring During Daily Personal Care Routines

Xulin Ma,Jiankai Tang,Zhang Jiang,Songqin Cheng,Yuanchun Shi,Dong LI,Xin Liu,Daniel McDuff,Xiaojing Liu,Yuntao Wang

Main category: cs.CV

TL;DR: 论文提出了首个长期远程光电容积描记(rPPG)数据集LADH,结合RGB和红外视频输入,改进了非接触式生理监测的准确性和鲁棒性,并在心率估计中达到了4.99 BPM的MAE。

Details Motivation: 远程光电容积描记(rPPG)在长期个人护理场景(如高海拔环境下的日常活动)中的应用面临环境光照变化、手部遮挡和动态面部姿势等挑战。

Contribution: 1) 提出了首个长期rPPG数据集LADH,包含240段同步RGB和红外面部视频;2) 展示了结合RGB和红外输入的改进效果;3) 展示了多任务学习在多生理指标上的性能提升。

Method: 1) 构建包含RGB和红外视频的LADH数据集;2) 结合RGB和红外输入进行生理信号监测;3) 采用多任务学习方法优化多生理指标检测。

Result: 结合RGB和红外输入的心率估计MAE为4.99 BPM,多任务学习显著提升了多生理指标的监测性能。

Insight: 结合多模态数据和多任务学习可以显著提升非接触式生理监测的准确性和鲁棒性,尤其在复杂环境中表现突出。

Abstract: Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring
of physiological signals and offers a practical alternative to traditional
health sensing methods. Although rPPG is promising for daily health monitoring,
its application in long-term personal care scenarios, such as mirror-facing
routines in high-altitude environments, remains challenging due to ambient
lighting variations, frequent occlusions from hand movements, and dynamic
facial postures. To address these challenges, we present LADH (Long-term
Altitude Daily Health), the first long-term rPPG dataset containing 240
synchronized RGB and infrared (IR) facial videos from 21 participants across
five common personal care scenarios, along with ground-truth PPG, respiration,
and blood oxygen signals. Our experiments demonstrate that combining RGB and IR
video inputs improves the accuracy and robustness of non-contact physiological
monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate
estimation. Furthermore, we find that multi-task learning enhances performance
across multiple physiological indicators simultaneously. Dataset and code are
open at https://github.com/McJackTang/FusionVitals.

[86] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

Yuting Li,Lai Wei,Kaipeng Zheng,Jingyuan Huang,Linghe Kong,Lichao Sun,Weiran Huang

Main category: cs.CV

TL;DR: 论文指出,多模态大语言模型(MLLMs)忽视视觉处理的重要性,并提出一种简单的视觉扰动框架,显著提升模型在数学推理任务中的表现。

Details Motivation: 尽管MLLMs发展迅速,但其视觉处理能力被低估。实验发现,仅提供图像标题的语言模型表现优于直接处理原始视觉输入的MLLMs,表明MLLMs未能有效整合视觉信息进行推理。

Contribution: 提出一种无需修改算法或额外数据的视觉扰动框架,包含三种扰动策略(干扰项拼接、保留主导性的混合、随机旋转),显著提升数学推理性能。

Method: 通过在训练后流程(如SFT、DPO、GRPO)中引入三种视觉扰动:干扰项拼接、保留主导性的混合、随机旋转,增强模型的感知鲁棒性。

Result: 在多个数据集上验证,数学推理性能显著提升,达到与算法改进相当的增益;Qwen2.5-VL-7B模型表现接近开源7B RL调优模型的水平。

Insight: 视觉扰动在多模态数学推理中至关重要;不同扰动策略在提升视觉推理的不同方面具有互补作用,表明‘更好的推理始于更好的视觉’。

Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they
have largely overlooked the importance of visual processing. In a simple yet
revealing experiment, we interestingly find that language-only models, when
provided with image captions, can achieve comparable or even better performance
than MLLMs that consume raw visual inputs. This suggests that current MLLMs may
generate accurate visual descriptions but fail to effectively integrate them
during reasoning. Motivated by this, we propose a simple visual perturbation
framework that enhances perceptual robustness without requiring algorithmic
modifications or additional training data. Our approach introduces three
targeted perturbations: distractor concatenation, dominance-preserving mixup,
and random rotation, that can be easily integrated into existing post-training
pipelines including SFT, DPO, and GRPO. Through extensive experiments across
multiple datasets, we demonstrate consistent improvements in mathematical
reasoning performance, with gains comparable to those achieved through
algorithmic changes. Additionally, we achieve competitive performance among
open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual
perturbation. Through comprehensive ablation studies, we analyze the
effectiveness of different perturbation strategies, revealing that each
perturbation type contributes uniquely to different aspects of visual
reasoning. Our findings highlight the critical role of visual perturbation in
multimodal mathematical reasoning: better reasoning begins with better seeing.
Our code is available at https://github.com/YutingLi0606/Vision-Matters.

[87] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets

Yangrui Zhu,Junhua Bao,Yipan Wei,Yapeng Li,Bo Du

Main category: cs.CV

TL;DR: 该论文提出了多模态异构类别集学习(MMHCL)的实用场景,并提出了基于类别相似性的跨模态融合模型(CSCF),通过语义空间对齐和不确定性估计实现模态间的知识迁移与决策融合,在多个基准数据集上表现优于现有方法。

Details Motivation: 现实应用中多模态数据的类别分布常不一致,现有方法假设所有模态共享相同类别集,限制了模型利用跨模态信息的能力。

Contribution: 提出了MMHCL任务和CSCF模型,通过语义空间对齐和基于类别相似性的信息融合,解决了异构类别集下的多模态分类问题。

Method: CSCF将模态特征对齐到共享语义空间以迁移知识,基于不确定性估计选择最具判别性的模态,并通过类别相似性整合跨模态信息。

Result: 在多个基准数据集上,CSCF显著优于现有SOTA方法。

Insight: 异构类别集下的模态对齐与信息融合是提升多模态分类性能的关键。

Abstract: Existing multimodal methods typically assume that different modalities share
the same category set. However, in real-world applications, the category
distributions in multimodal data exhibit inconsistencies, which can hinder the
model’s ability to effectively utilize cross-modal information for recognizing
all categories. In this work, we propose the practical setting termed
Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are
trained in heterogeneous category sets of multi-modal data and aim to recognize
complete classes set of all modalities during test. To effectively address this
task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF).
Specifically, CSCF aligns modality-specific features to a shared semantic space
to enable knowledge transfer between seen and unseen classes. It then selects
the most discriminative modality for decision fusion through uncertainty
estimation. Finally, it integrates cross-modal information based on class
similarity, where the auxiliary modality refines the prediction of the dominant
one. Experimental results show that our method significantly outperforms
existing state-of-the-art (SOTA) approaches on multiple benchmark datasets,
effectively addressing the MMHCL task.

[88] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints

Xiangkai Zhang,Xiang Zhou,Mao Chen,Yuchen Lu,Xu Yang,Zhiyong Liu

Main category: cs.CV

TL;DR: 论文提出了一种分层跨源图像匹配方法,用于无人机绝对视觉定位,通过语义和结构约束提升定位准确性和鲁棒性。

Details Motivation: 无人机在GNSS信号不可用时,传统的视觉定位方法因跨源差异和时空变化导致匹配困难。

Contribution: 1. 结合语义感知和结构约束的分层匹配方法;2. 提出了一种无需依赖相对定位技术的无人机绝对视觉定位流程。

Method: 首先通过语义特征和结构约束进行粗匹配,再用轻量级细粒度匹配模块实现像素级对应。

Result: 在公开数据集和新的CS-UAV数据集上验证了方法的优越性和鲁棒性。

Insight: 语义和结构的引入能有效缓解跨源和时变差异,提升匹配精度。

Abstract: Absolute localization, aiming to determine an agent’s location with respect
to a global reference, is crucial for unmanned aerial vehicles (UAVs) in
various applications, but it becomes challenging when global navigation
satellite system (GNSS) signals are unavailable. Vision-based absolute
localization methods, which locate the current view of the UAV in a reference
satellite map to estimate its position, have become popular in GNSS-denied
scenarios. However, existing methods mostly rely on traditional and low-level
image matching, suffering from difficulties due to significant differences
introduced by cross-source discrepancies and temporal variations. To overcome
these limitations, in this paper, we introduce a hierarchical cross-source
image matching method designed for UAV absolute localization, which integrates
a semantic-aware and structure-constrained coarse matching module with a
lightweight fine-grained matching module. Specifically, in the coarse matching
module, semantic features derived from a vision foundation model first
establish region-level correspondences under semantic and structural
constraints. Then, the fine-grained matching module is applied to extract fine
features and establish pixel-level correspondences. Building upon this, a UAV
absolute visual localization pipeline is constructed without any reliance on
relative localization techniques, mainly by employing an image retrieval module
before the proposed hierarchical image matching modules. Experimental
evaluations on public benchmark datasets and a newly introduced CS-UAV dataset
demonstrate superior accuracy and robustness of the proposed method under
various challenging conditions, confirming its effectiveness.

[89] Q-SAM2: Accurate Quantization for Segment Anything Model 2

Nicola Farronato,Florian Scheidegger,Mattia Rigotti,Cristiano Malossi,Michele Magno,Haotong Qin

Main category: cs.CV

TL;DR: 该论文提出了Q-SAM2方法,通过线性层校准和量化感知训练(QAT)技术,有效解决了SAM2模型在低比特量化时的性能下降问题,显著提升了计算效率和精度。

Details Motivation: SAM2的计算和内存消耗大,限制了其在资源受限场景中的应用。因此,作者提出了一种高效的低比特量化方法,以解决量化过程中的性能损失问题。

Contribution: 1. 提出线性层校准方法,通过最小化Frobenius范数调整权重分布;2. 设计了QAT流程,通过裁剪抑制异常值并适应量化阈值。

Method: 结合线性层校准和QAT技术,优化了SAM2的低比特量化性能。

Result: Q-SAM2在超低2比特量化下表现优异,mIoU准确率比未校准模型提升66%。

Insight: 校准技术不仅适用于量化感知训练,也能显著提升训练后量化的性能。

Abstract: The Segment Anything Model 2 (SAM2) has gained significant attention as a
foundational approach for promptable image and video segmentation. However, its
expensive computational and memory consumption poses a severe challenge for its
application in resource-constrained scenarios. In this paper, we propose an
accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To
address the performance degradation caused by the singularities in weight and
activation distributions during quantization, Q-SAM2 introduces two novel
technical contributions. We first introduce a linear layer calibration method
for low-bit initialization of SAM2, which minimizes the Frobenius norm over a
small image batch to reposition weight distributions for improved quantization.
We then propose a Quantization-Aware Training (QAT) pipeline that applies
clipping to suppress outliers and allows the network to adapt to quantization
thresholds during training. Our comprehensive experiments demonstrate that
Q-SAM2 allows for highly accurate inference while substantially improving
efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses
existing state-of-the-art general quantization schemes, especially for
ultra-low 2-bit quantization. While designed for quantization-aware training,
our proposed calibration technique also proves effective in post-training
quantization, achieving up to a 66% mIoU accuracy improvement over
non-calibrated models.

[90] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

Andrea Caraffa,Davide Boscaini,Fabio Poiesi

Main category: cs.CV

TL;DR: FreeZeV2是一种无需训练的6D位姿估计方法,利用预训练的基础模型实现高效且高精度的新物体位姿估计,显著提升了速度和准确性。

Details Motivation: 解决6D位姿估计中新物体泛化问题的传统方法需要大量任务特定的训练数据,计算成本高昂。FreeZeV2探索是否可以不依赖任务特定训练,通过预训练模型实现高性能位姿估计。

Contribution: 1) 稀疏特征提取减少计算;2) 基于特征的分值机制改进位姿选择;3) 模块化设计支持多实例分割模型集成。

Method: 利用预训练的几何和视觉基础模型,结合稀疏特征提取、特征感知分值和模块化设计,实现高效且高精度的6D位姿估计。

Result: 在BOP Benchmark的7个核心数据集上取得新SOTA,使用相同分割掩码时速度提升8倍且精度提高5%;集成分割模型时精度再提升8%且速度仍快2.5倍。

Insight: 预训练的基础模型可以避免任务特定的昂贵训练,同时通过高效设计和集成策略进一步提升性能和速度。

Abstract: Estimating the 6D pose of objects from RGBD data is a fundamental problem in
computer vision, with applications in robotics and augmented reality. A key
challenge is achieving generalization to novel objects that were not seen
during training. Most existing approaches address this by scaling up training
on synthetic data tailored to the task, a process that demands substantial
computational resources. But is task-specific training really necessary for
accurate and efficient 6D pose estimation of novel objects? To answer No!, we
introduce FreeZeV2, the second generation of FreeZe: a training-free method
that achieves strong generalization to unseen objects by leveraging geometric
and vision foundation models pre-trained on unrelated data. FreeZeV2 improves
both accuracy and efficiency over FreeZe through three key contributions: (i) a
sparse feature extraction strategy that reduces inference-time computation
without sacrificing accuracy; (ii) a feature-aware scoring mechanism that
improves both pose selection during RANSAC-based 3D registration and the final
ranking of pose candidates; and (iii) a modular design that supports ensembles
of instance segmentation models, increasing robustness to segmentation masks
errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark,
where it establishes a new state-of-the-art in 6D pose estimation of unseen
objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable
8x speedup over FreeZe while also improving accuracy by 5%. When using
ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy
while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall
Method at the BOP Challenge 2024.

[91] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Xiandong Zou,Ruihao Xia,Hongsong Wang,Pan Zhou

Main category: cs.CV

TL;DR: 论文提出了一个名为DreamCS的框架,通过构建首个大规模非配对的3D偏好数据集(3D-MeshPref),并结合新颖的Cauchy-Schwarz散度目标训练奖励模型(RewardCS),实现了更符合人类偏好的几何感知3D生成。

Details Motivation: 现有的文本到3D生成方法通常依赖难以获得的配对多视角2D图像来训练奖励模型,这种2D偏见会导致几何伪影。因此,研究旨在解决这一问题,提出直接基于非配对3D数据学习人类偏好的方法。

Contribution: 1) 构建了首个大规模非配对的3D偏好数据集3D-MeshPref;2) 提出了基于Cauchy-Schwarz散度目标的RewardCS奖励模型;3) 提出了统一的DreamCS框架,将3D偏好反馈集成到文本到3D生成流程中。

Method: 1) 利用大规模语言模型标注并结合人工评估构建3D-MeshPref数据集;2) 设计RewardCS奖励模型,通过Cauchy-Schwarz散度目标学习非配对3D数据的几何偏好;3) 将RewardCS集成到文本到3D生成流程中,优化隐式和显式3D生成。

Result: 实验表明,DreamCS显著优于现有方法,生成的3D资产更具几何忠实性和人类偏好性。

Insight: 通过直接学习3D数据的几何偏好,而非依赖2D图像配对数据,可以有效避免2D偏见,提升生成质量。

Abstract: While text-to-3D generation has attracted growing interest, existing methods
often struggle to produce 3D assets that align well with human preferences.
Current preference alignment techniques for 3D content typically rely on
hardly-collected preference-paired multi-view 2D images to train 2D reward
models, when then guide 3D generation – leading to geometric artifacts due to
their inherent 2D bias. To address these limitations, we construct 3D-MeshPref,
the first large-scale unpaired 3D preference dataset, featuring diverse 3D
meshes annotated by a large language model and refined by human evaluators. We
then develop RewardCS, the first reward model trained directly on unpaired
3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling
effective learning of human-aligned 3D geometric preferences without requiring
paired comparisons. Building on this, we propose DreamCS, a unified framework
that integrates RewardCS into text-to-3D pipelines – enhancing both implicit
and explicit 3D generation with human preference feedback. Extensive
experiments show DreamCS outperforms prior methods, producing 3D assets that
are both geometrically faithful and human-preferred. Code and models will be
released publicly.

[92] MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion

Chuang Maa,Yu Peia,Jianhang Zhanga,Shaokai Zhaoa,Bowen Jib,Liang Xiea,Ye Yana,Erwei Yin

Main category: cs.CV

TL;DR: 该论文引入了首个多模态微表情数据集MMME,同步采集了面部动作信号、中枢神经系统信号和外周生理信号,填补了现有微表情研究中多模态数据的空白,显著提升了微表情识别和检测的性能。

Details Motivation: 现有微表情研究仅关注视觉模态,忽略了其他生理模态蕴含的丰富情感信息,导致性能远低于实际应用需求。因此,探索微表情视觉特征与生理信号的跨模态关联机制,成为推动微表情分析的关键。

Contribution: 提出了首个多模态微表情数据集MMME,同步采集了面部动作信号(MEs)、中枢神经系统信号(EEG)和外周生理信号(PPG等),并验证了其可靠性和多模态融合的性能提升。

Method: 通过同步采集多模态数据,构建MMME数据集,涵盖634个MEs和2,841个MaEs,并设计了多模态融合框架。实验验证了数据集的可靠性及其在ME分析和识别中的效果。

Result: 实验表明,多模态融合显著提升了微表情识别和检测性能。MMME是目前模态最全面的微表情数据集,为相关研究提供了重要数据支持。

Insight: 多模态数据(尤其是生理信号)的引入为微表情分析带来了新的维度,揭示了视觉-生理协同效应的潜力,推动了从单模态视觉分析到多模态融合的范式转变。

Abstract: Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an
individual’s genuine emotional state. Their analysis has attracted considerable
interest due to its promising applications in fields such as healthcare,
criminal investigation, and human-computer interaction. However, existing ME
research is limited to single visual modality, overlooking the rich emotional
information conveyed by other physiological modalities, resulting in ME
recognition and spotting performance far below practical application needs.
Therefore, exploring the cross-modal association mechanism between ME visual
features and physiological signals (PS), and developing a multimodal fusion
framework, represents a pivotal step toward advancing ME analysis. This study
introduces a novel ME dataset, MMME, which, for the first time, enables
synchronized collection of facial action signals (MEs), central nervous system
signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming
the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841
macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS,
establishing a robust foundation for investigating ME neural mechanisms and
conducting multimodal fusion-based analyses. Extensive experiments validate the
dataset’s reliability and provide benchmarks for ME analysis, demonstrating
that integrating MEs with PS significantly enhances recognition and spotting
performance. To the best of our knowledge, MMME is the most comprehensive ME
dataset to date in terms of modality diversity. It provides critical data
support for exploring the neural mechanisms of MEs and uncovering the
visual-physiological synergistic effects, driving a paradigm shift in ME
research from single-modality visual analysis to multimodal fusion. The dataset
will be publicly available upon acceptance of this paper.

[93] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction

Junli Deng,Ping Shi,Qipei Li,Jinyang Guo

Main category: cs.CV

TL;DR: DynaSplat 提出了一种基于高斯泼溅的动态场景重建方法,结合动态-静态分离和分层运动建模,显著提升了复杂动态场景的重建效果。

Details Motivation: 现有方法在复杂动态场景中的重建效果有限,DynaSplat 旨在通过动态-静态分离和分层运动建模解决这一问题。

Contribution: 1. 通过变形偏移统计和 2D 运动流一致性实现动态-静态分类;2. 分层运动建模策略捕捉全局粗变换和局部细粒度运动;3. 基于物理的不透明度估计提升视觉一致性。

Method: 1. 动态-静态分离;2. 分层运动建模;3. 基于物理的不透明度估计。

Result: 在多个数据集上,DynaSplat 在准确性和真实感上超越了现有方法,且更高效紧凑。

Insight: 结合动态-静态分离和分层建模能显著提升动态场景重建效果,同时基于物理的估计方法增强了视觉一致性。

Abstract: Reconstructing intricate, ever-changing environments remains a central
ambition in computer vision, yet existing solutions often crumble before the
complexity of real-world dynamics. We present DynaSplat, an approach that
extends Gaussian Splatting to dynamic scenes by integrating dynamic-static
separation and hierarchical motion modeling. First, we classify scene elements
as static or dynamic through a novel fusion of deformation offset statistics
and 2D motion flow consistency, refining our spatial representation to focus
precisely where motion matters. We then introduce a hierarchical motion
modeling strategy that captures both coarse global transformations and
fine-grained local movements, enabling accurate handling of intricate,
non-rigid motions. Finally, we integrate physically-based opacity estimation to
ensure visually coherent reconstructions, even under challenging occlusions and
perspective shifts. Extensive experiments on challenging datasets reveal that
DynaSplat not only surpasses state-of-the-art alternatives in accuracy and
realism but also provides a more intuitive, compact, and efficient route to
dynamic scene reconstruction.

[94] OctoNav: Towards Generalist Embodied Navigation

Chen Gao,Liankai Jin,Xingyu Peng,Jiazhao Zhang,Yue Deng,Annan Li,He Wang,Si Liu

Main category: cs.CV

TL;DR: OctoNav提出了一种通用导航智能体框架,通过多模态基准OctoNav-Bench和混合训练方法OctoNav-R1,实现了基于自由指令的导航能力。

Details Motivation: 现有导航研究分散为不同任务(如ObjNav、ImgNav等),缺乏通用性。本文旨在构建一个能处理多模态、多能力自由指令的通用导航智能体。

Contribution: 1) 提出了大规模多模态基准OctoNav-Bench,支持自由指令和连续环境;2) 设计了混合训练范式HTP,结合了TBA-SFT、Nav-GPRO和在线强化学习;3) 提出了TBA-CoT数据集,增强模型的推理能力。

Method: 1) 构建OctoNav-Bench基准,包含多样化的指令-轨迹对;2) 基于MLLMs的OctoNav-R1模型,通过HTP(三阶段训练)实现通用导航;3) 引入TBA-SFT和Nav-GPRO提升推理能力。

Result: OctoNav-R1在性能上优于现有方法,验证了通用导航智能体的可行性。

Insight: 结合思维链(CoT)的导航方法能显著提升模型的推理能力,为通用导航任务提供了新方向。

Abstract: Embodied navigation stands as a foundation pillar within the broader pursuit
of embodied AI. However, previous navigation research is divided into different
tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task
objectives and modalities, making datasets and methods are designed
individually. In this work, we take steps toward generalist navigation agents,
which can follow free-form instructions that include arbitrary compounds of
multi-modal and multi-capability. To achieve this, we propose a large-scale
benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1.
Specifically, OctoNav-Bench features continuous environments and is constructed
via a designed annotation pipeline. We thoroughly craft instruction-trajectory
pairs, where instructions are diverse in free-form with arbitrary modality and
capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within
OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1,
we build it upon MLLMs and adapt it to a VLA-type model, which can produce
low-level actions solely based on 2D visual observations. Moreover, we design a
Hybrid Training Paradigm (HTP) that consists of three stages, i.e.,
Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains
specifically designed learning policies and rewards. Importantly, for TBA-SFT
and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which
show impressive reasoning ability via thinking-before-answer. Thus, we aim to
investigate how to achieve thinking-before-action in the embodied navigation
field, to improve model’s reasoning ability toward generalists. Specifically,
we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a
cold-start phrase and then leverage Nav-GPRO to improve its thinking ability.
Finally, OctoNav-R1 shows superior performance compared with previous methods.

[95] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

Panagiotis Kaliosis,John Pavlopoulos

Main category: cs.CV

TL;DR: 论文提出了一种新的损失函数,通过Wasserstein距离对齐预测文本与目标字符频率分布,提升了手写文本识别的准确性和鲁棒性,并展示了无需重新训练的推理时优化方法。

Details Motivation: 手写文本识别因字符频率分布随时间或地区变化而性能下降,现有方法难以应对这种数据分布偏移。

Contribution: 1. 提出了一种新的损失函数,利用Wasserstein距离对齐字符频率分布;2. 展示了在推理时通过引导解码优化现有模型的方法。

Method: 使用Wasserstein距离作为损失函数的一部分,惩罚预测分布与目标分布的差异,并结合引导解码策略优化推理。

Result: 实验验证了该方法在多个数据集和架构上的有效性,提升了模型的泛化能力和性能。

Insight: 字符频率分布对齐是提升手写文本识别鲁棒性的关键,且无需重训练即可优化模型。

Abstract: Handwritten text recognition aims to convert visual input into
machine-readable text, and it remains challenging due to the evolving and
context-dependent nature of handwriting. Character sets change over time, and
character frequency distributions shift across historical periods or regions,
often causing models trained on broad, heterogeneous corpora to underperform on
specific subsets. To tackle this, we propose a novel loss function that
incorporates the Wasserstein distance between the character frequency
distribution of the predicted text and a target distribution empirically
derived from training data. By penalizing divergence from expected
distributions, our approach enhances both accuracy and robustness under
temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that
character distribution alignment can also improve existing models at inference
time without requiring retraining by integrating it as a scoring function in a
guided decoding scheme. Experimental results across multiple datasets and
architectures confirm the effectiveness of our method in boosting
generalization and performance. We open source our code at
https://github.com/pkaliosis/fada.

[96] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Florian Bordes,Quentin Garrido,Justine T Kao,Adina Williams,Michael Rabbat,Emmanuel Dupoux

Main category: cs.CV

TL;DR: IntPhys 2是一个用于评估深度学习模型对直观物理理解的视频基准工具,专注于四大核心原理,并通过违反期望的测试框架挑战模型在复杂虚拟环境中的表现。

Details Motivation: 当前的深度学习模型在复杂场景中对直观物理的理解远不及人类,亟需通过基准测试推动模型架构和训练方法的改进。

Contribution: 提出了IntPhys 2基准工具,扩展了原始IntPhys,聚焦四大物理原理(持久性、不变性、时空连续性和固体性),并提供了对多个先进模型的评估结果。

Method: 基于违反期望框架,设计了一套全面的测试,通过区分可能和不可能的事件来评估模型在复杂虚拟环境中的表现。

Result: 现有模型在四大原理上的表现接近随机水平(50%),与人类接近完美的准确率形成鲜明对比,凸显了模型的不足。

Insight: 研究揭示了当前深度学习模型在直观物理理解上的巨大差距,为未来模型的设计和训练指明了方向。

Abstract: We present IntPhys 2, a video benchmark designed to evaluate the intuitive
physics understanding of deep learning models. Building on the original IntPhys
benchmark, IntPhys 2 focuses on four core principles related to macroscopic
objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity.
These conditions are inspired by research into intuitive physical understanding
emerging during early childhood. IntPhys 2 offers a comprehensive suite of
tests, based on the violation of expectation framework, that challenge models
to differentiate between possible and impossible events within controlled and
diverse virtual environments. Alongside the benchmark, we provide performance
evaluations of several state-of-the-art models. Our findings indicate that
while these models demonstrate basic visual understanding, they face
significant challenges in grasping intuitive physics across the four principles
in complex scenes, with most models performing at chance levels (50%), in stark
contrast to human performance, which achieves near-perfect accuracy. This
underscores the gap between current models and human-like intuitive physics
understanding, highlighting the need for advancements in model architectures
and training methodologies.

[97] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Seonho Lee,Jiho Choi,Inha Kang,Jiwook Kim,Junsung Park,Hyunjung Shim

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的几何蒸馏方法,通过从现成的3D基础模型中提取几何线索(如稀疏对应、相对深度关系和密集成本体积)来增强预训练的视觉语言模型(VLMs)的3D空间理解能力。

Details Motivation: 现有的视觉语言模型在3D空间结构的理解上存在局限性,尽管它们在多样化的视觉和语言任务中表现出色。

Contribution: 提出了一种无需标注的轻量级微调框架(几何蒸馏),通过注入几何线索提升预训练VLMs的3D空间理解能力,同时保持与自然图像-文本输入的兼容性。

Method: 从现成的3D基础模型(如MASt3R, VGGT)中提取稀疏对应、相对深度关系和密集成本体积,并蒸馏到预训练的VLMs中,无需修改其架构。

Result: 在3D视觉语言推理和3D感知基准测试中,该方法显著优于现有方法,且计算成本更低。

Insight: 几何蒸馏为2D训练的VLMs与3D理解之间搭建了一条高效且可扩展的路径,拓展了其在空间多模态任务中的应用潜力。

Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse
visual and linguistic tasks, yet they remain fundamentally limited in their
understanding of 3D spatial structures. We propose Geometric Distillation, a
lightweight, annotation-free fine-tuning framework that injects human-inspired
geometric cues into pretrained VLMs without modifying their architecture. By
distilling (1) sparse correspondences, (2) relative depth relations, and (3)
dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R,
VGGT), our method shapes representations to be geometry-aware while remaining
compatible with natural image-text inputs. Through extensive evaluations on 3D
vision-language reasoning and 3D perception benchmarks, our method consistently
outperforms prior approaches, achieving improved 3D spatial reasoning with
significantly lower computational cost. Our work demonstrates a scalable and
efficient path to bridge 2D-trained VLMs with 3D understanding, opening up
wider use in spatially grounded multimodal tasks.

[98] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge

Haoru Wang,Kai Ye,Yangyan Li,Wenzheng Chen,Baoquan Chen

Main category: cs.CV

TL;DR: 论文提出了一种减少依赖3D知识的通用新颖视图合成框架,通过最小化3D归纳偏置和姿态依赖,直接从稀疏2D图像学习隐式3D感知,实现高质量的新视图生成。

Details Motivation: 当前的新颖视图合成方法通常依赖强3D知识(如显式3D表示或已知相机姿态),限制了其泛化能力和对数据的充分利用。论文探讨了减少3D知识依赖的潜力,并发现数据规模扩大时,依赖较少的3D知识的方法性能提升更快。

Contribution: 1. 系统分析了3D知识在新颖视图合成中的作用;2. 提出了一种最小化3D归纳偏置和姿态依赖的框架,实现从稀疏、无需姿态标注的2D图像学习隐式3D感知;3. 展示了该方法在大规模数据下的高性能。

Method: 设计了一种数据中心的范式,完全避免显式3D表示或姿态标注,通过直接学习稀疏2D图像的隐式3D表示,实现无需场景优化的新视图合成。

Result: 实验表明,该方法能在无需3D知识或姿态标注的情况下生成逼真且3D一致的新视图,性能甚至与依赖姿态输入的方法相当。

Insight: 在数据规模扩大的背景下,减少对3D知识的依赖能够更高效地学习隐式3D感知,为新颖视图合成提供更灵活、可扩展的解决方案。

Abstract: We consider the problem of generalizable novel view synthesis (NVS), which
aims to generate photorealistic novel views from sparse or even unposed 2D
images without per-scene optimization. This task remains fundamentally
challenging, as it requires inferring 3D structure from incomplete and
ambiguous 2D observations. Early approaches typically rely on strong 3D
knowledge, including architectural 3D inductive biases (e.g., embedding
explicit 3D representations, such as NeRF or 3DGS, into network design) and
ground-truth camera poses for both input and target views. While recent efforts
have sought to reduce the 3D inductive bias or the dependence on known camera
poses of input views, critical questions regarding the role of 3D knowledge and
the necessity of circumventing its use remain under-explored. In this work, we
conduct a systematic analysis on the 3D knowledge and uncover a critical trend:
the performance of methods that requires less 3D knowledge accelerates more as
data scales, eventually achieving performance on par with their 3D
knowledge-driven counterparts, which highlights the increasing importance of
reducing dependence on 3D knowledge in the era of large-scale data. Motivated
by and following this trend, we propose a novel NVS framework that minimizes 3D
inductive bias and pose dependence for both input and target views. By
eliminating this 3D knowledge, our method fully leverages data scaling and
learns implicit 3D awareness directly from sparse 2D images, without any 3D
inductive bias or pose annotation during training. Extensive experiments
demonstrate that our model generates photorealistic and 3D-consistent novel
views, achieving even comparable performance with methods that rely on posed
inputs, thereby validating the feasibility and effectiveness of our
data-centric paradigm. Project page:
https://pku-vcl-geometry.github.io/Less3Depend/ .

[99] EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks

Athinoulla Konstantinou,Georgios Leontidis,Mamatha Thota,Aiden Durrant

Main category: cs.CV

TL;DR: EquiCaps是一种无预测器的胶囊网络方法,通过利用胶囊网络的固有姿态感知能力,实现了姿态感知的自监督学习,无需专用预测器即可提高姿态估计任务的性能。

Details Motivation: 研究旨在探索不需要依赖专用预测器就能实现等变性的自监督表示学习方法,并验证胶囊网络在姿态感知表示中的固有优势。

Contribution: 提出了EquiCaps,一种基于胶囊网络的姿态感知自监督方法;引入3DIEBench-T数据集,扩展了任务复杂度以验证方法的效果;实验表明EquiCaps在旋转预测任务中优于现有方法。

Method: 利用胶囊网络的固有姿态感知能力,设计了一种无预测器的等变性学习方法;通过多几何变换任务提升了模型的鲁棒性和泛化能力。

Result: 在3DIEBench旋转预测基准上,EquiCaps达到了0.78的$R^2$值,优于SIE和CapsIE方法;在复杂几何变换下仍保持稳健的等变性性能。

Insight: 胶囊网络具有固有的姿态感知能力,可以避免依赖专用预测器;无预测器的设计在复杂任务中表现出更强的泛化能力。

Abstract: Learning self-supervised representations that are invariant and equivariant
to transformations is crucial for advancing beyond traditional visual
classification tasks. However, many methods rely on predictor architectures to
encode equivariance, despite evidence that architectural choices, such as
capsule networks, inherently excel at learning interpretable pose-aware
representations. To explore this, we introduce EquiCaps (Equivariant Capsule
Network), a capsule-based approach to pose-aware self-supervision that
eliminates the need for a specialised predictor for enforcing equivariance.
Instead, we leverage the intrinsic pose-awareness capabilities of capsules to
improve performance in pose estimation tasks. To further challenge our
assumptions, we increase task complexity via multi-geometric transformations to
enable a more thorough evaluation of invariance and equivariance by introducing
3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical
results demonstrate that EquiCaps outperforms prior state-of-the-art
equivariant methods on rotation prediction, achieving a supervised-level $R^2$
of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE
and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to
non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant
performance under combined geometric transformations, underscoring its
generalisation capabilities and the promise of predictor-free capsule
architectures.

[100] Only-Style: Stylistic Consistency in Image Generation without Content Leakage

Tilemachos Aravanis,Panagiotis Filntisis,Petros Maragos,George Retsinas

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为Only-Style的方法,旨在解决图像生成中风格一致性与内容泄漏的问题。通过定位和自适应调整风格对齐参数,有效分离了语义内容和风格元素,同时在评估框架中展示了显著改进的效果。

Details Motivation: 现有的风格一致性图像生成方法难以有效分离语义内容和风格元素,导致内容泄漏问题。论文旨在解决这一挑战,提出一种能够在保持风格一致性的同时避免内容泄漏的方法。

Contribution: 1. 提出Only-Style方法,通过定位和自适应调整参数来消除内容泄漏;2. 引入一种新的评估框架,用于量化风格一致性生成的成功率;3. 展示了该方法在多种实例中的显著改进效果。

Method: 方法的核心是定位内容泄漏,并在推理过程中自适应调整控制风格对齐的参数。这种调整专注于参考图像中包含主题的局部区域,从而平衡风格一致性和泄漏消除。

Result: 实验表明,Only-Style在风格一致性和内容泄漏消除方面显著优于现有方法,展示了鲁棒的生成效果。

Insight: 通过动态调整风格对齐参数,可以有效避免因风格对齐引发的语义内容泄漏问题,同时保持视觉风格的一致性。

Abstract: Generating images in a consistent reference visual style remains a
challenging computer vision task. State-of-the-art methods aiming for
style-consistent generation struggle to effectively separate semantic content
from stylistic elements, leading to content leakage from the image provided as
a reference to the targets. To address this challenge, we propose Only-Style: a
method designed to mitigate content leakage in a semantically coherent manner
while preserving stylistic consistency. Only-Style works by localizing content
leakage during inference, allowing the adaptive tuning of a parameter that
controls the style alignment process, specifically within the image patches
containing the subject in the reference image. This adaptive process best
balances stylistic consistency with leakage elimination. Moreover, the
localization of content leakage can function as a standalone component, given a
reference-target image pair, allowing the adaptive tuning of any
method-specific parameter that provides control over the impact of the
stylistic reference. In addition, we propose a novel evaluation framework to
quantify the success of style-consistent generations in avoiding undesired
content leakage. Our approach demonstrates a significant improvement over
state-of-the-art methods through extensive evaluation across diverse instances,
consistently achieving robust stylistic consistency without undesired content
leakage.

[101] MetricHMR: Metric Human Mesh Recovery from Monocular Images

He Zhang,Chentao Song,Hongwen Zhang,Tao Yu

Main category: cs.CV

TL;DR: MetricHMR提出了一种新的方法,从单目图像中恢复人体网格和全局平移,解决了现有方法的尺度和深度模糊问题。

Details Motivation: 现有的人体网格恢复(HMR)方法在尺度和深度上存在严重模糊性,导致重建结果中全局平移和形状不准确,无法满足实际应用需求。

Contribution: 1. 首次系统地分析了HMR方法在相机模型下的表现,强调标准透视投影模型对基于度量尺度的HMR的重要性;2. 提出了一种基于标准透视投影的射线图(ray map)方法,实现了端到端的度量尺度HMR。

Method: 方法分为三部分:1. 分析HMR方法的相机模型问题;2. 验证标准透视投影下度量HMR的模糊范围;3. 提出基于射线图的方法,联合编码边界框信息、相机参数和几何线索,无需额外的度量正则化模块。

Result: 在室内和野外场景下,MetricHMR在度量姿态、形状和全局平移估计上均达到了最先进的性能。

Insight: 标准透视投影模型是实现度量尺度HMR的关键,而射线图方法能够有效地结合多种信息,提升重建精度。

Abstract: We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric
human mesh recovery with accurate global translation from monocular images. In
contrast to existing HMR methods that suffer from severe scale and depth
ambiguity, MetricHMR is able to produce geometrically reasonable body shape and
global translation in the reconstruction results. To this end, we first
systematically analyze previous HMR methods on camera models to emphasize the
critical role of the standard perspective projection model in enabling
metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR
under the standard perspective projection model. Finally, we contribute a novel
approach that introduces a ray map based on the standard perspective projection
to jointly encode bounding-box information, camera parameters, and geometric
cues for End2End metric HMR without any additional metric-regularization
modules. Extensive experiments demonstrate that our method achieves
state-of-the-art performance, even compared with sequential HMR methods, in
metric pose, shape, and global translation estimation across both indoor and
in-the-wild scenarios.

[102] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering

Jianhan Qi,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 提出一种基于结构-光谱图卷积和证据边学习的高光谱图像聚类方法,结合对比学习框架,显著提升了聚类精度。

Details Motivation: 高光谱图像聚类因缺乏标注信息且现有图神经网络未能充分利用光谱信息,加上超像素图的拓扑结构不准确,导致聚类效果受限。

Contribution: 1) 设计了结构-光谱图卷积算子(SSGCO),联合提取空间和光谱特征;2) 提出证据引导的自适应边学习(EGAEL)模块,动态优化超像素图的边权重。

Method: 将SSGCO和EGAEL集成到对比学习框架中,实现表示学习和聚类的同步优化。

Result: 在四个数据集上,聚类精度分别提升2.61%、6.06%、4.96%和3.15%。

Insight: 联合空间和光谱特征以及动态边优化是提升高光谱图像聚类性能的关键。

Abstract: Hyperspectral image (HSI) clustering assigns similar pixels to the same class
without any annotations, which is an important yet challenging task. For
large-scale HSIs, most methods rely on superpixel segmentation and perform
superpixel-level clustering based on graph neural networks (GNNs). However,
existing GNNs cannot fully exploit the spectral information of the input HSI,
and the inaccurate superpixel topological graph may lead to the confusion of
different class semantics during information aggregation. To address these
challenges, we first propose a structural-spectral graph convolutional operator
(SSGCO) tailored for graph-structured HSI superpixels to improve their
representation quality through the co-extraction of spatial and spectral
features. Second, we propose an evidence-guided adaptive edge learning (EGAEL)
module that adaptively predicts and refines edge weights in the superpixel
topological graph. We integrate the proposed method into a contrastive learning
framework to achieve clustering, where representation learning and clustering
are simultaneously conducted. Experiments demonstrate that the proposed method
improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best
compared methods on four HSI datasets. Our code is available at
https://github.com/jhqi/SSGCO-EGAEL.

[103] Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos

Benjamin Reichman,Constantin Patsch,Jack Truxal,Atishay Jain,Larry Heck

Main category: cs.CV

TL;DR: 论文提出了一个基于视频的多轮对话数据集OKCV,要求模型结合视频内容和外部知识回答问题,展示了任务挑战和基准结果。

Details Motivation: 将OK-VQA扩展到视频对话场景,需要模型不仅能识别视频中的视觉信息,还需结合外部知识进行对话,为相关研究提供数据支持。

Contribution: 提出了一个包含2,017个视频和5,986条人工标注对话的数据集OKCV,支持多轮对话和外部知识结合的研究。

Method: 数据集构建方法:标注视频片段和对话,要求答案依赖外部知识;提供了基线模型评估。

Result: 展示了数据集上的基线性能,揭示了任务中的挑战,如视频内容与外部知识的结合。

Insight: 视频对话任务需要同时处理视觉时序信息和外部知识,为多模态对话系统研究提供了新方向。

Abstract: In outside knowledge visual question answering (OK-VQA), the model must
identify relevant visual information within an image and incorporate external
knowledge to accurately respond to a question. Extending this task to a
visually grounded dialogue setting based on videos, a conversational model must
both recognize pertinent visual details over time and answer questions where
the required information is not necessarily present in the visual information.
Moreover, the context of the overall conversation must be considered for the
subsequent dialogue. To explore this task, we introduce a dataset comprised of
$2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$
interleaved dialogue turns. While the dialogue context is visually grounded in
specific video segments, the questions further require external knowledge that
is not visually present. Thus, the model not only has to identify relevant
video parts but also leverage external knowledge to converse within the
dialogue. We further provide several baselines evaluated on our dataset and
show future challenges associated with this task. The dataset is made publicly
available here: https://github.com/c-patsch/OKCV.

[104] LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

Jiangyong Huang,Xiaojian Ma,Xiongkun Linghu,Yue Fan,Junchao He,Wenxin Tan,Qing Li,Song-Chun Zhu,Yixin Chen,Baoxiong Jia,Siyuan Huang

Main category: cs.CV

TL;DR: 论文提出了LEO-VL模型,通过高效的特征网格表示(CFG)和大规模3D视觉语言数据训练,解决了3D-VL通用模型中数据扩展性的障碍,实现了在多任务基准上的最佳性能。

Details Motivation: 当前3D-VL模型在能力和鲁棒性上落后于2D模型,主要障碍在于数据扩展性问题和高昂的token开销。研究目标是开发能理解3D场景并执行多任务的通用模型。

Contribution: 1. 提出了LEO-VL模型和CFG高效特征表示方法;2. 构建了超过70万条高质量3D-VL数据集;3. 引入了SceneDPO训练目标以增强鲁棒性;4. 在多个基准上实现了SOTA性能。

Method: 1. 使用CFG(Condensed Feature Grid)表示3D场景,结合2D感知和3D空间结构;2. 通过任务和场景多样性数据训练模型;3. 提出SceneDPO进行后训练优化。

Result: 在SQA3D、MSQA和Beacon3D等基准上达到SOTA性能,验证了CFG的高效性和数据集的多样性价值。

Insight: 高效的场景表示和多样性数据是3D-VL通用模型的关键;SceneDPO能有效提升模型的鲁棒性。

Abstract: Developing 3D-VL generalists capable of understanding 3D scenes and following
natural language instructions to perform a wide range of tasks has been a
long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL
models still lag behind their 2D counterparts in capability and robustness,
falling short of the generalist standard. A key obstacle to developing 3D-VL
generalists lies in data scalability, hindered by the lack of an efficient
scene representation. We propose LEO-VL, a 3D-VL model built upon condensed
feature grid (CFG), an efficient scene representation that bridges 2D
perception and 3D spatial structure while significantly reducing token
overhead. This efficiency unlocks large-scale training towards 3D-VL
generalist, for which we curate over 700k high-quality 3D-VL data spanning four
domains of real-world indoor scenes and five tasks such as captioning and
dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA
benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the
efficiency of our representation, the importance of task and scene diversity,
and the validity of our data curation principle. Furthermore, we introduce
SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL
models. We hope our findings contribute to the advancement of scalable and
robust 3D-VL generalists.

[105] CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

Aaron Foss,Chloe Evans,Sasha Mitts,Koustuv Sinha,Ammar Rizvi,Justine T. Kao

Main category: cs.CV

TL;DR: CausalVQA是一个基于真实世界视频的因果推理基准数据集,旨在测试模型对物理世界中因果关系的理解能力,包括五种问题类型。顶尖多模态模型在该基准上表现显著低于人类,尤其是在预测和假设问题上。

Details Motivation: 现有的VQA基准要么侧重于表面感知理解,要么局限于模拟环境中的狭窄物理推理问题。CausalVQA弥补了这一空白,提出了更具挑战性的真实世界因果推理问题,测试模型对行动和事件结果的预测能力。

Contribution: 提出了CausalVQA基准数据集,包含五种问题类型(反事实、假设、预测、规划和描述性),并通过质量控制机制确保模型需依赖深度视觉理解而非语言线索。

Method: 设计了基于真实世界视频的问题-答案对,注重因果推理,并通过严谨的质量控制避免模型利用简单捷径。

Result: 实验表明,当前顶尖多模态模型在该基准上表现显著低于人类,特别在预测和假设问题上。

Insight: 当前系统在时空推理、物理原理理解以及对替代可能性的把握上仍存在不足,需要进一步研究以提升真实世界环境中的预测能力。

Abstract: We introduce CausalVQA, a benchmark dataset for video question answering
(VQA) composed of question-answer pairs that probe models’ understanding of
causality in the physical world. Existing VQA benchmarks either tend to focus
on surface perceptual understanding of real-world videos, or on narrow physical
reasoning questions created using simulation environments. CausalVQA fills an
important gap by presenting challenging questions that are grounded in
real-world scenarios, while focusing on models’ ability to predict the likely
outcomes of different actions and events through five question types:
counterfactual, hypothetical, anticipation, planning and descriptive. We
designed quality control mechanisms that prevent models from exploiting trivial
shortcuts, requiring models to base their answers on deep visual understanding
instead of linguistic cues. We find that current frontier multimodal models
fall substantially below human performance on the benchmark, especially on
anticipation and hypothetical questions. This highlights a challenge for
current systems to leverage spatial-temporal reasoning, understanding of
physical principles, and comprehension of possible alternatives to make
accurate predictions in real-world settings.

[106] UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

Ziyi Wang,Yanran Zhang,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: UniPre3D提出了一种统一的3D点云预训练方法,通过跨模态高斯泼溅(Gaussian Splatting)实现对象和场景级点云的高效学习。

Details Motivation: 现有3D点云预训练方法难以统一适用于不同尺度和架构的模型,导致对象和场景级任务表现不均。UniPre3D旨在解决这一问题。

Contribution: 提出了首个统一的3D点云预训练方法UniPre3D,适用于任意尺度的点云和模型架构,并通过高斯泼溅渲染实现端到端优化。

Method: 以预测高斯基元为预训练任务,结合可微分高斯泼溅渲染图像,引入2D预训练模型特征以增强几何结构学习。

Result: 在多种对象和场景级任务中的实验结果验证了UniPre3D的普适性和有效性。

Insight: 跨模态信息(如2D特征)的引入有助于优化3D几何结构学习,高斯泼溅提供了一种高效的像素级监督方式。

Abstract: The scale diversity of point cloud data presents significant challenges in
developing unified representation learning techniques for 3D vision. Currently,
there are few unified 3D models, and no existing pre-training method is equally
effective for both object- and scene-level point clouds. In this paper, we
introduce UniPre3D, the first unified pre-training method that can be
seamlessly applied to point clouds of any scale and 3D models of any
architecture. Our approach predicts Gaussian primitives as the pre-training
task and employs differentiable Gaussian splatting to render images, enabling
precise pixel-level supervision and end-to-end optimization. To further
regulate the complexity of the pre-training task and direct the model’s focus
toward geometric structures, we integrate 2D features from pre-trained image
models to incorporate well-established texture knowledge. We validate the
universal effectiveness of our proposed method through extensive experiments
across a variety of object- and scene-level tasks, using diverse point cloud
models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.

[107] Vision Generalist Model: A Survey

Ziyi Wang,Yongming Rao,Shuofeng Sun,Xinrun Liu,Yi Wei,Xumin Yu,Zuyan Liu,Yanbo Wang,Hongmin Liu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 该论文对视觉通用模型(Vision Generalist Model)进行了全面综述,探讨了其背景、框架设计、性能提升技术,并提供了应用场景和未来研究方向。

Details Motivation: 受到自然语言处理中通用模型成功的启发,研究者们尝试将其应用于计算机视觉任务。然而,视觉任务的输入输出多样性较大,如何统一表示是一大挑战。

Contribution: 论文系统总结了视觉通用模型的背景、框架设计、性能提升技术,并探讨了相关领域的联系以及潜在的应用场景和挑战。

Method: 通过综述现有研究,论文分析了视觉通用模型的数据集、任务、基准测试,以及框架设计和技术实现方法。

Result: 论文展示了视觉通用模型的潜力,并指出其在多样化任务中的应用前景。

Insight: 视觉通用模型的发展需要解决输入输出多样性问题,未来研究可结合多模态技术和任务自适应方法进一步提升性能。

Abstract: Recently, we have witnessed the great success of the generalist model in
natural language processing. The generalist model is a general framework
trained with massive data and is able to process various downstream tasks
simultaneously. Encouraged by their impressive performance, an increasing
number of researchers are venturing into the realm of applying these models to
computer vision tasks. However, the inputs and outputs of vision tasks are more
diverse, and it is difficult to summarize them as a unified representation. In
this paper, we provide a comprehensive overview of the vision generalist
models, delving into their characteristics and capabilities within the field.
First, we review the background, including the datasets, tasks, and benchmarks.
Then, we dig into the design of frameworks that have been proposed in existing
research, while also introducing the techniques employed to enhance their
performance. To better help the researchers comprehend the area, we take a
brief excursion into related domains, shedding light on their interconnections
and potential synergies. To conclude, we provide some real-world application
scenarios, undertake a thorough examination of the persistent challenges, and
offer insights into possible directions for future research endeavors.

[108] Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

Sushant Gautam,Michael A. Riegler,Pål Halvorsen

Main category: cs.CV

TL;DR: 介绍了Kvasir-VQA-x1,一个用于医学视觉问答(MedVQA)的大规模胃肠道内窥镜数据集,扩展了原始数据集,新增159,549个问题-答案对,并引入视觉增强以模拟实际临床场景。

Details Motivation: 现有MedVQA数据集缺乏临床复杂性和视觉多样性,限制了临床决策支持系统的发展。Kvasir-VQA-x1旨在填补这一空白。

Contribution: 1. 提出了Kvasir-VQA-x1数据集,扩展了原始数据集的规模和复杂性;2. 引入了分层复杂性的问题生成方法;3. 加入了视觉扰动以增强模型鲁棒性。

Method: 使用大语言模型生成分层复杂性的问题-答案对,并通过视觉增强模拟临床常见成像伪影。

Result: 新数据集支持标准VQA性能评估和模型鲁棒性测试,提供了更具挑战性和临床相关性的基准。

Insight: 通过引入分层问题和视觉扰动,Kvasir-VQA-x1能够更全面地评估模型在真实临床环境中的表现,推动更可靠的多模态AI系统的发展。

Abstract: Medical Visual Question Answering (MedVQA) is a promising field for
developing clinical decision support systems, yet progress is often limited by
the available datasets, which can lack clinical complexity and visual
diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new,
large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly
expands upon the original Kvasir-VQA by incorporating 159,549 new
question-answer pairs that are designed to test deeper clinical reasoning. We
developed a systematic method using large language models to generate these
questions, which are stratified by complexity to better assess a model’s
inference capabilities. To ensure our dataset prepares models for real-world
clinical scenarios, we have also introduced a variety of visual augmentations
that mimic common imaging artifacts. The dataset is structured to support two
main evaluation tracks: one for standard VQA performance and another to test
model robustness against these visual perturbations. By providing a more
challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate
the development of more reliable and effective multimodal AI systems for use in
clinical settings. The dataset is fully accessible and adheres to FAIR data
principles, making it a valuable resource for the wider research community.
Code and data: https://github.com/Simula/Kvasir-VQA-x1 and
https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1

[109] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu,Jian Guan,Kaituo Feng,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tieniu Tan

Main category: cs.CV

TL;DR: 论文提出了一种通过视觉绘图增强视觉-语言模型空间推理能力的新方法,通过三阶段训练框架实现,实验表明在多项空间推理任务中性能显著提升。

Details Motivation: 现有视觉-语言模型在空间推理任务中表现不足,主要依赖于纯文本的推理方式,无法精确处理几何和空间关系。为了解决这一问题,论文提出通过视觉绘图操作增强模型的直接空间表达能力。

Contribution: 论文的主要贡献包括:1) 提出了一种新的视觉绘图推理范式,2) 设计了三阶段训练框架(冷启动训练、反射性拒绝采样、强化学习),3) 在多个空间推理基准任务中实现了平均18.4%的性能提升。

Method: 方法分为三个阶段:1) 使用合成数据进行冷启动训练,建立基础的绘图能力;2) 通过反射性拒绝采样增强模型的自我反思行为;3) 采用强化学习直接优化目标奖励。模型通过绘制边界框和辅助线等操作直接在视觉空间表达和推理。

Result: 实验表明,论文提出的模型VILASR在迷宫导航、静态空间推理、视频推理和多视角推理等任务中均显著优于现有方法,平均提升18.4%。

Insight: 论文的核心见解是,通过视觉绘图操作直接表达空间关系比纯文本推理更接近人类的思维模式,从而突破了现有方法的性能瓶颈。同时,三阶段训练框架为视觉-语言模型的空间推理能力提供了可扩展的训练方案。

Abstract: As textual reasoning with large language models (LLMs) has advanced
significantly, there has been growing interest in enhancing the multimodal
reasoning capabilities of large vision-language models (LVLMs). However,
existing methods primarily approach multimodal reasoning in a straightforward,
text-centric manner, where both reasoning and answer derivation are conducted
purely through text, with the only difference being the presence of multimodal
input. As a result, these methods often encounter fundamental limitations in
spatial reasoning tasks that demand precise geometric understanding and
continuous spatial tracking-capabilities that humans achieve through mental
visualization and manipulation. To address the limitations, we propose drawing
to reason in space, a novel paradigm that enables LVLMs to reason through
elementary drawing operations in the visual space. By equipping models with
basic drawing operations, including annotating bounding boxes and drawing
auxiliary lines, we empower them to express and analyze spatial relationships
through direct visual manipulation, meanwhile avoiding the performance ceiling
imposed by specialized perception tools in previous tool-integrated reasoning
approaches. To cultivate this capability, we develop a three-stage training
framework: cold-start training with synthetic data to establish basic drawing
abilities, reflective rejection sampling to enhance self-reflection behaviors,
and reinforcement learning to directly optimize for target rewards. Extensive
experiments demonstrate that our model, named VILASR, consistently outperforms
existing methods across diverse spatial reasoning benchmarks, involving maze
navigation, static spatial reasoning, video-based reasoning, and
multi-view-based reasoning tasks, with an average improvement of 18.4%.

[110] Efficient Part-level 3D Object Generation via Dual Volume Packing

Jiaxiang Tang,Ruijie Lu,Zhaoshuo Li,Zekun Hao,Xuan Li,Fangyin Wei,Shuran Song,Gang Zeng,Ming-Yu Liu,Tsung-Yi Lin

Main category: cs.CV

TL;DR: 论文提出了一种双体素打包策略的端到端框架,用于从单个图像生成高质量、多部分的3D对象,支持任意数量的语义化部件编辑。

Details Motivation: 现有的3D对象生成方法通常生成单一网格,限制了部件级别的编辑能力。不同对象的部件数量差异大,缺乏灵活的处理方法。

Contribution: 提出了一种双体素打包策略,可以将多个部件组织成两个互补的体素空间,实现高质量、语义化的多部件3D对象生成。

Method: 采用双体素打包策略,将部件分布到两个互补的体素空间中,确保部件完整性并通过组装生成最终对象。实验基于单张输入图像进行端到端训练。

Result: 实验显示,该方法在质量、多样性和泛化能力上优于现有基于图像的多部件生成方法。

Insight: 通过双体素打包策略,能有效处理部件数量可变的问题,同时保持部件的语义完整性和可编辑性。

Abstract: Recent progress in 3D object generation has greatly improved both the quality
and efficiency. However, most existing methods generate a single mesh with all
parts fused together, which limits the ability to edit or manipulate individual
parts. A key challenge is that different objects may have a varying number of
parts. To address this, we propose a new end-to-end framework for part-level 3D
object generation. Given a single input image, our method generates
high-quality 3D objects with an arbitrary number of complete and semantically
meaningful parts. We introduce a dual volume packing strategy that organizes
all parts into two complementary volumes, allowing for the creation of complete
and interleaved parts that assemble into the final object. Experiments show
that our model achieves better quality, diversity, and generalization than
previous image-based part-level generation methods.

[111] ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang,Kashyap Chitta,Shenyuan Gao,Long Chen,Yuqian Shao,Xiaosong Jia,Hongyang Li,Andreas Geiger,Xiangyu Yue,Li Chen

Main category: cs.CV

TL;DR: ReSim proposes一种结合真实世界和模拟驾驶数据的可靠世界模拟方法,通过扩散变换器架构提升驾驶场景模拟的多样性和可控性,并引入Video2Reward模块评估动作奖励。结果显示其在视觉保真度和任务性能上均有显著提升。

Details Motivation: 当前驾驶世界模型主要依赖专家驾驶数据,难以模拟危险或非专家行为,限制了其在策略评估等任务中的应用。

Contribution: 1. 提出结合真实与模拟数据的异构世界模型;2. 设计了改进条件信号整合和预测可控性的策略;3. 引入Video2Reward模块从模拟视频中估计奖励。

Method: 使用扩散变换器架构的视频生成器,整合真实与非专家驾驶数据;通过条件信号增强可控性和保真度;设计Video2Reward模块评估动作奖励。

Result: ReSim在视觉保真度上提升44%,非专家动作可控性提升50%,在NAVSIM任务中规划和策略选择性能分别提升2%和25%。

Insight: 异构数据结合和扩散变换器架构能显著提升驾驶场景模拟的多样性和可靠性,适用于复杂驾驶行为评估。

Abstract: How can we reliably simulate future driving scenarios under a wide range of
ego driving behaviors? Recent driving world models, developed exclusively on
real-world driving data composed mainly of safe expert trajectories, struggle
to follow hazardous or non-expert behaviors, which are rare in such data. This
limitation restricts their applicability to tasks such as policy evaluation. In
this work, we address this challenge by enriching real-world human
demonstrations with diverse non-expert data collected from a driving simulator
(e.g., CARLA), and building a controllable world model trained on this
heterogeneous corpus. Starting with a video generator featuring a diffusion
transformer architecture, we devise several strategies to effectively integrate
conditioning signals and improve prediction controllability and fidelity. The
resulting model, ReSim, enables Reliable Simulation of diverse open-world
driving scenarios under various actions, including hazardous non-expert ones.
To close the gap between high-fidelity simulation and applications that require
reward signals to judge different actions, we introduce a Video2Reward module
that estimates a reward from ReSim’s simulated future. Our ReSim paradigm
achieves up to 44% higher visual fidelity, improves controllability for both
expert and non-expert actions by over 50%, and boosts planning and policy
selection performance on NAVSIM by 2% and 25%, respectively.

[112] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Zhenzhi Wang,Jiaqi Yang,Jianwen Jiang,Chao Liang,Gaojie Lin,Zerong Zheng,Ceyuan Yang,Dahua Lin

Main category: cs.CV

TL;DR: 该论文提出了一种名为InterActHuman的新框架,用于在多概念场景中生成高可控性的人体动画。通过显式布局控制和多模态条件匹配,解决了现有方法无法精确控制多实体互动的问题。

Details Motivation: 现有的端到端人体动画方法通常仅支持单实体控制,并在全局注入多模态条件,无法处理多实体互动场景,限制了实际应用的潜力。

Contribution: 论文的主要贡献是提出了一种支持多概念布局对齐的框架,能够通过显式布局控制和区域特定的模态条件绑定,实现高可控性的多实体互动动画生成。

Method: 方法包括:1)利用掩码预测器自动推断多概念的布局信息;2)通过迭代方式将局部音频条件注入对应区域,确保模态与布局对齐;3)结合去噪视频和参考外观的匹配来实现条件绑定。

Result: 实验和消融研究表明,相比于隐式方法和其他现有技术,该框架在多模态条件下表现出更强的布局控制能力和生成质量。

Insight: 论文揭示了显式布局控制对多概念交互场景的重要性,为未来多实体动画生成提供了一种有效的解决方案。

Abstract: End-to-end human animation with rich multi-modal conditions, e.g., text,
image and audio has achieved remarkable advancements in recent years. However,
most existing methods could only animate a single subject and inject conditions
in a global manner, ignoring scenarios that multiple concepts could appears in
the same video with rich human-human interactions and human-object
interactions. Such global assumption prevents precise and per-identity control
of multiple concepts including humans and objects, therefore hinders
applications. In this work, we discard the single-entity assumption and
introduce a novel framework that enforces strong, region-specific binding of
conditions from modalities to each identity’s spatiotemporal footprint. Given
reference images of multiple concepts, our method could automatically infer
layout information by leveraging a mask predictor to match appearance cues
between the denoised video and each reference appearance. Furthermore, we
inject local audio condition into its corresponding region to ensure
layout-aligned modality matching in a iterative manner. This design enables the
high-quality generation of controllable multi-concept human-centric videos.
Empirical results and ablation studies validate the effectiveness of our
explicit layout control for multi-modal conditions compared to implicit
counterparts and other existing methods.

[113] A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Benno Krojer,Mojtaba Komeili,Candace Ross,Quentin Garrido,Koustuv Sinha,Nicolas Ballas,Mahmoud Assran

Main category: cs.CV

TL;DR: 该论文提出了一个名为MVP的新基准测试,用于评估视频语言模型的物理理解能力,通过引入最小变化配对来避免模型依赖表面视觉或文本线索的捷径解。

Details Motivation: 现有的视频问答基准容易被模型利用表面视觉或文本线索的捷径解,导致性能评估不准确。

Contribution: 提出了包含55K高质量多选视频问答样本的MVP基准,每个样本有一个最小变化配对(视觉相似但答案对立),有效抑制了捷径解。

Method: MVP从九个视频数据源构建,包含第一人称视角、机器人交互数据和认知科学直觉物理基准。通过最小变化配对,迫使模型真正理解物理世界。

Result: 人类在MVP上的性能为92.9%,而当前最佳开源模型为40.2%,随机基准为25%,表明其挑战性。

Insight: MVP通过最小变化配对机制,使得依赖捷径解的模型性能低于随机基准,有助于更准确地评估模型的物理理解能力。

Abstract: Existing benchmarks for assessing the spatio-temporal understanding and
reasoning abilities of video language models are susceptible to score inflation
due to the presence of shortcut solutions based on superficial visual or
textual cues. This paper mitigates the challenges in accurately assessing model
performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple
shortcut-aware video QA benchmark for assessing the physical understanding of
video language models. The benchmark is comprised of 55K high-quality
multiple-choice video QA examples focusing on physical world understanding.
Examples are curated from nine video data sources, spanning first-person
egocentric and exocentric videos, robotic interaction data, and cognitive
science intuitive physics benchmarks. To mitigate shortcut solutions that rely
on superficial visual or textual cues and biases, each sample in MVP has a
minimal-change pair – a visually similar video accompanied by an identical
question but an opposing answer. To answer a question correctly, a model must
provide correct answers for both examples in the minimal-change pair; as such,
models that solely rely on visual or textual biases would achieve below random
performance. Human performance on MVP is 92.9%, while the best open-source
state-of-the-art video-language model achieves 40.2% compared to random
performance at 25%.

[114] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

Ron Yosef,Moran Yanuka,Yonatan Bitton,Dani Lischinski

Main category: cs.CV

TL;DR: 论文介绍了EditInspector,一个基于人工标注的文本引导图像编辑评估基准,用于评测现有模型的编辑验证能力,并提出了两种新方法在检测伪影和生成差异描述上优于当前最优模型。

Details Motivation: 生成式AI的快速发展使得文本引导的图像编辑变得普及,但缺乏系统方法来验证这些编辑的质量和准确性。

Contribution: 1. 提出EditInspector,首个基于人类标注的文本引导图像编辑评估基准;2. 评测了现有模型在多维度评估编辑任务中的表现;3. 提出了两种新方法,在伪影检测和差异描述任务上超越SOTA。

Method: 1. 基于模板收集人工标注数据构建基准;2. 对现有模型在多维度(如准确性、伪影、视觉质量等)进行评测;3. 提出新方法解决现有模型的不足。

Result: 现有模型在综合评估编辑任务中表现不佳,且容易生成幻觉描述;提出的新方法在伪影检测和差异描述任务上优于SOTA。

Insight: 1. 文本引导编辑的评估需要更系统性方法;2. 当前模型仍需提升在复杂任务中的表现;3. 人工标注数据在生成任务评估中具有重要意义。

Abstract: Text-guided image editing, fueled by recent advancements in generative AI, is
becoming increasingly widespread. This trend highlights the need for a
comprehensive framework to verify text-guided edits and assess their quality.
To address this need, we introduce EditInspector, a novel benchmark for
evaluation of text-guided image edits, based on human annotations collected
using an extensive template for edit verification. We leverage EditInspector to
evaluate the performance of state-of-the-art (SoTA) vision and language models
in assessing edits across various dimensions, including accuracy, artifact
detection, visual quality, seamless integration with the image scene, adherence
to common sense, and the ability to describe edit-induced changes. Our findings
indicate that current models struggle to evaluate edits comprehensively and
frequently hallucinate when describing the changes. To address these
challenges, we propose two novel methods that outperform SoTA models in both
artifact detection and difference caption generation.

[115] Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

Yiming Dou,Wonseok Oh,Yuqing Luo,Antonio Loquercio,Andrew Owens

Main category: cs.CV

TL;DR: 这篇论文研究了如何通过3D场景重建实现交互性,提出了一种从3D手部轨迹预测物理交互声音的方法,实验表明生成的声音能准确传达材质和动作信息。

Details Motivation: 研究动机是为了增强3D场景的交互性,使其不仅能可视化,还能通过声音反馈物理交互的细节(如材质和动作)。

Contribution: 主要贡献是提出了一种基于校正流模型(rectified flow model)的方法,将3D手部轨迹映射到对应的音频信号。

Method: 方法的核心是使用动作-声音对训练校正流模型,通过3D手部轨迹预测相应的声音,支持用户查询不同动作的声音效果。

Result: 实验结果表明,生成的声音能准确传达材质和动作特性,且人类观察者难以区分生成声音与真实声音。

Insight: 研究发现声音信号可以作为3D场景物理交互的有效补充,未来可能用于增强虚拟现实或游戏中的交互体验。

Abstract: We study the problem of making 3D scene reconstructions interactive by asking
the following question: can we predict the sounds of human hands physically
interacting with a scene? First, we record a video of a human manipulating
objects within a 3D scene using their hands. We then use these action-sound
pairs to train a rectified flow model to map 3D hand trajectories to their
corresponding audio. At test time, a user can query the model for other
actions, parameterized as sequences of hand poses, to estimate their
corresponding sounds. In our experiments, we find that our generated sounds
accurately convey material properties and actions, and that they are often
indistinguishable to human observers from real sounds. Project page:
https://www.yimingdou.com/hearing_hands/

[116] PlayerOne: Egocentric World Simulator

Yuanpeng Tu,Hao Luo,Xi Chen,Xiang Bai,Fan Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: PlayerOne是首个以自我为中心的真实世界模拟器,能够动态生成与现实用户动作严格对齐的沉浸式视频,通过从粗到细的训练流程和部件解耦的运动注入方案实现精准控制。

Details Motivation: 当前缺少能够动态生成与真实用户动作对齐的自我中心视角(egocentric)视频的模拟器,这种技术对于虚拟现实、人机交互等领域具有重要意义。

Contribution: 1) 提出首个自我中心视角的真实世界模拟器PlayerOne;2) 设计了一个从粗到细的训练流程,结合大规模文本-视频对和同步运动-视频数据微调;3) 提出部件解耦的运动注入方案和联合重建框架,实现精准控制和场景一致性。

Method: 1) 采用从粗到细的训练流程,先在大规模文本-视频对上预训练,再通过自动构建的同步运动-视频数据微调;2) 设计部件解耦的运动注入方案,分部件控制运动;3) 提出联合重建框架,逐步建模4D场景和视频帧。

Result: 实验表明PlayerOne能够精准控制不同人类动作,并对多样化场景进行一致性建模,展示了其强大的泛化能力。

Insight: 自我中心视角的视频模拟需要分部件精准控制和场景一致性建模,这为世界建模及其多样化应用开辟了新方向。

Abstract: We introduce PlayerOne, the first egocentric realistic world simulator,
facilitating immersive and unrestricted exploration within vividly dynamic
environments. Given an egocentric scene image from the user, PlayerOne can
accurately construct the corresponding world and generate egocentric videos
that are strictly aligned with the real scene human motion of the user captured
by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that
first performs pretraining on large-scale egocentric text-video pairs for
coarse-level egocentric understanding, followed by finetuning on synchronous
motion-video data extracted from egocentric-exocentric video datasets with our
automatic construction pipeline. Besides, considering the varying importance of
different components, we design a part-disentangled motion injection scheme,
enabling precise control of part-level movements. In addition, we devise a
joint reconstruction framework that progressively models both the 4D scene and
video frames, ensuring scene consistency in the long-form video generation.
Experimental results demonstrate its great generalization ability in precise
control of varying human movements and worldconsistent modeling of diverse
scenarios. It marks the first endeavor into egocentric real-world simulation
and can pave the way for the community to delve into fresh frontiers of world
modeling and its diverse applications.

cs.SD [Back]

[117] SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research

Ahmed Adel Attia,Jing Liu,Carl Espy-Wilson

Main category: cs.SD

TL;DR: 论文提出了一种利用游戏引擎合成课堂噪声和语音数据的方法,解决了教育领域语音数据稀缺的问题,并生成了一个名为SimClass的数据集。

Details Motivation: 由于公开课堂语音数据稀缺,且缺乏专门的课堂噪声语料库,导致教育领域的语音识别模型开发受限。论文旨在解决这一问题。

Contribution: 1. 提出了一种基于游戏引擎的课堂噪声合成方法;2. 生成了SimClass数据集,包含合成的课堂噪声和模拟的课堂语音数据。

Method: 1. 使用游戏引擎合成课堂噪声;2. 结合公开的儿童语音语料库和YouTube讲座视频生成模拟课堂语音数据。

Result: 实验表明,SimClass可以很好地模拟真实课堂语音,为开发鲁棒的语音识别和增强模型提供了资源。

Insight: 通过游戏引擎合成数据的方法具有可扩展性,可应用于其他领域,为数据稀缺问题提供了一种新思路。

Abstract: The scarcity of large-scale classroom speech data has hindered the
development of AI-driven speech models for education. Public classroom datasets
remain limited, and the lack of a dedicated classroom noise corpus prevents the
use of standard data augmentation techniques.
In this paper, we introduce a scalable methodology for synthesizing classroom
noise using game engines, a framework that extends to other domains. Using this
methodology, we present SimClass, a dataset that includes both a synthesized
classroom noise corpus and a simulated classroom speech dataset. The speech
data is generated by pairing a public children’s speech corpus with YouTube
lecture videos to approximate real classroom interactions in clean conditions.
Our experiments on clean and noisy speech demonstrate that SimClass closely
approximates real classroom speech, making it a valuable resource for
developing robust speech recognition and enhancement models.

[118] Training-Free Voice Conversion with Factorized Optimal Transport

Alexander Lobashev,Assel Yermekova,Maria Larchenko

Main category: cs.SD

TL;DR: 本文提出了一种无需训练的语音转换方法MKL-VC,通过分解最优输运映射在WavLM嵌入子空间中实现高质量、任意语言间的语音转换,仅需5秒参考音频即可完成。

Details Motivation: 目前的语音转换方法(如kNN-VC)在跨语言场景下表现较差且需要大量训练数据。MKL-VC旨在通过分解最优输运映射解决这一问题,提升短参考音频下的内容保留和鲁棒性。

Contribution: 提出了一种无需训练的语音转换方法MKL-VC,通过分解最优输运映射解决了维度间方差不均匀的问题,实现了高质量跨语言语音转换。

Method: MKL-VC采用分解的Monge-Kantorovich线性最优输运映射替代kNN回归,将其应用于WavLM嵌入子空间中,解决了维度间方差不均匀的问题。

Result: 实验表明,MKL-VC在LibriSpeech和FLEURS数据集上显著优于kNN-VC,尤其在跨语言语音转换任务中表现突出,性能接近FACodec。

Insight: 分解最优输运映射可以有效处理高维嵌入空间中非均匀方差问题,为语音转换任务提供了一种高效且无需训练的方法。

Abstract: This paper introduces Factorized MKL-VC, a training-free modification for
kNN-VC pipeline. In contrast with original pipeline, our algorithm performs
high quality any-to-any cross-lingual voice conversion with only 5 second of
reference audio. MKL-VC replaces kNN regression with a factorized optimal
transport map in WavLM embedding subspaces, derived from Monge-Kantorovich
Linear solution. Factorization addresses non-uniform variance across
dimensions, ensuring effective feature transformation. Experiments on
LibriSpeech and FLEURS datasets show MKL-VC significantly improves content
preservation and robustness with short reference audio, outperforming kNN-VC.
MKL-VC achieves performance comparable to FACodec, especially in cross-lingual
voice conversion domain.

cs.IR [Back]

[119] ThinkQE: Query Expansion via an Evolving Thinking Process

Yibin Lei,Tao Shen,Andrew Yates

Main category: cs.IR

TL;DR: ThinkQE提出了一种新的查询扩展框架,通过结合思维过程和语料库互动策略,显著提升了检索性能,尤其在多样性和探索性方面表现突出。

Details Motivation: 现有基于LLM的查询扩展方法虽表现优异,但往往过于专注特定语义,忽略了查询的多样性和探索性。ThinkQE旨在通过更深入的语义探索和迭代优化来解决这一问题。

Contribution: 1. 提出了基于思维过程的查询扩展方法,促进更全面的语义探索;2. 设计了语料库互动策略,利用检索反馈迭代优化扩展结果。

Method: ThinkQE结合了思维过程和语料库互动策略。前者通过更深层次的语义探索生成多样化的扩展查询,后者则利用检索反馈不断迭代改进扩展结果。

Result: 在DL19、DL20和BRIGHT等数据集上,ThinkQE表现优异,优于现有的密集检索器和重排序方法。

Insight: 通过迭代反馈和深度语义探索,可以显著提升查询扩展的多样性和检索性能,尤其适用于需要广泛探索的应用场景。

Abstract: Effective query expansion for web search benefits from promoting both
exploration and result diversity to capture multiple interpretations and facets
of a query. While recent LLM-based methods have improved retrieval performance
and demonstrate strong domain generalization without additional training, they
often generate narrowly focused expansions that overlook these desiderata. We
propose ThinkQE, a test-time query expansion framework addressing this
limitation through two key components: a thinking-based expansion process that
encourages deeper and comprehensive semantic exploration, and a
corpus-interaction strategy that iteratively refines expansions using retrieval
feedback from the corpus. Experiments on diverse web search benchmarks (DL19,
DL20, and BRIGHT) show ThinkQE consistently outperforms prior approaches,
including training-intensive dense retrievers and rerankers.

eess.IV [Back]

[120] Exploring Image Transforms derived from Eye Gaze Variables for Progressive Autism Diagnosis

Abigail Copiaco,Christian Ritz,Yassine Himeur,Valsamma Eapen,Ammar Albanna,Wathiq Mansoor

Main category: eess.IV

TL;DR: 该论文提出了一种基于AI的辅助技术,利用眼球凝视变量生成的图像变换,通过迁移学习实现自闭症谱系障碍(ASD)的高效诊断,旨在简化诊断流程并保护用户隐私。

Details Motivation: ASD的发病率迅速上升,而现有的诊断方法耗时且成本高,亟需一种更便捷、高效的技术以改善诊断效率和用户体验。

Contribution: 1. 提出了一种结合迁移学习和眼球凝视变量生成图像变换的ASD诊断方法。2. 实现了居家定期诊断的可能性,减轻患者和护理者的压力。3. 保护用户隐私,同时为护理者和治疗师提供定期进展更新。

Method: 通过眼球凝视变量生成图像变换,并结合迁移学习技术,训练模型以实现ASD的自动化诊断。

Result: 该方法能够实现高效且隐私保护的ASD诊断,为家庭和医疗系统提供了便捷的解决方案。

Insight: 1. 图像变换技术可以在保护隐私的同时提高诊断效率。2. 迁移学习能够有效利用有限的医疗数据。3. 居家诊断模式有望成为未来医疗辅助技术的发展方向。

Abstract: The prevalence of Autism Spectrum Disorder (ASD) has surged rapidly over the
past decade, posing significant challenges in communication, behavior, and
focus for affected individuals. Current diagnostic techniques, though
effective, are time-intensive, leading to high social and economic costs. This
work introduces an AI-powered assistive technology designed to streamline ASD
diagnosis and management, enhancing convenience for individuals with ASD and
efficiency for caregivers and therapists. The system integrates transfer
learning with image transforms derived from eye gaze variables to diagnose ASD.
This facilitates and opens opportunities for in-home periodical diagnosis,
reducing stress for individuals and caregivers, while also preserving user
privacy through the use of image transforms. The accessibility of the proposed
method also offers opportunities for improved communication between guardians
and therapists, ensuring regular updates on progress and evolving support
needs. Overall, the approach proposed in this work ensures timely, accessible
diagnosis while protecting the subjects’ privacy, improving outcomes for
individuals with ASD.

[121] Foundation Models in Medical Imaging – A Review and Outlook

Vivien van Veldhuizen,Vanessa Botha,Chunyao Lu,Melis Erdal Cesur,Kevin Groot Lipman,Edwin D. de Jong,Hugo Horlings,Clárisa Sanchez,Cees Snoek,Ritse Mann,Eric Marcus,Jonas Teuwen

Main category: eess.IV

TL;DR: 该综述探讨了医学影像中基础模型(FMs)的发展与应用,分析了其在病理学、放射学和眼科中的应用,总结了150多项研究,介绍了核心组件、自监督学习方法及下游适应策略,并提出了未来研究方向。

Details Motivation: 医学影像分析通常依赖大量标注数据,基础模型通过学习大规模无标签数据提取通用视觉特征,减少了标注需求,为医学影像分析带来了新的可能性。

Contribution: 1. 综述了基础模型在医学影像中的发展与应用;2. 总结了模型架构、自监督学习方法和下游适应策略;3. 分析了不同领域的设计选择及挑战。

Method: 基于150多项研究,分析了基础模型在医学影像中的实现方法,包括自监督预训练、模型架构选择以及特定任务的下游微调策略。

Result: 研究表明,基础模型在病理学、放射学和眼科等医学影像领域表现出色,能够显著减少对标注数据的依赖。

Insight: 基础模型为医学影像分析提供了新工具,但仍需解决模型泛化能力、数据隐私和计算资源等问题。

Abstract: Foundation models (FMs) are changing the way medical images are analyzed by
learning from large collections of unlabeled data. Instead of relying on
manually annotated examples, FMs are pre-trained to learn general-purpose
visual features that can later be adapted to specific clinical tasks with
little additional supervision. In this review, we examine how FMs are being
developed and applied in pathology, radiology, and ophthalmology, drawing on
evidence from over 150 studies. We explain the core components of FM pipelines,
including model architectures, self-supervised learning methods, and strategies
for downstream adaptation. We also review how FMs are being used in each
imaging domain and compare design choices across applications. Finally, we
discuss key challenges and open questions to guide future research.

[122] Low-Rank Augmented Implicit Neural Representation for Unsupervised High-Dimensional Quantitative MRI Reconstruction

Haonan Zhang,Guoyan Lao,Yuyao Zhang,Hongjiang Wei

Main category: eess.IV

TL;DR: 本文提出了一种名为LoREIN的无监督双先验集成框架,用于加速3D多参数定量MRI重建,通过结合低秩先验和连续性先验,提高重建精度。

Details Motivation: 当前的重建方法通常仅依赖单一先验或物理模型解决高度不适定的逆问题,导致结果不理想。本文旨在通过结合两种先验(低秩先验和连续性先验)来提升重建质量。

Contribution: 1. 提出LoREIN框架,结合低秩表示(LRR)和隐式神经表示(INR)两种先验。2. 引入零样本学习范式,适用于复杂时空及高维图像重建任务。

Method: 1. 使用低秩表示(LRR)建模低秩先验。2. 通过隐式神经表示(INR)建模连续性先验。3. INR的连续表示估计低秩子空间中的最优空间基,提升加权图像重建质量。

Result: LoREIN能够高保真地重建加权图像,并利用多对比加权图像的结构和定量信息提升定量参数图的重建精度。

Insight: 结合低秩和连续性先验的方法在解决高维医学图像重建问题上具有潜力,且零样本学习范式可推广到其他复杂图像重建任务。

Abstract: Quantitative magnetic resonance imaging (qMRI) provides tissue-specific
parameters vital for clinical diagnosis. Although simultaneous multi-parametric
qMRI (MP-qMRI) technologies enhance imaging efficiency, robustly reconstructing
qMRI from highly undersampled, high-dimensional measurements remains a
significant challenge. This difficulty arises primarily because current
reconstruction methods that rely solely on a single prior or physics-informed
model to solve the highly ill-posed inverse problem, which often leads to
suboptimal results. To overcome this limitation, we propose LoREIN, a novel
unsupervised and dual-prior-integrated framework for accelerated 3D MP-qMRI
reconstruction. Technically, LoREIN incorporates both low-rank prior and
continuity prior via low-rank representation (LRR) and implicit neural
representation (INR), respectively, to enhance reconstruction fidelity. The
powerful continuous representation of INR enables the estimation of optimal
spatial bases within the low-rank subspace, facilitating high-fidelity
reconstruction of weighted images. Simultaneously, the predicted multi-contrast
weighted images provide essential structural and quantitative guidance, further
enhancing the reconstruction accuracy of quantitative parameter maps.
Furthermore, our work introduces a zero-shot learning paradigm with broad
potential in complex spatiotemporal and high-dimensional image reconstruction
tasks, further advancing the field of medical imaging.

[123] The RSNA Lumbar Degenerative Imaging Spine Classification (LumbarDISC) Dataset

Tyler J. Richards,Adam E. Flanders,Errol Colak,Luciano M. Prevedello,Robyn L. Ball,Felipe Kitamura,John Mongan,Maryam Vazirabad,Hui-Ming Lin,Anne Kendell,Thanat Kanthawang,Salita Angkurawaranon,Emre Altinmakas,Hakan Dogan,Paulo Eduardo de Aguiar Kuriki,Arjuna Somasundaram,Christopher Ruston,Deniz Bulja,Naida Spahovic,Jennifer Sommer,Sirui Jiang,Eduardo Moreno Judice de Mattos Farina,Eduardo Caminha Nunes,Michael Brassil,Megan McNamara,Johanna Ortiz,Jacob Peoples,Vinson L. Uytana,Anthony Kam,Venkata N. S. Dola,Daniel Murphy,David Vu,Dataset Contributor Group,Dataset Annotator Group,Competition Data Notebook Group,Jason F. Talbott

Main category: eess.IV

TL;DR: RSNA LumbarDISC数据集是最大的公开成人MRI腰椎退化性变化标注数据集,包含2,697名患者的8,593张影像,来自8个机构,支持非商业用途。

Details Motivation: 现有腰椎退化性病变研究的公开数据集稀缺,阻碍了机器学习和影像分析研究的进展。

Contribution: 提供了最大规模的公开腰椎MRI数据集,标注了退化性变化,并支持机器学习和临床应用的开发。

Method: 通过多国多机构合作收集数据,并由专家放射科医生对腰椎退化程度进行分级。

Result: 数据集已公开,并用于RSNA 2024竞赛,推动深度学习模型在腰椎退化分类中的应用。

Insight: 该数据集填补了腰椎退化研究的数据空白,为临床效率提升和患者护理改进提供了资源。

Abstract: The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging
Spine Classification (LumbarDISC) dataset is the largest publicly available
dataset of adult MRI lumbar spine examinations annotated for degenerative
changes. The dataset includes 2,697 patients with a total of 8,593 image series
from 8 institutions across 6 countries and 5 continents. The dataset is
available for free for non-commercial use via Kaggle and RSNA Medical Imaging
Resource of AI (MIRA). The dataset was created for the RSNA 2024 Lumbar Spine
Degenerative Classification competition where competitors developed deep
learning models to grade degenerative changes in the lumbar spine. The degree
of spinal canal, subarticular recess, and neural foraminal stenosis was graded
at each intervertebral disc level in the lumbar spine. The images were
annotated by expert volunteer neuroradiologists and musculoskeletal
radiologists from the RSNA, American Society of Neuroradiology, and the
American Society of Spine Radiology. This dataset aims to facilitate research
and development in machine learning and lumbar spine imaging to lead to
improved patient care and clinical efficiency.

[124] Sampling Theory for Super-Resolution with Implicit Neural Representations

Mahrokh Najaf,Gregory Ongie

Main category: eess.IV

TL;DR: 该论文研究了使用隐式神经表示(INR)从低频傅里叶样本中恢复连续域图像的采样理论,提出了一种非凸参数空间优化问题与无限维空间惩罚的联系,并验证了精确恢复的可行性。

Details Motivation: 隐式神经表示(INR)在计算机视觉和计算成像的逆问题中表现出强大潜力,但目前对其样本复杂度的理解不足,尤其是在线性逆问题中。本文旨在填补这一空白。

Contribution: 论文的主要贡献是提出了INR实现连续域图像恢复的采样理论,建立了非凸优化与无限维空间惩罚的联系,并给出了精确恢复的充分样本条件。

Method: 通过拟合具有ReLU激活和傅里叶特征层的单隐藏层INR,并结合广义权重衰减正则化,研究了从低频傅里叶样本中恢复图像的采样需求。

Result: 理论证明了INR可实现精确恢复的条件,并通过实验验证了低宽度INR在连续域超分辨率恢复中的性能。

Insight: 论文揭示了INR在解决逆问题中的潜力,尤其是通过无限维空间的视角对非凸优化问题的理论分析提供了新思路。

Abstract: Implicit neural representations (INRs) have emerged as a powerful tool for
solving inverse problems in computer vision and computational imaging. INRs
represent images as continuous domain functions realized by a neural network
taking spatial coordinates as inputs. However, unlike traditional pixel
representations, little is known about the sample complexity of estimating
images using INRs in the context of linear inverse problems. Towards this end,
we study the sampling requirements for recovery of a continuous domain image
from its low-pass Fourier samples by fitting a single hidden-layer INR with
ReLU activation and a Fourier features layer using a generalized form of weight
decay regularization. Our key insight is to relate minimizers of this
non-convex parameter space optimization problem to minimizers of a convex
penalty defined over an infinite-dimensional space of measures. We identify a
sufficient number of Fourier samples for which an image realized by an INR is
exactly recoverable by solving the INR training problem. To validate our
theory, we empirically assess the probability of achieving exact recovery of
images realized by low-width single hidden-layer INRs, and illustrate the
performance of INRs on super-resolution recovery of continuous domain phantom
images.

cs.AI [Back]

[125] Ming-Omni: A Unified Multimodal Model for Perception and Generation

Inclusion AI,Biao Gong,Cheng Zou,Chuanyang Zheng,Chunluan Zhou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jun Peng,Kaixiang Ji,Kaiyou Song,Kaimeng Ren,Libin Wang,Lixiang Ru,Lele Xie,Longhua Tan,Lyuxin Xue,Lan Wang,Mochen Bai,Ning Gao,Pei Chen,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Sirui Gao,Tinghao Liu,Taisong Li,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaoxue Chen,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yunxiao Sun,Yipeng Chen,Yifei Wu,Yongjie Lyu,Ziping Ma,Zipeng Feng,Zhijiang Fang,Zhihao Qiu,Ziyuan Huang,Zhengyu He

Main category: cs.AI

TL;DR: Ming-Omni 是一个统一的多模态模型,能够处理图像、文本、音频和视频,并在语音和图像生成方面表现出色。它通过专用编码器和MoE架构实现高效的多模态处理与融合,并支持音频和图像生成功能。

Details Motivation: 目前的多模态模型通常需要单独的任务模型或结构调整,限制了灵活性和效率。Ming-Omni旨在提供一个统一的框架,支持多种模态的感知与生成。

Contribution: 提出了第一个开源的多模态模型Ming-Omni,支持文本、图像、音频和视频的处理与生成;设计了专用的模态特定路由器和MoE架构,实现了高效的多模态融合;集成音频解码器和图像生成模块,扩展了生成能力。

Method: 使用专用编码器提取不同模态的token;通过MoE架构Ling和模态特定路由器处理多模态输入;集成音频解码器和图像生成模块(Ming-Lite-Uni)。

Result: 实验表明,Ming-Omni在感知和生成任务中表现优异,支持多种任务(如上下文对话、文本转语音、图像编辑),并与GPT-4o在多模态支持上相当。

Insight: 统一的多模态模型可以减少任务特定模型的需求,提高灵活性和效率;MoE架构和模态特定路由器的设计是高效多模态处理的关键。

Abstract: We propose Ming-Omni, a unified multimodal model capable of processing
images, text, audio, and video, while demonstrating strong proficiency in both
speech and image generation. Ming-Omni employs dedicated encoders to extract
tokens from different modalities, which are then processed by Ling, an MoE
architecture equipped with newly proposed modality-specific routers. This
design enables a single model to efficiently process and fuse multimodal inputs
within a unified framework, thereby facilitating diverse tasks without
requiring separate models, task-specific fine-tuning, or structural redesign.
Importantly, Ming-Omni extends beyond conventional multimodal models by
supporting audio and image generation. This is achieved through the integration
of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for
high-quality image generation, which also allow the model to engage in
context-aware chatting, perform text-to-speech conversion, and conduct
versatile image editing. Our experimental results showcase Ming-Omni offers a
powerful solution for unified perception and generation across all modalities.
Notably, our proposed Ming-Omni is the first open-source model we are aware of
to match GPT-4o in modality support, and we release all code and model weights
to encourage further research and development in the community.

[126] Intent Factored Generation: Unleashing the Diversity in Your Language Model

Eltayeb Ahmed,Uljad Berdica,Martha Elliott,Danijela Horak,Jakob N. Foerster

Main category: cs.AI

TL;DR: 论文提出了一种名为意图因子化生成(IFG)的方法,通过在采样过程中引入语义密集的意图因子,提高语言模型生成样本的多样性和质量。

Details Motivation: 当前方法在固定提示下生成多样样本时,通常仅停留在词级别,导致推理任务探索不足和对话代理单调重复。IFG旨在解决这一问题。

Contribution: 提出了IFG方法,通过两阶段采样(首先生成意图,再基于意图生成最终响应)提升多样性和质量。

Method: IFG将采样分为意图生成和最终响应生成两个阶段,分别使用高低温设置以确保多样性和连贯性。

Result: 实验表明IFG在数学、代码任务和对话生成中提升了性能,并在通用语言建模任务中保持生成质量的同时提高了多样性。

Insight: 通过显式建模意图,IFG能够更好地控制生成的多样性,同时确保内容的一致性。这种方法简单易集成,适用于多种应用。

Abstract: Obtaining multiple meaningfully diverse, high quality samples from Large
Language Models for a fixed prompt remains an open challenge. Current methods
for increasing diversity often only operate at the token-level, paraphrasing
the same response. This is problematic because it leads to poor exploration on
reasoning problems and to unengaging, repetitive conversational agents. To
address this we propose Intent Factored Generation (IFG), factorising the
sampling process into two stages. First, we sample a semantically dense intent,
e.g., a summary or keywords. Second, we sample the final response conditioning
on both the original prompt and the intent from the first stage. This allows us
to use a higher temperature during the intent step to promote conceptual
diversity, and a lower temperature during the final generation to ensure the
outputs are coherent and self-consistent. Additionally, we find that prompting
the model to explicitly state its intent for each step of the chain-of-thought
before generating the step is beneficial for reasoning tasks. We demonstrate
our method’s effectiveness across a diverse set of tasks. We show this method
improves both pass@k and Reinforcement Learning from Verifier Feedback on maths
and code tasks. For instruction-tuning, we combine IFG with Direct Preference
Optimisation to increase conversational diversity without sacrificing reward.
Finally, we achieve higher diversity while maintaining the quality of
generations on a general language modelling task, using a new dataset of reader
comments and news articles that we collect and open-source. In summary, we
present a simple method of increasing the sample diversity of LLMs while
maintaining performance. This method can be implemented by changing the prompt
and varying the temperature during generation, making it easy to integrate into
many algorithms for gains across various applications.

[127] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran,Adrien Bardes,David Fan,Quentin Garrido,Russell Howes,Mojtaba,Komeili,Matthew Muckley,Ammar Rizvi,Claire Roberts,Koustuv Sinha,Artem Zholus,Sergio Arnaud,Abha Gejji,Ada Martin,Francois Robert Hogan,Daniel Dugas,Piotr Bojanowski,Vasil Khalidov,Patrick Labatut,Francisco Massa,Marc Szafraniec,Kapil Krishnakumar,Yong Li,Xiaodong Ma,Sarath Chandar,Franziska Meier,Yann LeCun,Michael Rabbat,Nicolas Ballas

Main category: cs.AI

TL;DR: V-JEPA 2是一种自监督视频模型,通过大规模互联网视频和少量机器人交互数据预训练,实现了对物理世界的理解、预测和规划能力。

Details Motivation: 现代AI的挑战在于通过观察学习理解世界并采取行动,本文探索了一种自监督方法,结合互联网视频和少量机器人数据,开发能够理解和规划物理世界的模型。

Contribution: 1. 提出了V-JEPA 2模型,在视频理解和动作预测任务上达到SOTA性能;2. 结合大语言模型,在视频问答任务上表现优异;3. 通过机器人数据微调,实现了零样本规划能力。

Method: 1. 预训练无动作联合嵌入预测架构(V-JEPA 2);2. 结合大语言模型对齐;3. 通过少量机器人视频微调为动作条件世界模型(V-JEPA 2-AC)。

Result: 1. 动作理解(Something-Something v2)77.3 top-1准确率;2. 视频问答任务(PerceptionTest 84.0分);3. 零样本机器人规划任务成功。

Insight: 自监督学习结合大规模视频数据和小量机器人数据,可以高效构建通用世界模型,支持跨任务和场景的应用。

Abstract: A major challenge for modern AI is to learn to understand the world and learn
to act largely by observation. This paper explores a self-supervised approach
that combines internet-scale video data with a small amount of interaction data
(robot trajectories), to develop models capable of understanding, predicting,
and planning in the physical world. We first pre-train an action-free
joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset
comprising over 1 million hours of internet video. V-JEPA 2 achieves strong
performance on motion understanding (77.3 top-1 accuracy on Something-Something
v2) and state-of-the-art performance on human action anticipation (39.7
recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models.
Additionally, after aligning V-JEPA 2 with a large language model, we
demonstrate state-of-the-art performance on multiple video question-answering
tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on
TempCompass). Finally, we show how self-supervised learning can be applied to
robotic planning tasks by post-training a latent action-conditioned world
model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the
Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different
labs and enable picking and placing of objects using planning with image goals.
Notably, this is achieved without collecting any data from the robots in these
environments, and without any task-specific training or reward. This work
demonstrates how self-supervised learning from web-scale data and a small
amount of robot interaction data can yield a world model capable of planning in
the physical world.

cs.CR [Back]

[128] Adversarial Text Generation with Dynamic Contextual Perturbation

Hetvi Waghela,Jaydip Sen,Sneha Rakshit,Subhasis Dasgupta

Main category: cs.CR

TL;DR: 提出了一种名为动态上下文扰动(DCP)的新型对抗文本攻击方法,通过动态生成上下文感知的扰动,提升对抗样本的语义一致性和流畅性,有效挑战了当前最先进的NLP模型的鲁棒性。

Details Motivation: 现有对抗文本攻击方法多局限于单词或局部文本段落的修改,忽视上下文语境,导致扰动容易被察觉或语义不一致。

Contribution: 1. 提出DCP方法,动态生成跨句子、段落和文档的上下文感知扰动;2. 结合预训练语言模型的能力,通过对抗目标函数迭代优化扰动;3. 实验验证DCP在提升对抗攻击隐蔽性和效果方面的有效性。

Method: DCP利用预训练语言模型动态生成上下文感知的扰动,并通过对抗目标函数(平衡模型误导和文本自然性)迭代优化。

Result: 实验表明DCP能生成更自然、有效的对抗样本,显著挑战了现有NLP模型的鲁棒性。

Insight: 上下文在对抗攻击中起关键作用,未来需开发能抵御此类复杂攻击的NLP鲁棒性方法。

Abstract: Adversarial attacks on Natural Language Processing (NLP) models expose
vulnerabilities by introducing subtle perturbations to input text, often
leading to misclassification while maintaining human readability. Existing
methods typically focus on word-level or local text segment alterations,
overlooking the broader context, which results in detectable or semantically
inconsistent perturbations. We propose a novel adversarial text attack scheme
named Dynamic Contextual Perturbation (DCP). DCP dynamically generates
context-aware perturbations across sentences, paragraphs, and documents,
ensuring semantic fidelity and fluency. Leveraging the capabilities of
pre-trained language models, DCP iteratively refines perturbations through an
adversarial objective function that balances the dual objectives of inducing
model misclassification and preserving the naturalness of the text. This
comprehensive approach allows DCP to produce more sophisticated and effective
adversarial examples that better mimic natural language patterns. Our
experimental results, conducted on various NLP models and datasets, demonstrate
the efficacy of DCP in challenging the robustness of state-of-the-art NLP
systems. By integrating dynamic contextual analysis, DCP significantly enhances
the subtlety and impact of adversarial attacks. This study highlights the
critical role of context in adversarial attacks and lays the groundwork for
creating more robust NLP systems capable of withstanding sophisticated
adversarial strategies.

[129] DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Yitong Zhang,Jia Li,Liyi Cai,Ge Li

Main category: cs.CR

TL;DR: 论文提出了DAVSP方法,通过视觉安全提示和深度对齐技术,增强大型视觉语言模型对恶意查询的防御能力,同时保持良性输入的实用性。

Details Motivation: 现有的对齐方法难以在抵抗恶意查询的同时有效保留良性输入的实用性,该研究旨在解决这一问题。

Contribution: 提出了视觉安全提示和深度对齐技术,两者结合显著提升了模型的安全性与实用性。

Method: 通过可训练的视觉安全提示扩展输入图像的优化空间,并利用深度对齐技术在模型激活空间中进行监督训练。

Result: 在多个基准测试中,DAVSP成功抵御了恶意查询,同时保持了良性输入的实用性,并展现了跨模型生成能力。

Insight: 视觉安全提示和深度对齐技术的结合是实现模型安全对齐的关键。

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress across
various applications but remain vulnerable to malicious queries that exploit
the visual modality. Existing alignment approaches typically fail to resist
malicious queries while preserving utility on benign ones effectively. To
address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP),
which is built upon two key innovations. First, we introduce the Visual Safety
Prompt, which appends a trainable padding region around the input image. It
preserves visual features and expands the optimization space. Second, we
propose Deep Alignment, a novel approach to train the visual safety prompt
through supervision in the model’s activation space. It enhances the inherent
ability of LVLMs to perceive malicious queries, achieving deeper alignment than
prior works. Extensive experiments across five benchmarks on two representative
LVLMs demonstrate that DAVSP effectively resists malicious queries while
preserving benign input utility. Furthermore, DAVSP exhibits great cross-model
generation ability. Ablation studies further reveal that both the Visual Safety
Prompt and Deep Alignment are essential components, jointly contributing to its
overall effectiveness. The code is publicly available at
https://github.com/zhangyitonggg/DAVSP.

cs.GR [Back]

[130] SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach

Elly Akhoundi,Hung Yu Ling,Anup Anand Deshmukh,Judith Butepage

Main category: cs.GR

TL;DR: 本文提出了一种基于Transformer的简单框架SILK,用于运动插值任务,通过数据建模选择和单一Transformer编码器实现高质量动画效果,挑战了模型复杂度决定动画质量的假设。

Details Motivation: 运动插值对动画师至关重要,现有方法依赖复杂模型或多次训练步骤,亟需简化且高效的解决方案。

Contribution: 1. 提出简单有效的单一Transformer编码器框架SILK;2. 揭示数据建模选择(如数据量、姿势表示和速度特征)对性能的关键影响。

Method: 使用单一Transformer编码器,重点优化数据建模(姿势表示、速度特征输入)以生成高质量运动插值。

Result: 实验表明增加数据量和优化姿势表示可提升结果质量,速度特征显著改善动画性能。

Insight: 运动插值性能更依赖数据建模而非模型复杂度,为动画研究提供数据中心的视角。

Abstract: Motion in-betweening is a crucial tool for animators, enabling intricate
control over pose-level details in each keyframe. Recent machine learning
solutions for motion in-betweening rely on complex models, incorporating
skeleton-aware architectures or requiring multiple modules and training steps.
In this work, we introduce a simple yet effective Transformer-based framework,
employing a single Transformer encoder to synthesize realistic motions for
motion in-betweening tasks. We find that data modeling choices play a
significant role in improving in-betweening performance. Among others, we show
that increasing data volume can yield equivalent or improved motion
transitions, that the choice of pose representation is vital for achieving
high-quality results, and that incorporating velocity input features enhances
animation performance. These findings challenge the assumption that model
complexity is the primary determinant of animation quality and provide insights
into a more data-centric approach to motion interpolation. Additional videos
and supplementary material are available at https://silk-paper.github.io.

[131] VideoMat: Extracting PBR Materials from Video Diffusion Models

Jacob Munkberg,Zian Wang,Ruofan Liang,Tianchang Shen,Jon Hasselgren

Main category: cs.GR

TL;DR: VideoMat利用视频扩散模型和物理渲染技术,从文本提示或单张图像中生成高质量的3D模型材质。该方法通过生成多视角一致的材料属性,结合内部分解和可微分渲染,最终输出与常见内容创作工具兼容的PBR材质。

Details Motivation: 现有方法在从文本或图像生成高质量PBR材质时存在局限性,尤其是在多视角一致性和物理真实感方面。本文旨在通过扩散模型和物理渲染的结合,实现更高质量的材质生成。

Contribution: 1. 提出一种结合视频扩散模型和物理渲染的材质生成流程。2. 通过内在分解和可微分渲染提取高质量PBR材质。3. 生成的多视角材质一致且与现有工具兼容。

Method: 1. 微调视频扩散模型以生成多视角一致的视频。2. 使用内部分解模型提取材质属性(基底色、粗糙度、金属度)。3. 结合可微分路径追踪器优化PBR材质。

Result: 生成的材质在视觉质量和物理一致性上优于现有方法,可直接用于常见3D内容创作工具。

Insight: 视频扩散模型在多视角生成和材质一致性方面具有潜力,内在分解与物理渲染的结合可以进一步提升材质生成的质量和可用性。

Abstract: We leverage finetuned video diffusion models, intrinsic decomposition of
videos, and physically-based differentiable rendering to generate high quality
materials for 3D models given a text prompt or a single image. We condition a
video diffusion model to respect the input geometry and lighting condition.
This model produces multiple views of a given 3D model with coherent material
properties. Secondly, we use a recent model to extract intrinsics (base color,
roughness, metallic) from the generated video. Finally, we use the intrinsics
alongside the generated video in a differentiable path tracer to robustly
extract PBR materials directly compatible with common content creation tools.

[132] DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Chieh Hubert Lin,Zhaoyang Lv,Songyin Wu,Zhen Xu,Thu Nguyen-Phuoc,Hung-Yu Tseng,Julian Straub,Numair Khan,Lei Xiao,Ming-Hsuan Yang,Yuheng Ren,Richard Newcombe,Zhao Dong,Zhengqin Li

Main category: cs.GR

TL;DR: DGS-LRM是首个基于前馈方法从单目视频预测可变形3D高斯泼溅的模型,专注于动态场景重建。

Details Motivation: 动态场景的实时重建面临训练数据稀缺和3D表示方法不足的挑战,该研究旨在填补这一空白。

Contribution: 1) 提出大规模合成数据集;2) 设计像素级可变形3D高斯表示;3) 实现实时且泛化性强的动态重建Transformer网络。

Method: 通过增强合成数据集训练,采用可变形3D高斯表示,并使用Transformer网络进行前馈预测。

Result: 在动态重建质量上媲美基于优化的方法,在长距离3D跟踪任务中表现优秀。

Insight: 可变形3D高斯表示和合成数据驱动的训练范式为动态场景重建提供了新思路。

Abstract: We introduce the Deformable Gaussian Splats Large Reconstruction Model
(DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian
splats from a monocular posed video of any dynamic scene. Feed-forward scene
reconstruction has gained significant attention for its ability to rapidly
create digital replicas of real-world environments. However, most existing
models are limited to static scenes and fail to reconstruct the motion of
moving objects. Developing a feed-forward model for dynamic scene
reconstruction poses significant challenges, including the scarcity of training
data and the need for appropriate 3D representations and training paradigms. To
address these challenges, we introduce several key technical contributions: an
enhanced large-scale synthetic dataset with ground-truth multi-view videos and
dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian
representation that is easy to learn, supports high-quality dynamic view
synthesis, and enables long-range 3D tracking; and a large transformer network
that achieves real-time, generalizable dynamic scene reconstruction. Extensive
qualitative and quantitative experiments demonstrate that DGS-LRM achieves
dynamic scene reconstruction quality comparable to optimization-based methods,
while significantly outperforming the state-of-the-art predictive dynamic
reconstruction method on real-world examples. Its predicted physically grounded
3D deformation is accurate and can readily adapt for long-range 3D tracking
tasks, achieving performance on par with state-of-the-art monocular video 3D
tracking methods.

cs.LG [Back]

[133] An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Valentyn Boreiko,Alexander Panfilov,Vaclav Voracek,Matthias Hein,Jonas Geiping

Main category: cs.LG

TL;DR: 该论文提出了一种统一的威胁模型,基于N-gram语言模型的困惑度评估LLM jailbreak攻击的效果,发现离散优化攻击优于基于LLM的攻击,并揭示了成功攻击的关键模式。

Details Motivation: 现有jailbreak攻击方法在流畅性和计算成本上差异较大,缺乏统一的评估标准,因此需要一种可解释且与LLM无关的威胁模型来公平比较这些方法。

Contribution: 提出了基于N-gram的威胁模型,首次在统一框架下评估多种jailbreak攻击,并揭示了攻击成功的关键特征。

Method: 利用1T token数据训练N-gram语言模型,计算攻击文本的困惑度,作为评估攻击效果的指标,并适应性地调整多种流行攻击方法。

Result: 实验发现针对现代安全调整LLM的攻击成功率低于预期,离散优化攻击表现优于LLM基攻击;成功攻击常利用罕见或特定领域的二元组。

Insight: 可解释的N-gram模型揭示了攻击的本质模式,为防御设计提供了方向——关注罕见或异常文本片段的检测。

Abstract: A plethora of jailbreaking attacks have been proposed to obtain harmful
responses from safety-tuned LLMs. These methods largely succeed in coercing the
target output in their original settings, but their attacks vary substantially
in fluency and computational effort. In this work, we propose a unified threat
model for the principled comparison of these methods. Our threat model checks
if a given jailbreak is likely to occur in the distribution of text. For this,
we build an N-gram language model on 1T tokens, which, unlike model-based
perplexity, allows for an LLM-agnostic, nonparametric, and inherently
interpretable evaluation. We adapt popular attacks to this threat model, and,
for the first time, benchmark these attacks on equal footing with it. After an
extensive comparison, we find attack success rates against safety-tuned modern
models to be lower than previously presented and that attacks based on discrete
optimization significantly outperform recent LLM-based attacks. Being
inherently interpretable, our threat model allows for a comprehensive analysis
and comparison of jailbreak attacks. We find that effective attacks exploit and
abuse infrequent bigrams, either selecting the ones absent from real-world text
or rare ones, e.g., specific to Reddit or code datasets.

[134] Too Big to Think: Capacity, Memorization, and Generalization in Pre-Trained Transformers

Joshua Barron,Devin White

Main category: cs.LG

TL;DR: 通过预训练容量受限的Transformer模型,研究记忆与泛化的关系,发现小模型泛化能力强但记忆差,大模型则相反,且联合训练时模型无法同时兼顾二者。

Details Motivation: 探讨大型语言模型中记忆与泛化的关系,理解模型容量如何影响这两种学习模式。

Contribution: 在受控的合成任务中揭示了模型容量对记忆和泛化的权衡关系,发现预训练可能偏向某种学习模式。

Method: 预训练不同容量的Transformer模型,分别测试算术外推(泛化)和事实记忆任务。

Result: 小模型擅长泛化但记忆差,大模型反之;联合训练时所有模型均无法泛化。

Insight: 模型容量是决定学习模式的关键因素,这可能对小型语言模型的设计和部署有启示。

Abstract: The relationship between memorization and generalization in large language
models (LLMs) remains an open area of research, with growing evidence that the
two are deeply intertwined. In this work, we investigate this relationship by
pre-training a series of capacity-limited Transformer models from scratch on
two synthetic character-level tasks designed to separately probe generalization
(via arithmetic extrapolation) and memorization (via factual recall). We
observe a consistent trade-off: small models extrapolate to unseen arithmetic
cases but fail to memorize facts, while larger models memorize but fail to
extrapolate. An intermediate-capacity model exhibits a similar shift toward
memorization. When trained on both tasks jointly, no model (regardless of size)
succeeds at extrapolation. These findings suggest that pre-training may
intrinsically favor one learning mode over the other. By isolating these
dynamics in a controlled setting, our study offers insight into how model
capacity shapes learning behavior and offers broader implications for the
design and deployment of small language models.

[135] SensorLM: Learning the Language of Wearable Sensors

Yuwei Zhang,Kumar Ayush,Siyuan Qiao,A. Ali Heydari,Girish Narayanswamy,Maxwell A. Xu,Ahmed A. Metwally,Shawn Xu,Jake Garrison,Xuhai Xu,Tim Althoff,Yun Liu,Pushmeet Kohli,Jiening Zhan,Mark Malhotra,Shwetak Patel,Cecilia Mascolo,Xin Liu,Daniel McDuff,Yuzhe Yang

Main category: cs.LG

TL;DR: SensorLM是一个传感器-语言基础模型家族,旨在通过自然语言理解可穿戴传感器数据,解决了传感器数据与语言对齐的挑战,并构建了最大的传感器-语言数据集。

Details Motivation: 可穿戴传感器数据的普遍性与缺乏丰富的标注数据使得传感器数据与自然语言的对齐和理解变得困难。

Contribution: 1)提出了一个层次化的标注生成流程;2)构建了最大的传感器-语言数据集;3)扩展了多模态预训练架构(如CLIP、CoCa)。

Method: 采用了层次化的标注生成流程捕捉传感器数据的统计、结构和语义信息,并基于多模态预训练架构进行模型设计。

Result: SensorLM在零样本识别、少样本学习和跨模态检索任务中表现优于现有方法,展示了扩展性、标签效率和零样本泛化能力。

Insight: SensorLM展示了传感器数据与自然语言对齐的潜力,并提供了传感器数据理解的新范式。

Abstract: We present SensorLM, a family of sensor-language foundation models that
enable wearable sensor data understanding with natural language. Despite its
pervasive nature, aligning and interpreting sensor data with language remains
challenging due to the lack of paired, richly annotated sensor-text
descriptions in uncurated, real-world wearable data. We introduce a
hierarchical caption generation pipeline designed to capture statistical,
structural, and semantic information from sensor data. This approach enabled
the curation of the largest sensor-language dataset to date, comprising over
59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM
extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and
recovers them as specific variants within a generic architecture. Extensive
experiments on real-world tasks in human activity analysis and healthcare
verify the superior performance of SensorLM over state-of-the-art in zero-shot
recognition, few-shot learning, and cross-modal retrieval. SensorLM also
demonstrates intriguing capabilities including scaling behaviors, label
efficiency, sensor captioning, and zero-shot generalization to unseen tasks.

Samuel Holt,Max Ruiz Luyten,Thomas Pouplin,Mihaela van der Schaar

Main category: cs.LG

TL;DR: 本文提出了一种通过原子事实增强和前瞻搜索的上下文学习方法,提升LLM代理的规划能力,使其能够在复杂交互环境中更高效地进行多步推理。

Details Motivation: 现有的大型语言模型(LLMs)在复杂交互环境中需要大量指导或交互历史才能有效工作,难以适应新信息或高效利用过去经验进行多步推理。本文旨在通过上下文学习方法,提升LLM代理的规划能力,而无需微调。

Contribution: 提出了一个结合原子事实增强和递归前瞻搜索的LLM代理框架。代理能够从交互轨迹中提取任务关键的原子事实,动态增强LLM组件的提示,并通过前瞻搜索进行规划。

Method: 代理通过提取交互轨迹中的原子事实,动态增强LLM的提示(用于动作提议、潜在世界模型模拟和状态值估计)。规划采用深度受限的前瞻搜索,LLM模拟潜在轨迹并评估其结果。

Result: 在TextFrozenLake和ALFWorld等挑战性交互任务中,代理表现出更好的性能和适应性,随着经验积累能够实现更优的行为。

Insight: 通过原子事实增强和前瞻搜索,代理能够在不更新权重的情况下,利用上下文学习提升规划能力,这为LLM在交互任务中的应用提供了新思路。

Abstract: Large Language Models (LLMs) are increasingly capable but often require
significant guidance or extensive interaction history to perform effectively in
complex, interactive environments. Existing methods may struggle with adapting
to new information or efficiently utilizing past experiences for multi-step
reasoning without fine-tuning. We introduce a novel LLM agent framework that
enhances planning capabilities through in-context learning, facilitated by
atomic fact augmentation and a recursive lookahead search. Our agent learns to
extract task-critical ``atomic facts’’ from its interaction trajectories. These
facts dynamically augment the prompts provided to LLM-based components
responsible for action proposal, latent world model simulation, and state-value
estimation. Planning is performed via a depth-limited lookahead search, where
the LLM simulates potential trajectories and evaluates their outcomes, guided
by the accumulated facts and interaction history. This approach allows the
agent to improve its understanding and decision-making online, leveraging its
experience to refine its behavior without weight updates. We provide a
theoretical motivation linking performance to the quality of fact-based
abstraction and LLM simulation accuracy. Empirically, our agent demonstrates
improved performance and adaptability on challenging interactive tasks,
achieving more optimal behavior as it accumulates experience, showcased in
tasks such as TextFrozenLake and ALFWorld.

[137] Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Shuai Wang,Zhenhua Liu,Jiaheng Wei,Xuanwu Yin,Dong Li,Emad Barsoum

Main category: cs.LG

TL;DR: Athena-PRM是一种多模态过程奖励模型,用于评估复杂推理问题解决中各步骤的奖励分数,通过数据高效的方法生成高质量的过程标注数据,显著提升了性能。

Details Motivation: 传统的过程奖励模型(PRM)需要大量时间和金钱投入,且自动标注方法(如Monte Carlo估计)常产生噪声标签和高计算成本。为解决这些问题,提出了一种更高效的方法。

Contribution: 提出利用强弱完成器之间的预测一致性作为可靠过程标签的判别标准,开发了ORM初始化和负数据上采样两种策略提升PRM性能,并在多个场景和基准上验证了其有效性。

Method: 通过预测一致性生成高质量过程标注数据,结合ORM初始化和负数据上采样优化PRM性能。

Result: Athena-PRM在多个基准测试中表现优异,如WeMath和MathVista分别提升10.2和7.1分,并在VisualProcessBench上超越之前SoTA 3.9 F1分。

Insight: 预测一致性和数据高效策略的结合显著提升了多模态推理评估的准确性,为复杂推理任务的优化提供了新思路。

Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to
evaluate the reward score for each step in solving complex reasoning problems.
Developing high-performance PRMs typically demands significant time and
financial investment, primarily due to the necessity for step-level annotations
of reasoning steps. Conventional automated labeling methods, such as Monte
Carlo estimation, often produce noisy labels and incur substantial
computational costs. To efficiently generate high-quality process-labeled data,
we propose leveraging prediction consistency between weak and strong completers
as a criterion for identifying reliable process labels. Remarkably, Athena-PRM
demonstrates outstanding effectiveness across various scenarios and benchmarks
with just 5,000 samples. Furthermore, we also develop two effective strategies
to improve the performance of PRMs: ORM initialization and up-sampling for
negative data. We validate our approach in three specific scenarios:
verification for test time scaling, direct evaluation of reasoning step
correctness, and reward ranked fine-tuning. Our Athena-PRM consistently
achieves superior performance across multiple benchmarks and scenarios.
Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances
performance by 10.2 points on WeMath and 7.1 points on MathVista for test time
scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in
VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score,
showcasing its robust capability to accurately assess the correctness of the
reasoning step. Additionally, utilizing Athena-PRM as the reward model, we
develop Athena-7B with reward ranked fine-tuning and outperforms baseline with
a significant margin on five benchmarks.

[138] Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

Tim Z. Xiao,Johannes Zenn,Zhen Liu,Weiyang Liu,Robert Bamler,Bernhard Schölkopf

Main category: cs.LG

TL;DR: 本文提出Verbalized Rejection Sampling (VRS),一种通过自然语言改进大语言模型(LLM)采样偏差的方法,适用于伯努利分布。

Details Motivation: 尽管LLM能准确描述概率分布,但在生成忠实样本时表现不佳,限制了其在需要可靠随机性的任务中的应用。

Contribution: 提出VRS方法,将经典拒绝采样自然语言化,减少了LLM的采样偏差,且无需修改模型内部或复杂提示工程。

Method: 通过自然语言提示让LLM对样本进行推理和接受/拒绝,改进经典拒绝采样算法。

Result: 实验表明VRS显著减少了采样偏差,理论分析也支持其有效性。

Insight: 经典概率工具可通过自然语言嵌入LLM工作流,提升可靠性,而不依赖模型内部访问。

Abstract: Large language models (LLMs) can often accurately describe probability
distributions using natural language, yet they still struggle to generate
faithful samples from them. This mismatch limits their use in tasks requiring
reliable stochasticity, such as Monte Carlo methods, agent-based simulations,
and randomized decision-making. We investigate this gap between knowledge and
sampling in the context of Bernoulli distributions. We introduce Verbalized
Rejection Sampling (VRS), a natural-language adaptation of classical rejection
sampling that prompts the LLM to reason about and accept or reject proposed
samples. Despite relying on the same Bernoulli mechanism internally, VRS
substantially reduces sampling bias across models. We provide theoretical
analysis showing that, under mild assumptions, VRS improves over direct
sampling, with gains attributable to both the algorithm and prompt design. More
broadly, our results show how classical probabilistic tools can be verbalized
and embedded into LLM workflows to improve reliability, without requiring
access to model internals or heavy prompt engineering.

[139] MultiNet: An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

Pranav Guruprasad,Yangyue Wang,Harshvardhan Sikka

Main category: cs.LG

TL;DR: MultiNet是一个开源软件工具包和基准测试套件,旨在评估和适应多模态动作模型,覆盖视觉、语言和动作领域。

Details Motivation: 多模态动作模型在通用智能代理系统中具有潜力,但缺乏标准化的评估工具和数据集。

Contribution: 提出MultiNet,一个完全开源的基准测试和软件生态系统,提供标准化的评估协议和复合数据集。

Method: 开发了包括数据下载、模型评估和标准化测试的开源工具包,涵盖多种任务。

Result: MultiNet被用于下游研究,揭示了视觉语言动作模型的泛化局限性。

Insight: 多模态模型的评估需要跨领域的标准化工具和丰富的数据集,以推动进一步研究。

Abstract: Recent innovations in multimodal action models represent a promising
direction for developing general-purpose agentic systems, combining visual
understanding, language comprehension, and action generation. We introduce
MultiNet - a novel, fully open-source benchmark and surrounding software
ecosystem designed to rigorously evaluate and adapt models across vision,
language, and action domains. We establish standardized evaluation protocols
for assessing vision-language models (VLMs) and vision-language-action models
(VLAs), and provide open source software to download relevant data, models, and
evaluations. Additionally, we provide a composite dataset with over 1.3
trillion tokens of image captioning, visual question answering, commonsense
reasoning, robotic control, digital game-play, simulated
locomotion/manipulation, and many more tasks. The MultiNet benchmark,
framework, toolkit, and evaluation harness have been used in downstream
research on the limitations of VLA generalization.

[140] LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

Jiaqi Tang,Yu Xia,Yi-Feng Wu,Yuwei Hu,Yuhui Chen,Qing-Guo Chen,Xiaogang Xu,Xiangyu Wu,Hao Lu,Yanqing Ma,Shiyin Lu,Qifeng Chen

Main category: cs.LG

TL;DR: 论文提出了一种名为LPO的新方法,通过优化位置偏好提高GUI代理的交互准确性,利用信息熵和动态位置奖励函数显著提升了交互精度,并在实验中取得了SOTA结果。

Details Motivation: 当前GUI代理在空间定位任务中主要依赖SFT方法,但这些方法在感知位置数据上存在局限性,而强化学习等方法又无法有效评估位置准确性,亟需一种更高效的解决方案。

Contribution: 提出了Location Preference Optimization (LPO)方法,利用信息熵预测信息丰富的区域,结合动态位置奖励函数优化交互偏好,并通过GRPO支持更广泛的GUI环境探索。

Method: LPO通过信息熵确定高信息密度区域,引入基于物理距离的动态位置奖励函数,并结合GRPO优化交互策略。

Result: 实验表明LPO在离线基准测试和在线评估中均达到了SOTA性能。

Insight: 位置数据的优化对于GUI代理的交互精度至关重要,信息熵和动态奖励机制的结合可以显著提升任务的完成质量。

Abstract: The advent of autonomous agents is transforming interactions with Graphical
User Interfaces (GUIs) by employing natural language as a powerful
intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods
in current GUI agents for achieving spatial localization, these methods face
substantial challenges due to their limited capacity to accurately perceive
positional data. Existing strategies, such as reinforcement learning, often
fail to assess positional accuracy effectively, thereby restricting their
utility. In response, we introduce Location Preference Optimization (LPO), a
novel approach that leverages locational data to optimize interaction
preferences. LPO uses information entropy to predict interaction positions by
focusing on zones rich in information. Besides, it further introduces a dynamic
location reward function based on physical distance, reflecting the varying
importance of interaction positions. Supported by Group Relative Preference
Optimization (GRPO), LPO facilitates an extensive exploration of GUI
environments and significantly enhances interaction precision. Comprehensive
experiments demonstrate LPO’s superior performance, achieving SOTA results
across both offline benchmarks and real-world online evaluations. Our code will
be made publicly available soon, at https://github.com/AIDC-AI/LPO.

[141] FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

Weiying Zheng,Ziyue Lin,Pengxin Guo,Yuyin Zhou,Feifei Wang,Liangqiong Qu

Main category: cs.LG

TL;DR: 论文介绍了FedVLMBench,这是联邦学习中首个系统性的视觉-语言模型(VLM)微调基准测试。涵盖多种架构、策略、算法和数据集,并揭示了关键发现,如数据异质性和任务类型对FL方法的影响。

Details Motivation: 现有VLM微调方法多依赖集中式训练,不适用于隐私要求严格的领域。联邦学习(FL)虽被引入,但缺乏系统性基准测试,无法全面评估其效果。

Contribution: 提出首个系统性的联邦VLM微调基准FedVLMBench,涵盖多种架构、策略、数据集和任务,并提供了关键发现和工具。

Method: 集成两种主流VLM架构(编码器基和无编码器)、四种微调策略、五种FL算法、六个多模态数据集,覆盖多种场景和任务类别。

Result: 发现编码器基VLM在FL中采用2层MLP连接器并同时微调连接器和LLM效果最佳;FL方法对视觉任务的异质性更敏感。

Insight: 数据异质性和任务类型显著影响FL方法的性能,为隐私保护的多模态基础模型训练提供了实证指导。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in
cross-modal understanding and generation by integrating visual and textual
information. While instruction tuning and parameter-efficient fine-tuning
methods have substantially improved the generalization of VLMs, most existing
approaches rely on centralized training, posing challenges for deployment in
domains with strict privacy requirements like healthcare. Recent efforts have
introduced Federated Learning (FL) into VLM fine-tuning to address these
privacy concerns, yet comprehensive benchmarks for evaluating federated
fine-tuning strategies, model architectures, and task generalization remain
lacking. In this work, we present \textbf{FedVLMBench}, the first systematic
benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two
mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning
strategies, five FL algorithms, six multimodal datasets spanning four
cross-domain single-task scenarios and two cross-domain multitask settings,
covering four distinct downstream task categories. Through extensive
experiments, we uncover key insights into the interplay between VLM
architectures, fine-tuning strategies, data heterogeneity, and multi-task
federated optimization. Notably, we find that a 2-layer multilayer perceptron
(MLP) connector with concurrent connector and LLM tuning emerges as the optimal
configuration for encoder-based VLMs in FL. Furthermore, current FL methods
exhibit significantly higher sensitivity to data heterogeneity in
vision-centric tasks than text-centric ones, across both encoder-free and
encoder-based VLM architectures. Our benchmark provides essential tools,
datasets, and empirical guidance for the research community, offering a
standardized platform to advance privacy-preserving, federated training of
multimodal foundation models.

cs.RO [Back]

[142] UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

Yihe Tang,Wenlong Huang,Yingke Wang,Chengshu Li,Roy Yuan,Ruohan Zhang,Jiajun Wu,Li Fei-Fei

Main category: cs.RO

TL;DR: 论文提出了一种无监督方法UAD,通过利用基础模型提取物体affordance知识,并将其蒸馏到任务条件的affordance模型中,无需手动标注,实现了在开放任务指令下的广泛泛化能力。

Details Motivation: 现有视觉affordance预测方法依赖手动标注或预设任务集,限制了它们在开放任务和未结构化环境中的泛化能力。UAD旨在通过无监督方式解决这一问题。

Contribution: 1. 提出UAD方法,无需手动标注,通过基础模型自动生成affordance标注;2. 训练轻量任务条件解码器,实现大规模仿真数据到真实场景的泛化;3. 展示了在模仿学习中,仅需少量演示即可泛化到新物体和任务。

Method: 利用大视觉模型和视觉语言模型的互补优势,自动标注大量<指令,视觉affordance>对;训练基于冻结特征的轻量任务条件解码器。

Result: UAD在仿真数据训练后,展现了对真实场景和人类活动的高泛化能力;模仿学习策略在仅10次演示后,泛化到新物体类别和任务变体。

Insight: 基础模型的互补结合可高效解决无监督affordance标注问题;轻量任务条件模型在真实场景中表现优异。

Abstract: Understanding fine-grained object affordances is imperative for robots to
manipulate objects in unstructured environments given open-ended task
instructions. However, existing methods of visual affordance predictions often
rely on manually annotated data or conditions only on a predefined set of
tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for
distilling affordance knowledge from foundation models into a task-conditioned
affordance model without any manual annotations. By leveraging the
complementary strengths of large vision models and vision-language models, UAD
automatically annotates a large-scale dataset with detailed $<$instruction,
visual affordance$>$ pairs. Training only a lightweight task-conditioned
decoder atop frozen features, UAD exhibits notable generalization to
in-the-wild robotic scenes and to various human activities, despite only being
trained on rendered objects in simulation. Using affordance provided by UAD as
the observation space, we show an imitation learning policy that demonstrates
promising generalization to unseen object instances, object categories, and
even variations in task instructions after training on as few as 10
demonstrations. Project website: https://unsup-affordance.github.io/

[143] DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects

Guanghu Xie,Zhiduo Jiang,Yonglong Zhang,Yang Liu,Zongwu Xie,Baoshi Cao,Hong Liu

Main category: cs.RO

TL;DR: 论文提出了一种名为DCIRNet的多模态深度补全网络,用于解决透明和反射物体深度信息缺失的问题,实现了44%的抓取成功率提升。

Details Motivation: 透明和反射物体由于其独特的视觉特性(如镜面反射和光传输)导致深度传感器难以准确估计深度,影响下游任务的性能。

Contribution: 1. 提出了DCIRNet,一种整合RGB和深度图的多模态网络;2. 设计了创新的多模态特征融合模块;3. 引入了多阶段监督和深度细化策略以改善深度补全效果。

Method: 1. 使用RGB和深度图作为输入;2. 通过多模态特征融合提取互补信息;3. 采用多阶段监督和迭代细化策略逐步优化深度估计。

Result: 在公开数据集上,DCIRNet表现出色,抓取成功率提升44%,验证了方法的有效性。

Insight: 整合多模态数据和迭代细化策略能显著提升透明和反射物体的深度估计精度,从而优化机器人抓取任务的性能。

Abstract: Transparent and reflective objects in everyday environments pose significant
challenges for depth sensors due to their unique visual properties, such as
specular reflections and light transmission. These characteristics often lead
to incomplete or inaccurate depth estimation, which severely impacts downstream
geometry-based vision tasks, including object recognition, scene
reconstruction, and robotic manipulation. To address the issue of missing depth
information in transparent and reflective objects, we propose DCIRNet, a novel
multimodal depth completion network that effectively integrates RGB images and
depth maps to enhance depth estimation quality. Our approach incorporates an
innovative multimodal feature fusion module designed to extract complementary
information between RGB images and incomplete depth maps. Furthermore, we
introduce a multi-stage supervision and depth refinement strategy that
progressively improves depth completion and effectively mitigates the issue of
blurred object boundaries. We integrate our depth completion model into
dexterous grasping frameworks and achieve a $44%$ improvement in the grasp
success rate for transparent and reflective objects. We conduct extensive
experiments on public datasets, where DCIRNet demonstrates superior
performance. The experimental results validate the effectiveness of our
approach and confirm its strong generalization capability across various
transparent and reflective objects.

[144] From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Irving Fang,Juexiao Zhang,Shengbang Tong,Chen Feng

Main category: cs.RO

TL;DR: 该论文提出了一个统一的评测套件,包含50个仿真任务,用于系统评估视觉-语言-动作(VLA)模型的泛化能力,发现VLM预训练虽赋予模型强大的感知和高层规划能力,但在动作执行上表现不稳定。

Details Motivation: 当前视觉-语言-动作(VLA)模型的评测不足,传统模仿学习基准缺乏语言指令,且现有评测任务有限,难以量化VLM预训练对下游机器人策略泛化能力的贡献。

Contribution: 提出了一个包含50个仿真任务的统一评测套件,系统评估了多种VLA架构的泛化能力,并分析了VLM预训练对动作执行的影响。

Method: 设计了涵盖语言指令、视觉和物体的10个子类别的仿真任务套件,并评测了多种VLA模型在这些任务上的表现。

Result: 结果表明,VLM虽提供强感知和规划能力(意图),但在动作执行上表现不稳定;微调可能损害原始VLM的通用推理能力。

Insight: VLA模型的感知与动作执行间存在明显差距,需进一步研究填补这一差距的方法。

Abstract: One promise that Vision-Language-Action (VLA) models hold over traditional
imitation learning for robotics is to leverage the broad generalization
capabilities of large Vision-Language Models (VLMs) to produce versatile,
“generalist” robot policies. However, current evaluations of VLAs remain
insufficient. Traditional imitation learning benchmarks are unsuitable due to
the lack of language instructions. Emerging benchmarks for VLAs that
incorporate language often come with limited evaluation tasks and do not intend
to investigate how much VLM pretraining truly contributes to the generalization
capabilities of the downstream robotic policy. Meanwhile, much research relies
on real-world robot setups designed in isolation by different institutions,
which creates a barrier for reproducibility and accessibility. To address this
gap, we introduce a unified probing suite of 50 simulation-based tasks across
10 subcategories spanning language instruction, vision, and objects. We
systematically evaluate several state-of-the-art VLA architectures on this
suite to understand their generalization capability. Our results show that
while VLM backbones endow VLAs with robust perceptual understanding and high
level planning, which we refer to as good intentions, this does not reliably
translate into precise motor execution: when faced with out-of-distribution
observations, policies often exhibit coherent intentions, but falter in action
execution. Moreover, finetuning on action data can erode the original VLM’s
generalist reasoning abilities. We release our task suite and evaluation code
to serve as a standardized benchmark for future VLAs and to drive research on
closing the perception-to-action gap. More information, including the source
code, can be found at https://ai4ce.github.io/INT-ACT/

[145] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Wenbo Zhang,Tianrun Hu,Yanyuan Qiao,Hanbo Zhang,Yuchu Qin,Yang Li,Jiajun Liu,Tao Kong,Lingqiao Liu,Xiao Ma

Main category: cs.RO

TL;DR: CoA是一种新型的视觉-运动策略范式,通过逆向推理生成完整轨迹,结合任务目标,实现全局到局部的动作约束,提升了机器人在模拟和真实任务中的性能。

Details Motivation: 传统方法通过前向预测下一步动作,可能缺乏全局视角,而CoA通过逆向推理和动作级链式思考(CoT)实现任务目标驱动的轨迹生成,增强泛化能力。

Contribution: 提出了CoA范式,结合任务目标逆向生成轨迹;设计了连续动作表征、动态停止等机制,实现可变长度轨迹生成;在模拟和真实任务中达到SOTA性能。

Method: 采用逆向自回归建模,首先生成关键帧动作(任务目标),接着自回归生成后续动作;引入连续动作表征、动态停止、反向时间集成和多令牌预测等技术。

Result: 在60个RLBench任务和8个真实机器人操作任务中取得了最先进的表现,展现了强大的空间泛化能力。

Insight: 逆向推理能更好地将任务目标融入动作生成过程,全局-局部结构设计显着提升了动作规划的效果和泛化性。

Abstract: We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built
upon Trajectory Autoregressive Modeling. Unlike conventional approaches that
predict next step action(s) forward, CoA generates an entire trajectory by
explicit backward reasoning with task-specific goals through an action-level
Chain-of-Thought (CoT) process. This process is unified within a single
autoregressive structure: (1) the first token corresponds to a stable keyframe
action that encodes the task-specific goals; and (2) subsequent action tokens
are generated autoregressively, conditioned on the initial keyframe and
previously predicted actions. This backward action reasoning enforces a
global-to-local structure, allowing each local action to be tightly constrained
by the final goal. To further realize the action reasoning structure, CoA
incorporates four complementary designs: continuous action token
representation; dynamic stopping for variable-length trajectory generation;
reverse temporal ensemble; and multi-token prediction to balance action chunk
modeling with global structure. As a result, CoA gives strong spatial
generalization capabilities while preserving the flexibility and simplicity of
a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art
performance across 60 RLBench tasks and 8 real-world manipulation tasks.

eess.AS [Back]

[146] Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

Suhas BN,Andrew M. Sherrill,Jyoti Alaparthi,Dominik Mattioli,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah

Main category: eess.AS

TL;DR: 本文提出了一种基于LoRA微调的音频-语言模型方法,用于自动标注长时间暴露疗法(PE)的关键阶段时间边界,实现了在真实PE会话数据集上的高精度定位(MAE为5.3秒)。

Details Motivation: 传统PE疗法中,治疗师的忠实度评估依赖人工审查会话录音,费时费力。本文旨在通过自动化方法高效定位PE核心阶段的起止时间,支持临床监督与培训。

Contribution: 1. 提出首个结合LoRA微调与音频-语言模型的PE阶段定位方法;2. 利用任务特定提示的软监督优化边界预测;3. 在真实数据集上验证了方法的有效性。

Method: 1. 基于Qwen2-Audio预训练模型,通过LoRA进行轻量化微调;2. 分割30秒音频-文本窗口输入模型;3. 利用LLM生成标注并经人工验证;4. 采用软监督预测归一化边界偏移。

Result: 在313个真实PE会话中,最佳配置(LoRA秩8,30秒窗口)的MAE为5.3秒,窗口大小和LoRA秩对性能影响显著。

Insight: 1. 上下文粒度(窗口大小)对时间定位至关重要;2. LoRA的轻量化适配适合小规模任务数据;3. 软监督能有效学习模糊边界。

Abstract: Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic
stress disorder (PTSD), but evaluating therapist fidelity remains
labor-intensive due to the need for manual review of session recordings. We
present a method for the automatic temporal localization of key PE fidelity
elements – identifying their start and stop times – directly from session
audio and transcripts. Our approach fine-tunes a large pre-trained
audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process
focused 30-second windows of audio-transcript input. Fidelity labels for three
core protocol phases – therapist orientation (P1), imaginal exposure (P2), and
post-imaginal processing (P3) – are generated via LLM-based prompting and
verified by trained raters. The model is trained to predict normalized boundary
offsets using soft supervision guided by task-specific prompts. On a dataset of
313 real PE sessions, our best configuration (LoRA rank 8, 30s windows)
achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further
analyze the effects of window size and LoRA rank, highlighting the importance
of context granularity and model adaptation. This work introduces a scalable
framework for fidelity tracking in PE therapy, with potential to support
clinician training, supervision, and quality assurance.