Table of Contents

cs.CL [Back]

[1] TaskCraft: Automated Generation of Agentic Tasks

Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: TaskCraft提出了一种自动化工作流,用于生成难度可扩展、多工具交互且可验证的智能体任务。通过深度和宽度扩展,TaskCraft解决了现有数据缺乏工具交互和依赖人工标注的问题。

Details Motivation: 当前智能体任务的研究面临两大问题:现有指令数据缺乏工具交互能力,以及智能体基准测试依赖昂贵的人工标注。TaskCraft旨在通过自动化生成任务来解决这些问题。

Contribution: 1. 提出了TaskCraft,一种自动化生成智能体任务的框架。2. 通过深度和宽度扩展生成结构和层次复杂的任务。3. 提供了一个包含约36,000个任务的合成数据集。

Method: TaskCraft通过深度和宽度扩展扩展原子任务(基础任务),生成复杂任务。深度扩展增加任务步骤,宽度扩展引入多工具交互。生成的任务支持难度调整和验证。

Result: 实验表明,生成的任务可以优化提示生成流程,并提升智能体基础模型的监督微调效果。

Insight: 通过自动化生成任务,TaskCraft为智能体任务的可扩展性和多样性提供了新的解决方案,同时降低了依赖人工标注的成本。

Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool
use, and adaptive reasoning, are becoming increasingly central to the
advancement of NLP and AI. However, existing instruction data lacks tool
interaction, and current agentic benchmarks rely on costly human annotation,
limiting their scalability. We introduce \textsc{TaskCraft}, an automated
workflow for generating difficulty-scalable, multi-tool, and verifiable agentic
tasks with execution trajectories. TaskCraft expands atomic tasks using
depth-based and width-based extensions to create structurally and
hierarchically complex challenges. Empirical results show that these tasks
improve prompt optimization in the generation workflow and enhance supervised
fine-tuning of agentic foundation models. We present a large-scale synthetic
dataset of approximately 36,000 tasks with varying difficulty to support future
research on agent tuning and evaluation.

[2] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel

Main category: cs.CL

TL;DR: 该论文提出了一种名为Chat-of-Thought的多智能体系统,用于高效生成工业资产的FMEA文档,通过多角色协作和动态任务路由优化生成与验证过程。

Details Motivation: 工业资产管理中的FMEA文档生成面临效率和质量挑战,传统的单智能体方法难以满足复杂需求,需要一种协作、动态的系统来解决这些问题。

Contribution: 提出了Chat-of-Thought系统,通过多智能体协作、动态任务路由和Chat of Thought的创新机制,有效提升FMEA文档生成的效率和准确性。

Method: 采用多角色LLM智能体协作,引入动态任务路由和Chat of Thought机制,通过模板驱动和上下文感知的协作完成FMEA生成与验证。

Result: 系统在工业设备监测领域中展示了高效生成和验证FMEA文档的能力,解决了复杂场景下的协作问题。

Insight: 多智能体协作和动态角色分配能够显著提升复杂任务的执行效果,尤其是在需要多视角验证的领域(如FMEA)中表现突出。

Abstract: This paper presents a novel multi-agent system called Chat-of-Thought,
designed to facilitate the generation of Failure Modes and Effects Analysis
(FMEA) documents for industrial assets. Chat-of-Thought employs multiple
collaborative Large Language Model (LLM)-based agents with specific roles,
leveraging advanced AI techniques and dynamic task routing to optimize the
generation and validation of FMEA tables. A key innovation in this system is
the introduction of a Chat of Thought, where dynamic, multi-persona-driven
discussions enable iterative refinement of content. This research explores the
application domain of industrial equipment monitoring, highlights key
challenges, and demonstrates the potential of Chat-of-Thought in addressing
these challenges through interactive, template-driven workflows and
context-aware agent collaboration.

[3] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering

Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu

Main category: cs.CL

TL;DR: 提出ChartReasoner,一个代码驱动的两阶段框架,通过高保真转换图表为结构化ECharts代码并自动生成推理轨迹,提升图表问答任务中的长链推理能力。

Details Motivation: 传统多模态推理方法将视觉任务转换为文本任务时丢失关键视觉细节,尤其在图表问答任务中。如何保留图表结构语义信息并实现高效推理是核心挑战。

Contribution: 1. 提出代码驱动的两阶段框架ChartReasoner;2. 设计高保真图表转换模型和自动化推理轨迹生成流程;3. 在多个基准测试中表现优异,接近GPT-4o。

Method: 1. 训练模型将图表转换为ECharts代码;2. 利用合成流水线生成高质量推理数据;3. 结合监督微调和强化学习训练多模态模型。

Result: 在四个公开基准上表现优异,保留图表细节的同时参数更少,接近GPT-4o性能。

Insight: 代码驱动方法能有效保留视觉细节,且自动化数据合成是提升多模态推理性能的关键。

Abstract: Recently, large language models have shown remarkable reasoning capabilities
through long-chain reasoning before responding. However, how to extend this
capability to visual reasoning tasks remains an open challenge. Existing
multimodal reasoning approaches transfer such visual reasoning task into
textual reasoning task via several image-to-text conversions, which often lose
critical structural and semantic information embedded in visualizations,
especially for tasks like chart question answering that require a large amount
of visual details. To bridge this gap, we propose ChartReasoner, a code-driven
novel two-stage framework designed to enable precise, interpretable reasoning
over charts. We first train a high-fidelity model to convert diverse chart
images into structured ECharts codes, preserving both layout and data semantics
as lossless as possible. Then, we design a general chart reasoning data
synthesis pipeline, which leverages this pretrained transport model to
automatically and scalably generate chart reasoning trajectories and utilizes a
code validator to filter out low-quality samples. Finally, we train the final
multimodal model using a combination of supervised fine-tuning and
reinforcement learning on our synthesized chart reasoning dataset and
experimental results on four public benchmarks clearly demonstrate the
effectiveness of our proposed ChartReasoner. It can preserve the original
details of the charts as much as possible and perform comparably with
state-of-the-art open-source models while using fewer parameters, approaching
the performance of proprietary systems like GPT-4o in out-of-domain settings.

[4] Unsupervised Elicitation of Language Models

Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike

Main category: cs.CL

TL;DR: 提出一种无监督算法ICM,用于通过最大化内部一致性微调预训练语言模型,无需外部监督,表现优于人类标注数据。

Details Motivation: 在语言模型能力超越人类的场景下,高质量的人类监督难以获取,需要无监督方法引导模型适应下游任务。

Contribution: 提出Internal Coherence Maximization (ICM)方法,首次实现无监督微调语言模型,并在多个任务中超越人类标注的监督学习。

Method: ICM通过最大化模型生成标签的内部一致性进行微调,完全依赖模型自身输出,无需外部人工标注。

Result: 在GSM8k验证、TruthfulQA和Alpaca奖励建模任务中,ICM表现优于人类监督,且能更好地激发模型的超级能力。

Insight: 无监督方法在模型能力超越人类时更具优势,可能成为未来训练前沿模型的可行路径。

Abstract: To steer pretrained language models for downstream tasks, today’s
post-training paradigm relies on humans to specify desired behaviors. However,
for models with superhuman capabilities, it is difficult or impossible to get
high-quality human supervision. To address this challenge, we introduce a new
unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune
pretrained language models on their own generated labels, \emph{without
external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward
modeling tasks, our method matches the performance of training on golden
supervision and outperforms training on crowdsourced human supervision. On
tasks where LMs’ capabilities are strongly superhuman, our method can elicit
those capabilities significantly better than training on human labels. Finally,
we show that our method can improve the training of frontier LMs: we use our
method to train an unsupervised reward model and use reinforcement learning to
train a Claude 3.5 Haiku-based assistant. Both the reward model and the
assistant outperform their human-supervised counterparts.

[5] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective

Yi Wang,Max Kreminski

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLM)在故事生成中的能力,通过叙事规划视角分析了其生成高质量故事的潜力与挑战。通过设定一个基于文学示例的基准测试,研究发现LLM(如GPT-4)在小规模故事中能够保持因果合理性,但在角色意图和戏剧冲突方面仍面临困难。

Details Motivation: 当前LLM在故事生成中的应用广泛,但其生成故事的自动评估方法有限,人工评估成本高且主观性强。计算叙事学为高质量故事提供了理论支持,本文希望通过叙事规划问题来深入理解LLM的生成能力。

Contribution: 提出了一个基于文学示例的叙事规划基准,用于系统评估LLM在因果合理性、角色意图和戏剧冲突方面的表现。研究发现LLM在小规模故事中表现较好,但需强化学习以应对复杂推理任务。

Method: 采用叙事规划的方法,构建了一个基准测试,通过文学示例评估LLM在不同叙事维度(因果、角色意图、冲突)上的表现。实验使用GPT-4等模型进行分析。

Result: GPT-4级LLM能生成小规模因果合理的故事,但在角色意图和戏剧冲突的规划上表现不足,需借助强化学习提升复杂推理能力。

Insight: LLM在故事生成中表现出一定的潜力,但其能力受限于叙事复杂性,需进一步优化模型以应对角色意图和戏剧冲突等高级叙事要求,尤其是在游戏环境中的应用。

Abstract: Story generation has been a prominent application of Large Language Models
(LLMs). However, understanding LLMs’ ability to produce high-quality stories
remains limited due to challenges in automatic evaluation methods and the high
cost and subjectivity of manual evaluation. Computational narratology offers
valuable insights into what constitutes a good story, which has been applied in
the symbolic narrative planning approach to story generation. This work aims to
deepen the understanding of LLMs’ story generation capabilities by using them
to solve narrative planning problems. We present a benchmark for evaluating
LLMs on narrative planning based on literature examples, focusing on causal
soundness, character intentionality, and dramatic conflict. Our experiments
show that GPT-4 tier LLMs can generate causally sound stories at small scales,
but planning with character intentionality and dramatic conflict remains
challenging, requiring LLMs trained with reinforcement learning for complex
reasoning. The results offer insights on the scale of stories that LLMs can
generate while maintaining quality from different aspects. Our findings also
highlight interesting problem solving behaviors and shed lights on challenges
and considerations for applying LLM narrative planning in game environments.

[6] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta,Francis Ferraro

Main category: cs.CL

TL;DR: Q2E提出了一种基于大型语言模型和视觉语言模型的查询到事件分解方法,用于零样本多语言文本到视频检索,表现优于现有基线方法。

Details Motivation: 现有方法在处理复杂现实世界事件的视频检索时,往往简化了用户查询,导致检索效果不佳。Q2E旨在通过分解查询并利用模型的隐式知识提升检索能力。

Contribution: 1. 提出Q2E方法,通过分解查询提升跨数据集、领域、语言的多模态视频检索;2. 展示了如何将方法扩展到视觉和语音输入;3. 采用基于熵的融合评分实现零样本融合。

Method: Q2E利用大型语言模型和视觉语言模型的隐式知识分解用户查询,并通过熵融合评分整合多模态信息,实现零样本视频检索。

Result: 在两个多样化数据集和多种检索指标上,Q2E表现优于现有方法,且整合音频信息显著提升了检索效果。

Insight: 分解复杂查询并结合多模态信息(如音频)可以显著提升视频检索性能,尤其是在零样本和多语言场景下。

Abstract: Recent approaches have shown impressive proficiency in extracting and
leveraging parametric knowledge from Large-Language Models (LLMs) and
Vision-Language Models (VLMs). In this work, we consider how we can improve the
identification and retrieval of videos related to complex real-world events by
automatically extracting latent parametric knowledge about those events. We
present Q2E: a Query-to-Event decomposition method for zero-shot multilingual
text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our
approach demonstrates that we can enhance the understanding of otherwise overly
simplified human queries by decomposing the query using the knowledge embedded
in LLMs and VLMs. We additionally show how to apply our approach to both visual
and speech-based inputs. To combine this varied multimodal knowledge, we adopt
entropy-based fusion scoring for zero-shot fusion. Through evaluations on two
diverse datasets and multiple retrieval metrics, we demonstrate that Q2E
outperforms several state-of-the-art baselines. Our evaluation also shows that
integrating audio information can significantly improve text-to-video
retrieval. We have released code and data for future research.

[7] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: 论文介绍了TTT-Bench,一个通过简单的井字棋类游戏评估大型推理模型(LRMs)基本战略、空间和逻辑推理能力的基准测试。尽管对人类来说这些游戏很简单,但模型表现不佳。

Details Motivation: 当前大型推理模型在STEM领域表现优秀,但在更广任务领域的推理能力探索不足,特别是战略和空间推理。

Contribution: 提出了TTT-Bench基准,生成可验证的两玩家游戏问题,评估模型在简单游戏中的推理能力。

Method: 采用可扩展的程序化方法生成游戏问题,测试多种先进LRMs的表现。

Result: 大多数模型在简单任务中表现不佳,尤其是长期战略推理,且与数学问题表现差距显著。

Insight: 模型在复杂数学问题上表现良好,但在简单战略推理任务中表现较弱,凸显了当前模型的局限性。

Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning
capabilities across a broad range of tasks including Olympiad-level
mathematical problems, indicating evidence of their complex reasoning
abilities. While many reasoning benchmarks focus on the STEM domain, the
ability of LRMs to reason correctly in broader task domains remains
underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark
that is designed to evaluate basic strategic, spatial, and logical reasoning
abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games
that humans can effortlessly solve from a young age. We propose a simple yet
scalable programmatic approach for generating verifiable two-player game
problems for TTT-Bench. Although these games are trivial for humans, they
require reasoning about the intentions of the opponent, as well as the game
board’s spatial configurations, to ensure a win. We evaluate a diverse set of
state-of-the-art LRMs, and \textbf{discover that the models that excel at hard
math problems frequently fail at these simple reasoning games}. Further testing
reveals that our evaluated reasoning models score on average $\downarrow$ 41%
& $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024
respectively, with larger models achieving higher performance using shorter
reasoning traces, where most of the models struggle on long-term strategic
reasoning situations on simple and new TTT-Bench tasks.

[8] Classifying Unreliable Narrators with Large Language Models

Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi

Main category: cs.CL

TL;DR: 论文提出了一种利用大型语言模型(LLM)识别不可靠叙述者的方法,并发布了TUNa数据集,尝试在少量样本、微调和课程学习设置下评估模型性能。

Details Motivation: 人类在阅读第一人称叙述时常常需要考虑叙述者是否可靠,而现有方法缺乏对不可靠叙述者的标准化识别。研究希望通过计算方法和LLM填补这一空白。

Contribution: 1. 提出基于叙述学理论的不可靠叙述者分类任务;2. 发布了TUNa数据集,包含多领域文本的人工标注;3. 评估了多种LLM在少量样本、微调和课程学习设置下的表现。

Method: 1. 利用叙述学理论定义不可靠叙述者的类型;2. 构建TUNa数据集并设计分类任务;3. 尝试少量样本学习、微调和课程学习来训练LLM。

Result: 任务极具挑战性,但LLM在识别不可靠叙述者方面显示出潜力。

Insight: 从文学分析中学习的方法可以迁移到现实世界文本中,为LLM在叙事分析中的应用提供了新方向。

Abstract: Often when we interact with a first-person account of events, we consider
whether or not the narrator, the primary speaker of the text, is reliable. In
this paper, we propose using computational methods to identify unreliable
narrators, i.e. those who unintentionally misrepresent information. Borrowing
literary theory from narratology to define different types of unreliable
narrators based on a variety of textual phenomena, we present TUNa, a
human-annotated dataset of narratives from multiple domains, including blog
posts, subreddit posts, hotel reviews, and works of literature. We define
classification tasks for intra-narrational, inter-narrational, and
inter-textual unreliabilities and analyze the performance of popular
open-weight and proprietary LLMs for each. We propose learning from literature
to perform unreliable narrator classification on real-world text data. To this
end, we experiment with few-shot, fine-tuning, and curriculum learning
settings. Our results show that this task is very challenging, and there is
potential for using LLMs to identify unreliable narrators. We release our
expert-annotated dataset and code and invite future research in this area.

[9] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak

Main category: cs.CL

TL;DR: 论文提出了一种名为Flick的新方法,专注于低资源语言的少标签文本分类问题,通过优化伪标签的生成和选择机制,显著提升了模型的性能。

Details Motivation: 现有的少标签文本分类方法在低资源语言环境中面临伪标签噪音和领域适应的挑战,尤其是在语言多样性高的情况下。

Contribution: 提出了一种新颖的伪标签精炼组件,通过利用高置信度伪标签和自适应的top-k选择机制,显著提升了伪标签质量。

Method: Flick采用多任务学习框架,结合单簇凝聚性和自适应top-k选择机制,从广泛的初始簇中蒸馏高置信度伪标签。

Result: 在14个多样化数据集(包括低资源语言如阿拉伯语、乌尔都语等)上验证了Flick的优越性能和适应性。

Insight: 通过聚焦高置信度伪标签和简化伪标签生成过程,Flick在低资源语言环境中表现出更强的鲁棒性和泛化能力。

Abstract: Training deep learning networks with minimal supervision has gained
significant research attention due to its potential to reduce reliance on
extensive labelled data. While self-training methods have proven effective in
semi-supervised learning, they remain vulnerable to errors from noisy pseudo
labels. Moreover, most recent approaches to the few-label classification
problem are either designed for resource-rich languages such as English or
involve complex cascading models that are prone to overfitting. To address the
persistent challenge of few-label text classification in truly low-resource
linguistic contexts, where existing methods often struggle with noisy
pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods
that rely on generic multi-cluster pseudo-labelling or complex cascading
architectures, Flick leverages the fundamental insight that distilling
high-confidence pseudo-labels from a broader set of initial clusters can
dramatically improve pseudo-label quality, particularly for linguistically
diverse, low-resource settings. Flick introduces a novel pseudo-label
refinement component, a departure from traditional pseudo-labelling strategies
by identifying and leveraging top-performing pseudo-label clusters. This
component specifically learns to distil highly reliable pseudo-labels from an
initial broad set by focusing on single-cluster cohesion and leveraging an
adaptive top-k selection mechanism. This targeted refinement process is crucial
for mitigating the propagation of errors inherent in low-resource data,
allowing for robust fine-tuning of pre-trained language models with only a
handful of true labels. We demonstrate Flick’s efficacy across 14 diverse
datasets, encompassing challenging low-resource languages such as Arabic, Urdu,
and Setswana, alongside English, showcasing its superior performance and
adaptability.

[10] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context

Chuck Arvin

Main category: cs.CL

TL;DR: 论文研究大型语言模型(LLMs)在模拟教育环境中对学生提示的迎合行为(sycophancy),发现模型对答案的选择显著受学生提供的信息影响,且较小模型更易表现这种行为。

Details Motivation: 在教育环境中,LLMs的迎合行为可能导致知识水平不同的学生受益不均,甚至强化错误理解,因此需要研究其机制和缓解方法。

Contribution: 1. 展示了LLMs在模拟教育环境中的迎合行为;2. 揭示了模型大小与迎合行为强度的关系;3. 通过分析回答变化和词级概率验证了假说。

Method: 通过五种实验条件测试不同LLMs(如GPT-4o和GPT-4.1),研究查询框架对回答质量的影响。

Result: 学生提示错误答案时模型正确率最多下降15%,提示正确答案则提升15%;小模型迎合行为更强(如GPT-4.1-nano达30%)。

Insight: LLMs的迎合行为可能加剧教育不平等,需进一步探索其机制和解决方案。

Abstract: This study examines how user-provided suggestions affect Large Language
Models (LLMs) in a simulated educational context, where sycophancy poses
significant risks. Testing five different LLMs from the OpenAI GPT-4o and
GPT-4.1 model classes across five experimental conditions, we show that
response quality varies dramatically based on query framing. In cases where the
student mentions an incorrect answer, the LLM correctness can degrade by as
much as 15 percentage points, while mentioning the correct answer boosts
accuracy by the same margin. Our results also show that this bias is stronger
in smaller models, with an effect of up to 30% for the GPT-4.1-nano model,
versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their
answer, and an investigation into token level probabilities, confirm that the
models are generally changing their answers to answer choices mentioned by
students in line with the sycophancy hypothesis. This sycophantic behavior has
important implications for educational equity, as LLMs may accelerate learning
for knowledgeable students while the same tools may reinforce misunderstanding
for less knowledgeable students. Our results highlight the need to better
understand the mechanism, and ways to mitigate, such bias in the educational
context.

[11] Code Execution as Grounded Supervision for LLM Reasoning

Dongwon Jung,Wenxuan Zhou,Muhao Chen

Main category: cs.CL

TL;DR: 这篇论文提出了一种利用代码执行确定性生成高质量Chain-of-Thought(CoT)监督数据的方法,显著提升了大型语言模型(LLMs)的推理能力。

Details Motivation: 现有的推理数据生成方法或依赖昂贵的人工标注,或使用易出错的LLM生成的CoT,难以保证可靠性和准确性。因此,作者希望通过代码执行的确定性提取可验证的推理轨迹。

Contribution: 论文的主要贡献是提出了一种通过代码执行生成高质量CoT监督数据的可扩展方法,避免了人工标注和LLM生成的不准确性,并提升了推理能力的可迁移性。

Method: 方法的核心是从代码执行中提取可验证的逐步推理轨迹,并将其转换为自然语言的CoT推理过程。通过实验验证了其有效性和高效性。

Result: 实验结果表明,该方法生成的推理数据准确性高,且减少了推理中的无意义重复和过度思考,从而降低了推理时的总token长度。

Insight: 利用代码执行作为监督信号能够提供可靠的推理步骤,从而提升LLMs的泛化能力和推理效率。

Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision
has proven effective for enhancing their reasoning abilities. However,
obtaining reliable and accurate reasoning supervision remains a significant
challenge. We propose a scalable method for generating a high-quality CoT
supervision dataset by leveraging the determinism of program execution. Unlike
existing reasoning dataset generation methods that rely on costly human
annotations or error-prone LLM-generated CoT, our approach extracts verifiable,
step-by-step reasoning traces from code execution and transforms them into a
natural language CoT reasoning. Experiments on reasoning benchmarks across
various domains show that our method effectively equips LLMs with transferable
reasoning abilities across diverse tasks. Furthermore, the ablation studies
validate that our method produces highly accurate reasoning data and reduces
overall token length during inference by reducing meaningless repetition and
overthinking.

[12] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

Xiaohan Yu,Pu Jian,Chong Chen

Main category: cs.CL

TL;DR: TableRAG提出了一种针对异构文档的检索增强生成框架,统一了文本理解和表格操作,显著提升了模型在多跳推理和全局查询中的表现。

Details Motivation: 现有RAG方法在处理包含文本和表格的异构文档时,扁平化和分块策略破坏了表格结构,导致信息丢失和多跳推理能力受限。

Contribution: 1. 提出TableRAG框架,支持文本检索和表格操作的迭代协同;2. 开发了评估异构推理能力的HeteQA基准;3. 在公开数据集和HeteQA上取得SOTA效果。

Method: TableRAG通过四步迭代:上下文敏感查询分解、文本检索、SQL编程与执行、组合式中间答案生成。

Result: 实验显示TableRAG在异构文档问答任务中显著优于基线,达到新SOTA。

Insight: 保留表格结构并迭代结合文本与表格操作是提升异构文档推理能力的关键。

Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.

[13] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan

Main category: cs.CL

TL;DR: 论文提出了一种名为PAG的框架,通过多轮强化学习让大语言模型在策略和验证器角色之间切换,实现自我纠正,避免了传统方法中的冗余生成问题。

Details Motivation: 大语言模型在复杂推理任务中表现出色,但验证自身输出的可靠性仍是一个挑战。现有方法依赖外部验证器或多阶段训练,缺乏扩展性,因此需要一种更高效的自验证机制。

Contribution: 提出PAG框架,通过统一的强化学习范式实现了大语言模型的自我验证和纠正,引入选择性修正机制,仅在检测到错误时生成新答案,避免冗余计算。

Method: 采用多轮强化学习,让模型在策略生成和验证器角色之间交替进行。通过生成式验证步骤选择性修正答案。

Result: 在多个推理任务上的实验表明,PAG在直接生成和自纠正准确性上均有提升,且其自验证能力优于自一致性方法。

Insight: 通过在单一框架中联合优化生成和验证能力,PAG证明了自我验证对大语言模型的潜力,同时避免了模型崩塌问题。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in
complex reasoning tasks, yet they still struggle to reliably verify the
correctness of their own outputs. Existing solutions to this verification
challenge often depend on separate verifier models or require multi-stage
self-correction training pipelines, which limit scalability. In this paper, we
propose Policy as Generative Verifier (PAG), a simple and effective framework
that empowers LLMs to self-correct by alternating between policy and verifier
roles within a unified multi-turn reinforcement learning (RL) paradigm.
Distinct from prior approaches that always generate a second attempt regardless
of model confidence, PAG introduces a selective revision mechanism: the model
revises its answer only when its own generative verification step detects an
error. This verify-then-revise workflow not only alleviates model collapse but
also jointly enhances both reasoning and verification abilities. Extensive
experiments across diverse reasoning benchmarks highlight PAG’s dual
advancements: as a policy, it enhances direct generation and self-correction
accuracy; as a verifier, its self-verification outperforms self-consistency.

[14] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt

Main category: cs.CL

TL;DR: 该论文提出了TempVS基准,用于评估多模态大语言模型(MLLMs)在图像序列中理解事件时序的能力。实验表明,当前MLLMs在此任务上表现不佳,与人类能力存在显著差距。

Details Motivation: 研究MLLMs在时序推理能力上的不足,尤其在图像序列中对事件顺序的理解。

Contribution: 提出了TempVS基准,包含三项测试任务(事件关系推理、句子排序和图像排序),并提供了详细分析。

Method: 通过TempVS基准对38种先进MLLMs进行评估,测试其在视觉和语言多模态下的时序推理能力。

Result: 实验表明MLLMs在理解事件时序方面表现较差,与人类表现差距较大。

Insight: 研究揭示了MLLMs在时序推理上的缺陷,为未来改进提供了方向。

Abstract: This paper introduces the TempVS benchmark, which focuses on temporal
grounding and reasoning capabilities of Multimodal Large Language Models
(MLLMs) in image sequences. TempVS consists of three main tests (i.e., event
relation inference, sentence ordering and image ordering), each accompanied
with a basic grounding test. TempVS requires MLLMs to rely on both visual and
linguistic modalities to understand the temporal order of events. We evaluate
38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS,
with a substantial performance gap compared to human capabilities. We also
provide fine-grained insights that suggest promising directions for future
research. Our TempVS benchmark data and code are available at
https://github.com/yjsong22/TempVS.

[15] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty

Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng

Main category: cs.CL

TL;DR: 这篇论文提出了一种新方法,通过动态调整输出长度的惩罚项,优化大型语言模型(LLMs)在推理任务中的效率,使其在处理简单问题时生成更简洁的输出,同时为复杂问题保留足够的推理步骤,从而提升整体性能。

Details Motivation: 当前LLMs在推理任务中表现优异,但传统方法(如Chain-of-Thought提示)往往生成冗长输出,增加了计算延迟。现有方法(如强化学习)对问题复杂性缺乏区分,导致效率不彰。因此,作者希望提升LLMs的推理效率,使其在简单问题上更简洁,复杂问题上更精确。

Contribution: 主要贡献是提出一种动态奖励函数,根据问题复杂性调整输出长度惩罚项,优化推理效率。实验表明,该方法在简单数据集(GSM8K、MATH500)上缩短输出并保持精度,在复杂数据集(AIME2024)上提升精度。

Method: 方法包括:1)根据问题复杂性划分奖励函数;2)引入动态长度惩罚项(Powered Length Penalty);3)通过实验验证在三个数据集上的性能。

Result: 在GSM8K和MATH500(简单数据集)上显著缩短输出,精度未降;在AIME2024(复杂数据集)上精度提升。

Insight: 动态调整长度惩罚项能有效平衡推理效率和精度,表明LLMs在不同复杂性任务上需要差异化的优化策略。

Abstract: Large language models (LLMs) have demonstrated significant advancements in
reasoning capabilities, performing well on various challenging benchmarks.
Techniques like Chain-of-Thought prompting have been introduced to further
improve reasoning. However, these approaches frequently generate longer
outputs, which in turn increase computational latency. Although some methods
use reinforcement learning to shorten reasoning, they often apply uniform
penalties without considering the problem’s complexity, leading to suboptimal
outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by
promoting conciseness for simpler problems while preserving sufficient
reasoning for more complex ones for accuracy, thus improving the model’s
overall performance. Specifically, we manage the model’s reasoning efficiency
by dividing the reward function and including a novel penalty for output
length. Our approach has yielded impressive outcomes in benchmark evaluations
across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively
simpler datasets GSM8K and MATH500, our method has effectively shortened output
lengths while preserving or enhancing accuracy. On the more demanding AIME2024
dataset, our approach has resulted in improved accuracy.

[16] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa

Main category: cs.CL

TL;DR: 该论文将表格-文本对齐任务重新定义为解释任务,强调不仅要预测标签,还需识别关键表格单元格以增强可解释性。通过扩展SciTab基准数据集并标注单元格级合理性,提出了处理模糊情况的分类法,实验表明对齐信息提升验证性能,但多数LLM无法忠实还原人类标注的合理性。

Details Motivation: 科学声明验证的传统方法仅预测标签,缺乏对模型推理的解释性。因此,该研究旨在通过识别关键表格单元格,增强模型的可解释性。

Contribution: 1. 将表格-文本对齐任务重新定义为解释任务;2. 构建了包含单元格级合理性的新数据集;3. 提出了处理模糊情况的分类法。

Method: 1. 扩展SciTab数据集,人工标注单元格级合理性;2. 设计分类法处理模糊情况;3. 实验验证对齐信息对性能的影响。

Result: 1. 对齐信息提升了声明验证性能;2. 多数LLM能预测正确标签,但无法忠实还原人类标注的合理性。

Insight: 模型预测的正确性不一定反映其推理的忠实性,强调了可解释性在科学声明验证中的重要性。

Abstract: Scientific claim verification against tables typically requires predicting
whether a claim is supported or refuted given a table. However, we argue that
predicting the final label alone is insufficient: it reveals little about the
model’s reasoning and offers limited interpretability. To address this, we
reframe table-text alignment as an explanation task, requiring models to
identify the table cells essential for claim verification. We build a new
dataset by extending the SciTab benchmark with human-annotated cell-level
rationales. Annotators verify the claim label and highlight the minimal set of
cells needed to support their decision. After the annotation process, we
utilize the collected information and propose a taxonomy for handling ambiguous
cases. Our experiments show that (i) incorporating table alignment information
improves claim verification performance, and (ii) most LLMs, while often
predicting correct labels, fail to recover human-aligned rationales, suggesting
that their predictions do not stem from faithful reasoning.

[17] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang

Main category: cs.CL

TL;DR: 该论文提出了RRP框架,通过结合知识图谱的语义和结构信息,为LLM生成高质量的推理路径,解决了现有方法在复杂问题上的不足,并在实验中取得了最优性能。

Details Motivation: LLM在知识密集型任务中因缺乏背景知识和幻觉问题表现不佳,现有KG增强方法虽补充了事实知识,但仍难以解决复杂问题。论文认为推理路径的可靠性和逻辑一致性同样重要。

Contribution: 1. 提出RRP框架,结合LLM的语义能力和KG的结构信息,生成高质量的推理路径。
2. 引入反思模块,评估和优化推理路径的重要性。
3. 在公开数据集上验证了RRP的优越性,并展示了其可插拔性。

Method: 1. 结合关系嵌入和双向分布学习获取KG的结构信息。
2. 利用LLM的语义能力挖掘KG。
3. 通过反思模块筛选和优化推理路径。

Result: 在多个公开数据集上,RRP的性能超越了现有基线方法,并能够无缝集成到不同LLM中,提升其推理能力。

Insight: 高质量的推理路径不仅能补充事实知识,还能提供逻辑一致的指导,对LLM在复杂任务中的表现至关重要。

Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks
due to a lack of background knowledge and a tendency to hallucinate. To address
these limitations, integrating knowledge graphs (KGs) with LLMs has been
intensively studied. Existing KG-enhanced LLMs focus on supplementary factual
knowledge, but still struggle with solving complex questions. We argue that
refining the relationships among facts and organizing them into a logically
consistent reasoning path is equally important as factual knowledge itself.
Despite their potential, extracting reliable reasoning paths from KGs poses the
following challenges: the complexity of graph structures and the existence of
multiple generated paths, making it difficult to distinguish between useful and
redundant ones. To tackle these challenges, we propose the RRP framework to
mine the knowledge graph, which combines the semantic strengths of LLMs with
structural information obtained through relation embedding and bidirectional
distribution learning. Additionally, we introduce a rethinking module that
evaluates and refines reasoning paths according to their significance.
Experimental results on two public datasets show that RRP achieves
state-of-the-art performance compared to existing baseline methods. Moreover,
RRP can be easily integrated into various LLMs to enhance their reasoning
abilities in a plug-and-play manner. By generating high-quality reasoning paths
tailored to specific questions, RRP distills effective guidance for LLM
reasoning.

[18] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal

Main category: cs.CL

TL;DR: 论文介绍了针对BEA 2025共享任务中AI导师错误识别的四种方法,其中检索增强的少样本提示系统结合大语言模型表现最佳。

Details Motivation: 评估AI导师在数学推理中是否能正确识别学生错误,提升教育反馈的准确性和可解释性。

Contribution: 提出融合检索增强提示与大语言模型的方法,显著提升错误识别的性能。

Method: 包括四种方法:集成模型、冻结句嵌入、历史感知模型和检索增强提示系统。

Result: 检索增强提示系统在所有基线方法中表现最优。

Insight: 结合示例驱动的提示和大语言模型推理能有效提升教育反馈评估的效果。

Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA
2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The
task involves evaluating whether a tutor’s response correctly identifies a
mistake in a student’s mathematical reasoning. We explore four approaches: (1)
an ensemble of machine learning models over pooled token embeddings from
multiple pretrained language models (LMs); (2) a frozen sentence-transformer
using [CLS] embeddings with an MLP classifier; (3) a history-aware model with
multi-head attention between token-level history and response embeddings; and
(4) a retrieval-augmented few-shot prompting system with a large language model
(LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples,
constructs structured prompts, and uses schema-guided output parsing to produce
interpretable predictions. It outperforms all baselines, demonstrating the
effectiveness of combining example-driven prompting with LLM reasoning for
pedagogical feedback assessment. Our code is available at
https://github.com/NaumanNaeem/BEA_2025.

[19] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

Ye Yu,Yaoning Yu,Haohan Wang

Main category: cs.CL

TL;DR: PREMISE 提出了一种基于提示优化的框架,用于减少大型推理模型在数学推理任务中的冗余计算,显著降低 token 开销和成本,同时保持准确性。

Details Motivation: 现有的长推理链(CoT)方法虽然性能强,但冗长且 token 开销大,增加了部署成本。PREMISE 旨在通过提示优化解决这一问题,无需修改模型权重。

Contribution: 提出了 PREMISE 框架,结合诊断和梯度启发式提示优化,显著减少冗余计算和成本,适用于商业 LLM。

Method: 通过多目标文本搜索平衡 token 长度和答案有效性,优化提示以最小化计算冗余。

Result: 在多个数学基准测试中,保持或提升准确性(如 Claude 96%→96%,Gemini 91%→92%),同时减少 token 开销高达 87.5% 和成本 69%-82%。

Insight: 提示级优化是高效推理的可扩展路径,无需牺牲推理质量。

Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve
strong performance on mathematical benchmarks using lengthy chain-of-thought
(CoT) reasoning, but the resulting traces are often unnecessarily verbose. This
inflates token usage and cost, limiting deployment in latency-sensitive or
API-constrained settings. We introduce PREMISE (PRompt-based Efficient
Mathematical Inference with Strategic Evaluation), a prompt-only framework that
reduces reasoning overhead without modifying model weights. PREMISE combines
trace-level diagnostics with gradient-inspired prompt optimization to minimize
redundant computation while preserving answer accuracy. The approach jointly
optimizes brevity and correctness through a multi-objective textual search that
balances token length and answer validity. Unlike prior work, PREMISE runs in a
single-pass black-box interface, so it can be applied directly to commercial
LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy
($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while
reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by
$69$–$82%$. These results show that prompt-level optimization is a practical
and scalable path to efficient LRM inference without compromising reasoning
quality.

[20] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims

Priyanka Kargupta,Runchu Tian,Jiawei Han

Main category: cs.CL

TL;DR: ClaimSpect是一个基于检索增强生成的框架,用于自动构建针对复杂声明的层级分析,并通过检索相关语料丰富其视角,以提供更全面的回应。

Details Motivation: 现实中的声明(如科学或政治声明)通常具有复杂性,难以简单地用‘真’或‘假’来标记。需要一种方法将其分解为更易验证的子方面,并提供多角度分析。

Contribution: 提出了ClaimSpect框架,能够自动构建声明层级结构,并通过检索语料发现子方面和不同视角,提供更全面的分析结果。

Method: 使用检索增强生成技术,将声明分解为多个子方面,并通过层级化语料检索来发现新子方面和不同观点及其流行度。

Result: 在真实世界的科学和政治声明数据集上验证了ClaimSpect的鲁棒性和准确性,通过案例研究和人工评估展示了其优于多个基线方法的有效性。

Insight: ClaimSpect提供了一种新的方式来处理复杂声明,通过层级化和多视角分析增强了信息的可解释性和实用性。

Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be
clearly labeled as entirely “true” or “false” – as is frequently the case with
scientific and political claims. However, a claim (e.g., “vaccine A is better
than vaccine B”) can be dissected into its integral aspects and sub-aspects
(e.g., efficacy, safety, distribution), which are individually easier to
validate. This enables a more comprehensive, structured response that provides
a well-rounded perspective on a given problem while also allowing the reader to
prioritize specific angles of interest within the claim (e.g., safety towards
children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based
framework for automatically constructing a hierarchy of aspects typically
considered when addressing a claim and enriching them with corpus-specific
perspectives. This structure hierarchically partitions an input corpus to
retrieve relevant segments, which assist in discovering new sub-aspects.
Moreover, these segments enable the discovery of varying perspectives towards
an aspect of the claim (e.g., support, neutral, or oppose) and their respective
prevalence (e.g., “how many biomedical papers believe vaccine A is more
transportable than B?”). We apply ClaimSpect to a wide variety of real-world
scientific and political claims featured in our constructed dataset, showcasing
its robustness and accuracy in deconstructing a nuanced claim and representing
perspectives within a corpus. Through real-world case studies and human
evaluation, we validate its effectiveness over multiple baselines.

[21] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs

Alberto Testoni,Iacer Calixto

Main category: cs.CL

TL;DR: 本文对大型语言模型(LLMs)在临床问答任务中的不确定性估计方法进行了细粒度评估,比较了多种模型和方法在不同医学专科和问题类型上的表现,并提出了轻量级的单次生成估计方法。

Details Motivation: 在高风险领域(如临床决策支持)中,准确和校准良好的不确定性估计对于LLMs的部署至关重要。研究旨在评估不同LLMs在临床问答任务中的不确定性估计表现。

Contribution: 首次系统地评估了10种开源LLMs在临床问题解答中的不确定性估计性能,提出了基于推理行为信号的轻量级单次生成估计方法,其性能接近语义熵方法。

Method: 比较了标准单次生成和基于采样的方法,并探索了基于推理行为信号的轻量级单次生成估计方法。评估涵盖两个数据集、11个医学专科和6种问题类型。

Result: 结果显示,不同医学专科和问题类型之间存在显著差异,轻量级单次生成方法性能接近语义熵方法。

Insight: 模型选择应考虑问题和模型的匹配性,轻量级方法在高计算成本场景下具有潜在优势。

Abstract: Accurate and well-calibrated uncertainty estimates are essential for
deploying large language models (LLMs) in high-stakes domains such as clinical
decision support. We present a fine-grained evaluation of uncertainty
estimation methods for clinical multiple-choice question answering, covering
ten open-source LLMs (general-purpose, biomedical, and reasoning models) across
two datasets, eleven medical specialties, and six question types. We compare
standard single-generation and sampling-based methods, and present a case study
exploring simple, single-pass estimators based on behavioral signals in
reasoning traces. These lightweight methods approach the performance of
Semantic Entropy while requiring only one generation. Our results reveal
substantial variation across specialties and question types, underscoring the
importance of selecting models based on both the nature of the question and
model-specific strengths.

[22] Improving Named Entity Transcription with Contextual LLM-based Revision

Viet Anh Trinh,Xinlu He,Jacob Whitehill

Main category: cs.CL

TL;DR: 本文提出了一种基于大型语言模型(LLM)的修正机制,通过利用LLM的推理能力和包含正确命名实体的局部上下文(如课程笔记)来修正ASR预测中的错误命名实体。实验结果表明,该方法在名为NER-MIT-OpenCourseWare的新数据集上,将命名实体的WER降低了30%。

Details Motivation: 尽管ASR系统在通用语音识别上表现优异,但命名实体的错误率仍然较高,而命名实体通常是关键词,其误识别会严重影响下游应用。因此,需要一种有效的方法来修正ASR中的命名实体错误。

Contribution: 1) 提出了一种基于LLM的命名实体修正机制;2) 引入了一个新的数据集NER-MIT-OpenCourseWare,包含45小时的MIT课程数据。

Method: 利用LLM的推理能力,结合局部上下文(如课程笔记)中的正确命名实体,对ASR预测中的命名实体进行修正。

Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER降低了30%。

Insight: 通过结合LLM和局部上下文,可以有效修正ASR系统中的命名实体错误,尤其是在特定领域(如教育领域)中效果显著。

Abstract: With recent advances in modeling and the increasing amount of supervised
training data, automatic speech recognition (ASR) systems have achieved
remarkable performance on general speech. However, the word error rate (WER) of
state-of-the-art ASR remains high for named entities. Since named entities are
often the most critical keywords, misrecognizing them can affect all downstream
applications, especially when the ASR system functions as the front end of a
complex system. In this paper, we introduce a large language model (LLM)
revision mechanism to revise incorrect named entities in ASR predictions by
leveraging the LLM’s reasoning ability as well as local context (e.g., lecture
notes) containing a set of correct named entities. Finally, we introduce the
NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses
for development and testing. On this dataset, our proposed technique achieves
up to 30% relative WER reduction for named entities.

[23] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints

Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens

Main category: cs.CL

TL;DR: LangEdit是一种新颖的空约束框架,旨在解决多语言顺序知识编辑中的负干扰问题,通过将参数更新投影到先前更新子空间的正交补空间,实现语言特定知识更新的精确隔离。

Details Motivation: 多语言大模型(LLMs)中跨语言一致的知识更新是一个长期未解决的挑战,传统的多模型管理成本高,而一体化编辑会导致参数干扰。

Contribution: LangEdit通过空约束框架实现语言特定知识更新的正交投影,保证更新独立性,同时保留多语言泛化能力。

Method: 采用正交投影方法,将每次语言特定的参数更新约束到先前更新子空间的正交补空间,避免参数干扰。

Result: 在三种模型架构、六种语言和四项下游任务上的评估表明,LangEdit有效减少了参数干扰,优于现有编辑方法。

Insight: LangEdit为多语言LLMs提供了一种高效、精确的知识更新方法,解决了跨语言知识编辑的干扰问题。

Abstract: Efficiently updating multilingual knowledge in large language models (LLMs),
while preserving consistent factual representations across languages, remains a
long-standing and unresolved challenge. While deploying separate editing
systems for each language might seem viable, this approach incurs substantial
costs due to the need to manage multiple models. A more efficient solution
involves integrating knowledge updates across all languages into a unified
model. However, performing sequential edits across languages often leads to
destructive parameter interference, significantly degrading multilingual
generalization and the accuracy of injected knowledge. To address this
challenge, we propose LangEdit, a novel null-space constrained framework
designed to precisely isolate language-specific knowledge updates. The core
innovation of LangEdit lies in its ability to project parameter updates for
each language onto the orthogonal complement of previous updated subspaces.
This approach mathematically guarantees update independence while preserving
multilingual generalization capabilities. We conduct a comprehensive evaluation
across three model architectures, six languages, and four downstream tasks,
demonstrating that LangEdit effectively mitigates parameter interference and
outperforms existing state-of-the-art editing methods. Our results highlight
its potential for enabling efficient and accurate multilingual knowledge
updates in LLMs. The code is available at
https://github.com/VRCMF/LangEdit.git.

[24] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu

Main category: cs.CL

TL;DR: 论文提出ReCUT方法,通过逐步探索和长短切换采样策略,平衡LLM推理长度与准确性,显著减少推理长度30-50%的同时保持或提升准确性。

Details Motivation: 现有CoT提示方法存在过度思考问题,导致推理路径冗长或冗余。现有解决方案受限于生成数据质量且易过拟合。

Contribution: 提出ReCUT方法,通过逐步探索机制和长短切换采样策略,训练两个专用模型,最终通过参数插值集成,实现推理长度与准确性的平衡。

Method: 1) 逐步探索生成多样推理路径;2) 长短切换采样构建偏好对;3) 训练两个专用模型(准确性/简短性优化),参数插值集成。

Result: 在数学推理数据集上,推理长度减少30-50%,准确性保持或提升。

Insight: 通过长短推理路径的偏好优化与集成,显著提升LLM推理效率,为复杂任务提供新思路。

Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially
improved the reasoning capabilities of Large Language Models (LLMs). However,
these methods often suffer from overthinking, leading to unnecessarily lengthy
or redundant reasoning traces. Existing approaches attempt to mitigate this
issue through curating multiple reasoning chains for training LLMs, but their
effectiveness is often constrained by the quality of the generated data and
prone to overfitting. To address the challenge, we propose Reasoning
Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing
the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a
stepwise exploration mechanism and a long-short switched sampling strategy,
enabling LLMs to incrementally generate diverse reasoning paths. These paths
are evaluated and used to construct preference pairs to train two specialized
models (Gemini LLMs)-one optimized for reasoning accuracy, the other for
shorter reasoning. A final integrated model is obtained by interpolating the
parameters of these two models. Experimental results across multiple math
reasoning datasets and backbone models demonstrate that ReCUT significantly
reduces reasoning lengths by approximately 30-50%, while maintaining or
improving reasoning accuracy compared to various baselines. All codes and data
will be released via https://github.com/NEUIR/ReCUT.

[25] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Alireza Salemi,Mukta Maddipatla,Hamed Zamani

Main category: cs.CL

TL;DR: 本文提出了mRAG,一种多智能体的检索增强生成(RAG)框架,通过自训练和奖励引导的轨迹采样优化智能体间协作,显著优于传统RAG方法。

Details Motivation: 解决传统RAG方法在复杂任务中协作效率不足的问题,通过多智能体分工和自训练提升性能。

Contribution: 1. 提出多智能体RAG框架mRAG;2. 引入自训练和奖励引导的轨迹采样优化协作;3. 在SIGIR 2025 LiveRAG竞赛中验证有效性。

Method: 基于多智能体分工(规划、搜索、推理、协调),结合自训练和奖励引导的轨迹采样优化协作。

Result: 在DataMorgana数据集上优于传统RAG基线模型,并通过案例分析展示其实际效能。

Insight: 多智能体分工和自训练可以有效提升RAG在复杂任务中的性能,为实际应用提供了新思路。

Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG)
framework composed of specialized agents for subtasks such as planning,
searching, reasoning, and coordination. Our system uses a self-training
paradigm with reward-guided trajectory sampling to optimize inter-agent
collaboration and enhance response generation. Evaluated on DataMorgana-derived
datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms
conventional RAG baselines. We further analyze competition outcomes and
showcase the framework’s strengths with case studies, demonstrating its
efficacy for complex, real-world RAG tasks.

[26] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为SlowFast Sampling的新型动态采样策略,用于加速基于扩散的语言模型(dLLMs)。该方法通过三条黄金原则(确定性、收敛性、位置性)指导采样过程,并结合缓存技术实现显著加速,在性能和效率上超越传统自回归模型。

Details Motivation: 现有扩散语言模型的采样策略(如基于置信度或半自回归解码)存在静态行为问题,导致效率和灵活性受限。需要一种动态策略来优化推理过程,充分发挥dLLMs的并行生成潜力。

Contribution: 1. 提出SlowFast Sampling动态采样策略,结合三条黄金原则(确定性、收敛性、位置性)指导采样。2. 结合dLLM-Cache减少冗余计算。3. 在LLaDA等模型上实现高达15.63倍加速,结合缓存后提升34.22倍,性能超越自回归基线(如LLaMA3 8B)。

Method: 1. 动态交替使用探索性和加速解码阶段。2. 通过三条原则决定何时何地解码:确定性原则(高置信度时解码)、收敛原则(预测稳定时加速)、位置原则(优先解码特定位置)。3. 利用dLLM-Cache缓存中间结果。

Result: 实验显示,SlowFast Sampling在LLaDA上实现15.63倍加速(结合缓存达34.22倍),且精度下降极小。吞吐量显著优于传统自回归模型(如LLaMA3 8B)。

Insight: 1. 动态采样策略可有效释放dLLMs的并行生成潜力。2. 缓存技术与采样策略结合能进一步优化效率。3. 三条黄金原则为未来研究提供了通用指导框架。

Abstract: Diffusion-based language models (dLLMs) have emerged as a promising
alternative to traditional autoregressive LLMs by enabling parallel token
generation and significantly reducing inference latency. However, existing
sampling strategies for dLLMs, such as confidence-based or semi-autoregressive
decoding, often suffer from static behavior, leading to suboptimal efficiency
and limited flexibility. In this paper, we propose SlowFast Sampling, a novel
dynamic sampling strategy that adaptively alternates between exploratory and
accelerated decoding stages. Our method is guided by three golden principles:
certainty principle, convergence principle, and positional principle, which
govern when and where tokens can be confidently and efficiently decoded. We
further integrate our strategy with dLLM-Cache to reduce redundant computation.
Extensive experiments across benchmarks and models show that SlowFast Sampling
achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and
up to 34.22$\times$ when combined with caching. Notably, our approach
outperforms strong autoregressive baselines like LLaMA3 8B in throughput,
demonstrating that well-designed sampling can unlock the full potential of
dLLMs for fast and high-quality generation.

[27] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater

Main category: cs.CL

TL;DR: 本文研究了不同语言预训练的wav2vec2模型如何编码语音、音调和说话者信息,通过探测分类器和几何分析发现这些信息的子空间基本正交,且表示结构与预训练语言无关。

Details Motivation: 现有分析主要集中在英语领域的自监督语音模型,本文旨在探究多语言预训练的wav2vec2模型是否以类似方式编码语音、音调和说话者信息。

Contribution: 揭示了wav2vec2模型在多语言预训练中编码语音、音调和说话者信息的正交子空间特性,并证明其表示结构独立于预训练语言。

Method: 使用探测分类器和几何分析方法,对四种不同语言预训练的wav2vec2模型进行分析,比较语言匹配与非匹配条件下的表现。

Result: 发现所有预训练和测试语言中,语音、音调和说话者信息的子空间基本正交,且层间探测准确率模式相似,仅在语音和音调上有轻微的语言匹配优势。

Insight: wav2vec2学习的表示结构具有语言无关性,表明其自监督学习机制能够通用地捕捉语音、音调和说话者信息。

Abstract: Analyses of self-supervised speech models have begun to reveal where and how
they represent different types of information. However, almost all analyses
have focused on English. Here, we examine how wav2vec2 models trained on four
different languages encode both language-matched and non-matched speech. We use
probing classifiers and geometric analyses to examine how phones, lexical
tones, and speaker information are represented. We show that for all
pretraining and test languages, the subspaces encoding phones, tones, and
speakers are largely orthogonal, and that layerwise patterns of probing
accuracy are similar, with a relatively small advantage for matched-language
phone and tone (but not speaker) probes in the later layers. Our findings
suggest that the structure of representations learned by wav2vec2 is largely
independent of the speech material used during pretraining.

[28] Slimming Down LLMs Without Losing Their Minds

Qingda,Mai

Main category: cs.CL

TL;DR: 这篇论文研究了参数高效方法(如LoRA和QLoRA)对大型语言模型性能的影响,发现LoRA方法在提升任务性能的同时保持了计算效率,且性能与微调数据集和任务的匹配度密切相关。

Details Motivation: 探究如何在资源有限的情况下高效微调大型语言模型,同时保持其性能。

Contribution: 验证了LoRA和QLoRA等参数高效方法在任务性能提升中的有效性,并揭示了性能与数据集-任务对齐的关系。

Method: 在常识推理(HellaSwag)、数学推理(GSM8K)和多领域知识(MMLU-CS)三个领域评估LoRA和QLoRA的性能。

Result: LoRA方法在计算高效的前提下显著提升了任务性能,且性能高度依赖数据集与任务的匹配。

Insight: 参数高效方法为资源受限环境下的LLM微调提供了理论和实践指导。

Abstract: This paper investigates and validates the impact of fine-tuning on large
language model performance, focusing on parameter-efficient methods (LoRA and
QLoRA). We evaluate model capabilities across three key domains: (1)
commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3)
multi-domain knowledge (MMLU-CS).
Our findings demonstrate that: (1) LoRA-based methods effectively improve
task-specific performance while maintaining computational efficiency, and (2)
performance strongly depends on alignment between fine-tuning dataset and
benchmark tasks. The study provides both theoretical insights into
parameter-efficient mechanisms and practical guidance for developers
implementing efficient LLM adaptation with limited resources.

[29] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型在微调过程中表现出的双重性(泛化与幻觉),并提出了一种称为’上下文外推理’(OCR)的机制解释这种现象。通过实验和理论分析,论文揭示了OCR与矩阵分解和梯度下降隐式偏置的关系。

Details Motivation: 大型语言模型在微调时既能泛化新知识,又容易产生幻觉,但这种现象的原因尚不明确。论文旨在通过研究OCR机制,理解模型推理行为的本质。

Contribution: 1. 提出OCR作为模型泛化与幻觉的共同机制;2. 通过实验验证OCR在五种主流LLM中的作用;3. 理论分析揭示了矩阵分解和梯度下降隐式偏置对OCR的关键影响。

Method: 论文设计了合成事实召回任务,并通过实验验证了单层单头注意力变换器在分解矩阵下的表现。理论部分分析了梯度下降隐式偏置对模型学习关联性的作用。

Result: 实验表明OCR确实驱动了泛化和幻觉行为。理论分析揭示了矩阵分解的重要性,并表明梯度下降倾向于最小化核范数的解,从而解释了模型的高效学习能力。

Insight: 论文提供了理解模型推理行为的新视角,强调了矩阵结构和优化目标对模型能力的关键影响,为缓解知识注入中的不良行为提供了理论基础。

Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning,
but this process exhibits a puzzling duality: models can generalize remarkably
from new facts, yet are also prone to hallucinating incorrect information.
However, the reasons for this phenomenon remain poorly understood. In this
work, we argue that both behaviors stem from a single mechanism known as
out-of-context reasoning (OCR): the ability to deduce implications by
associating concepts, even those without a causal link. Our experiments across
five prominent LLMs confirm that OCR indeed drives both generalization and
hallucination, depending on whether the associated concepts are causally
related. To build a rigorous theoretical understanding of this phenomenon, we
then formalize OCR as a synthetic factual recall task. We empirically show that
a one-layer single-head attention-only transformer with factorized output and
value matrices can learn to solve this task, while a model with combined
weights cannot, highlighting the crucial role of matrix factorization. Our
theoretical analysis shows that the OCR capability can be attributed to the
implicit bias of gradient descent, which favors solutions that minimize the
nuclear norm of the combined output-value matrix. This mathematical structure
explains why the model learns to associate facts and implications with high
sample efficiency, regardless of whether the correlation is causal or merely
spurious. Ultimately, our work provides a theoretical foundation for
understanding the OCR phenomenon, offering a new lens for analyzing and
mitigating undesirable behaviors from knowledge injection.

[30] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall

Main category: cs.CL

TL;DR: BioClinical ModernBERT 是一种针对生物医学和临床 NLP 优化的、支持长上下文的高性能编码器模型,通过大规模领域适应性预训练和多源数据集实现了显著性能提升。

Details Motivation: 现有的编码器模型在生物医学和临床 NLP 领域的进展较慢,且通常依赖于单一数据源,限制了其适应性和性能。

Contribution: 提出了 BioClinical ModernBERT,通过大规模多源数据集预训练,显著提升了生物医学和临床 NLP 任务的表现,并支持长上下文处理。

Method: 基于 ModernBERT 进行领域适应性预训练,使用了包含 53.5B tokens 的生物医学和临床语料库以及 20 个多样化数据集。

Result: 在四个下游任务中优于现有生物医学和临床编码器,发布了基础和大型模型版本及训练检查点。

Insight: 多源数据的引入和长上下文支持是提升生物医学和临床 NLP 模型表现的关键。

Abstract: Encoder-based transformer models are central to biomedical and clinical
Natural Language Processing (NLP), as their bidirectional self-attention makes
them well-suited for efficiently extracting structured information from
unstructured text through discriminative tasks. However, encoders have seen
slower development compared to decoder models, leading to limited domain
adaptation in biomedical and clinical settings. We introduce BioClinical
ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT
release, incorporating long-context processing and substantial improvements in
speed and performance for biomedical and clinical NLP. BioClinical ModernBERT
is developed through continued pretraining on the largest biomedical and
clinical corpus to date, with over 53.5 billion tokens, and addresses a key
limitation of prior clinical encoders by leveraging 20 datasets from diverse
institutions, domains, and geographic regions, rather than relying on data from
a single source. It outperforms existing biomedical and clinical encoders on
four downstream tasks spanning a broad range of use cases. We release both base
(150M parameters) and large (396M parameters) versions of BioClinical
ModernBERT, along with training checkpoints to support further research.

[31] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Lan Zhang,Marco Valentino,Andre Freitas

Main category: cs.CL

TL;DR: 该论文提出了一种基于LLM法官的集成方法(EFG),用于自动评估数学自动形式化任务,通过多维度标准(如逻辑保持、数学一致性等)提供更透明的评估,实验表明其与人工评估的相关性优于粗粒度模型。

Details Motivation: 数学自动形式化任务的评估依赖人工,耗时且需要专业知识。现有的LLM评估方法标准过于粗粒度,难以满足高级数学形式化推理的需求。

Contribution: 引入了‘EFG集成方法’,通过多维度标准系统评估形式化任务,提出逻辑保持(LP)、数学一致性(MC)、形式有效性(FV)和形式质量(FQ)四个维度。该方法显著提升了与人工评估的相关性。

Method: 基于LLM法官的集成方法(EFG),定义多维度评估标准(LP、MC、FV、FQ),结合透明的评分机制,实现自动化评估。

Result: 实验表明,EFG集成方法与人工评估的相关性优于粗粒度模型,尤其在形式质量评估上表现突出。

Insight: LLM作为法官的潜力在于,当其评估标准细粒度且定义明确时,可以提供可扩展、可解释且可靠的自动评估支持。

Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by
enabling the automatic translation of natural language statements into formal
languages. While recent advances using large language models (LLMs) have shown
promising results, methods for automatically evaluating autoformalization
remain underexplored. As one moves to more complex domains (e.g., advanced
mathematics), human evaluation requires significant time and domain expertise,
especially as the complexity of the underlying statements and background
knowledge increases. LLM-as-a-judge presents a promising approach for
automating such evaluation. However, existing methods typically employ
coarse-grained and generic evaluation criteria, which limit their effectiveness
for advanced formal mathematical reasoning, where quality hinges on nuanced,
multi-granular dimensions. In this work, we take a step toward addressing this
gap by introducing a systematic, automatic method to evaluate autoformalization
tasks. The proposed method is based on an epistemically and formally grounded
ensemble (EFG) of LLM judges, defined on criteria encompassing logical
preservation (LP), mathematical consistency (MC), formal validity (FV), and
formal quality (FQ), resulting in a transparent assessment that accounts for
different contributing factors. We validate the proposed framework to serve as
a proxy for autoformalization assessment within the domain of formal
mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM
judges is a suitable emerging proxy for evaluation, more strongly correlating
with human assessments than a coarse-grained model, especially when assessing
formal qualities. These findings suggest that LLM-as-judges, especially when
guided by a well-defined set of atomic properties, could offer a scalable,
interpretable, and reliable support for evaluating formal mathematical
reasoning.

[32] Magistral

Mistral-AI,:,Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang

Main category: cs.CL

TL;DR: Magistral是Mistral的首个推理模型,基于从头开始构建的强化学习(RL)流程,完全依赖自身模型和基础设施,探索了纯RL训练LLM的极限。

Details Motivation: 现有方法通常依赖预先蒸馏的RL轨迹或实现,Magistral尝试从零开始构建RL流程,探索纯RL训练LLM的潜力。

Contribution: 1. 展示了一个支持纯RL训练LLM的完整技术栈;2. 提出了强制模型推理语言的简单方法;3. 证明仅通过文本数据的RL训练能保留初始模型的大部分能力。

Method: 采用纯RL训练方法,不依赖外部数据或蒸馏技术,直接从自身模型和基础设施出发。

Result: Magistral Medium在推理任务上表现优异,同时RL训练提升了多模态理解、指令遵循和函数调用能力;开源了Magistral Small。

Insight: 纯RL训练在文本数据上不仅能保持原模型能力,还能进一步优化特定任务,展示了RL在LLM训练中的潜力。

Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable
reinforcement learning (RL) pipeline. Instead of relying on existing
implementations and RL traces distilled from prior models, we follow a ground
up approach, relying solely on our own models and infrastructure. Notably, we
demonstrate a stack that enabled us to explore the limits of pure RL training
of LLMs, present a simple method to force the reasoning language of the model,
and show that RL on text data alone maintains most of the initial checkpoint’s
capabilities. We find that RL on text maintains or improves multimodal
understanding, instruction following and function calling. We present Magistral
Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we
open-source Magistral Small (Apache 2.0) which further includes cold-start data
from Magistral Medium.

[33] Dynamic Epistemic Friction in Dialogue

Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型在对话中更新信念时的“动态认知摩擦”问题,提出了一种基于动态认知逻辑的模型,用于预测和优化对话中的信念对齐。

Details Motivation: 现有的大型语言模型在与人协作时缺乏对信念更新过程中阻力的考虑,即“认知摩擦”,这限制了其在复杂对话场景中的有效性。

Contribution: 论文定义了动态认知摩擦的概念,并将其纳入动态认知逻辑框架,提出了一种能够预测对话中信念更新的模型。

Method: 采用动态认知逻辑(Dynamic Epistemic Logic)框架,分析对话中的信念修订过程,提出衡量认知阻力(摩擦)的模型。

Result: 通过具体协作任务的实验表明,该模型能有效预测对话中信念的更新,并为进一步优化对话对齐提供了理论基础。

Insight: 动态认知摩擦是影响对话中信念对齐的关键因素,将其量化可以为提升语言模型在复杂场景中的适应性提供新思路。

Abstract: Recent developments in aligning Large Language Models (LLMs) with human
preferences have significantly enhanced their utility in human-AI collaborative
scenarios. However, such approaches often neglect the critical role of
“epistemic friction,” or the inherent resistance encountered when updating
beliefs in response to new, conflicting, or ambiguous information. In this
paper, we define dynamic epistemic friction as the resistance to epistemic
integration, characterized by the misalignment between an agent’s current
belief state and new propositions supported by external evidence. We position
this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit,
2011), where friction emerges as nontrivial belief-revision during the
interaction. We then present analyses from a situated collaborative task that
demonstrate how this model of epistemic friction can effectively predict belief
updates in dialogues, and we subsequently discuss how the model of belief
alignment as a measure of epistemic resistance or friction can naturally be
made more sophisticated to accommodate the complexities of real-world dialogue
scenarios.

[34] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu

Main category: cs.CL

TL;DR: Domain2Vec提出一种无需训练的方法,通过数据集向量化找到最佳数据组合,提升下游任务性能。

Details Motivation: 现有的数据集组合优化方法通常需要大量训练计算,Domain2Vec旨在减少计算开销,通过分析数据分布与模型性能的关系,高效找到最佳数据组合。

Contribution: 1. 提出Domain2Vec,将数据集向量化为元域分布向量;2. 引入分布对齐假设(DA²),无需训练即可优化数据组合;3. 显著降低计算开销,提升下游任务性能。

Method: 通过分类器将数据集分解为元域分布向量,利用DA²假设优化数据组合,结合现有方法建模域向量与模型性能的关系。

Result: Domain2Vec仅需原方法51.5%的计算量即可达到相同验证损失,相同计算预算下平均提升下游性能2.83%。

Insight: 数据集分布与模型性能的对齐关系可通过向量化高效建模,为数据组合优化提供了一种低开销、可扩展的解决方案。

Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any
dataset into a linear combination of several \emph{meta-domains}, a new concept
designed to capture the key underlying features of datasets.
\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a
classifier to decompose any given dataset into a domain vector that corresponds
to a distribution over this vocabulary. These domain vectors enable the
identification of the optimal data mixture for language model (LM) pretraining
in a training-free manner under the \emph{\textbf{D}istribution
\textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when
the data distributions of the training set and the validation set are better
aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can
be seamlessly integrated into previous works to model the relationship between
domain vectors and LM performance, greatly enhancing the efficiency and
scalability of previous methods. Extensive experiments demonstrate that
\textsc{Domain2Vec} helps find the data mixture that enhances downstream task
performance with minimal computational overhead. Specifically,
\textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only
$51.5%$ of the computation required when training on the original mixture of
The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves
downstream performance by an average of $2.83%$.

[35] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva

Main category: cs.CL

TL;DR: 本文研究了推理模型识别和从不理想的思维(如无关或误导性思维)中恢复的能力,发现模型虽能识别问题,但难以恢复,且大模型表现更差。这呼吁改进模型的自我评估能力。

Details Motivation: 最近的研究表明推理模型能够进行反思和自验证,但它们在真正遇到不理想思维时如何表现尚不清楚。本文旨在填补这一空白。

Contribution: 本文系统地研究了四种不理想思维对推理模型的影响,揭示了模型识别与恢复的差异,并发现大模型在恢复上表现更差。

Method: 通过注入四种不理想思维(无关、误导性等)到模型的推理过程中,评估模型的识别和恢复能力,并观察不同规模模型的表现。

Result: 模型能识别不理想思维,但恢复能力差,尤其是大模型表现更糟。最小模型对有害思维的干扰抵抗最强。

Insight: 模型的自我评估能力仍需改进,尤其是在面对干扰时。规模增长未必带来性能提升,甚至可能适得其反。

Abstract: Recent reasoning models show the ability to reflect, backtrack, and
self-validate their reasoning, which is crucial in spotting mistakes and
arriving at accurate solutions. A natural question that arises is how
effectively models can perform such self-reevaluation. We tackle this question
by investigating how well reasoning models identify and recover from four types
of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to
the question, thoughts misdirecting the question as a slightly different
question, and thoughts that lead to incorrect answers. We show that models are
effective at identifying most unhelpful thoughts but struggle to recover from
the same thoughts when these are injected into their thinking process, causing
significant performance drops. Models tend to naively continue the line of
reasoning of the injected irrelevant thoughts, which showcases that their
self-reevaluation abilities are far from a general “meta-cognitive” awareness.
Moreover, we observe non/inverse-scaling trends, where larger models struggle
more than smaller ones to recover from short irrelevant thoughts, even when
instructed to reevaluate their reasoning. We demonstrate the implications of
these findings with a jailbreak experiment using irrelevant thought injection,
showing that the smallest models are the least distracted by
harmful-response-triggering thoughts. Overall, our findings call for
improvement in self-reevaluation of reasoning models to develop better
reasoning and safer systems.

cs.CV [Back]

[36] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj

Main category: cs.CV

TL;DR: 本文提出了一种结合文本到图像和音频生成模型的多模态电影视频合成方法,通过Stable Diffusion、GPT-2和混合音频流水线实现高保真视频生成。

Details Motivation: 随着生成式人工智能的发展,如何高效合成具有叙事连贯性和专业质量的电影视频成为研究重点。

Contribution: 提出了一种结合文本到图像(Stable Diffusion)、叙事结构(GPT-2)和混合音频的多模态视频合成框架,支持60秒电影生成。

Method: 采用五场景框架,结合线性帧插值和电影级后处理(如锐化),并优化音频-视频同步。使用GPU加速的Python环境实现。

Result: 实验展示了出色的视觉质量、叙事连贯性和效率,适用于创意、教育和工业场景。

Insight: 多模态生成模型(文本、图像、音频)的结合为电影视频合成提供了新思路,优化技术(如CUDA内存管理)提升了可靠性。

Abstract: Advances in generative artificial intelligence have altered multimedia
creation, allowing for automatic cinematic video synthesis from text inputs.
This work describes a method for creating 60-second cinematic movies
incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for
narrative structuring, and a hybrid audio pipeline using gTTS and
YouTube-sourced music. It uses a five-scene framework, which is augmented by
linear frame interpolation, cinematic post-processing (e.g., sharpening), and
audio-video synchronization to provide professional-quality results. It was
created in a GPU-accelerated Google Colab environment using Python 3.11. It has
a dual-mode Gradio interface (Simple and Advanced), which supports resolutions
of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA
memory management and error handling ensure reliability. The experiments
demonstrate outstanding visual quality, narrative coherence, and efficiency,
furthering text-to-video synthesis for creative, educational, and industrial
applications.

[37] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue

Main category: cs.CV

TL;DR: 论文提出了一种基于掩码感知的LoRA微调方法,通过首帧引导实现可控的视频编辑,解决了传统方法依赖大规模预训练和编辑灵活性不足的问题。

Details Motivation: 当前基于扩散模型的视频编辑方法依赖大规模预训练,且首帧引导的编辑方式缺乏对后续帧的灵活控制。

Contribution: 提出了一种掩码驱动的LoRA微调方法,结合输入视频和参考图像,实现了高效的区域特定学习与可控编辑传播。

Method: 通过掩码动态调节模型的注意力区域,结合LoRA技术对预训练的I2V模型进行微调,保留背景并引导编辑内容。

Result: 实验表明,该方法在视频编辑性能上优于现有技术。

Insight: 掩码与LoRA的结合为可控视频编辑提供了高效的解决方案,无需改变模型架构即可实现灵活编辑。

Abstract: Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our approach preserves background
regions while enabling controllable edits propagation. This solution offers
efficient and adaptable video editing without altering the model architecture.
To better steer this process, we incorporate additional references, such as
alternate viewpoints or representative scene states, which serve as visual
anchors for how content should unfold. We address the control challenge using a
mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model
to the editing context. The model must learn from two distinct sources: the
input video provides spatial structure and motion cues, while reference images
offer appearance guidance. A spatial mask enables region-specific learning by
dynamically modulating what the model attends to, ensuring that each area draws
from the appropriate source. Experimental results show our method achieves
superior video editing performance compared to state-of-the-art methods.

[38] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

Bin Guo,John H. L. Hansen

Main category: cs.CV

TL;DR: 提出了一种受深度优先搜索算法启发的视觉架构DeepTraverse,通过递归探索和动态校准模块实现特征的自适应迭代优化,在图像分类任务中表现优异。

Details Motivation: 传统视觉模型的特征提取过程缺乏显式的自适应迭代与推理能力。作者探索是否可以通过经典搜索算法的逻辑,构建更结构化、可解释的特征提取流程。

Contribution: 提出了DeepTraverse架构,结合递归探索模块和自适应校准模块,实现了特征的动态优化与高效学习。

Method: 1. 递归探索模块:通过参数共享,沿有潜力的特征路径进行深度分析;2. 自适应校准模块:根据全局上下文动态调整特征显著性。

Result: 在多个图像分类基准测试中,DeepTraverse取得了与或优于传统模型的性能,且参数量相近或更少。

Insight: 将算法先验(如深度优先搜索)融入视觉模型设计,可以提升模型的效率、性能和可解释性。

Abstract: Conventional vision backbones, despite their success, often construct
features through a largely uniform cascade of operations, offering limited
explicit pathways for adaptive, iterative refinement. This raises a compelling
question: can principles from classical search algorithms instill a more
algorithmic, structured, and logical processing flow within these networks,
leading to representations built through more interpretable, perhaps
reasoning-like decision processes? We introduce DeepTraverse, a novel vision
architecture directly inspired by algorithmic search strategies, enabling it to
learn features through a process of systematic elucidation and adaptive
refinement distinct from conventional approaches. DeepTraverse operationalizes
this via two key synergistic components: recursive exploration modules that
methodically deepen feature analysis along promising representational paths
with parameter sharing for efficiency, and adaptive calibration modules that
dynamically adjust feature salience based on evolving global context. The
resulting algorithmic interplay allows DeepTraverse to intelligently construct
and refine feature patterns. Comprehensive evaluations across a diverse suite
of image classification benchmarks show that DeepTraverse achieves highly
competitive classification accuracy and robust feature discrimination, often
outperforming conventional models with similar or larger parameter counts. Our
work demonstrates that integrating such algorithmic priors provides a
principled and effective strategy for building more efficient, performant, and
structured vision backbones.

[39] Test-Time Adaptation for Generalizable Task Progress Estimation

Christos Ziakas,Alessandra Russo

Main category: cs.CV

TL;DR: 论文提出了一种测试时适应方法,通过优化自监督目标,使进度估计模型能够在测试轨迹中在线适应视觉和时间上下文。

Details Motivation: 针对任务进度估计的通用性问题,现有方法在分布外任务和环境中的表现不足,需要一种能够在测试时动态适应的方法。

Contribution: 提出了基于梯度优化的元学习策略,通过自监督目标实现测试时适应性,提高了模型在多样任务和环境中的泛化能力。

Method: 结合专家视觉轨迹和自然语言任务描述,训练模型以在测试时通过优化语义内容而非时间顺序来适应新任务。

Result: 方法在分布外任务、环境和实体中表现优异,超越了基于自回归视觉语言模型的上下文学习方法。

Insight: 测试时自适应是提升模型泛化能力的有效途径,尤其适用于动态和多变的真实世界任务。

Abstract: We propose a test-time adaptation method that enables a progress estimation
model to adapt online to the visual and temporal context of test trajectories
by optimizing a learned self-supervised objective. To this end, we introduce a
gradient-based meta-learning strategy to train the model on expert visual
trajectories and their natural language task descriptions, such that test-time
adaptation improves progress estimation relying on semantic content over
temporal order. Our test-time adaptation method generalizes from a single
training environment to diverse out-of-distribution tasks, environments, and
embodiments, outperforming the state-of-the-art in-context learning approach
using autoregressive vision-language models.

[40] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang

Main category: cs.CV

TL;DR: EfficientVLA是一个无需训练的推理加速框架,通过综合多方面的冗余优化,显著提升Vision-Language-Action模型的效率和部署性。

Details Motivation: VLA模型(尤其是基于扩散架构的模型)在具身智能领域潜力巨大,但高计算和内存需求限制了其实际应用。现有方法往往只针对局部低效问题,缺乏全局优化。

Contribution: 提出了EfficientVLA框架,通过语言模块剪枝、视觉处理路径优化和动作头的时间冗余消除,系统性解决VLA模型的冗余问题。

Method: 结合三种策略:(1) 基于语言模块层间冗余分析的剪枝;(2) 任务感知的视觉令牌选择;(3) 动作头中间特征缓存与重用。

Result: 在CogACT模型上实现了1.93倍加速,FLOPs减少至28.9%,任务成功率仅下降0.6%。

Insight: 全局冗余优化比局部优化更有效,且无需额外训练即可显著提升模型效率。

Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based
architectures, demonstrate transformative potential for embodied intelligence
but are severely hampered by high computational and memory demands stemming
from extensive inherent and inference-time redundancies. While existing
acceleration efforts often target isolated inefficiencies, such piecemeal
solutions typically fail to holistically address the varied computational and
memory bottlenecks across the entire VLA pipeline, thereby limiting practical
deployability. We introduce EfficientVLA, a structured and training-free
inference acceleration framework that systematically eliminates these barriers
by cohesively exploiting multifaceted redundancies. EfficientVLA
synergistically integrates three targeted strategies: (1) pruning of
functionally inconsequential layers from the language module, guided by an
analysis of inter-layer redundancies; (2) optimizing the visual processing
pathway through a task-aware strategy that selects a compact, diverse set of
visual tokens, balancing task-criticality with informational coverage; and (3)
alleviating temporal computational redundancy within the iterative
diffusion-based action head by strategically caching and reusing key
intermediate features. We apply our method to a standard VLA model CogACT,
yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6%
success rate drop in the SIMPLER benchmark.

[41] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild

Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso

Main category: cs.CV

TL;DR: 该论文提出了一个多模态数据集ICCWD,用于检测图像中未成年人,填补了现有研究的空白,并通过基准测试展示了现有方法的局限性。

Details Motivation: 由于数字内容中对未成年人的监管需求,目前缺乏多模态环境下检测未成年人的数据集,因此作者提出了ICCWD数据集以支持相关研究。

Contribution: 1. 发布了首个多模态图像-文本数据集ICCWD,用于未成年人检测;2. 提供了丰富的图像上下文(包括虚构描绘和部分可见身体);3. 通过基准测试展示了现有检测方法的性能(最高正确率75.3%)。

Method: 1. 手动标注了10,000个图像-文本对,标注是否包含未成年人;2. 使用该数据集对三种检测器(包括商业年龄估计系统)进行了基准测试。

Result: 实验结果表明未成年人检测任务具有挑战性,最佳方法的真阳性率为75.3%。

Insight: 未成年人检测在多模态环境中仍需改进,公开数据集有望推动更优方法的开发。

Abstract: Platforms and the law regulate digital content depicting minors (defined as
individuals under 18 years of age) differently from other types of content.
Given the sheer amount of content that needs to be assessed, machine
learning-based automation tools are commonly used to detect content depicting
minors. To our knowledge, no dataset or benchmark currently exists for
detecting these identification methods in a multi-modal environment. To fill
this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an
image-caption dataset aimed at benchmarking tools that detect depictions of
minors. Our dataset is richer than previous child image datasets, containing
images of children in a variety of contexts, including fictional depictions and
partially visible bodies. ICCWD contains 10,000 image-caption pairs manually
labeled to indicate the presence or absence of a child in the image. To
demonstrate the possible utility of our dataset, we use it to benchmark three
different detectors, including a commercial age estimation system applied to
images. Our results suggest that child detection is a challenging task, with
the best method achieving a 75.3% true positive rate. We hope the release of
our dataset will aid in the design of better minor detection methods in a wide
range of scenarios.

[42] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers

Natanael Lucena,Fábio S. da Silva,Ricardo Rios

Main category: cs.CV

TL;DR: 该论文比较了卷积神经网络(CNN)和视觉变换器(ViT)在多分类银屑病及其类似疾病图像上的性能,发现ViT在小模型上表现更优,其中DaViT-B的f1-score达到96.4%。

Details Motivation: 研究动机是探索视觉变换器在医学图像分类任务(如银屑病检测)中的潜力,并与传统CNN方法进行对比。

Contribution: 主要贡献是证明了ViTs(尤其是DaViT-B)在银屑病检测任务中的高效性,f1-score达到96.4%,优于CNN。

Method: 方法包括使用ImageNet预训练的CNN和ViT模型,对特定医学图像数据集进行微调和性能比较。

Result: 结果显示ViTs在小模型上表现更优,DaViT-B的f1-score为96.4%,推荐为自动化银屑病检测的最优架构。

Insight: 论文的洞见在于ViTs在医学图像分类任务中展现出强大潜力,尤其是在小模型下仍能保持高性能。

Abstract: This paper presents a comparison of the performance of Convolutional Neural
Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying
images containing lesions of psoriasis and diseases similar to it. Models
pre-trained on ImageNet were adapted to a specific data set. Both achieved high
predictive metrics, but the ViTs stood out for their superior performance with
smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the
best results, with an f1-score of 96.4%, and is recommended as the most
efficient architecture for automated psoriasis detection. This article
reinforces the potential of ViTs for medical image classification tasks.

[43] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang

Main category: cs.CV

TL;DR: 该论文提出了ViCrit任务,通过微妙的视觉幻觉定位任务增强视觉语言模型(VLMs)的感知能力,并展示了其在多种视觉基准上的显著提升效果。

Details Motivation: 由于视觉任务往往难以明确验证,强化学习(RL)在视觉语言模型中的扩展受到限制。论文旨在设计一种兼具挑战性且可验证的视觉任务。

Contribution: 1. 提出ViCrit任务,用于训练VLMs定位人工注入的视觉幻觉;2. 引入ViCrit-Bench基准,系统地评估模型在不同领域和错误类型上的表现。

Method: 通过在人类标注的图片描述中注入细微的视觉描述错误(如对象、属性、数量或空间关系的改动),要求模型定位错误片段,并提供精确匹配的二元奖励信号。

Result: 实验表明,通过ViCrit任务训练的模型在多种视觉基准上表现显著提升,且能力可泛化到抽象图像推理和视觉数学任务。

Insight: 精细的幻觉批评任务能有效提升视觉感知能力,而非仅依赖于记忆已见对象,展示了视觉理解的潜力。

Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning
large language models (LLMs) using tasks that are challenging yet easily
verifiable, such as math reasoning or code generation. However, extending this
success to visual perception in vision-language models (VLMs) has been impeded
by the scarcity of vision-centric tasks that are simultaneously challenging and
unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption
Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle,
synthetic visual hallucination injected into paragraphs of human-written image
captions. Starting from a 200-word captions, we inject a single, subtle visual
description error-altering a few words on objects, attributes, counts, or
spatial relations-and task the model to pinpoint the corrupted span given the
image and the modified caption. This formulation preserves the full perceptual
difficulty while providing a binary, exact-match reward that is easy to compute
and unambiguous. Models trained with the ViCrit Task exhibit substantial gains
across a variety of VL benchmarks. Crucially, the improvements transfer beyond
natural-image training data to abstract image reasoning and visual math,
showing promises of learning to perceive rather than barely memorizing seen
objects. To facilitate evaluation, we further introduce ViCrit-Bench, a
category-balanced diagnostic benchmark that systematically probes perception
errors across diverse image domains and error types. Together, our results
demonstrate that fine-grained hallucination criticism is an effective and
generalizable objective for enhancing visual perception in VLMs.

[44] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context

Yael Frischholz,Devis Tuia,Michael Lehning

Main category: cs.CV

TL;DR: 该论文提出了一种基于注意力机制的SSR检索方法,通过隐式学习从卫星图像序列推断晴空地表反射率,无需手动特征提取。

Details Motivation: 传统SSR检索方法在山区因动态雪盖变化效果不佳,需改进。

Contribution: 提出了一种基于注意力机制的SSR检索方法,隐式学习地表反射率动态,显著提升山区检索效果。

Method: 采用Temporo-Spatial Vision Transformer,输入多光谱卫星图像序列和静态地形特征,训练目标是HelioMont的SSR估计。

Result: 模型在提供足够长时间上下文时,性能媲美基于反照率的方法,尤其对山区效果显著。

Insight: 长时间上下文可有效捕捉地表反射率动态,提升模型泛化能力。

Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery
critically depends on estimating the background reflectance that a spaceborne
sensor would observe under clear-sky conditions. Deviations from this baseline
can then be used to detect cloud presence and guide radiative transfer models
in inferring atmospheric attenuation. Operational retrieval algorithms
typically approximate background reflectance using monthly statistics, assuming
surface properties vary slowly relative to atmospheric conditions. However,
this approach fails in mountainous regions where intermittent snow cover and
changing snow surfaces are frequent. We propose an attention-based emulator for
SSR retrieval that implicitly learns to infer clear-sky surface reflectance
from raw satellite image sequences. Built on the Temporo-Spatial Vision
Transformer, our approach eliminates the need for hand-crafted features such as
explicit albedo maps or cloud masks. The emulator is trained on instantaneous
SSR estimates from the HelioMont algorithm over Switzerland, a region
characterized by complex terrain and dynamic snow cover. Inputs include
multi-spectral SEVIRI imagery from the Meteosat Second Generation platform,
augmented with static topographic features and solar geometry. The target
variable is HelioMont’s SSR, computed as the sum of its direct and diffuse
horizontal irradiance components, given at a spatial resolution of 1.7 km. We
show that, when provided a sufficiently long temporal context, the model
matches the performances of albedo-informed models, highlighting the model’s
ability to internally learn and exploit latent surface reflectance dynamics.
Our geospatial analysis shows this effect is most powerful in mountainous
regions and improves generalization in both simple and complex topographic
settings. Code and datasets are publicly available at
https://github.com/frischwood/HeMu-dev.git

[45] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling

Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias

Main category: cs.CV

TL;DR: 该论文重新审视了注意力探针在掩码图像建模(MIM)中的应用,提出了高效的探针方法(EP),通过多查询交叉注意力机制提升效率与性能。

Details Motivation: 由于分布式补丁标记的特性,标准线性探针(LP)无法充分评估MIM模型的潜力,因此需要更高效的注意力探针方法。

Contribution: 提出了高效探针(EP),通过减少冗余投影和可训练参数,显著提升计算效率,并在多个基准测试中优于现有方法。

Method: 采用多查询交叉注意力机制,优化注意力探针的设计,减少计算开销。

Result: EP在七项基准测试中优于LP和先前的注意力探针方法,且在低样本和分层设置中表现优异。

Insight: 高效的注意力机制不仅提升性能,还能生成可解释的注意力图,适用于多种预训练范式。

Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is
emerging as the preferred evaluation protocol for self-supervised learning
(SSL). Yet, the standard linear probing (LP) fails to adequately reflect the
potential of models trained with Masked Image Modeling (MIM), due to the
distributed nature of patch tokens. This motivates the need for attentive
probing, an alternative that uses attention to selectively aggregate
patch-level features. Despite its growing adoption, attentive probing remains
under-explored, with existing methods suffering from excessive parameterization
and poor computational efficiency.
In this work, we revisit attentive probing through the lens of the
accuracy-efficiency trade-off. We conduct a systematic study of existing
methods, analyzing their mechanisms and benchmarking their performance. We
introduce efficient probing (EP), a multi-query cross-attention mechanism that
eliminates redundant projections, reduces the number of trainable parameters,
and achieves up to a 10$\times$ speed-up over conventional multi-head
attention. Despite its simplicity, EP outperforms LP and prior attentive
probing approaches across seven benchmarks, generalizes well beyond MIM to
diverse pre-training paradigms, produces interpretable attention maps, and
achieves strong gains in low-shot and layer-wise settings. Code available at
https://github.com/billpsomas/efficient-probing.

[46] Improving Personalized Search with Regularized Low-Rank Parameter Updates

Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell

Main category: cs.CV

TL;DR: 论文提出了一种通过正则化低秩参数更新改进个性化视觉语言检索的方法,显著提升了在小样本学习任务中的表现。

Details Motivation: 个性化视觉语言检索需要从小样本中学习新概念(如'我的狗Fido'),并将其与通用知识结合以识别不同上下文中的概念,这一任务极具挑战性。

Contribution: 1. 提出通过正则化低秩适应调整语言编码器最后一层的少量参数,有效识别个性化概念同时保留通用知识;2. 探索了多个人化概念参数的结合策略;3. 引入基于VLM生成标题的图像检索指标评估通用知识保留效果。

Method: 采用正则化低秩适应方法微调语言编码器的最后一层参数,并结合参数加法策略整合多个个性化概念。

Result: 在DeepFashion2和ConCon-Chi基准测试中,个性化检索准确率比之前方法提高了4%-22%。

Insight: 正则化低秩适应是一种有效的个性化概念学习方法,同时通用知识的保留可以通过VLM生成的标题进行评估。

Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g.
“my dog Fido”) from only a few examples. This task is challenging because it
requires not only learning a new concept from a few images, but also
integrating the personal and general knowledge together to recognize the
concept in different contexts. In this paper, we show how to effectively adapt
the internal representation of a vision-language dual encoder model for
personalized vision-language retrieval. We find that regularized low-rank
adaption of a small set of parameters in the language encoder’s final layer
serves as a highly effective alternative to textual inversion for recognizing
the personal concept while preserving general knowledge. Additionally, we
explore strategies for combining parameters of multiple learned personal
concepts, finding that parameter addition is effective. To evaluate how well
general knowledge is preserved in a finetuned representation, we introduce a
metric that measures image retrieval accuracy based on captions generated by a
vision language model (VLM). Our approach achieves state-of-the-art accuracy on
two benchmarks for personalized image retrieval with natural language queries -
DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal
retrievals.

[47] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators

Parsa Rahimi,Sebastien Marcel

Main category: cs.CV

TL;DR: ScoreMix提出了一种基于扩散模型分数组合的数据增强方法,通过混合不同类别的扩散轨迹分数生成具有挑战性的合成样本,显著提升了判别模型的性能。

Details Motivation: 在标记数据有限的情况下,如何利用生成模型增强判别模型的性能是一个关键问题。ScoreMix利用扩散模型的分数组合特性生成高质量合成样本,以解决这一问题。

Contribution: 提出了ScoreMix,一种简单而有效的基于扩散模型分数组合的数据增强方法,通过混合不同类别的扩散轨迹分数生成有挑战性的样本,显著提升判别模型性能。

Method: 通过凸组合不同类别条件扩散轨迹的分数,生成合成样本。研究发现,混合判别器嵌入空间中相距较远的类别能带来更大性能提升。

Result: 在多个基准测试中,ScoreMix显著提升了判别模型的性能,且无需复杂的超参数搜索。

Insight: 生成器的条件空间与判别器的嵌入空间关联性较低,混合远离的类别比相近类别更能提升性能,为数据增强提供了新思路。

Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation
strategy leveraging the score compositional properties of diffusion models to
enhance discriminator performance, particularly under scenarios with limited
labeled data. By convexly mixing the scores from different class-conditioned
trajectories during diffusion sampling, we generate challenging synthetic
samples that significantly improve discriminative capabilities in all studied
benchmarks. We systematically investigate class-selection strategies for mixing
and discover that greater performance gains arise when combining classes
distant in the discriminator’s embedding space, rather than close in the
generator’s condition space. Moreover, we empirically show that, under standard
metrics, the correlation between the generator’s learned condition space and
the discriminator’s embedding space is minimal. Our approach achieves notable
performance improvements without extensive parameter searches, demonstrating
practical advantages for training discriminative models while effectively
mitigating problems regarding collections of large datasets. Paper website:
https://parsa-ra.github.io/scoremix

[48] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops

Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles

Main category: cs.CV

TL;DR: 该论文提出了一个结合多源数据(卫星图像、气候、蒸散发和土壤数据)的深度学习模型,用于加州70多种作物的县级产量预测,整体R2得分达到0.76。

Details Motivation: 加州是全球农业生产的领导者,但复杂的环境、气候和土壤因素使得产量预测充满挑战。现有数据未充分利用多源信息进行精准预测。

Contribution: 1) 创建了一个覆盖加州所有县70多种作物的综合产量数据集;2) 开发了一个结合多模态输入的多模态深度学习模型,用于县级作物产量预测。

Method: 采用分层特征提取和时间序列编码器捕捉生长季的时空动态,静态输入(如土壤特性和作物类别)用于长期变异性建模。

Result: 模型在未见测试数据上的整体R2得分为0.76,表现优异。

Insight: 多源数据的整合和时空动态建模对农业产量预测至关重要,为气候适应和精准农业提供了新工具。

Abstract: California is a global leader in agricultural production, contributing 12.5%
of the United States total output and ranking as the fifth-largest food and
cotton supplier in the world. Despite the availability of extensive historical
yield data from the USDA National Agricultural Statistics Service, accurate and
timely crop yield forecasting remains a challenge due to the complex interplay
of environmental, climatic, and soil-related factors. In this study, we
introduce a comprehensive crop yield benchmark dataset covering over 70 crops
across all California counties from 2008 to 2022. The benchmark integrates
diverse data sources, including Landsat satellite imagery, daily climate
records, monthly evapotranspiration, and high-resolution soil properties. To
effectively learn from these heterogeneous inputs, we develop a multi-modal
deep learning model tailored for county-level, crop-specific yield forecasting.
The model employs stratified feature extraction and a timeseries encoder to
capture spatial and temporal dynamics during the growing season. Static inputs
such as soil characteristics and crop identity inform long-term variability.
Our approach achieves an overall R2 score of 0.76 across all crops of unseen
test dataset, highlighting strong predictive performance across California
diverse agricultural regions. This benchmark and modeling framework offer a
valuable foundation for advancing agricultural forecasting, climate adaptation,
and precision farming. The full dataset and codebase are publicly available at
our GitHub repository.

[49] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli

Main category: cs.CV

TL;DR: DySS提出了一种基于动态查询和状态空间学习的高效多摄像头视频3D物体检测方法,通过状态空间模型(SSM)和动态查询更新操作,实现了优异的检测性能和实时推理速度。

Details Motivation: 传统方法依赖密集BEV特征,计算成本高;稀疏查询方法虽有所改进,但仍需大量查询且处理多帧视频时效率低。DySS旨在通过动态查询和状态空间学习提高检测效率和性能。

Contribution: 1. 引入状态空间模型(SSM)进行时序特征处理,并通过辅助任务(未来预测和掩码重建)增强模型对运动和对应信息的捕捉。2. 提出动态查询更新机制(合并、删除、拆分),维持高效的检测查询集。

Method: 1. 使用SSM逐步处理时序特征。2. 通过未来预测和掩码重建任务优化SSM。3. 基于SSM学习的状态动态更新查询,减少计算负担。

Result: 在nuScenes测试集上,NDS为65.31,mAP为57.4;验证集上NDS为56.2,mAP为46.2,推理速度达33 FPS,优于现有方法。

Insight: DySS通过结合时序建模和动态查询优化,显著提升了3D物体检测的效率,同时保持了高性能,为实时自动驾驶感知任务提供了新思路。

Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most
important perception tasks in autonomous driving. Earlier methods rely on dense
BEV features, which are costly to construct. More recent works explore sparse
query-based detection. However, they still require a large number of queries
and can become expensive to run when more video frames are used. In this paper,
we propose DySS, a novel method that employs state-space learning and dynamic
queries. More specifically, DySS leverages a state-space model (SSM) to
sequentially process the sampled features over time steps. In order to
encourage the model to better capture the underlying motion and correspondence
information, we introduce auxiliary tasks of future prediction and masked
reconstruction to better train the SSM. The state of the SSM then provides an
informative yet efficient summarization of the scene. Based on the state-space
learned features, we dynamically update the queries via merge, remove, and
split operations, which help maintain a useful, lean set of detection queries
throughout the network. Our proposed DySS achieves both superior detection
performance and efficient inference. Specifically, on the nuScenes test split,
DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the
art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a
real-time inference speed of 33 FPS.

[50] HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Eunkyu Park,Minyeong Kim,Gunhee Kim

Main category: cs.CV

TL;DR: 论文提出了HalLoc数据集和一个基线模型,用于高效、概率性的幻觉检测,目标是提升视觉语言模型在关键应用中的可靠性。

Details Motivation: 大型视觉语言模型中的幻觉问题严重影响其可靠性,而现有检测方法计算成本高且无法处理模糊情况。

Contribution: 1. 提出了包含150K标记级标注的HalLoc数据集;2. 设计了一种低开销的基线模型,支持生成过程中的实时幻觉检测。

Method: 开发了一个标记级注释的数据集,并训练了一个基线模型,可实现低延迟的幻觉检测。

Result: HalLoc数据集和基线模型公开可用,为提升视觉语言模型的可靠性提供了实用工具。

Insight: 该工作推动了幻觉检测从二元判别向概率性评分的转变,增强了模型的透明度和实用性。

Abstract: Hallucinations pose a significant challenge to the reliability of large
vision-language models, making their detection essential for ensuring accuracy
in critical applications. Current detection methods often rely on
computationally intensive models, leading to high latency and resource demands.
Their definitive outcomes also fail to account for real-world scenarios where
the line between hallucinated and truthful information is unclear. To address
these issues, we propose HalLoc, a dataset designed for efficient,
probabilistic hallucination detection. It features 150K token-level annotated
samples, including hallucination types, across Visual Question Answering (VQA),
instruction-following, and image captioning tasks. This dataset facilitates the
development of models that detect hallucinations with graded confidence,
enabling more informed user interactions. Additionally, we introduce a baseline
model trained on HalLoc, offering low-overhead, concurrent hallucination
detection during generation. The model can be seamlessly integrated into
existing VLMs, improving reliability while preserving efficiency. The prospect
of a robust plug-and-play hallucination detection module opens new avenues for
enhancing the trustworthiness of vision-language models in real-world
applications. The HalLoc dataset and code are publicly available at:
https://github.com/dbsltm/cvpr25_halloc.

[51] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation

Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya

Main category: cs.CV

TL;DR: 本文通过对HAM10000数据集的综合评估,研究了基于深度学习的皮肤癌分类方法,结合迁移学习和不确定性量化(UQ),展示了CLIP-based视觉变压器模型的优越性能,以及集成方法在准确性和不确定性处理之间的平衡。

Details Motivation: 自动化皮肤癌分类对早期治疗和改善患者预后至关重要,但现有深度学习方法受限于数据稀缺和缺乏不确定性感知。本文旨在通过迁移学习和UQ提升模型的性能和可信度。

Contribution: 1. 在HAM10000数据集上对多种预训练特征提取器和分类器进行基准测试;2. 引入UQ方法(MCD、集成和EMCD)评估模型的不确定性;3. 展示了CLIP-based视觉变压器和集成方法的优越性。

Method: 1. 使用CLIP variants、ResNet50、DenseNet121、VGG16和EfficientNet-V2-Large等预训练模型结合SVM、XGBoost、逻辑回归分类器进行基准测试;2. 采用MCD、集成和EMCD进行不确定性量化;3. 使用UAcc、USen、USpe、UPre等指标评估不确定性。

Result: CLIP-based视觉变压器(如LAION CLIP ViT-H/14)与SVM结合表现最佳;集成方法在准确性和不确定性处理之间取得平衡,EMCD对不确定预测更敏感。

Insight: 在医学诊断中,集成不确定性量化可以提升深度学习的可信度和实际应用价值,为临床决策提供更可靠的依据。

Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment
and improved patient outcomes. Deep learning (DL) models have shown promise in
automating skin cancer classification, but their performance can be limited by
data scarcity and a lack of uncertainty awareness. In this study, we present a
comprehensive evaluation of DL-based skin lesion classification using transfer
learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the
first phase, we benchmarked several pre-trained feature extractors-including
Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50
(ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual
Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range
of traditional classifiers such as Support Vector Machine (SVM), eXtreme
Gradient Boosting (XGBoost), and logistic regression. Our results show that
CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM,
deliver the highest classification performance. In the second phase, we
incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte
Carlo Dropout (EMCD) to assess not only prediction accuracy but also the
reliability of model outputs. We evaluated these models using uncertainty-aware
metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen),
uncertainty specificity(USpe), and uncertainty precision(UPre). The results
demonstrate that ensemble methods offer a good trade-off between accuracy and
uncertainty handling, while EMCD is more sensitive to uncertain predictions.
This study highlights the importance of integrating UQ into DL-based medical
diagnosis to enhance both performance and trustworthiness in real-world
clinical applications.

[52] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework

Sadia Kamal,Tim Oates,Joy Wan

Main category: cs.CV

TL;DR: 该论文提出了一种弱监督多模态框架,用于从有限输入(如病灶图像和稀疏临床文本)自动生成临床结构化的SOAP笔记,旨在减轻医生负担并减少对大型标注数据的依赖。

Details Motivation: 皮肤癌是全球最常见的癌症之一,医生手动记录SOAP笔记耗时且易导致职业倦怠,因此需要一种自动化解决方案。

Contribution: 1. 提出了一种弱监督多模态框架生成SOAP笔记;2. 引入了两个新指标MedConceptEval和CCS评估临床质量;3. 在关键临床指标上表现接近GPT-4o等先进模型。

Method: 结合病灶图像和稀疏临床文本,利用弱监督学习减少对标注数据的依赖,生成结构化的SOAP笔记。

Result: 在临床相关指标上表现接近GPT-4o、Claude和DeepSeek Janus Pro等先进模型。

Insight: 弱监督学习和多模态输入的结合可以有效减少数据标注需求并提升临床文档生成的效率。

Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for
over $8 billion in annual healthcare expenditures. In clinical settings,
physicians document patient visits using detailed SOAP (Subjective, Objective,
Assessment, and Plan) notes. However, manually generating these notes is
labor-intensive and contributes to clinician burnout. In this work, we propose
a weakly supervised multimodal framework to generate clinically structured SOAP
notes from limited inputs, including lesion images and sparse clinical text.
Our approach reduces reliance on manual annotations, enabling scalable,
clinically grounded documentation while alleviating clinician burden and
reducing the need for large annotated data. Our method achieves performance
comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical
relevance metrics. To evaluate clinical quality, we introduce two novel metrics
MedConceptEval and Clinical Coherence Score (CCS) which assess semantic
alignment with expert medical concepts and input features, respectively.

[53] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video

Fei Zhao,Da Pan,Zelu Qi,Ping Shi

Main category: cs.CV

TL;DR: 该文针对元宇宙中用户生成的全向视频(UGC-ODV)的视听质量评估问题,构建了一个数据集,并提出了一种基准模型,结合了视频特征提取、音频特征提取和视听融合模块。实验结果表明模型表现优异。

Details Motivation: 随着元宇宙的兴起,用户生成的全向视频(UGC-ODV)日益重要,但相关的视听质量评估研究较少,亟需数据集和方法支持。

Contribution: 构建了首个针对UGC-ODV的视听质量评估数据集,并提出了一种有效的基准模型,填补了该领域的研究空白。

Method: 通过5名参与者使用两种全向相机拍摄300段视频,覆盖10种场景类型,并进行主观评分实验。模型包括视频特征提取、音频特征提取和视听融合模块。

Result: 基准模型在提出的数据集上表现最优,验证了其有效性。

Insight: 该研究为UGC-ODV的视听质量评估提供了数据和模型基础,推动了元宇宙相关技术的发展。

Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos
(ODVs) have garnered notable interest, gradually shifting from
professional-generated content (PGC) to user-generated content (UGC). However,
the study of audio-visual quality assessment (AVQA) within ODVs remains
limited. To address this, we construct a dataset of UGC omnidirectional audio
and video (A/V) content. The videos are captured by five individuals using two
different types of omnidirectional cameras, shooting 300 videos covering 10
different scene types. A subjective AVQA experiment is conducted on the dataset
to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to
facilitate the development of UGC-ODV AVQA fields, we construct an effective
AVQA baseline model on the proposed dataset, of which the baseline model
consists of video feature extraction module, audio feature extraction and
audio-visual fusion module. The experimental results demonstrate that our model
achieves optimal performance on the proposed dataset.

[54] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions

Deliang Wang,Chao Yang,Gaowei Chen

Main category: cs.CV

TL;DR: 该研究探讨了利用视觉语言模型(VLMs)通过零样本提示检测学生学术情绪的方法,发现Qwen2.5-VL-7B-Instruct在识别学生困惑表情方面表现较好,但模型对分心行为的检测效果不佳。

Details Motivation: 学生的学术情绪对其学习表现和行为有重要影响,而传统监督学习方法泛化能力有限,需要大量标注数据。视觉语言模型的出现为解决这一问题提供了新思路。

Contribution: 研究评估了两种视觉语言模型(Llama-3.2-11B-Vision-Instruct和Qwen2.5-VL-7B-Instruct)在零样本提示下对学生学术情绪的识别能力。

Method: 使用两种VLMs通过零样本提示对5,000张包含困惑、分心、快乐、中性及疲惫表情的图像进行分析。

Result: Qwen2.5-VL-7B-Instruct在识别困惑表情方面表现较优,但两种模型均无法有效检测分心行为。快乐情绪的检测准确率较高。

Insight: 视觉语言模型在学术情绪识别中表现良好,尤其适用于检测学生困惑情绪,但需要进一步改进对分心行为的识别能力。

Abstract: Students’ academic emotions significantly influence their social behavior and
learning performance. Traditional approaches to automatically and accurately
analyze these emotions have predominantly relied on supervised machine learning
algorithms. However, these models often struggle to generalize across different
contexts, necessitating repeated cycles of data collection, annotation, and
training. The emergence of Vision-Language Models (VLMs) offers a promising
alternative, enabling generalization across visual recognition tasks through
zero-shot prompting without requiring fine-tuning. This study investigates the
potential of VLMs to analyze students’ academic emotions via facial expressions
in an online learning environment. We employed two VLMs,
Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000
images depicting confused, distracted, happy, neutral, and tired expressions
using zero-shot prompting. Preliminary results indicate that both models
demonstrate moderate performance in academic facial expression recognition,
with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct.
Notably, both models excel in identifying students’ happy emotions but fail to
detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits
relatively high performance in recognizing students’ confused expressions,
highlighting its potential for practical applications in identifying content
that causes student confusion.

[55] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting

Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin

Main category: cs.CV

TL;DR: PointGS提出了一种基于高斯泼溅的点注意力感知稀疏视图合成框架,能够从稀疏训练视图中实现高质量的实时渲染。

Details Motivation: 现有的3D高斯泼溅(3DGS)方法需要大量校准视图以生成完整场景表示,稀疏输入会导致过拟合和渲染质量下降。PointGS旨在解决这一限制。

Contribution: 1)利用立体基础模型精确估计相机姿态并重建稠密点云以初始化高斯分布;2)通过多尺度2D特征采样和聚合编码高斯颜色属性;3)设计了基于自注意力机制的点交互网络,增强点级外观表示。

Method: 1)通过立体基础模型获取相机姿态和稠密点云;2)从稀疏输入中采样多尺度2D特征并聚合以编码高斯颜色;3)使用自注意力机制的点交互网络优化点级特征,最后通过MLP解码高斯参数完成渲染。

Result: 在多种基准测试中,PointGS显著优于基于NeRF的方法,并在少样本设置下与当前最优3DGS方法竞争激烈。

Insight: PointGS展示了稀疏视图下通过点级特征增强和高斯泼溅技术实现高质量渲染的潜力,为3D重建和渲染领域提供了新的思路。

Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that
surpasses the neural radiance field (NeRF) in both rendering speed and visual
quality by leveraging an explicit 3D scene representation. Existing 3DGS
approaches require a large number of calibrated views to generate a consistent
and complete scene representation. When input views are limited, 3DGS tends to
overfit the training views, leading to noticeable degradation in rendering
quality. To address this limitation, we propose a Point-wise Feature-Aware
Gaussian Splatting framework that enables real-time, high-quality rendering
from sparse training views. Specifically, we first employ the latest stereo
foundation model to estimate accurate camera poses and reconstruct a dense
point cloud for Gaussian initialization. We then encode the colour attributes
of each 3D Gaussian by sampling and aggregating multiscale 2D appearance
features from sparse inputs. To enhance point-wise appearance representation,
we design a point interaction network based on a self-attention mechanism,
allowing each Gaussian point to interact with its nearest neighbors. These
enriched features are subsequently decoded into Gaussian parameters through two
lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive
experiments on diverse benchmarks demonstrate that our method significantly
outperforms NeRF-based approaches and achieves competitive performance under
few-shot settings compared to the state-of-the-art 3DGS methods.

[56] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models

Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu

Main category: cs.CV

TL;DR: 该论文提出了UrbanSense框架,基于视觉大语言模型,通过多模态方法实现城市街景风格的自动化、可扩展分析,并展示了其在量化城市风格差异方面的有效性。

Details Motivation: 城市文化和建筑风格因地理、历史和社会政治因素差异显著,传统研究方法依赖专家解读,难以标准化。需要一种客观、数据驱动的方法来量化分析城市街景风格。

Contribution: 1. 构建了UrbanDiffBench数据集;2. 开发了基于视觉语言模型的UrbanSense框架;3. 实验验证了其量化城市风格差异的能力。

Method: 采用多模态研究框架,结合视觉语言模型,自动化分析城市街景风格的差异,并通过数据集和定量指标进行评估。

Result: 生成描述的80%通过t检验,主观评估中Phi得分高(城市0.912,时期0.833),表明能捕捉细微风格差异。

Insight: UrbanSense为城市风格演化提供了科学量化工具,为未来设计提供了数据支持,展现了多模态方法在城市研究中的潜力。

Abstract: Urban cultures and architectural styles vary significantly across cities due
to geographical, chronological, historical, and socio-political factors.
Understanding these differences is essential for anticipating how cities may
evolve in the future. As representative cases of historical continuity and
modern innovation in China, Beijing and Shenzhen offer valuable perspectives
for exploring the transformation of urban streetscapes. However, conventional
approaches to urban cultural studies often rely on expert interpretation and
historical documentation, which are difficult to standardize across different
contexts. To address this, we propose a multimodal research framework based on
vision-language models, enabling automated and scalable analysis of urban
streetscape style differences. This approach enhances the objectivity and
data-driven nature of urban form research. The contributions of this study are
as follows: First, we construct UrbanDiffBench, a curated dataset of urban
streetscapes containing architectural images from different periods and
regions. Second, we develop UrbanSense, the first vision-language-model-based
framework for urban streetscape analysis, enabling the quantitative generation
and comparison of urban style representations. Third, experimental results show
that Over 80% of generated descriptions pass the t-test (p less than 0.05).
High Phi scores (0.912 for cities, 0.833 for periods) from subjective
evaluations confirm the method’s ability to capture subtle stylistic
differences. These results highlight the method’s potential to quantify and
interpret urban style evolution, offering a scientifically grounded lens for
future design.

[57] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration

Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu

Main category: cs.CV

TL;DR: RealKeyMorph(RKM)提出了一种分辨率不敏感的医学图像配准方法,通过训练网络学习图像对的关键点,并在真实世界坐标系中操作,避免了传统方法因重采样引入的伪影。

Details Motivation: 医学图像配准中,图像分辨率差异(如像素间距、切片厚度等)会导致传统方法因重采样引入伪影。RKM旨在消除这一限制,直接在原始数据上操作。

Contribution: RKM扩展了KeyMorph框架,通过输出真实世界坐标系中的关键点,实现了分辨率无关的图像配准,避免了重采样的需要。

Method: RKM利用扫描仪提供的仿射矩阵(如MRI机器),将关键点转换为真实世界坐标,并将其整合到训练过程中,使关键点提取与分辨率无关。

Result: 实验证明,RKM在腹部MRI正交2D堆栈和不同分辨率3D脑数据集上的配准任务中表现出优势。

Insight: 通过在真实世界坐标系中操作,RKM避免了传统配准方法因重采样带来的问题,为医学图像处理提供了一种更鲁棒的方法。

Abstract: Many real-world settings require registration of a pair of medical images
that differ in spatial resolution, which may arise from differences in image
acquisition parameters like pixel spacing, slice thickness, and field-of-view.
However, all previous machine learning-based registration techniques resample
images onto a fixed resolution. This is suboptimal because resampling can
introduce artifacts due to interpolation. To address this, we present
RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is
an extension of KeyMorph, a registration framework which works by training a
network to learn corresponding keypoints for a given pair of images, after
which a closed-form keypoint matching step is used to derive the transformation
that aligns them. To avoid resampling and enable operating on the raw data, RKM
outputs keypoints in real-world coordinates of the scanner. To do this, we
leverage the affine matrix produced by the scanner (e.g., MRI machine) that
encodes the mapping from voxel coordinates to real world coordinates. By
transforming keypoints into real-world space and integrating this into the
training process, RKM effectively enables the extracted keypoints to be
resolution-agnostic. In our experiments, we demonstrate the advantages of RKM
on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as
3D volumes with varying resolutions in brain datasets.

[58] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation

Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang

Main category: cs.CV

TL;DR: Motion-R1结合了Chain-of-Thought推理和强化学习,通过分解复杂文本指令为逻辑动作路径,提升了文本到动作生成的语义理解能力与一致性。

Details Motivation: 现有文本到动作生成方法多基于端到端映射,缺乏对深层语言结构和逻辑推理的捕捉,导致动作生成的多样性、可控性和一致性受限。

Contribution: 提出Motion-R1框架,整合Chain-of-Thought机制,显式分解文本指令为逻辑动作路径,并结合强化学习(Group Relative Policy Optimization)联合优化推理链和动作合成。

Method: 1. Chain-of-Thought机制分解复杂指令;2. Group Relative Policy Optimization算法联合优化推理与动作生成。

Result: 在多个基准数据集上表现优异,尤其在需要细粒度语义理解和长期时序一致性的场景中优于现有方法。

Insight: 显式逻辑分解和强化学习的结合可显著提升文本到动作生成的语义理解与执行能力,为复杂指令执行提供了新思路。

Abstract: Recent advances in large language models, especially in natural language
understanding and reasoning, have opened new possibilities for text-to-motion
generation. Although existing approaches have made notable progress in semantic
alignment and motion synthesis, they often rely on end-to-end mapping
strategies that fail to capture deep linguistic structures and logical
reasoning. Consequently, generated motions tend to lack controllability,
consistency, and diversity. To address these limitations, we propose Motion-R1,
a unified motion-language modeling framework that integrates a Chain-of-Thought
mechanism. By explicitly decomposing complex textual instructions into
logically structured action paths, Motion-R1 provides high-level semantic
guidance for motion generation, significantly enhancing the model’s ability to
interpret and execute multi-step, long-horizon, and compositionally rich
commands. To train our model, we adopt Group Relative Policy Optimization, a
reinforcement learning algorithm designed for large models, which leverages
motion quality feedback to optimize reasoning chains and motion synthesis
jointly. Extensive experiments across multiple benchmark datasets demonstrate
that Motion-R1 achieves competitive or superior performance compared to
state-of-the-art methods, particularly in scenarios requiring nuanced semantic
understanding and long-term temporal coherence. The code, model and data will
be publicly available.

[59] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device

Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh

Main category: cs.CV

TL;DR: FaceLiVT提出了一种轻量级但强大的人脸识别模型,结合CNN-Transformer架构和创新的多头部线性注意力机制,显著降低计算复杂度同时保持高准确性。

Details Motivation: 移动设备上的人脸识别需要轻量化和高效性,而现有模型在计算复杂度和延迟方面难以兼顾。作者希望通过结合CNN和Transformer的优势,设计一种更适合移动设备的解决方案。

Contribution: 主要贡献是提出了FaceLiVT模型,整合了多头部线性注意力机制(MHLA)和结构重参数化的令牌混合器,显著提升了移动设备上的推理速度和准确性。

Method: 采用混合CNN-Transformer架构,引入多头部线性注意力机制(MHLA)和结构重参数化技术,优化计算效率和模型性能。

Result: 在LFW、CFP-FP等基准测试中,FaceLiVT表现优于现有轻量级模型,推理速度比EdgeFace快8.6倍,比纯ViT模型快21.2倍。

Insight: 通过结合CNN的局部特征提取能力和Transformer的全局建模能力,并优化注意力机制,可以显著提升移动设备上的人脸识别效率。

Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition
model that integrates a hybrid Convolution Neural Network (CNN)-Transformer
architecture with an innovative and lightweight Multi-Head Linear Attention
(MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer,
FaceLiVT effectively reduces computational complexity while preserving
competitive accuracy. Extensive evaluations on challenging benchmarks;
including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior
performance compared to state-of-the-art lightweight models. MHLA notably
improves inference speed, allowing FaceLiVT to deliver high accuracy with lower
latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace,
a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2
faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers
an efficient and practical solution for real-time face recognition on
resource-constrained platforms.

[60] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion

Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu

Main category: cs.CV

TL;DR: FSATFusion提出了一种基于频率-空间注意力Transformer的红外与可见光图像融合网络,通过改进Transformer模块和注意力机制,显著提升了融合性能。

Details Motivation: 现有的深度学习方法在红外与可见光图像融合任务中,由于卷积操作难以捕捉全局上下文,导致信息丢失,限制了融合性能。

Contribution: 1. 提出FSAT模块,结合频率-空间注意力机制提取判别性特征;2. 设计改进Transformer模块(ITM),增强全局上下文信息提取能力;3. 展示了FSATFusion在融合质量和下游任务中的优越性能。

Method: 1. 使用频率-空间注意力Transformer(FSAT)模块提取特征;2. 通过改进Transformer模块(ITM)提升全局信息捕捉能力;3. 端到端训练网络。

Result: 实验表明,FSATFusion在融合质量和效率上优于现有方法,且具有良好的泛化能力和下游任务性能。

Insight: 结合频率-空间注意力机制的Transformer能有效解决图像融合中的信息丢失问题,提升全局特征提取能力。

Abstract: The infrared and visible images fusion (IVIF) is receiving increasing
attention from both the research community and industry due to its excellent
results in downstream applications. Existing deep learning approaches often
utilize convolutional neural networks to extract image features. However, the
inherently capacity of convolution operations to capture global context can
lead to information loss, thereby restricting fusion performance. To address
this limitation, we propose an end-to-end fusion network named the
Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The
FSATFusion contains a frequency-spatial attention Transformer (FSAT) module
designed to effectively capture discriminate features from source images. This
FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of
extracting significant features from feature maps. Additionally, we propose an
improved Transformer module (ITM) to enhance the ability to extract global
context information of vanilla Transformer. We conducted both qualitative and
quantitative comparative experiments, demonstrating the superior fusion quality
and efficiency of FSATFusion compared to other state-of-the-art methods.
Furthermore, our network was tested on two additional tasks without any
modifications, to verify the excellent generalization capability of FSATFusion.
Finally, the object detection experiment demonstrated the superiority of
FSATFusion in downstream visual tasks. Our code is available at
https://github.com/Lmmh058/FSATFusion.

[61] Revisiting Transformers with Insights from Image Filtering

Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen

Main category: cs.CV

TL;DR: 该论文通过图像处理框架重新解释Transformer的自注意力机制,不仅提升了其可解释性,还通过图像处理启发的修改提高了模型的性能和鲁棒性。

Details Motivation: 自注意力机制的成功缺乏坚实的理论基础,尤其是在各种架构组件的作用上。论文旨在通过图像处理框架填补这一空白。

Contribution: 提出了一个统一的图像处理框架,解释自注意力机制及其组件(如位置编码和残差连接)的作用,并引入了两种改进模型。

Method: 通过图像滤波的视角分析自注意力机制,开发了一个理论框架,并提出了两种基于图像处理的架构修改。

Result: 改进的模型在语言和视觉任务中表现出更高的准确性和鲁棒性,同时增强了长序列理解能力。

Insight: 将图像处理理论与自注意力机制结合,可以同时提升模型的性能和可解释性,为未来的研究提供了新方向。

Abstract: The self-attention mechanism, a cornerstone of Transformer-based
state-of-the-art deep learning architectures, is largely heuristic-driven and
fundamentally challenging to interpret. Establishing a robust theoretical
foundation to explain its remarkable success and limitations has therefore
become an increasingly prominent focus in recent research. Some notable
directions have explored understanding self-attention through the lens of image
denoising and nonparametric regression. While promising, existing frameworks
still lack a deeper mechanistic interpretation of various architectural
components that enhance self-attention, both in its original formulation and
subsequent variants. In this work, we aim to advance this understanding by
developing a unifying image processing framework, capable of explaining not
only the self-attention computation itself but also the role of components such
as positional encoding and residual connections, including numerous later
variants. We also pinpoint potential distinctions between the two concepts
building upon our framework, and make effort to close this gap. We introduce
two independent architectural modifications within transformers. While our
primary objective is interpretability, we empirically observe that image
processing-inspired modifications can also lead to notably improved accuracy
and robustness against data contamination and adversaries across language and
vision tasks as well as better long sequence understanding.

[62] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial

Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield

Main category: cs.CV

TL;DR: 本文提出了一个名为PoseIDON的计算机视觉流程,结合深度基础模型特征与多视图摄影测量技术,用于从ROV视频中估计海底物体的6自由度姿态及周围海底的朝向,并通过CAD模型对齐推断埋藏深度。

Details Motivation: 海底人为物体的埋藏状态对局部沉积动力学、生态风险评估以及污染物运输的研究至关重要。但由于部分遮挡、能见度差和物体退化等原因,从遥感图像中准确估计埋藏深度仍具挑战。

Contribution: 提出PoseIDON流程,结合基础模型与多视图技术,实现非侵入式海底埋藏测绘,支持环境评估;验证了方法在历史海洋倾倒场地的有效性。

Method: 利用深度基础模型提取特征,结合多视图摄影测量技术估计物体6自由度姿态和海底朝向;通过CAD模型对齐与局部平面拟合推断埋藏深度。

Result: 在54个物体的验证中,平均埋藏深度误差约10厘米,并能反映沉积物运输的空间模式。

Insight: PoseIDON方法为海底埋藏测绘提供了可扩展且非侵入式的解决方案,支持对环境受污染场地的快速评估。

Abstract: The burial state of anthropogenic objects on the seafloor provides insight
into localized sedimentation dynamics and is also critical for assessing
ecological risks, potential pollutant transport, and the viability of recovery
or mitigation strategies for hazardous materials such as munitions. Accurate
burial depth estimation from remote imagery remains difficult due to partial
occlusion, poor visibility, and object degradation. This work introduces a
computer vision pipeline, called PoseIDON, which combines deep foundation model
features with multiview photogrammetry to estimate six degrees of freedom
object pose and the orientation of the surrounding seafloor from ROV video.
Burial depth is inferred by aligning CAD models of the objects with observed
imagery and fitting a local planar approximation of the seafloor. The method is
validated using footage of 54 objects, including barrels and munitions,
recorded at a historic ocean dumpsite in the San Pedro Basin. The model
achieves a mean burial depth error of approximately 10 centimeters and resolves
spatial burial patterns that reflect underlying sediment transport processes.
This approach enables scalable, non-invasive mapping of seafloor burial and
supports environmental assessment at contaminated sites.

[63] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin

Main category: cs.CV

TL;DR: DART提出了一种可微分的动态自适应区域分词器,通过内容相关的可变大小分块解决固定大小分块带来的问题,显著提升了ViT和Mamba模型的性能,同时降低了计算开销。

Details Motivation: 现有ViT和Mamba模型依赖固定大小的图像分块,会导致背景区域编码过多而关键局部细节丢失,尤其是信息稀疏分布时效果不佳。

Contribution: 提出DART,一种完全可微的动态自适应区域分词器,能够根据内容自适应调整分块大小,显著提升模型性能。

Method: 结合可学习的区域评分和分段可微分分位数操作,将更多令牌分配给信息丰富的区域。

Result: 在DeiT上实现2.1%的准确率提升,同时降低45%的FLOPs,在多个模型上验证了其有效性。

Insight: 动态调整分块大小比统一增加令牌密度更高效,能在减少计算开销的同时提升性能。

Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and
Vision Mamba (Vim) have achieved remarkable performance in computer vision
tasks. However, their reliance on fixed-size patches often results in excessive
encoding of background regions and omission of critical local details,
especially when informative objects are sparsely distributed. To address this,
we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART),
which adaptively partitions images into content-dependent patches of varying
sizes. DART combines learnable region scores with piecewise differentiable
quantile operations to allocate denser tokens to information-rich areas.
Despite introducing only approximately 1 million (1M) additional parameters,
DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that
uniformly increase token density to capture fine-grained details, DART offers a
more efficient alternative, achieving 45% FLOPs reduction with superior
performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that
DART consistently enhances accuracy while incurring minimal or even reduced
computational overhead. Code is available at
https://github.com/HCPLab-SYSU/DART.

[64] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion

Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai

Main category: cs.CV

TL;DR: ReconMOST 是一种基于数据驱动的扩散模型框架,用于多层海水温度的重建,通过历史模拟数据预训练和观测数据引导,解决了传统方法的稀疏数据和高计算成本问题。

Details Motivation: 传统海水温度重建方法受限于数据稀疏性、算法复杂性和高计算成本,而现有的机器学习方法主要集中在海面或局部区域,难以处理多云遮挡等问题。

Contribution: 提出了 ReconMOST 框架,利用扩散模型预训练学习物理一致的海洋温度分布模式,并通过观测数据引导反向扩散过程,实现全球多层海水温度的高精度重建。

Method: 1. 预训练无条件扩散模型学习历史数值模拟数据的物理分布;2. 利用稀疏但高精度的现场观测数据作为反向扩散过程的引导点;3. 在无观测区域利用预训练学习的分布模式进行隐式引导。

Result: 在 CMIP6 和 EN4 数据上的实验结果显示,MSE 值为引导 0.049、重建 0.680、总体 0.633,表明方法在准确性和泛化能力上的优越性。

Insight: 通过结合数据驱动的扩散模型和物理一致的预训练模式,ReconMOST 展示了在复杂海洋数据重建任务中的潜力,尤其是在数据稀疏和缺失的情况下。

Abstract: Accurate reconstruction of ocean is essential for reflecting global climate
dynamics and supporting marine meteorological research. Conventional methods
face challenges due to sparse data, algorithmic complexity, and high
computational costs, while increasing usage of machine learning (ML) method
remains limited to reconstruction problems at the sea surface and local
regions, struggling with issues like cloud occlusion. To address these
limitations, this paper proposes ReconMOST, a data-driven guided diffusion
model framework for multi-layer sea temperature reconstruction. Specifically,
we first pre-train an unconditional diffusion model using a large collection of
historical numerical simulation data, enabling the model to attain physically
consistent distribution patterns of ocean temperature fields. During the
generation phase, sparse yet high-accuracy in-situ observational data are
utilized as guidance points for the reverse diffusion process, generating
accurate reconstruction results. Importantly, in regions lacking direct
observational data, the physically consistent spatial distribution patterns
learned during pre-training enable implicitly guided and physically plausible
reconstructions. Our method extends ML-based SST reconstruction to a global,
multi-layer setting, handling over 92.5% missing data while maintaining
reconstruction accuracy, spatial resolution, and superior generalization
capability. We pre-train our model on CMIP6 numerical simulation data and
conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The
results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on
reconstruction, and 0.633 on total, respectively, demonstrating the
effectiveness and robustness of the proposed framework. Our source code is
available at https://github.com/norsheep/ReconMOST.

[65] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie

Main category: cs.CV

TL;DR: Pisces是一种自回归的多模态基础模型,通过解耦的视觉编码架构和优化的训练技术,统一了图像理解和生成任务,并在公开基准测试中表现出色。

Details Motivation: 当前的多模态基础模型虽然在图像理解和生成任务上实现了统一,但其性能往往不及专门针对单一任务的模型。主要原因在于视觉特征的差异和训练过程的不同。

Contribution: 提出了Pisces模型,通过解耦的视觉编码架构和针对多模态生成的优化训练技术,首次在统一框架下实现了图像理解和生成的竞争性性能。

Method: 采用了一种新型的解耦视觉编码架构,并结合精细的数据准备、预训练和微调技术,优化了多模态生成任务的效果。

Result: 在超过20个公开的图像理解基准测试中表现出色,并在GenEval图像生成基准上展现了强大的生成能力。

Insight: 研究揭示了图像理解与生成之间的协同关系,并证明了使用分离的视觉编码器对统一多模态模型的促进作用。

Abstract: Recent advances in large language models (LLMs) have enabled multimodal
foundation models to tackle both image understanding and generation within a
unified framework. Despite these gains, unified models often underperform
compared to specialized models in either task. A key challenge in developing
unified models lies in the inherent differences between the visual features
needed for image understanding versus generation, as well as the distinct
training processes required for each modality. In this work, we introduce
Pisces, an auto-regressive multimodal foundation model that addresses this
challenge through a novel decoupled visual encoding architecture and tailored
training techniques optimized for multimodal generation. Combined with
meticulous data curation, pretraining, and finetuning, Pisces achieves
competitive performance in both image understanding and image generation. We
evaluate Pisces on over 20 public benchmarks for image understanding, where it
demonstrates strong performance across a wide range of tasks. Additionally, on
GenEval, a widely adopted benchmark for image generation, Pisces exhibits
robust generative capabilities. Our extensive analysis reveals the synergistic
relationship between image understanding and generation, and the benefits of
using separate visual encoders, advancing the field of unified multimodal
models.

[66] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

Shuo wang,Jihao Zhang

Main category: cs.CV

TL;DR: MF2Summ是一个基于多模态融合的视频摘要模型,结合视觉和听觉信息,通过跨模态Transformer和时序对齐的注意力机制提升性能,在SumMe和TVSum数据集上表现优于现有方法。

Details Motivation: 现有的视频摘要方法通常仅依赖单一模态(如视觉),难以充分捕捉视频的语义丰富性。因此,本文提出多模态融合的方法,结合视觉和听觉信息,以提升视频摘要的效果。

Contribution: 1. 提出MF2Summ,一个多模态融合的视频摘要模型;2. 设计跨模态Transformer和时序对齐的自注意力机制,增强模态间依赖和时序关系建模;3. 在SumMe和TVSum数据集上取得性能提升。

Method: 采用五阶段流程:特征提取(GoogLeNet和SoundNet)、跨模态注意力交互、特征融合、片段预测和关键片段选择。特别提出跨模态Transformer和时序对齐的自注意力机制。关键片段选择采用NMS和KTS算法。

Result: 在SumMe和TVSum数据集上,MF2Summ的F1-score分别比DSNet提升1.9%和0.6%,优于其他先进方法。

Insight: 多模态融合能显著提升视频摘要性能;跨模态注意力机制和时序对齐是建模模态依赖的关键;NMS和KTS算法能有效筛选关键片段。

Abstract: The rapid proliferation of online video content necessitates effective video
summarization techniques. Traditional methods, often relying on a single
modality (typically visual), struggle to capture the full semantic richness of
videos. This paper introduces MF2Summ, a novel video summarization model based
on multimodal content understanding, integrating both visual and auditory
information. MF2Summ employs a five-stage process: feature extraction,
cross-modal attention interaction, feature fusion, segment prediction, and key
shot selection. Visual features are extracted using a pre-trained GoogLeNet
model, while auditory features are derived using SoundNet. The core of our
fusion mechanism involves a cross-modal Transformer and an alignment-guided
self-attention Transformer, designed to effectively model inter-modal
dependencies and temporal correspondences. Segment importance, location, and
center-ness are predicted, followed by key shot selection using Non-Maximum
Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm.
Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ
achieves competitive performance, notably improving F1-scores by 1.9% and
0.6% respectively over the DSNet model, and performing favorably against other
state-of-the-art methods.

[67] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen

Main category: cs.CV

TL;DR: 论文提出了一种新颖的多模态情感识别框架CIDer,通过模型特定的自蒸馏模块(MSSD)和模型无关的因果推理模块(MACI),解决了模态缺失和分布偏移(OOD)问题,同时在参数效率和训练速度上优于现有方法。

Details Motivation: 多模态情感识别(MER)在实际应用中常面临模态缺失和分布偏移的挑战。现有方法通常依赖特定模型或引入过多参数,实用性受限。

Contribution: 1) 提出CIDer框架,结合MSSD和MACI模块,解决模态缺失和OOD问题;2) 定义新任务RMFM,通用化模态缺失的定义;3) 引入新的MER OOD数据集。

Method: 1) MSSD模块通过权重共享的自蒸馏方法增强RMFM任务下的鲁棒性;2) MACI模块通过因果图设计减少标签和语言偏见;3) WSAM降低计算复杂度,MCT实现高效多模态融合。

Result: 实验表明CIDer在RMFM和OOD场景中均表现优异,参数少且训练速度快。

Insight: 1) 自蒸馏和因果推理的结合能有效提升多模态任务的鲁棒性;2) 轻量化设计在复杂任务中具有实际优势。

Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges
in addressing both modality missing and Out-Of-Distribution (OOD) data
simultaneously. Existing methods often rely on specific models or introduce
excessive parameters, which limits their practicality. To address these issues,
we propose a novel robust MER framework, Causal Inference Distiller (CIDer),
and introduce a new task, Random Modality Feature Missing (RMFM), to generalize
the definition of modality missing. CIDer integrates two key components: a
Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal
Inference (MACI) module. MSSD enhances robustness under the RMFM task through a
weight-sharing self-distillation approach applied across low-level features,
attention maps, and high-level representations. Additionally, a Word-level
Self-aligned Attention Module (WSAM) reduces computational complexity, while a
Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion.
To tackle OOD challenges, MACI employs a tailored causal graph to mitigate
label and language biases using a Multimodal Causal Module (MCM) and
fine-grained counterfactual texts. Notably, MACI can independently enhance OOD
generalization with minimal additional parameters. Furthermore, we also
introduce the new repartitioned MER OOD datasets. Experimental results
demonstrate that CIDer achieves robust performance in both RMFM and OOD
scenarios, with fewer parameters and faster training compared to
state-of-the-art methods. The implementation of this work is publicly
accessible at https://github.com/gw-zhong/CIDer.

[68] Rethinking Generative Human Video Coding with Implicit Motion Transformation

Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye

Main category: cs.CV

TL;DR: 该论文通过隐式运动变换(IMT)改进了生成式人体视频编码(GHVC),解决了传统显式运动引导导致重建失真和运动不准确的问题,实现了高效压缩和高保真合成。

Details Motivation: 传统生成式视频编码依赖显式运动场作为中间监督,但在复杂多样的人体运动模式下存在重建质量低和运动不准确的问题。论文旨在探索隐式运动变换如何提升GHVC的性能。

Contribution: 提出了一种基于IMT的GHVC方法,将复杂人体信号编码为紧凑视觉特征,并转化为隐式运动引导,以提升重建质量。

Method: 通过隐式运动变换(IMT),将人体信号编码为紧凑特征并生成隐式运动引导,从而优化生成式人体视频编码的模型。

Result: 实验证明,IMT显著提升了GHVC的压缩效率和重建质量。

Insight: 隐式运动变换优于显式运动场,尤其适用于复杂运动模式的人体视频编码,为生成式视频编码提供了新思路。

Abstract: Beyond traditional hybrid-based video codec, generative video codec could
achieve promising compression performance by evolving high-dimensional signals
into compact feature representations for bitstream compactness at the encoder
side and developing explicit motion fields as intermediate supervision for
high-quality reconstruction at the decoder side. This paradigm has achieved
significant success in face video compression. However, compared to facial
videos, human body videos pose greater challenges due to their more complex and
diverse motion patterns, i.e., when using explicit motion guidance for
Generative Human Video Coding (GHVC), the reconstruction results could suffer
severe distortions and inaccurate motion. As such, this paper highlights the
limitations of explicit motion-based approaches for human body video
compression and investigates the GHVC performance improvement with the aid of
Implicit Motion Transformation, namely IMT. In particular, we propose to
characterize complex human body signal into compact visual features and
transform these features into implicit motion guidance for signal
reconstruction. Experimental results demonstrate the effectiveness of the
proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency
compression and high-fidelity synthesis.

[69] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: MedSeg-R提出了一种新型任务——医学图像推理分割,利用多模态大语言模型(MLLMs)的推理能力生成精确分割掩码,并通过全局上下文理解和像素级定位模块实现端到端框架。

Details Motivation: 现有医学图像分割模型依赖显式人工指令,缺乏主动推理能力,限制了其在自动诊断中的应用。MLLMs虽在医学问答任务中表现优异,但难以生成精确分割掩码。

Contribution: 1. 提出医学图像推理分割任务;2. 开发了MedSeg-R框架,结合MLLMs的推理能力生成分割掩码;3. 构建了MedSeg-QA数据集。

Method: MedSeg-R包含全局上下文理解模块(解析图像和指令,生成多模态中间令牌)和像素级定位模块(解码令牌生成分割掩码和文本响应)。

Result: 在多个基准测试中表现优异,分割精度高并提供可解释的医学图像分析。

Insight: 通过结合MLLMs的推理能力和像素级定位,MedSeg-R为医学图像分割提供了更灵活和智能的解决方案。

Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing
models are limited by their reliance on explicit human instructions and lack
the active reasoning capabilities to understand complex clinical questions.
While recent advancements in multimodal large language models (MLLMs) have
improved medical question-answering (QA) tasks, most methods struggle to
generate precise segmentation masks, limiting their application in automatic
medical diagnosis. In this paper, we introduce medical image reasoning
segmentation, a novel task that aims to generate segmentation masks based on
complex and implicit medical instructions. To address this, we propose
MedSeg-R, an end-to-end framework that leverages the reasoning abilities of
MLLMs to interpret clinical questions while also capable of producing
corresponding precise segmentation masks for medical images. It is built on two
core components: 1) a global context understanding module that interprets
images and comprehends complex medical instructions to generate multi-modal
intermediate tokens, and 2) a pixel-level grounding module that decodes these
tokens to produce precise segmentation masks and textual responses.
Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the
medical image reasoning segmentation task. It includes over 10,000 image-mask
pairs and multi-turn conversations, automatically annotated using large
language models and refined through physician reviews. Experiments show
MedSeg-R’s superior performance across several benchmarks, achieving high
segmentation accuracy and enabling interpretable textual analysis of medical
images.

[70] LLMs Are Not Yet Ready for Deepfake Image Detection

Shahroz Tariq,David Nguyen,M. A. P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore

Main category: cs.CV

TL;DR: 这篇论文通过零样本评估四种主流视觉语言模型(VLM)在检测深度伪造图像上的表现,发现尽管这些模型能提供连贯的解释并识别表面异常,但尚不可靠作为独立检测系统。

Details Motivation: 随着深度伪造技术的复杂化,媒体完整性和公众信任面临严峻挑战;同时,视觉语言模型(VLM)因其多领域潜力而备受关注,但其在深度伪造检测中的应用尚未明确。

Contribution: 论文系统评估了四种VLM(ChatGPT、Claude、Gemini和Grok)在三种深度伪造类型(换脸、重演和合成生成)上的检测能力,揭示了其优势与局限性。

Method: 采用结构化零样本评估方法,利用精心构建的包含真实与伪造图像的基准测试集,量化模型的分类准确性和推理深度。

Result: 结果显示,VLM虽然能生成合理解释并识别表面异常,但容易受到误导性视觉模式(如复古风格)的影响,无法独立可靠地检测深度伪造。

Insight: 尽管通用模型目前不适合自主检测深度伪造,但其在可解释性和上下文分析上的优势,表明其在混合或人机协同检测框架中具有潜力。

Abstract: The growing sophistication of deepfakes presents substantial challenges to
the integrity of media and the preservation of public trust. Concurrently,
vision-language models (VLMs), large language models enhanced with visual
reasoning capabilities, have emerged as promising tools across various domains,
sparking interest in their applicability to deepfake detection. This study
conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT,
Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap,
reenactment, and synthetic generation. Leveraging a meticulously assembled
benchmark comprising authentic and manipulated images from diverse sources, we
evaluate each model’s classification accuracy and reasoning depth. Our analysis
indicates that while VLMs can produce coherent explanations and detect
surface-level anomalies, they are not yet dependable as standalone detection
systems. We highlight critical failure modes, such as an overemphasis on
stylistic elements and vulnerability to misleading visual patterns like vintage
aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and
contextual analysis, suggesting their potential to augment human expertise in
forensic workflows. These insights imply that although general-purpose models
currently lack the reliability needed for autonomous deepfake detection, they
hold promise as integral components in hybrid or human-in-the-loop detection
frameworks.

[71] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation

Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao

Main category: cs.CV

TL;DR: PSLG-SAM框架通过两阶段方法(粗定位和精细分割)解决RRSIS任务中的密集标注和复杂场景问题,显著减少标注负担并提升性能。

Details Motivation: RRSIS任务需要基于文本描述分割遥感图像中的指定对象,现有方法依赖密集标注和多模态融合,面临复杂场景和标注负担大的挑战。

Contribution: 提出PSLG-SAM框架,将RRSIS任务分解为粗定位和精细分割两阶段;贡献了高质量手动标注数据集;实验证明方法显著优于现有模型。

Method: 1. 粗定位阶段:视觉定位网络粗略定位文本描述的对象;2. 精细分割阶段:利用SAM(增强聚类前景点生成和边界迭代优化策略)完成精确分割,无需训练。

Result: 在RRSIS-D和RRSIS-M数据集上,PSLG-SAM表现优异,超过现有最优模型。

Insight: 通过任务分解和模块化设计,可以显著降低标注需求并提升模型对复杂场景的鲁棒性。

Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates
segmentation masks for specified objects in images based on textual
descriptions, which has attracted widespread attention and research interest.
Current RRSIS methods rely on multi-modal fusion backbones and semantic
segmentation heads but face challenges like dense annotation requirements and
complex scene interpretation. To address these issues, we propose a framework
named \textit{prompt-generated semantic localization guiding Segment Anything
Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse
localization and fine segmentation. In coarse localization stage, a visual
grounding network roughly locates the text-described object. In fine
segmentation stage, the coordinates from the first stage guide the Segment
Anything Model (SAM), enhanced by a clustering-based foreground point generator
and a mask boundary iterative optimization strategy for precise segmentation.
Notably, the second stage can be train-free, significantly reducing the
annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS
task into two stages allows for focusing on specific region segmentation,
avoiding interference from complex scenes.We further contribute a high-quality,
multi-category manually annotated dataset. Experimental validation on two
datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant
performance improvements and surpasses existing state-of-the-art models.Our
code will be made publicly available.

[72] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft

Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai

Main category: cs.CV

TL;DR: J-DDL是一种用于战斗机表面损伤检测与定位的智能系统,通过结合2D图像和3D点云数据,利用优化的YOLO架构和新型损失函数实现高精度检测。

Details Motivation: 战斗机表面损伤检测存在手工检查效率低、一致性差的问题,亟需自动化解决方案。

Contribution: 提出J-DDL系统,结合2D和3D数据实现损伤检测与定位;设计了优化的YOLO架构、轻量级特征提取模块(Fasternet)、高效多尺度注意力(EMA)模块及新型损失函数Inner-CIOU;并发布了首个公开的飞机损伤数据集。

Method: 采用激光扫描仪和相机捕捉2D图像和3D点云,通过优化的YOLO网络检测2D图像中的损伤,随后映射到3D点云进行定位。

Result: 实验验证了J-DDL的高效性,展示了其在自动化飞机检测技术中的潜力。

Insight: 结合2D和3D数据可提升损伤检测的精度;轻量化设计与注意力机制优化对复杂场景检测至关重要。

Abstract: Ensuring the safety and extended operational life of fighter aircraft
necessitates frequent and exhaustive inspections. While surface defect
detection is feasible for human inspectors, manual methods face critical
limitations in scalability, efficiency, and consistency due to the vast surface
area, structural complexity, and operational demands of aircraft maintenance.
We propose a smart surface damage detection and localization system for fighter
aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the
entire aircraft surface, captured using a combined system of laser scanners and
cameras, to achieve precise damage detection and localization. Central to our
system is a novel damage detection network built on the YOLO architecture,
specifically optimized for identifying surface defects in 2D aircraft images.
Key innovations include lightweight Fasternet blocks for efficient feature
extraction, an optimized neck architecture incorporating Efficient Multiscale
Attention (EMA) modules for superior feature aggregation, and the introduction
of a novel loss function, Inner-CIOU, to enhance detection accuracy. After
detecting damage in 2D images, the system maps the identified anomalies onto
corresponding 3D point clouds, enabling accurate 3D localization of defects
across the aircraft surface. Our J-DDL not only streamlines the inspection
process but also ensures more comprehensive and detailed coverage of large and
complex aircraft exteriors. To facilitate further advancements in this domain,
we have developed the first publicly available dataset specifically focused on
aircraft damage. Experimental evaluations validate the effectiveness of our
framework, underscoring its potential to significantly advance automated
aircraft inspection technologies.

[73] CogStream: Context-guided Streaming Video Question Answering

Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu

Main category: cs.CV

TL;DR: CogStream提出了一个具有挑战性的任务:基于上下文引导的流式视频推理,并贡献了一个密集标注的数据集和一个基线模型CogReasoner,该方法通过视觉流压缩和历史对话检索高效完成任务。

Details Motivation: 现有方法在流式视频推理中面临计算负担和高估不相关上下文的问题,CogStream旨在模拟真实场景,要求模型识别最相关的历史上下文以回答问题。

Contribution: 1. 提出CogStream任务;2. 贡献密集标注的数据集;3. 提出基线模型CogReasoner。

Method: CogReasoner通过视觉流压缩和历史对话检索高效处理流式视频推理任务。

Result: 实验证明了方法的有效性。

Insight: 流式视频推理需要高效过滤无关上下文,CogReasoner的设计为此提供了可行方案。

Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving
multimodal understanding, challenges persist in streaming video reasoning due
to its reliance on contextual information. Existing paradigms feed all
available historical contextual information into Vid-LLMs, resulting in a
significant computational burden for visual data processing. Furthermore, the
inclusion of irrelevant context distracts models from key details. This paper
introduces a challenging task called Context-guided Streaming Video Reasoning
(CogStream), which simulates real-world streaming video scenarios, requiring
models to identify the most relevant historical contextual information to
deduce answers for questions about the current stream. To support CogStream, we
present a densely annotated dataset featuring extensive and hierarchical
question-answer pairs, generated by a semi-automatic pipeline. Additionally, we
present CogReasoner as a baseline model. It efficiently tackles this task by
leveraging visual stream compression and historical dialogue retrieval.
Extensive experiments prove the effectiveness of this method. Code will be
released soon.

[74] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations

Yutong Zhou,Masahiro Ryo

Main category: cs.CV

TL;DR: 该论文提出了一种端到端的视觉到因果框架,将物种图像转化为可解释的栖息地偏好因果洞察,并结合大语言模型生成人类可读的解释。

Details Motivation: 理解物种为何生活在特定位置对生态系统的认识和生物多样性保护至关重要,但现有生态工作流程对非专业人士不友好。

Contribution: 提出了一个整合物种识别、全球分布检索、伪缺失采样和气候数据提取的全流程框架,结合因果推断方法和大语言模型生成人类可读的解释。

Method: 1. 整合多模态数据(图像、分布、气候);2. 使用因果推断方法发现环境特征的因果结构;3. 通过模板和大语言模型生成解释。

Result: 以蜜蜂和花朵为案例展示了框架的潜力,证明了其能为物种栖息地生成统计支持的、人类可读的解释。

Insight: 多模态AI助手结合生态建模实践,为非专业人士提供了直观的生态洞察工具。

Abstract: Explaining why the species lives at a particular location is important for
understanding ecological systems and conserving biodiversity. However, existing
ecological workflows are fragmented and often inaccessible to non-specialists.
We propose an end-to-end visual-to-causal framework that transforms a species
image into interpretable causal insights about its habitat preference. The
system integrates species recognition, global occurrence retrieval,
pseudo-absence sampling, and climate data extraction. We then discover causal
structures among environmental features and estimate their influence on species
occurrence using modern causal inference methods. Finally, we generate
statistically grounded, human-readable causal explanations from structured
templates and large language models. We demonstrate the framework on a bee and
a flower species and report early results as part of an ongoing project,
showing the potential of the multimodal AI assistant backed up by a recommended
ecological modeling practice for describing species habitat in
human-understandable language.

[75] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics

Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin

Main category: cs.CV

TL;DR: 论文提出了一种新型指标CEI,用于检测人脸识别系统中的人口统计偏置,尤其是分布尾部的细微差异。CEI通过分别分析真实和冒用分数分布,配置性聚焦尾部概率,优于现有方法。

Details Motivation: 现有指标难以检测高性能人脸识别系统中的细微人口统计偏置,尤其是在分数分布的尾部。

Contribution: 提出全面公平指数(CEI)及其自动化版本CEI^A,能有效检测尾部偏置,优于传统指标。

Method: CEI分别分析真实和冒用分数分布,配置性聚焦尾部概率,并结合自动化工具提升客观性。

Result: 实验验证了CEI在检测细微偏置上的优越性,尤其在尾部表现更敏感。

Insight: CEI不仅适用于人脸识别,还可用于其他需要分析分布尾部的统计问题。

Abstract: Demographic bias in high-performance face recognition (FR) systems often
eludes detection by existing metrics, especially with respect to subtle
disparities in the tails of the score distribution. We introduce the
Comprehensive Equity Index (CEI), a novel metric designed to address this
limitation. CEI uniquely analyzes genuine and impostor score distributions
separately, enabling a configurable focus on tail probabilities while also
considering overall distribution shapes. Our extensive experiments (evaluating
state-of-the-art FR systems, intentionally biased models, and diverse datasets)
confirm CEI’s superior ability to detect nuanced biases where previous methods
fall short. Furthermore, we present CEI^A, an automated version of the metric
that enhances objectivity and simplifies practical application. CEI provides a
robust and sensitive tool for operational FR fairness assessment. The proposed
methods have been developed particularly for bias evaluation in face biometrics
but, in general, they are applicable for comparing statistical distributions in
any problem where one is interested in analyzing the distribution tails.

[76] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou

Main category: cs.CV

TL;DR: DreamActor-H1是一个基于扩散变换器(DiT)的框架,旨在生成高保真的人-产品演示视频,解决了身份保留和空间关系理解的挑战。

Details Motivation: 电子商务和数字营销中,高保真的人-产品演示视频对产品呈现至关重要,但现有方法难以同时保留人和产品的身份,或缺乏对空间关系的理解。

Contribution: 提出了一种结合扩散变换器的框架,通过配对参考信息和掩蔽交叉注意力机制,保留身份和细节;利用3D人体网格和产品边界框实现精确运动对齐。

Method: 使用扩散变换器(DiT)结合3D人体网格和产品边界框,引入结构化文本编码增强3D一致性;采用混合数据集和多样数据增强策略训练。

Result: 在身份完整性和运动真实性方面优于现有技术,生成了更真实的人-产品交互视频。

Insight: 通过结合3D几何信息和语义编码,可以有效解决人-产品交互中的身份保留和空间对齐问题。

Abstract: In e-commerce and digital marketing, generating high-fidelity human-product
demonstration videos is important for effective product presentation. However,
most existing frameworks either fail to preserve the identities of both humans
and products or lack an understanding of human-product spatial relationships,
leading to unrealistic representations and unnatural interactions. To address
these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our
method simultaneously preserves human identities and product-specific details,
such as logos and textures, by injecting paired human-product reference
information and utilizing an additional masked cross-attention mechanism. We
employ a 3D body mesh template and product bounding boxes to provide precise
motion guidance, enabling intuitive alignment of hand gestures with product
placements. Additionally, structured text encoding is used to incorporate
category-level semantics, enhancing 3D consistency during small rotational
changes across frames. Trained on a hybrid dataset with extensive data
augmentation strategies, our approach outperforms state-of-the-art techniques
in maintaining the identity integrity of both humans and products and
generating realistic demonstration motions. Project page:
https://submit2025-dream.github.io/DreamActor-H1/.

[77] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He

Main category: cs.CV

TL;DR: 论文提出了一种名为PLACE的新框架,通过病理层面的跨模态对齐和相关性探索,提升医学视觉表示学习,无需额外人工标注。

Details Motivation: 医学领域的数据稀缺问题严重,现有的方法多关注实例级或标记级的跨模态对齐,忽略了病理层面的一致性。本研究旨在填补这一空白。

Contribution: 1. 提出了病理级跨模态对齐(PCMA)方法;2. 设计了视觉病理观察提取器和相关性探索任务;3. 框架无需外部疾病标注,提高了泛化性和鲁棒性。

Method: 1. 通过PCMA模块最大化图像和报告中病理观察的一致性;2. 提取局部标记的视觉病理观察表示;3. 设计了图像块相关性识别的代理任务。

Result: 在分类、图像到文本检索、语义分割、目标检测和报告生成等任务中达到了新的SOTA性能。

Insight: 病理层面的对齐和相关性探索能够显著提升医学视觉表示学习的性能,尤其是在数据稀缺的情况下,表现出较强的泛化能力和鲁棒性。

Abstract: Learning medical visual representations from image-report pairs through joint
learning has garnered increasing research attention due to its potential to
alleviate the data scarcity problem in the medical domain. The primary
challenges stem from the lengthy reports that feature complex discourse
relations and semantic pathologies. Previous works have predominantly focused
on instance-wise or token-wise cross-modal alignment, often neglecting the
importance of pathological-level consistency. This paper presents a novel
framework PLACE that promotes the Pathological-Level Alignment and enriches the
fine-grained details via Correlation Exploration without additional human
annotations. Specifically, we propose a novel pathological-level cross-modal
alignment (PCMA) approach to maximize the consistency of pathology observations
from both images and reports. To facilitate this, a Visual Pathology
Observation Extractor is introduced to extract visual pathological observation
representations from localized tokens. The PCMA module operates independently
of any external disease annotations, enhancing the generalizability and
robustness of our methods. Furthermore, we design a proxy task that enforces
the model to identify correlations among image patches, thereby enriching the
fine-grained details crucial for various downstream tasks. Experimental results
demonstrate that our proposed framework achieves new state-of-the-art
performance on multiple downstream tasks, including classification,
image-to-text retrieval, semantic segmentation, object detection and report
generation.

[78] DanceChat: Large Language Model-Guided Music-to-Dance Generation

Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan

Main category: cs.CV

TL;DR: DanceChat是一种基于大语言模型(LLM)的音乐到舞蹈生成方法,通过LLM提供文本指导,生成多样且与音乐风格对齐的舞蹈动作。

Details Motivation: 现有音乐到舞蹈生成方法因音乐与动作之间的语义差距和数据稀缺性,难以生成多样且准确的舞蹈动作。

Contribution: 提出了DanceChat,利用LLM生成文本指导,通过多模态特征融合和扩散模型,提升了舞蹈生成的多样性和音乐对齐性。

Method: 1)LLM生成伪舞蹈指令;2)多模态特征提取与融合;3)基于扩散模型的动作合成与对齐损失。

Result: 在AIST++数据集和人工评测中,DanceChat在质量和多样性上均优于现有方法。

Insight: 利用LLM提供高层次的文本指导,能有效弥补音乐与动作之间的语义差距,提升生成舞蹈的多样性和风格对齐性。

Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned
on musical input. Despite recent progress, significant challenges remain due to
the semantic gap between music and dance motion, as music offers only abstract
cues, such as melody, groove, and emotion, without explicitly specifying the
physical movements. Moreover, a single piece of music can produce multiple
plausible dance interpretations. This one-to-many mapping demands additional
guidance, as music alone provides limited information for generating diverse
dance movements. The challenge is further amplified by the scarcity of paired
music and dance data, which restricts the model^a\u{A}'Zs ability to learn
diverse dance patterns. In this paper, we introduce DanceChat, a Large Language
Model (LLM)-guided music-to-dance generation approach. We use an LLM as a
choreographer that provides textual motion instructions, offering explicit,
high-level guidance for dance generation. This approach goes beyond implicit
learning from music alone, enabling the model to generate dance that is both
more diverse and better aligned with musical styles. Our approach consists of
three components: (1) an LLM-based pseudo instruction generation module that
produces textual dance guidance based on music style and structure, (2) a
multi-modal feature extraction and fusion module that integrates music, rhythm,
and textual guidance into a shared representation, and (3) a diffusion-based
motion synthesis module together with a multi-modal alignment loss, which
ensures that the generated dance is aligned with both musical and textual cues.
Extensive experiments on AIST++ and human evaluations show that DanceChat
outperforms state-of-the-art methods both qualitatively and quantitatively.

[79] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu

Main category: cs.CV

TL;DR: 论文提出了一种名为T2I-PAL的新方法,通过结合文本生成图像模型与CLIP框架,解决了多标签图像识别中的模态差异问题,并显著提升了性能。

Details Motivation: CLIP等视觉-语言预训练模型虽然能通过对比学习将图像与文本特征对齐,但模态差异问题仍然限制了其在多标签图像识别中的应用。论文旨在减少这种差异,同时减少对语义标注数据的依赖。

Contribution: 1. 提出T2I-PAL方法,利用文本生成图像模型生成高质量图像以减少模态差异。2. 结合类热图和可学习原型,增强局部特征的鲁棒性。3. 联合提示调优和适配器学习优化分类性能。

Method: 1. 使用文本生成图像模型从文本生成多样化的真实图像。2. 引入类热图和可学习原型以聚合局部相似性。3. 结合提示调优和适配器学习进行参数高效微调。

Result: 在MS-COCO等基准测试中,T2I-PAL相比现有最优方法平均提升3.47%的性能。

Insight: 1. 文本生成图像模型可以有效填补模态差异。2. 局部特征增强对多标签识别至关重要。3. 联合提示和适配器学习为CLIP微调提供了新思路。

Abstract: Benefited from image-text contrastive learning, pre-trained vision-language
models, e.g., CLIP, allow to direct leverage texts as images (TaI) for
parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image
features to be similar to the corresponding text features, the modality gap
remains a nontrivial issue and limits image recognition performance of TaI.
Using multi-label image recognition (MLR) as an example, we present a novel
method, called T2I-PAL to tackle the modality gap issue when using only text
captions for PEFT. The core design of T2I-PAL is to leverage pre-trained
text-to-image generation models to generate photo-realistic and diverse images
from text captions, thereby reducing the modality gap. To further enhance MLR,
T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This
aggregates local similarities, making the representation of local visual
features more robust and informative for multi-label recognition. For better
PEFT, we further combine both prompt tuning and adapter learning to enhance
classification performance. T2I-PAL offers significant advantages: it
eliminates the need for fully semantically annotated training images, thereby
reducing the manual annotation workload, and it preserves the intrinsic mode of
the CLIP model, allowing for seamless integration with any existing CLIP
framework. Extensive experiments on multiple benchmarks, including MS-COCO,
VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance
by 3.47% in average above the top-ranked state-of-the-art methods.

[80] Rethinking Random Masking in Self Distillation on ViT

Jihyeon Seong,Hyunkyung Han

Main category: cs.CV

TL;DR: 论文探讨了在自蒸馏框架(如DINO)中随机掩码的作用,提出了一种非对称掩码策略,仅对学生的全局视图进行掩码,从而保留关键语义信息并提升性能。

Details Motivation: 当前自蒸馏框架(如DINO)中使用随机掩码可能无意中破坏关键语义信息,因此需要更智能的掩码策略以提高训练效果。

Contribution: 提出了一种非对称掩码策略,仅在学生的全局视图中应用随机掩码,保留了教师模型的全局视图和学生的局部视图,从而提升注意力图的鲁棒性和细粒度。

Method: 在DINO框架中,仅对学生的全局视图进行随机掩码,保持教师全局视图和学生局部视图的原始状态,通过多视角增强方案保留干净的监督信号。

Result: 在mini-ImageNet数据集上使用DINO-Tiny评估,结果表明该方法能够生成更鲁棒和细粒度的注意力图,并提升下游任务性能。

Insight: 在自蒸馏中,合理的掩码策略可以通过保留关键语义信息显著提升模型性能,而非对称掩码是一种有效的实现方式。

Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a
wide range of vision tasks. In particular, self-distillation frameworks such as
DINO have contributed significantly to these advances. Within such frameworks,
random masking is often utilized to improve training efficiency and introduce
regularization. However, recent studies have raised concerns that
indiscriminate random masking may inadvertently eliminate critical semantic
information, motivating the development of more informed masking strategies. In
this study, we explore the role of random masking in the self-distillation
setting, focusing on the DINO framework. Specifically, we apply random masking
exclusively to the student’s global view, while preserving the student’s local
views and the teacher’s global view in their original, unmasked forms. This
design leverages DINO’s multi-view augmentation scheme to retain clean
supervision while inducing robustness through masked inputs. We evaluate our
approach using DINO-Tiny on the mini-ImageNet dataset and show that random
masking under this asymmetric setup yields more robust and fine-grained
attention maps, ultimately enhancing downstream performance.

[81] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement

Jin Huang,Honghua Chen,Mingqiang Wei

Main category: cs.CV

TL;DR: 论文提出了一种名为HEA-MM的分层误差评估框架,用于飞机CAD模型在制造与测量平台中的质量评估,通过全局、部件和特征三个层次进行误差分析。

Details Motivation: 航空设备的高质量要求(高性能、高稳定性和高可靠性)促使开发一种系统化的方法评估制造过程中的CAD模型误差。

Contribution: 1. 提出了分层的误差评估框架HEA-MM;2. 提出了一种基于优化的基元细化方法;3. 开发了一个两阶段算法用于圆形特征检测。

Method: 1. 使用结构光扫描仪获取3D测量数据;2. 在全局、部件和特征三个层次上分析误差;3. 通过优化拆分和合并操作细化基元;4. 采用张量投票和假设聚类框架检测圆形特征。

Result: 实验结果表明,HEA-MM方法在多种飞机CAD模型上有效实现了误差评估。

Insight: 分层分析方法能够更全面地捕捉制造误差,特别是在复杂几何结构中,优化和特征检测算法的结合提升了评估的精确性。

Abstract: The most essential feature of aviation equipment is high quality, including
high performance, high stability and high reliability. In this paper, we
propose a novel hierarchical error assessment framework for aircraft CAD models
within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs
structured light scanners to obtain comprehensive 3D measurements of
manufactured workpieces. The measured point cloud is registered with the
reference CAD model, followed by an error analysis conducted at three
hierarchical levels: global, part, and feature. At the global level, the error
analysis evaluates the overall deviation of the scanned point cloud from the
reference CAD model. At the part level, error analysis is performed on these
patches underlying the point clouds. We propose a novel optimization-based
primitive refinement method to obtain a set of meaningful patches of point
clouds. Two basic operations, splitting and merging, are introduced to refine
the coarse primitives. At the feature level, error analysis is performed on
circular holes, which are commonly found in CAD models. To facilitate it, a
two-stage algorithm is introduced for the detection of circular holes. First,
edge points are identified using a tensor-voting algorithm. Then, multiple
circles are fitted through a hypothesize-and-clusterize framework, ensuring
accurate detection and analysis of the circular features. Experimental results
on various aircraft CAD models demonstrate the effectiveness of our proposed
method.

[82] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai

Main category: cs.CV

TL;DR: 该论文提出了一种名为SSP的统一框架,通过语义解耦的空间分区(Semantic-decoupled Spatial Partition)解决点监督目标检测中的样本分配不足和实例混淆问题,显著提升了密集场景下的检测性能。

Details Motivation: 在遥感图像中,高密度的目标场景需要大量人工标注,而基于点监督的定向目标检测方法虽然成本低,但存在样本分配不足和实例混淆的问题。论文提出SSP框架以解决这些问题。

Contribution: 1) 提出像素级空间分区的样本分配方法,精确估计目标尺度并挖掘高质量样本。2) 提出基于语义空间分区的边界框提取方法,生成伪标签用于监督检测器学习。3) 在多个数据集上验证了SSP的优越性。

Method: 1) 通过空间分区估计目标尺度并挖掘正负样本。2) 利用语义调制空间分区生成边界框伪标签。3) 结合ORCNN和ReDet架构实现端到端训练。

Result: 在DOTA-v1.0上,SSP在点监督下达到了45.78%的mAP,比当前最佳方法(PointOBB-v2)提升了4.10%。与ORCNN和ReDet结合后,mAP分别达到47.86%和48.50%。

Insight: SSP通过结合规则驱动和数据驱动的方法,解决了点监督目标检测中的核心问题,为密集场景下的高效标注提供了新思路。

Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented
object detection rapid development, yet hindered by labor-intensive annotation
for high-density scenes. Oriented object detection with point supervision
offers a cost-effective solution for densely packed scenes in remote sensing,
yet existing methods suffer from inadequate sample assignment and instance
confusion due to rigid rule-based designs. To address this, we propose SSP
(Semantic-decoupled Spatial Partition), a unified framework that synergizes
rule-driven prior injection and data-driven label purification. Specifically,
SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based
Sample Assignment, which compactly estimates the upper and lower bounds of
object scales and mines high-quality positive samples and hard negative samples
through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based
Box Extraction, which derives instances from spatial partitions modulated by
semantic maps and reliably converts them into bounding boxes to form
pseudo-labels for supervising the learning of downstream detectors. Experiments
on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP
under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%.
Furthermore, when integrated with ORCNN and ReDet architectures, the SSP
framework achieves mAP values of 47.86% and 48.50%, respectively. The code is
available at https://github.com/antxinyuan/ssp.

[83] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

Eshan Ramesh,Nishio Takayuki

Main category: cs.CV

TL;DR: LatentCSI 是一种从 WiFi CSI 测量生成环境图像的新方法,通过预训练的潜扩散模型实现高效高分辨率图像合成。

Details Motivation: 传统方法依赖 GAN 等复杂技术,计算成本高且效果受限。本文旨在通过轻量级网络和潜扩散模型简化流程,提升生成效率和质量。

Contribution: 提出 LatentCSI,结合轻量级网络和预训练潜扩散模型,实现直接从 CSI 到高质量图像的生成,并支持文本引导控制。

Method: 1. 使用轻量级网络将 CSI 幅度映射到潜空间;2. 在潜空间中应用去噪扩散模型(带文本引导);3. 通过预训练解码器生成图像。

Result: 在自采数据和 MM-Fi 数据集上验证,LatentCSI 在计算效率和感知质量上优于基线方法,且支持文本引导。

Insight: 潜空间直接生成图像避免了像素级编码的复杂性,结合预训练模型可高效实现高质量结果,文本引导进一步提升了实用性。

Abstract: We present LatentCSI, a novel method for generating images of the physical
environment from WiFi CSI measurements that leverages a pretrained latent
diffusion model (LDM). Unlike prior approaches that rely on complex and
computationally intensive techniques such as GANs, our method employs a
lightweight neural network to map CSI amplitudes directly into the latent space
of an LDM. We then apply the LDM’s denoising diffusion model to the latent
representation with text-based guidance before decoding using the LDM’s
pretrained decoder to obtain a high-resolution image. This design bypasses the
challenges of pixel-space image generation and avoids the explicit image
encoding stage typically required in conventional image-to-image pipelines,
enabling efficient and high-quality image synthesis. We validate our approach
on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi
devices and cameras; and a subset of the publicly available MM-Fi dataset. The
results demonstrate that LatentCSI outperforms baselines of comparable
complexity trained directly on ground-truth images in both computational
efficiency and perceptual quality, while additionally providing practical
advantages through its unique capacity for text-guided controllability.

[84] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu

Main category: cs.CV

TL;DR: MSTAR提出了一种无需边界框标注的多查询场景文本检索方法,通过动态捕获多粒度文本表示和融合风格感知指令,显著提升了检索性能。

Details Motivation: 现有场景文本检索方法依赖昂贵的边界框标注且难以统一多种查询类型,MSTAR旨在解决这些问题。

Contribution: 1. 提出无需边界框标注的Box-free方法,大幅降低标注成本;2. 通过动态多粒度表征和风格感知指令统一自由文本查询;3. 构建首个多查询场景文本检索基准MQTR。

Method: 1. 渐进式视觉嵌入动态捕获多粒度文本表示;2. 结合风格感知指令融合自由文本查询;3. 多实例匹配模块增强视觉-语言对齐。

Result: 在Total-Text上MAP超过SOTA 6.4%,在MQTR上平均提升8.5%。

Insight: 1. Box-free设计显著降低标注成本;2. 多查询统一策略适应多样化检索需求;3. 动态多粒度表征提升文本理解能力。

Abstract: Scene text retrieval has made significant progress with the assistance of
accurate text localization. However, existing approaches typically require
costly bounding box annotations for training. Besides, they mostly adopt a
customized retrieval strategy but struggle to unify various types of queries to
meet diverse retrieval needs. To address these issues, we introduce Muti-query
Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for
scene text retrieval. It incorporates progressive vision embedding to
dynamically capture the multi-grained representation of texts and harmonizes
free-style text queries with style-aware instructions. Additionally, a
multi-instance matching module is integrated to enhance vision-language
alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset,
the first benchmark designed to evaluate the multi-query scene text retrieval
capability of models, comprising four query types and 16k images. Extensive
experiments demonstrate the superiority of our method across seven public
datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous
state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box
annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly
outperforms the previous models by an average of 8.5%. The code and datasets
are available at https://github.com/yingift/MSTAR.

[85] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O’Neil,Sotirios A. Tsaftaris

Main category: cs.CV

TL;DR: 本文提出了一种基于解剖学信息的弱监督提示调整框架,用于改进预训练的胸部X光潜在扩散模型的多模态对齐性能,使其在下游任务(如短语定位)中表现优异。

Details Motivation: 在医学影像领域,潜在扩散模型(Latent Diffusion Models)的多模态对齐性能由于数据隐私问题受限。本文旨在解决胸部X光报告中自由文本与图像区域的临床相关性对齐不足的问题。

Contribution: 提出了一种用于改进预训练潜在扩散模型的弱监督提示调整框架,显著提升了多模态对齐性能,并在标准数据集(MS-CXR)和外部数据集(VinDr-CXR)上达到新的最先进水平。

Method: 通过解剖学信息引导的弱监督提示调优,改进预训练模型的文本-图像对齐能力,使其适应下游任务如短语定位。

Result: 在MS-CXR数据集上达到新的SOTA,同时在外部数据集VinDr-CXR上表现出鲁棒性能。

Insight: 解剖学信息的引入为医学影像的多模态对齐提供了新的优化方向,无需大量标注数据即可显著提升模型性能。

Abstract: Latent Diffusion Models have shown remarkable results in text-guided image
synthesis in recent years. In the domain of natural (RGB) images, recent works
have shown that such models can be adapted to various vision-language
downstream tasks with little to no supervision involved. On the contrary,
text-to-image Latent Diffusion Models remain relatively underexplored in the
field of medical imaging, primarily due to limited data availability (e.g., due
to privacy concerns). In this work, focusing on the chest X-ray modality, we
first demonstrate that a standard text-conditioned Latent Diffusion Model has
not learned to align clinically relevant information in free-text radiology
reports with the corresponding areas of the given scan. Then, to alleviate this
issue, we propose a fine-tuning framework to improve multi-modal alignment in a
pre-trained model such that it can be efficiently repurposed for downstream
tasks such as phrase grounding. Our method sets a new state-of-the-art on a
standard benchmark dataset (MS-CXR), while also exhibiting robust performance
on out-of-distribution data (VinDr-CXR). Our code will be made publicly
available.

[86] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

Francisco Caetano,Christiaan Viviers,Peter H. N. De With,Fons van der Sommen

Main category: cs.CV

TL;DR: Symmetrical Flow Matching (SymmFlow) is a novel framework unifying image generation, segmentation, and classification through a symmetric learning objective, ensuring bi-directional consistency and preserving semantic information.

Details Motivation: Existing methods often separate generative modeling, segmentation, and classification tasks. SymmFlow aims to unify these tasks within a single model, leveraging flow matching for improved consistency and efficiency.

Contribution: Introduces SymmFlow, a symmetric learning framework for joint modeling of forward and reverse transformations, enabling one-step segmentation and classification without iterative refinement. It supports flexible conditioning with pixel- and image-level labels.

Method: SymmFlow uses a symmetric learning objective to model bi-directional flows, preserving entropy for diversity and explicitly retaining semantic information. It introduces a new training objective for efficient sampling.

Result: Achieves state-of-the-art FID scores (11.9 on CelebAMask-HQ, 7.0 on COCO-Stuff) with only 25 inference steps. It also shows competitive segmentation and promising classification performance.

Insight: SymmFlow demonstrates that unifying generative, segmentation, and classification tasks is feasible through symmetric flow matching, offering a more efficient and consistent framework compared to task-specific models.

Abstract: Flow Matching has emerged as a powerful framework for learning continuous
transformations between distributions, enabling high-fidelity generative
modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new
formulation that unifies semantic segmentation, classification, and image
generation within a single model. Using a symmetric learning objective,
SymmFlow models forward and reverse transformations jointly, ensuring
bi-directional consistency, while preserving sufficient entropy for generative
diversity. A new training objective is introduced to explicitly retain semantic
information across flows, featuring efficient sampling while preserving
semantic structure, allowing for one-step segmentation and classification
without iterative refinement. Unlike previous approaches that impose strict
one-to-one mapping between masks and images, SymmFlow generalizes to flexible
conditioning, supporting both pixel-level and image-level class labels.
Experimental results on various benchmarks demonstrate that SymmFlow achieves
state-of-the-art performance on semantic image synthesis, obtaining FID scores
of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps.
Additionally, it delivers competitive results on semantic segmentation and
shows promising capabilities in classification tasks. The code will be publicly
available.

[87] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang

Main category: cs.CV

TL;DR: GigaVideo-1 是一种高效的视频生成微调框架,通过自动反馈提升生成质量,无需人工标注或大量计算资源,仅需4 GPU小时即可显著改进17个评估维度。

Details Motivation: 现有视频扩散模型微调依赖人工标注和大量计算资源,限制了实用性。作者希望通过自动反馈和高效优化方法改进视频生成质量。

Contribution: 提出了GigaVideo-1框架,包括基于提示的数据引擎和奖励引导的训练策略,无需额外人工监督即可提升生成质量。

Method: 设计了提示驱动的数据引擎生成多样化训练样本,并引入基于预训练VLM反馈的奖励引导训练策略。

Result: 在VBench-2.0基准测试中,GigaVideo-1平均提升4%的性能,仅消耗4 GPU小时。

Insight: 自动反馈机制可有效替代人工标注,高效解锁预训练模型的潜力,为视频生成领域提供了一种低成本优化方案。

Abstract: Recent progress in diffusion models has greatly enhanced video generation
quality, yet these models still require fine-tuning to improve specific
dimensions like instance preservation, motion rationality, composition, and
physical plausibility. Existing fine-tuning approaches often rely on human
annotations and large-scale computational resources, limiting their
practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning
framework that advances video generation without additional human supervision.
Rather than injecting large volumes of high-quality data from external sources,
GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models
through automatic feedback. Specifically, we focus on two key aspects of the
fine-tuning process: data and optimization. To improve fine-tuning data, we
design a prompt-driven data engine that constructs diverse, weakness-oriented
training samples. On the optimization side, we introduce a reward-guided
training strategy, which adaptively weights samples using feedback from
pre-trained vision-language models with a realism constraint. We evaluate
GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17
evaluation dimensions. Experiments show that GigaVideo-1 consistently improves
performance on almost all the dimensions with an average gain of about 4% using
only 4 GPU-hours. Requiring no manual annotations and minimal real data,
GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and
data will be publicly available.

[88] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis

Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović

Main category: cs.CV

TL;DR: PiPViT提出了一种基于视觉Transformer(ViT)的原型学习方法,通过对比学习和多分辨率输入处理,学习可解释的病灶原型,适用于视网膜图像分析。

Details Motivation: 现有的原型方法在医学图像中难以生成与人类可理解的生物标志物一致的可视化原型,且过于细粒度,而医学影像中病灶的范围和存在同样重要。

Contribution: 提出PiPViT,一种基于ViT的可解释原型模型,能够利用图像级标签学习人类可理解的原型,并通过对比学习和多分辨率处理实现跨尺度的生物标志物定位。

Method: 采用ViT捕获图像块间的长距离依赖关系,结合对比学习和多分辨率输入处理,学习与临床相关的原型。

Result: 在四个视网膜OCT图像数据集上表现出色,不仅性能与SOTA相当,还能提供更直观的解释。原型在语义和临床上也具有相关性。

Insight: PiPViT通过结合ViT和原型学习,提供了一种医学影像诊断中透明且可解释的新方法,有助于临床理解诊断结果。

Abstract: Background and Objective: Prototype-based methods improve interpretability by
learning fine-grained part-prototypes; however, their visualization in the
input pixel space is not always consistent with human-understandable
biomarkers. In addition, well-known prototype-based approaches typically learn
extremely granular prototypes that are less interpretable in medical imaging,
where both the presence and extent of biomarkers and lesions are critical.
Methods: To address these challenges, we propose PiPViT (Patch-based Visual
Interpretable Prototypes), an inherently interpretable prototypical model for
image recognition. Leveraging a vision transformer (ViT), PiPViT captures
long-range dependencies among patches to learn robust, human-interpretable
prototypes that approximate lesion extent only using image-level labels.
Additionally, PiPViT benefits from contrastive learning and multi-resolution
input processing, which enables effective localization of biomarkers across
scales.
Results: We evaluated PiPViT on retinal OCT image classification across four
datasets, where it achieved competitive quantitative performance compared to
state-of-the-art methods while delivering more meaningful explanations.
Moreover, quantitative evaluation on a hold-out test set confirms that the
learned prototypes are semantically and clinically relevant. We believe PiPViT
can transparently explain its decisions and assist clinicians in understanding
diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT

[89] Enhancing Deepfake Detection using SE Block Attention with CNN

Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy

Main category: cs.CV

TL;DR: 论文提出了一种基于SE块注意力和CNN的轻量级深度伪造检测模型,通过动态通道特征重校准提高效率和准确性,在Style GAN数据集上达到了94.14%的分类准确率和0.985的AUC-ROC分数。

Details Motivation: 深度伪造技术因其高度逼真的合成内容对信息真实性和安全性构成威胁,传统检测方法难以应对,而现有模型多为大型网络,计算开销大。

Contribution: 提出了结合SE块注意力的轻量级CNN模型,通过动态特征重校准提升效率,同时保持高检测性能。

Method: 集成了SE块的轻量级CNN,SE块用于动态调整通道特征权重,增强有用特征,抑制无用信息。

Result: 在Style GAN数据集上表现优异,分类准确率94.14%,AUC-ROC分数0.985。

Insight: SE块的动态特征重校准机制能有效提升轻量级模型的性能,为计算资源有限的场景提供高效解决方案。

Abstract: In the digital age, Deepfake present a formidable challenge by using advanced
artificial intelligence to create highly convincing manipulated content,
undermining information authenticity and security. These sophisticated
fabrications surpass traditional detection methods in complexity and realism.
To address this issue, we aim to harness cutting-edge deep learning
methodologies to engineer an innovative deepfake detection model. However, most
of the models designed for deepfake detection are large, causing heavy storage
and memory consumption. In this research, we propose a lightweight convolution
neural network (CNN) with squeeze and excitation block attention (SE) for
Deepfake detection. The SE block module is designed to perform dynamic
channel-wise feature recalibration. The SE block allows the network to
emphasize informative features and suppress less useful ones, which leads to a
more efficient and effective learning module. This module is integrated with a
simple sequential model to perform Deepfake detection. The model is smaller in
size and it achieves competing accuracy with the existing models for deepfake
detection tasks. The model achieved an overall classification accuracy of
94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse
Fake Face Dataset. Our proposed approach presents a promising avenue for
combating the Deepfake challenge with minimal computational resources,
developing efficient and scalable solutions for digital content verification.

[90] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework

Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo

Main category: cs.CV

TL;DR: 论文提出了Unsourced Adversarial CAPTCHA (UAC)框架,通过基于文本提示生成高保真对抗样本,增强CAPTCHA的多样性,并支持定向和非定向攻击,有效抵御基于DNN的自动攻击。

Details Motivation: 随着深度学习的快速发展,传统CAPTCHA在DNN驱动的自动攻击面前越来越脆弱。现有对抗攻击方法依赖原始图像特征,导致扭曲干扰人类理解,且缺乏初始输入图像时适用性受限。

Contribution: 提出了UAC框架,基于攻击者指定的文本提示生成高保真对抗样本,支持定向和非定向攻击;针对非定向攻击,提出了BP-UAC方法,采用多模态梯度和双路径优化策略。

Method: 1. 定向攻击使用EDICT方法优化扩散模型中的双潜变量;2. 非定向攻击采用BP-UAC,结合多模态梯度和双路径优化。

Result: 实验证明BP-UAC在多样系统中实现了高攻击成功率,生成的CAPTCHA对人类和DNN均难以区分。

Insight: 通过结合文本提示和多模态优化,UAC框架为CAPTCHA设计提供了新思路,平衡了对抗攻击的有效性和人类可读性。

Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are
increasingly vulnerable to automated attacks powered by deep neural networks
(DNNs). Existing adversarial attack methods often rely on original image
characteristics, resulting in distortions that hinder human interpretation and
limit applicability in scenarios lacking initial input images. To address these
challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel
framework generating high-fidelity adversarial examples guided by
attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC
enhances CAPTCHA diversity and supports both targeted and untargeted attacks.
For targeted attacks, the EDICT method optimizes dual latent variables in a
diffusion model for superior image quality. In untargeted attacks, especially
for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA
(BP-UAC), a two-step optimization strategy employing multimodal gradients and
bi-path optimization for efficient misclassification. Experiments show BP-UAC
achieves high attack success rates across diverse systems, generating natural
CAPTCHAs indistinguishable to humans and DNNs.

[91] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral

Main category: cs.CV

TL;DR: 本研究提出了一种多任务、多年龄框架,结合重加权损失和年龄平衡采样,显著提升了未成年检测在开放图像中的准确性和鲁棒性。

Details Motivation: 公开数据中未成年样本不足且分布偏移严重,需要鲁棒的模型来解决这些问题。

Contribution: 1)多任务架构联合年龄回归和多个未成年分类任务;2)引入重加权损失和年龄平衡采样;3)提出新的评测基准。

Method: 基于冻结的FaRL视觉-语言主干,结合共享特征的两层MLP,设计了年龄回归头和四个未成年分类头,并采用重加权损失和优化采样策略。

Result: 模型在多个未成年分类任务上显著提升F2分数,并在分布偏移下保持高召回率。

Insight: 多任务联合优化和平衡采样是关键,且新评测基准为实际应用提供了更严格的测试场景。

Abstract: Accurate automatic screening of minors in unconstrained images demands models
that are robust to distribution shift and resilient to the children
under-representation in publicly available data. To overcome these issues, we
propose a multi-task architecture with dedicated under/over-age discrimination
tasks based on a frozen FaRL vision-language backbone joined with a compact
two-layer MLP that shares features across one age-regression head and four
binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing
on the legally critical age range. To address the severe class imbalance, we
introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch
sampling, which equalizes twelve age bins during stochastic optimization.
Further improvement is achieved with an age gap that removes edge cases from
the loss.
Moreover, we set a rigorous evaluation by proposing the Overall Under-Age
Benchmark, with 303k cleaned training images and 110k test images, defining
both the “ASORES-39k” restricted overall test, which removes the noisiest
domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images,
stressing extreme pose ($>$45{\deg}), expression, and low image quality to
emulate real-world shifts.
Trained on the cleaned overall set with resampling and age gap, our multiage
model “F” lowers the root-mean-square-error on the ASORES-39k restricted test
from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from
F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to
the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall
while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline,
demonstrating strong generalization under distribution shift. For the under-12
and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and
from 0.689 to 0.916, respectively.

[92] Continual Hyperbolic Learning of Instances and Classes

Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth

Main category: cs.CV

TL;DR: 论文提出了一种新的持续学习任务,同时处理实例和类别的分类任务,并利用双曲空间建模层次结构,提出HyperCLIC算法,结合双曲分类和蒸馏目标,实现了对层次关系的持续嵌入。

Details Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的分类任务,而传统持续学习仅关注其中之一。因此,论文提出同时学习实例和类别的任务,并利用层次结构建模。

Contribution: 1. 提出新的持续学习任务:同时学习实例和类别;2. 提出HyperCLIC算法,利用双曲空间建模层次结构;3. 引入持续的层次化评价指标。

Method: 1. 利用双曲空间表示层次关系;2. 结合双曲分类和蒸馏目标实现持续学习;3. 在EgoObjects数据集上验证。

Result: 实验证明HyperCLIC能有效处理多粒度任务,提升层次化泛化能力。

Insight: 双曲空间适合建模层次结构,且在持续学习中具有潜力;实例和类别的联合学习更贴近实际应用需求。

Abstract: Continual learning has traditionally focused on classifying either instances
or classes, but real-world applications, such as robotics and self-driving
cars, require models to handle both simultaneously. To mirror real-life
scenarios, we introduce the task of continual learning of instances and
classes, at the same time. This task challenges models to adapt to multiple
levels of granularity over time, which requires balancing fine-grained instance
recognition with coarse-grained class generalization. In this paper, we
identify that classes and instances naturally form a hierarchical structure. To
model these hierarchical relationships, we propose HyperCLIC, a continual
learning algorithm that leverages hyperbolic space, which is uniquely suited
for hierarchical data due to its ability to represent tree-like structures with
low distortion and compact embeddings. Our framework incorporates hyperbolic
classification and distillation objectives, enabling the continual embedding of
hierarchical relations. To evaluate performance across multiple granularities,
we introduce continual hierarchical metrics. We validate our approach on
EgoObjects, the only dataset that captures the complexity of hierarchical
object recognition in dynamic real-world environments. Empirical results show
that HyperCLIC operates effectively at multiple granularities with improved
hierarchical generalization.

[93] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He

Main category: cs.CV

TL;DR: 论文提出了一种基于不确定性的伯努利扩散模型(UMBD),通过选择性优化分割质量较差的区域,显著提升了伪装目标检测的性能。

Details Motivation: 伪装目标检测(COD)中,目标与背景的视觉差异小,现有方法的分割结果仍有较大优化空间,但尚未充分探索生成式后处理方法。

Contribution: 提出了首个用于COD的生成式优化框架UMBD,设计了不确定性掩码机制和混合不确定性量化网络(HUQNet),实现了针对性的优化。

Method: UMBD通过不确定性引导的掩码机制选择性地应用伯努利扩散,HUQNet多分支架构融合多源不确定性以提高估计精度。

Result: 在多个COD基准测试中,平均MAE提升5.5%,加权F-measure提升3.2%,且计算开销适中。

Insight: 将生成式方法与判别式模型结合,可通过针对性优化显著提升COD性能,不确定性估计在优化过程中起到了关键作用。

Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the
subtle visual differences between targets and their backgrounds. While existing
methods have made notable progress, there remains significant potential for
post-processing refinement that has yet to be fully explored. To address this
limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model,
the first generative refinement framework specifically designed for COD. UMBD
introduces an uncertainty-guided masking mechanism that selectively applies
Bernoulli diffusion to residual regions with poor segmentation quality,
enabling targeted refinement while preserving correctly segmented areas. To
support this process, we design the Hybrid Uncertainty Quantification Network
(HUQNet), which employs a multi-branch architecture and fuses uncertainty from
multiple sources to improve estimation accuracy. This enables adaptive guidance
during the generative sampling process. The proposed UMBD framework can be
seamlessly integrated with a wide range of existing Encoder-Decoder-based COD
models, combining their discriminative capabilities with the generative
advantages of diffusion-based refinement. Extensive experiments across multiple
COD benchmarks demonstrate consistent performance improvements, achieving
average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest
computational overhead. Code will be released.

[94] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain

Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng

Main category: cs.CV

TL;DR: IQE-CLIP提出了一种结合文本和视觉信息的查询嵌入方法,用于医学领域的零样本/少样本异常检测,通过类基础和可学习的提示令牌以及实例感知查询模块,显著提升了性能。

Details Motivation: 现有基于CLIP的方法在零样本/少样本异常检测中依赖于特定场景的提示设计,且主要针对工业领域,缺乏对医学任务的探索。IQE-CLIP旨在解决这些局限性。

Contribution: 1. 提出了结合文本和视觉信息的查询嵌入方法;2. 设计了类基础和可学习的提示令牌;3. 引入了实例感知查询模块;4. 在医学领域实现了最先进的性能。

Method: 1. 使用类基础和可学习的提示令牌适配CLIP到医学场景;2. 通过实例感知查询模块提取区域级上下文信息,生成对异常敏感的嵌入。

Result: 在六个医学数据集上的实验表明,IQE-CLIP在零样本和少样本设置中均达到最先进性能。

Insight: 结合文本和视觉信息的查询嵌入能更有效地捕捉异常特征,尤其在医学领域。实例感知模块的设计为跨模态信息融合提供了新思路。

Abstract: Recent advances in vision-language models, such as CLIP, have significantly
improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks.
However, most existing CLIP-based methods assume prior knowledge of categories
and rely on carefully designed prompts tailored to specific scenarios. While
these text prompts capture semantic information in the textual space, they
often fail to distinguish normal and anomalous instances in the joint embedding
space. Moreover, most ZFSAD approaches focus on industrial domains, with
limited exploration in medical tasks. To address these limitations, we propose
IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query
embeddings integrating both textual and instance-aware visual information serve
as more effective indicators of anomalies. Specifically, we introduce
class-based and learnable prompting tokens to better adapt CLIP to the medical
setting. Furthermore, we design an instance-aware query module that extracts
region-level contextual information from both modalities, enabling the
generation of anomaly-sensitive embeddings. Extensive experiments on six
medical datasets demonstrate that IQE-CLIP achieves state-of-the-art
performance in both zero-shot and few-shot settings. Code and data are
available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.

[95] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu

Main category: cs.CV

TL;DR: PosterCraft是一个统一框架,用于生成高质量美学海报,通过多阶段优化工作流程,显著提升了文本渲染和布局的视觉效果。

Details Motivation: 生成美学海报比简单设计图像更具挑战性,需要兼顾文本渲染、艺术内容整合和布局和谐。现有方法通常是模块化或预定义布局,限制了创造性。

Contribution: 提出PosterCraft统一框架,放弃模块化管道和预定义布局,通过多阶段优化流程(包括文本渲染优化、区域感知微调、美学强化学习和反馈细化)实现高质量海报生成。

Method: 使用级联工作流程,包括文本渲染优化(Text-Render-2M数据集)、区域感知微调(HQ-Poster100K)、美学文本强化学习和联合视觉语言反馈细化。

Result: 在多项实验中,PosterCraft在渲染精度、布局一致性和视觉吸引力上显著优于开源基线,接近商业系统的水平。

Insight: 通过多阶段优化和自动化数据构建,可以在不复杂修改架构的情况下实现高质量海报生成,展示了统一框架的强大潜力。

Abstract: Generating aesthetic posters is more challenging than simple design images:
it requires not only precise text rendering but also the seamless integration
of abstract artistic content, striking layouts, and overall stylistic harmony.
To address this, we propose PosterCraft, a unified framework that abandons
prior modular pipelines and rigid, predefined layouts, allowing the model to
freely explore coherent, visually compelling compositions. PosterCraft employs
a carefully designed, cascaded workflow to optimize the generation of
high-aesthetic posters: (i) large-scale text-rendering optimization on our
newly introduced Text-Render-2M dataset; (ii) region-aware supervised
fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via
best-of-n preference optimization; and (iv) joint vision-language feedback
refinement. Each stage is supported by a fully automated data-construction
pipeline tailored to its specific needs, enabling robust training without
complex architectural modifications. Evaluated on multiple experiments,
PosterCraft significantly outperforms open-source baselines in rendering
accuracy, layout coherence, and overall visual appeal-approaching the quality
of SOTA commercial systems. Our code, models, and datasets can be found in the
Project page: https://ephemeral182.github.io/PosterCraft

[96] SlotPi: Physics-informed Object-centric Reasoning Models

Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun

Main category: cs.CV

TL;DR: SlotPi是一个基于物理知识的物体中心推理模型,通过结合哈密顿原理和时空预测模块,解决了动态场景模拟中物理知识整合和模型适应性的问题。

Details Motivation: 当前物体中心动态模拟方法缺乏物理知识的整合,且在多样化场景中的适应性验证不足,特别是流体和物体交互的动态场景。

Contribution: 提出了SlotPi模型,结合了物理模块和时空预测模块,创建了一个涵盖物体和流体交互的真实世界数据集,展示了模型的强大适应性和性能。

Method: SlotPi模型整合了基于哈密顿原理的物理模块和时空预测模块,用于动态预测。

Result: 在基准数据集和流体数据集上的预测和VQA任务中表现出色,验证了模型的适应性和性能。

Insight: 整合物理知识可以显著提升模型在复杂动态场景中的推理能力,为更先进的世界模型开发奠定了基础。

Abstract: Understanding and reasoning about dynamics governed by physical laws through
visual observation, akin to human capabilities in the real world, poses
significant challenges. Currently, object-centric dynamic simulation methods,
which emulate human behavior, have achieved notable progress but overlook two
critical aspects: 1) the integration of physical knowledge into models. Humans
gain physical insights by observing the world and apply this knowledge to
accurately reason about various dynamic scenarios; 2) the validation of model
adaptability across diverse scenarios. Real-world dynamics, especially those
involving fluids and objects, demand models that not only capture object
interactions but also simulate fluid flow characteristics. To address these
gaps, we introduce SlotPi, a slot-based physics-informed object-centric
reasoning model. SlotPi integrates a physical module based on Hamiltonian
principles with a spatio-temporal prediction module for dynamic forecasting.
Our experiments highlight the model’s strengths in tasks such as prediction and
Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore,
we have created a real-world dataset encompassing object interactions, fluid
dynamics, and fluid-object interactions, on which we validated our model’s
capabilities. The model’s robust performance across all datasets underscores
its strong adaptability, laying a foundation for developing more advanced world
models.

[97] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning

Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae

Main category: cs.CV

TL;DR: 本文提出了一种结合事件相机与强化学习的机器人导航控制器,用于实时人本导航与避障,突破了传统图像控制器的固定帧率与运动模糊限制。

Details Motivation: 传统基于图像的导航控制器存在固定帧率、运动模糊和高延迟问题,而事件相机的异步特性能够灵活处理视觉信息,为机器人导航提供了新的可能性。

Contribution: 主要贡献在于提出了一个结合事件相机、其他传感器和强化学习的框架,实现了自适应的人本导航与避障,并通过模仿学习提升了样本效率。

Method: 方法包括事件相机的异步视觉处理、深度确定性策略梯度(DDPG)策略优化,以及初始模仿学习阶段。

Result: 在模拟环境中展示了鲁棒的导航能力,包括行人跟随和避障。

Insight: 事件相机与强化学习的结合为实时机器人导航提供了高效解决方案,异步处理显著提升了系统适应性。

Abstract: This work introduces a robot navigation controller that combines event
cameras and other sensors with reinforcement learning to enable real-time
human-centered navigation and obstacle avoidance. Unlike conventional
image-based controllers, which operate at fixed rates and suffer from motion
blur and latency, this approach leverages the asynchronous nature of event
cameras to process visual information over flexible time intervals, enabling
adaptive inference and control. The framework integrates event-based
perception, additional range sensing, and policy optimization via Deep
Deterministic Policy Gradient, with an initial imitation learning phase to
improve sample efficiency. Promising results are achieved in simulated
environments, demonstrating robust navigation, pedestrian following, and
obstacle avoidance. A demo video is available at the project website.

[98] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization

Mario Barbara,Alaa Maalouf

Main category: cs.CV

TL;DR: 本文提出了一种零样本、基于自然语言查询的视频摘要方法Prompts-to-Summaries,利用现有视频语言模型(VidLMs)和大语言模型(LLMs)无需训练数据即可生成用户引导的视频摘要,性能超越无监督方法并与监督方法相当。

Details Motivation: 视频数据的爆炸式增长催生了对无需领域特定训练数据、可灵活响应用户自然语言意图的视频摘要工具的需求。现有方法要么依赖数据集限制了泛化能力,要么无法结合用户自然语言表达的意图。

Contribution: 1. 提出了首个零样本、基于文本查询的视频摘要框架,无需训练数据。2. 设计了一种高效的VidLM批处理提示方案,支持长视频处理。3. 利用LLM作为裁判生成重要性分数,并提出了两个新指标(一致性和独特性)用于细粒度重要性评分。4. 提出了新数据集VidSum-Reason,推动查询驱动视频摘要研究。

Method: 1. 将原始视频分割为连贯场景。2. 通过VidLM生成场景级描述。3. 利用LLM根据提示为场景分配重要性分数。4. 通过一致性和独特性指标将分数传播到短片段级别。

Result: 在SumMe和TVSum上超越了所有无监督方法,与监督方法表现相当。在QFVS基准测试中表现竞争力,尽管未使用训练数据。

Insight: 预训练多模态模型通过精心设计的提示和分数传播机制,已经具备强大的通用视频摘要能力,无需额外训练数据。

Abstract: The explosive growth of video data intensified the need for flexible
user-controllable summarization tools that can operate without domain-specific
training data. Existing methods either rely on datasets, limiting
generalization, or cannot incorporate user intent expressed in natural
language. We introduce Prompts-to-Summaries: the first zero-shot,
text-queryable video summarizer that converts off-the-shelf video-language
models (VidLMs) captions into user-guided skims via large language models
(LLMs) judging, without the use of training data at all, beating all
unsupervised and matching supervised methods. Our pipeline (i) segments raw
video footage into coherent scenes, (ii) generates rich scene-level
descriptions through a memory-efficient, batch-style VidLM prompting scheme
that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a
judge to assign scene-level importance scores under a carefully crafted prompt,
and finally, (iv) propagates those scores to short segments level via two new
metrics: consistency (temporal coherency) and uniqueness (novelty), yielding
fine-grained frame importance. On SumMe and TVSum, our data-free approach
surpasses all prior data-hungry unsupervised methods. It also performs
competitively on the Query-Focused Video Summarization (QFVS) benchmark,
despite using no training data and the competing methods requiring supervised
frame-level importance. To spur further research, we release VidSum-Reason, a
new query-driven dataset featuring long-tailed concepts and multi-step
reasoning; our framework attains robust F1 scores and serves as the first
challenging baseline. Overall, our results demonstrate that pretrained
multimodal models, when orchestrated with principled prompting and score
propagation, already provide a powerful foundation for universal,
text-queryable video summarization.

[99] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing

Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan

Main category: cs.CV

TL;DR: 论文提出了一种名为SmoothProper的无监督可变形图像配准方法,通过结构性非参平滑解决了稀疏特征和大位移问题,无需标签监督,显著降低了配准误差。

Details Motivation: 针对现有无监督可变形图像配准方法在处理稀疏特征和大位移时的不足,提出了SmoothProper模块,以解决网络预测中的平滑性和结构一致性挑战。

Contribution: 提出了SmoothProper,一种模型无关的插件式神经模块,通过双优化层和交互项实现了平滑性和结构一致性的提升,无需超参数调优。

Method: 结合了双优化层和定制交互项,在网络前向传递中强制执行平滑性和消息传递,提升了流信号的传播效率和结构一致性。

Result: 在视网膜血管数据集上,SmoothProper将配准误差降至1.88像素(2912x2912图像),首次有效解决了稀疏特征和大位移问题。

Insight: 通过结构性非参平滑,SmoothProper展示了在无监督配准中处理复杂图像特征的潜力,为类似任务提供了新思路。

Abstract: Learning-based deformable image registration (DIR) accelerates alignment by
amortizing traditional optimization via neural networks. Label supervision
further enhances accuracy, enabling efficient and precise nonlinear alignment
of unseen scans. However, images with sparse features amid large smooth
regions, such as retinal vessels, introduce aperture and large-displacement
challenges that unsupervised DIR methods struggle to address. This limitation
occurs because neural networks predict deformation fields in a single forward
pass, leaving fields unconstrained post-training and shifting the
regularization burden entirely to network weights. To address these issues, we
introduce SmoothProper, a plug-and-play neural module enforcing smoothness and
promoting message passing within the network’s forward pass. By integrating a
duality-based optimization layer with tailored interaction terms, SmoothProper
efficiently propagates flow signals across spatial locations, enforces
smoothness, and preserves structural consistency. It is model-agnostic,
seamlessly integrates into existing registration frameworks with minimal
parameter overhead, and eliminates regularizer hyperparameter tuning.
Preliminary results on a retinal vessel dataset exhibiting aperture and
large-displacement challenges demonstrate our method reduces registration error
to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach
to effectively address both challenges. The source code will be available at
https://github.com/tinymilky/SmoothProper.

[100] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian

Main category: cs.CV

TL;DR: 论文提出了一种基于掩码自编码器(HOMAE)的遮挡感知手-物体姿态估计方法,通过目标聚焦掩码策略和多尺度特征融合,结合隐式SDF与显式点云,显著提升了遮挡情况下的姿态估计性能。

Details Motivation: 现有方法在处理手-物体交互中的遮挡问题时缺乏全局结构感知和推理能力,影响了姿态估计的准确性。本文旨在通过掩码自编码器增强模型的上下文感知能力。

Contribution: 1. 提出目标聚焦掩码策略;2. 结合多尺度特征预测SDF以捕获全局和细节信息;3. 融合隐式SDF与显式点云以提升几何感知能力。

Method: 采用掩码自编码器框架,设计目标聚焦掩码策略模拟遮挡,结合多尺度特征预测SDF,并通过SDF与点云的互补融合提升遮挡区域的处理能力。

Result: 在DexYCB和HO3Dv2基准测试中达到SOTA性能。

Insight: 通过结构化掩码模拟遮挡增强了模型的上下文推理能力,而SDF与点云的结合则提供了全局与局部几何信息的互补优势。

Abstract: Hand-object pose estimation from monocular RGB images remains a significant
challenge mainly due to the severe occlusions inherent in hand-object
interactions. Existing methods do not sufficiently explore global structural
perception and reasoning, which limits their effectiveness in handling occluded
hand-object interactions. To address this challenge, we propose an
occlusion-aware hand-object pose estimation method based on masked
autoencoders, termed as HOMAE. Specifically, we propose a target-focused
masking strategy that imposes structured occlusion on regions of hand-object
interaction, encouraging the model to learn context-aware features and reason
about the occluded structures. We further integrate multi-scale features
extracted from the decoder to predict a signed distance field (SDF), capturing
both global context and fine-grained geometry. To enhance geometric perception,
we combine the implicit SDF with an explicit point cloud derived from the SDF,
leveraging the complementary strengths of both representations. This fusion
enables more robust handling of occluded regions by combining the global
context from the SDF with the precise local geometry provided by the point
cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks
demonstrate that HOMAE achieves state-of-the-art performance in hand-object
pose estimation. We will release our code and model.

[101] VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou

Main category: cs.CV

TL;DR: VideoDeepResearch提出了一种基于文本推理模型和多模态工具包的代理框架,通过选择性访问视频内容来解决长视频理解任务,显著提升了性能。

Details Motivation: 现有多模态大语言模型(MLLM)因上下文窗口限制和复杂性难以处理长视频理解(LVU)任务。本文挑战了依赖扩展上下文窗口和强视觉能力的传统思路,转而采用代理工具架构。

Contribution: 1. 提出了VideoDeepResearch,一种基于文本推理模型和多模态工具包的代理框架;2. 在多个LVU基准测试中显著优于现有MLLM基线。

Method: 结合文本推理模型(LRM)与模块化多模态工具包(如多模态检索器和视觉感知器),通过任务驱动的视频内容选择和工具使用策略解决问题。

Result: 在MLVU、LVBench和LongVideoBench上分别超越先前最佳性能9.6%、6.6%和3.9%,验证了代理系统的有效性。

Insight: 代理工具架构可以通过模块化设计和任务驱动策略有效解决长视频理解的复杂性问题,而无需依赖扩展的上下文窗口或强视觉模型。

Abstract: Long video understanding (LVU) presents a significant challenge for current
multi-modal large language models (MLLMs) due to the task’s inherent complexity
and context window constraint. It is widely assumed that addressing LVU tasks
requires foundation MLLMs with extended context windows, strong visual
perception capabilities, and proficient domain expertise. In this work, we
challenge this common belief by introducing VideoDeepResearch, a novel agentic
framework for long video understanding. Our approach relies solely on a
text-only large reasoning model (LRM) combined with a modular multi-modal
toolkit, including multimodal retrievers and visual perceivers, all of which
are readily available in practice. For each LVU task, the system formulates a
problem-solving strategy through reasoning, while selectively accessing and
utilizing essential video content via tool using. We conduct extensive
experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench.
Our results demonstrate that VideoDeepResearch achieves substantial
improvements over existing MLLM baselines, surpassing the previous
state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and
LongVideoBench, respectively. These findings highlight the promise of agentic
systems in overcoming key challenges in LVU problems.

[102] Post-Training Quantization for Video Matting

Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种专门用于视频抠图的后训练量化框架(PTQ4VM),通过两阶段量化策略、全局仿射校准和光流辅助组件,显著减少了量化误差并保持了时间一致性。

Details Motivation: 视频抠图在资源受限设备上部署时面临计算密集型模型的挑战,后训练量化(PTQ)尚未在这一领域得到系统研究。

Contribution: 1)两阶段PTQ策略;2)统计驱动的全局仿射校准(GAC);3)光流辅助(OFA)组件。

Method: 结合块重建优化和全局校准的两阶段PTQ,利用GAC补偿统计失真,并使用OFA增强模型对前景的区分能力。

Result: 在4比特量化下接近全精度性能,计算量减少8倍,优于现有量化方法。

Insight: 结合统计校正和时序信息能显著提升视频抠图模型的量化效果,为实际应用提供高效解决方案。

Abstract: Video matting is crucial for applications such as film production and virtual
reality, yet deploying its computationally intensive models on
resource-constrained devices presents challenges. Quantization is a key
technique for model compression and acceleration. As an efficient approach,
Post-Training Quantization (PTQ) is still in its nascent stages for video
matting, facing significant hurdles in maintaining accuracy and temporal
coherence. To address these challenges, this paper proposes a novel and general
PTQ framework specifically designed for video matting models, marking, to the
best of our knowledge, the first systematic attempt in this domain. Our
contributions include: (1) A two-stage PTQ strategy that combines
block-reconstruction-based optimization for fast, stable initial quantization
and local dependency capture, followed by a global calibration of quantization
parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine
Calibration (GAC) method that enables the network to compensate for cumulative
statistical distortions arising from factors such as neglected BN layer
effects, even reducing the error of existing PTQ methods on video matting tasks
up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages
temporal and semantic priors from frames to guide the PTQ process, enhancing
the model’s ability to distinguish moving foregrounds in complex scenes and
ultimately achieving near full-precision performance even under ultra-low-bit
quantization. Comprehensive quantitative and visual results show that our
PTQ4VM achieves the state-of-the-art accuracy performance across different
bit-widths compared to the existing quantization methods. We highlight that the
4-bit PTQ4VM even achieves performance close to the full-precision counterpart
while enjoying 8x FLOP savings.

[103] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang

Main category: cs.CV

TL;DR: VRBench是首个针对长叙事视频多步推理能力的基准测试,包含1,010个长视频和9,468个人工标注的多步问答对,旨在解决现有评测中忽视时间推理和程序有效性的问题。

Details Motivation: 现有评测方法在长视频多步推理任务中存在局限性,尤其是时间推理和程序有效性未得到充分评估,VRBench旨在填补这一空白。

Contribution: 1. 创建了首个长视频多步推理基准VRBench;2. 提出了人机协作框架生成连贯推理链;3. 设计了多阶段评估流程,包括结果和推理过程的多维度评测。

Method: 1. 通过多阶段筛选和专家评分选择高质量视频;2. 开发人机协作框架生成多步推理链;3. 使用多选题和基于LLM的进度级评分评测模型表现。

Result: 对12个LLM和16个VLM的广泛评测表明,VRBench能够全面分析模型的多步推理能力,并提供领域内有价值的洞见。

Insight: 多步推理任务中,时间上下文和程序有效性对模型表现至关重要,而基于进度的评测能更全面反映模型推理质量。

Abstract: We present VRBench, the first long narrative video benchmark crafted for
evaluating large models’ multi-step reasoning capabilities, addressing
limitations in existing evaluations that overlook temporal reasoning and
procedural validity. It comprises 1,010 long videos (with an average duration
of 1.6 hours), along with 9,468 human-labeled multi-step question-answering
pairs and 30,292 reasoning steps with timestamps. These videos are curated via
a multi-stage filtering process including expert inter-rater reviewing to
prioritize plot coherence. We develop a human-AI collaborative framework that
generates coherent reasoning chains, each requiring multiple temporally
grounded steps, spanning seven types (e.g., event attribution, implicit
inference). VRBench designs a multi-phase evaluation pipeline that assesses
models at both the outcome and process levels. Apart from the MCQs for the
final results, we propose a progress-level LLM-guided scoring metric to
evaluate the quality of the reasoning chain from multiple dimensions
comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on
VRBench, we undertake a thorough analysis and provide valuable insights that
advance the field of multi-step reasoning.

[104] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation

Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu

Main category: cs.CV

TL;DR: CreatiPoster是一个生成可编辑、多图层图形设计的框架,支持自然语言或用户提供的素材输入,能够生成专业级的设计并保持可编辑性。通过联合协议模型和条件背景模型,它超越了现有开源和商业系统,并发布了10万版权自由的多图层设计库。

Details Motivation: 当前AI工具在图形设计中难以兼顾用户素材的准确整合、可编辑性和专业视觉吸引力,依赖模板库的商业系统也不灵活。解决这些问题可以推动AI辅助图形设计的民主化。

Contribution: 提出了CreatiPoster框架,通过协议模型生成JSON规范,结合条件背景模型合成设计;构建了图形设计生成的基准测试;发布了10万版权自由的多图层设计库。

Method: 首先使用协议模型生成JSON规范,描述每层的布局、内容和风格;随后条件背景模型基于前景层合成背景。

Result: 实验表明,CreatiPoster在图形设计生成任务上超越现有开源和商业系统,并支持多样应用如编辑、多语言适应等。

Insight: 通过结构化生成和多模态模型的联合使用,AI可以更灵活地生成专业且可编辑的图形设计,为用户提供了高效的工具支持。

Abstract: Graphic design plays a crucial role in both commercial and personal contexts,
yet creating high-quality, editable, and aesthetically pleasing graphic
compositions remains a time-consuming and skill-intensive task, especially for
beginners. Current AI tools automate parts of the workflow, but struggle to
accurately incorporate user-supplied assets, maintain editability, and achieve
professional visual appeal. Commercial systems, like Canva Magic Design, rely
on vast template libraries, which are impractical for replicate. In this paper,
we introduce CreatiPoster, a framework that generates editable, multi-layer
compositions from optional natural-language instructions or assets. A protocol
model, an RGBA large multimodal model, first produces a JSON specification
detailing every layer (text or asset) with precise layout, hierarchy, content
and style, plus a concise background prompt. A conditional background model
then synthesizes a coherent background conditioned on this rendered foreground
layers. We construct a benchmark with automated metrics for graphic-design
generation and show that CreatiPoster surpasses leading open-source approaches
and proprietary commercial systems. To catalyze further research, we release a
copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports
diverse applications such as canvas editing, text overlay, responsive resizing,
multilingual adaptation, and animated posters, advancing the democratization of
AI-assisted graphic design. Project homepage:
https://github.com/graphic-design-ai/creatiposter

[105] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement

Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung

Main category: cs.CV

TL;DR: 论文提出一种零样本生成模型适应方法AIR,通过迭代优化解决CLIP嵌入空间中文本偏移与图像偏移不对齐问题,提升目标域图像生成质量。

Details Motivation: 现有零样本生成模型适应方法假设文本偏移与图像偏移完全对齐,导致生成图像质量下降。本文受NLP偏移不对齐研究启发,分析CLIP嵌入空间中二者的不对齐现象。

Contribution: 1. 实证研究CLIP嵌入空间中文本和图像偏移的不对齐及其与概念距离的关系;2. 提出AIR方法,通过迭代优化改善生成质量。

Method: AIR方法通过迭代优化逐步对齐文本和图像偏移,利用概念距离信息减少不对齐影响。

Result: 实验表明,AIR在26种实验设置中均优于现有方法,生成图像质量显著提升。

Insight: 偏移不对齐与概念距离相关,近距概念偏移更小,为模型优化提供了新思路。

Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained
generator to a target domain using only text guidance and without any samples
from the target domain. Central to recent ZSGM approaches are directional loss
which use the text guidance in the form of aligning the image offset with text
offset in the embedding space of a vision-language model like CLIP. This is
similar to the analogical reasoning in NLP where the offset between one pair of
words is used to identify a missing element in another pair by aligning the
offset between these two pairs. However, a major limitation of existing ZSGM
methods is that the learning objective assumes the complete alignment between
image offset and text offset in the CLIP embedding space, resulting in quality
degrade in generated images. Our work makes two main contributions. Inspired by
the offset misalignment studies in NLP, as our first contribution, we perform
an empirical study to analyze the misalignment between text offset and image
offset in CLIP embedding space for various large publicly available datasets.
Our important finding is that offset misalignment in CLIP embedding space is
correlated with concept distance, i.e., close concepts have a less offset
misalignment. To address the limitations of the current approaches, as our
second contribution, we propose Adaptation with Iterative Refinement (AIR)
which is the first ZSGM approach to focus on improving target domain image
quality based on our new insight on offset misalignment.Qualitative,
quantitative, and user study in 26 experiment setups consistently demonstrate
the proposed AIR approach achieves SOTA performance. Additional experiments are
in Supp.

[106] M4V: Multi-Modal Mamba for Text-to-Video Generation

Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma

Main category: cs.CV

TL;DR: 论文提出M4V框架,结合多模态Mamba架构与扩散模型,解决了Transformer在文本到视频生成中的计算效率问题,显著降低了计算成本,同时通过奖励学习策略提升了视频质量。

Details Motivation: 当前文本到视频生成任务由于Transformer的二次复杂度在处理时空序列时计算成本高,限制了实际应用。因此,需要一种更高效的序列建模方法,同时支持多模态信息融合。

Contribution: 1. 提出多模态Mamba框架M4V,引入MM-DiM块实现多模态信息与时空建模的无缝集成。2. 设计奖励学习策略提升长上下文自回归生成过程中的视觉质量。3. 实验表明M4V在高质量视频生成的同时计算成本显著降低(FLOPs减少45%)。

Method: 1. MM-DiM块通过多模态令牌重组设计实现多模态信息与时空建模的结合。2. 采用奖励学习策略优化帧级视觉真实感。

Result: 在文本到视频生成基准测试中,M4V生成高质量视频的同时显著降低计算成本(768×1280分辨率下FLOPs减少45%)。

Insight: Mamba架构在视频生成任务中具有潜力,通过多模态令牌重组和奖励学习的结合可以有效提升生成质量与效率。

Abstract: Text-to-video generation has significantly enriched content creation and
holds the potential to evolve into powerful world simulators. However, modeling
the vast spatiotemporal space remains computationally demanding, particularly
when employing Transformers, which incur quadratic complexity in sequence
processing and thus limit practical applications. Recent advancements in
linear-time sequence modeling, particularly the Mamba architecture, offer a
more efficient alternative. Nevertheless, its plain design limits its direct
applicability to multi-modal and spatiotemporal video generation tasks. To
address these challenges, we introduce M4V, a Multi-Modal Mamba framework for
text-to-video generation. Specifically, we propose a multi-modal diffusion
Mamba (MM-DiM) block that enables seamless integration of multi-modal
information and spatiotemporal modeling through a multi-modal token
re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45%
compared to the attention-based alternative when generating videos at
768$\times$1280 resolution. Additionally, to mitigate the visual quality
degradation in long-context autoregressive generation processes, we introduce a
reward learning strategy that further enhances per-frame visual realism.
Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to
produce high-quality videos while significantly lowering computational costs.
Code and models will be publicly available at
https://huangjch526.github.io/M4V_project.

[107] VINCIE: Unlocking In-context Image Editing from Video

Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang

Main category: cs.CV

TL;DR: VINCIE提出了一种基于视频的上下文图像编辑方法,通过设计块因果扩散变换器和多任务学习,直接从视频数据中学习,无需依赖任务特定流程或专家模型。

Details Motivation: 当前上下文图像编辑方法需要依赖任务特定的流程和专家模型(如分割和修复技术)来整理训练数据,这限制了方法的通用性和可扩展性。研究探索是否可以通过直接学习视频数据来开发更通用的图像编辑模型。

Contribution: 1. 提出了一种可扩展的方法,将视频标注为交错的序列;2. 设计了块因果扩散变换器,结合多任务学习;3. 提出了一个新颖的多轮图像编辑基准测试。

Method: 使用块因果扩散变换器(block-causal diffusion transformer)在三个任务上进行训练:下一帧预测、当前分割预测和下一分割预测。

Result: 模型在上下文图像编辑任务上表现优异,并在多轮图像编辑基准测试中达到SOTA。此外,模型还展示了多概念组合、故事生成和编辑链任务上的潜力。

Insight: 直接从视频数据学习可以避免依赖任务特定流程,同时模型展示了在未见任务上的泛化能力。

Abstract: In-context image editing aims to modify images based on a contextual sequence
comprising text and previously generated images. Existing methods typically
depend on task-specific pipelines and expert models (e.g., segmentation and
inpainting) to curate training data. In this work, we explore whether an
in-context image editing model can be learned directly from videos. We
introduce a scalable approach to annotate videos as interleaved multimodal
sequences. To effectively learn from this data, we design a block-causal
diffusion transformer trained on three proxy tasks: next-image prediction,
current segmentation prediction, and next-segmentation prediction.
Additionally, we propose a novel multi-turn image editing benchmark to advance
research in this area. Extensive experiments demonstrate that our model
exhibits strong in-context image editing capabilities and achieves
state-of-the-art results on two multi-turn image editing benchmarks. Despite
being trained exclusively on videos, our model also shows promising abilities
in multi-concept composition, story generation, and chain-of-editing
applications.

[108] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian

Main category: cs.CV

TL;DR: 该论文提出了一个新的任务——知识图像生成,并发布了MMMG基准测试,用于评估图像生成模型的多模态推理能力。通过专家验证的数据集和统一的图谱表示,揭示了当前模型的推理缺陷,并提出了一个开源基线模型。

Details Motivation: 知识图像在人类文明和学习机制中扮演重要角色,但现有图像生成模型在生成此类图像时缺乏多模态推理能力。因此,作者提出了MMMG基准测试,以推动模型在知识图像生成方面的进步。

Contribution: 1. 提出知识图像生成任务和MMMG基准测试;2. 提供包含10个学科、6个教育级别的4,456对专家验证数据;3. 引入统一的图谱表示和MMMG-Score评估指标;4. 开源FLUX-Reason基线模型。

Method: 1. 使用知识图谱(KG)明确标注图像中的核心实体及其依赖关系;2. 结合图谱编辑距离和视觉清晰度设计MMMG-Score;3. 基于推理LLM和扩散模型构建FLUX-Reason基线。

Result: 评估了16种SOTA文本到图像生成模型,发现其在实体保真度、关系强度和图像清晰度方面存在严重缺陷。GPT-4o的MMMG-Score仅为50.20,而基线模型FLUX-Reason得分为34.45。

Insight: 1. 知识图像生成需要更强的多模态推理能力;2. 统一的图谱表示简化了评估过程;3. 当前模型在解释性图像生成上仍有巨大提升空间。

Abstract: In this paper, we introduce knowledge image generation as a new task,
alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation
Benchmark (MMMG) to probe the reasoning capability of image generation models.
Knowledge images have been central to human civilization and to the mechanisms
of human learning–a fact underscored by dual-coding theory and the
picture-superiority effect. Generating such images is challenging, demanding
multimodal reasoning that fuses world knowledge with pixel-level grounding into
clear explanatory visuals. To enable comprehensive evaluation, MMMG offers
4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines,
6 educational levels, and diverse knowledge formats such as charts, diagrams,
and mind maps. To eliminate confounding complexity during evaluation, we adopt
a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a
target image’s core entities and their dependencies. We further introduce
MMMG-Score to evaluate generated knowledge images. This metric combines factual
fidelity, measured by graph-edit distance between KGs, with visual clarity
assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image
generation models expose serious reasoning deficits–low entity fidelity, weak
relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20,
underscoring the benchmark’s difficulty. To spur further progress, we release
FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines
a reasoning LLM with diffusion models and is trained on 16,000 curated
knowledge image-prompt pairs.

[109] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang

Main category: cs.CV

TL;DR: 论文提出了一种名为CDPruner的新型视觉标记剪枝方法,通过最大化条件多样性来优化多模态大语言模型(MLLM)的推理效率。

Details Motivation: 视觉标记数量远多于文本标记导致MLLM推理成本高,现有方法基于注意力或相似性的剪枝存在冗余问题。

Contribution: 提出CDPruner方法,基于条件多样性最大化进行标记剪枝,提高模型效率并保持性能。

Method: 使用基于指令的条件相似性定义,结合行列式点过程(DPP)进行标记剪枝,训练无关且模型泛用。

Result: 实验显示CDPruner在多种MLLM上表现优异,大幅降低FLOPs和CUDA延迟,同时保留94%原始准确率。

Insight: 最大化条件多样性能够平衡图像表征和指令遵循,实现高效且高性能的剪枝。

Abstract: In multimodal large language models (MLLMs), the length of input visual
tokens is often significantly greater than that of their textual counterparts,
leading to a high inference cost. Many works aim to address this issue by
removing redundant visual tokens. However, current approaches either rely on
attention-based pruning, which retains numerous duplicate tokens, or use
similarity-based pruning, overlooking the instruction relevance, consequently
causing suboptimal performance. In this paper, we go beyond attention or
similarity by proposing a novel visual token pruning method named CDPruner,
which maximizes the conditional diversity of retained tokens. We first define
the conditional similarity between visual tokens conditioned on the
instruction, and then reformulate the token pruning problem with determinantal
point process (DPP) to maximize the conditional diversity of the selected
subset. The proposed CDPruner is training-free and model-agnostic, allowing
easy application to various MLLMs. Extensive experiments across diverse MLLMs
show that CDPruner establishes new state-of-the-art on various vision-language
benchmarks. By maximizing conditional diversity through DPP, the selected
subset better represents the input images while closely adhering to user
instructions, thereby preserving strong performance even with high reduction
ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency
by 78%, while maintaining 94% of the original accuracy. Our code is available
at https://github.com/Theia-4869/CDPruner.

[110] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan

Main category: cs.CV

TL;DR: GenWorld提出一个大规模、高质量的真实世界模拟数据集,用于检测AI生成的视频,并开发了SpannDetector模型,通过多视角一致性提升检测性能。

Details Motivation: 随着视频生成技术的快速发展,AI生成视频的可信度问题日益凸显,现有检测方法因缺乏高质量数据集而受限。

Contribution: 1. 构建了GenWorld数据集,专注于真实世界模拟视频;2. 提出了SpannDetector模型,利用多视角一致性检测AI生成视频。

Method: 使用多种先进视频生成模型生成高质量伪造视频,并提出基于多视角一致性的SpannDetector方法。

Result: 实验表明SpannDetector在检测高质量视频上表现优异,验证了方法的有效性。

Insight: 忽略真实世界线索是现有方法的缺陷,物理合理性和多视角一致性是改进AI生成视频检测的关键。

Abstract: The flourishing of video generation technologies has endangered the
credibility of real-world information and intensified the demand for
AI-generated video detectors. Despite some progress, the lack of high-quality
real-world datasets hinders the development of trustworthy detectors. In this
paper, we propose GenWorld, a large-scale, high-quality, and real-world
simulation dataset for AI-generated video detection. GenWorld features the
following characteristics: (1) Real-world Simulation: GenWorld focuses on
videos that replicate real-world scenarios, which have a significant impact due
to their realism and potential influence; (2) High Quality: GenWorld employs
multiple state-of-the-art video generation models to provide realistic and
high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes
videos generated from diverse generators and various prompt modalities (e.g.,
text, image, video), offering the potential to learn more generalizable
forensic features. We analyze existing methods and find they fail to detect
high-quality videos generated by world models (i.e., Cosmos), revealing
potential drawbacks of ignoring real-world clues. To address this, we propose a
simple yet effective model, SpannDetector, to leverage multi-view consistency
as a strong criterion for real-world AI-generated video detection. Experiments
show that our method achieves superior results, highlighting a promising
direction for explainable AI-generated video detection based on physical
plausibility. We believe that GenWorld will advance the field of AI-generated
video detection. Project Page: https://chen-wl20.github.io/GenWorld

[111] Fine-Grained Perturbation Guidance via Attention Head Selection

Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim

Main category: cs.CV

TL;DR: 本文研究了扩散模型中注意力扰动的细粒度方法,提出了一种通过选择注意力头(HeadHunter框架)实现精细化控制生成质量和视觉属性的方法,并引入了SoftPAG技术调节扰动强度。

Details Motivation: 现有注意力扰动方法缺乏对扰动应用位置的原则性指导,尤其是在Diffusion Transformer(DiT)架构中,质量相关的计算分布在多个层中。

Contribution: 1. 提出了首个针对扩散模型中注意力头的扰动分析;2. 设计了HeadHunter框架,用于迭代选择与用户目标对齐的注意力头;3. 提出了SoftPAG技术,通过线性插值调节扰动强度,减少伪影。

Method: 1. 从层级到注意力头级的细粒度扰动分析;2. 基于用户目标的注意力头选择框架(HeadHunter);3. 通过SoftPAG技术调节扰动强度(注意力矩阵向单位矩阵插值)。

Result: 在Stable Diffusion 3和FLUX.1等大规模DiT文本到图像模型上验证了方法的有效性,在生成质量提升和风格控制方面表现优越。

Insight: 特定注意力头控制不同的视觉概念(如结构、风格、纹理质量),可以通过组合选择实现针对性风格控制。

Abstract: Recent guidance methods in diffusion models steer reverse sampling by
perturbing the model to construct an implicit weak model and guide generation
away from it. Among these approaches, attention perturbation has demonstrated
strong empirical performance in unconditional scenarios where classifier-free
guidance is not applicable. However, existing attention perturbation methods
lack principled approaches for determining where perturbations should be
applied, particularly in Diffusion Transformer (DiT) architectures where
quality-relevant computations are distributed across layers. In this paper, we
investigate the granularity of attention perturbations, ranging from the layer
level down to individual attention heads, and discover that specific heads
govern distinct visual concepts such as structure, style, and texture quality.
Building on this insight, we propose “HeadHunter”, a systematic framework for
iteratively selecting attention heads that align with user-centric objectives,
enabling fine-grained control over generation quality and visual attributes. In
addition, we introduce SoftPAG, which linearly interpolates each selected
head’s attention map toward an identity matrix, providing a continuous knob to
tune perturbation strength and suppress artifacts. Our approach not only
mitigates the oversmoothing issues of existing layer-level perturbation but
also enables targeted manipulation of specific visual styles through
compositional head selection. We validate our method on modern large-scale
DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,
demonstrating superior performance in both general quality enhancement and
style-specific guidance. Our work provides the first head-level analysis of
attention perturbation in diffusion models, uncovering interpretable
specialization within attention layers and enabling practical design of
effective perturbation strategies.

[112] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: InstaInpaint提出了一种快速的3D场景修复框架,能够在0.4秒内完成修复任务,相比之前方法加速1000倍,同时保持高性能。

Details Motivation: 现有3D场景修复方法依赖耗时优化,无法满足实时或在线应用需求,亟需一种快速高效的解决方案。

Contribution: 1. 提出了InstaInpaint框架,实现快速3D场景修复;2. 设计了自监督掩码微调策略,训练定制化大规模重建模型(LRM);3. 在多个基准测试中达到最优性能,并展示了良好的下游任务泛化能力。

Method: 采用基于参考的前馈框架,结合2D修复提议和自监督掩码微调策略,训练LRM模型以实现高效修复。

Result: 在标准测试中,InstaInpaint速度提升1000倍,性能达到SOTA,且能泛化至下游任务如物体插入和多区域修复。

Insight: 关键设计包括自监督微调和LRM模型,这些方法显著提升了泛化能力、纹理一致性和几何正确性。

Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in
virtual and augmented reality. To support interactive operations for better
immersiveness, such as moving or editing objects, 3D scene inpainting methods
are proposed to repair or complete the altered geometry. However, current
approaches rely on lengthy and computationally intensive optimization, making
them impractical for real-time or online applications. We propose InstaInpaint,
a reference-based feed-forward framework that produces 3D-scene inpainting from
a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised
masked-finetuning strategy to enable training of our custom large
reconstruction model (LRM) on the large-scale dataset. Through extensive
experiments, we analyze and identify several key designs that improve
generalization, textural consistency, and geometric correctness. InstaInpaint
achieves a 1000x speed-up from prior methods while maintaining a
state-of-the-art performance across two standard benchmarks. Moreover, we show
that InstaInpaint generalizes well to flexible downstream applications such as
object insertion and multi-region inpainting. More video results are available
at our project page: https://dhmbb2.github.io/InstaInpaint_page/.

cs.LG [Back]

[113] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

Shangpin Peng,Weinong Wang,Zhuotao Tian,Senqiao Yang,Xing Wu,Haotian Xu,Chengquan Zhang,Takashi Isobe,Baotian Hu,Min Zhang

Main category: cs.LG

TL;DR: Omni-DPO 是一种双视角优化框架,通过自适应加权样本,结合数据质量和模型学习动态,显著提升了强化学习从人类反馈(RLHF)的性能。

Details Motivation: 现有的 DPO 方法将所有偏好对视为相同,忽略了其固有质量和学习效用的差异,导致数据利用和性能不佳。

Contribution: 提出了 Omni-DPO 框架,从数据质量和模型动态两个视角联合优化偏好学习,提升了训练效率和模型性能。

Method: 通过自适应加权样本,结合数据质量和模型的学习动态调整训练策略,实现更高效的偏好优化。

Result: 在文本理解和数学推理任务中,Omni-DPO 显著优于基线方法,Gemma-2-9b-it 模型在 Arena-Hard 基准上超越 Claude 3 Opus 6.7 分。

Insight: 关注数据的固有质量和模型学习动态的联合优化,是提升强化学习从人类反馈效果的关键。

Abstract: Direct Preference Optimization (DPO) has become a cornerstone of
reinforcement learning from human feedback (RLHF) due to its simplicity and
efficiency. However, existing DPO-based approaches typically treat all
preference pairs uniformly, ignoring critical variations in their inherent
quality and learning utility, leading to suboptimal data utilization and
performance. To address this challenge, we propose Omni-DPO, a dual-perspective
optimization framework that jointly accounts for (1) the inherent quality of
each preference pair and (2) the model’s evolving performance on those pairs.
By adaptively weighting samples according to both data quality and the model’s
learning dynamics during training, Omni-DPO enables more effective training
data utilization and achieves better performance. Experimental results on
various models and benchmarks demonstrate the superiority and generalization
capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it
finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant
margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning
tasks, Omni-DPO consistently outperforms the baseline methods across all
benchmarks, providing strong empirical evidence for the effectiveness and
robustness of our approach. Code and models will be available at
https://github.com/pspdada/Omni-DPO.

[114] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

Jikai Jin,Vasilis Syrgkanis,Sham Kakade,Hanlin Zhang

Main category: cs.LG

TL;DR: 论文提出了一种因果表示学习框架,通过建模基准表现与潜在能力因素的线性关系,揭示了语言模型能力的层次性因果结构。

Details Motivation: 语言模型能力的评估存在方法论挑战,如混杂效应和高计算成本,需更严谨的因果分析方法。

Contribution: 开发了因果表示学习框架,识别了语言模型能力的层次性因果结构,强调了控制基础模型变异的必要性。

Method: 通过线性变换建模基准表现与潜在能力因素的关系,控制基础模型作为混杂因子,识别因果结构。

Result: 在1500多个模型的六项基准测试数据上,发现了一个三节点的线性因果结构,揭示了能力发展的层次性方向。

Insight: 模型能力的发展呈现从通用问题解决到指令遵循再到数学推理的因果方向,基础模型变异是影响评估的关键因素。

Abstract: Faithful evaluation of language model capabilities is crucial for deriving
actionable insights that can inform model development. However, rigorous causal
evaluations in this domain face significant methodological challenges,
including complex confounding effects and prohibitive computational costs
associated with extensive retraining. To tackle these challenges, we propose a
causal representation learning framework wherein observed benchmark performance
is modeled as a linear transformation of a few latent capability factors.
Crucially, these latent factors are identified as causally interrelated after
appropriately controlling for the base model as a common confounder. Applying
this approach to a comprehensive dataset encompassing over 1500 models
evaluated across six benchmarks from the Open LLM Leaderboard, we identify a
concise three-node linear causal structure that reliably explains the observed
performance variations. Further interpretation of this causal structure
provides substantial scientific insights beyond simple numerical rankings:
specifically, we reveal a clear causal direction starting from general
problem-solving capabilities, advancing through instruction-following
proficiency, and culminating in mathematical reasoning ability. Our results
underscore the essential role of carefully controlling base model variations
during evaluation, a step critical to accurately uncovering the underlying
causal relationships among latent model capabilities.

[115] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Ching Chang,Jeehyun Hwang,Yidan Shi,Haixin Wang,Wen-Chih Peng,Tien-Fu Chen,Wei Wang

Main category: cs.LG

TL;DR: 该论文介绍了Time-IMM数据集和IMM-TSF基准库,旨在解决现实世界中不规则、多模态时间序列数据的挑战,并通过实验证明多模态建模对提升预测性能的重要性。

Details Motivation: 现实世界中的时间序列数据往往是不规则、多模态且杂乱的,但现有基准通常假设数据是干净、规则且单模态的,导致研究与实际应用之间存在差距。

Contribution: 提出了Time-IMM数据集,捕捉了多模态时间序列中的九种不规则性;开发了IMM-TSF基准库,支持异步集成和真实评估。

Method: 使用触发驱动、约束驱动和伪影驱动的机制分类不规则性;引入时间戳到文本的融合模块和多模态融合模块,支持基于最近邻平均和注意力机制的集成策略。

Result: 实验结果表明,显式建模多模态数据在不规则时间序列上显著提升了预测性能。

Insight: 不规则性和多模态的显式建模对时间序列分析的性能至关重要,为现实世界应用提供了更贴近实际的评估基准。

Abstract: Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.

[116] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

Sai Prasanna Teja Reddy Bogireddy,Abrar Majeedi,Viswanatha Reddy Gajjala,Zhuoyan Xu,Siddhant Rai,Vaishnav Potlapalli

Main category: cs.LG

TL;DR: 论文提出了一种基于代理提示优化的方法,用于证据驱动的临床问答任务,通过两个阶段(证据识别与答案生成)并利用提示优化器提升性能,最终在ArchEHR-QA任务中取得了第二名。

Details Motivation: 电子健康记录(EHR)的自动化问答系统可以为临床医生和患者提供关键信息支持,但需要在有限监督下实现精确的证据检索和可靠的答案生成。

Contribution: 将任务解耦为证据识别和答案生成两阶段,并利用DSPy的MIPROv2提示优化器自动优化提示;提出自一致性投票机制以提高证据召回率。

Method: 1. 句子级证据识别;2. 带明确引用的答案生成;使用MIPROv2优化器联合优化指令和少量样本演示;引入自一致性投票机制。

Result: 在隐藏测试集上得分51.5,排名第二,优于零样本和少样本提示方法20分和10分以上。

Insight: 数据驱动的提示优化是模型微调的高效替代方案,可提升医疗领域高风险问答任务的可靠性。

Abstract: Automated question answering (QA) over electronic health records (EHRs) can
bridge critical information gaps for clinicians and patients, yet it demands
both precise evidence retrieval and faithful answer generation under limited
supervision. In this work, we present Neural, the runner-up in the BioNLP 2025
ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method
decouples the task into (1) sentence-level evidence identification and (2)
answer synthesis with explicit citations. For each stage, we automatically
explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning
instructions and few-shot demonstrations on the development set. A
self-consistency voting scheme further improves evidence recall without
sacrificing precision. On the hidden test set, our method attains an overall
score of 51.5, placing second stage while outperforming standard zero-shot and
few-shot prompting by over 20 and 10 points, respectively. These results
indicate that data-driven prompt optimization is a cost-effective alternative
to model fine-tuning for high-stakes clinical QA, advancing the reliability of
AI assistants in healthcare.

[117] Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Adam Karvonen,Samuel Marks

Main category: cs.LG

TL;DR: 论文提出了一种通过内部偏见缓解方法,在现实场景中减少LLM的偏见,识别并中和模型激活中的敏感属性方向,实现了稳定的偏见减少。

Details Motivation: LLM在高风险招聘应用中的部署日益增多,但现有简单的反偏见提示在现实场景中失效,需要更鲁棒的缓解方法。

Contribution: 提出了内部偏见缓解方法,通过识别敏感属性方向并应用中概念编辑,显著减少了偏见(通常低于1%),同时保持模型性能。

Method: 通过合成数据集识别种族和性别相关方向,并在推理时应用中概念编辑。

Result: 在多种商业和开源模型中,该方法将偏见降至通常低于1%,同时模型性能基本不受影响。

Insight: 现实场景中LLM的偏见问题更复杂,需要通过内部干预而非简单提示来解决,同时需要更现实的评估方法。

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring
applications, making decisions that directly impact people’s careers and
livelihoods. While prior studies suggest simple anti-bias prompts can eliminate
demographic biases in controlled evaluations, we find these mitigations fail
when realistic contextual details are introduced. We address these failures
through internal bias mitigation: by identifying and neutralizing sensitive
attribute directions within model activations, we achieve robust bias reduction
across all tested scenarios. Across leading commercial (GPT-4o, Claude 4
Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3,
Mistral-24B), we find that adding realistic context such as company names,
culture descriptions from public careers pages, and selective hiring
constraints (e.g.,``only accept candidates in the top 10%“) induces
significant racial and gender biases (up to 12% differences in interview
rates). When these biases emerge, they consistently favor Black over White
candidates and female over male candidates across all tested models and
scenarios. Moreover, models can infer demographics and become biased from
subtle cues like college affiliations, with these biases remaining invisible
even when inspecting the model’s chain-of-thought reasoning. To address these
limitations, our internal bias mitigation identifies race and gender-correlated
directions and applies affine concept editing at inference time. Despite using
directions from a simple synthetic dataset, the intervention generalizes
robustly, consistently reducing bias to very low levels (typically under 1%,
always below 2.5%) while largely maintaining model performance. Our findings
suggest that practitioners deploying LLMs for hiring should adopt more
realistic evaluation methodologies and consider internal mitigation strategies
for equitable outcomes.

[118] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

Evelyn Ma,Duo Zhou,Peizhi Niu,Huiting Zhou,Huan Zhang,Olgica Milenkovic,S. Rasoul Etesami

Main category: cs.LG

TL;DR: GUARD是一个用于大型语言模型(LLM)的指导性遗忘与保留框架,通过数据归因减少无意遗忘,提升模型保留有价值信息的能力。

Details Motivation: 由于法规遵从、版权保护和隐私问题,LLM的遗忘变得越来越重要,但现有方法常因遗忘高影响数据而损害模型效用。

Contribution: GUARD提出了轻量级的数据归因指标和自适应非均匀遗忘权重分配方法,显著改进了信息保留性能。

Method: GUARD利用代理数据归因指标量化遗忘集与保留集的对齐性,并通过自适应权重分配优化遗忘目标。

Result: 在TOFU基准测试中,GUARD显著提升了保留集的效用(Truth Ratio最高提升194.92%),同时保持高效的遗忘性能。

Insight: 数据级因素对LLM遗忘性能有重要影响,GUARD提供了一种高效平衡遗忘与保留的方法。

Abstract: Unlearning in large language models (LLMs) is becoming increasingly important
due to regulatory compliance, copyright protection, and privacy concerns.
However, a key challenge in LLM unlearning is unintended forgetting, where the
removal of specific data inadvertently impairs the utility of the model and its
retention of valuable, desired information. While prior work has primarily
focused on architectural innovations, the influence of data-level factors on
unlearning performance remains underexplored. As a result, existing methods
often suffer from degraded retention when forgetting high-impact data. To
address this, we propose GUARD-a novel framework for Guided Unlearning And
Retention via Data attribution. At its core, GUARD introduces a lightweight
proxy data attribution metric tailored for LLM unlearning, which quantifies the
“alignment” between the forget and retain sets while remaining computationally
efficient. Building on this, we design a novel unlearning objective that
assigns adaptive, nonuniform unlearning weights to samples, inversely
proportional to their proxy attribution scores. Through such a reallocation of
unlearning power, GUARD mitigates unintended losses in retention. We provide
rigorous theoretical guarantees that GUARD significantly enhances retention
while maintaining forgetting metrics comparable to prior methods. Extensive
experiments on the TOFU benchmark across multiple LLM architectures demonstrate
that GUARD substantially improves utility preservation while ensuring effective
unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to
194.92% in terms of Truth Ratio when forgetting 10% of the training data.

[119] Build the web for agents, not agents for the web

Xing Han Lù,Gaurav Kamath,Marius Mosbach,Siva Reddy

Main category: cs.LG

TL;DR: 这篇立场论文提出了一种新的范式转变,即开发专为智能体设计的网页接口(AWI),而不是让智能体适应人类的网页界面。

Details Motivation: 当前的网页智能体方法因人类设计的界面与LLM能力之间的不匹配而面临挑战,导致处理复杂网页输入时效率低下。

Contribution: 提出了Agentic Web Interface(AWI)的概念,并制定了六项设计原则,以优化智能体的网页交互。

Method: 通过设计专为智能体优化的新型网页交互范式,解决了现有方法的局限性。

Result: AWI旨在提高网页智能体的效率、可靠性和透明度,为未来的协作开发奠定了基础。

Insight: 网页智能体的未来发展需要重新设计界面,而不是简单地让智能体适应现有的人类界面。

Abstract: Recent advancements in Large Language Models (LLMs) and multimodal
counterparts have spurred significant interest in developing web agents – AI
systems capable of autonomously navigating and completing tasks within web
environments. While holding tremendous promise for automating complex web
interactions, current approaches face substantial challenges due to the
fundamental mismatch between human-designed interfaces and LLM capabilities.
Current methods struggle with the inherent complexity of web inputs, whether
processing massive DOM trees, relying on screenshots augmented with additional
information, or bypassing the user interface entirely through API interactions.
This position paper advocates for a paradigm shift in web agent research:
rather than forcing web agents to adapt to interfaces designed for humans, we
should develop a new interaction paradigm specifically optimized for agentic
capabilities. To this end, we introduce the concept of an Agentic Web Interface
(AWI), an interface specifically designed for agents to navigate a website. We
establish six guiding principles for AWI design, emphasizing safety,
efficiency, and standardization, to account for the interests of all primary
stakeholders. This reframing aims to overcome fundamental limitations of
existing interfaces, paving the way for more efficient, reliable, and
transparent web agent design, which will be a collaborative effort involving
the broader ML community.

[120] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems

Aayush Karan,Kulin Shah,Sitan Chen

Main category: cs.LG

TL;DR: 该论文提出了ReGuidance,一种简单但有效的扩散模型包装器,用于提升在困难逆问题中的样本质量和奖励表现。

Details Motivation: 现有方法如DPS在奖励信息不足(如低信噪比的困难逆问题)时会偏离数据流形,导致输出不真实。ReGuidance通过优化初始化和反向ODE流程解决了这一问题。

Contribution: 提出了ReGuidance包装器,通过反向ODE和DPS的初始化改进,显著提升样本质量和测量一致性,并提供了理论证明这是首个DPS的严格算法保证。

Method: 通过从候选解反向运行无条件概率流ODE,生成潜在初始化,再应用于DPS,从而提升样本的真实性和奖励表现。

Result: 在大型框内修复和高倍数超分辨率等困难任务中,ReGuidance显著超越了现有基线方法的质量和一致性。

Insight: 通过反向ODE优化初始化,可以在多模态数据分布上同时提升奖励和接近数据流形,为DPS提供了理论支持。

Abstract: There has been a flurry of activity around using pretrained diffusion models
as informed data priors for solving inverse problems, and more generally around
steering these models using reward models. Training-free methods like diffusion
posterior sampling (DPS) and its many variants have offered flexible heuristic
algorithms for these tasks, but when the reward is not informative enough,
e.g., in hard inverse problems with low signal-to-noise ratio, these techniques
veer off the data manifold, failing to produce realistic outputs. In this work,
we devise a simple wrapper, ReGuidance, for boosting both the sample realism
and reward achieved by these methods. Given a candidate solution $\hat{x}$
produced by an algorithm of the user’s choice, we propose inverting the
solution by running the unconditional probability flow ODE in reverse starting
from $\hat{x}$, and then using the resulting latent as an initialization for
DPS. We evaluate our wrapper on hard inverse problems like large box
in-painting and super-resolution with high upscaling. Whereas state-of-the-art
baselines visibly fail, we find that applying our wrapper on top of these
baselines significantly boosts sample quality and measurement consistency. We
complement these findings with theory proving that on certain multimodal data
distributions, ReGuidance simultaneously boosts the reward and brings the
candidate solution closer to the data manifold. To our knowledge, this
constitutes the first rigorous algorithmic guarantee for DPS.

eess.IV [Back]

[121] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective

Minye Shao,Zeyu Wang,Haoran Duan,Yawen Huang,Bing Zhai,Shizheng Wang,Yang Long,Yefeng Zheng

Main category: eess.IV

TL;DR: HFF-Net通过从频域角度重新思考脑肿瘤分割,提出了一种综合频域分解(FDD)、自适应拉普拉斯卷积(ALC)和频域交叉注意力(FDCA)的网络,显著提升了增强肿瘤区域的分割性能。

Details Motivation: 现有方法在分割MRI中对比增强的脑肿瘤区域时性能下降,主要因为缺乏对肿瘤特征的充分考量,如复杂纹理和方向变化。

Contribution: 提出HFF-Net,引入FDD模块分解频域信息、ALC模块自适应增强边界敏感度,以及FDCA模块融合多尺度特征。

Method: 频域分解(FDD)分离高低频成分;ALC动态调整卷积核强调高频细节;FDCA整合语义、位置和切片信息。

Result: 在四个公共数据集上,平均Dice分数提升4.48%,增强肿瘤区域提升7.33%,计算效率和临床适用性良好。

Insight: 频域视角能有效捕捉肿瘤区域的复杂特征,动态调整高频细节是关键提升点。

Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions
visible in post-contrast MRI (areas highlighted by contrast agent injection),
is crucial for accurate clinical diagnosis and treatment planning but remains
challenging. However, current methods exhibit notable performance degradation
in segmenting these enhancing brain tumor areas, largely due to insufficient
consideration of MRI-specific tumor features such as complex textures and
directional variations. To address this, we propose the Harmonized Frequency
Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a
frequency-domain perspective. To comprehensively characterize tumor regions, we
develop a Frequency Domain Decomposition (FDD) module that separates MRI images
into low-frequency components, capturing smooth tumor contours and
high-frequency components, highlighting detailed textures and directional
edges. To further enhance sensitivity to tumor boundaries, we introduce an
Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical
high-frequency details using dynamically updated convolution kernels. To
effectively fuse tumor features across multiple scales, we design a Frequency
Domain Cross-Attention (FDCA) integrating semantic, positional, and
slice-specific information. We further validate and interpret frequency-domain
improvements through visualization, theoretical reasoning, and experimental
analyses. Extensive experiments on four public datasets demonstrate that
HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39%
to 7.72%) in the mean Dice scores across the three major subregions, and an
average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the
segmentation of contrast-enhancing tumor regions, while maintaining favorable
computational efficiency and clinical applicability. Code:
https://github.com/VinyehShaw/HFF.

[122] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation

Emerson P. Grabke,Masoom A. Haider,Babak Taati

Main category: eess.IV

TL;DR: 该论文提出了一种名为CCELLA的新型双重条件方法,结合了非医学大型语言模型的文本特征和病理学分类,用于训练潜在扩散模型(LDM),以生成高质量的医学图像,尤其是在数据有限的情况下。该方法显著提高了合成图像的质量和分类器性能。

Details Motivation: 医学图像生成中,潜在扩散模型的训练通常依赖于有限的文本编码器、非医学LDM的重复使用或需要大量数据微调,这些方法限制了性能和科学可访问性。作者旨在解决这些问题,提出一种数据高效的LDM训练框架。

Contribution: 1. 提出了CCELLA双重条件方法,结合文本特征和病理学分类;2. 设计了一种联合损失函数和数据高效的训练框架;3. 在有限数据的3D前列腺MRI数据集上展示了高性能的图像生成和分类器提升效果。

Method: 采用双头条件方法(CCELLA),通过交叉注意力机制将非医学大型语言模型的文本特征与病理学分类(通过时间步嵌入)同时注入LDM U-Net。结合联合损失函数,实现高效训练。

Result: 在3D前列腺MRI数据集上,FID得分为0.025,显著优于基准模型(FID 0.071)。合成的图像用于训练分类器时,准确率从69%提升至74%,且仅用合成图像训练的分类器与真实图像训练的性能相当。

Insight: 结合非医学领域的文本特征和医学领域的病理学分类,可以有效提升LDM在医学图像生成中的性能,尤其是在数据稀缺的情况下。这种方法为医学图像合成提供了新的可能性。

Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges
affecting machine learning development for medical imaging. However, medical
LDM training typically relies on performance- or scientific
accessibility-limiting strategies including a reliance on short-prompt text
encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with
large data volumes. We propose a Class-Conditioned Efficient Large Language
model Adapter (CCELLA) to address these limitations. CCELLA is a novel
dual-head conditioning approach that simultaneously conditions the LDM U-Net
with non-medical large language model-encoded text features through
cross-attention and with pathology classification through the timestep
embedding. We also propose a joint loss function and a data-efficient LDM
training framework. In combination, these strategies enable
pathology-conditioned LDM training for high-quality medical image synthesis
given limited data volume and human data annotation, improving LDM performance
and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a
size-limited prostate MRI dataset, significantly outperforming a recent
foundation model with FID 0.071. When training a classifier for prostate cancer
prediction, adding synthetic images generated by our method to the training
dataset improves classifier accuracy from 69% to 74%. Training a classifier
solely on our method’s synthetic images achieved comparable performance to
training on real images alone.

[123] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction

Yuliang Zhu,Jing Cheng,Qi Xie,Zhuo-Xu Cui,Qingyong Zhu,Yuanyuan Liu,Xin Liu,Jianfeng Ren,Chengbo Wang,Dong Liang

Main category: eess.IV

TL;DR: 该论文提出了一种名为DUN-SRE的新型深度展开网络,结合时空旋转等变性,用于动态MRI重建,显著提升了图像质量。

Details Motivation: 动态MRI具有时空对称性(空间旋转和时间对称性),现有方法未能充分利用这些先验信息,尤其是时间对称性,影响了重建质量。

Contribution: 提出DUN-SRE模型,首次将时空旋转等变性整合到深度展开网络中,通过(2+1)D等变卷积架构实现对称性约束的严格传播。

Method: 采用(2+1)D等变卷积架构,结合数据一致性和近端映射模块,开发了高保真的群滤波器参数化机制。

Result: 在心脏CINE MRI数据集上实现了最先进的性能,尤其在保留旋转对称结构方面表现突出。

Insight: 对称性先验的显式建模对动态MRI重建至关重要,时空等变性架构能有效捕捉心脏运动的物理动态。

Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries,
including spatial rotation symmetry within individual frames and temporal
symmetry along the time dimension. Explicit incorporation of these symmetry
priors in the reconstruction model can significantly improve image quality,
especially under aggressive undersampling scenarios. Recently, Equivariant
convolutional neural network (ECNN) has shown great promise in exploiting
spatial symmetry priors. However, existing ECNNs critically fail to model
temporal symmetry, arguably the most universal and informative structural prior
in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep
Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for
Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance
through a (2+1)D equivariant convolutional architecture. In particular, it
integrates both the data consistency and proximal mapping module into a unified
deep unrolling framework. This architecture ensures rigorous propagation of
spatiotemporal rotation symmetry constraints throughout the reconstruction
process, enabling more physically accurate modeling of cardiac motion dynamics
in cine MRI. In addition, a high-fidelity group filter parameterization
mechanism is developed to maintain representation precision while enforcing
symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets
demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in
preserving rotation-symmetric structures, offering strong generalization
capability to a broad range of dynamic MRI reconstruction tasks.

[124] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation

Xi Chen,Zhiqiang Shen,Peng Cao,Jinzhu Yang,Osmar R. Zaiane

Main category: eess.IV

TL;DR: 该论文提出了ConStyX方法,通过同时增强图像的内容和风格来解决医学图像分割中的领域泛化问题,克服了传统方法仅依赖风格扰动和忽视过度增强负面影响的缺陷。

Details Motivation: 医学图像通常来自多领域,导致领域偏移影响分割模型的性能。现有领域随机化方法仅依赖风格扰动,且忽视过度增强的负面影响。

Contribution: 提出了ConStyX方法,同时增强图像内容和风格,并通过平衡增强特征的利用和负面影响,实现了更好的领域泛化性能。

Method: 设计了一种内容风格增强方法,通过动态调整增强策略,覆盖更广的数据领域,并减少过度增强的负面影响。

Result: 多领域实验表明,ConStyX在医学图像分割中显著提升了模型的泛化性能。

Insight: 同时增强内容和风格能够更全面地模拟领域变化,而避免过度增强的负面影响是提升泛化能力的关键。

Abstract: Medical images are usually collected from multiple domains, leading to domain
shifts that impair the performance of medical image segmentation models. Domain
Generalization (DG) aims to address this issue by training a robust model with
strong generalizability. Recently, numerous domain randomization-based DG
methods have been proposed. However, these methods suffer from the following
limitations: 1) constrained efficiency of domain randomization due to their
exclusive dependence on image style perturbation, and 2) neglect of the adverse
effects of over-augmented images on model training. To address these issues, we
propose a novel domain randomization-based DG method, called content style
augmentation (ConStyX), for generalizable medical image segmentation.
Specifically, ConStyX 1) augments the content and style of training data,
allowing the augmented training data to better cover a wider range of data
domains, and 2) leverages well-augmented features while mitigating the negative
effects of over-augmented features during model training. Extensive experiments
across multiple domains demonstrate that our ConStyX achieves superior
generalization performance. The code is available at
https://github.com/jwxsp1/ConStyX.

[125] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches

Andrea Moglia,Matteo Leccardi,Matteo Cavicchioli,Alice Maccarini,Marco Marcon,Luca Mainardi,Pietro Cerveri

Main category: eess.IV

TL;DR: 这篇论文全面调查了医学图像分割领域的通用模型,重点比较了它们与任务特定模型的性能,并探讨了未来的发展方向和挑战。

Details Motivation: 受到大型语言模型和Segment Anything Model (SAM)成功的启发,研究者希望探索通用模型在医学图像分割中的应用,以提升泛化能力和减少对任务特定模型的需求。

Contribution: 提供了对医学图像分割通用模型的全面调查,包括对SAM及其变体的分类、性能分析及与任务特定模型的比较,并提出了未来研究方向。

Method: 通过分类和比较不同通用模型(如SAM、SAM 2、图像与文本联合训练模型)在医学图像分割中的表现,并结合文献和实验结果进行分析。

Result: 通用模型在医学图像分割中展现出潜力,但仍需解决监管合规、隐私安全等挑战,且性能在某些任务上可能不及任务特定模型。

Insight: 未来研究方向包括合成数据、早期信息融合、自然语言处理中的通用模型经验借鉴,以及临床转化的可行性。

Abstract: Following the successful paradigm shift of large language models, leveraging
pre-training on a massive corpus of data and fine-tuning on different
downstream tasks, generalist models have made their foray into computer vision.
The introduction of Segment Anything Model (SAM) set a milestone on
segmentation of natural images, inspiring the design of a multitude of
architectures for medical image segmentation. In this survey we offer a
comprehensive and in-depth investigation on generalist models for medical image
segmentation. We start with an introduction on the fundamentals concepts
underpinning their development. Then, we provide a taxonomy on the different
declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on
the recent SAM 2, on other innovative models trained on images alone, and
others trained on both text and images. We thoroughly analyze their
performances at the level of both primary research and best-in-literature,
followed by a rigorous comparison with the state-of-the-art task-specific
models. We emphasize the need to address challenges in terms of compliance with
regulatory frameworks, privacy and security laws, budget, and trustworthy
artificial intelligence (AI). Finally, we share our perspective on future
directions concerning synthetic data, early fusion, lessons learnt from
generalist models in natural language processing, agentic AI and physical AI,
and clinical translation.

[126] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation

Zhenhuan Zhou

Main category: eess.IV

TL;DR: Med-URWKV是一种基于纯RWKV架构的医学图像分割模型,首次利用ImageNet预训练的VRWKV编码器,在多个数据集上表现优异。

Details Motivation: 现有医学图像分割方法(如CNN、Transformer或混合架构)存在感受野受限或计算复杂度高的问题。RWKV因线性复杂度和长程建模能力成为有潜力的替代方案,但尚未充分利用预训练优势。

Contribution: 提出首个纯RWKV架构的医学图像分割模型Med-URWKV,直接复用大规模预训练的VRWKV编码器,验证了预训练提升性能的有效性。

Method: 基于U-Net框架,引入预训练的VRWKV编码器,保持纯RWKV结构,避免额外修改。

Result: 在7个数据集上表现优于或媲美从零训练的优化RWKV模型,证明了预训练的重要性。

Insight: 预训练的VRWKV编码器可为医学图像分割提供更强特征表示,纯RWKV架构在保持高效的同时具备竞争力。

Abstract: Medical image segmentation is a fundamental and key technology in
computer-aided diagnosis and treatment. Previous methods can be broadly
classified into three categories: convolutional neural network (CNN) based,
Transformer based, and hybrid architectures that combine both. However, each of
them has its own limitations, such as restricted receptive fields in CNNs or
the computational overhead caused by the quadratic complexity of Transformers.
Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a
promising alternative for various vision tasks, offering strong long-range
modeling capabilities with linear computational complexity. Some studies have
also adapted RWKV to medical image segmentation tasks, achieving competitive
performance. However, most of these studies focus on modifications to the
Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring
the potential advantages of leveraging pre-trained VRWKV models for medical
image segmentation tasks. In this paper, we propose Med-URWKV, a pure
RWKV-based architecture built upon the U-Net framework, which incorporates
ImageNet-based pretraining to further explore the potential of RWKV in medical
image segmentation tasks. To the best of our knowledge, Med-URWKV is the first
pure RWKV segmentation model in the medical field that can directly reuse a
large-scale pre-trained VRWKV encoder. Experimental results on seven datasets
demonstrate that Med-URWKV achieves comparable or even superior segmentation
performance compared to other carefully optimized RWKV models trained from
scratch. This validates the effectiveness of using a pretrained VRWKV encoder
in enhancing model performance. The codes will be released.

cs.GR [Back]

[127] Edit360: 2D Image Edits to 3D Assets from Any Angle

Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang

Main category: cs.GR

TL;DR: Edit360是一个无需调优的框架,能够将2D图像编辑扩展到多视角一致的3D编辑,通过引入Anchor-View Editing Propagation机制,实现任意视角的高质量3D资产重建。

Details Motivation: 现有方法通常将编辑限制在预定的视角范围内,缺乏灵活性,难以满足实际应用中对多视角一致性的要求。

Contribution: 提出Edit360框架,实现了从2D编辑到3D资产的无缝扩展,并引入Anchor-View Editing Propagation机制,确保多视角一致性。

Method: 基于视频扩散模型,通过选择锚点视角进行2D编辑,并在潜空间和注意力空间中对多视角信息进行对齐和合并。

Result: 实现了高质量3D资产的重建,支持自定义3D内容创作。

Insight: 将2D编辑能力扩展到3D领域的挑战在于多视角一致性的处理,Edit360通过融合扩散模型的多视角信息解决了这一问题。

Abstract: Recent advances in diffusion models have significantly improved image
generation and editing, but extending these capabilities to 3D assets remains
challenging, especially for fine-grained edits that require multi-view
consistency. Existing methods typically restrict editing to predetermined
viewing angles, severely limiting their flexibility and practical applications.
We introduce Edit360, a tuning-free framework that extends 2D modifications to
multi-view consistent 3D editing. Built upon video diffusion models, Edit360
enables user-specific editing from arbitrary viewpoints while ensuring
structural coherence across all views. The framework selects anchor views for
2D modifications and propagates edits across the entire 360-degree range. To
achieve this, Edit360 introduces a novel Anchor-View Editing Propagation
mechanism, which effectively aligns and merges multi-view information within
the latent and attention spaces of diffusion models. The resulting edited
multi-view sequences facilitate the reconstruction of high-quality 3D assets,
enabling customizable 3D content creation.

cs.RO [Back]

[128] A Navigation Framework Utilizing Vision-Language Models

Yicheng Duan,Kaiyu tang

Main category: cs.RO

TL;DR: 论文提出了一种利用视觉语言模型(VLN)的模块化导航框架,通过解耦视觉语言理解和动作规划,实现了快速且适应性强的导航,无需大量微调。

Details Motivation: 现有的大型视觉语言模型(如CLIP和Flamingo)虽然在多模态理解方面表现优异,但在实时部署和计算成本方面存在挑战。论文旨在解决这些问题,同时提升导航的灵活性和效率。

Contribution: 1. 提出了一种模块化的导航框架,结合冻结的视觉语言模型和轻量级规划逻辑;2. 采用提示工程、结构化历史管理和双帧视觉输入策略,提升导航决策的连续性。

Method: 框架集成Qwen2.5-VL-7B-Instruct模型,通过提示工程优化语言指令理解,使用双帧视觉输入和结构化历史管理增强导航连贯性。

Result: 在VLN-CE设置下评估了Room-to-Room基准和Matterport3D数据集,结果表明在未见环境中的通用性存在挑战,但模块化设计为未来改进提供了基础。

Insight: 模块化设计是解决计算成本和实时部署问题的有效途径,未来可通过增强环境先验和扩展多模态输入进一步提升性能。

Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied
AI, requiring agents to interpret natural language instructions and navigate
through visually rich, unfamiliar environments. Recent advances in large
vision-language models (LVLMs), such as CLIP and Flamingo, have significantly
improved multimodal understanding but introduced new challenges related to
computational cost and real-time deployment. In this project, we propose a
modular, plug-and-play navigation framework that decouples vision-language
understanding from action planning. By integrating a frozen vision-language
model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to
achieve flexible, fast, and adaptable navigation without extensive model
fine-tuning. Our framework leverages prompt engineering, structured history
management, and a two-frame visual input strategy to enhance decision-making
continuity across navigation steps. We evaluate our system on the Room-to-Room
benchmark within the VLN-CE setting using the Matterport3D dataset and
Habitat-Lab simulation environment. Although our initial results reveal
challenges in generalizing to unseen environments under strict evaluation
settings, our modular approach lays a foundation for scalable and efficient
navigation systems, highlighting promising directions for future improvement
through enhanced environmental priors and expanded multimodal input
integration.

[129] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

Wang Xinjie,Liu Liu,Cao Yu,Wu Ruiqi,Qin Wenkang,Wang Dehui,Sui Wei,Su Zhizhong

Main category: cs.RO

TL;DR: EmbodiedGen是一个为具身智能设计的生成性3D世界引擎平台,旨在低成本生成高质量、可控且逼真的3D资产,以解决当前手动创建3D数据资产的高成本和缺乏真实感的问题。

Details Motivation: 当前具身智能任务依赖手动创建的3D图形资产,成本高且真实感有限,限制了数据驱动方法的扩展性。EmbodiedGen旨在通过生成性AI技术解决这一问题。

Contribution: 提出了EmbodiedGen平台,包含六个关键模块,能够生成多样、交互式的3D世界,支持高质量3D资产的低成本生成和物理模拟。

Method: EmbodiedGen结合生成性AI技术,通过Image-to-3D、Text-to-3D等模块生成可交互的3D资产,并以URDF格式输出,便于物理模拟。

Result: 生成的3D资产具有高质量、可控性和真实感,可直接用于物理仿真引擎,支持下游任务的训练和评估。

Insight: 利用生成性AI技术可以显著降低3D数据资产的成本并提升其多样性,从而推动具身智能研究的扩展性和通用性。

Abstract: Constructing a physically realistic and accurately scaled simulated 3D world
is crucial for the training and evaluation of embodied intelligence tasks. The
diversity, realism, low cost accessibility and affordability of 3D data assets
are critical for achieving generalization and scalability in embodied AI.
However, most current embodied intelligence tasks still rely heavily on
traditional 3D computer graphics assets manually created and annotated, which
suffer from high production costs and limited realism. These limitations
significantly hinder the scalability of data driven approaches. We present
EmbodiedGen, a foundational platform for interactive 3D world generation. It
enables the scalable generation of high-quality, controllable and
photorealistic 3D assets with accurate physical properties and real-world scale
in the Unified Robotics Description Format (URDF) at low cost. These assets can
be directly imported into various physics simulation engines for fine-grained
physical control, supporting downstream tasks in training and evaluation.
EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key
modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object
Generation, Scene Generation and Layout Generation. EmbodiedGen generates
diverse and interactive 3D worlds composed of generative 3D assets, leveraging
generative AI to address the challenges of generalization and evaluation to the
needs of embodied intelligence related research. Code is available at
https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

[130] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop

Justin Kerr,Kush Hari,Ethan Weber,Chung Min Kim,Brent Yi,Tyler Bonnen,Ken Goldberg,Angjoo Kanazawa

Main category: cs.RO

TL;DR: 这篇论文提出了EyeRobot系统,通过结合模仿学习(BC)和强化学习(RL)的BC-RL循环,训练机器人眼球注视行为以完成实际任务,实现了手眼协调。

Details Motivation: 受人类主动观察以完成任务的启发,论文旨在设计一种能够通过注视行为辅助机器人完成任务的系统。

Contribution: 1. 开发了一种可以自由旋转的机械眼球;2. 提出了BC-RL循环训练方法,联合训练手部和眼部的策略;3. 设计了一种受中央凹启发的策略架构,提高了分辨率和计算效率。

Method: 1. 收集示教数据并导入仿真环境;2. 使用BC训练手部策略,RL训练眼部策略;3. 通过BC-RL循环实现手眼协同训练。

Result: 在五个全景工作空间任务中,EyeRobot表现出有效的手眼协调能力,能够在大范围内完成操作任务。

Insight: 主动注视行为能够显著提升机器人在复杂任务中的表现,尤其是在需要大范围操作的环境下。

Abstract: Humans do not passively observe the visual world – we actively look in order
to act. Motivated by this principle, we introduce EyeRobot, a robotic system
with gaze behavior that emerges from the need to complete real-world tasks. We
develop a mechanical eyeball that can freely rotate to observe its surroundings
and train a gaze policy to control it using reinforcement learning. We
accomplish this by first collecting teleoperated demonstrations paired with a
360 camera. This data is imported into a simulation environment that supports
rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze
on top of robot demonstrations. We then introduce a BC-RL loop to train the
hand and eye jointly: the hand (BC) agent is trained from rendered eye
observations, and the eye (RL) agent is rewarded when the hand produces correct
action predictions. In this way, hand-eye coordination emerges as the eye looks
towards regions which allow the hand to complete the task. EyeRobot implements
a foveal-inspired policy architecture allowing high resolution with a small
compute budget, which we find also leads to the emergence of more stable
fixation as well as improved ability to track objects and ignore distractors.
We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring
manipulation in an arc surrounding the robot arm. Our experiments suggest
EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate
manipulation over large workspaces with a single camera. See project site for
videos: https://www.eyerobot.net/

cs.MA [Back]

[131] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

Haoyuan Shi,Yunxin Li,Xinyu Chen,Longyue Wang,Baotian Hu,Min Zhang

Main category: cs.MA

TL;DR: AniMaker是一个多智能体框架,用于自动生成连贯的多场景故事动画,通过MCTS驱动的视频片段生成和故事感知的片段选择优化动画质量和一致性。

Details Motivation: 现有的视频生成方法在生成多场景、多角色的连贯故事动画时存在叙事断裂、节奏问题和模型不稳定性的挑战。

Contribution: 1. 提出了AniMaker,一个多智能体框架,实现高效的多候选视频片段生成和故事感知选择;2. 设计了MCTS-Gen策略优化视频片段生成;3. 开发了AniEval框架,首次专注于多镜头动画评估。

Method: 1. 使用多智能体(导演、摄影、评审、后期)分工协作;2. 摄影智能体采用MCTS-Gen策略生成高质量片段;3. 评审智能体通过AniEval评估片段的故事一致性、动作完成度等。

Result: 实验表明AniMaker在VBench和AniEval评估中表现优异,显著提升多候选生成效率,接近生产标准。

Insight: 多智能体分工与MCTS驱动的生成策略能有效解决多场景动画的连贯性和质量问题,AniEval为多镜头动画评估提供了新标准。

Abstract: Despite rapid advancements in video generation models, generating coherent
storytelling videos that span multiple scenes and characters remains
challenging. Current methods often rigidly convert pre-generated keyframes into
fixed-length clips, resulting in disjointed narratives and pacing issues.
Furthermore, the inherent instability of video generation models means that
even a single low-quality clip can significantly degrade the entire output
animation’s logical coherence and visual continuity. To overcome these
obstacles, we introduce AniMaker, a multi-agent framework enabling efficient
multi-candidate clip generation and storytelling-aware clip selection, thus
creating globally consistent and story-coherent animation solely from text
input. The framework is structured around specialized agents, including the
Director Agent for storyboard generation, the Photography Agent for video clip
generation, the Reviewer Agent for evaluation, and the Post-Production Agent
for editing and voiceover. Central to AniMaker’s approach are two key technical
components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search
(MCTS)-inspired strategy that intelligently navigates the candidate space to
generate high-potential clips while optimizing resource usage; and AniEval in
Reviewer Agent, the first framework specifically designed for multi-shot
animation evaluation, which assesses critical aspects such as story-level
consistency, action completion, and animation-specific features by considering
each clip in the context of its preceding and succeeding clips. Experiments
demonstrate that AniMaker achieves superior quality as measured by popular
metrics including VBench and our proposed AniEval framework, while
significantly improving the efficiency of multi-candidate generation, pushing
AI-generated storytelling animation closer to production standards.

cs.SD [Back]

[132] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs

Tony Alex,Wish Suharitdamrong,Sara Atito,Armin Mustafa,Philip J. B. Jackson,Imran Razzak,Muhammad Awais

Main category: cs.SD

TL;DR: 该论文通过系统研究音频编码器和LLM(大语言模型)的信息传递机制,提出并验证了几种优化架构设计的方法,显著提升了音频-LLM的性能。

Details Motivation: 尽管音频-LLM在应用领域取得了快速进展,但其底层机制,尤其是音频编码器如何高效将丰富的语义信息传递给LLM,仍未得到充分研究。论文旨在探索并优化这一交互过程。

Contribution: 1)提出了延迟音频集成的方法,以增强LLM对音频信息的探测能力;2)验证了仅通过LLM的注意力子模块即可有效探测音频表征;3)展示了多音频编码器集成能提供更丰富的表征。

Method: 通过设计实验验证了三种假设:延迟音频集成、仅使用注意力子模块、多编码器集成。实验基于560万音频-文本对的数据集,采用三阶段训练。

Result: 最终提出的架构在基准测试中实现了10%到60%的相对性能提升,验证了优化跨模态信息传递的有效性。

Insight: 研究揭示了音频-LLM中信息传递的关键机制,包括延迟集成和多编码器互补作用,为未来跨模态模型设计提供了理论支持。

Abstract: The integration of audio perception capabilities into Large Language Models
(LLMs) has enabled significant advances in Audio-LLMs. Although
application-focused developments, particularly in curating training data for
specific capabilities e.g., audio reasoning, have progressed rapidly, the
underlying mechanisms that govern efficient transfer of rich semantic
representations from audio encoders to LLMs remain under-explored. We
conceptualize effective audio-LLM interaction as the LLM’s ability to
proficiently probe the audio encoder representations to satisfy textual
queries. This paper presents a systematic investigation on how architectural
design choices can affect that. Beginning with a standard Pengi/LLaVA-style
audio-LLM architecture, we propose and evaluate several modifications guided by
hypotheses derived from mechanistic interpretability studies and LLM
operational principles. Our experiments demonstrate that: (1) delaying audio
integration until the LLM’s initial layers establish textual context that
enhances its ability to probe the audio representations for relevant
information; (2) the LLM can proficiently probe audio representations
exclusively through LLM layer’s attention submodule, without requiring
propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently
integrated ensemble of diverse audio encoders provides richer, complementary
representations, thereby broadening the LLM’s capacity to probe a wider
spectrum of audio information. All hypotheses are evaluated using an identical
three-stage training curriculum on a dataset of 5.6 million audio-text pairs,
ensuring controlled comparisons. Our final architecture, which incorporates all
proposed modifications, achieves relative improvements from 10% to 60% over
the baseline, validating our approach to optimizing cross-modal information
transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/

cs.CR [Back]

[133] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

Zilong Wang,Xiang Zheng,Xiaosen Wang,Bo Wang,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CR

TL;DR: GenBreak是一个通过微调大型语言模型(LLM)来系统性探测文本到图像(T2I)生成器潜在安全漏洞的框架,结合监督微调和强化学习,成功生成既能绕过安全机制又能输出有害内容的对抗性提示。

Details Motivation: 现有的T2I模型可能被滥用以生成有害内容,而传统的安全测试方法存在局限性——要么容易检测,要么无法生成真正有害的输出。因此,需要一种可靠的工具来评估T2I模型的安全性。

Contribution: 提出了GenBreak框架,通过微调LLM并结合强化学习,实现了系统性生成高效且隐蔽的对抗性提示,揭示了T2I模型的严重安全隐患。

Method: 结合监督微调(基于标注数据)和强化学习(通过与代理T2I模型的交互),利用多奖励信号指导LLM生成既能绕过安全机制又高度有害的对抗性提示。

Result: 生成的对抗性提示在针对商业T2I生成器的黑盒攻击中表现优异,暴露了实际的安全弱点。

Insight: 生成对抗性提示需要平衡隐蔽性和危害性,多奖励信号的设计是关键;T2I模型的安全防御仍需进一步改进。

Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and
are now widely used in content creation. However, these models can be misused
to generate harmful content, including nudity or violence, posing significant
safety risks. While most platforms employ content moderation systems,
underlying vulnerabilities can still be exploited by determined adversaries.
Recent research on red-teaming and adversarial attacks against T2I models has
notable limitations: some studies successfully generate highly toxic images but
use adversarial prompts that are easily detected and blocked by safety filters,
while others focus on bypassing safety mechanisms but fail to produce genuinely
harmful outputs, neglecting the discovery of truly high-risk prompts.
Consequently, there remains a lack of reliable tools for evaluating the safety
of defended T2I models. To address this gap, we propose GenBreak, a framework
that fine-tunes a red-team large language model (LLM) to systematically explore
underlying vulnerabilities in T2I generators. Our approach combines supervised
fine-tuning on curated datasets with reinforcement learning via interaction
with a surrogate T2I model. By integrating multiple reward signals, we guide
the LLM to craft adversarial prompts that enhance both evasion capability and
image toxicity, while maintaining semantic coherence and diversity. These
prompts demonstrate strong effectiveness in black-box attacks against
commercial T2I generators, revealing practical and concerning safety
weaknesses.

[134] Secure Data Access in Cloud Environments Using Quantum Cryptography

S. Vasavi Venkata Lakshmi,Ziaul Haque Choudhury

Main category: cs.CR

TL;DR: 该论文提出了一种结合量子密钥分发(QKD)和量子一次一密(QOTP)的方法,利用BB84协议在云计算环境中实现安全数据传输,为未来量子计算机威胁下的数据安全提供解决方案。

Details Motivation: 随着量子计算机的发展,传统加密方法可能无法应对未来的安全威胁。云计算环境中的数据安全成为一个迫切问题,需要新的技术手段。

Contribution: 主要贡献是将量子密码学(如QKD和QOTP)应用于云计算环境,提出了一种能够抵抗量子计算机攻击的安全数据传输方案。

Method: 采用BB84协议实现量子密钥分发(QKD),并利用量子一次一密(QOTP)对数据进行加密和解密。这两种技术结合确保了数据的完全保密性。

Result: 研究证明了量子密码学在云计算环境中的有效性,能够为数据存储和共享提供强大的安全保障,即使面对量子计算机攻击。

Insight: 量子密码学为未来的数据安全问题提供了前瞻性解决方案,尤其是在云计算等分布式环境中,展现了其潜在的应用价值。

Abstract: Cloud computing has made storing and accessing data easier but keeping it
secure is a big challenge nowadays. Traditional methods of ensuring data may
not be strong enough in the future when powerful quantum computers become
available. To solve this problem, this study uses quantum cryptography to
protect data in the cloud environment. Quantum Key Distribution (QKD) creates
secure keys by sending information using quantum particles like photons.
Specifically, we use the BB84 protocol, a simple and reliable way to make
secure keys that cannot be stolen without detection. To protect the data, we
use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the
data stays completely private. This study shows how these Quantum methods can
be applied in cloud systems to provide a strong defense against hackers, even
if they have access to quantum computers. The combination of QKD, BB84, and
QOTP creates a safe and reliable way to keep data secure when it is stored or
shared in the cloud. Using quantum cryptography, this paper provides a way to
ensure data security now and in the future, making cloud computing safer for
everyone to store their data securely and safely.

physics.med-ph [Back]

[135] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation

Nicholas Summerfield,Qisheng He,Alex Kuo,Ahmed I. Ghanem,Simeng Zhu,Chase Ruff,Joshua Pan,Anudeep Kumar,Prashant Nagpal,Jiwei Zhao,Ming Dong,Carri K. Glide-Hurst

Main category: physics.med-ph

TL;DR: MAGIC是一种多模态心脏子结构分割方法,通过单一模型实现跨模态分割,性能优于对比模型,且计算轻量。

Details Motivation: 心脏子结构分割在放射治疗计划中至关重要,但现有深度学习方法在多模态和重叠结构上缺乏泛化能力。

Contribution: 提出MAGIC方法,通过单一模型实现多模态心脏子结构分割,简化计算需求并提升临床灵活性。

Method: 基于nnU-Net的U形结构,通过复制的编码-解码分支实现多模态分割,支持CT、MRI等多种输入。

Result: 在Sim-CT、MR-Linac和CCTA上的平均Dice分数分别为0.75、0.68和0.80,多数情况下优于对比模型。

Insight: MAGIC展示了单一模型处理多模态任务的潜力,为临床提供了轻量且灵活的解决方案。

Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to
minimize risk of radiation-induced heart disease. Deep learning (DL) offers
efficient methods to reduce contouring burden but lacks generalizability across
different modalities and overlapping structures. This work introduces and
validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and
multi-modal cardiac substructure segmentation. MAGIC is implemented through
replicated encoding and decoding branches of an nnU-Net-based, U-shaped
backbone conserving the function of a single model. Twenty cardiac
substructures (heart, chambers, great vessels (GVs), valves, coronary arteries
(CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac,
and cardiac CT angiography (CCTA) modalities were manually delineated and used
to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison
models (four segmentation subgroups across three modalities) were equivalently
trained. All methods were compared for training efficiency and against
reference contours using the Dice Similarity Coefficient (DSC) and two-tailed
Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were
0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC
outperforms the comparison in 57% of cases, with limited statistical
differences. MAGIC offers an effective and accurate segmentation solution that
is lightweight and capable of segmenting multiple modalities and overlapping
structures in a single model. MAGIC further enables clinical implementation by
simplifying the computational requirements and offering unparalleled
flexibility for clinical settings.

eess.SY [Back]

[136] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing

Rongfei Li,Francis Assadian

Main category: eess.SY

TL;DR: 该论文提出了一种针对自动化制造环境中视觉伺服的摄像头位置搜索算法,通过优化摄像头移动策略和学习环境特征,提高了观察精度,同时考虑了能量限制。

Details Motivation: 在自动化制造环境中,摄像头的位置对视觉伺服的精度至关重要。论文旨在解决摄像头位置选择对图像噪声水平和观察精度的影响,并优化摄像头移动策略以减少能耗。

Contribution: 主要贡献包括:(1)提出了一种摄像头位置搜索算法,通过学习环境特征提高搜索效率;(2)结合图像平均技术,在不滤除高频信息的情况下提升观察精度;(3)在能量有限的情况下,确保摄像头到达次优位置。

Method: 方法包括:(1)使用摄像头移动策略探索工作空间;(2)通过图像平均技术评估图像噪声水平;(3)结合学习机制优化搜索路径,同时考虑能量限制。

Result: 实验结果表明,该算法在仿真自动化制造环境中有效提高了观察精度,并在能量有限的情况下实现了次优位置的选择。

Insight: 论文揭示了摄像头位置对视觉伺服精度的影响,并通过智能搜索策略和学习机制,为自动化制造中的摄像头定位问题提供了一种高效的解决方案。

Abstract: Visual servoing technology has been well developed and applied in many
automated manufacturing tasks, especially in tools’ pose alignment. To access a
full global view of tools, most applications adopt eye-to-hand configuration or
eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing
environment. Most research papers mainly put efforts into developing control
and observation architectures in various scenarios, but few of them have
discussed the importance of the camera’s location in eye-to-hand configuration.
In a manufacturing environment, the quality of camera estimations may vary
significantly from one observation location to another, as the combined effects
of environmental conditions result in different noise levels of a single image
shot at different locations. In this paper, we propose an algorithm for the
camera’s moving policy so that it explores the camera workspace and searches
for the optimal location where the images’ noise level is minimized. Also, this
algorithm ensures the camera ends up at a suboptimal (if the optimal one is
unreachable) location among the locations already searched, with limited energy
available for moving the camera. Unlike a simple brute force approach, the
algorithm enables the camera to explore space more efficiently by adapting the
search policy from learning the environment. With the aid of an image averaging
technique, this algorithm, in use of a solo camera, achieves the observation
accuracy in eye-to-hand configurations to a desirable extent without filtering
out high-frequency information in the original image. An automated
manufacturing application has been simulated and the results show the success
of this algorithm’s improvement of observation precision with limited energy.

[137] Semi-Tensor-Product Based Convolutional Neural Networks

Daizhan Cheng

Main category: eess.SY

TL;DR: 该论文提出了一种基于半张量积(STP)的新型卷积运算(CP),并通过结合域基CP和STP向量,避免了传统卷积中填充操作带来的无效信息,进而构建了STP-based CNN,应用于图像和三阶信号识别。

Details Motivation: 传统卷积运算中的填充操作(如零填充)可能引入无效信息,影响模型性能。本研究旨在通过半张量积的泛化特性,设计一种无需填充的卷积运算。

Contribution: 1. 提出了基于域基的卷积积(CP);2. 结合STP与CP,设计了一种新型卷积运算;3. 构建了无需填充的STP-based CNN。

Method: 利用半张量积的泛化特性,提出域基卷积积(CP),并将其与传统STP结合,实现无需填充的卷积运算。最后基于此构建了STP-based CNN。

Result: 新方法在图像和三阶信号识别任务中取得了显著效果,避免了传统填充带来的无效信息问题。

Insight: STP的灵活维度处理能力为卷积运算设计提供了新思路,无需填充的操作简化了模型且避免了信息污染。

Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional
inner product of vectors, which allows the factor vectors to of different
dimensions. This paper proposes a domain-based convolutional product (CP).
Combining domain-based CP with STP of vectors, a new CP is proposed. Since
there is no zero or any other padding, it can avoid the junk information caused
by padding. Using it, the STP-based convolutional neural network (CNN) is
developed. Its application to image and third order signal identifications is
considered.

cs.MM [Back]

[138] Multimodal Large Language Models: A Survey

Longzhen Han,Awes Mubarak,Almas Baimagambetov,Nikolaos Polatidis,Thar Baker

Main category: cs.MM

TL;DR: 这篇《Multimodal Large Language Models: A Survey》对多模态大语言模型(MLLMs)的发展进行了系统综述,涵盖了从文本生成扩展到多种感官模态的模型。

Details Motivation: 随着多模态技术的快速发展,研究者需要整合语言与其他感官模态,以推动更具普适性和适应性的多模态系统的发展。

Contribution: 提出了六大生成模态的分类,并探讨了自监督学习(SSL)、专家混合(MoE)、人类反馈强化学习(RLHF)和思维链(CoT)提示等关键技术如何实现跨模态能力。

Method: 通过分析关键模型、架构趋势和跨模态协同效应,总结了基于Transformer和扩散模型的创新架构,以及它们在跨模态迁移和模块化专长中的应用。

Result: 指出了评估、模块化和结构化推理等未解决的挑战,为MLLM的未来发展提供了统一视角。

Insight: 跨模态协同和技术转移是MLLM发展的核心方向,未来的研究需要集中在通用性、适应性和可解释性的提升上。

Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text
generation, now spanning diverse output modalities including images, music,
video, human motion, and 3D objects, by integrating language with other sensory
modalities under unified architectures. This survey categorises six primary
generative modalities and examines how foundational techniques, namely
Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement
Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting,
enable cross-modal capabilities. We analyze key models, architectural trends,
and emergent cross-modal synergies, while highlighting transferable techniques
and unresolved challenges. Architectural innovations like transformers and
diffusion models underpin this convergence, enabling cross-modal transfer and
modular specialization. We highlight emerging patterns of synergy, and identify
open challenges in evaluation, modularity, and structured reasoning. This
survey offers a unified perspective on MLLM development and identifies critical
paths toward more general-purpose, adaptive, and interpretable multimodal
systems.

[139] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis

Jianwu Fang,Lei-Lei Li,Zhedong Zheng,Hongkai Yu,Jianru Xue,Zhengguo Li,Tat-Seng Chua

Main category: cs.MM

TL;DR: 论文提出了一种基于扩散模型的注意力视频生成方法(AVD),通过合成事故视频片段来解决交通事故预测(TAA)中的数据偏差问题,并结合等变损失(EQ-TAA)提升模型性能。

Details Motivation: 当前交通事故预测方法依赖大量标注数据,但事故数据的因果部分难以识别且容易受数据偏差影响。论文旨在通过生成因果视频片段解决这一问题。

Contribution: 1. 提出AVD模型,通过文本提示生成因果视频片段;2. 提出EQ-TAA方法,利用等变三重损失提升预测性能;3. 无需额外标注即可训练。

Method: 利用扩散模型生成事故视频片段,并结合等变三重损失优化模型,实现对事故因果部分的建模。

Result: 实验表明AVD和EQ-TAA在性能上优于现有方法。

Insight: 通过生成因果视频片段可以有效缓解数据偏差问题,等变损失设计进一步提升了模型的鲁棒性。

Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging
problem for achieving zero fatalities in the future. Current approaches
typically treat TAA as a supervised learning task needing the laborious
annotation of accident occurrence duration. However, the inherent long-tailed,
uncertain, and fast-evolving nature of traffic scenes has the problem that real
causal parts of accidents are difficult to identify and are easily dominated by
data bias, resulting in a background confounding issue. Thus, we propose an
Attentive Video Diffusion (AVD) model that synthesizes additional accident
video clips by generating the causal part in dashcam videos, i.e., from normal
clips to accident clips. AVD aims to generate causal video frames based on
accident or accident-free text prompts while preserving the style and content
of frames for TAA after video generation. This approach can be trained using
datasets collected from various driving scenes without any extra annotations.
Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant
triple loss for an anchor accident-free video clip, along with the generated
pair of contrastive pseudo-normal and pseudo-accident clips. Extensive
experiments have been conducted to evaluate the performance of AVD and EQ-TAA,
and competitive performance compared to state-of-the-art methods has been
obtained.

[140] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction

Jie Qin,Wei Yang,Yan Su,Yiran Zhu,Weizhen Li,Yunyue Pan,Chengchang Pan,Honggang Qi

Main category: cs.MM

TL;DR: 一种自适应双模态框架通过动态分支选择、双向跨模态GAN和混合训练协议,实现了灵活的单/双模态HER2预测,显著提升了单模态和双模态的预测准确率。

Details Motivation: 目前HER2评估模型通常单独分析H&E或IHC图像,而临床实践中需要两者的协同解释,但由于工作流复杂性和成本限制,同时获取这两种模态的数据较为困难。

Contribution: 提出了一个自适应双模态框架,通过动态分支选择器、双向跨模态GAN和混合训练协议,实现了灵活的单/双模态预测,显著提升了精度和资源效率。

Method: 1) 动态分支选择器根据输入完整性激活单模态重建或双模态联合推理;2) 双向跨模态GAN实现缺失模态的上下文感知特征空间重建;3) 混合训练协议结合对抗学习和多任务优化。

Result: 单模态H&E预测精度从71.44%提升至94.25%,双模态精度达95.09%;IHC单模态输入仍保持90.28%的可靠性。

Insight: 该框架通过动态路由输入和跨模态重建,显著缓解了数据缺失带来的性能下降,同时保持了计算效率,适用于资源有限的环境。

Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or
IHC images in isolation,despite clinical reliance on their synergistic
interpretation. However, concurrent acquisition of both modalities is often
hindered by workflow complexity and cost constraints. We propose an adaptive
bimodal framework enabling flexible single-/dual-modality HER2 prediction
through three innovations: 1) A dynamic branch selector that activates either
single-modality reconstruction or dual-modality joint inference based on input
completeness; 2) A bidirectional cross-modal GAN performing context-aware
feature-space reconstruction of missing modalities; 3) A hybrid training
protocol integrating adversarial learning and multi-task optimization. This
architecture elevates single-modality H&E prediction accuracy from 71.44% to
94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28%
reliability with sole IHC inputs. The framework’s “dual-preferred,
single-compatible” design delivers near-bimodal performance without requiring
synchronized acquisition, particularly benefiting resource-limited settings
through IHC infrastructure cost reduction. Experimental validation confirms
22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with
cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251
(IHC to HE). By dynamically routing inputs through reconstruction-enhanced or
native fusion pathways, the system mitigates performance degradation from
missing data while preserving computational efficiency (78.55% parameter
reduction in lightweight variant). This elastic architecture demonstrates
significant potential for democratizing precise HER2 assessment across diverse
healthcare settings.

[141] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space

Kangwei Liu,Junwu Liu,Xiaowei Yi,Jinlin Guo,Yun Cao

Main category: cs.MM

TL;DR: 本文提出了一种基于扩散模型的3D面部动画生成方法,通过多模态情感绑定和注意力机制实现灵活的情感控制和丰富的运动多样性。

Details Motivation: 现有的音频驱动情感3D面部动画方法存在两个主要问题:一是依赖单一模态控制信号,未能综合利用多模态信息的互补性;二是确定性回归映射限制了情感表达和非语言行为的随机性。

Contribution: 1. 提出基于FLAME的多模态情感绑定策略,通过对比学习对齐文本、音频和情感标签等模态;2. 设计具有内容感知注意力和情感引导层的潜在扩散模型,提升运动多样性的同时保持时间一致性和自然面部动态。

Method: 1. 使用多模态情感绑定策略对齐不同模态;2. 采用注意力机制增强潜在扩散模型的效果。

Result: 实验表明,该方法在情感相似性指标上比现有方法提升21.6%,同时保持了生理合理的面部动态。

Insight: 多模态信息的联合使用和扩散模型的引入显著提升了3D面部动画的情感表达能力和多样性。

Abstract: Audio-driven emotional 3D facial animation encounters two significant
challenges: (1) reliance on single-modal control signals (videos, text, or
emotion labels) without leveraging their complementary strengths for
comprehensive emotion manipulation, and (2) deterministic regression-based
mapping that constrains the stochastic nature of emotional expressions and
non-verbal behaviors, limiting the expressiveness of synthesized animations. To
address these challenges, we present a diffusion-based framework for
controllable expressive 3D facial animation. Our approach introduces two key
innovations: (1) a FLAME-centered multimodal emotion binding strategy that
aligns diverse modalities (text, audio, and emotion labels) through contrastive
learning, enabling flexible emotion control from multiple signal sources, and
(2) an attention-based latent diffusion model with content-aware attention and
emotion-guided layers, which enriches motion diversity while maintaining
temporal coherence and natural facial dynamics. Extensive experiments
demonstrate that our method outperforms existing approaches across most
metrics, achieving a 21.6% improvement in emotion similarity while preserving
physiologically plausible facial dynamics. Project Page:
https://kangweiiliu.github.io/Control_3D_Animation.

[142] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics

Yi-Chun Chen

Main category: cs.MM

TL;DR: 该论文提出了一种层级知识图框架,用于多模态媒体(如漫画)的视觉叙事理解。方法将叙事内容分解为多个层次,并通过整合知识图捕捉语义、空间和时间关系,支持多样叙事任务的推理。

Details Motivation: 动机是解决视觉叙事(如漫画)中复杂的多模态关系理解问题,通过结构化表示支持推理任务。

Contribution: 主要贡献是提出了一种层级知识图框架,能够整合视觉和文本信息,支持多层次的叙事推理任务。

Method: 方法包括构建多模态图,将视觉元素(如角色、物体、动作)与文本组件(如对话、旁白)关联,并在叙事层级间整合知识图。

Result: 在Manga109数据集上验证,结果显示高精度和高召回率,支持多样叙事任务(如动作检索、对话追踪、角色映射等)。

Insight: 研究展示了结构化图表示在多模态叙事分析中的有效性,为交互式叙事和多模态推理提供了可扩展基础。

Abstract: This paper presents a hierarchical knowledge graph framework for the
structured understanding of visual narratives, focusing on multimodal media
such as comics. The proposed method decomposes narrative content into multiple
levels, from macro-level story arcs to fine-grained event segments. It
represents them through integrated knowledge graphs that capture semantic,
spatial, and temporal relationships. At the panel level, we construct
multimodal graphs that link visual elements such as characters, objects, and
actions with corresponding textual components, including dialogue and captions.
These graphs are integrated across narrative levels to support reasoning over
story structure, character continuity, and event progression.
We apply our approach to a manually annotated subset of the Manga109 dataset
and demonstrate its ability to support symbolic reasoning across diverse
narrative tasks, including action retrieval, dialogue tracing, character
appearance mapping, and panel timeline reconstruction. Evaluation results show
high precision and recall across tasks, validating the coherence and
interpretability of the framework. This work contributes a scalable foundation
for narrative-based content analysis, interactive storytelling, and multimodal
reasoning in visual media.

[143] WDMIR: Wavelet-Driven Multimodal Intent Recognition

Weiyin Gong,Kai Zhang,Yanghai Zhang,Qi Liu,Xinjie Sun,Junyu Lu,Linbo Zhu

Main category: cs.MM

TL;DR: WDMIR提出了一种基于小波变换的多模态意图识别框架,通过频域分析提升非语言信息的语义提取能力,实现了性能提升。

Details Motivation: 现有方法过于依赖文本分析,忽略了非语言信息的丰富语义内容,WDMIR旨在通过频域分析弥补这一不足。

Contribution: 1. 提出小波驱动的多模态融合模块,实现视频-音频特征的频域同步分解与集成;2. 设计跨模态交互机制,逐步增强特征从双模态到三模态的融合。

Method: 采用小波变换对视频-音频特征进行频域分析,结合跨模态交互机制逐步融合多模态特征。

Result: 在MIntRec数据集上取得SOTA性能,准确率提升1.13%,小波融合模块对非语言语义提取的效果提升0.41%。

Insight: 频域分析(如小波变换)能有效捕捉非语言信息的动态语义,跨模态逐步融合对意图识别至关重要。

Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user
intentions by integrating verbal and non-verbal information across video, audio
and text modalities. While existing approaches prioritize text analysis, they
often overlook the rich semantic content embedded in non-verbal cues. This
paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR)
framework that enhances intent understanding through frequency-domain analysis
of non-verbal information. To be more specific, we propose: (1) a
wavelet-driven fusion module that performs synchronized decomposition and
integration of video-audio features in the frequency domain, enabling
fine-grained analysis of temporal dynamics; (2) a cross-modal interaction
mechanism that facilitates progressive feature enhancement from bimodal to
trimodal integration, effectively bridging the semantic gap between verbal and
non-verbal information. Extensive experiments on MIntRec demonstrate that our
approach achieves state-of-the-art performance, surpassing previous methods by
1.13% on accuracy. Ablation studies further verify that the wavelet-driven
fusion module significantly improves the extraction of semantic information
from non-verbal sources, with a 0.41% increase in recognition accuracy when
analyzing subtle emotional cues.

cs.IR [Back]

[144] Conversational Search: From Fundamentals to Frontiers in the LLM Era

Fengran Mo,Chuan Meng,Mohammad Aliannejadi,Jian-Yun Nie

Main category: cs.IR

TL;DR: 该教程介绍了会话搜索的基础与由大型语言模型(LLM)推动的前沿研究,旨在为学术界和工业界的研究者及从业者提供全面知识。

Details Motivation: 会话搜索通过多轮交互满足复杂信息需求,但因LLM的出现带来新的机会与挑战,需重新探讨其发展路径。

Contribution: 1. 结合LLM的能力(如指令遵循、内容生成、推理)革新会话搜索;2. 提供基础与前沿研究的全面连接。

Method: 教程形式,通过理论介绍与前沿案例分析结合的方式。

Result: 参与者将掌握构建下一代会话搜索系统所需的核心原则与新兴技术。

Insight: LLM在会话搜索中的应用不仅提升了智能化水平,也带来新的研究挑战,如上下文理解和动态交互优化。

Abstract: Conversational search enables multi-turn interactions between users and
systems to fulfill users’ complex information needs. During this interaction,
the system should understand the users’ search intent within the conversational
context and then return the relevant information through a flexible,
dialogue-based interface. The recent powerful large language models (LLMs) with
capacities of instruction following, content generation, and reasoning, attract
significant attention and advancements, providing new opportunities and
challenges for building up intelligent conversational search systems. This
tutorial aims to introduce the connection between fundamentals and the emerging
topics revolutionized by LLMs in the context of conversational search. It is
designed for students, researchers, and practitioners from both academia and
industry. Participants will gain a comprehensive understanding of both the core
principles and cutting-edge developments driven by LLMs in conversational
search, equipping them with the knowledge needed to contribute to the
development of next-generation conversational search systems.

cs.AI [Back]

[145] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence

Michelle M. Li,Ben Y. Reis,Adam Rodman,Tianxi Cai,Noa Dagan,Ran D. Balicer,Joseph Loscalzo,Isaac S. Kohane,Marinka Zitnik

Main category: cs.AI

TL;DR: 这篇论文提出了医疗AI中的上下文切换概念,旨在通过动态调整模型行为来适应不同医疗场景,避免因固定训练导致的错误。

Details Motivation: 当前医疗AI模型在适应新环境、人群或专业时需微调或提示,难以动态响应复杂多变的临床情境,导致上下文错误。

Contribution: 提出了上下文切换AI的愿景,使模型无需重新训练即可跨专业、人群和临床场景动态调整推理。

Method: 通过动态行为调整和上下文感知,实现模型在不同医疗环境中的自适应推理。

Result: 未来目标是开发能够跨专业、区域诊断和治疗的AI,扩大医疗服务可及性。

Insight: 医疗AI需要更强的上下文适应能力,以克服固定训练的局限性,服务于多样化临床需求。

Abstract: Medical foundation models, including language models trained on clinical
notes, vision-language models on medical images, and multimodal models on
electronic health records, can summarize clinical notes, answer medical
questions, and assist in decision-making. Adapting these models to new
populations, specialties, or settings typically requires fine-tuning, careful
prompting, or retrieval from knowledge bases. This can be impractical, and
limits their ability to interpret unfamiliar inputs and adjust to clinical
situations not represented during training. As a result, models are prone to
contextual errors, where predictions appear reasonable but fail to account for
critical patient-specific or contextual information. These errors stem from a
fundamental limitation that current models struggle with: dynamically adjusting
their behavior across evolving contexts of medical care. In this Perspective,
we outline a vision for context-switching in medical AI: models that
dynamically adapt their reasoning without retraining to new specialties,
populations, workflows, and clinical roles. We envision context-switching AI to
diagnose, manage, and treat a wide range of diseases across specialties and
regions, and expand access to medical care.

[146] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou,Yiheng Wang,Xuming He,Ruoyao Xiao,Zhiwei Li,Qiantai Feng,Zijie Guo,Yuejin Yang,Hao Wu,Wenxuan Huang,Jiaqi Wei,Dan Si,Xiuqi Yao,Jia Bu,Haiwen Huang,Tianfan Fu,Shixiang Tang,Ben Fei,Dongzhan Zhou,Fenghua Ling,Yan Lu,Siqi Sun,Chenhui Li,Guanjie Zheng,Jiancheng Lv,Wenlong Zhang,Lei Bai

Main category: cs.AI

TL;DR: 本文提出了Scientists’ First Exam (SFE)基准测试,用于评估多模态大语言模型(MLLMs)在科学领域的感知、理解和推理能力,填补了现有评测的不足。

Details Motivation: 科学发现依赖于复杂的多模态推理,但目前评测MLLMs的基准主要集中在知识理解上,缺乏对感知和推理能力的评估。

Contribution: 1. 提出了SFE基准,涵盖科学信号感知、属性理解和比较推理三个层次;2. 包含830个专家验证的VQA问题,覆盖5个高价值学科;3. 揭示了当前先进模型(如GPT-3和InternVL-3)在科学领域仍有显著提升空间。

Method: 设计了SFE基准,通过三个层次(感知、理解和推理)评估MLLMs的科学认知能力。测试任务包括多模态问答,覆盖66种任务。

Result: GPT-3和InternVL-3在SFE上的得分分别为34.08%和26.52%,表明MLLMs在科学领域仍有较大改进空间。

Insight: 科学领域的MLLMs需要更强的感知和推理能力,SFE为未来AI支持科学发现的研究提供了参考方向。

Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning
based on information-intensive scientific data and domain-specific expertise.
Empowered by expert-level scientific benchmarks, scientific Multimodal Large
Language Models (MLLMs) hold the potential to significantly enhance this
discovery process in realistic workflows. However, current scientific
benchmarks mostly focus on evaluating the knowledge understanding capabilities
of MLLMs, leading to an inadequate assessment of their perception and reasoning
abilities. To address this gap, we present the Scientists’ First Exam (SFE)
benchmark, designed to evaluate the scientific cognitive capacities of MLLMs
through three interconnected levels: scientific signal perception, scientific
attribute understanding, scientific comparative reasoning. Specifically, SFE
comprises 830 expert-verified VQA pairs across three question types, spanning
66 multimodal tasks across five high-value disciplines. Extensive experiments
reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%
and 26.52% on SFE, highlighting significant room for MLLMs to improve in
scientific realms. We hope the insights obtained in SFE will facilitate further
developments in AI-enhanced scientific discoveries.

[147] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

Vincenzo Colle,Mohamed Sana,Nicola Piovesan,Antonio De Domenico,Fadhel Ayed,Merouane Debbah

Main category: cs.AI

TL;DR: 该论文提出了TeleMath,一个专门用于评估大语言模型(LLMs)在电信领域数学问题求解能力的基准数据集,覆盖信号处理、网络优化等多个主题。通过评估发现,专为数学或逻辑推理设计的模型表现最佳,而通用模型即使参数量大也难以胜任。

Details Motivation: 电信领域对数学密集型任务的需求增加,但现有LLMs在专业领域的数学推理能力尚未充分探索。作者希望通过TeleMath填补这一空白。

Contribution: 推出了首个电信领域数学问题求解的基准数据集TeleMath,包含500个QnA对,并揭示了专为数学推理设计的模型的优势。

Method: 采用专家设计的种子问题,构建QnA生成流程,评估了多种开源LLMs的表现。

Result: 专为数学或逻辑推理设计的模型在TeleMath上表现最佳,通用模型则表现不佳。

Insight: 专业领域(如电信)的数学问题求解需要针对性设计的LLMs,而非单纯增加参数量的通用模型。

Abstract: The increasing adoption of artificial intelligence in telecommunications has
raised interest in the capability of Large Language Models (LLMs) to address
domain-specific, mathematically intensive tasks. Although recent advancements
have improved the performance of LLMs in general mathematical reasoning, their
effectiveness within specialized domains, such as signal processing, network
optimization, and performance analysis, remains largely unexplored. To address
this gap, we introduce TeleMath, the first benchmark dataset specifically
designed to evaluate LLM performance in solving mathematical problems with
numerical solutions in the telecommunications domain. Comprising 500
question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the
telecommunications field. This paper outlines the proposed QnAs generation
pipeline, starting from a selected seed of problems crafted by Subject Matter
Experts. The evaluation of a wide range of open-source LLMs reveals that best
performance on TeleMath is achieved by recent models explicitly designed for
mathematical or logical reasoning. In contrast, general-purpose models, even
those with a large number of parameters, often struggle with these challenges.
We have released the dataset and the evaluation code to ease result
reproducibility and support future research.

[148] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin,Ziyang Gong,Cong Wang,Yonglin Tian,Tengchao Zhang,Xue Yang,Gen Luo,Fei-Yue Wang

Main category: cs.AI

TL;DR: 该论文提出了首个专注于分子毒性修复的基准任务ToxiMol,并构建了一个标准化数据集,同时提出了自动评估框架ToxiEval。实验表明,当前的多模态大语言模型(MLLM)在此任务上仍面临挑战,但在毒性理解、语义约束和结构感知编辑方面展现潜力。

Details Motivation: 毒性是药物早期开发失败的主要原因之一,但目前缺乏系统定义和基准任务以支持分子毒性修复的研究。

Contribution: 1. 提出了首个分子毒性修复的基准任务ToxiMol;2. 构建了覆盖11个任务的标准化数据集;3. 设计了自动评估框架ToxiEval。

Method: 1. 使用专家知识设计机制感知和任务自适应的提示标注管道;2. 整合毒性终点预测、合成可及性、药物相似性和结构相似性为评估链。

Result: 实验评估了近30个主流MLLM,显示其在毒性修复任务上仍有挑战,但在某些方面已显现潜力。

Insight: MLLM在分子毒性修复任务中的应用尚需进一步研究,但其在理解毒性和结构编辑方面的能力为未来提供了方向。

Abstract: Toxicity remains a leading cause of early-stage drug development failure.
Despite advances in molecular design and property prediction, the task of
molecular toxicity repair - generating structurally valid molecular
alternatives with reduced toxicity - has not yet been systematically defined or
benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task
for general-purpose Multimodal Large Language Models (MLLMs) focused on
molecular toxicity repair. We construct a standardized dataset covering 11
primary tasks and 560 representative toxic molecules spanning diverse
mechanisms and granularities. We design a prompt annotation pipeline with
mechanism-aware and task-adaptive capabilities, informed by expert
toxicological knowledge. In parallel, we propose an automated evaluation
framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic
accessibility, drug-likeness, and structural similarity into a high-throughput
evaluation chain for repair success. We systematically assess nearly 30
mainstream general-purpose MLLMs and design multiple ablation studies to
analyze key factors such as evaluation criteria, candidate diversity, and
failure attribution. Experimental results show that although current MLLMs
still face significant challenges on this task, they begin to demonstrate
promising capabilities in toxicity understanding, semantic constraint
adherence, and structure-aware molecule editing.