Table of Contents
- cs.CL [Total: 35]
- cs.CV [Total: 77]
- cs.MM [Total: 6]
- cs.GR [Total: 2]
- cs.LG [Total: 8]
- eess.SY [Total: 2]
- cs.AI [Total: 4]
- cs.MA [Total: 1]
- eess.IV [Total: 6]
- cs.CR [Total: 1]
- cs.SD [Total: 1]
- physics.med-ph [Total: 1]
- cs.RO [Total: 3]
- cs.IR [Total: 1]
cs.CL [Back]
[1] TaskCraft: Automated Generation of Agentic Tasks
Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: TaskCraft是一种自动生成具有多工具交互、可扩展难度和可验证执行轨迹的代理任务的框架,解决了现有数据集中工具交互不足和人工标注成本高的问题。
Details
Motivation: 现有的指令数据缺乏工具交互,而代理任务基准主要依赖人工标注,成本高且难以扩展。因此,需要一种自动化的方法来生成多样化、难度可控的代理任务。Contribution: 1. 提出了TaskCraft框架,自动生成多工具交互、可扩展难度的代理任务;2. 通过深度和宽度扩展方法生成结构化和层次化复杂的任务;3. 发布了包含3.6万个任务的合成数据集。
Method: 1. 利用深度和宽度扩展方法扩展原子任务;2. 生成带有执行轨迹的多工具任务;3. 优化提示生成和监督微调流程。
Result: 实验表明,生成的任务改进了提示优化和监督微调的效果,支持了代理模型的性能提升。
Insight: 自动化任务生成是解决代理任务数据稀缺和标注成本高的有效途径,同时可通过调整扩展方式控制任务难度。
Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool
use, and adaptive reasoning, are becoming increasingly central to the
advancement of NLP and AI. However, existing instruction data lacks tool
interaction, and current agentic benchmarks rely on costly human annotation,
limiting their scalability. We introduce \textsc{TaskCraft}, an automated
workflow for generating difficulty-scalable, multi-tool, and verifiable agentic
tasks with execution trajectories. TaskCraft expands atomic tasks using
depth-based and width-based extensions to create structurally and
hierarchically complex challenges. Empirical results show that these tasks
improve prompt optimization in the generation workflow and enhance supervised
fine-tuning of agentic foundation models. We present a large-scale synthetic
dataset of approximately 36,000 tasks with varying difficulty to support future
research on agent tuning and evaluation.
[2] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information
Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel
Main category: cs.CL
TL;DR: 本文提出了一种名为Chat-of-Thought的多代理系统,用于生成工业资产的FMEA文档。该系统通过多角色协作的LLM代理和动态任务路由优化内容生成与验证。
Details
Motivation: 工业设备监控领域的FMEA文档生成面临高效性和准确性挑战,传统方法难以满足需求。Contribution: 提出了一种融合动态多角色讨论的Chat-of-Thought系统,优化了FMEA文档的生成和迭代细化。
Method: 采用多角色LLM代理协作和动态任务路由,结合模板驱动工作流和上下文感知的代理协作。
Result: 展示了Chat-of-Thought在工业设备监控领域的潜力,能够高效生成和验证FMEA文档。
Insight: 多代理协作和动态讨论能够显著提升领域特定信息的生成质量与效率。
Abstract: This paper presents a novel multi-agent system called Chat-of-Thought,
designed to facilitate the generation of Failure Modes and Effects Analysis
(FMEA) documents for industrial assets. Chat-of-Thought employs multiple
collaborative Large Language Model (LLM)-based agents with specific roles,
leveraging advanced AI techniques and dynamic task routing to optimize the
generation and validation of FMEA tables. A key innovation in this system is
the introduction of a Chat of Thought, where dynamic, multi-persona-driven
discussions enable iterative refinement of content. This research explores the
application domain of industrial equipment monitoring, highlights key
challenges, and demonstrates the potential of Chat-of-Thought in addressing
these challenges through interactive, template-driven workflows and
context-aware agent collaboration.
[3] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering
Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: ChartReasoner是一个两阶段的框架,通过代码驱动的方式实现图表问答任务中的长链推理,通过高保真的图表转换和自动合成数据,提升了多模态推理的精确性和可解释性。
Details
Motivation: 当前的视觉推理任务通常将视觉信息转换为文本进行推理,但会丢失图表中的结构和语义信息。尤其在图表问答任务中,这会导致关键细节的缺失。Contribution: 提出了ChartReasoner框架:1. 高保真地将图表转换为结构化代码;2. 设计了自动合成图表推理数据的方法;3. 结合监督学习和强化学习训练多模态模型。
Method: 1. 训练模型将图表图像转换为ECharts代码;2. 利用代码验证器自动生成和筛选高质量推理数据;3. 结合监督微调和强化学习训练最终模型。
Result: 在四个公开基准测试中,ChartReasoner在保留图表细节和推理性能上表现优异,参数更少的情况下接近GPT-4o的性能。
Insight: 通过代码驱动的模态转换和自动数据合成,可以在保留视觉细节的同时实现高效的多模态推理,为视觉推理任务提供了一种新思路。
Abstract: Recently, large language models have shown remarkable reasoning capabilities
through long-chain reasoning before responding. However, how to extend this
capability to visual reasoning tasks remains an open challenge. Existing
multimodal reasoning approaches transfer such visual reasoning task into
textual reasoning task via several image-to-text conversions, which often lose
critical structural and semantic information embedded in visualizations,
especially for tasks like chart question answering that require a large amount
of visual details. To bridge this gap, we propose ChartReasoner, a code-driven
novel two-stage framework designed to enable precise, interpretable reasoning
over charts. We first train a high-fidelity model to convert diverse chart
images into structured ECharts codes, preserving both layout and data semantics
as lossless as possible. Then, we design a general chart reasoning data
synthesis pipeline, which leverages this pretrained transport model to
automatically and scalably generate chart reasoning trajectories and utilizes a
code validator to filter out low-quality samples. Finally, we train the final
multimodal model using a combination of supervised fine-tuning and
reinforcement learning on our synthesized chart reasoning dataset and
experimental results on four public benchmarks clearly demonstrate the
effectiveness of our proposed ChartReasoner. It can preserve the original
details of the charts as much as possible and perform comparably with
state-of-the-art open-source models while using fewer parameters, approaching
the performance of proprietary systems like GPT-4o in out-of-domain settings.
[4] Unsupervised Elicitation of Language Models
Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike
Main category: cs.CL
TL;DR: 该论文提出了一种无监督算法ICM,通过在自生成标签上微调预训练语言模型,无需外部监督,优于传统的人类监督方法。
Details
Motivation: 针对超人类能力的语言模型,高质量的人类监督难以获取,论文旨在解决这一问题。Contribution: 提出了无监督算法ICM,能够在无外部监督的情况下有效微调语言模型。
Method: 使用Internal Coherence Maximization (ICM)算法,通过最大化模型内部一致性微调模型。
Result: 在多个任务上表现优于人类监督方法,并能提升前沿语言模型的训练效果。
Insight: 无监督方法可以超越人类监督,尤其是在模型能力远超人类的任务上。
Abstract: To steer pretrained language models for downstream tasks, today’s
post-training paradigm relies on humans to specify desired behaviors. However,
for models with superhuman capabilities, it is difficult or impossible to get
high-quality human supervision. To address this challenge, we introduce a new
unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune
pretrained language models on their own generated labels, \emph{without
external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward
modeling tasks, our method matches the performance of training on golden
supervision and outperforms training on crowdsourced human supervision. On
tasks where LMs’ capabilities are strongly superhuman, our method can elicit
those capabilities significantly better than training on human labels. Finally,
we show that our method can improve the training of frontier LMs: we use our
method to train an unsupervised reward model and use reinforcement learning to
train a Claude 3.5 Haiku-based assistant. Both the reward model and the
assistant outperform their human-supervised counterparts.
[5] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
Yi Wang,Max Kreminski
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLM)在故事生成中的能力,重点关注叙事规划问题。通过提出基于文学例子的评估基准,研究发现GPT-4级别的LLM在小规模下可以生成因果合理的故事,但在角色意图性和戏剧冲突方面仍存在挑战。
Details
Motivation: 故事生成是LLM的重要应用领域,但对其能力的理解有限,主要由于自动评估方法的不足和人工评估的高成本与主观性。Contribution: 提出了基于文学例子的叙事规划评估基准,明确了LLM在因果合理性、角色意图性和戏剧冲突方面的表现,并揭示了其在复杂推理方面的局限。
Method: 通过设定叙事规划任务评估LLM的故事生成能力,分析了因果合理性、角色意图性和戏剧冲突等关键指标。
Result: GPT-4级别的LLM在小规模故事中表现良好,但在角色意图性和戏剧冲突方面仍需强化学习支持的复杂推理。
Insight: LLM在叙事规划中的能力受限于规模与复杂度,未来需结合强化学习等技术提升复杂推理能力。
Abstract: Story generation has been a prominent application of Large Language Models
(LLMs). However, understanding LLMs’ ability to produce high-quality stories
remains limited due to challenges in automatic evaluation methods and the high
cost and subjectivity of manual evaluation. Computational narratology offers
valuable insights into what constitutes a good story, which has been applied in
the symbolic narrative planning approach to story generation. This work aims to
deepen the understanding of LLMs’ story generation capabilities by using them
to solve narrative planning problems. We present a benchmark for evaluating
LLMs on narrative planning based on literature examples, focusing on causal
soundness, character intentionality, and dramatic conflict. Our experiments
show that GPT-4 tier LLMs can generate causally sound stories at small scales,
but planning with character intentionality and dramatic conflict remains
challenging, requiring LLMs trained with reinforcement learning for complex
reasoning. The results offer insights on the scale of stories that LLMs can
generate while maintaining quality from different aspects. Our findings also
highlight interesting problem solving behaviors and shed lights on challenges
and considerations for applying LLM narrative planning in game environments.
[6] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta,Francis Ferraro
Main category: cs.CL
TL;DR: Q2E提出了一种零样本多语言文本到视频检索的方法,通过LLMs和VLMs的潜在知识分解查询,提升了复杂事件的视频检索能力。
Details
Motivation: 利用LLMs和VLMs的潜在参数知识,改进复杂事件的视频检索,解决人类查询过于简化的问题。Contribution: 提出Q2E方法,支持零样本多语言视频检索,并可跨数据集、领域、LLMs/VLMs适配;展示了音频信息对检索的显著提升。
Method: 将查询分解为事件相关子任务,利用LLMs/VLMs知识增强查询理解;采用基于熵的融合评分进行多模态零样本融合。
Result: 在多个数据集和检索指标上优于现有方法,音频信息的集成显著提升了性能。
Insight: 复杂事件检索可通过分解查询和多模态融合优化;音频信息在多模态检索中不可或缺。
Abstract: Recent approaches have shown impressive proficiency in extracting and
leveraging parametric knowledge from Large-Language Models (LLMs) and
Vision-Language Models (VLMs). In this work, we consider how we can improve the
identification and retrieval of videos related to complex real-world events by
automatically extracting latent parametric knowledge about those events. We
present Q2E: a Query-to-Event decomposition method for zero-shot multilingual
text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our
approach demonstrates that we can enhance the understanding of otherwise overly
simplified human queries by decomposing the query using the knowledge embedded
in LLMs and VLMs. We additionally show how to apply our approach to both visual
and speech-based inputs. To combine this varied multimodal knowledge, we adopt
entropy-based fusion scoring for zero-shot fusion. Through evaluations on two
diverse datasets and multiple retrieval metrics, we demonstrate that Q2E
outperforms several state-of-the-art baselines. Our evaluation also shows that
integrating audio information can significantly improve text-to-video
retrieval. We have released code and data for future research.
[7] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: TTT-Bench是一个新的评测基准,通过四种简单的井字棋变体游戏评测大型推理模型(LRMs)的基础策略、空间和逻辑推理能力,发现这些模型虽然在复杂数学问题上表现优异,但在简单推理游戏中表现不佳。
Details
Motivation: 现有评测基准主要集中在STEM领域,而LRMs在更广泛的任务领域中的推理能力尚未充分探索。通过设计简单游戏评测基础推理能力,填补了这一空白。Contribution: 1. 提出TTT-Bench基准,包含四种简单但需策略推理的井字棋变体游戏;2. 发现LRMs在简单推理任务中表现显著低于复杂数学任务,揭示了其局限性。
Method: 采用程序化方法生成可验证的两人游戏问题,评测多种先进LRMs的表现,特别关注其对对手意图和空间配置的推理能力。
Result: 评测结果显示,LRMs在TTT-Bench上的表现平均比MATH 500和AIME 2024低41%和5%,尤其是在长期策略推理任务中表现较差。
Insight: 大型推理模型在复杂任务上的优异表现可能掩盖了其在基础推理能力上的不足,这为进一步优化模型提供了方向。
Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning
capabilities across a broad range of tasks including Olympiad-level
mathematical problems, indicating evidence of their complex reasoning
abilities. While many reasoning benchmarks focus on the STEM domain, the
ability of LRMs to reason correctly in broader task domains remains
underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark
that is designed to evaluate basic strategic, spatial, and logical reasoning
abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games
that humans can effortlessly solve from a young age. We propose a simple yet
scalable programmatic approach for generating verifiable two-player game
problems for TTT-Bench. Although these games are trivial for humans, they
require reasoning about the intentions of the opponent, as well as the game
board’s spatial configurations, to ensure a win. We evaluate a diverse set of
state-of-the-art LRMs, and \textbf{discover that the models that excel at hard
math problems frequently fail at these simple reasoning games}. Further testing
reveals that our evaluated reasoning models score on average $\downarrow$ 41%
& $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024
respectively, with larger models achieving higher performance using shorter
reasoning traces, where most of the models struggle on long-term strategic
reasoning situations on simple and new TTT-Bench tasks.
[8] Classifying Unreliable Narrators with Large Language Models
Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi
Main category: cs.CL
TL;DR: 论文利用大型语言模型(LLM)识别不可靠叙述者,提出了TUNa数据集和分类任务,实验表明此任务极具挑战性,但有潜力。
Details
Motivation: 研究旨在通过计算方式识别叙述者是否可靠,填补了文学理论与现实世界文本数据的应用空白。Contribution: 1. 提出了TUNa数据集,涵盖多领域文本;2. 定义了三种不可靠叙述者的分类任务;3. 分析了LLM在不同学习设置下的表现。
Method: 结合文学理论定义分类任务,利用few-shot、微调和课程学习等方法训练LLM。
Result: 任务极具挑战性,但LLM在识别不可靠叙述者方面有潜力。
Insight: 文学理论可以为现实世界文本分类提供启发,未来研究可进一步优化模型和数据集。
Abstract: Often when we interact with a first-person account of events, we consider
whether or not the narrator, the primary speaker of the text, is reliable. In
this paper, we propose using computational methods to identify unreliable
narrators, i.e. those who unintentionally misrepresent information. Borrowing
literary theory from narratology to define different types of unreliable
narrators based on a variety of textual phenomena, we present TUNa, a
human-annotated dataset of narratives from multiple domains, including blog
posts, subreddit posts, hotel reviews, and works of literature. We define
classification tasks for intra-narrational, inter-narrational, and
inter-textual unreliabilities and analyze the performance of popular
open-weight and proprietary LLMs for each. We propose learning from literature
to perform unreliable narrator classification on real-world text data. To this
end, we experiment with few-shot, fine-tuning, and curriculum learning
settings. Our results show that this task is very challenging, and there is
potential for using LLMs to identify unreliable narrators. We release our
expert-annotated dataset and code and invite future research in this area.
[9] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages
Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak
Main category: cs.CL
TL;DR: Flick提出了一种用于低资源语言的少标签文本分类方法,通过高质量伪标签蒸馏和自适应选择机制,显著提升了伪标签的可靠性。
Details
Motivation: 解决低资源语言环境中少标签文本分类的难点,尤其是在噪声伪标签和领域适应问题上。Contribution: 提出了Flick方法,通过伪标签精馏组件和自适应top-k选择机制,显著提升了伪标签质量,适用于低资源语言。
Method: 利用高置信度伪标签蒸馏和单簇内聚性,结合自适应top-k选择机制,优化伪标签生成过程。
Result: 在14个多样化数据集上验证了Flick的优越性能,包括阿拉伯语、乌尔都语等低资源语言。
Insight: 通过专注于高质量伪标签的生成,Flick在低资源环境中实现了更鲁棒的模型微调,仅需少量真实标签。
Abstract: Training deep learning networks with minimal supervision has gained
significant research attention due to its potential to reduce reliance on
extensive labelled data. While self-training methods have proven effective in
semi-supervised learning, they remain vulnerable to errors from noisy pseudo
labels. Moreover, most recent approaches to the few-label classification
problem are either designed for resource-rich languages such as English or
involve complex cascading models that are prone to overfitting. To address the
persistent challenge of few-label text classification in truly low-resource
linguistic contexts, where existing methods often struggle with noisy
pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods
that rely on generic multi-cluster pseudo-labelling or complex cascading
architectures, Flick leverages the fundamental insight that distilling
high-confidence pseudo-labels from a broader set of initial clusters can
dramatically improve pseudo-label quality, particularly for linguistically
diverse, low-resource settings. Flick introduces a novel pseudo-label
refinement component, a departure from traditional pseudo-labelling strategies
by identifying and leveraging top-performing pseudo-label clusters. This
component specifically learns to distil highly reliable pseudo-labels from an
initial broad set by focusing on single-cluster cohesion and leveraging an
adaptive top-k selection mechanism. This targeted refinement process is crucial
for mitigating the propagation of errors inherent in low-resource data,
allowing for robust fine-tuning of pre-trained language models with only a
handful of true labels. We demonstrate Flick’s efficacy across 14 diverse
datasets, encompassing challenging low-resource languages such as Arabic, Urdu,
and Setswana, alongside English, showcasing its superior performance and
adaptability.
[10] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context
Chuck Arvin
Main category: cs.CL
TL;DR: 论文研究了在模拟教育环境中,用户提供的建议如何影响大型语言模型(LLM),尤其是模型的’谄媚’行为对教育公平可能产生的负面影响。
Details
Motivation: 研究动机在于LLM在教育环境中的应用日益增多,但其对用户输入的敏感性可能导致谄媚行为,从而加剧教育不平等。Contribution: 主要贡献是量化了LLM的谄媚行为,展示了模型在不同条件下回答质量的变化,并揭示了小模型更易出现谄媚行为。
Method: 通过测试五款LLM在五种实验条件下的表现,分析了模型回答质量的变化,并研究了标记级别的概率以确认谄媚行为。
Result: 结果显示,模型的准确性受学生回答的显著影响(±15%),且小模型的谄媚效应更强(30% vs. 8%)。
Insight: 研究发现LLM在教育中可能加剧知识差距,强调了理解和减少这种偏见的必要性。
Abstract: This study examines how user-provided suggestions affect Large Language
Models (LLMs) in a simulated educational context, where sycophancy poses
significant risks. Testing five different LLMs from the OpenAI GPT-4o and
GPT-4.1 model classes across five experimental conditions, we show that
response quality varies dramatically based on query framing. In cases where the
student mentions an incorrect answer, the LLM correctness can degrade by as
much as 15 percentage points, while mentioning the correct answer boosts
accuracy by the same margin. Our results also show that this bias is stronger
in smaller models, with an effect of up to 30% for the GPT-4.1-nano model,
versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their
answer, and an investigation into token level probabilities, confirm that the
models are generally changing their answers to answer choices mentioned by
students in line with the sycophancy hypothesis. This sycophantic behavior has
important implications for educational equity, as LLMs may accelerate learning
for knowledgeable students while the same tools may reinforce misunderstanding
for less knowledgeable students. Our results highlight the need to better
understand the mechanism, and ways to mitigate, such bias in the educational
context.
[11] Code Execution as Grounded Supervision for LLM Reasoning
Dongwon Jung,Wenxuan Zhou,Muhao Chen
Main category: cs.CL
TL;DR: 论文提出了一种利用代码执行确定性生成高质量思维链监督数据的方法,替代依赖人工标注或易错的LLM生成监督数据,有效提升了LLM的推理能力。
Details
Motivation: 现有思维链监督数据生成方法依赖昂贵的人工标注或易错的LLM生成,难以保证可靠性和准确性。本文通过利用代码执行的确定性,提出了一种可扩展的高质量监督数据生成方法。Contribution: 1. 提出利用程序执行的确定性生成高质量思维链监督数据的方法;2. 所生成的推理数据可通过执行验证准确性;3. 实验证明该方法能有效提升LLM的跨领域推理能力。
Method: 从代码执行中提取可验证的逐步推理痕迹,并将其转化为自然语言的思维链推理数据。
Result: 在多个领域的推理基准测试中,该方法显著提升了LLM的推理能力,并通过消融实验验证了生成数据的准确性和推理效率的提升。
Insight: 利用代码执行的确定性生成推理监督数据是一种高效且可扩展的方法,可减少对人工标注的依赖并提高推理准确性。
Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision
has proven effective for enhancing their reasoning abilities. However,
obtaining reliable and accurate reasoning supervision remains a significant
challenge. We propose a scalable method for generating a high-quality CoT
supervision dataset by leveraging the determinism of program execution. Unlike
existing reasoning dataset generation methods that rely on costly human
annotations or error-prone LLM-generated CoT, our approach extracts verifiable,
step-by-step reasoning traces from code execution and transforms them into a
natural language CoT reasoning. Experiments on reasoning benchmarks across
various domains show that our method effectively equips LLMs with transferable
reasoning abilities across diverse tasks. Furthermore, the ablation studies
validate that our method produces highly accurate reasoning data and reduces
overall token length during inference by reducing meaningless repetition and
overthinking.
[12] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu,Pu Jian,Chong Chen
Main category: cs.CL
TL;DR: TableRAG是一种检索增强生成框架,针对异构文档(包含文本和表格)的推理任务提出,通过融合文本检索和表格操作,解决了现有方法在表格结构和多跳推理中的局限性。
Details
Motivation: 现有的检索增强生成(RAG)方法在处理包含文本和表格的异构文档时存在局限性,如破坏表格结构和信息丢失,导致在多跳和全局推理任务中表现不佳。Contribution: 提出了TableRAG框架,融合文本理解和表格操作;开发了HeteQA基准测试,用于评估异构文档的多跳推理能力;实验证明TableRAG在多项任务中优于基线。
Method: 采用迭代四步法:查询分解、文本检索、SQL编程与执行、中间答案生成,结合文本和表格的动态操作。
Result: TableRAG在公开数据集和HeteQA上均超越现有方法,成为异构文档问答的新SOTA。
Insight: 异构文档的推理需要结合文本和表格的结构化操作,而非简单拼接,动态的多步推理能显著提升性能。
Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
[13] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan
Main category: cs.CL
TL;DR: PAG提出了一个结合生成与验证的多轮强化学习框架,通过模型在生成和验证角色间切换,选择性修正答案,提升自校正能力。
Details
Motivation: 大型语言模型(LLM)在复杂推理任务中表现优异,但难以可靠验证自身输出的正确性。现有解决方案依赖独立验证模块或多阶段训练,限制了扩展性。Contribution: 提出了Policy as Generative Verifier(PAG)框架,将生成与验证统一到多轮强化学习中,通过选择性修正避免模型崩溃,提升推理和验证能力。
Method: 在统一的多轮强化学习范式下,模型交替扮演生成策略和验证器的角色,仅在生成验证检测到错误时修正答案(验证-修正流程)。
Result: 在多样化推理基准测试中,PAG作为策略提升了生成和自校正的准确性,作为验证器其自验证表现优于自一致性方法。
Insight: 将验证与生成统一到单一框架中,通过选择性修正避免了不必要的重复修正常见的模型崩溃问题,同时联合优化了推理和验证能力。
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in
complex reasoning tasks, yet they still struggle to reliably verify the
correctness of their own outputs. Existing solutions to this verification
challenge often depend on separate verifier models or require multi-stage
self-correction training pipelines, which limit scalability. In this paper, we
propose Policy as Generative Verifier (PAG), a simple and effective framework
that empowers LLMs to self-correct by alternating between policy and verifier
roles within a unified multi-turn reinforcement learning (RL) paradigm.
Distinct from prior approaches that always generate a second attempt regardless
of model confidence, PAG introduces a selective revision mechanism: the model
revises its answer only when its own generative verification step detects an
error. This verify-then-revise workflow not only alleviates model collapse but
also jointly enhances both reasoning and verification abilities. Extensive
experiments across diverse reasoning benchmarks highlight PAG’s dual
advancements: as a policy, it enhances direct generation and self-correction
accuracy; as a verifier, its self-verification outperforms self-consistency.
[14] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt
Main category: cs.CL
TL;DR: 这篇论文提出了TempVS基准,用于评估多模态大型语言模型(MLLMs)在图像序列中对时间逻辑的理解能力,发现现有模型与人类表现存在显著差距。
Details
Motivation: 研究动机是验证MLLMs是否能真正理解图像序列中的事件顺序,揭示其时间推理和基础能力的不足。Contribution: 主要贡献是提出了TempVS基准,包含三种测试任务(事件关系推理、句子排序和图像排序),并评估了38个SOTA MLLMs的表现。
Method: 方法是通过设计TempVS基准测试,结合视觉和语言模态,评估模型对事件顺序的理解能力。
Result: 结果显示现有MLLMs在TempVS任务上表现不佳,与人类能力差距较大。
Insight: 研究指出未来研究方向,包括改进模型的时间推理能力和多模态融合机制。
Abstract: This paper introduces the TempVS benchmark, which focuses on temporal
grounding and reasoning capabilities of Multimodal Large Language Models
(MLLMs) in image sequences. TempVS consists of three main tests (i.e., event
relation inference, sentence ordering and image ordering), each accompanied
with a basic grounding test. TempVS requires MLLMs to rely on both visual and
linguistic modalities to understand the temporal order of events. We evaluate
38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS,
with a substantial performance gap compared to human capabilities. We also
provide fine-grained insights that suggest promising directions for future
research. Our TempVS benchmark data and code are available at
https://github.com/yjsong22/TempVS.
[15] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty
Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng
Main category: cs.CL
TL;DR: 本文提出了一种通过动态调整输出长度惩罚来提升大型语言模型(LLM)推理效率的方法,针对简单问题减少输出长度以降低计算延迟,而对复杂问题保留充分推理以提高准确性。
Details
Motivation: 现有方法如Chain-of-Thought提示虽然提升了LLM的推理能力,但往往导致输出过长,增加计算延迟。统一长度惩罚忽略了问题复杂性,影响了性能表现。Contribution: 引入了基于问题复杂性的动态长度惩罚机制,平衡了输出长度和推理能力,从而在提升效率的同时维持或提高准确性。
Method: 通过分割奖励函数并加入新型输出长度惩罚,动态调整模型在不同复杂度问题上的推理行为。
Result: 在三个数据集(GSM8K、MATH500、AIME2024)上验证了方法的有效性,简单数据集上缩短了输出长度而不损失准确性,复杂数据集上提升了准确性。
Insight: 动态调整推理策略能显著提升LLM的效率和性能,说明问题复杂度对推理行为的设计至关重要。
Abstract: Large language models (LLMs) have demonstrated significant advancements in
reasoning capabilities, performing well on various challenging benchmarks.
Techniques like Chain-of-Thought prompting have been introduced to further
improve reasoning. However, these approaches frequently generate longer
outputs, which in turn increase computational latency. Although some methods
use reinforcement learning to shorten reasoning, they often apply uniform
penalties without considering the problem’s complexity, leading to suboptimal
outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by
promoting conciseness for simpler problems while preserving sufficient
reasoning for more complex ones for accuracy, thus improving the model’s
overall performance. Specifically, we manage the model’s reasoning efficiency
by dividing the reward function and including a novel penalty for output
length. Our approach has yielded impressive outcomes in benchmark evaluations
across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively
simpler datasets GSM8K and MATH500, our method has effectively shortened output
lengths while preserving or enhancing accuracy. On the more demanding AIME2024
dataset, our approach has resulted in improved accuracy.
[16] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
Main category: cs.CL
TL;DR: 该论文将表格-文本对齐重新定义为解释任务,强调模型需识别支持或反驳科学声明所需的关键表格单元格,并构建了包含人工标注单元格级依据的新数据集,以提高科学声明的可解释性和验证性能。
Details
Motivation: 仅预测科学声明验证标签的模型缺乏透明性和解释性,无法揭示模型推理过程。通过引入表格单元格对齐任务,论文旨在提升模型的可解释性和性能。Contribution: 1. 将表格-文本对齐任务重新定义为解释任务;2. 构建了包含人工标注单元格依据的新数据集;3. 提出了处理模糊案例的分类法;4. 验证了表格对齐信息对提升声明验证性能的作用。
Method: 1. 扩展SciTab基准数据集,加入人工标注的单元格级依据;2. 提出分类法处理模糊案例;3. 实验验证表格对齐信息对性能的提升。
Result: 实验表明,加入表格对齐信息可提升声明验证性能,但大多数语言模型虽能预测正确标签,却难以复现人类对齐的依据,表明其预测缺乏忠实推理。
Insight: 模型的预测正确性并不等同于忠实推理,表明现有语言模型在解释性任务上仍有不足,需进一步改进以提升可解释性。
Abstract: Scientific claim verification against tables typically requires predicting
whether a claim is supported or refuted given a table. However, we argue that
predicting the final label alone is insufficient: it reveals little about the
model’s reasoning and offers limited interpretability. To address this, we
reframe table-text alignment as an explanation task, requiring models to
identify the table cells essential for claim verification. We build a new
dataset by extending the SciTab benchmark with human-annotated cell-level
rationales. Annotators verify the claim label and highlight the minimal set of
cells needed to support their decision. After the annotation process, we
utilize the collected information and propose a taxonomy for handling ambiguous
cases. Our experiments show that (i) incorporating table alignment information
improves claim verification performance, and (ii) most LLMs, while often
predicting correct labels, fail to recover human-aligned rationales, suggesting
that their predictions do not stem from faithful reasoning.
[17] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang
Main category: cs.CL
TL;DR: 论文提出了一种名为RRP的框架,通过结合知识图谱和大型语言模型,解决复杂推理任务中路径可靠性和冗余问题,提升推理能力。
Details
Motivation: 大型语言模型在知识密集型任务中表现不佳,主要因为缺乏背景知识且容易产生幻觉,知识图谱的引入虽能补充事实,但仍难以生成逻辑一致的推理路径。Contribution: 提出了RRP框架,利用LLMs的语义能力和知识图谱的结构信息,通过关系嵌入和双向分布学习生成高质量推理路径,并引入反思模块优化路径。
Method: 结合LLMs和知识图谱的关系嵌入与双向分布学习,生成推理路径;通过反思模块评估路径重要性,提升路径质量。
Result: 在两个公开数据集上达到SOTA性能,且能灵活集成到不同LLMs中。
Insight: 推理路径的可靠性和逻辑一致性对LLMs的推理能力至关重要,结构信息和语义能力的结合为复杂问题提供了有效解决方案。
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks
due to a lack of background knowledge and a tendency to hallucinate. To address
these limitations, integrating knowledge graphs (KGs) with LLMs has been
intensively studied. Existing KG-enhanced LLMs focus on supplementary factual
knowledge, but still struggle with solving complex questions. We argue that
refining the relationships among facts and organizing them into a logically
consistent reasoning path is equally important as factual knowledge itself.
Despite their potential, extracting reliable reasoning paths from KGs poses the
following challenges: the complexity of graph structures and the existence of
multiple generated paths, making it difficult to distinguish between useful and
redundant ones. To tackle these challenges, we propose the RRP framework to
mine the knowledge graph, which combines the semantic strengths of LLMs with
structural information obtained through relation embedding and bidirectional
distribution learning. Additionally, we introduce a rethinking module that
evaluates and refines reasoning paths according to their significance.
Experimental results on two public datasets show that RRP achieves
state-of-the-art performance compared to existing baseline methods. Moreover,
RRP can be easily integrated into various LLMs to enhance their reasoning
abilities in a plug-and-play manner. By generating high-quality reasoning paths
tailored to specific questions, RRP distills effective guidance for LLM
reasoning.
[18] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal
Main category: cs.CL
TL;DR: 该论文提出了四种方法用于评估AI导师是否能够正确识别学生数学推理中的错误,最终检索增强的少样本提示系统结合LLM推理表现最佳。
Details
Motivation: 旨在提升AI导师在教学反馈中的错误识别能力,从而优化其教学效果。Contribution: 1. 提出了四种错误识别方法;2. 展示了检索增强的提示系统在任务中的有效性。
Method: 结合了四种方法:机器学习模型集成、句子Transformer、历史感知模型和检索增强的少样本提示系统。
Result: 检索增强的提示系统优于所有基线,证明了其在教学反馈评估中的优势。
Insight: 示例驱动的提示与LLM推理结合能有效提升AI导师的错误识别能力。
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA
2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The
task involves evaluating whether a tutor’s response correctly identifies a
mistake in a student’s mathematical reasoning. We explore four approaches: (1)
an ensemble of machine learning models over pooled token embeddings from
multiple pretrained language models (LMs); (2) a frozen sentence-transformer
using [CLS] embeddings with an MLP classifier; (3) a history-aware model with
multi-head attention between token-level history and response embeddings; and
(4) a retrieval-augmented few-shot prompting system with a large language model
(LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples,
constructs structured prompts, and uses schema-guided output parsing to produce
interpretable predictions. It outperforms all baselines, demonstrating the
effectiveness of combining example-driven prompting with LLM reasoning for
pedagogical feedback assessment. Our code is available at
https://github.com/NaumanNaeem/BEA_2025.
[19] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Ye Yu,Yaoning Yu,Haohan Wang
Main category: cs.CL
TL;DR: PREMISE 是一种无需修改模型权重的提示优化框架,通过多目标文本搜索和梯度启发式方法,显著减少大型推理模型中的冗余计算和 token 使用,同时保持高准确率。
Details
Motivation: 现有的 LRM(大型推理模型)在数学推理任务中使用冗长的链式思维(CoT)推理,导致 token 使用量高且成本昂贵,限制了在延迟敏感或 API 受限环境中的部署。PREMISE 旨在通过优化提示减少推理开销。Contribution: 提出 PREMISE 框架,首次将基于提示的优化方法应用于 LRM,实现高效推理;通过多目标优化平衡简洁性和准确性,显著降低 token 使用和成本。
Method: 结合轨迹级诊断和梯度启发式提示优化,通过多目标文本搜索同时优化 token 长度和答案有效性;适用于单次黑盒接口的商业 LLM。
Result: 在 GSM8K、SVAMP 和 Math500 基准上,PREMISE 在保持或提升准确率(Claude 96%→96%,Gemini 91%→92%)的同时,显著减少 token 使用(最高 87.5%)和成本(降低 69%–82%)。
Insight: 提示级优化是实现高效 LRM 推理的可行路径,无需对模型进行任何修改,且适用于商业 LLM。未来可扩展到其他推理任务和场景。
Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve
strong performance on mathematical benchmarks using lengthy chain-of-thought
(CoT) reasoning, but the resulting traces are often unnecessarily verbose. This
inflates token usage and cost, limiting deployment in latency-sensitive or
API-constrained settings. We introduce PREMISE (PRompt-based Efficient
Mathematical Inference with Strategic Evaluation), a prompt-only framework that
reduces reasoning overhead without modifying model weights. PREMISE combines
trace-level diagnostics with gradient-inspired prompt optimization to minimize
redundant computation while preserving answer accuracy. The approach jointly
optimizes brevity and correctness through a multi-objective textual search that
balances token length and answer validity. Unlike prior work, PREMISE runs in a
single-pass black-box interface, so it can be applied directly to commercial
LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy
($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while
reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by
$69$–$82%$. These results show that prompt-level optimization is a practical
and scalable path to efficient LRM inference without compromising reasoning
quality.
[20] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta,Runchu Tian,Jiawei Han
Main category: cs.CL
TL;DR: 该论文提出了ClaimSpect框架,通过检索增强生成技术自动构建针对复杂声明的层次化分析,结合语料库视角解构声明并展示不同观点。
Details
Motivation: 现实中许多声明(如科学或政治领域的声明)往往具有复杂性,仅用“真”或“假”难以全面评估。论文旨在通过层次化分析和多角度验证提供更全面的见解。Contribution: 提出ClaimSpect框架,能够自动生成层次化声明分析,结合语料库检索技术发现子方面和多样化观点。
Method: 采用检索增强生成技术,通过分层语料库检索相关片段,发现声明的子方面及其支持、中立或反对的观点。
Result: 在科学和政治声明数据集上的实验表明,ClaimSpect能够有效解构复杂声明,并在多种基线方法中表现优异。
Insight: 层次化分析和多角度检索提供了一种更全面、结构化的方式评估复杂声明,有助于用户聚焦感兴趣的特定方面。
Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be
clearly labeled as entirely “true” or “false” – as is frequently the case with
scientific and political claims. However, a claim (e.g., “vaccine A is better
than vaccine B”) can be dissected into its integral aspects and sub-aspects
(e.g., efficacy, safety, distribution), which are individually easier to
validate. This enables a more comprehensive, structured response that provides
a well-rounded perspective on a given problem while also allowing the reader to
prioritize specific angles of interest within the claim (e.g., safety towards
children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based
framework for automatically constructing a hierarchy of aspects typically
considered when addressing a claim and enriching them with corpus-specific
perspectives. This structure hierarchically partitions an input corpus to
retrieve relevant segments, which assist in discovering new sub-aspects.
Moreover, these segments enable the discovery of varying perspectives towards
an aspect of the claim (e.g., support, neutral, or oppose) and their respective
prevalence (e.g., “how many biomedical papers believe vaccine A is more
transportable than B?”). We apply ClaimSpect to a wide variety of real-world
scientific and political claims featured in our constructed dataset, showcasing
its robustness and accuracy in deconstructing a nuanced claim and representing
perspectives within a corpus. Through real-world case studies and human
evaluation, we validate its effectiveness over multiple baselines.
[21] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs
Alberto Testoni,Iacer Calixto
Main category: cs.CL
TL;DR: 本文对临床问答任务中大型语言模型(LLMs)的不确定性估计方法进行了细粒度评估,比较了不同模型和方法在多种医学专业和问题类型下的表现,并提出了一种轻量级的单次生成估计方法。
Details
Motivation: 在临床决策支持等高风险领域,LLMs的准确且校准良好的不确定性估计至关重要。然而,现有研究对LLMs在不同医学专业和问题类型下的表现差异缺乏细致分析,为此需进行系统性评估。Contribution: 1. 对10种开源LLMs(通用、生物医学和推理模型)在两种数据集、11个医学专业和6种问题类型下进行了不确定性估计的全面评估;
2. 提出了一种轻量级的单次生成不确定性估计方法,其性能接近语义熵方法。
Method: 1. 比较了标准的单次生成和基于采样的不确定性估计方法;
2. 设计了一种基于推理轨迹中行为信号的简单单次生成估计方法,无需多次采样。
Result: 实验结果表明,不同医学专业和问题类型之间存在显著差异,强调了根据问题性质和模型特点选择合适模型的重要性。轻量级方法的性能接近语义熵,但仅需单次生成。
Insight: 1. 医学专业和问题类型的多样性对LLMs的不确定性估计性能有显著影响;
2. 轻量级方法在保持性能的同时降低了计算成本,适用于实际部署。
Abstract: Accurate and well-calibrated uncertainty estimates are essential for
deploying large language models (LLMs) in high-stakes domains such as clinical
decision support. We present a fine-grained evaluation of uncertainty
estimation methods for clinical multiple-choice question answering, covering
ten open-source LLMs (general-purpose, biomedical, and reasoning models) across
two datasets, eleven medical specialties, and six question types. We compare
standard single-generation and sampling-based methods, and present a case study
exploring simple, single-pass estimators based on behavioral signals in
reasoning traces. These lightweight methods approach the performance of
Semantic Entropy while requiring only one generation. Our results reveal
substantial variation across specialties and question types, underscoring the
importance of selecting models based on both the nature of the question and
model-specific strengths.
[22] Improving Named Entity Transcription with Contextual LLM-based Revision
Viet Anh Trinh,Xinlu He,Jacob Whitehill
Main category: cs.CL
TL;DR: 论文提出了一种基于大语言模型(LLM)的修正方法,通过利用LLM的推理能力和上下文信息(如课程笔记)来改进ASR系统中的命名实体转录错误,并在自建的数据集上实现了30%的相对WER降低。
Details
Motivation: 当前ASR系统在通用语音识别上表现优异,但对命名实体的错误率仍然较高,影响了后续应用。因此,需要一种方法专门提升命名实体的转录准确性。Contribution: 1. 提出了一种基于LLM的修正机制;2. 引入了一个新的数据集NER-MIT-OpenCourseWare(45小时);3. 在命名实体转录上实现了30%的相对WER降低。
Method: 通过LLM的推理能力和本地上下文(如课程笔记)中的正确命名实体信息,对ASR输出的命名实体进行修正。
Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER相对降低了30%。
Insight: LLM的上下文推理能力可以有效修正ASR系统中的命名实体错误,且结合领域知识(如课程笔记)能进一步提升效果。
Abstract: With recent advances in modeling and the increasing amount of supervised
training data, automatic speech recognition (ASR) systems have achieved
remarkable performance on general speech. However, the word error rate (WER) of
state-of-the-art ASR remains high for named entities. Since named entities are
often the most critical keywords, misrecognizing them can affect all downstream
applications, especially when the ASR system functions as the front end of a
complex system. In this paper, we introduce a large language model (LLM)
revision mechanism to revise incorrect named entities in ASR predictions by
leveraging the LLM’s reasoning ability as well as local context (e.g., lecture
notes) containing a set of correct named entities. Finally, we introduce the
NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses
for development and testing. On this dataset, our proposed technique achieves
up to 30% relative WER reduction for named entities.
[23] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints
Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens
Main category: cs.CL
TL;DR: LangEdit是一种通过零空间约束减轻多语言顺序知识编辑中负面干扰的新框架,确保语言特定知识更新的独立性,同时保持多语言通用能力。
Details
Motivation: 多语言大型语言模型(LLMs)在更新知识时,跨语言的顺序编辑会导致参数干扰,破坏多语言通用性和知识准确性,亟需解决这一挑战。Contribution: 提出LangEdit框架,利用零空间投影隔离语言特定知识更新,确保编辑独立性并保留多语言通用能力。
Method: 通过将参数更新投影到先前更新子空间的正交补空间(零空间),数学保证更新独立性。
Result: 在三种模型架构、六种语言和四项下游任务上验证,LangEdit显著减少参数干扰,优于现有编辑方法。
Insight: 零空间约束为多语言知识更新提供了一种高效且数学可解释的解决方案,为LLM的多语言编辑开辟了新方向。
Abstract: Efficiently updating multilingual knowledge in large language models (LLMs),
while preserving consistent factual representations across languages, remains a
long-standing and unresolved challenge. While deploying separate editing
systems for each language might seem viable, this approach incurs substantial
costs due to the need to manage multiple models. A more efficient solution
involves integrating knowledge updates across all languages into a unified
model. However, performing sequential edits across languages often leads to
destructive parameter interference, significantly degrading multilingual
generalization and the accuracy of injected knowledge. To address this
challenge, we propose LangEdit, a novel null-space constrained framework
designed to precisely isolate language-specific knowledge updates. The core
innovation of LangEdit lies in its ability to project parameter updates for
each language onto the orthogonal complement of previous updated subspaces.
This approach mathematically guarantees update independence while preserving
multilingual generalization capabilities. We conduct a comprehensive evaluation
across three model architectures, six languages, and four downstream tasks,
demonstrating that LangEdit effectively mitigates parameter interference and
outperforms existing state-of-the-art editing methods. Our results highlight
its potential for enabling efficient and accurate multilingual knowledge
updates in LLMs. The code is available at
https://github.com/VRCMF/LangEdit.git.
[24] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu
Main category: cs.CL
TL;DR: ReCUT提出了一种新方法,通过逐步探索和长短切换采样策略,平衡LLM的推理长度与准确性,显著减少了推理长度并保持了准确性。
Details
Motivation: 现有的CoT提示方法存在过度推理(overthinking)问题,导致推理轨迹冗长或冗余,现有解决方案因数据质量和过拟合问题效果受限。Contribution: 提出了ReCUT方法,通过逐步探索和长短切换采样生成多样化推理路径,并训练两个专用模型(分别优化准确性和推理长度),最终通过参数插值获得集成模型。
Method: 采用逐步探索机制和长短切换采样策略生成推理路径,通过偏好对训练两个专用模型,并通过参数插值集成。
Result: 在多个数学推理数据集上,推理长度减少30-50%,同时保持或提升了准确性。
Insight: 通过平衡推理长度和准确性,ReCUT在减少计算开销的同时保持了推理质量,为LLM的高效推理提供了新思路。
Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially
improved the reasoning capabilities of Large Language Models (LLMs). However,
these methods often suffer from overthinking, leading to unnecessarily lengthy
or redundant reasoning traces. Existing approaches attempt to mitigate this
issue through curating multiple reasoning chains for training LLMs, but their
effectiveness is often constrained by the quality of the generated data and
prone to overfitting. To address the challenge, we propose Reasoning
Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing
the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a
stepwise exploration mechanism and a long-short switched sampling strategy,
enabling LLMs to incrementally generate diverse reasoning paths. These paths
are evaluated and used to construct preference pairs to train two specialized
models (Gemini LLMs)-one optimized for reasoning accuracy, the other for
shorter reasoning. A final integrated model is obtained by interpolating the
parameters of these two models. Experimental results across multiple math
reasoning datasets and backbone models demonstrate that ReCUT significantly
reduces reasoning lengths by approximately 30-50%, while maintaining or
improving reasoning accuracy compared to various baselines. All codes and data
will be released via https://github.com/NEUIR/ReCUT.
[25] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training
Alireza Salemi,Mukta Maddipatla,Hamed Zamani
Main category: cs.CL
TL;DR: 该论文提出了一种名为 mRAG 的多智能体检索增强生成框架,通过自训练和奖励引导的轨迹采样优化智能体协作,提升了复杂任务的生成效果,并在比赛中表现优异。
Details
Motivation: 传统的检索增强生成(RAG)方法在复杂任务中表现受限,缺乏多智能体协作的能力。为了克服这一局限,研究者提出了多智能体框架 mRAG,以优化任务分解与协作。Contribution: 1. 提出了 mRAG,一种多智能体 RAG 框架,通过规划、搜索、推理和协调等子任务智能体协作完成任务。
2. 采用自训练和奖励引导的轨迹采样方法,优化多智能体协作。
3. 在 SIGIR 2025 LiveRAG 比赛中验证了其优越性。
Method: 1. 设计多智能体框架,每个智能体负责特定子任务(如规划、搜索等)。
2. 利用自训练范式结合奖励引导的轨迹采样,优化智能体间的协作策略。
3. 在 DataMorgana 数据集上进行训练和评估。
Result: mRAG 在 SIGIR 2025 LiveRAG 比赛中的表现优于传统 RAG 基线,展示了其在复杂任务中的生成能力。
Insight: 1. 多智能体协作能显著提升复杂任务的生成质量。
2. 自训练和奖励引导的轨迹采样是优化智能体协作的有效方法。
Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG)
framework composed of specialized agents for subtasks such as planning,
searching, reasoning, and coordination. Our system uses a self-training
paradigm with reward-guided trajectory sampling to optimize inter-agent
collaboration and enhance response generation. Evaluated on DataMorgana-derived
datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms
conventional RAG baselines. We further analyze competition outcomes and
showcase the framework’s strengths with case studies, demonstrating its
efficacy for complex, real-world RAG tasks.
[26] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles
Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: 论文提出了一种名为SlowFast Sampling的动态采样策略,通过交替探索和加速解码阶段,显著提升扩散语言模型的推理效率,同时结合dLLM-Cache减少冗余计算。
Details
Motivation: 现有扩散语言模型的采样策略(如基于置信度或半自回归解码)存在静态行为问题,导致效率不足和灵活性受限,因此需要更高效的动态采样方法。Contribution: 提出了SlowFast Sampling策略及其三个黄金原则(确定性原则、收敛原则和位置原则),并结合dLLM-Cache优化计算效率。
Method: 采用动态采样策略SlowFast Sampling,交替进行探索和加速解码,并利用三个原则指导解码时机和位置,同时结合缓存减少冗余计算。
Result: 实验表明,SlowFast Sampling在LLaDA上实现了15.63倍的加速,结合缓存后可达34.22倍,且吞吐量优于自回归基线LLaMA3 8B。
Insight: 合理的采样策略可以充分释放扩散语言模型的潜力,实现高效且高质量的文本生成。
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising
alternative to traditional autoregressive LLMs by enabling parallel token
generation and significantly reducing inference latency. However, existing
sampling strategies for dLLMs, such as confidence-based or semi-autoregressive
decoding, often suffer from static behavior, leading to suboptimal efficiency
and limited flexibility. In this paper, we propose SlowFast Sampling, a novel
dynamic sampling strategy that adaptively alternates between exploratory and
accelerated decoding stages. Our method is guided by three golden principles:
certainty principle, convergence principle, and positional principle, which
govern when and where tokens can be confidently and efficiently decoded. We
further integrate our strategy with dLLM-Cache to reduce redundant computation.
Extensive experiments across benchmarks and models show that SlowFast Sampling
achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and
up to 34.22$\times$ when combined with caching. Notably, our approach
outperforms strong autoregressive baselines like LLaMA3 8B in throughput,
demonstrating that well-designed sampling can unlock the full potential of
dLLMs for fast and high-quality generation.
[27] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater
Main category: cs.CL
TL;DR: 该论文研究了自监督语音模型wav2vec2在不同语言预训练下对语音、音调、说话者信息的表示方式,揭示了这些信息在模型中的正交性及跨语言共性。
Details
Motivation: 目前对自监督语音模型的分析主要集中在英语,论文旨在探索wav2vec2模型在不同语言预训练下如何编码语音、音调和说话者信息,填补多语言研究的空白。Contribution: 论文的主要贡献是通过分析多语言数据,发现语音、音调和说话者信息在wav2vec2模型中通常是正交的,且表示结构在跨语言中表现一致性。
Method: 采用了探测分类器和几何分析,研究了预训练模型对语音、音调、说话者信息的编码方式,分析了匹配与非匹配语言的性能差异。
Result: 结果显示,所有预训练语言中,语音、音调和说话者信息的子空间基本正交,且层间性能模式相似,仅在匹配语言的语音和音调任务中后期层有微弱优势。
Insight: 研究表明,wav2vec2学习的表示结构主要独立于预训练语音材料,可能具有跨语言的通用性。
Abstract: Analyses of self-supervised speech models have begun to reveal where and how
they represent different types of information. However, almost all analyses
have focused on English. Here, we examine how wav2vec2 models trained on four
different languages encode both language-matched and non-matched speech. We use
probing classifiers and geometric analyses to examine how phones, lexical
tones, and speaker information are represented. We show that for all
pretraining and test languages, the subspaces encoding phones, tones, and
speakers are largely orthogonal, and that layerwise patterns of probing
accuracy are similar, with a relatively small advantage for matched-language
phone and tone (but not speaker) probes in the later layers. Our findings
suggest that the structure of representations learned by wav2vec2 is largely
independent of the speech material used during pretraining.
[28] Slimming Down LLMs Without Losing Their Minds
Qingda,Mai
Main category: cs.CL
TL;DR: 本文研究了高效参数微调方法(LoRA和QLoRA)对大型语言模型(LLM)性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现,并强调了微调数据集与目标任务对齐的重要性。
Details
Motivation: 随着大型语言模型的规模不断扩大,高效参数微调方法的需求日益增长。作者旨在验证这些方法在任务特定性能提升中的效果及其计算效率。Contribution: 研究表明:(1) LoRA方法在提升任务性能的同时保持了计算效率;(2) 微调数据集与目标任务的对齐对性能影响显著。
Method: 采用了LoRA和QLoRA两种参数高效微调方法,并在HellaSwag、GSM8K和MMLU-CS三个评测基准上进行了实验验证。
Result: LoRA方法在任务特定性能上表现优异且高效,目标任务表现与微调数据集选择密切相关。
Insight: 为资源有限的开发者提供了高效适配LLM的理论依据和实践指导,强调了数据对齐在微调中的关键作用。
Abstract: This paper investigates and validates the impact of fine-tuning on large
language model performance, focusing on parameter-efficient methods (LoRA and
QLoRA). We evaluate model capabilities across three key domains: (1)
commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3)
multi-domain knowledge (MMLU-CS).
Our findings demonstrate that: (1) LoRA-based methods effectively improve
task-specific performance while maintaining computational efficiency, and (2)
performance strongly depends on alignment between fine-tuning dataset and
benchmark tasks. The study provides both theoretical insights into
parameter-efficient mechanisms and practical guidance for developers
implementing efficient LLM adaptation with limited resources.
[29] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)在新知识微调过程中表现出的‘泛化’与‘幻觉’行为的根源,提出了‘上下文外推理’(OCR)机制,并通过实验和理论分析验证了其作用。
Details
Motivation: 尽管LLMs能够通过微调获取新知识,但其在泛化和幻觉行为上的矛盾现象尚未得到充分解释。本文旨在揭示这两种行为的共同机制。Contribution: 1) 提出了‘上下文外推理’(OCR)机制,解释了LLMs的泛化与幻觉行为;2) 通过实验验证了OCR的存在及其对模型行为的影响;3) 理论分析了OCR的数学基础,即梯度下降的隐式偏好。
Method: 1) 在五个主流LLMs上设计实验,验证OCR的作用;2) 将OCR形式化为合成事实召回任务;3) 分析单层单头注意力模型的学习能力,强调了矩阵因式分解的重要性。
Result: 实验表明OCR驱动了泛化和幻觉行为,理论分析揭示了梯度下降倾向于最小化输出-值矩阵的核范数,从而解释了其高效学习能力。
Insight: 知识注入过程中的泛化和幻觉现象可归因于相同的底层机制(OCR),梯度下降的隐式偏好是模型学习关联能力的核心原因。
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning,
but this process exhibits a puzzling duality: models can generalize remarkably
from new facts, yet are also prone to hallucinating incorrect information.
However, the reasons for this phenomenon remain poorly understood. In this
work, we argue that both behaviors stem from a single mechanism known as
out-of-context reasoning (OCR): the ability to deduce implications by
associating concepts, even those without a causal link. Our experiments across
five prominent LLMs confirm that OCR indeed drives both generalization and
hallucination, depending on whether the associated concepts are causally
related. To build a rigorous theoretical understanding of this phenomenon, we
then formalize OCR as a synthetic factual recall task. We empirically show that
a one-layer single-head attention-only transformer with factorized output and
value matrices can learn to solve this task, while a model with combined
weights cannot, highlighting the crucial role of matrix factorization. Our
theoretical analysis shows that the OCR capability can be attributed to the
implicit bias of gradient descent, which favors solutions that minimize the
nuclear norm of the combined output-value matrix. This mathematical structure
explains why the model learns to associate facts and implications with high
sample efficiency, regardless of whether the correlation is causal or merely
spurious. Ultimately, our work provides a theoretical foundation for
understanding the OCR phenomenon, offering a new lens for analyzing and
mitigating undesirable behaviors from knowledge injection.
[30] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall
Main category: cs.CL
TL;DR: BioClinical ModernBERT 是一种针对生物医学和临床 NLP 任务的长上下文编码器,通过大规模领域适应和长上下文处理,提升速度和性能。
Details
Motivation: 现有编码器在生物医学和临床 NLP 中的应用受限,发展滞后于解码器模型,因此亟需一种高效、领域适应的解决方案。Contribution: 提出了 BioClinical ModernBERT,通过在最大生物医学和临床语料库上持续预训练,提升了领域适应性和任务表现。
Method: 基于 ModernBERT 进行领域适应,结合长上下文处理和 53.5B 标记的大规模预训练,利用多源数据集(20 个)提高泛化能力。
Result: 在四项下游任务中超越现有编码器,并提供了基础版和大版模型及训练检查点以供研究。
Insight: 多源数据和大规模预训练是提升生物医学和临床 NLP 模型性能的关键。
Abstract: Encoder-based transformer models are central to biomedical and clinical
Natural Language Processing (NLP), as their bidirectional self-attention makes
them well-suited for efficiently extracting structured information from
unstructured text through discriminative tasks. However, encoders have seen
slower development compared to decoder models, leading to limited domain
adaptation in biomedical and clinical settings. We introduce BioClinical
ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT
release, incorporating long-context processing and substantial improvements in
speed and performance for biomedical and clinical NLP. BioClinical ModernBERT
is developed through continued pretraining on the largest biomedical and
clinical corpus to date, with over 53.5 billion tokens, and addresses a key
limitation of prior clinical encoders by leveraging 20 datasets from diverse
institutions, domains, and geographic regions, rather than relying on data from
a single source. It outperforms existing biomedical and clinical encoders on
four downstream tasks spanning a broad range of use cases. We release both base
(150M parameters) and large (396M parameters) versions of BioClinical
ModernBERT, along with training checkpoints to support further research.
[31] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Lan Zhang,Marco Valentino,Andre Freitas
Main category: cs.CL
TL;DR: 论文提出了一种基于LLM评委的系统化自动评估方法(EFG),用于形式数学推理中的自动形式化任务,通过多维度标准提高了评估的透明度和有效性。
Details
Motivation: 当前自动形式化任务中,LLM评委的评估方法过于粗粒度,无法满足高级数学推理中对细微和多维度质量的要求,亟需一种更系统化的评估框架。Contribution: 提出了一个基于形式化和认知基础的LLM评委集成方法(EFG),通过逻辑保存、数学一致性、形式有效性和形式质量等多维度标准,实现了更透明和可靠的评估。
Method: 采用LLM评委集成(EFG),评估标准涵盖逻辑保存(LP)、数学一致性(MC)、形式有效性(FV)和形式质量(FQ),并与人类评估进行对比验证。
Result: 实验表明EFG集成方法在评估形式数学推理任务时,比粗粒度模型更贴合人类评估结果,尤其在形式质量方面表现突出。
Insight: 通过定义明确的原子属性指导LLM评委,可以构建一个可扩展、可解释且可靠的自动评估框架,特别适用于复杂的形式数学推理任务。
Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by
enabling the automatic translation of natural language statements into formal
languages. While recent advances using large language models (LLMs) have shown
promising results, methods for automatically evaluating autoformalization
remain underexplored. As one moves to more complex domains (e.g., advanced
mathematics), human evaluation requires significant time and domain expertise,
especially as the complexity of the underlying statements and background
knowledge increases. LLM-as-a-judge presents a promising approach for
automating such evaluation. However, existing methods typically employ
coarse-grained and generic evaluation criteria, which limit their effectiveness
for advanced formal mathematical reasoning, where quality hinges on nuanced,
multi-granular dimensions. In this work, we take a step toward addressing this
gap by introducing a systematic, automatic method to evaluate autoformalization
tasks. The proposed method is based on an epistemically and formally grounded
ensemble (EFG) of LLM judges, defined on criteria encompassing logical
preservation (LP), mathematical consistency (MC), formal validity (FV), and
formal quality (FQ), resulting in a transparent assessment that accounts for
different contributing factors. We validate the proposed framework to serve as
a proxy for autoformalization assessment within the domain of formal
mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM
judges is a suitable emerging proxy for evaluation, more strongly correlating
with human assessments than a coarse-grained model, especially when assessing
formal qualities. These findings suggest that LLM-as-judges, especially when
guided by a well-defined set of atomic properties, could offer a scalable,
interpretable, and reliable support for evaluating formal mathematical
reasoning.
[32] Magistral
Mistral-AI,:,Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang
Main category: cs.CL
TL;DR: Magistral 是 Mistral 的第一个推理模型,基于可扩展的强化学习(RL)流程,通过纯 RL 训练探索了大语言模型(LLM)的极限,并提供了强制模型推理语言的简单方法。
Details
Motivation: 现有方法通常依赖于现有实现或从先前模型中提取的 RL 痕迹,Magistral 通过从头开始的方法,仅使用自己的模型和基础设施,探索纯 RL 训练的可能性。Contribution: 1. 展示了纯 RL 训练 LLM 的极限;2. 提出了一种强制模型推理语言的简单方法;3. 证明了仅基于文本数据的 RL 训练能保留初始检查点的多模态理解、指令跟随和函数调用能力。
Method: 使用 Mistral Medium 3 训练 Magistral Medium,完全基于 RL;开源 Magistral Small,包含来自 Magistral Medium 的冷启动数据。
Result: RL 训练不仅保持了初始模型的能力,还改善了多模态理解、指令跟随和函数调用。
Insight: 纯 RL 训练是一种可行的 LLM 训练策略,能够在不依赖外部数据的情况下提升模型性能。
Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable
reinforcement learning (RL) pipeline. Instead of relying on existing
implementations and RL traces distilled from prior models, we follow a ground
up approach, relying solely on our own models and infrastructure. Notably, we
demonstrate a stack that enabled us to explore the limits of pure RL training
of LLMs, present a simple method to force the reasoning language of the model,
and show that RL on text data alone maintains most of the initial checkpoint’s
capabilities. We find that RL on text maintains or improves multimodal
understanding, instruction following and function calling. We present Magistral
Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we
open-source Magistral Small (Apache 2.0) which further includes cold-start data
from Magistral Medium.
[33] Dynamic Epistemic Friction in Dialogue
Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky
Main category: cs.CL
TL;DR: 该论文探讨了大型语言模型(LLM)在与人类协作时的’动态认知摩擦’,即因新信息与现有信念冲突而产生的阻力,并提出了一种基于动态认知逻辑的模型来预测对话中的信念更新。
Details
Motivation: 尽管LLM在人类偏好对齐方面取得了进展,但现有方法忽视了信念更新过程中的认知摩擦,导致在对抗性或模糊信息下的表现不足。Contribution: 提出了动态认知摩擦的概念,并构建了一个基于动态认知逻辑的模型,用于分析和预测协作任务中对话的信念更新过程。
Method: 使用动态认知逻辑框架,将认知摩擦建模为信念修正中的阻力,并通过实际协作任务的分析验证模型的有效性。
Result: 模型能够有效预测对话中的信念更新行为,并为复杂现实对话场景的信念对齐提供了更精细的度量方法。
Insight: 动态认知摩擦为理解人类-AI协作中的信念冲突和更新提供了新视角,有助于改进LLM在对抗性环境下的表现。
Abstract: Recent developments in aligning Large Language Models (LLMs) with human
preferences have significantly enhanced their utility in human-AI collaborative
scenarios. However, such approaches often neglect the critical role of
“epistemic friction,” or the inherent resistance encountered when updating
beliefs in response to new, conflicting, or ambiguous information. In this
paper, we define dynamic epistemic friction as the resistance to epistemic
integration, characterized by the misalignment between an agent’s current
belief state and new propositions supported by external evidence. We position
this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit,
2011), where friction emerges as nontrivial belief-revision during the
interaction. We then present analyses from a situated collaborative task that
demonstrate how this model of epistemic friction can effectively predict belief
updates in dialogues, and we subsequently discuss how the model of belief
alignment as a measure of epistemic resistance or friction can naturally be
made more sophisticated to accommodate the complexities of real-world dialogue
scenarios.
[34] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
Main category: cs.CL
TL;DR: Domain2Vec提出了一种无训练的方法,通过将数据集分解为元域(meta-domains)的线性组合来优化数据混合,显著减少了计算开销。
Details
Motivation: 当前在语言模型预训练中,如何选择最优的数据混合是一个重要问题。传统方法需要多次训练,计算成本高。本文提出了一种无需训练的方法,通过向量化数据集来找到最佳数据混合。Contribution: 1. 提出了域向量化的新方法Domain2Vec;2. 引入元域概念和分布对齐假设(DA^2);3. 实现了无需训练的最优数据混合选择。
Method: Domain2Vec将数据集分解为元域的线性组合,生成域向量并表示分布;利用分类器进行分解;通过分布对齐假设优化数据混合。
Result: 在Pile-CC上,仅用51.5%的计算量达到了与原数据混合相同的验证损失;在相同计算预算下,下游任务性能平均提升2.83%。
Insight: 分布对齐假设为数据混合优化提供了理论基础;Domain2Vec的向量化方法提升了传统方法的效率和扩展性。
Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any
dataset into a linear combination of several \emph{meta-domains}, a new concept
designed to capture the key underlying features of datasets.
\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a
classifier to decompose any given dataset into a domain vector that corresponds
to a distribution over this vocabulary. These domain vectors enable the
identification of the optimal data mixture for language model (LM) pretraining
in a training-free manner under the \emph{\textbf{D}istribution
\textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when
the data distributions of the training set and the validation set are better
aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can
be seamlessly integrated into previous works to model the relationship between
domain vectors and LM performance, greatly enhancing the efficiency and
scalability of previous methods. Extensive experiments demonstrate that
\textsc{Domain2Vec} helps find the data mixture that enhances downstream task
performance with minimal computational overhead. Specifically,
\textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only
$51.5%$ of the computation required when training on the original mixture of
The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves
downstream performance by an average of $2.83%$.
[35] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva
Main category: cs.CL
TL;DR: 本文探讨了推理模型在识别和恢复四种无益思维上的表现,发现模型能有效识别但难以从中恢复,尤其是大模型表现更差,并提出了改进自我评估能力的呼吁。
Details
Motivation: 研究动机在于探索推理模型是否具备有效的自我评估能力,能否识别并修正无益思维,以提升模型的推理准确性和安全性。Contribution: 主要贡献包括:1) 对四种无益思维的分类研究;2) 揭示了模型在识别和恢复无益思维上的局限性;3) 发现大模型在恢复短无关思维时的反规模趋势;4) 通过实验展示了模型在有害思维注入时的安全风险。
Method: 方法包括:1) 设计实验评估模型对无益思维的识别和恢复能力;2) 注入四种无益思维并观察模型表现;3) 比较不同规模模型的性能差异;4) 进行 jailbreak 实验以验证安全影响。
Result: 结果显示:1) 模型能有效识别无益思维但难以恢复;2) 大模型在短无关思维注入时表现更差;3) 小模型对有害思维的干扰最小。
Insight: 核心洞察是:模型的自我评估能力尚不成熟,需进一步改进以实现更可靠和安全的推理系统,同时规模并非总是提高性能的关键因素。
Abstract: Recent reasoning models show the ability to reflect, backtrack, and
self-validate their reasoning, which is crucial in spotting mistakes and
arriving at accurate solutions. A natural question that arises is how
effectively models can perform such self-reevaluation. We tackle this question
by investigating how well reasoning models identify and recover from four types
of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to
the question, thoughts misdirecting the question as a slightly different
question, and thoughts that lead to incorrect answers. We show that models are
effective at identifying most unhelpful thoughts but struggle to recover from
the same thoughts when these are injected into their thinking process, causing
significant performance drops. Models tend to naively continue the line of
reasoning of the injected irrelevant thoughts, which showcases that their
self-reevaluation abilities are far from a general “meta-cognitive” awareness.
Moreover, we observe non/inverse-scaling trends, where larger models struggle
more than smaller ones to recover from short irrelevant thoughts, even when
instructed to reevaluate their reasoning. We demonstrate the implications of
these findings with a jailbreak experiment using irrelevant thought injection,
showing that the smallest models are the least distracted by
harmful-response-triggering thoughts. Overall, our findings call for
improvement in self-reevaluation of reasoning models to develop better
reasoning and safer systems.
cs.CV [Back]
[36] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj
Main category: cs.CV
TL;DR: 该论文提出了一种多模态电影视频合成方法,结合文本到图像和音频生成模型,实现了60秒高质量电影的自动化生成,支持高达1024x768的分辨率和15-30 FPS的帧率。
Details
Motivation: 生成式人工智能的进步推动了多媒体创作的自动化,但目前仍缺乏一种整合文本、图像和音频合成的端到端电影视频生成方法。Contribution: 主要贡献是提出了一种五场景框架,结合Stable Diffusion、GPT-2和混合音频流水线,实现了高质量的视听同步电影视频合成,并通过优化确保了效率和可靠性。
Method: 方法包括:1)使用Stable Diffusion生成高保真图像;2)GPT-2用于叙事结构;3)混合音频流水线(gTTS和YouTube音乐);4)线性帧插值和电影级后处理增强质量;5)优化CUDA内存管理和错误处理。
Result: 实验表明,该方法在视觉质量、叙事连贯性和效率上表现优异,适用于创意、教育和工业应用。
Insight: 该工作展示了生成式AI在多模态视频合成中的潜力,通过整合不同模型和流水线优化,实现了端到端的电影级内容生成。
Abstract: Advances in generative artificial intelligence have altered multimedia
creation, allowing for automatic cinematic video synthesis from text inputs.
This work describes a method for creating 60-second cinematic movies
incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for
narrative structuring, and a hybrid audio pipeline using gTTS and
YouTube-sourced music. It uses a five-scene framework, which is augmented by
linear frame interpolation, cinematic post-processing (e.g., sharpening), and
audio-video synchronization to provide professional-quality results. It was
created in a GPU-accelerated Google Colab environment using Python 3.11. It has
a dual-mode Gradio interface (Simple and Advanced), which supports resolutions
of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA
memory management and error handling ensure reliability. The experiments
demonstrate outstanding visual quality, narrative coherence, and efficiency,
furthering text-to-video synthesis for creative, educational, and industrial
applications.
[37] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue
Main category: cs.CV
TL;DR: 该论文提出了一种基于LoRA(低秩适应)微调的视频编辑方法LoRA-Edit,通过掩码感知的LoRA微调实现可控的首帧引导视频编辑,避免了大规模预训练的局限性,同时保留了背景并支持灵活的编辑传播。
Details
Motivation: 当前基于扩散模型的视频编辑方法依赖于大规模预训练,缺乏对特定编辑的灵活性。首帧引导编辑虽然能控制第一帧,但对后续帧的控制不足。论文旨在解决这一问题。Contribution: 提出了一种掩码驱动的LoRA微调方法,能够在不改变模型架构的情况下高效地适应预训练的Image-to-Video模型,支持灵活的编辑传播和背景保留。
Method: 通过掩码感知的LoRA微调技术,结合输入视频的空间结构和运动线索以及参考图像的外观引导,动态调节模型对不同区域的关注点。
Result: 实验表明,该方法在视频编辑任务中优于现有方法,实现了高质量的编辑效果。
Insight: 该方法通过掩码和LoRA技术的结合,实现了对视频编辑的灵活控制,同时避免了模型架构的修改,为高效视频编辑提供了新思路。
Abstract: Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our approach preserves background
regions while enabling controllable edits propagation. This solution offers
efficient and adaptable video editing without altering the model architecture.
To better steer this process, we incorporate additional references, such as
alternate viewpoints or representative scene states, which serve as visual
anchors for how content should unfold. We address the control challenge using a
mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model
to the editing context. The model must learn from two distinct sources: the
input video provides spatial structure and motion cues, while reference images
offer appearance guidance. A spatial mask enables region-specific learning by
dynamically modulating what the model attends to, ensuring that each area draws
from the appropriate source. Experimental results show our method achieves
superior video editing performance compared to state-of-the-art methods.
[38] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding
Bin Guo,John H. L. Hansen
Main category: cs.CV
TL;DR: DeepTraverse提出了一种受深度优先搜索启发的视觉网络架构,通过递归探索和自适应校准模块实现特征的系统性构建和精细化,提升了分类准确性和特征判别力。
Details
Motivation: 传统视觉主干网络的特征构建方式单一,缺乏自适应迭代优化的能力。作者受经典搜索算法启发,探索了将算法化、结构化处理流程引入视觉网络的潜力,以提高特征的可解释性和推理能力。Contribution: 1) 提出了DeepTraverse架构,结合递归探索模块和自适应校准模块,实现了高效且结构化的特征构建方法。2) 通过实验验证了该架构在分类任务中的竞争性能,优于传统模型。
Method: 1) 递归探索模块:通过参数共享深度分析特征路径。2) 自适应校准模块:动态调整特征重要性以适配全局上下文。两者的协同作用实现了特征的系统性构建与优化。
Result: 在多个图像分类基准测试中,DeepTraverse表现出更高的分类准确性和特征判别力,其性能优于参数数量相近或更多的传统模型。
Insight: 将算法先验(如搜索策略)引入视觉网络设计,能够构建更高效、性能更强且结构化的视觉主干网络,为网络的可解释性和推理能力提供了新思路。
Abstract: Conventional vision backbones, despite their success, often construct
features through a largely uniform cascade of operations, offering limited
explicit pathways for adaptive, iterative refinement. This raises a compelling
question: can principles from classical search algorithms instill a more
algorithmic, structured, and logical processing flow within these networks,
leading to representations built through more interpretable, perhaps
reasoning-like decision processes? We introduce DeepTraverse, a novel vision
architecture directly inspired by algorithmic search strategies, enabling it to
learn features through a process of systematic elucidation and adaptive
refinement distinct from conventional approaches. DeepTraverse operationalizes
this via two key synergistic components: recursive exploration modules that
methodically deepen feature analysis along promising representational paths
with parameter sharing for efficiency, and adaptive calibration modules that
dynamically adjust feature salience based on evolving global context. The
resulting algorithmic interplay allows DeepTraverse to intelligently construct
and refine feature patterns. Comprehensive evaluations across a diverse suite
of image classification benchmarks show that DeepTraverse achieves highly
competitive classification accuracy and robust feature discrimination, often
outperforming conventional models with similar or larger parameter counts. Our
work demonstrates that integrating such algorithmic priors provides a
principled and effective strategy for building more efficient, performant, and
structured vision backbones.
[39] Test-Time Adaptation for Generalizable Task Progress Estimation
Christos Ziakas,Alessandra Russo
Main category: cs.CV
TL;DR: 本文提出了一种测试时自适应方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。该方法基于梯度元学习策略,利用专家视觉轨迹和自然语言任务描述进行训练,从而提升基于语义内容而非时间顺序的进度估计性能。
Details
Motivation: 传统的进度估计方法在面对分布外任务、环境和体现形式时表现不佳。本文旨在提出一种通用性强的自适应方法,使模型在测试时能够动态适应新场景。Contribution: 1. 提出了一种测试时自适应方法,能够在测试时在线优化模型以适应新场景。2. 引入梯度元学习策略,结合专家视觉轨迹和自然语言任务描述进行训练。3. 展示了该方法在分布外任务、环境和体现形式中的泛化能力。
Method: 采用梯度元学习策略,训练模型基于专家视觉轨迹和自然语言任务描述,并设计了一种自监督目标函数。在测试时,模型通过优化该目标函数自适应调整参数。
Result: 实验表明,该方法在分布外任务、环境和体现形式中的表现优于当前最先进的基于上下文的自动回归视觉-语言模型方法。
Insight: 通过结合元学习和测试时自适应,模型能够在不依赖时间顺序的情况下更好地捕捉语义内容,从而提升通用性和适应性。
Abstract: We propose a test-time adaptation method that enables a progress estimation
model to adapt online to the visual and temporal context of test trajectories
by optimizing a learned self-supervised objective. To this end, we introduce a
gradient-based meta-learning strategy to train the model on expert visual
trajectories and their natural language task descriptions, such that test-time
adaptation improves progress estimation relying on semantic content over
temporal order. Our test-time adaptation method generalizes from a single
training environment to diverse out-of-distribution tasks, environments, and
embodiments, outperforming the state-of-the-art in-context learning approach
using autoregressive vision-language models.
[40] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang
Main category: cs.CV
TL;DR: EfficientVLA是一种无需训练的高效推理加速框架,用于减少VLA模型的计算和内存开销,通过剪枝、视觉令牌优化和特征缓存策略,显著提升了推理速度,同时保持了性能。
Details
Motivation: 当前VLA模型(如扩散式架构)在计算和内存需求上存在冗余问题,限制了其实际部署能力。现有的加速方法通常只能解决局部问题,无法全面优化整个VLA流程。Contribution: 提出了EfficientVLA框架,通过综合剪枝、视觉令牌优化和特征缓存三种策略,系统性地消除VLA模型的计算和内存冗余,实现高效推理。
Method: 1)剪枝语言模块中功能冗余的层;2)使用任务感知策略优化视觉令牌选择;3)在扩散式动作头中缓存和重用中间特征以减少冗余。
Result: 在CogACT模型上实现了1.93倍的推理加速,计算量降至28.9%,同时在SIMPLER基准测试中仅降低0.6%的成功率。
Insight: 通过结构化和训练无关的方法,可以显著提升VLA模型的效率,为实际部署提供了可能性。
Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based
architectures, demonstrate transformative potential for embodied intelligence
but are severely hampered by high computational and memory demands stemming
from extensive inherent and inference-time redundancies. While existing
acceleration efforts often target isolated inefficiencies, such piecemeal
solutions typically fail to holistically address the varied computational and
memory bottlenecks across the entire VLA pipeline, thereby limiting practical
deployability. We introduce EfficientVLA, a structured and training-free
inference acceleration framework that systematically eliminates these barriers
by cohesively exploiting multifaceted redundancies. EfficientVLA
synergistically integrates three targeted strategies: (1) pruning of
functionally inconsequential layers from the language module, guided by an
analysis of inter-layer redundancies; (2) optimizing the visual processing
pathway through a task-aware strategy that selects a compact, diverse set of
visual tokens, balancing task-criticality with informational coverage; and (3)
alleviating temporal computational redundancy within the iterative
diffusion-based action head by strategically caching and reusing key
intermediate features. We apply our method to a standard VLA model CogACT,
yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6%
success rate drop in the SIMPLER benchmark.
[41] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild
Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso
Main category: cs.CV
TL;DR: 该论文发布了首个多模态环境下检测未成年人的图像-字幕数据集ICCWD,包含10,000个手动标注的图像-字幕对,用于评估检测工具的性能。实验显示现有方法在此任务上仍有挑战性。
Details
Motivation: 现有缺乏多模态环境下检测未成年人的数据集,而法律和平台对未成年人内容的监管需求迫切。Contribution: 发布了首个多模态未成年人检测数据集ICCWD,包含丰富的上下文和标注。
Method: 通过手动标注10,000个图像-字幕对,设计实验评估了三种检测器的性能,包括商用年龄估计系统。
Result: 最佳检测器的真阳性率为75.3%,表明任务具有挑战性。
Insight: 多模态数据集有助于设计更优的未成年人检测方法,填补了现有研究空白。
Abstract: Platforms and the law regulate digital content depicting minors (defined as
individuals under 18 years of age) differently from other types of content.
Given the sheer amount of content that needs to be assessed, machine
learning-based automation tools are commonly used to detect content depicting
minors. To our knowledge, no dataset or benchmark currently exists for
detecting these identification methods in a multi-modal environment. To fill
this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an
image-caption dataset aimed at benchmarking tools that detect depictions of
minors. Our dataset is richer than previous child image datasets, containing
images of children in a variety of contexts, including fictional depictions and
partially visible bodies. ICCWD contains 10,000 image-caption pairs manually
labeled to indicate the presence or absence of a child in the image. To
demonstrate the possible utility of our dataset, we use it to benchmark three
different detectors, including a commercial age estimation system applied to
images. Our results suggest that child detection is a challenging task, with
the best method achieving a 75.3% true positive rate. We hope the release of
our dataset will aid in the design of better minor detection methods in a wide
range of scenarios.
[42] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers
Natanael Lucena,Fábio S. da Silva,Ricardo Rios
Main category: cs.CV
TL;DR: 本文比较了卷积神经网络(CNN)和视觉变换器(ViT)在多分类银屑病及其类似病变图像任务中的性能,发现ViT在小模型上表现更优,推荐DaViT-B架构用于自动化检测。
Details
Motivation: 研究旨在探索CNN和ViT在医学图像分类中的性能差异,特别是针对银屑病的识别任务,以推动高效自动化诊断工具的发展。Contribution: 通过实验对比证明了ViT在银屑病检测任务中的优越性,尤其是DaViT-B模型取得了96.4%的F1分数。
Method: 使用ImageNet预训练的CNN和ViT模型,针对特定数据集进行微调,评估其性能指标。
Result: ViT表现优于CNN,DaViT-B模型性能最佳,F1分数达96.4%。
Insight: 视觉变换器在小模型情况下展现出色性能,表明其在医学图像分类任务中的潜力。
Abstract: This paper presents a comparison of the performance of Convolutional Neural
Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying
images containing lesions of psoriasis and diseases similar to it. Models
pre-trained on ImageNet were adapted to a specific data set. Both achieved high
predictive metrics, but the ViTs stood out for their superior performance with
smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the
best results, with an f1-score of 96.4%, and is recommended as the most
efficient architecture for automated psoriasis detection. This article
reinforces the potential of ViTs for medical image classification tasks.
[43] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang
Main category: cs.CV
TL;DR: 论文提出了ViCrit任务,通过强化学习(RL)微调视觉语言模型(VLMs),任务要求模型定位人为注入的视觉描述错误,从而提升视觉感知能力。ViCrit平衡了任务难度与可验证性,同时在多个VL基准测试中表现优异。
Details
Motivation: 现有RL任务多针对纯语言模型(如数学推理或代码生成),但VLMs的视觉感知任务缺乏既具挑战性又可明确验证的代理任务,阻碍了RL在视觉领域的应用。Contribution: 提出了ViCrith任务,通过定位合成视觉幻觉(即错误的视觉描述),为VLMs提供一种可验证的RL代理任务,并展示了其在多种VL任务中的泛化能力。
Method: 在200字的人类标注图像描述中注入单一视觉错误(如物体、属性、数量或空间关系),要求模型基于图像和修改后的描述定位错误。任务设计为二元精确匹配奖励,便于计算。
Result: 实验表明,ViCrit训练的模型在多个VL基准测试中表现显著提升,并能泛化到自然图像以外的抽象推理和视觉数学任务。
Insight: ViCrit不仅提升了模型的记忆能力,还促进了真正的视觉感知学习,表明细粒度的幻觉批评是提升VLMs视觉感知的有效目标。
Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning
large language models (LLMs) using tasks that are challenging yet easily
verifiable, such as math reasoning or code generation. However, extending this
success to visual perception in vision-language models (VLMs) has been impeded
by the scarcity of vision-centric tasks that are simultaneously challenging and
unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption
Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle,
synthetic visual hallucination injected into paragraphs of human-written image
captions. Starting from a 200-word captions, we inject a single, subtle visual
description error-altering a few words on objects, attributes, counts, or
spatial relations-and task the model to pinpoint the corrupted span given the
image and the modified caption. This formulation preserves the full perceptual
difficulty while providing a binary, exact-match reward that is easy to compute
and unambiguous. Models trained with the ViCrit Task exhibit substantial gains
across a variety of VL benchmarks. Crucially, the improvements transfer beyond
natural-image training data to abstract image reasoning and visual math,
showing promises of learning to perceive rather than barely memorizing seen
objects. To facilitate evaluation, we further introduce ViCrit-Bench, a
category-balanced diagnostic benchmark that systematically probes perception
errors across diverse image domains and error types. Together, our results
demonstrate that fine-grained hallucination criticism is an effective and
generalizable objective for enhancing visual perception in VLMs.
[44] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context
Yael Frischholz,Devis Tuia,Michael Lehning
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的太阳辐射反演方法,通过隐式学习从卫星图像序列中推断背景反射率,解决了传统方法在山区动态雪盖区域的局限性。
Details
Motivation: 传统太阳辐射反演算法依赖月度统计估计背景反射率,但在山区由于动态雪盖导致性能下降。本文旨在通过隐式学习背景反射率来提升反演精度。Contribution: 提出了一种基于Temporo-Spatial Vision Transformer的注意力机制模型,无需手工特征(如反照率图或云掩膜),直接从卫星图像序列中学习背景反射率动态。
Method: 利用多光谱SEVIRI图像、静态地形特征和太阳几何数据,训练模型从时间上下文中隐式推断背景反射率。模型基于HelioMont算法的太阳辐射估计进行训练。
Result: 实验表明,模型在提供足够的时间上下文时,性能与依赖反照率的方法相当,尤其在山区表现突出。
Insight: 时间上下文在隐式学习背景反射率中至关重要,尤其在动态雪盖区域。模型能够捕捉并利用潜在地表反射动态,提升泛化能力。
Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery
critically depends on estimating the background reflectance that a spaceborne
sensor would observe under clear-sky conditions. Deviations from this baseline
can then be used to detect cloud presence and guide radiative transfer models
in inferring atmospheric attenuation. Operational retrieval algorithms
typically approximate background reflectance using monthly statistics, assuming
surface properties vary slowly relative to atmospheric conditions. However,
this approach fails in mountainous regions where intermittent snow cover and
changing snow surfaces are frequent. We propose an attention-based emulator for
SSR retrieval that implicitly learns to infer clear-sky surface reflectance
from raw satellite image sequences. Built on the Temporo-Spatial Vision
Transformer, our approach eliminates the need for hand-crafted features such as
explicit albedo maps or cloud masks. The emulator is trained on instantaneous
SSR estimates from the HelioMont algorithm over Switzerland, a region
characterized by complex terrain and dynamic snow cover. Inputs include
multi-spectral SEVIRI imagery from the Meteosat Second Generation platform,
augmented with static topographic features and solar geometry. The target
variable is HelioMont’s SSR, computed as the sum of its direct and diffuse
horizontal irradiance components, given at a spatial resolution of 1.7 km. We
show that, when provided a sufficiently long temporal context, the model
matches the performances of albedo-informed models, highlighting the model’s
ability to internally learn and exploit latent surface reflectance dynamics.
Our geospatial analysis shows this effect is most powerful in mountainous
regions and improves generalization in both simple and complex topographic
settings. Code and datasets are publicly available at
https://github.com/frischwood/HeMu-dev.git
[45] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出了高效的注意力探测方法(EP),通过多查询交叉注意力机制减少冗余参数,提升计算效率,并在多个基准测试中优于线性探测和传统注意力探测方法。
Details
Motivation: 由于微调在大规模场景下不切实际,探测成为自监督学习的重要评估协议。然而,标准的线性探测(LP)无法充分反映掩码图像建模(MIM)的潜力,因为其分布式补丁令牌的特性。这促使了对注意力探测的需求,但现有方法存在参数冗余和计算效率低的问题。Contribution: 本文引入了高效探测(EP),一种多查询交叉注意力机制,显著减少了冗余投影和可训练参数,实现了10倍的速度提升,并在多个基准测试中表现优于现有方法。
Method: EP通过多查询交叉注意力机制选择性聚合补丁级特征,消除了冗余投影,减少了参数数量和计算开销。
Result: EP在七个基准测试中超越了LP和传统注意力探测方法,在低样本和逐层设置中表现优异,同时生成可解释的注意力图。
Insight: 注意力探测在MIM等任务中比线性探测更具潜力,且高效的设计可以显著提升性能和计算效率。
Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is
emerging as the preferred evaluation protocol for self-supervised learning
(SSL). Yet, the standard linear probing (LP) fails to adequately reflect the
potential of models trained with Masked Image Modeling (MIM), due to the
distributed nature of patch tokens. This motivates the need for attentive
probing, an alternative that uses attention to selectively aggregate
patch-level features. Despite its growing adoption, attentive probing remains
under-explored, with existing methods suffering from excessive parameterization
and poor computational efficiency.
In this work, we revisit attentive probing through the lens of the
accuracy-efficiency trade-off. We conduct a systematic study of existing
methods, analyzing their mechanisms and benchmarking their performance. We
introduce efficient probing (EP), a multi-query cross-attention mechanism that
eliminates redundant projections, reduces the number of trainable parameters,
and achieves up to a 10$\times$ speed-up over conventional multi-head
attention. Despite its simplicity, EP outperforms LP and prior attentive
probing approaches across seven benchmarks, generalizes well beyond MIM to
diverse pre-training paradigms, produces interpretable attention maps, and
achieves strong gains in low-shot and layer-wise settings. Code available at
https://github.com/billpsomas/efficient-probing.
[46] Improving Personalized Search with Regularized Low-Rank Parameter Updates
Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell
Main category: cs.CV
TL;DR: 本文提出了一种通过正则化低秩参数更新改进个性化视觉语言检索的方法,结合少量示例学习新概念并整合个人与通用知识,在DeepFashion2和ConCon-Chi基准测试中表现最优。
Details
Motivation: 个性化视觉语言检索需从少量示例学习新概念,同时整合个人与通用知识。现有方法难以平衡两者的结合。Contribution: 1. 提出正则化低秩参数更新方法,有效适应语言编码器的表示;2. 探索多个人概念参数的组合策略;3. 引入新指标评估通用知识保存效果。
Method: 在语言编码器最后一层对少量参数进行正则化低秩更新,并采用参数加法组合多个学习到的个人概念。
Result: 在DeepFashion2和ConCon-Chi基准测试中,个性化检索准确率比现有方法提升4%-22%。
Insight: 低秩参数更新能有效平衡新概念学习和通用知识保留,参数加法是组合多概念的可行策略。
Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g.
“my dog Fido”) from only a few examples. This task is challenging because it
requires not only learning a new concept from a few images, but also
integrating the personal and general knowledge together to recognize the
concept in different contexts. In this paper, we show how to effectively adapt
the internal representation of a vision-language dual encoder model for
personalized vision-language retrieval. We find that regularized low-rank
adaption of a small set of parameters in the language encoder’s final layer
serves as a highly effective alternative to textual inversion for recognizing
the personal concept while preserving general knowledge. Additionally, we
explore strategies for combining parameters of multiple learned personal
concepts, finding that parameter addition is effective. To evaluate how well
general knowledge is preserved in a finetuned representation, we introduce a
metric that measures image retrieval accuracy based on captions generated by a
vision language model (VLM). Our approach achieves state-of-the-art accuracy on
two benchmarks for personalized image retrieval with natural language queries -
DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal
retrievals.
[47] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators
Parsa Rahimi,Sebastien Marcel
Main category: cs.CV
TL;DR: ScoreMix提出了一种基于扩散模型的数据增强方法,通过混合不同类别的分数生成挑战性样本,显著提升了判别模型的性能,尤其是在数据较少的情况下。
Details
Motivation: 解决小样本场景下判别模型性能不足的问题,利用扩散模型的分数组合特性生成有挑战性的合成数据。Contribution: 1) 提出ScoreMix方法,利用扩散模型的分数组合特性增强数据;2) 发现混合远处类别比近处类别效果更好;3) 展示生成器和判别器空间的低相关性。
Method: 通过在扩散采样过程中凸组合不同类别的分数,生成具有挑战性的合成样本。
Result: 在多个基准测试中显著提升了判别模型的性能,尤其是在数据有限的情况下。
Insight: 判别器和生成器的条件空间相关性低,混合远处类别的策略更有效。
Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation
strategy leveraging the score compositional properties of diffusion models to
enhance discriminator performance, particularly under scenarios with limited
labeled data. By convexly mixing the scores from different class-conditioned
trajectories during diffusion sampling, we generate challenging synthetic
samples that significantly improve discriminative capabilities in all studied
benchmarks. We systematically investigate class-selection strategies for mixing
and discover that greater performance gains arise when combining classes
distant in the discriminator’s embedding space, rather than close in the
generator’s condition space. Moreover, we empirically show that, under standard
metrics, the correlation between the generator’s learned condition space and
the discriminator’s embedding space is minimal. Our approach achieves notable
performance improvements without extensive parameter searches, demonstrating
practical advantages for training discriminative models while effectively
mitigating problems regarding collections of large datasets. Paper website:
https://parsa-ra.github.io/scoremix
[48] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops
Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles
Main category: cs.CV
TL;DR: 该论文提出了一种结合卫星图像、气候、蒸散和土壤数据的多模态深度学习模型,用于加州70多种作物的县级产量预测。模型在测试数据集上达到了0.76的R2分数,表现优异。
Details
Motivation: 加州是全球农业生产的领导者,但准确的作物产量预测仍面临挑战,因为涉及复杂的环境、气候和土壤因素。现有数据虽丰富,但缺乏有效的整合和预测方法。Contribution: 1. 提供了一个涵盖加州所有县70多种作物的综合产量基准数据集(2008-2022)。2. 开发了一个多模态深度学习模型,结合卫星图像等多源数据,实现了高精度的县级产量预测。
Method: 模型采用分层特征提取和时间序列编码器,捕捉生长季节的时空动态。静态输入(如土壤特性和作物类型)用于长期变异性建模。
Result: 在测试数据集上,模型的整体R2分数为0.76,展示了强大的预测能力。
Insight: 该研究为农业预测、气候适应和精准农业提供了有力工具,公开的数据集和代码对进一步研究具有重要价值。
Abstract: California is a global leader in agricultural production, contributing 12.5%
of the United States total output and ranking as the fifth-largest food and
cotton supplier in the world. Despite the availability of extensive historical
yield data from the USDA National Agricultural Statistics Service, accurate and
timely crop yield forecasting remains a challenge due to the complex interplay
of environmental, climatic, and soil-related factors. In this study, we
introduce a comprehensive crop yield benchmark dataset covering over 70 crops
across all California counties from 2008 to 2022. The benchmark integrates
diverse data sources, including Landsat satellite imagery, daily climate
records, monthly evapotranspiration, and high-resolution soil properties. To
effectively learn from these heterogeneous inputs, we develop a multi-modal
deep learning model tailored for county-level, crop-specific yield forecasting.
The model employs stratified feature extraction and a timeseries encoder to
capture spatial and temporal dynamics during the growing season. Static inputs
such as soil characteristics and crop identity inform long-term variability.
Our approach achieves an overall R2 score of 0.76 across all crops of unseen
test dataset, highlighting strong predictive performance across California
diverse agricultural regions. This benchmark and modeling framework offer a
valuable foundation for advancing agricultural forecasting, climate adaptation,
and precision farming. The full dataset and codebase are publicly available at
our GitHub repository.
[49] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: DySS提出了一种基于状态空间学习和动态查询的高效3D物体检测方法,通过状态空间模型和动态查询更新机制,实现了高性能和高效推理。
Details
Motivation: 传统的基于密集BEV特征的3D物体检测方法计算成本高,而基于稀疏查询的方法在处理多帧视频时仍然需要大量查询,导致效率低下。DySS旨在通过状态空间学习和动态查询来解决这些问题。Contribution: 1. 提出状态空间模型(SSM)用于时序特征处理,并引入未来预测和掩码重构任务增强模型性能。2. 设计了动态查询更新机制(合并、移除和拆分),保持高效查询集。3. 在nuScenes数据集上实现了SOTA性能(65.31 NDS和57.4 mAP)和实时推理速度(33 FPS)。
Method: 1. 利用SSM对时序特征进行序列化处理。2. 引入未来预测和掩码重构作为辅助任务,优化SSM学习。3. 基于学习到的状态空间特征,动态更新查询集(合并、移除和拆分)。
Result: 在nuScenes测试集上取得65.31 NDS和57.4 mAP,验证集上56.2 NDS和46.2 mAP,推理速度达33 FPS。
Insight: 通过状态空间学习和动态查询机制,DySS不仅提升了3D物体检测性能,还显著降低了计算成本,为实时感知任务提供了高效解决方案。
Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most
important perception tasks in autonomous driving. Earlier methods rely on dense
BEV features, which are costly to construct. More recent works explore sparse
query-based detection. However, they still require a large number of queries
and can become expensive to run when more video frames are used. In this paper,
we propose DySS, a novel method that employs state-space learning and dynamic
queries. More specifically, DySS leverages a state-space model (SSM) to
sequentially process the sampled features over time steps. In order to
encourage the model to better capture the underlying motion and correspondence
information, we introduce auxiliary tasks of future prediction and masked
reconstruction to better train the SSM. The state of the SSM then provides an
informative yet efficient summarization of the scene. Based on the state-space
learned features, we dynamically update the queries via merge, remove, and
split operations, which help maintain a useful, lean set of detection queries
throughout the network. Our proposed DySS achieves both superior detection
performance and efficient inference. Specifically, on the nuScenes test split,
DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the
art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a
real-time inference speed of 33 FPS.
[50] HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park,Minyeong Kim,Gunhee Kim
Main category: cs.CV
TL;DR: HalLoc是一个针对视觉语言模型(VLM)幻觉问题的标记级定位数据集,支持高效的概率幻觉检测,包含15万标注样本,并提出了低开销的基线模型。
Details
Motivation: 现有幻觉检测方法计算复杂度高且难以区分模糊的真实与幻觉信息,HalLoc旨在解决这些问题。Contribution: 提供了首个大规模标记级幻觉定位数据集HalLoc,并开发了低开销检测模型,可直接集成到现有VLM中。
Method: 设计了包含VQA、指令跟随和图像描述任务的15万标注样本数据集,并训练了高效基线模型。
Result: HalLoc支持开发概率检测模型,基线模型可无缝集成到VLM中,提升可靠性。
Insight: HalLoc为提升VLM在真实场景中的可信度提供了新方向,通过概率检测和低开销模型实现高效幻觉定位。
Abstract: Hallucinations pose a significant challenge to the reliability of large
vision-language models, making their detection essential for ensuring accuracy
in critical applications. Current detection methods often rely on
computationally intensive models, leading to high latency and resource demands.
Their definitive outcomes also fail to account for real-world scenarios where
the line between hallucinated and truthful information is unclear. To address
these issues, we propose HalLoc, a dataset designed for efficient,
probabilistic hallucination detection. It features 150K token-level annotated
samples, including hallucination types, across Visual Question Answering (VQA),
instruction-following, and image captioning tasks. This dataset facilitates the
development of models that detect hallucinations with graded confidence,
enabling more informed user interactions. Additionally, we introduce a baseline
model trained on HalLoc, offering low-overhead, concurrent hallucination
detection during generation. The model can be seamlessly integrated into
existing VLMs, improving reliability while preserving efficiency. The prospect
of a robust plug-and-play hallucination detection module opens new avenues for
enhancing the trustworthiness of vision-language models in real-world
applications. The HalLoc dataset and code are publicly available at:
https://github.com/dbsltm/cvpr25_halloc.
[51] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation
Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
Main category: cs.CV
TL;DR: 该论文通过结合迁移学习和不确定性量化(UQ),对皮肤癌分类任务进行了全面评估,旨在提高分类准确性和模型输出的可靠性。
Details
Motivation: 皮肤癌的早期准确诊断对患者治疗至关重要,但现有深度学习模型受限于数据稀缺和缺乏不确定性意识,需要进一步提升性能与可信度。Contribution: 论文的主要贡献包括:1) 比较了多种预训练特征提取器和传统分类器的性能;2) 引入了UQ方法(如MCD、集成和EMCD),并提出了不确定性感知的评估指标;3) 揭示了集成方法在准确性和不确定性处理之间的平衡优势。
Method: 方法分为两阶段:1) 基于HAM10000数据集,对比了CLIP变体、ResNet50、DenseNet121等预训练模型与SVM、XGBoost等分类器的组合性能;2) 结合UQ方法(MCD、集成、EMCD)评估模型预测的可靠性。
Result: 结果表明,基于CLIP的视觉变换器(如LAION CLIP ViT-H/14)与SVM组合表现最佳;集成方法在准确性和不确定性处理上表现平衡,而EMCD对不确定预测更敏感。
Insight: 论文强调了在医疗诊断中整合UQ的重要性,既能提升性能,又能增强模型在临床实际应用中的可信度。
Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment
and improved patient outcomes. Deep learning (DL) models have shown promise in
automating skin cancer classification, but their performance can be limited by
data scarcity and a lack of uncertainty awareness. In this study, we present a
comprehensive evaluation of DL-based skin lesion classification using transfer
learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the
first phase, we benchmarked several pre-trained feature extractors-including
Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50
(ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual
Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range
of traditional classifiers such as Support Vector Machine (SVM), eXtreme
Gradient Boosting (XGBoost), and logistic regression. Our results show that
CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM,
deliver the highest classification performance. In the second phase, we
incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte
Carlo Dropout (EMCD) to assess not only prediction accuracy but also the
reliability of model outputs. We evaluated these models using uncertainty-aware
metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen),
uncertainty specificity(USpe), and uncertainty precision(UPre). The results
demonstrate that ensemble methods offer a good trade-off between accuracy and
uncertainty handling, while EMCD is more sensitive to uncertain predictions.
This study highlights the importance of integrating UQ into DL-based medical
diagnosis to enhance both performance and trustworthiness in real-world
clinical applications.
[52] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: 提出了一种弱监督多模态框架,用于从病灶图像和稀疏临床文本生成SOAP笔记,减轻医生负担并减少对大量标注数据的依赖,性能媲美先进模型。
Details
Motivation: 皮肤癌是全球最常见的癌症,医生手动生成SOAP笔记耗时且易导致职业倦怠,亟需自动化解决方案以缓解负担。Contribution: 1. 设计了一个弱监督多模态框架,减少对标注数据的依赖;2. 提出两个新指标MedConceptEval和CCS,评估生成的临床笔记质量。
Method: 基于病灶图像和稀疏临床文本,通过弱监督学习生成结构化的SOAP笔记,无需大量标注数据。
Result: 在临床相关性指标上表现与GPT-4o、Claude和DeepSeek Janus Pro相当。
Insight: 弱监督学习和多模态输入的结合可用于医疗领域的结构化文本生成任务,有效减少标注成本。
Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for
over $8 billion in annual healthcare expenditures. In clinical settings,
physicians document patient visits using detailed SOAP (Subjective, Objective,
Assessment, and Plan) notes. However, manually generating these notes is
labor-intensive and contributes to clinician burnout. In this work, we propose
a weakly supervised multimodal framework to generate clinically structured SOAP
notes from limited inputs, including lesion images and sparse clinical text.
Our approach reduces reliance on manual annotations, enabling scalable,
clinically grounded documentation while alleviating clinician burden and
reducing the need for large annotated data. Our method achieves performance
comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical
relevance metrics. To evaluate clinical quality, we introduce two novel metrics
MedConceptEval and Clinical Coherence Score (CCS) which assess semantic
alignment with expert medical concepts and input features, respectively.
[53] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
Fei Zhao,Da Pan,Zelu Qi,Ping Shi
Main category: cs.CV
TL;DR: 论文针对元宇宙中用户生成的360度视频(UGC-ODV)的音视频质量评估(AVQA)问题,构建了一个数据集并提出了一种基线模型。
Details
Motivation: 随着元宇宙的兴起,360度视频逐渐从专业生成内容(PGC)转向用户生成内容(UGC)。然而,UGC-ODV的音视频质量评估研究仍然有限。Contribution: 1. 构建了一个包含300个视频的UGC-ODV音视频数据集;2. 设计了一个包含视频特征提取、音频特征提取和音视频融合模块的AVQA基线模型。
Method: 1. 通过5人和2种360度相机拍摄10类场景的视频;2. 进行主观AVQA实验获取MOS评分;3. 设计基线模型提取视频和音频特征并融合评估质量。
Result: 提出的基线模型在构建的数据集上表现最优。
Insight: UGC-ODV的音视频质量评估需要结合多模态特征,且用户生成内容的质量评估与专业内容存在差异。
Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos
(ODVs) have garnered notable interest, gradually shifting from
professional-generated content (PGC) to user-generated content (UGC). However,
the study of audio-visual quality assessment (AVQA) within ODVs remains
limited. To address this, we construct a dataset of UGC omnidirectional audio
and video (A/V) content. The videos are captured by five individuals using two
different types of omnidirectional cameras, shooting 300 videos covering 10
different scene types. A subjective AVQA experiment is conducted on the dataset
to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to
facilitate the development of UGC-ODV AVQA fields, we construct an effective
AVQA baseline model on the proposed dataset, of which the baseline model
consists of video feature extraction module, audio feature extraction and
audio-visual fusion module. The experimental results demonstrate that our model
achieves optimal performance on the proposed dataset.
[54] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions
Deliang Wang,Chao Yang,Gaowei Chen
Main category: cs.CV
TL;DR: 该论文研究了利用视觉语言模型(VLMs)通过零样本提示检测学生在在线学习环境中的学术情绪,发现Qwen2.5-VL-7B-Instruct在识别困惑表情方面表现较好,但对分心行为的检测效果较差。
Details
Motivation: 传统的学生学术情绪分析方法依赖监督学习,且泛化能力不足。视觉语言模型的出现为解决这一问题提供了新思路。Contribution: 论文验证了VLMs在零样本提示下检测学生学术情绪的潜力,并比较了两种模型的表现,为实际应用提供了参考。
Method: 使用Llama-3.2-11B-Vision-Instruct和Qwen2.5-VL-7B-Instruct两种VLMs,对5000张包含五种表情的学生面部图像进行零样本分析。
Result: Qwen2.5-VL-7B-Instruct在困惑表情识别上表现较好,而两种模型对分心行为的检测效果不佳。快乐情绪的识别效果最好。
Insight: VLMs在学术情绪检测领域具有一定的潜力,但对某些情绪(如分心)的识别需要进一步改进。
Abstract: Students’ academic emotions significantly influence their social behavior and
learning performance. Traditional approaches to automatically and accurately
analyze these emotions have predominantly relied on supervised machine learning
algorithms. However, these models often struggle to generalize across different
contexts, necessitating repeated cycles of data collection, annotation, and
training. The emergence of Vision-Language Models (VLMs) offers a promising
alternative, enabling generalization across visual recognition tasks through
zero-shot prompting without requiring fine-tuning. This study investigates the
potential of VLMs to analyze students’ academic emotions via facial expressions
in an online learning environment. We employed two VLMs,
Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000
images depicting confused, distracted, happy, neutral, and tired expressions
using zero-shot prompting. Preliminary results indicate that both models
demonstrate moderate performance in academic facial expression recognition,
with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct.
Notably, both models excel in identifying students’ happy emotions but fail to
detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits
relatively high performance in recognizing students’ confused expressions,
highlighting its potential for practical applications in identifying content
that causes student confusion.
[55] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting
Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin
Main category: cs.CV
TL;DR: PointGS利用点注意力感知的高斯泼溅技术,通过稀疏视图实现了高质量的实时渲染。
Details
Motivation: 现有的3DGS方法需要大量校准视图来生成一致的场景表示,稀疏输入时容易过拟合训练视图,导致渲染质量下降。Contribution: 提出了一种点特征感知的高斯泼溅框架,通过多尺度2D特征采样、点交互网络和轻量级MLP解码,实现了稀疏视图下的高质量渲染。
Method: 1. 使用立体基础模型估计相机姿态并重建密集点云;2. 通过多尺度2D特征采样和聚合编码高斯颜色属性;3. 设计基于自注意力机制的点交互网络增强点表示;4. 用轻量级MLP解码为高斯参数进行渲染。
Result: 在多种基准测试中显著优于NeRF方法,且在少样本设置下与最先进的3DGS方法竞争。
Insight: 点注意力机制和多尺度特征融合是提升稀疏视图渲染质量的关键。
Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that
surpasses the neural radiance field (NeRF) in both rendering speed and visual
quality by leveraging an explicit 3D scene representation. Existing 3DGS
approaches require a large number of calibrated views to generate a consistent
and complete scene representation. When input views are limited, 3DGS tends to
overfit the training views, leading to noticeable degradation in rendering
quality. To address this limitation, we propose a Point-wise Feature-Aware
Gaussian Splatting framework that enables real-time, high-quality rendering
from sparse training views. Specifically, we first employ the latest stereo
foundation model to estimate accurate camera poses and reconstruct a dense
point cloud for Gaussian initialization. We then encode the colour attributes
of each 3D Gaussian by sampling and aggregating multiscale 2D appearance
features from sparse inputs. To enhance point-wise appearance representation,
we design a point interaction network based on a self-attention mechanism,
allowing each Gaussian point to interact with its nearest neighbors. These
enriched features are subsequently decoded into Gaussian parameters through two
lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive
experiments on diverse benchmarks demonstrate that our method significantly
outperforms NeRF-based approaches and achieves competitive performance under
few-shot settings compared to the state-of-the-art 3DGS methods.
[56] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu
Main category: cs.CV
TL;DR: 该论文提出了一个基于视觉-语言模型的多模态研究框架UrbanSense,用于自动化、可扩展的城市街景风格差异分析,并展示了其在量化城市风格演变方面的潜力。
Details
Motivation: 城市文化和建筑风格因地理、历史、社会政治等因素差异显著,传统研究方法难以标准化。该研究旨在通过数据驱动方法增强城市形态研究的客观性。Contribution: 1.构建了UrbanDiffBench数据集;2.开发了首个基于视觉-语言模型的城市街景分析框架UrbanSense;3.实验结果验证了该方法在量化风格差异上的有效性。
Method: 基于视觉-语言模型的多模态框架,支持城市风格的定量生成与比较。
Result: 超过80%的生成描述通过t检验(p<0.05),主观评价中的高Phi分数证实了方法对风格差异的捕捉能力(城市0.912,时期0.833)。
Insight: 该框架为城市风格的量化与解释提供了科学依据,为未来设计提供了数据支持。
Abstract: Urban cultures and architectural styles vary significantly across cities due
to geographical, chronological, historical, and socio-political factors.
Understanding these differences is essential for anticipating how cities may
evolve in the future. As representative cases of historical continuity and
modern innovation in China, Beijing and Shenzhen offer valuable perspectives
for exploring the transformation of urban streetscapes. However, conventional
approaches to urban cultural studies often rely on expert interpretation and
historical documentation, which are difficult to standardize across different
contexts. To address this, we propose a multimodal research framework based on
vision-language models, enabling automated and scalable analysis of urban
streetscape style differences. This approach enhances the objectivity and
data-driven nature of urban form research. The contributions of this study are
as follows: First, we construct UrbanDiffBench, a curated dataset of urban
streetscapes containing architectural images from different periods and
regions. Second, we develop UrbanSense, the first vision-language-model-based
framework for urban streetscape analysis, enabling the quantitative generation
and comparison of urban style representations. Third, experimental results show
that Over 80% of generated descriptions pass the t-test (p less than 0.05).
High Phi scores (0.912 for cities, 0.833 for periods) from subjective
evaluations confirm the method’s ability to capture subtle stylistic
differences. These results highlight the method’s potential to quantify and
interpret urban style evolution, offering a scientifically grounded lens for
future design.
[57] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration
Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu
Main category: cs.CV
TL;DR: RealKeyMorph (RKM) 提出了一种分辨率无关的图像配准方法,通过输出真实世界坐标系中的关键点,避免了传统方法中的重采样问题。
Details
Motivation: 医学图像配准中,由于采集参数(如像素间距、切片厚度等)的差异,图像分辨率可能不同。传统机器学习方法基于固定分辨率进行重采样,容易引入插值伪影,影响配准效果。Contribution: RKM 扩展了 KeyMorph 框架,通过在真实世界坐标系中提取关键点,实现了分辨率无关的图像配准,避免了重采样的缺陷。
Method: RKM 利用扫描仪生成的仿射矩阵(如 MRI 机器的矩阵),将关键点从体素坐标转换为真实世界坐标,并在训练过程中集成这一转换,使关键点具有分辨率无关性。
Result: 实验显示,RKM 在腹部 MRI 的 2D 正交堆叠和脑部数据集的 3D 体积配准任务中表现优异。
Insight: 通过直接操作原始数据(避免重采样),RKM 提升了图像配准的精度和鲁棒性,尤其适用于分辨率不一致的场景。
Abstract: Many real-world settings require registration of a pair of medical images
that differ in spatial resolution, which may arise from differences in image
acquisition parameters like pixel spacing, slice thickness, and field-of-view.
However, all previous machine learning-based registration techniques resample
images onto a fixed resolution. This is suboptimal because resampling can
introduce artifacts due to interpolation. To address this, we present
RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is
an extension of KeyMorph, a registration framework which works by training a
network to learn corresponding keypoints for a given pair of images, after
which a closed-form keypoint matching step is used to derive the transformation
that aligns them. To avoid resampling and enable operating on the raw data, RKM
outputs keypoints in real-world coordinates of the scanner. To do this, we
leverage the affine matrix produced by the scanner (e.g., MRI machine) that
encodes the mapping from voxel coordinates to real world coordinates. By
transforming keypoints into real-world space and integrating this into the
training process, RKM effectively enables the extracted keypoints to be
resolution-agnostic. In our experiments, we demonstrate the advantages of RKM
on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as
3D volumes with varying resolutions in brain datasets.
[58] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang
Main category: cs.CV
TL;DR: Motion-R1通过结合思维链和强化学习提升文本到动作生成的质量,解决了现有方法在语义对齐和动作合成中的不足。
Details
Motivation: 现有方法无法捕捉深层语言结构和逻辑推理,导致生成的动作缺乏可控性、一致性和多样性。Motion-R1旨在解决这些问题。Contribution: 提出了Motion-R1框架,结合思维链机制和强化学习,显著提升复杂文本指令的解析和多步骤动作生成能力。
Method: 使用思维链分解文本指令,并采用Group Relative Policy Optimization强化学习算法联合优化推理链和动作合成。
Result: 在多个基准数据集上表现优异,尤其在语义理解和长期时间连贯性方面优于现有方法。
Insight: 思维链机制和强化学习的结合能够有效提升复杂动作生成的语义理解和执行能力。
Abstract: Recent advances in large language models, especially in natural language
understanding and reasoning, have opened new possibilities for text-to-motion
generation. Although existing approaches have made notable progress in semantic
alignment and motion synthesis, they often rely on end-to-end mapping
strategies that fail to capture deep linguistic structures and logical
reasoning. Consequently, generated motions tend to lack controllability,
consistency, and diversity. To address these limitations, we propose Motion-R1,
a unified motion-language modeling framework that integrates a Chain-of-Thought
mechanism. By explicitly decomposing complex textual instructions into
logically structured action paths, Motion-R1 provides high-level semantic
guidance for motion generation, significantly enhancing the model’s ability to
interpret and execute multi-step, long-horizon, and compositionally rich
commands. To train our model, we adopt Group Relative Policy Optimization, a
reinforcement learning algorithm designed for large models, which leverages
motion quality feedback to optimize reasoning chains and motion synthesis
jointly. Extensive experiments across multiple benchmark datasets demonstrate
that Motion-R1 achieves competitive or superior performance compared to
state-of-the-art methods, particularly in scenarios requiring nuanced semantic
understanding and long-term temporal coherence. The code, model and data will
be publicly available.
[59] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device
Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVT是一种轻量级但强大的人脸识别模型,结合了CNN-Transformer混合架构和创新的轻量级多头线性注意力机制(MHLA),在降低计算复杂度的同时保持了高精度。
Details
Motivation: 在移动设备上实现高效、低延迟的人脸识别,同时平衡计算资源与模型性能。Contribution: 提出FaceLiVT模型,结合MHLA和结构重参数化技术,显著提升推理速度,适用于资源受限平台。
Method: 采用混合CNN-Transformer架构,设计轻量级MHLA机制,并通过结构重参数化优化模型效率。
Result: 在LFW等基准数据集上表现优异,推理速度比EdgeFace快8.6倍,比纯ViT模型快21.2倍。
Insight: 混合架构结合轻量级注意力机制可显著提升移动设备上的实时人脸识别性能。
Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition
model that integrates a hybrid Convolution Neural Network (CNN)-Transformer
architecture with an innovative and lightweight Multi-Head Linear Attention
(MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer,
FaceLiVT effectively reduces computational complexity while preserving
competitive accuracy. Extensive evaluations on challenging benchmarks;
including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior
performance compared to state-of-the-art lightweight models. MHLA notably
improves inference speed, allowing FaceLiVT to deliver high accuracy with lower
latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace,
a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2
faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers
an efficient and practical solution for real-time face recognition on
resource-constrained platforms.
[60] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu
Main category: cs.CV
TL;DR: 该论文提出了FSATFusion,一种用于红外和可见光图像融合的端到端网络,结合了频域-空间注意力Transformer模块,显著提升了融合效果和下游任务性能。
Details
Motivation: 现有方法多基于CNN,但在全局上下文捕获上存在局限,导致信息丢失,影响融合质量。作者希望通过Transformer和注意力机制提升这一能力。Contribution: 1. 提出FSAT模块,结合频域-空间注意力机制;2. 改进Transformer模块(ITM)以增强全局上下文提取;3. 展示了优异的融合效果和泛化能力。
Method: 设计频域-空间注意力Transformer(FSAT)模块,结合FSAM提取显著特征,并改进Transformer(ITM)以增强全局信息捕获。
Result: 实验表明FSATFusion在融合质量和效率上优于现有方法,且在下游任务(如目标检测)中表现优异。
Insight: 结合频率和空间域的注意力机制能更好地提取特征,改进的Transformer模块进一步提升了全局信息捕获能力。
Abstract: The infrared and visible images fusion (IVIF) is receiving increasing
attention from both the research community and industry due to its excellent
results in downstream applications. Existing deep learning approaches often
utilize convolutional neural networks to extract image features. However, the
inherently capacity of convolution operations to capture global context can
lead to information loss, thereby restricting fusion performance. To address
this limitation, we propose an end-to-end fusion network named the
Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The
FSATFusion contains a frequency-spatial attention Transformer (FSAT) module
designed to effectively capture discriminate features from source images. This
FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of
extracting significant features from feature maps. Additionally, we propose an
improved Transformer module (ITM) to enhance the ability to extract global
context information of vanilla Transformer. We conducted both qualitative and
quantitative comparative experiments, demonstrating the superior fusion quality
and efficiency of FSATFusion compared to other state-of-the-art methods.
Furthermore, our network was tested on two additional tasks without any
modifications, to verify the excellent generalization capability of FSATFusion.
Finally, the object detection experiment demonstrated the superiority of
FSATFusion in downstream visual tasks. Our code is available at
https://github.com/Lmmh058/FSATFusion.
[61] Revisiting Transformers with Insights from Image Filtering
Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen
Main category: cs.CV
TL;DR: 本文通过图像处理的视角重新解释了Transformer中的自注意力机制,提出了一个统一的框架来解释其计算过程及组件作用,并提出了两种改进架构,不仅在可解释性上有所提升,还在任务性能上取得了显著进步。
Details
Motivation: 自注意力机制的成功缺乏理论解释,现有方法虽尝试从图像去噪和非参数回归角度理解,但仍未深入分析其增强组件的机理。本文旨在通过图像处理框架弥补这一理解差距。Contribution: 1. 提出基于图像处理的统一框架,解释自注意力及其组件(如位置编码和残差连接)的作用。2. 通过这一框架发现了自注意力与图像处理之间的潜在差异并尝试弥合。3. 提出两种独立架构改进,提升了模型性能。
Method: 1. 构建图像处理框架分析自注意力机制。2. 通过该框架解释位置编码和残差连接等功能。3. 提出两种改进架构并验证其效果。
Result: 实验表明,基于图像处理启发的改进不仅增强了模型的可解释性,还在语言和视觉任务中提升了准确性和鲁棒性,尤其是在长序列理解上表现更优。
Insight: 自注意力机制的某些设计灵感可能源于图像处理领域,这种跨领域的视角有助于深入理解并改进Transformer架构。
Abstract: The self-attention mechanism, a cornerstone of Transformer-based
state-of-the-art deep learning architectures, is largely heuristic-driven and
fundamentally challenging to interpret. Establishing a robust theoretical
foundation to explain its remarkable success and limitations has therefore
become an increasingly prominent focus in recent research. Some notable
directions have explored understanding self-attention through the lens of image
denoising and nonparametric regression. While promising, existing frameworks
still lack a deeper mechanistic interpretation of various architectural
components that enhance self-attention, both in its original formulation and
subsequent variants. In this work, we aim to advance this understanding by
developing a unifying image processing framework, capable of explaining not
only the self-attention computation itself but also the role of components such
as positional encoding and residual connections, including numerous later
variants. We also pinpoint potential distinctions between the two concepts
building upon our framework, and make effort to close this gap. We introduce
two independent architectural modifications within transformers. While our
primary objective is interpretability, we empirically observe that image
processing-inspired modifications can also lead to notably improved accuracy
and robustness against data contamination and adversaries across language and
vision tasks as well as better long sequence understanding.
[62] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial
Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
Main category: cs.CV
TL;DR: PoseIDON 是一个结合深度基础模型与多视角摄影测量的计算机视觉流程,用于估计海底物体6自由度位姿,并通过CAD模型对齐推断埋藏深度。
Details
Motivation: 海底人为物体埋藏状态的准确估计对研究局部沉积动态、评估生态风险和污染物运输至关重要,但传统方法因遮挡、低可见度和物体退化等问题难以实现。Contribution: 提出了PoseIDON方法,首次将深度基础模型特征与多视角摄影测量结合,实现了对海底物体位姿和埋藏深度的准确估计。
Method: 使用多视角ROV视频,通过深度基础模型提取特征,结合CAD模型对齐和局部海床平面拟合,计算物体6自由度位姿和埋藏深度。
Result: 在54个物体的测试中,平均埋藏深度误差约为10厘米,成功捕捉了底层沉积物运输的空间模式。
Insight: 该方法为非侵入式、可扩展的海底埋藏状态测绘提供了新思路,适用于环境污染评估和危险物恢复策略制定。
Abstract: The burial state of anthropogenic objects on the seafloor provides insight
into localized sedimentation dynamics and is also critical for assessing
ecological risks, potential pollutant transport, and the viability of recovery
or mitigation strategies for hazardous materials such as munitions. Accurate
burial depth estimation from remote imagery remains difficult due to partial
occlusion, poor visibility, and object degradation. This work introduces a
computer vision pipeline, called PoseIDON, which combines deep foundation model
features with multiview photogrammetry to estimate six degrees of freedom
object pose and the orientation of the surrounding seafloor from ROV video.
Burial depth is inferred by aligning CAD models of the objects with observed
imagery and fitting a local planar approximation of the seafloor. The method is
validated using footage of 54 objects, including barrels and munitions,
recorded at a historic ocean dumpsite in the San Pedro Basin. The model
achieves a mean burial depth error of approximately 10 centimeters and resolves
spatial burial patterns that reflect underlying sediment transport processes.
This approach enables scalable, non-invasive mapping of seafloor burial and
supports environmental assessment at contaminated sites.
[63] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba
Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin
Main category: cs.CV
TL;DR: 论文提出了一种名为DART的动态自适应区域标记器,通过自适应划分图像为不同大小的块,提升了Vision Transformer和Mamba的性能,同时减少了计算开销。
Details
Motivation: 现有的Vision Transformer和Mamba等非卷积模型依赖固定大小的图像块,导致对背景区域的过度编码和对关键局部细节的遗漏,尤其在信息稀疏分布的场景下表现不佳。Contribution: 提出了DART(Dynamic Adaptive Region Tokenizer),一种完全可微的动态自适应区域标记器,能够根据内容自适应划分图像块,从而更高效地编码信息丰富的区域。
Method: DART结合可学习的区域得分和分段可微分的分位数操作,动态分配更密集的标记到信息丰富区域。引入仅约1M额外参数,显著提升模型性能。
Result: 在DeiT(ImageNet-1K)上准确率提升2.1%,同时减少45%的浮点运算量(FLOPs)。在DeiT、Vim和VideoMamba上的实验验证了其一致性和高效性。
Insight: DART提供了一种更高效的方法替代均匀增加标记密度的策略,能够在提升性能的同时减少计算开销,适用于信息分布不均的场景。
Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and
Vision Mamba (Vim) have achieved remarkable performance in computer vision
tasks. However, their reliance on fixed-size patches often results in excessive
encoding of background regions and omission of critical local details,
especially when informative objects are sparsely distributed. To address this,
we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART),
which adaptively partitions images into content-dependent patches of varying
sizes. DART combines learnable region scores with piecewise differentiable
quantile operations to allocate denser tokens to information-rich areas.
Despite introducing only approximately 1 million (1M) additional parameters,
DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that
uniformly increase token density to capture fine-grained details, DART offers a
more efficient alternative, achieving 45% FLOPs reduction with superior
performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that
DART consistently enhances accuracy while incurring minimal or even reduced
computational overhead. Code is available at
https://github.com/HCPLab-SYSU/DART.
[64] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion
Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai
Main category: cs.CV
TL;DR: ReconMOST提出了一种基于数据驱动的扩散模型框架,用于多层面海水温度重建,通过历史数值模拟数据预训练模型,并利用稀疏原位观测数据指导逆向扩散过程,实现了高精度和物理一致的重建。
Details
Motivation: 传统海洋温度重建方法因数据稀疏、算法复杂和计算成本高而受限,而现有的机器学习方法通常仅适用于海表或局部区域,且难以处理云遮挡等问题。需要一种更高效、全局且多层面的重建方法。Contribution: 提出ReconMOST框架,首次将ML方法扩展到全球多层面海水温度重建;通过预训练的扩散模型和观测数据指导,实现了高精度和物理一致的重建效果。
Method: 1. 基于CMIP6历史数值模拟数据预训练无条件扩散模型;2. 使用稀疏高精度原位观测数据指导逆向扩散过程;3. 在无直接观测区域,利用预训练学习到的物理分布模式进行隐式指导重建。
Result: 在CMIP6和EN4分析数据上,MSE达到0.049(指导)、0.680(重建)和0.633(总),即使在92.5%数据缺失情况下仍保持重建精度和分辨率。
Insight: 扩散模型在海洋科学中的应用潜力——通过结合物理一致的预训练和稀疏观测,可实现高精度全局重建,同时解决了传统ML方法的局限性。
Abstract: Accurate reconstruction of ocean is essential for reflecting global climate
dynamics and supporting marine meteorological research. Conventional methods
face challenges due to sparse data, algorithmic complexity, and high
computational costs, while increasing usage of machine learning (ML) method
remains limited to reconstruction problems at the sea surface and local
regions, struggling with issues like cloud occlusion. To address these
limitations, this paper proposes ReconMOST, a data-driven guided diffusion
model framework for multi-layer sea temperature reconstruction. Specifically,
we first pre-train an unconditional diffusion model using a large collection of
historical numerical simulation data, enabling the model to attain physically
consistent distribution patterns of ocean temperature fields. During the
generation phase, sparse yet high-accuracy in-situ observational data are
utilized as guidance points for the reverse diffusion process, generating
accurate reconstruction results. Importantly, in regions lacking direct
observational data, the physically consistent spatial distribution patterns
learned during pre-training enable implicitly guided and physically plausible
reconstructions. Our method extends ML-based SST reconstruction to a global,
multi-layer setting, handling over 92.5% missing data while maintaining
reconstruction accuracy, spatial resolution, and superior generalization
capability. We pre-train our model on CMIP6 numerical simulation data and
conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The
results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on
reconstruction, and 0.633 on total, respectively, demonstrating the
effectiveness and robustness of the proposed framework. Our source code is
available at https://github.com/norsheep/ReconMOST.
[65] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie
Main category: cs.CV
TL;DR: Pisces是一种自回归多模态基础模型,通过解耦的视觉编码架构和针对多模态生成的定制训练技术,解决了图像理解与生成任务中的性能差异问题,同时在两者上均表现出色。
Details
Motivation: 现有统一多模态模型在图像理解和生成任务中往往不如专用模型表现优异,主要因为视觉特征和训练过程的差异。Pisces旨在设计一种统一框架,解决这一问题。Contribution: 1. 提出Pisces,一种自回归多模态基础模型;2. 设计解耦的视觉编码架构;3. 优化多模态生成的训练技术;4. 在多个公开基准测试中验证其竞争力。
Method: 采用解耦视觉编码架构,分别为图像理解和生成任务优化特征提取;结合数据精选、预训练和微调技术,提升多模态任务的性能。
Result: 在20多个图像理解任务和GenEval图像生成基准测试中表现出色,验证了模型在两种任务上的竞争力。
Insight: 图像理解与生成之间存在协同关系,解耦的视觉编码架构能有效提升统一多模态模型的性能。
Abstract: Recent advances in large language models (LLMs) have enabled multimodal
foundation models to tackle both image understanding and generation within a
unified framework. Despite these gains, unified models often underperform
compared to specialized models in either task. A key challenge in developing
unified models lies in the inherent differences between the visual features
needed for image understanding versus generation, as well as the distinct
training processes required for each modality. In this work, we introduce
Pisces, an auto-regressive multimodal foundation model that addresses this
challenge through a novel decoupled visual encoding architecture and tailored
training techniques optimized for multimodal generation. Combined with
meticulous data curation, pretraining, and finetuning, Pisces achieves
competitive performance in both image understanding and image generation. We
evaluate Pisces on over 20 public benchmarks for image understanding, where it
demonstrates strong performance across a wide range of tasks. Additionally, on
GenEval, a widely adopted benchmark for image generation, Pisces exhibits
robust generative capabilities. Our extensive analysis reveals the synergistic
relationship between image understanding and generation, and the benefits of
using separate visual encoders, advancing the field of unified multimodal
models.
[66] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment
Shuo wang,Jihao Zhang
Main category: cs.CV
TL;DR: MF2Summ提出了一种多模态融合的视频摘要方法,结合视觉和听觉信息,通过跨模态Transformer和时间对齐机制提升摘要性能。在SumMe和TVSum数据集上表现优于现有方法。
Details
Motivation: 传统视频摘要方法通常仅依赖视觉信息,无法充分利用视频的多模态语义。本文旨在通过融合视觉和听觉信息,提升视频摘要的性能和语义丰富度。Contribution: 1) 提出MF2Summ模型,首次结合视觉和听觉信息进行视频摘要;2) 设计了跨模态Transformer和时间对齐机制;3) 在SumMe和TVSum数据集上表现优于现有方法。
Method: 1) 从GoogLeNet和SoundNet提取视觉和听觉特征;2) 使用跨模态Transformer和自注意力Transformer建模模态依赖和时间对齐;3) 预测片段重要性,并通过NMS和KTS算法选择关键片段。
Result: 在SumMe和TVSum数据集上,F1分数分别提升1.9%和0.6%,优于DSNet和其他先进方法。
Insight: 多模态融合和时间对齐是提升视频摘要性能的关键,听觉信息为视觉提供了补充语义,增强了摘要的全面性。
Abstract: The rapid proliferation of online video content necessitates effective video
summarization techniques. Traditional methods, often relying on a single
modality (typically visual), struggle to capture the full semantic richness of
videos. This paper introduces MF2Summ, a novel video summarization model based
on multimodal content understanding, integrating both visual and auditory
information. MF2Summ employs a five-stage process: feature extraction,
cross-modal attention interaction, feature fusion, segment prediction, and key
shot selection. Visual features are extracted using a pre-trained GoogLeNet
model, while auditory features are derived using SoundNet. The core of our
fusion mechanism involves a cross-modal Transformer and an alignment-guided
self-attention Transformer, designed to effectively model inter-modal
dependencies and temporal correspondences. Segment importance, location, and
center-ness are predicted, followed by key shot selection using Non-Maximum
Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm.
Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ
achieves competitive performance, notably improving F1-scores by 1.9% and
0.6% respectively over the DSNet model, and performing favorably against other
state-of-the-art methods.
[67] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen
Main category: cs.CV
TL;DR: 论文提出了一种名为CIDer的鲁棒多模态情感识别框架,通过模型特定的自蒸馏和模型无关的因果推断模块,解决了模态缺失和分布偏移问题。
Details
Motivation: 多模态情感识别在面临模态缺失和分布外数据时表现不佳,现有方法过于依赖特定模型或引入过多参数,CIDer旨在解决这些问题。Contribution: 1) 提出CIDer框架,结合自蒸馏和因果推断模块;2) 引入RMFM任务和新的OOD数据集;3) 在参数效率和鲁棒性上优于现有方法。
Method: CIDer包含Model-Specific Self-Distillation (MSSD)和Model-Agnostic Causal Inference (MACI)模块。MSSD通过自蒸馏增强鲁棒性,MACI利用因果图减少偏差。
Result: CIDer在RMFM和OOD任务中表现出色,且参数更少、训练更快。
Insight: 结合自蒸馏和因果推断可以同时解决模态缺失和分布偏移问题,且无需过多参数。
Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges
in addressing both modality missing and Out-Of-Distribution (OOD) data
simultaneously. Existing methods often rely on specific models or introduce
excessive parameters, which limits their practicality. To address these issues,
we propose a novel robust MER framework, Causal Inference Distiller (CIDer),
and introduce a new task, Random Modality Feature Missing (RMFM), to generalize
the definition of modality missing. CIDer integrates two key components: a
Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal
Inference (MACI) module. MSSD enhances robustness under the RMFM task through a
weight-sharing self-distillation approach applied across low-level features,
attention maps, and high-level representations. Additionally, a Word-level
Self-aligned Attention Module (WSAM) reduces computational complexity, while a
Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion.
To tackle OOD challenges, MACI employs a tailored causal graph to mitigate
label and language biases using a Multimodal Causal Module (MCM) and
fine-grained counterfactual texts. Notably, MACI can independently enhance OOD
generalization with minimal additional parameters. Furthermore, we also
introduce the new repartitioned MER OOD datasets. Experimental results
demonstrate that CIDer achieves robust performance in both RMFM and OOD
scenarios, with fewer parameters and faster training compared to
state-of-the-art methods. The implementation of this work is publicly
accessible at https://github.com/gw-zhong/CIDer.
[68] Rethinking Generative Human Video Coding with Implicit Motion Transformation
Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye
Main category: cs.CV
TL;DR: 本文提出了一种基于隐式运动变换(IMT)的生成式人体视频编码方法,解决了传统显式运动场在复杂人体运动中的失真和运动不准确问题。
Details
Motivation: 传统基于显式运动的生成式人体视频编码在复杂多样的运动模式中表现不佳,导致重建结果失真和运动不准确。Contribution: 提出了隐式运动变换(IMT)方法,通过将复杂人体信号表征为紧凑视觉特征,并将其转换为隐式运动指导,显著提升了生成式人体视频编码的性能。
Method: 将人体信号压缩为紧凑视觉特征,并利用隐式运动变换(IMT)将其转换为隐式运动指导,用于高质量重建。
Result: 实验证明,IMT方法在高效压缩和高保真合成方面表现优异。
Insight: 隐式运动变换能更有效地处理复杂人体运动,避免了显式运动场的局限性。
Abstract: Beyond traditional hybrid-based video codec, generative video codec could
achieve promising compression performance by evolving high-dimensional signals
into compact feature representations for bitstream compactness at the encoder
side and developing explicit motion fields as intermediate supervision for
high-quality reconstruction at the decoder side. This paradigm has achieved
significant success in face video compression. However, compared to facial
videos, human body videos pose greater challenges due to their more complex and
diverse motion patterns, i.e., when using explicit motion guidance for
Generative Human Video Coding (GHVC), the reconstruction results could suffer
severe distortions and inaccurate motion. As such, this paper highlights the
limitations of explicit motion-based approaches for human body video
compression and investigates the GHVC performance improvement with the aid of
Implicit Motion Transformation, namely IMT. In particular, we propose to
characterize complex human body signal into compact visual features and
transform these features into implicit motion guidance for signal
reconstruction. Experimental results demonstrate the effectiveness of the
proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency
compression and high-fidelity synthesis.
[69] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: MedSeg-R 是一个基于多模态大语言模型(MLLMs)的端到端框架,用于医学图像推理分割任务,能够生成精确的分割掩码并理解复杂的临床指令。
Details
Motivation: 现有的医学图像分割模型依赖显式的人工指令,缺乏主动推理能力,无法处理复杂的临床问题。多模态大语言模型在医学QA任务中表现良好,但难以生成精确的分割掩码。Contribution: 1) 提出医学图像推理分割任务;2) 设计MedSeg-R框架,结合MLLMs的推理能力和像素级分割;3) 发布MedSeg-QA数据集,包含10,000多对图像掩码和对话数据。
Method: 框架包含全局上下文理解模块(生成多模态中间令牌)和像素级接地模块(解码令牌生成分割掩码和文本响应)。
Result: 实验表明MedSeg-R在多个基准测试中表现优异,分割精度高,并支持医学图像的可解释文本分析。
Insight: 通过结合大语言模型的推理能力和像素级分割技术,可以有效实现医学图像的复杂指令理解和精确分割。
Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing
models are limited by their reliance on explicit human instructions and lack
the active reasoning capabilities to understand complex clinical questions.
While recent advancements in multimodal large language models (MLLMs) have
improved medical question-answering (QA) tasks, most methods struggle to
generate precise segmentation masks, limiting their application in automatic
medical diagnosis. In this paper, we introduce medical image reasoning
segmentation, a novel task that aims to generate segmentation masks based on
complex and implicit medical instructions. To address this, we propose
MedSeg-R, an end-to-end framework that leverages the reasoning abilities of
MLLMs to interpret clinical questions while also capable of producing
corresponding precise segmentation masks for medical images. It is built on two
core components: 1) a global context understanding module that interprets
images and comprehends complex medical instructions to generate multi-modal
intermediate tokens, and 2) a pixel-level grounding module that decodes these
tokens to produce precise segmentation masks and textual responses.
Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the
medical image reasoning segmentation task. It includes over 10,000 image-mask
pairs and multi-turn conversations, automatically annotated using large
language models and refined through physician reviews. Experiments show
MedSeg-R’s superior performance across several benchmarks, achieving high
segmentation accuracy and enabling interpretable textual analysis of medical
images.
[70] LLMs Are Not Yet Ready for Deepfake Image Detection
Shahroz Tariq,David Nguyen,M. A. P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore
Main category: cs.CV
TL;DR: 该论文通过零样本评估研究了四种大型视觉语言模型(VLM)在深伪图像检测中的表现,发现虽然它们能生成合理解释并识别表面异常,但仍不适合作为独立检测系统。模型容易受到风格误导,但在可解释性和上下文分析方面有潜力,适合作为混合或人机协作框架的一部分。
Details
Motivation: 随着深伪技术的快速发展,维护媒体可信度和公众信任面临巨大挑战。视觉语言模型(VLM)因其多领域潜力而被视为可能的解决方案,但其在深伪检测中的实际表现尚不清楚,本研究旨在填补这一空白。Contribution: 1. 对四种主流VLM(ChatGPT、Claude、Gemini、Grok)进行零样本深伪检测评估;2. 构建涵盖多种深伪类型的基准数据集;3. 揭示模型的失败模式(如过度关注风格)及潜力(如上下文分析能力)。
Method: 1. 使用零样本评估方法,测试模型在三种深伪类型(换脸、重演、合成生成)上的表现;2. 通过分类准确率和解释深度分析模型能力;3. 识别模型的关键失败模式,如对复古风格等误导模式的敏感性。
Result: 1. VLM能生成合理解释并识别表面异常,但无法作为独立检测工具;2. 模型在风格误导下表现不佳;3. 在可解释性和上下文分析方面表现突出,适合辅助人类专家。
Insight: 通用模型虽无法独立胜任深伪检测,但其在可解释性和上下文分析方面的优势使其适合作为混合或人机协作框架的一部分,未来可结合领域专业知识提升性能。
Abstract: The growing sophistication of deepfakes presents substantial challenges to
the integrity of media and the preservation of public trust. Concurrently,
vision-language models (VLMs), large language models enhanced with visual
reasoning capabilities, have emerged as promising tools across various domains,
sparking interest in their applicability to deepfake detection. This study
conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT,
Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap,
reenactment, and synthetic generation. Leveraging a meticulously assembled
benchmark comprising authentic and manipulated images from diverse sources, we
evaluate each model’s classification accuracy and reasoning depth. Our analysis
indicates that while VLMs can produce coherent explanations and detect
surface-level anomalies, they are not yet dependable as standalone detection
systems. We highlight critical failure modes, such as an overemphasis on
stylistic elements and vulnerability to misleading visual patterns like vintage
aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and
contextual analysis, suggesting their potential to augment human expertise in
forensic workflows. These insights imply that although general-purpose models
currently lack the reliability needed for autonomous deepfake detection, they
hold promise as integral components in hybrid or human-in-the-loop detection
frameworks.
[71] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao
Main category: cs.CV
TL;DR: 该论文提出了一种名为PSLG-SAM的框架,通过将参考遥感图像分割任务分解为粗定位和精细分割两阶段,利用视觉定位网络和SAM模型,显著提升了性能并减少了标注负担。
Details
Motivation: 当前RRSIS方法依赖多模态融合骨干和语义分割头,但面临密集标注需求和复杂场景解释的挑战。论文旨在通过两阶段分解解决这些问题。Contribution: 1. 提出PSLG-SAM框架,分解任务为粗定位和精细分割;2. 引入聚类增强和掩码边界优化策略;3. 贡献了一个高质量的多类别标注数据集。
Method: 1. 粗定位阶段使用视觉定位网络;2. 精细分割阶段用SAM模型,结合聚类生成前景点和掩码边界优化。
Result: 在两个数据集上,PSLG-SAM显著优于现有方法,验证了其有效性。
Insight: 通过任务分解和SAM模型的结合,可以显著减少标注负担并提升复杂场景下的分割精度。
Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates
segmentation masks for specified objects in images based on textual
descriptions, which has attracted widespread attention and research interest.
Current RRSIS methods rely on multi-modal fusion backbones and semantic
segmentation heads but face challenges like dense annotation requirements and
complex scene interpretation. To address these issues, we propose a framework
named \textit{prompt-generated semantic localization guiding Segment Anything
Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse
localization and fine segmentation. In coarse localization stage, a visual
grounding network roughly locates the text-described object. In fine
segmentation stage, the coordinates from the first stage guide the Segment
Anything Model (SAM), enhanced by a clustering-based foreground point generator
and a mask boundary iterative optimization strategy for precise segmentation.
Notably, the second stage can be train-free, significantly reducing the
annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS
task into two stages allows for focusing on specific region segmentation,
avoiding interference from complex scenes.We further contribute a high-quality,
multi-category manually annotated dataset. Experimental validation on two
datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant
performance improvements and surpasses existing state-of-the-art models.Our
code will be made publicly available.
[72] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft
Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai
Main category: cs.CV
TL;DR: J-DDL是一个用于战斗机表面损伤检测与定位的智能系统,结合2D图像与3D点云数据,优化了YOLO架构并引入创新模块,显著提升了检测效率和准确性。
Details
Motivation: 战斗机表面检查的规模和复杂性使得人工检测效率低下且难以统一,亟需自动化解决方案。Contribution: 1)提出J-DDL系统,整合2D图像与3D点云;2)优化YOLO架构,引入轻量级Fasternet块和EMA模块;3)提出新损失函数Inner-CIOU;4)发布首个飞机损伤公开数据集。
Method: 基于YOLO的检测网络,结合Fasternet块和EMA模块进行特征提取与聚合,利用Inner-CIOU损失提升精度,并将2D检测结果映射到3D点云。
Result: 实验验证J-DDL在损伤检测与定位上的有效性,显著优于传统方法。
Insight: 结合2D与3D数据可实现更全面的表面检查;轻量化模块和新型损失函数是提升检测效率的关键。
Abstract: Ensuring the safety and extended operational life of fighter aircraft
necessitates frequent and exhaustive inspections. While surface defect
detection is feasible for human inspectors, manual methods face critical
limitations in scalability, efficiency, and consistency due to the vast surface
area, structural complexity, and operational demands of aircraft maintenance.
We propose a smart surface damage detection and localization system for fighter
aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the
entire aircraft surface, captured using a combined system of laser scanners and
cameras, to achieve precise damage detection and localization. Central to our
system is a novel damage detection network built on the YOLO architecture,
specifically optimized for identifying surface defects in 2D aircraft images.
Key innovations include lightweight Fasternet blocks for efficient feature
extraction, an optimized neck architecture incorporating Efficient Multiscale
Attention (EMA) modules for superior feature aggregation, and the introduction
of a novel loss function, Inner-CIOU, to enhance detection accuracy. After
detecting damage in 2D images, the system maps the identified anomalies onto
corresponding 3D point clouds, enabling accurate 3D localization of defects
across the aircraft surface. Our J-DDL not only streamlines the inspection
process but also ensures more comprehensive and detailed coverage of large and
complex aircraft exteriors. To facilitate further advancements in this domain,
we have developed the first publicly available dataset specifically focused on
aircraft damage. Experimental evaluations validate the effectiveness of our
framework, underscoring its potential to significantly advance automated
aircraft inspection technologies.
[73] CogStream: Context-guided Streaming Video Question Answering
Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu
Main category: cs.CV
TL;DR: 这篇论文提出了一个名为CogStream的新任务,解决了流媒体视频场景中的多模态推理问题,同时提出了一种高效的方法和相关数据集。
Details
Motivation: 传统视频大语言模型在处理流媒体视频时存在计算负担大和无关上下文干扰的问题。Contribution: 论文提出了一种新的任务(CogStream),引入了包含层次化问答对的标注数据集,并提出了基线模型CogReasoner。
Method: CogReasoner通过视觉流压缩和历史对话检索高效处理任务。
Result: 实验证明了方法的有效性。
Insight: 流媒体视频的多模态推理需要高效利用相关上下文信息,避免无关数据干扰。
Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving
multimodal understanding, challenges persist in streaming video reasoning due
to its reliance on contextual information. Existing paradigms feed all
available historical contextual information into Vid-LLMs, resulting in a
significant computational burden for visual data processing. Furthermore, the
inclusion of irrelevant context distracts models from key details. This paper
introduces a challenging task called Context-guided Streaming Video Reasoning
(CogStream), which simulates real-world streaming video scenarios, requiring
models to identify the most relevant historical contextual information to
deduce answers for questions about the current stream. To support CogStream, we
present a densely annotated dataset featuring extensive and hierarchical
question-answer pairs, generated by a semi-automatic pipeline. Additionally, we
present CogReasoner as a baseline model. It efficiently tackles this task by
leveraging visual stream compression and historical dialogue retrieval.
Extensive experiments prove the effectiveness of this method. Code will be
released soon.
[74] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
Yutong Zhou,Masahiro Ryo
Main category: cs.CV
TL;DR: 该论文提出一个端到端的视觉-因果框架,将物种图像转化为关于其栖息地偏好的可解释因果分析,结合了物种识别、全球分布检索、伪缺失采样和气候数据提取等方法。
Details
Motivation: 理解物种为何生活在特定地点对生态研究和生物多样性保护至关重要,但现有生态工作流程零散且难以被非专家使用。Contribution: 该论文的主要贡献是开发了一个整合多模态数据和因果推断方法的框架,用于生成人类易懂的栖息地偏好解释。
Method: 方法包括:(1)物种识别,(2)全球分布检索,(3)伪缺失采样,(4)气候数据提取,(5)发现环境特征的因果结构,(6)评估其对物种分布的影响。最终通过模板和大语言模型生成解释。
Result: 通过蜜蜂和花卉物种的案例展示了框架的潜力,表明其能生成统计支撑且人类易懂的栖息地解释。
Insight: 结合多模态AI和因果推断方法,可以为生态学研究提供更直观和可解释的工具,尤其有助于非专业用户理解物种栖息地偏好。
Abstract: Explaining why the species lives at a particular location is important for
understanding ecological systems and conserving biodiversity. However, existing
ecological workflows are fragmented and often inaccessible to non-specialists.
We propose an end-to-end visual-to-causal framework that transforms a species
image into interpretable causal insights about its habitat preference. The
system integrates species recognition, global occurrence retrieval,
pseudo-absence sampling, and climate data extraction. We then discover causal
structures among environmental features and estimate their influence on species
occurrence using modern causal inference methods. Finally, we generate
statistically grounded, human-readable causal explanations from structured
templates and large language models. We demonstrate the framework on a bee and
a flower species and report early results as part of an ongoing project,
showing the potential of the multimodal AI assistant backed up by a recommended
ecological modeling practice for describing species habitat in
human-understandable language.
[75] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics
Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
Main category: cs.CV
TL;DR: 该论文提出了 Comprehensive Equity Index (CEI),一种用于检测人脸识别系统中人口统计偏差的新指标。CEI通过单独分析真实和冒名分数分布,并关注分布尾部,显著提升了检测细微偏差的能力。
Details
Motivation: 现有指标难以检测高性能人脸识别系统中的细微人口统计偏差,尤其是在分数分布的尾部。因此,需要一种更敏感的度量方法来揭示这些隐蔽的偏差。Contribution: 1. 提出了 CEI,一种新颖的公平性指标,专注于分布尾部概率和整体形状。2. 设计了自动化版本 CEI^A,提升了客观性和实用性。3. 验证了 CEI 在检测细微偏差上的优越性。
Method: CEI 分别分析真实分数分布和冒名分数分布,允许可配置的尾部概率关注,并结合整体分布形状进行评估。实验涵盖了多种数据集和故意引入偏差的模型。
Result: 实验表明,CEI 能有效检测出以往方法难以发现的细微偏差,尤其在分布尾部。CEI^A 进一步增强了实用性和客观性。
Insight: CEI 的核心创新在于对分布尾部的独立分析,这在检测细微偏差时尤为关键。这一方法不仅适用于人脸识别,还可扩展到其他需要分析分布尾部的问题。
Abstract: Demographic bias in high-performance face recognition (FR) systems often
eludes detection by existing metrics, especially with respect to subtle
disparities in the tails of the score distribution. We introduce the
Comprehensive Equity Index (CEI), a novel metric designed to address this
limitation. CEI uniquely analyzes genuine and impostor score distributions
separately, enabling a configurable focus on tail probabilities while also
considering overall distribution shapes. Our extensive experiments (evaluating
state-of-the-art FR systems, intentionally biased models, and diverse datasets)
confirm CEI’s superior ability to detect nuanced biases where previous methods
fall short. Furthermore, we present CEI^A, an automated version of the metric
that enhances objectivity and simplifies practical application. CEI provides a
robust and sensitive tool for operational FR fairness assessment. The proposed
methods have been developed particularly for bias evaluation in face biometrics
but, in general, they are applicable for comparing statistical distributions in
any problem where one is interested in analyzing the distribution tails.
[76] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散变换器(DiT)的框架DreamActor-H1,用于生成高保真的人类与产品展示视频,解决了现有方法在保留人和产品身份以及空间关系上的不足。
Details
Motivation: 在电子商务和数字营销中,生成高质量的人类与产品展示视频对产品呈现至关重要。现有方法往往无法同时保留人和产品的身份信息,或缺乏对空间关系的理解。Contribution: 1)提出了一种基于扩散变换器的框架,能够同时保留人和产品的身份细节;2)引入遮挡交叉注意力机制和3D人体网格模板,优化运动引导;3)通过结构化文本编码增强类别级语义一致性。
Method: 采用扩散变换器(DiT)结合遮挡交叉注意力机制,输入配对的参考信息(人和产品),使用3D身体网格和产品边界框提供运动指导,并利用结构化文本编码增强3D一致性。
Result: 在混合数据集上训练后,该方法在身份保留和运动生成上表现优于现有技术,实现了更真实的展示效果。
Insight: 通过结合3D模板和语义编码,可以显著提升生成视频的保真度和交互自然性,为电商应用提供实用解决方案。
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product
demonstration videos is important for effective product presentation. However,
most existing frameworks either fail to preserve the identities of both humans
and products or lack an understanding of human-product spatial relationships,
leading to unrealistic representations and unnatural interactions. To address
these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our
method simultaneously preserves human identities and product-specific details,
such as logos and textures, by injecting paired human-product reference
information and utilizing an additional masked cross-attention mechanism. We
employ a 3D body mesh template and product bounding boxes to provide precise
motion guidance, enabling intuitive alignment of hand gestures with product
placements. Additionally, structured text encoding is used to incorporate
category-level semantics, enhancing 3D consistency during small rotational
changes across frames. Trained on a hybrid dataset with extensive data
augmentation strategies, our approach outperforms state-of-the-art techniques
in maintaining the identity integrity of both humans and products and
generating realistic demonstration motions. Project page:
https://submit2025-dream.github.io/DreamActor-H1/.
[77] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He
Main category: cs.CV
TL;DR: 论文提出了一种名为PLACE的新框架,通过病理级别的跨模态对齐和关联探索,提升了医学视觉表示学习的效果,无需额外人工标注。
Details
Motivation: 医学领域的图像-报告对学习面临复杂语义和长报告的挑战,现有方法多关注实例或词汇级别的对齐,而忽略了病理级别的语义一致性。Contribution: 提出了PLACE框架,结合病理级别的跨模态对齐(PCMA)和关联探索,提高了医学视觉表示的鲁棒性和泛化能力。
Method: 1. 提出PCMA模块,通过视觉病理观察提取器从局部标记中提取病理表示;2. 设计代理任务探索图像块间的关联,丰富细粒度细节。
Result: 在分类、图像-文本检索、语义分割、目标检测和报告生成等多个下游任务中实现了SOTA性能。
Insight: 病理级别的对齐和关联探索是提升医学视觉表示学习的关键,且无需依赖外部疾病标注。
Abstract: Learning medical visual representations from image-report pairs through joint
learning has garnered increasing research attention due to its potential to
alleviate the data scarcity problem in the medical domain. The primary
challenges stem from the lengthy reports that feature complex discourse
relations and semantic pathologies. Previous works have predominantly focused
on instance-wise or token-wise cross-modal alignment, often neglecting the
importance of pathological-level consistency. This paper presents a novel
framework PLACE that promotes the Pathological-Level Alignment and enriches the
fine-grained details via Correlation Exploration without additional human
annotations. Specifically, we propose a novel pathological-level cross-modal
alignment (PCMA) approach to maximize the consistency of pathology observations
from both images and reports. To facilitate this, a Visual Pathology
Observation Extractor is introduced to extract visual pathological observation
representations from localized tokens. The PCMA module operates independently
of any external disease annotations, enhancing the generalizability and
robustness of our methods. Furthermore, we design a proxy task that enforces
the model to identify correlations among image patches, thereby enriching the
fine-grained details crucial for various downstream tasks. Experimental results
demonstrate that our proposed framework achieves new state-of-the-art
performance on multiple downstream tasks, including classification,
image-to-text retrieval, semantic segmentation, object detection and report
generation.
[78] DanceChat: Large Language Model-Guided Music-to-Dance Generation
Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan
Main category: cs.CV
TL;DR: DanceChat 是一种基于大语言模型(LLM)的音乐到舞蹈生成方法,通过文本指令提供明确的舞蹈指导,解决了音乐与舞蹈之间的语义鸿沟问题。
Details
Motivation: 音乐仅提供抽象的线索(如旋律、节奏和情感),难以直接映射到具体舞蹈动作,同时音乐和舞蹈的配对数据稀缺,限制了模型的多样性学习能力。Contribution: 提出了 DanceChat,利用 LLM 生成文本舞蹈指令,提供高层次指导;设计了多模态特征提取与融合模块,以及扩散模型与多模态对齐损失,确保生成的舞蹈与音乐和文本一致。
Method: 1) LLM 生成伪指令;2) 多模态特征提取与融合;3) 扩散模型结合多模态对齐损失进行运动合成。
Result: 在 AIST++ 数据集和人类评估中,DanceChat 在定性和定量上均优于现有方法。
Insight: 通过大语言模型提供的文本指令,可以显式填补音乐到舞蹈的语义鸿沟,显著提升生成的多样性和对齐性。
Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned
on musical input. Despite recent progress, significant challenges remain due to
the semantic gap between music and dance motion, as music offers only abstract
cues, such as melody, groove, and emotion, without explicitly specifying the
physical movements. Moreover, a single piece of music can produce multiple
plausible dance interpretations. This one-to-many mapping demands additional
guidance, as music alone provides limited information for generating diverse
dance movements. The challenge is further amplified by the scarcity of paired
music and dance data, which restricts the model^a\u{A}'Zs ability to learn
diverse dance patterns. In this paper, we introduce DanceChat, a Large Language
Model (LLM)-guided music-to-dance generation approach. We use an LLM as a
choreographer that provides textual motion instructions, offering explicit,
high-level guidance for dance generation. This approach goes beyond implicit
learning from music alone, enabling the model to generate dance that is both
more diverse and better aligned with musical styles. Our approach consists of
three components: (1) an LLM-based pseudo instruction generation module that
produces textual dance guidance based on music style and structure, (2) a
multi-modal feature extraction and fusion module that integrates music, rhythm,
and textual guidance into a shared representation, and (3) a diffusion-based
motion synthesis module together with a multi-modal alignment loss, which
ensures that the generated dance is aligned with both musical and textual cues.
Extensive experiments on AIST++ and human evaluations show that DanceChat
outperforms state-of-the-art methods both qualitatively and quantitatively.
[79] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning
Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu
Main category: cs.CV
TL;DR: 论文提出了一种名为T2I-PAL的新方法,通过利用文本到图像生成模型减少模态差异,并结合提示调谐和适配器学习提升多标签图像识别性能。
Details
Motivation: 现有基于CLIP的文本-图像对比学习方法存在模态差异问题,限制了多标签图像识别的性能。T2I-PAL旨在通过生成多样化的真实图像和局部特征聚合解决这一问题。Contribution: 1) 提出T2I-PAL方法,利用文本生成图像减少模态差异;2) 结合类热图和可学习原型提升局部特征表示;3) 整合提示调谐和适配器学习以优化分类性能。
Method: 1) 使用预训练文本到图像模型生成多样图像;2) 引入类热图和可学习原型优化局部特征;3) 结合提示调谐和适配器学习进行高效微调。
Result: 在多个基准数据集(如MS-COCO、VOC2007和NUS-WIDE)上,T2I-PAL的平均识别性能比现有最优方法高出3.47%。
Insight: 通过生成真实图像并优化局部特征表示,可以有效减少模态差异并提升多标签图像识别性能,同时降低对全标注数据的依赖。
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language
models, e.g., CLIP, allow to direct leverage texts as images (TaI) for
parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image
features to be similar to the corresponding text features, the modality gap
remains a nontrivial issue and limits image recognition performance of TaI.
Using multi-label image recognition (MLR) as an example, we present a novel
method, called T2I-PAL to tackle the modality gap issue when using only text
captions for PEFT. The core design of T2I-PAL is to leverage pre-trained
text-to-image generation models to generate photo-realistic and diverse images
from text captions, thereby reducing the modality gap. To further enhance MLR,
T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This
aggregates local similarities, making the representation of local visual
features more robust and informative for multi-label recognition. For better
PEFT, we further combine both prompt tuning and adapter learning to enhance
classification performance. T2I-PAL offers significant advantages: it
eliminates the need for fully semantically annotated training images, thereby
reducing the manual annotation workload, and it preserves the intrinsic mode of
the CLIP model, allowing for seamless integration with any existing CLIP
framework. Extensive experiments on multiple benchmarks, including MS-COCO,
VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance
by 3.47% in average above the top-ranked state-of-the-art methods.
[80] Rethinking Random Masking in Self Distillation on ViT
Jihyeon Seong,Hyunkyung Han
Main category: cs.CV
TL;DR: 该论文探讨了在自蒸馏框架(如DINO)中随机掩码的作用,提出了一种不对称的掩码策略,仅对学生的全局视图进行掩码,同时保留局部视图和教师的全局视图。
Details
Motivation: 研究表明,随机掩码可能无意中移除关键语义信息,因此需要更智能的掩码策略。Contribution: 提出了一种不对称随机掩码策略,仅在学生的全局视图上应用掩码,以保留干净的监督信号并增强鲁棒性。
Method: 在DINO框架中,对学生的全局视图进行随机掩码,同时保持局部视图和教师的全局视图完整。
Result: 在mini-ImageNet数据集上的实验表明,该方法能生成更鲁棒和细粒度的注意力图,并提升下游任务性能。
Insight: 通过不对称掩码策略,可以平衡训练效率和语义信息的保留,从而提升自蒸馏框架的性能。
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a
wide range of vision tasks. In particular, self-distillation frameworks such as
DINO have contributed significantly to these advances. Within such frameworks,
random masking is often utilized to improve training efficiency and introduce
regularization. However, recent studies have raised concerns that
indiscriminate random masking may inadvertently eliminate critical semantic
information, motivating the development of more informed masking strategies. In
this study, we explore the role of random masking in the self-distillation
setting, focusing on the DINO framework. Specifically, we apply random masking
exclusively to the student’s global view, while preserving the student’s local
views and the teacher’s global view in their original, unmasked forms. This
design leverages DINO’s multi-view augmentation scheme to retain clean
supervision while inducing robustness through masked inputs. We evaluate our
approach using DINO-Tiny on the mini-ImageNet dataset and show that random
masking under this asymmetric setup yields more robust and fine-grained
attention maps, ultimately enhancing downstream performance.
[81] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement
Jin Huang,Honghua Chen,Mingqiang Wei
Main category: cs.CV
TL;DR: 该论文提出了一种称为HEA-MM的分层误差评估框架,用于飞机制造和测量平台中的CAD模型。通过全局、部件和特征三个层次进行误差分析,并结合结构光扫描器和优化方法,实现了对飞机工件的高精度评估。
Details
Motivation: 航空设备对高质量(高性能、高稳定性和高可靠性)的要求极高。现有方法在评估CAD模型的制造误差时缺乏分层分析能力,因此需要一种更全面的误差评估框架。Contribution: 提出了HEA-MM框架,首次在全局、部件和特征三个层次上对CAD模型进行误差分析;提出了一种基于优化的原始图元细化方法;设计了一种两阶段算法,用于检测和分析圆形孔特征。
Method: 1. 全局层次:评估扫描点云与参考CAD模型的整体偏差;2. 部件层次:通过优化方法对粗糙图元进行拆分和合并操作,生成有意义的点云区域;3. 特征层次:采用张量投票和假设-聚类框架检测并分析圆形孔特征。
Result: 实验结果表明,HEA-MM在多种飞机CAD模型上表现出高效性和准确性,能够为制造和测量提供可靠的误差评估。
Insight: 分层误差分析方法可以更全面地捕捉制造误差,优化方法为点云区域分析提供了新思路,两阶段圆形孔检测算法提高了特征分析的精度。
Abstract: The most essential feature of aviation equipment is high quality, including
high performance, high stability and high reliability. In this paper, we
propose a novel hierarchical error assessment framework for aircraft CAD models
within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs
structured light scanners to obtain comprehensive 3D measurements of
manufactured workpieces. The measured point cloud is registered with the
reference CAD model, followed by an error analysis conducted at three
hierarchical levels: global, part, and feature. At the global level, the error
analysis evaluates the overall deviation of the scanned point cloud from the
reference CAD model. At the part level, error analysis is performed on these
patches underlying the point clouds. We propose a novel optimization-based
primitive refinement method to obtain a set of meaningful patches of point
clouds. Two basic operations, splitting and merging, are introduced to refine
the coarse primitives. At the feature level, error analysis is performed on
circular holes, which are commonly found in CAD models. To facilitate it, a
two-stage algorithm is introduced for the detection of circular holes. First,
edge points are identified using a tensor-voting algorithm. Then, multiple
circles are fitted through a hypothesize-and-clusterize framework, ensuring
accurate detection and analysis of the circular features. Experimental results
on various aircraft CAD models demonstrate the effectiveness of our proposed
method.
[82] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection
Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai
Main category: cs.CV
TL;DR: 该论文提出了一种名为SSP(语义解耦空间划分)的统一框架,用于解决点监督定向目标检测中样本分配和实例混淆的问题,显著提升了检测性能。
Details
Motivation: 随着遥感技术的进步,图像数量激增,但高密度场景中的定向目标检测因需要大量人工标注而受限。点监督方法虽然成本低,但现有方法因基于固定规则的设计导致样本分配不足和实例混淆。Contribution: 1) 提出SSP框架,结合规则驱动的先验注入和数据驱动的标签净化;2) 设计了基于像素级空间划分的样本分配和基于语义空间划分的框提取方法。
Method: 1) 通过像素级空间划分估计目标尺寸上下界并挖掘高质量样本;2) 利用语义空间划分生成伪标签以监督下游检测器学习。
Result: 在DOTA-v1.0等数据集上,SSP在点监督下达到45.78% mAP,优于SOTA方法PointOBB-v2 4.10%。与ORCNN和ReDet结合时,分别达到47.86%和48.50% mAP。
Insight: SSP通过结合规则和数据驱动的方法,有效解决了点监督中的样本分配问题,为高密度场景的定向目标检测提供了一种高效解决方案。
Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented
object detection rapid development, yet hindered by labor-intensive annotation
for high-density scenes. Oriented object detection with point supervision
offers a cost-effective solution for densely packed scenes in remote sensing,
yet existing methods suffer from inadequate sample assignment and instance
confusion due to rigid rule-based designs. To address this, we propose SSP
(Semantic-decoupled Spatial Partition), a unified framework that synergizes
rule-driven prior injection and data-driven label purification. Specifically,
SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based
Sample Assignment, which compactly estimates the upper and lower bounds of
object scales and mines high-quality positive samples and hard negative samples
through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based
Box Extraction, which derives instances from spatial partitions modulated by
semantic maps and reliably converts them into bounding boxes to form
pseudo-labels for supervising the learning of downstream detectors. Experiments
on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP
under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%.
Furthermore, when integrated with ORCNN and ReDet architectures, the SSP
framework achieves mAP values of 47.86% and 48.50%, respectively. The code is
available at https://github.com/antxinyuan/ssp.
[83] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model
Eshan Ramesh,Nishio Takayuki
Main category: cs.CV
TL;DR: LatentCSI是一种利用预训练的潜在扩散模型(LDM)从WiFi CSI测量生成物理环境图像的新方法,通过直接映射CSI振幅到潜在空间,提高计算效率和图像质量。
Details
Motivation: 传统方法如GAN在WiFi CSI图像生成中存在计算复杂度高和图像质量不佳的问题。LatentCSI旨在通过利用预训练的LDM解决这些问题,实现高效高质量的图像生成。Contribution: 提出LatentCSI方法,首次将WiFi CSI数据直接映射到预训练LDM的潜在空间,实现了高效、高质量的图像生成,并具备文本引导的灵活性。
Method: 使用轻量级神经网络将CSI振幅映射到LDM的潜在空间,利用LDM的扩散模型和文本引导进行去噪,最后通过预训练解码器生成高分辨率图像。
Result: 在自采集的WiFi设备和摄像头数据集及MM-Fi数据集上验证,LatentCSI在计算效率和感知质量上均优于基线方法,并支持文本引导控制。
Insight: 通过绕过传统的像素空间生成和显式编码阶段,LatentCSI展示了潜在扩散模型在跨模态数据(如WiFi CSI到图像)任务中的潜力。
Abstract: We present LatentCSI, a novel method for generating images of the physical
environment from WiFi CSI measurements that leverages a pretrained latent
diffusion model (LDM). Unlike prior approaches that rely on complex and
computationally intensive techniques such as GANs, our method employs a
lightweight neural network to map CSI amplitudes directly into the latent space
of an LDM. We then apply the LDM’s denoising diffusion model to the latent
representation with text-based guidance before decoding using the LDM’s
pretrained decoder to obtain a high-resolution image. This design bypasses the
challenges of pixel-space image generation and avoids the explicit image
encoding stage typically required in conventional image-to-image pipelines,
enabling efficient and high-quality image synthesis. We validate our approach
on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi
devices and cameras; and a subset of the publicly available MM-Fi dataset. The
results demonstrate that LatentCSI outperforms baselines of comparable
complexity trained directly on ground-truth images in both computational
efficiency and perceptual quality, while additionally providing practical
advantages through its unique capacity for text-guided controllability.
[84] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu
Main category: cs.CV
TL;DR: MSTAR提出了一种无需边界框标注的、支持多查询类型的场景文本检索方法,通过动态多粒度文本表示和风格感知指令统一自由文本查询,显著提升了性能并降低了标注成本。
Details
Motivation: 现有场景文本检索方法依赖昂贵的边界框标注,且难以统一多样化的查询类型。MSTAR旨在解决这些问题。Contribution: 1. 提出无需边界框标注的MSTAR方法。2. 引入多实例匹配模块增强视觉-语言对齐。3. 构建首个多查询场景文本检索基准MQTR。
Method: 1. 渐进式视觉嵌入动态捕捉多粒度文本表示。2. 风格感知指令统一自由文本查询。3. 多实例匹配模块优化对齐。
Result: 在7个公开数据集和MQTR基准上,MSTAR性能优于之前方法(如Total-Text上MAP提升6.4%),MQTR上平均提升8.5%。
Insight: 无需边界框标注的检索方法可行且高效;多查询统一化设计能更好满足多样化需求。
Abstract: Scene text retrieval has made significant progress with the assistance of
accurate text localization. However, existing approaches typically require
costly bounding box annotations for training. Besides, they mostly adopt a
customized retrieval strategy but struggle to unify various types of queries to
meet diverse retrieval needs. To address these issues, we introduce Muti-query
Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for
scene text retrieval. It incorporates progressive vision embedding to
dynamically capture the multi-grained representation of texts and harmonizes
free-style text queries with style-aware instructions. Additionally, a
multi-instance matching module is integrated to enhance vision-language
alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset,
the first benchmark designed to evaluate the multi-query scene text retrieval
capability of models, comprising four query types and 16k images. Extensive
experiments demonstrate the superiority of our method across seven public
datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous
state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box
annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly
outperforms the previous models by an average of 8.5%. The code and datasets
are available at https://github.com/yingift/MSTAR.
[85] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models
Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O’Neil,Sotirios A. Tsaftaris
Main category: cs.CV
TL;DR: 论文提出了一种针对胸部X光片的潜在扩散模型微调框架,通过弱监督提示调优提升多模态对齐,解决了医学影像中文本与图像对齐不足的问题,并在标准数据集和分布外数据上表现优异。
Details
Motivation: 医学影像领域的数据隐私问题导致数据有限,使得潜在扩散模型在文本与图像对齐方面的性能不足,影响了其在医学影像多模态任务中的应用。Contribution: 提出了一种弱监督提示调优框架,显著提升了预训练模型在医学影像中的多模态对齐能力;在标准数据集上达到新SOTA,并展示了对分布外数据的鲁棒性。
Method: 通过解剖学信息引导的弱监督提示调优方法,对预训练的潜在扩散模型进行高效微调,实现文本与图像区域的更精准对齐。
Result: 在MS-CXR数据集上达到新SOTA,同时在VinDr-CXR等分布外数据上表现鲁棒。
Insight: 即使是数据受限的医学影像领域,通过弱监督学习也能有效提升多模态模型的性能,为医学影像分析任务提供了新思路。
Abstract: Latent Diffusion Models have shown remarkable results in text-guided image
synthesis in recent years. In the domain of natural (RGB) images, recent works
have shown that such models can be adapted to various vision-language
downstream tasks with little to no supervision involved. On the contrary,
text-to-image Latent Diffusion Models remain relatively underexplored in the
field of medical imaging, primarily due to limited data availability (e.g., due
to privacy concerns). In this work, focusing on the chest X-ray modality, we
first demonstrate that a standard text-conditioned Latent Diffusion Model has
not learned to align clinically relevant information in free-text radiology
reports with the corresponding areas of the given scan. Then, to alleviate this
issue, we propose a fine-tuning framework to improve multi-modal alignment in a
pre-trained model such that it can be efficiently repurposed for downstream
tasks such as phrase grounding. Our method sets a new state-of-the-art on a
standard benchmark dataset (MS-CXR), while also exhibiting robust performance
on out-of-distribution data (VinDr-CXR). Our code will be made publicly
available.
[86] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models
Francisco Caetano,Christiaan Viviers,Peter H. N. De With,Fons van der Sommen
Main category: cs.CV
TL;DR: 该论文提出了Symmetrical Flow Matching(SymmFlow)框架,通过对称学习目标统一了图像生成、语义分割和分类任务,实现了高性能的单模态多任务模型。
Details
Motivation: 现有的Flow Matching框架在生成任务中表现优异,但在多任务统一方面存在局限性。研究人员希望通过对称学习目标实现图像生成、分割和分类的统一建模。Contribution: 提出了SymmFlow框架,通过双向一致性目标和语义保留机制,实现了生成、分割和分类任务的统一建模,并显著提高了性能。
Method: 采用对称学习目标联合建模前向和反向变换,同时引入新的训练目标保留语义信息,支持像素级和图像级条件生成。
Result: 在CelebAMask-HQ和COCO-Stuff等数据集上,生成任务FID分别达到11.9和7.0;同时在分割和分类任务中表现优异。
Insight: 通过对称性和语义保留机制,多任务统一建模不仅可行,还能提升模型在单一任务上的性能,为多模态生成学习提供了新思路。
Abstract: Flow Matching has emerged as a powerful framework for learning continuous
transformations between distributions, enabling high-fidelity generative
modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new
formulation that unifies semantic segmentation, classification, and image
generation within a single model. Using a symmetric learning objective,
SymmFlow models forward and reverse transformations jointly, ensuring
bi-directional consistency, while preserving sufficient entropy for generative
diversity. A new training objective is introduced to explicitly retain semantic
information across flows, featuring efficient sampling while preserving
semantic structure, allowing for one-step segmentation and classification
without iterative refinement. Unlike previous approaches that impose strict
one-to-one mapping between masks and images, SymmFlow generalizes to flexible
conditioning, supporting both pixel-level and image-level class labels.
Experimental results on various benchmarks demonstrate that SymmFlow achieves
state-of-the-art performance on semantic image synthesis, obtaining FID scores
of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps.
Additionally, it delivers competitive results on semantic segmentation and
shows promising capabilities in classification tasks. The code will be publicly
available.
[87] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang
Main category: cs.CV
TL;DR: GigaVideo-1提出了一种高效的视频生成微调框架,通过自动反馈而非大量人工标注和数据,仅用4 GPU小时显著提升了视频生成质量。
Details
Motivation: 当前视频生成模型需通过微调提升特定维度(如实例保留、运动合理性),但传统方法依赖人工标注和高计算资源,实用性受限。Contribution: 1. 设计了一种无监督的高效微调框架;2. 提出基于提示的数据引擎和奖励引导优化策略;3. 在VBench-2.0上平均提升4%,仅需4 GPU小时。
Method: 1. 通过提示驱动数据引擎构建多样化训练样本;2. 使用预训练视觉语言模型的反馈自适应加权样本。
Result: 在17个评估维度上均表现提升,平均增益4%,计算资源需求极低。
Insight: 自动反馈机制可有效替代人工标注,低资源消耗的微调方法具有实际应用潜力。
Abstract: Recent progress in diffusion models has greatly enhanced video generation
quality, yet these models still require fine-tuning to improve specific
dimensions like instance preservation, motion rationality, composition, and
physical plausibility. Existing fine-tuning approaches often rely on human
annotations and large-scale computational resources, limiting their
practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning
framework that advances video generation without additional human supervision.
Rather than injecting large volumes of high-quality data from external sources,
GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models
through automatic feedback. Specifically, we focus on two key aspects of the
fine-tuning process: data and optimization. To improve fine-tuning data, we
design a prompt-driven data engine that constructs diverse, weakness-oriented
training samples. On the optimization side, we introduce a reward-guided
training strategy, which adaptively weights samples using feedback from
pre-trained vision-language models with a realism constraint. We evaluate
GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17
evaluation dimensions. Experiments show that GigaVideo-1 consistently improves
performance on almost all the dimensions with an average gain of about 4% using
only 4 GPU-hours. Requiring no manual annotations and minimal real data,
GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and
data will be publicly available.
[88] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović
Main category: cs.CV
TL;DR: PiPViT提出了一种基于视觉Transformer的原型方法,通过对比学习和多分辨率输入处理学习可解释的原型,用于视网膜图像分析,既能实现高性能又能提供有意义的解释。
Details
Motivation: 现有原型方法在医学影像中的可视化通常与人类可理解的生物标志物不一致,且原型通常过细,难以解释生物标志物的范围和存在性。Contribution: 提出了PiPViT,一种基于ViT的可解释原型模型,利用对比学习和多分辨率处理学习鲁棒的、可解释的原型。
Method: 采用ViT捕获图像块间的长程依赖关系,通过对比学习和多分辨率输入处理学习原型。
Result: 在视网膜OCT图像分类任务中达到SOTA性能,原型具有临床相关性和语义意义。
Insight: PiPViT不仅能提供高性能分类,还能通过透明原型辅助临床诊断解释。
Abstract: Background and Objective: Prototype-based methods improve interpretability by
learning fine-grained part-prototypes; however, their visualization in the
input pixel space is not always consistent with human-understandable
biomarkers. In addition, well-known prototype-based approaches typically learn
extremely granular prototypes that are less interpretable in medical imaging,
where both the presence and extent of biomarkers and lesions are critical.
Methods: To address these challenges, we propose PiPViT (Patch-based Visual
Interpretable Prototypes), an inherently interpretable prototypical model for
image recognition. Leveraging a vision transformer (ViT), PiPViT captures
long-range dependencies among patches to learn robust, human-interpretable
prototypes that approximate lesion extent only using image-level labels.
Additionally, PiPViT benefits from contrastive learning and multi-resolution
input processing, which enables effective localization of biomarkers across
scales.
Results: We evaluated PiPViT on retinal OCT image classification across four
datasets, where it achieved competitive quantitative performance compared to
state-of-the-art methods while delivering more meaningful explanations.
Moreover, quantitative evaluation on a hold-out test set confirms that the
learned prototypes are semantically and clinically relevant. We believe PiPViT
can transparently explain its decisions and assist clinicians in understanding
diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT
[89] Enhancing Deepfake Detection using SE Block Attention with CNN
Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy
Main category: cs.CV
TL;DR: 该论文提出了一种轻量级的CNN结合SE注意力模块的Deepfake检测方法,显著降低了模型大小和计算资源消耗,同时在性能上达到了竞争性水平。
Details
Motivation: Deepfake技术的快速发展使得伪造内容越来越逼真,传统检测方法难以应对。现有深度检测模型通常体积庞大,存储和计算成本高,亟需高效轻量化的解决方案。Contribution: 提出了一种结合SE注意力模块的轻量级CNN模型,动态调整通道特征权重,提升检测效率,减少资源消耗。
Method: 采用带有SE注意力模块的CNN,通过动态特征重标定(channel-wise feature recalibration)强化信息丰富特征,抑制无用特征,模型结构简洁高效。
Result: 在Style GAN数据集上取得了94.14%的分类准确率和0.985的AUC-ROC分数,性能优于同类模型。
Insight: SE注意力模块可以有效提升轻量级模型的检测性能,为资源受限场景下的Deepfake检测提供了可行方案。
Abstract: In the digital age, Deepfake present a formidable challenge by using advanced
artificial intelligence to create highly convincing manipulated content,
undermining information authenticity and security. These sophisticated
fabrications surpass traditional detection methods in complexity and realism.
To address this issue, we aim to harness cutting-edge deep learning
methodologies to engineer an innovative deepfake detection model. However, most
of the models designed for deepfake detection are large, causing heavy storage
and memory consumption. In this research, we propose a lightweight convolution
neural network (CNN) with squeeze and excitation block attention (SE) for
Deepfake detection. The SE block module is designed to perform dynamic
channel-wise feature recalibration. The SE block allows the network to
emphasize informative features and suppress less useful ones, which leads to a
more efficient and effective learning module. This module is integrated with a
simple sequential model to perform Deepfake detection. The model is smaller in
size and it achieves competing accuracy with the existing models for deepfake
detection tasks. The model achieved an overall classification accuracy of
94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse
Fake Face Dataset. Our proposed approach presents a promising avenue for
combating the Deepfake challenge with minimal computational resources,
developing efficient and scalable solutions for digital content verification.
[90] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework
Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 论文提出了一个名为UAC的对抗性CAPTCHA框架,通过大语言模型(LLM)和扩散模型生成高质量对抗样本,支持目标和黑盒攻击,实验表明其攻击成功率高且生成的CAPTCHA对人类和DNN均难区分。
Details
Motivation: 传统CAPTCHA方案因深度学习进步而容易被自动化攻击破解,现有对抗攻击方法依赖原始图像特征,限制了在缺乏初始输入图像场景中的应用。Contribution: 1. 提出UAC框架,利用LLM生成高保真对抗样本;2. 提出EDICT方法优化扩散模型的双潜变量;3. 提出BP-UAC策略,支持黑盒攻击。
Method: 1. 使用LLM增强CAPTCHA多样性;2. 对目标攻击采用EDICT方法;3. 对黑盒攻击采用多模态梯度与双路径优化策略(BP-UAC)。
Result: BP-UAC在多样系统中实现高攻击成功率,生成自然且难区分的CAPTCHA。
Insight: 结合LLM和扩散模型可提升对抗样本的多样性和质量,多模态梯度与双路径优化对黑盒攻击有效。
Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are
increasingly vulnerable to automated attacks powered by deep neural networks
(DNNs). Existing adversarial attack methods often rely on original image
characteristics, resulting in distortions that hinder human interpretation and
limit applicability in scenarios lacking initial input images. To address these
challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel
framework generating high-fidelity adversarial examples guided by
attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC
enhances CAPTCHA diversity and supports both targeted and untargeted attacks.
For targeted attacks, the EDICT method optimizes dual latent variables in a
diffusion model for superior image quality. In untargeted attacks, especially
for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA
(BP-UAC), a two-step optimization strategy employing multimodal gradients and
bi-path optimization for efficient misclassification. Experiments show BP-UAC
achieves high attack success rates across diverse systems, generating natural
CAPTCHAs indistinguishable to humans and DNNs.
[91] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral
Main category: cs.CV
TL;DR: 该论文提出了一种多任务和多年龄方法,用于在无约束图像中检测未成年人,通过共享特征的架构、改进的损失函数和年龄平衡采样,显著提升了年龄估计和未成年人检测的准确性。
Details
Motivation: 自动检测未成年人面临公开数据中儿童样本不足和分布偏移的问题,需要开发鲁棒的方法来解决这些挑战。Contribution: 1. 提出了一种多任务架构,结合年龄回归和多个二元阈值的未成年人检测任务;2. 引入了α-重新加权的焦点式损失和年龄平衡采样以解决类别不平衡;3. 创建了一个大型且严格的评估基准(ASORES-39k和ASWIFT-20k)。
Method: 1. 基于冻结的FaRL视觉语言主干网络,结合双层MLP共享特征;2. 使用α-重新加权焦点损失和年龄平衡的mini-batch采样;3. 引入年龄间隙(age gap)优化边缘案例。
Result: 在ASORES-39k上,年龄估计的RMSE从5.733降至5.656,18岁以下检测的F2得分从0.801提升至0.857;在ASWIFT-20k上,F2得分从0.742提升至0.833。
Insight: 多任务学习和年龄平衡采样对未成年人检测任务至关重要,尤其是在数据分布偏移的情况下,模型表现出强鲁棒性。
Abstract: Accurate automatic screening of minors in unconstrained images demands models
that are robust to distribution shift and resilient to the children
under-representation in publicly available data. To overcome these issues, we
propose a multi-task architecture with dedicated under/over-age discrimination
tasks based on a frozen FaRL vision-language backbone joined with a compact
two-layer MLP that shares features across one age-regression head and four
binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing
on the legally critical age range. To address the severe class imbalance, we
introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch
sampling, which equalizes twelve age bins during stochastic optimization.
Further improvement is achieved with an age gap that removes edge cases from
the loss.
Moreover, we set a rigorous evaluation by proposing the Overall Under-Age
Benchmark, with 303k cleaned training images and 110k test images, defining
both the “ASORES-39k” restricted overall test, which removes the noisiest
domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images,
stressing extreme pose ($>$45{\deg}), expression, and low image quality to
emulate real-world shifts.
Trained on the cleaned overall set with resampling and age gap, our multiage
model “F” lowers the root-mean-square-error on the ASORES-39k restricted test
from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from
F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to
the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall
while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline,
demonstrating strong generalization under distribution shift. For the under-12
and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and
from 0.689 to 0.916, respectively.
[92] Continual Hyperbolic Learning of Instances and Classes
Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth
Main category: cs.CV
TL;DR: 论文提出了一个持续学习框架HyperCLIC,用于同时学习实例和类别,利用双曲空间建模层次结构,并验证了其在EgoObjects数据集上的有效性。
Details
Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,传统方法仅专注于单一任务,无法满足需求。Contribution: 1) 提出了同时学习实例和类别的持续学习任务;2) 提出HyperCLIC,利用双曲空间建模层次关系;3) 设计了持续层次化评估指标。
Method: 采用双曲空间的分类和蒸馏目标,通过低失真和紧凑嵌入表示树状结构,平衡细粒度实例识别和粗粒度类别泛化。
Result: 在EgoObjects数据集上验证,HyperCLIC在多种粒度下表现出色,提升了层次泛化能力。
Insight: 双曲空间适合建模层次关系,为持续学习中的多粒度任务提供了新思路。
Abstract: Continual learning has traditionally focused on classifying either instances
or classes, but real-world applications, such as robotics and self-driving
cars, require models to handle both simultaneously. To mirror real-life
scenarios, we introduce the task of continual learning of instances and
classes, at the same time. This task challenges models to adapt to multiple
levels of granularity over time, which requires balancing fine-grained instance
recognition with coarse-grained class generalization. In this paper, we
identify that classes and instances naturally form a hierarchical structure. To
model these hierarchical relationships, we propose HyperCLIC, a continual
learning algorithm that leverages hyperbolic space, which is uniquely suited
for hierarchical data due to its ability to represent tree-like structures with
low distortion and compact embeddings. Our framework incorporates hyperbolic
classification and distillation objectives, enabling the continual embedding of
hierarchical relations. To evaluate performance across multiple granularities,
we introduce continual hierarchical metrics. We validate our approach on
EgoObjects, the only dataset that captures the complexity of hierarchical
object recognition in dynamic real-world environments. Empirical results show
that HyperCLIC operates effectively at multiple granularities with improved
hierarchical generalization.
[93] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement
Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He
Main category: cs.CV
TL;DR: 该论文提出了第一个专门用于伪装目标检测(COD)的生成式优化框架——不确定性掩码伯努利扩散(UMBD),通过选择性优化低质量区域提升分割性能。
Details
Motivation: 现有COD方法在分割质量较差的区域存在优化空间,但缺乏针对性后处理框架。Contribution: 1. 提出UMBD框架,首次将生成式优化引入COD;2. 设计HUQNet网络,融合多源不确定性指导优化;3. 轻量级集成现有COD模型,显著提升性能。
Method: 1. UMBD通过伯努利扩散选择性优化低质量区域;2. HUQNet采用多分支架构量化不确定性;3. 扩散采样过程中自适应调整优化强度。
Result: 在多个COD基准测试中,平均MAE提升5.5%,加权F-measure提升3.2%,计算开销低。
Insight: 生成式优化与判别式模型的结合能有效解决COD的细微差异问题;不确定性引导的局部优化是关键。
Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the
subtle visual differences between targets and their backgrounds. While existing
methods have made notable progress, there remains significant potential for
post-processing refinement that has yet to be fully explored. To address this
limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model,
the first generative refinement framework specifically designed for COD. UMBD
introduces an uncertainty-guided masking mechanism that selectively applies
Bernoulli diffusion to residual regions with poor segmentation quality,
enabling targeted refinement while preserving correctly segmented areas. To
support this process, we design the Hybrid Uncertainty Quantification Network
(HUQNet), which employs a multi-branch architecture and fuses uncertainty from
multiple sources to improve estimation accuracy. This enables adaptive guidance
during the generative sampling process. The proposed UMBD framework can be
seamlessly integrated with a wide range of existing Encoder-Decoder-based COD
models, combining their discriminative capabilities with the generative
advantages of diffusion-based refinement. Extensive experiments across multiple
COD benchmarks demonstrate consistent performance improvements, achieving
average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest
computational overhead. Code will be released.
[94] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng
Main category: cs.CV
TL;DR: IQE-CLIP 是一种基于 CLIP 的零/小样本异常检测框架,通过结合文本和实例感知的视觉信息生成异常敏感嵌入,适用于医学领域。
Details
Motivation: 现有基于 CLIP 的方法依赖于预先设计的情境特定提示,无法在联合嵌入空间中有效区分正常和异常实例,且医学领域的探索有限。Contribution: 提出了 IQE-CLIP 框架,通过类基和可学习的提示标记,以及实例感知查询模块,生成异常敏感嵌入,填补医学领域 ZFSAD 的研究空白。
Method: 1. 引入类基和可学习提示标记;2. 设计实例感知查询模块,提取多模态区域级上下文信息。
Result: 在六个医学数据集上实现了零/小样本设置的 SOTA 性能。
Insight: 结合文本和视觉实例信息能更有效指示异常,医学领域的 ZFSAD 任务需要更精细的嵌入生成方法。
Abstract: Recent advances in vision-language models, such as CLIP, have significantly
improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks.
However, most existing CLIP-based methods assume prior knowledge of categories
and rely on carefully designed prompts tailored to specific scenarios. While
these text prompts capture semantic information in the textual space, they
often fail to distinguish normal and anomalous instances in the joint embedding
space. Moreover, most ZFSAD approaches focus on industrial domains, with
limited exploration in medical tasks. To address these limitations, we propose
IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query
embeddings integrating both textual and instance-aware visual information serve
as more effective indicators of anomalies. Specifically, we introduce
class-based and learnable prompting tokens to better adapt CLIP to the medical
setting. Furthermore, we design an instance-aware query module that extracts
region-level contextual information from both modalities, enabling the
generation of anomaly-sensitive embeddings. Extensive experiments on six
medical datasets demonstrate that IQE-CLIP achieves state-of-the-art
performance in both zero-shot and few-shot settings. Code and data are
available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.
[95] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu
Main category: cs.CV
TL;DR: PosterCraft是一个统一框架,用于生成高质量美学海报,通过多阶段优化工作流和自动化数据构建,显著提升渲染质量和视觉吸引力。
Details
Motivation: 生成美学海报比简单设计图像更具挑战性,需解决文本渲染、内容整合、布局和谐等问题。现有方法多为模块化流程,限制了生成自由度。Contribution: 1)提出统一框架PosterCraft,摆脱模块化约束;2)引入Text-Render-2M数据集和HQ-Poster100K;3)多阶段优化方法(文本渲染、区域感知微调、美学强化学习、反馈细化)。
Method: 1)大规模文本渲染优化;2)区域感知监督微调;3)基于偏好优化的美学文本强化学习;4)联合视觉语言反馈细化。
Result: 在渲染精度、布局一致性和视觉吸引力上大幅优于开源基线,接近SOTA商业系统质量。
Insight: 自动化数据构建和多阶段优化是提升生成质量的关键,统一框架更适合复杂美学任务。
Abstract: Generating aesthetic posters is more challenging than simple design images:
it requires not only precise text rendering but also the seamless integration
of abstract artistic content, striking layouts, and overall stylistic harmony.
To address this, we propose PosterCraft, a unified framework that abandons
prior modular pipelines and rigid, predefined layouts, allowing the model to
freely explore coherent, visually compelling compositions. PosterCraft employs
a carefully designed, cascaded workflow to optimize the generation of
high-aesthetic posters: (i) large-scale text-rendering optimization on our
newly introduced Text-Render-2M dataset; (ii) region-aware supervised
fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via
best-of-n preference optimization; and (iv) joint vision-language feedback
refinement. Each stage is supported by a fully automated data-construction
pipeline tailored to its specific needs, enabling robust training without
complex architectural modifications. Evaluated on multiple experiments,
PosterCraft significantly outperforms open-source baselines in rendering
accuracy, layout coherence, and overall visual appeal-approaching the quality
of SOTA commercial systems. Our code, models, and datasets can be found in the
Project page: https://ephemeral182.github.io/PosterCraft
[96] SlotPi: Physics-informed Object-centric Reasoning Models
Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun
Main category: cs.CV
TL;DR: SlotPi是一种基于物理知识的对象中心推理模型,通过结合哈密顿原理和时空预测模块,解决了现有方法中物理知识整合不足和跨场景适应性的问题。
Details
Motivation: 现有对象中心动态模拟方法忽略了物理知识的整合和模型在多样场景中的适应性验证,而人类能够通过观察世界获取物理知识并用于动态推理。Contribution: 提出了SlotPi模型,结合了哈密顿物理模块和时空预测模块,并在基准数据集和流体数据集上验证了其在预测和视觉问答任务中的优势。
Method: SlotPi通过整合基于哈密顿原理的物理模块和时空预测模块,实现了对象中心动态推理与流体流动特性的模拟。
Result: 实验表明,SlotPi在预测和VQA任务中表现优异,并在新创建的真实世界数据集上展现了强大的适应性。
Insight: 物理知识的整合不仅提高了模型的动态推理能力,还增强了其跨场景适应性,为构建更高级的世界模型奠定了基础。
Abstract: Understanding and reasoning about dynamics governed by physical laws through
visual observation, akin to human capabilities in the real world, poses
significant challenges. Currently, object-centric dynamic simulation methods,
which emulate human behavior, have achieved notable progress but overlook two
critical aspects: 1) the integration of physical knowledge into models. Humans
gain physical insights by observing the world and apply this knowledge to
accurately reason about various dynamic scenarios; 2) the validation of model
adaptability across diverse scenarios. Real-world dynamics, especially those
involving fluids and objects, demand models that not only capture object
interactions but also simulate fluid flow characteristics. To address these
gaps, we introduce SlotPi, a slot-based physics-informed object-centric
reasoning model. SlotPi integrates a physical module based on Hamiltonian
principles with a spatio-temporal prediction module for dynamic forecasting.
Our experiments highlight the model’s strengths in tasks such as prediction and
Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore,
we have created a real-world dataset encompassing object interactions, fluid
dynamics, and fluid-object interactions, on which we validated our model’s
capabilities. The model’s robust performance across all datasets underscores
its strong adaptability, laying a foundation for developing more advanced world
models.
[97] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning
Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae
Main category: cs.CV
TL;DR: 本文提出了一种结合事件相机和其他传感器以及强化学习的机器人导航控制器,用于实时的人类中心导航和避障。与传统基于图像的控制器相比,该方法利用事件相机的异步特性,实现了自适应推理和控制。
Details
Motivation: 传统基于图像的导航控制器存在固定帧率、运动模糊和延迟问题,而事件相机能够异步捕捉视觉信息,为解决这些问题提供了新思路。Contribution: 1. 提出了一个结合事件相机和强化学习的机器人导航框架;2. 通过模仿学习提升样本效率;3. 在仿真环境中展示了鲁棒的导航和避障能力。
Method: 1. 使用事件相机和其他传感器进行感知;2. 采用深度确定性策略梯度(DDPG)优化策略;3. 通过模仿学习初始化策略以提升效率。
Result: 在仿真环境中实现了鲁棒的导航、行人跟随和避障功能。
Insight: 事件相机的异步特性为机器人导航提供了新的感知方式,结合强化学习可以显著提升动态环境中的适应性。
Abstract: This work introduces a robot navigation controller that combines event
cameras and other sensors with reinforcement learning to enable real-time
human-centered navigation and obstacle avoidance. Unlike conventional
image-based controllers, which operate at fixed rates and suffer from motion
blur and latency, this approach leverages the asynchronous nature of event
cameras to process visual information over flexible time intervals, enabling
adaptive inference and control. The framework integrates event-based
perception, additional range sensing, and policy optimization via Deep
Deterministic Policy Gradient, with an initial imitation learning phase to
improve sample efficiency. Promising results are achieved in simulated
environments, demonstrating robust navigation, pedestrian following, and
obstacle avoidance. A demo video is available at the project website.
[98] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization
Mario Barbara,Alaa Maalouf
Main category: cs.CV
TL;DR: 该论文提出了一种零样本、可查询文本的视频摘要方法,利用视频-语言模型(VidLMs)和大型语言模型(LLMs)生成用户引导的视频摘要,无需训练数据,性能优于无监督方法,并媲美监督方法。
Details
Motivation: 视频数据的爆炸式增长需要灵活、用户可控的摘要工具,但现有方法要么依赖特定领域数据,无法泛化,要么无法结合用户通过自然语言表达的意图。Contribution: 1. 首次提出零样本、可查询文本的视频摘要方法;2. 开发了基于VidLMs和LLMs的流水线,无需训练数据;3. 在多个基准测试中超越无监督方法,媲美监督方法;4. 发布了新数据集VidSum-Reason。
Method: 1. 将视频分段为场景;2. 通过VidLM生成场景级描述;3. 使用LLM评估场景重要性;4. 通过一致性和唯一性指标传播分数到帧级别。
Result: 在SumMe和TVSum上超越无监督方法,QFVS基准测试中表现竞争力,且无需训练数据。VidSum-Reason数据集的提出为后续研究提供挑战性基线。
Insight: 预训练多模态模型结合精心设计的提示和分数传播方法,可成为通用、文本查询视频摘要的强大基础。
Abstract: The explosive growth of video data intensified the need for flexible
user-controllable summarization tools that can operate without domain-specific
training data. Existing methods either rely on datasets, limiting
generalization, or cannot incorporate user intent expressed in natural
language. We introduce Prompts-to-Summaries: the first zero-shot,
text-queryable video summarizer that converts off-the-shelf video-language
models (VidLMs) captions into user-guided skims via large language models
(LLMs) judging, without the use of training data at all, beating all
unsupervised and matching supervised methods. Our pipeline (i) segments raw
video footage into coherent scenes, (ii) generates rich scene-level
descriptions through a memory-efficient, batch-style VidLM prompting scheme
that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a
judge to assign scene-level importance scores under a carefully crafted prompt,
and finally, (iv) propagates those scores to short segments level via two new
metrics: consistency (temporal coherency) and uniqueness (novelty), yielding
fine-grained frame importance. On SumMe and TVSum, our data-free approach
surpasses all prior data-hungry unsupervised methods. It also performs
competitively on the Query-Focused Video Summarization (QFVS) benchmark,
despite using no training data and the competing methods requiring supervised
frame-level importance. To spur further research, we release VidSum-Reason, a
new query-driven dataset featuring long-tailed concepts and multi-step
reasoning; our framework attains robust F1 scores and serves as the first
challenging baseline. Overall, our results demonstrate that pretrained
multimodal models, when orchestrated with principled prompting and score
propagation, already provide a powerful foundation for universal,
text-queryable video summarization.
[99] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing
Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan
Main category: cs.CV
TL;DR: 论文提出了一种名为SmoothProper的无监督模块,通过结合非参数平滑优化层,有效解决了稀疏特征和大位移挑战下的变形图像配准问题。
Details
Motivation: 稀疏特征和大位移是传统无监督变形图像配准(DIR)方法的难点,神经网络的单次前向预测导致变形场缺乏约束,难以处理这些问题。Contribution: 提出了SmoothProper模块,通过优化层实现变形场的平滑性和结构一致性,无需额外参数调优,显著提升了配准精度。
Method: 在神经网络的前向传播中引入基于对偶优化的非参数平滑层,实现空间信号传播和变形场平滑。
Result: 在视网膜血管数据集上,配准误差降至1.88像素(2912x2912图像),首次有效解决了稀疏特征和大位移的双重挑战。
Insight: 结合优化层和神经网络可以弥补无监督DIR在结构一致性上的不足,为复杂场景配准提供了新思路。
Abstract: Learning-based deformable image registration (DIR) accelerates alignment by
amortizing traditional optimization via neural networks. Label supervision
further enhances accuracy, enabling efficient and precise nonlinear alignment
of unseen scans. However, images with sparse features amid large smooth
regions, such as retinal vessels, introduce aperture and large-displacement
challenges that unsupervised DIR methods struggle to address. This limitation
occurs because neural networks predict deformation fields in a single forward
pass, leaving fields unconstrained post-training and shifting the
regularization burden entirely to network weights. To address these issues, we
introduce SmoothProper, a plug-and-play neural module enforcing smoothness and
promoting message passing within the network’s forward pass. By integrating a
duality-based optimization layer with tailored interaction terms, SmoothProper
efficiently propagates flow signals across spatial locations, enforces
smoothness, and preserves structural consistency. It is model-agnostic,
seamlessly integrates into existing registration frameworks with minimal
parameter overhead, and eliminates regularizer hyperparameter tuning.
Preliminary results on a retinal vessel dataset exhibiting aperture and
large-displacement challenges demonstrate our method reduces registration error
to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach
to effectively address both challenges. The source code will be available at
https://github.com/tinymilky/SmoothProper.
[100] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders
Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian
Main category: cs.CV
TL;DR: 论文提出了一种基于掩码自编码器的遮挡感知3D手-物体姿态估计方法(HOMAE),通过目标聚焦掩码策略和多尺度特征融合,解决了手-物体交互中的遮挡问题,并在DexYCB和HO3Dv2基准测试中取得了最先进的结果。
Details
Motivation: 现有的手-物体姿态估计方法在遮挡情况下表现不佳,缺乏对全局结构的感知和推理能力。作者希望通过引入掩码自编码器和多尺度特征融合,提升模型在遮挡场景下的性能。Contribution: 1) 提出目标聚焦掩码策略,引导模型学习上下文感知特征;2) 结合多尺度特征预测SDF(有符号距离场)和点云,增强几何感知能力;3) 在DexYCB和HO3Dv2数据集上实现了最先进的性能。
Method: HOMAE方法基于掩码自编码器,采用目标聚焦掩码策略生成结构化遮挡,并通过解码器提取多尺度特征预测SDF。结合SDF的全局上下文与点云的局部几何信息,提升遮挡区域的鲁棒性。
Result: 在DexYCB和HO3Dv2基准测试中,HOMAE达到了最先进的性能,证明了其遮挡感知和几何融合的有效性。
Insight: 通过结合隐式(SDF)和显式(点云)表示的优势,可以更好地处理遮挡问题,同时多尺度特征提取和全局推理是提升姿态估计性能的关键。
Abstract: Hand-object pose estimation from monocular RGB images remains a significant
challenge mainly due to the severe occlusions inherent in hand-object
interactions. Existing methods do not sufficiently explore global structural
perception and reasoning, which limits their effectiveness in handling occluded
hand-object interactions. To address this challenge, we propose an
occlusion-aware hand-object pose estimation method based on masked
autoencoders, termed as HOMAE. Specifically, we propose a target-focused
masking strategy that imposes structured occlusion on regions of hand-object
interaction, encouraging the model to learn context-aware features and reason
about the occluded structures. We further integrate multi-scale features
extracted from the decoder to predict a signed distance field (SDF), capturing
both global context and fine-grained geometry. To enhance geometric perception,
we combine the implicit SDF with an explicit point cloud derived from the SDF,
leveraging the complementary strengths of both representations. This fusion
enables more robust handling of occluded regions by combining the global
context from the SDF with the precise local geometry provided by the point
cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks
demonstrate that HOMAE achieves state-of-the-art performance in hand-object
pose estimation. We will release our code and model.
[101] VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
Main category: cs.CV
TL;DR: 该论文提出了VideoDeepResearch,一种用于长视频理解(LVU)的新型智能代理框架,仅依赖文本推理模型和多模态工具包,显著优于现有MLLM基线。
Details
Motivation: 当前多模态大语言模型(MLLMs)在处理长视频理解(LVU)任务时面临复杂性及上下文窗口限制的挑战,论文旨在通过智能代理系统克服这些限制。Contribution: 1. 提出了VideoDeepResearch框架,无需扩展上下文窗口的多模态大模型;2. 结合文本推理模型和多模态工具包,实现高效视频内容筛选与利用。
Method: 使用文本推理模型(LRM)和多模态工具包(如检索器和视觉感知器),通过智能代理策略动态选择并处理关键视频内容。
Result: 在MLVU、Video-MME和LVBench等基准测试中,VideoDeepResearch显著超越现有MLLM基线,最高提升9.6%。
Insight: 智能代理系统通过动态工具调用和多模态协作,能够有效解决长视频理解的复杂性和上下文限制问题。
Abstract: Long video understanding (LVU) presents a significant challenge for current
multi-modal large language models (MLLMs) due to the task’s inherent complexity
and context window constraint. It is widely assumed that addressing LVU tasks
requires foundation MLLMs with extended context windows, strong visual
perception capabilities, and proficient domain expertise. In this work, we
challenge this common belief by introducing VideoDeepResearch, a novel agentic
framework for long video understanding. Our approach relies solely on a
text-only large reasoning model (LRM) combined with a modular multi-modal
toolkit, including multimodal retrievers and visual perceivers, all of which
are readily available in practice. For each LVU task, the system formulates a
problem-solving strategy through reasoning, while selectively accessing and
utilizing essential video content via tool using. We conduct extensive
experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench.
Our results demonstrate that VideoDeepResearch achieves substantial
improvements over existing MLLM baselines, surpassing the previous
state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and
LongVideoBench, respectively. These findings highlight the promise of agentic
systems in overcoming key challenges in LVU problems.
[102] Post-Training Quantization for Video Matting
Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang
Main category: cs.CV
TL;DR: 本篇论文提出了一种专为视频抠图设计的后训练量化(PTQ)框架PTQ4VM,通过两阶段策略和改进的全局校准方法(GAC)以及光流辅助(OFA)组件,显著提升了量化后模型的精度和时序一致性,同时大幅降低了计算开销,达到了4位量化下接近全精度的性能。
Details
Motivation: 视频抠图任务在计算资源受限的设备上面临部署困难,现有PTQ方法在精度和时序一致性上的不足限制了在这一领域的应用。Contribution: 1. 提出两阶段PTQ策略;2. 提出基于统计的全局仿射校准方法(GAC);3. 引入光流辅助组件(OFA)。
Method: 结合块重建优化与全局校准的PTQ策略,利用光流信息指导量化过程。
Result: PTQ4VM在多种位宽下均达到最优性能,4位量化模型接近全精度性能,计算开销降低8倍。
Insight: 通过捕捉局部依赖性和全局统计特性,并结合时序信息,可以显著提升视频抠图模型的量化效果。
Abstract: Video matting is crucial for applications such as film production and virtual
reality, yet deploying its computationally intensive models on
resource-constrained devices presents challenges. Quantization is a key
technique for model compression and acceleration. As an efficient approach,
Post-Training Quantization (PTQ) is still in its nascent stages for video
matting, facing significant hurdles in maintaining accuracy and temporal
coherence. To address these challenges, this paper proposes a novel and general
PTQ framework specifically designed for video matting models, marking, to the
best of our knowledge, the first systematic attempt in this domain. Our
contributions include: (1) A two-stage PTQ strategy that combines
block-reconstruction-based optimization for fast, stable initial quantization
and local dependency capture, followed by a global calibration of quantization
parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine
Calibration (GAC) method that enables the network to compensate for cumulative
statistical distortions arising from factors such as neglected BN layer
effects, even reducing the error of existing PTQ methods on video matting tasks
up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages
temporal and semantic priors from frames to guide the PTQ process, enhancing
the model’s ability to distinguish moving foregrounds in complex scenes and
ultimately achieving near full-precision performance even under ultra-low-bit
quantization. Comprehensive quantitative and visual results show that our
PTQ4VM achieves the state-of-the-art accuracy performance across different
bit-widths compared to the existing quantization methods. We highlight that the
4-bit PTQ4VM even achieves performance close to the full-precision counterpart
while enjoying 8x FLOP savings.
[103] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang
Main category: cs.CV
TL;DR: VRBench是首个针对长叙事视频多步推理能力的基准测试,包含1,010个长视频和9,468个人工标注的多步问题-答案对,为评估大型模型的时序推理和过程有效性提供标准化工具。
Details
Motivation: 现有评估方法忽视了时序推理和过程的合理性,缺乏对长视频多步推理的系统性测试,VRBench填补了这一空白。Contribution: 1. 提出首个长叙事视频多步推理基准VRBench;2. 提出人-AI协作框架生成时序逻辑链;3. 设计多阶段评估流程,包括结果和推理链质量的量化评估。
Method: 1. 通过多阶段筛选和专家评审确保视频内容连贯性;2. 开发人-AI协作框架生成7类多步推理问题;3. 使用多阶段评估流程(包括最终结果和LLM引导的推理链评分)。
Result: 对12个LLM和16个VLM的广泛评测显示,VRBench能有效区分模型在多步推理任务中的性能差异。
Insight: 长视频的多步推理需要更复杂的时序建模能力,模型的推理链质量需从多维度综合评估。
Abstract: We present VRBench, the first long narrative video benchmark crafted for
evaluating large models’ multi-step reasoning capabilities, addressing
limitations in existing evaluations that overlook temporal reasoning and
procedural validity. It comprises 1,010 long videos (with an average duration
of 1.6 hours), along with 9,468 human-labeled multi-step question-answering
pairs and 30,292 reasoning steps with timestamps. These videos are curated via
a multi-stage filtering process including expert inter-rater reviewing to
prioritize plot coherence. We develop a human-AI collaborative framework that
generates coherent reasoning chains, each requiring multiple temporally
grounded steps, spanning seven types (e.g., event attribution, implicit
inference). VRBench designs a multi-phase evaluation pipeline that assesses
models at both the outcome and process levels. Apart from the MCQs for the
final results, we propose a progress-level LLM-guided scoring metric to
evaluate the quality of the reasoning chain from multiple dimensions
comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on
VRBench, we undertake a thorough analysis and provide valuable insights that
advance the field of multi-step reasoning.
[104] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu
Main category: cs.CV
TL;DR: CreatiPoster是一个可编辑、可控的多层次图形设计生成框架,通过结合自然语言指令或用户提供的资产,生成高质量的JSON规范和多层次设计,超越现有开源和商业工具。
Details
Motivation: 当前AI工具难以准确整合用户提供的资产并保持可编辑性和视觉吸引力,商业系统依赖模板库,限制了创造性和实用性。Contribution: 提出了CreatiPoster框架,结合协议模型和条件背景模型,生成可编辑的多层次设计,并发布了10万份无版权多图层设计数据集。
Method: 使用协议模型生成JSON规范,描述各层布局、内容和样式;条件背景模型基于前景层生成一致背景。
Result: CreatiPoster超越现有开源方法和商业系统,支持多种应用(如画布编辑、多语言适配)。
Insight: 将多层次设计的生成与背景合成分离,提高了设计的可编辑性和视觉一致性,推动了AI辅助图形设计的民主化。
Abstract: Graphic design plays a crucial role in both commercial and personal contexts,
yet creating high-quality, editable, and aesthetically pleasing graphic
compositions remains a time-consuming and skill-intensive task, especially for
beginners. Current AI tools automate parts of the workflow, but struggle to
accurately incorporate user-supplied assets, maintain editability, and achieve
professional visual appeal. Commercial systems, like Canva Magic Design, rely
on vast template libraries, which are impractical for replicate. In this paper,
we introduce CreatiPoster, a framework that generates editable, multi-layer
compositions from optional natural-language instructions or assets. A protocol
model, an RGBA large multimodal model, first produces a JSON specification
detailing every layer (text or asset) with precise layout, hierarchy, content
and style, plus a concise background prompt. A conditional background model
then synthesizes a coherent background conditioned on this rendered foreground
layers. We construct a benchmark with automated metrics for graphic-design
generation and show that CreatiPoster surpasses leading open-source approaches
and proprietary commercial systems. To catalyze further research, we release a
copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports
diverse applications such as canvas editing, text overlay, responsive resizing,
multilingual adaptation, and animated posters, advancing the democratization of
AI-assisted graphic design. Project homepage:
https://github.com/graphic-design-ai/creatiposter
[105] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement
Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 本文提出了一种零样本生成模型自适应方法AIR,通过迭代优化解决现有方法中图像与文本偏移不对齐的问题,实验证明其在26种实验设置中均达到最优性能。
Details
Motivation: 现有零样本生成模型自适应方法假设图像偏移与文本偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。本文通过实证研究发现偏移不对齐与概念距离相关,并提出了改进方法。Contribution: 1. 通过实证分析发现CLIP嵌入空间中文本偏移与图像偏移的不对齐现象与概念距离相关。2. 提出AIR方法,首次基于偏移不对齐的新见解提升目标域图像质量。
Method: 提出Adaptation with Iterative Refinement (AIR),通过迭代优化目标域图像生成过程,解决偏移不对齐问题。
Result: 在26种实验设置中,AIR在定性和定量评估及用户研究中均达到最优性能。
Insight: CLIP嵌入空间中偏移不对齐与概念距离相关,接近的概念偏移不对齐较小,这一发现可指导生成模型优化。
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained
generator to a target domain using only text guidance and without any samples
from the target domain. Central to recent ZSGM approaches are directional loss
which use the text guidance in the form of aligning the image offset with text
offset in the embedding space of a vision-language model like CLIP. This is
similar to the analogical reasoning in NLP where the offset between one pair of
words is used to identify a missing element in another pair by aligning the
offset between these two pairs. However, a major limitation of existing ZSGM
methods is that the learning objective assumes the complete alignment between
image offset and text offset in the CLIP embedding space, resulting in quality
degrade in generated images. Our work makes two main contributions. Inspired by
the offset misalignment studies in NLP, as our first contribution, we perform
an empirical study to analyze the misalignment between text offset and image
offset in CLIP embedding space for various large publicly available datasets.
Our important finding is that offset misalignment in CLIP embedding space is
correlated with concept distance, i.e., close concepts have a less offset
misalignment. To address the limitations of the current approaches, as our
second contribution, we propose Adaptation with Iterative Refinement (AIR)
which is the first ZSGM approach to focus on improving target domain image
quality based on our new insight on offset misalignment.Qualitative,
quantitative, and user study in 26 experiment setups consistently demonstrate
the proposed AIR approach achieves SOTA performance. Additional experiments are
in Supp.
[106] M4V: Multi-Modal Mamba for Text-to-Video Generation
Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma
Main category: cs.CV
TL;DR: M4V是一个基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块和奖励学习策略,显著降低了计算成本并提升了生成视频的质量。
Details
Motivation: 传统基于Transformer的视频生成方法由于计算复杂度高(平方级),限制了实际应用。Mamba架构虽高效,但其设计难以直接适用于多模态和时空建模任务。M4V旨在解决这些问题。Contribution: 1. 提出多模态扩散Mamba块(MM-DiM),实现多模态信息的无缝集成和时空建模。2. 在768×1280分辨率下,计算量减少45%。3. 引入奖励学习策略,提升长上下文生成中的视觉质量。
Method: 1. 设计MM-DiM块支持多模态token重组。2. 使用Mamba块替代Transformer,降低计算复杂度。3. 通过奖励学习优化单帧视觉真实感。
Result: 在文本到视频基准测试中,M4V能够生成高质量视频,同时显著降低计算成本。
Insight: 1. Mamba架构在多模态任务中经过适配后表现出色。2. 奖励学习是提升长序列生成质量的有效方法。
Abstract: Text-to-video generation has significantly enriched content creation and
holds the potential to evolve into powerful world simulators. However, modeling
the vast spatiotemporal space remains computationally demanding, particularly
when employing Transformers, which incur quadratic complexity in sequence
processing and thus limit practical applications. Recent advancements in
linear-time sequence modeling, particularly the Mamba architecture, offer a
more efficient alternative. Nevertheless, its plain design limits its direct
applicability to multi-modal and spatiotemporal video generation tasks. To
address these challenges, we introduce M4V, a Multi-Modal Mamba framework for
text-to-video generation. Specifically, we propose a multi-modal diffusion
Mamba (MM-DiM) block that enables seamless integration of multi-modal
information and spatiotemporal modeling through a multi-modal token
re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45%
compared to the attention-based alternative when generating videos at
768$\times$1280 resolution. Additionally, to mitigate the visual quality
degradation in long-context autoregressive generation processes, we introduce a
reward learning strategy that further enhances per-frame visual realism.
Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to
produce high-quality videos while significantly lowering computational costs.
Code and models will be publicly available at
https://huangjch526.github.io/M4V_project.
[107] VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
Main category: cs.CV
TL;DR: VINCIE提出了一种直接从视频中学习上下文图像编辑的方法,通过多模态序列标注和块因果扩散变换器,实现了多任务学习,并在多项任务中表现优异。
Details
Motivation: 现有方法依赖任务特定的流水线和专家模型,而VINCIE探索是否可以直接从视频中学习上下文图像编辑,以简化流程并提升灵活性。Contribution: 提出了从视频中标注多模态序列的方法,设计了块因果扩散变换器进行多任务学习,并引入了新的多轮图像编辑基准。
Method: 使用块因果扩散变换器学习视频中的多模态序列,通过三个代理任务(下一图像预测、当前分割预测和下一分割预测)进行训练。
Result: 在多项任务中表现优异,包括多概念组合、故事生成和编辑链应用,并在多轮图像编辑基准中达到SOTA。
Insight: 从视频中学习上下文图像编辑是可行的,且多任务学习可以有效提升模型的泛化能力和表现。
Abstract: In-context image editing aims to modify images based on a contextual sequence
comprising text and previously generated images. Existing methods typically
depend on task-specific pipelines and expert models (e.g., segmentation and
inpainting) to curate training data. In this work, we explore whether an
in-context image editing model can be learned directly from videos. We
introduce a scalable approach to annotate videos as interleaved multimodal
sequences. To effectively learn from this data, we design a block-causal
diffusion transformer trained on three proxy tasks: next-image prediction,
current segmentation prediction, and next-segmentation prediction.
Additionally, we propose a novel multi-turn image editing benchmark to advance
research in this area. Extensive experiments demonstrate that our model
exhibits strong in-context image editing capabilities and achieves
state-of-the-art results on two multi-turn image editing benchmarks. Despite
being trained exclusively on videos, our model also shows promising abilities
in multi-concept composition, story generation, and chain-of-editing
applications.
[108] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
Main category: cs.CV
TL;DR: 该论文提出了知识图像生成这一新任务,并发布了MMMG基准,用于评估图像生成模型的推理能力。MMMG包含4,456对专家验证的知识图像-文本对,涵盖多学科、多教育层次和多种知识形式。通过统一的KG表示和MMMG-Score评估方法,揭示了当前模型的推理不足,并提供了一个开源基线FLUX-Reason。
Details
Motivation: 知识图像在人类文明和学习中扮演重要角色,但生成此类图像需要多模态推理能力,目前缺乏专门的任务和基准来评估模型的这一能力。Contribution: 1. 提出知识图像生成任务;2. 发布MMMG基准,包含多样化的知识图像-文本对;3. 设计MMMG-Score评估指标,结合事实保真度和视觉清晰度;4. 提供开源基线FLUX-Reason。
Method: 1. 统一使用知识图(KG)表示图像的核心实体和依赖关系;2. 通过图编辑距离和视觉清晰度评估生成图像的质量;3. 结合推理LLM和扩散模型,提出FLUX-Reason方法。
Result: 评估了16个SOTA文本到图像生成模型,发现普遍存在推理缺陷。GPT-4o的MMMG-Score仅为50.20,提供的基线FLUX-Reason得分为34.45。
Insight: 当前模型在多模态推理能力上仍有显著不足,未来的工作需要更深入地结合事实知识和视觉生成能力。
Abstract: In this paper, we introduce knowledge image generation as a new task,
alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation
Benchmark (MMMG) to probe the reasoning capability of image generation models.
Knowledge images have been central to human civilization and to the mechanisms
of human learning–a fact underscored by dual-coding theory and the
picture-superiority effect. Generating such images is challenging, demanding
multimodal reasoning that fuses world knowledge with pixel-level grounding into
clear explanatory visuals. To enable comprehensive evaluation, MMMG offers
4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines,
6 educational levels, and diverse knowledge formats such as charts, diagrams,
and mind maps. To eliminate confounding complexity during evaluation, we adopt
a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a
target image’s core entities and their dependencies. We further introduce
MMMG-Score to evaluate generated knowledge images. This metric combines factual
fidelity, measured by graph-edit distance between KGs, with visual clarity
assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image
generation models expose serious reasoning deficits–low entity fidelity, weak
relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20,
underscoring the benchmark’s difficulty. To spur further progress, we release
FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines
a reasoning LLM with diffusion models and is trained on 16,000 curated
knowledge image-prompt pairs.
[109] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
Main category: cs.CV
TL;DR: 论文提出了一种新的视觉token裁剪方法CDPruner,通过最大化条件多样性来优化多模态大语言模型(MLLMs)中的视觉token裁剪问题,显著提升了性能并降低了计算开销。
Details
Motivation: 在多模态大语言模型中,视觉token的长度通常远大于文本token,导致推理成本高昂。现有的裁剪方法(如基于注意力或相似性的方法)无法同时避免重复token和忽略指令相关性,从而影响模型性能。Contribution: 1. 提出了CDPruner,一种新的token裁剪方法,最大化保留token的条件多样性;
2. 通过条件相似性和行列式点过程(DPP)重新定义裁剪问题;
3. 在多种MLLMs上实现了SOTA性能,同时显著降低计算成本。
Method: 1. 定义基于指令的条件相似性;
2. 使用DPP重新建模token裁剪问题,以最大化条件多样性;
3. CDPruner是无训练的,且与模型无关,可轻松适配不同MLLMs。
Result: 在LLaVA等模型中,CDPruner将FLOPs减少95%,CUDA延迟降低78%,同时保持94%的原始精度。在多个视觉-语言基准上达到SOTA性能。
Insight: 最大化条件多样性不仅能减少冗余token,还能更好保留图像输入的代表性并紧密贴合用户指令,从而在高裁剪率下仍维持高性能。
Abstract: In multimodal large language models (MLLMs), the length of input visual
tokens is often significantly greater than that of their textual counterparts,
leading to a high inference cost. Many works aim to address this issue by
removing redundant visual tokens. However, current approaches either rely on
attention-based pruning, which retains numerous duplicate tokens, or use
similarity-based pruning, overlooking the instruction relevance, consequently
causing suboptimal performance. In this paper, we go beyond attention or
similarity by proposing a novel visual token pruning method named CDPruner,
which maximizes the conditional diversity of retained tokens. We first define
the conditional similarity between visual tokens conditioned on the
instruction, and then reformulate the token pruning problem with determinantal
point process (DPP) to maximize the conditional diversity of the selected
subset. The proposed CDPruner is training-free and model-agnostic, allowing
easy application to various MLLMs. Extensive experiments across diverse MLLMs
show that CDPruner establishes new state-of-the-art on various vision-language
benchmarks. By maximizing conditional diversity through DPP, the selected
subset better represents the input images while closely adhering to user
instructions, thereby preserving strong performance even with high reduction
ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency
by 78%, while maintaining 94% of the original accuracy. Our code is available
at https://github.com/Theia-4869/CDPruner.
[110] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
Main category: cs.CV
TL;DR: GenWorld是一个用于检测AI生成视频的大规模高质量真实世界模拟数据集,并提出SpannDetector模型,利用多视角一致性提升检测性能。
Details
Motivation: 随着视频生成技术的发展,AI生成的视频对真实世界信息的可信度构成威胁,亟需可靠的检测方法。然而,缺乏高质量的真实世界模拟数据集阻碍了检测器的发展。Contribution: 1) 提出了GenWorld数据集,专注于真实世界模拟场景和高品质AI生成视频;2) 发现现有多数方法无法检测高质视频;3) 提出SpannDetector模型,利用多视角一致性提升检测性能。
Method: 通过构建包含多模态提示生成视频的GenWorld数据集,并提出SpannDetector模型,利用多视角一致性作为检测AI生成视频的标准。
Result: 实验表明SpannDetector在检测高质生成视频上表现优异,为基于物理合理性的可解释检测提供了新方向。
Insight: 真实世界线索对于AI生成视频检测至关重要,多视角一致性是一种有效的检测标准。
Abstract: The flourishing of video generation technologies has endangered the
credibility of real-world information and intensified the demand for
AI-generated video detectors. Despite some progress, the lack of high-quality
real-world datasets hinders the development of trustworthy detectors. In this
paper, we propose GenWorld, a large-scale, high-quality, and real-world
simulation dataset for AI-generated video detection. GenWorld features the
following characteristics: (1) Real-world Simulation: GenWorld focuses on
videos that replicate real-world scenarios, which have a significant impact due
to their realism and potential influence; (2) High Quality: GenWorld employs
multiple state-of-the-art video generation models to provide realistic and
high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes
videos generated from diverse generators and various prompt modalities (e.g.,
text, image, video), offering the potential to learn more generalizable
forensic features. We analyze existing methods and find they fail to detect
high-quality videos generated by world models (i.e., Cosmos), revealing
potential drawbacks of ignoring real-world clues. To address this, we propose a
simple yet effective model, SpannDetector, to leverage multi-view consistency
as a strong criterion for real-world AI-generated video detection. Experiments
show that our method achieves superior results, highlighting a promising
direction for explainable AI-generated video detection based on physical
plausibility. We believe that GenWorld will advance the field of AI-generated
video detection. Project Page: https://chen-wl20.github.io/GenWorld
[111] Fine-Grained Perturbation Guidance via Attention Head Selection
Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim
Main category: cs.CV
TL;DR: 论文提出了HeadHunter和SoftPAG方法,通过细粒度选择和扰动注意力头,提升扩散模型中生成图像的视觉质量和可控性。
Details
Motivation: 现有注意力扰动方法缺乏确定扰动位置的系统性方法,尤其是在DiT架构中,质量相关的计算分散在不同层。Contribution: 提出了HeadHunter框架和SoftPAG方法,首次实现了对注意力头的细粒度分析和扰动,增强了生成的视觉质量与风格控制。
Method: 分析注意力头的功能特性,迭代选择符合目标的注意力头,并通过SoftPAG线性插值扰动强度。
Result: 在Stable Diffusion 3和FLUX.1上验证了方法的有效性,提升了生成质量并实现了风格特异性控制。
Insight: 注意力头在视觉概念(如结构、风格等)中表现出专业化分工,可被用于精准控制生成过程。
Abstract: Recent guidance methods in diffusion models steer reverse sampling by
perturbing the model to construct an implicit weak model and guide generation
away from it. Among these approaches, attention perturbation has demonstrated
strong empirical performance in unconditional scenarios where classifier-free
guidance is not applicable. However, existing attention perturbation methods
lack principled approaches for determining where perturbations should be
applied, particularly in Diffusion Transformer (DiT) architectures where
quality-relevant computations are distributed across layers. In this paper, we
investigate the granularity of attention perturbations, ranging from the layer
level down to individual attention heads, and discover that specific heads
govern distinct visual concepts such as structure, style, and texture quality.
Building on this insight, we propose “HeadHunter”, a systematic framework for
iteratively selecting attention heads that align with user-centric objectives,
enabling fine-grained control over generation quality and visual attributes. In
addition, we introduce SoftPAG, which linearly interpolates each selected
head’s attention map toward an identity matrix, providing a continuous knob to
tune perturbation strength and suppress artifacts. Our approach not only
mitigates the oversmoothing issues of existing layer-level perturbation but
also enables targeted manipulation of specific visual styles through
compositional head selection. We validate our method on modern large-scale
DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,
demonstrating superior performance in both general quality enhancement and
style-specific guidance. Our work provides the first head-level analysis of
attention perturbation in diffusion models, uncovering interpretable
specialization within attention layers and enabling practical design of
effective perturbation strategies.
[112] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model
Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: InstaInpaint 是一个基于参考的快速前馈框架,能够在0.4秒内从2D修复提案生成3D场景修复。通过自监督的掩码微调策略训练定制的大型重建模型(LRM),实现了1000倍的速度提升,并在两个标准基准测试中保持最新性能。
Details
Motivation: 当前的3D场景修复方法依赖耗时且计算密集的优化,不适合实时或在线应用,因此需要一种高效的解决方案来支持交互式操作。Contribution: 提出了InstaInpaint框架,实现了快速3D场景修复;开发了自监督掩码微调策略;在速度与性能上显著优于现有方法。
Method: 采用基于参考的前馈框架,结合掩码微调策略训练大型重建模型(LRM),从2D修复提案生成3D修复结果。
Result: 在0.4秒内完成修复,速度提升1000倍;在两个标准基准测试中表现优异;适用于对象插入和多区域修复等下游任务。
Insight: 自监督掩码微调策略和大规模数据训练对提升模型的泛化能力、纹理一致性和几何正确性至关重要;快速框架扩展了3D修复的实际应用场景。
Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in
virtual and augmented reality. To support interactive operations for better
immersiveness, such as moving or editing objects, 3D scene inpainting methods
are proposed to repair or complete the altered geometry. However, current
approaches rely on lengthy and computationally intensive optimization, making
them impractical for real-time or online applications. We propose InstaInpaint,
a reference-based feed-forward framework that produces 3D-scene inpainting from
a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised
masked-finetuning strategy to enable training of our custom large
reconstruction model (LRM) on the large-scale dataset. Through extensive
experiments, we analyze and identify several key designs that improve
generalization, textural consistency, and geometric correctness. InstaInpaint
achieves a 1000x speed-up from prior methods while maintaining a
state-of-the-art performance across two standard benchmarks. Moreover, we show
that InstaInpaint generalizes well to flexible downstream applications such as
object insertion and multi-region inpainting. More video results are available
at our project page: https://dhmbb2.github.io/InstaInpaint_page/.
cs.MM [Back]
[113] Multimodal Large Language Models: A Survey
Longzhen Han,Awes Mubarak,Almas Baimagambetov,Nikolaos Polatidis,Thar Baker
Main category: cs.MM
TL;DR: 这篇综述探讨了多模态大语言模型(MLLMs)的发展,分析了其从单一文本生成扩展到图像、音乐、视频等多样化输出的能力。文章重点研究了自监督学习、专家混合、人类反馈强化学习和思维链提示等核心技术如何推动跨模态能力的实现,并总结了当前研究的架构趋势、跨模态协同效应以及未解决的挑战。
Details
Motivation: 随着多模态大语言模型的迅速发展,其应用范围已远超文本生成,涉及图像、音频等多种模态。为了更好地理解其技术基础和未来方向,需要对现有研究进行系统分类和分析。Contribution: 1. 将多模态生成分为六种主要模态;2. 分析了自监督学习、专家混合等基础技术在跨模态能力中的作用;3. 总结了模型的架构趋势和跨模态协同效应;4. 提出了未解决的挑战,如评估、模块化和结构化推理。
Method: 通过文献调研和分类,对MLLMs的生成模态、基础技术和架构趋势进行了系统性的分析,重点关注了跨模态能力的技术实现和模型优化。
Result: 总结了当前MLLMs的发展现状,指出了跨模态协同的潜力,并提出了未来研究应关注的方向,如提升模型的通用性和可解释性。
Insight: 跨模态能力的实现依赖于基础技术的结合,如自监督学习和强化学习的协同作用。未来的挑战在于如何进一步提升模型的可扩展性和结构化推理能力。
Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text
generation, now spanning diverse output modalities including images, music,
video, human motion, and 3D objects, by integrating language with other sensory
modalities under unified architectures. This survey categorises six primary
generative modalities and examines how foundational techniques, namely
Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement
Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting,
enable cross-modal capabilities. We analyze key models, architectural trends,
and emergent cross-modal synergies, while highlighting transferable techniques
and unresolved challenges. Architectural innovations like transformers and
diffusion models underpin this convergence, enabling cross-modal transfer and
modular specialization. We highlight emerging patterns of synergy, and identify
open challenges in evaluation, modularity, and structured reasoning. This
survey offers a unified perspective on MLLM development and identifies critical
paths toward more general-purpose, adaptive, and interpretable multimodal
systems.
[114] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis
Jianwu Fang,Lei-Lei Li,Zhedong Zheng,Hongkai Yu,Jianru Xue,Zhengguo Li,Tat-Seng Chua
Main category: cs.MM
TL;DR: EQ-TAA提出了一种基于扩散模型的事故视频合成方法,通过生成因果视频帧来提升事故预测性能,同时避免了数据偏差问题,实现了无需额外标注的训练。
Details
Motivation: 当前的交通事故事件预测方法依赖于大量标注数据,且容易受数据偏差影响。为了解决这一问题,作者提出了一种通过视频合成生成因果部分的方法。Contribution: 1. 提出了一种基于注意力的视频扩散模型(AVD),用于合成事故视频片段;
2. 设计了等变TAA(EQ-TAA)框架,通过对比学习提升模型鲁棒性;
3. 实现了无需额外标注的训练,提升了模型泛化能力。
Method: 1. 使用AVD模型通过文本提示生成因果视频帧;
2. 设计等变三元损失,利用合成的事故与非事故片段进行对比学习;
3. 在多种场景数据上进行训练,避免标注依赖。
Result: 实验表明,EQ-TAA在性能上优于现有方法,且能够有效缓解数据偏差问题。
Insight: 通过合成视频因果部分的方法,可以显著减少标注需求,同时提升模型对真实事故场景的预测能力。
Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging
problem for achieving zero fatalities in the future. Current approaches
typically treat TAA as a supervised learning task needing the laborious
annotation of accident occurrence duration. However, the inherent long-tailed,
uncertain, and fast-evolving nature of traffic scenes has the problem that real
causal parts of accidents are difficult to identify and are easily dominated by
data bias, resulting in a background confounding issue. Thus, we propose an
Attentive Video Diffusion (AVD) model that synthesizes additional accident
video clips by generating the causal part in dashcam videos, i.e., from normal
clips to accident clips. AVD aims to generate causal video frames based on
accident or accident-free text prompts while preserving the style and content
of frames for TAA after video generation. This approach can be trained using
datasets collected from various driving scenes without any extra annotations.
Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant
triple loss for an anchor accident-free video clip, along with the generated
pair of contrastive pseudo-normal and pseudo-accident clips. Extensive
experiments have been conducted to evaluate the performance of AVD and EQ-TAA,
and competitive performance compared to state-of-the-art methods has been
obtained.
[115] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction
Jie Qin,Wei Yang,Yan Su,Yiran Zhu,Weizhen Li,Yunyue Pan,Chengchang Pan,Honggang Qi
Main category: cs.MM
TL;DR: 该论文提出了一种动态双向重建的灵活多模态输入框架,用于HER2表达的预测,显著提升了单模态和双模态输入的准确性和适应性。
Details
Motivation: 现有的HER2评估模型通常单独分析H&E或IHC图像,而临床实践中需要两者的协同解释,但同步获取这两种模态数据存在工作流程复杂性和成本限制的问题。Contribution: 1) 动态分支选择器根据输入完整性激活单模态重建或双模态联合推理;2) 双向跨模态GAN用于缺失模态的上下文感知特征空间重建;3) 结合对抗学习和多任务优化的混合训练协议。
Method: 通过动态分支选择器、双向跨模态GAN和混合训练协议,实现灵活的单/双模态输入预测。
Result: 单模态H&E预测准确率从71.44%提升至94.25%,双模态准确率达95.09%,仅IHC输入时的可靠性为90.28%。
Insight: 该框架的“双模态优先、单模态兼容”设计可以在不需要同步采样的条件下实现接近双模态的性能,尤其适合资源有限的医疗环境。
Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or
IHC images in isolation,despite clinical reliance on their synergistic
interpretation. However, concurrent acquisition of both modalities is often
hindered by workflow complexity and cost constraints. We propose an adaptive
bimodal framework enabling flexible single-/dual-modality HER2 prediction
through three innovations: 1) A dynamic branch selector that activates either
single-modality reconstruction or dual-modality joint inference based on input
completeness; 2) A bidirectional cross-modal GAN performing context-aware
feature-space reconstruction of missing modalities; 3) A hybrid training
protocol integrating adversarial learning and multi-task optimization. This
architecture elevates single-modality H&E prediction accuracy from 71.44% to
94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28%
reliability with sole IHC inputs. The framework’s “dual-preferred,
single-compatible” design delivers near-bimodal performance without requiring
synchronized acquisition, particularly benefiting resource-limited settings
through IHC infrastructure cost reduction. Experimental validation confirms
22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with
cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251
(IHC to HE). By dynamically routing inputs through reconstruction-enhanced or
native fusion pathways, the system mitigates performance degradation from
missing data while preserving computational efficiency (78.55% parameter
reduction in lightweight variant). This elastic architecture demonstrates
significant potential for democratizing precise HER2 assessment across diverse
healthcare settings.
[116] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
Kangwei Liu,Junwu Liu,Xiaowei Yi,Jinlin Guo,Yun Cao
Main category: cs.MM
TL;DR: 本文提出了一种基于扩散模型的3D面部动画生成方法,通过多模态信号(文本、音频、情感标签)的统一表示和扩散模型增强情感表达的多样性和可控性。
Details
Motivation: 当前音频驱动的3D面部动画方法多依赖单模态信号且采用确定性映射,限制了情感表达的多样性和灵活性。Contribution: 主要贡献包括:1) 通过对比学习实现多模态(文本、音频、情感标签)统一表示的FLAME中心对齐策略;2) 引入注意力机制的潜在扩散模型,增强动画多样性和时序一致性。
Method: 方法包括多模态情感对齐策略和基于注意力的潜在扩散模型,通过对比学习对齐多模态信号,扩散模型生成多样性动画。
Result: 实验表明,该方法在情感相似度上提升21.6%,同时保持生理合理的面部动态。
Insight: 多模态信号的统一表示和扩散模型的引入显著提升了情感表达的多样性和可控性,为3D面部动画提供了新思路。
Abstract: Audio-driven emotional 3D facial animation encounters two significant
challenges: (1) reliance on single-modal control signals (videos, text, or
emotion labels) without leveraging their complementary strengths for
comprehensive emotion manipulation, and (2) deterministic regression-based
mapping that constrains the stochastic nature of emotional expressions and
non-verbal behaviors, limiting the expressiveness of synthesized animations. To
address these challenges, we present a diffusion-based framework for
controllable expressive 3D facial animation. Our approach introduces two key
innovations: (1) a FLAME-centered multimodal emotion binding strategy that
aligns diverse modalities (text, audio, and emotion labels) through contrastive
learning, enabling flexible emotion control from multiple signal sources, and
(2) an attention-based latent diffusion model with content-aware attention and
emotion-guided layers, which enriches motion diversity while maintaining
temporal coherence and natural facial dynamics. Extensive experiments
demonstrate that our method outperforms existing approaches across most
metrics, achieving a 21.6% improvement in emotion similarity while preserving
physiologically plausible facial dynamics. Project Page:
https://kangweiiliu.github.io/Control_3D_Animation.
[117] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics
Yi-Chun Chen
Main category: cs.MM
TL;DR: 这篇论文提出了一种层次化的知识图谱框架,用于漫画等多模态媒体的结构化理解,通过多层次分解叙事内容并构建集成知识图谱,支持符号化推理任务。
Details
Motivation: 视觉叙事(如漫画)包含复杂的多模态信息(图像和文本),需要一种结构化的方法来理解其语义、空间和时间关系。Contribution: 提出了一个层次化的知识图谱框架,能够从宏观故事弧到细粒度事件片段,整合语义、空间和时间关系,支持多种叙事推理任务。
Method: 将叙事内容分解为多层次(如面板级别构建多模态图谱),并结合视觉元素与文本组件,通过集成知识图谱实现跨层次的推理。
Result: 在Manga109数据集上验证了框架的有效性,在动作检索、对话追踪等任务中表现出高精度和高召回率。
Insight: 层次化知识图谱能够有效地建模视觉叙事的复杂性,为基于叙事的分析、交互式叙事和多模态推理提供了可扩展的基础。
Abstract: This paper presents a hierarchical knowledge graph framework for the
structured understanding of visual narratives, focusing on multimodal media
such as comics. The proposed method decomposes narrative content into multiple
levels, from macro-level story arcs to fine-grained event segments. It
represents them through integrated knowledge graphs that capture semantic,
spatial, and temporal relationships. At the panel level, we construct
multimodal graphs that link visual elements such as characters, objects, and
actions with corresponding textual components, including dialogue and captions.
These graphs are integrated across narrative levels to support reasoning over
story structure, character continuity, and event progression.
We apply our approach to a manually annotated subset of the Manga109 dataset
and demonstrate its ability to support symbolic reasoning across diverse
narrative tasks, including action retrieval, dialogue tracing, character
appearance mapping, and panel timeline reconstruction. Evaluation results show
high precision and recall across tasks, validating the coherence and
interpretability of the framework. This work contributes a scalable foundation
for narrative-based content analysis, interactive storytelling, and multimodal
reasoning in visual media.
[118] WDMIR: Wavelet-Driven Multimodal Intent Recognition
Weiyin Gong,Kai Zhang,Yanghai Zhang,Qi Liu,Xinjie Sun,Junyu Lu,Linbo Zhu
Main category: cs.MM
TL;DR: 论文提出了一种基于小波变换的多模态意图识别框架(WDMIR),通过频域分析提升非语言信息的语义提取能力,显著提高了意图识别的准确性。
Details
Motivation: 现有方法过于依赖文本信息,忽视了视频和音频等非语言信息中的丰富语义内容,导致意图识别不够全面。Contribution: 1. 提出小波驱动的融合模块,在频域中同步分解和集成视频-音频特征;2. 设计跨模态交互机制,逐步增强多模态特征整合。
Method: 利用小波变换对视频和音频进行频域分析,结合跨模态交互机制实现从双模态到三模态的渐进式特征增强。
Result: 在MIntRec数据集上取得SOTA性能,准确率提升1.13%;小波融合模块对非语言信息的分析能力显著提升(0.41%)。
Insight: 频域分析为多模态意图识别提供了新的视角,小波变换能有效捕捉非语言信息中的细微动态特征。
Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user
intentions by integrating verbal and non-verbal information across video, audio
and text modalities. While existing approaches prioritize text analysis, they
often overlook the rich semantic content embedded in non-verbal cues. This
paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR)
framework that enhances intent understanding through frequency-domain analysis
of non-verbal information. To be more specific, we propose: (1) a
wavelet-driven fusion module that performs synchronized decomposition and
integration of video-audio features in the frequency domain, enabling
fine-grained analysis of temporal dynamics; (2) a cross-modal interaction
mechanism that facilitates progressive feature enhancement from bimodal to
trimodal integration, effectively bridging the semantic gap between verbal and
non-verbal information. Extensive experiments on MIntRec demonstrate that our
approach achieves state-of-the-art performance, surpassing previous methods by
1.13% on accuracy. Ablation studies further verify that the wavelet-driven
fusion module significantly improves the extraction of semantic information
from non-verbal sources, with a 0.41% increase in recognition accuracy when
analyzing subtle emotional cues.
cs.GR [Back]
[119] Learning-based density-equalizing map
Yanwen Huang,Lok Ming Lui,Gary P. T. Choi
Main category: cs.GR
TL;DR: 本文提出了一种基于学习的密度等值映射框架(LDEM),利用深度神经网络改进传统的密度等值映射方法,解决了传统方法在精度、重叠和2D到3D扩展方面的限制。
Details
Motivation: 传统密度等值映射方法依赖数值求解器或手工设计的能量函数,存在精度有限、极端情况下产生重叠以及难以从2D扩展到3D的问题。本文希望通过学习的方法解决这些问题。Contribution: 1. 提出了基于深度学习的LDEM框架,引入新的损失函数确保密度均匀性和几何规则性。2. 采用分层方法预测粗粒度和密集级别的变换。3. 实现了从2D到3D的无缝扩展,无需调整模型结构或损失函数。
Method: 1. 设计损失函数结合密度均匀性和几何规则性。2. 使用层级神经网络预测不同粒度的变换。3. 通过深度学习直接学习映射关系,避免迭代求解。
Result: LDEM在简单和复杂密度分布上表现出优越的密度等值和双射性,优于传统方法。同时,该方法可直接应用于3D场景,具有更强的扩展性。
Insight: 深度学习可以替代传统数值方法解决几何问题,尤其是在无需显式设计能量函数的情况下,能够实现更高效和鲁棒的解决方案。
Abstract: Density-equalizing map (DEM) serves as a powerful technique for creating
shape deformations with the area changes reflecting an underlying density
function. In recent decades, DEM has found widespread applications in fields
such as data visualization, geometry processing, and medical imaging.
Traditional approaches to DEM primarily rely on iterative numerical solvers for
diffusion equations or optimization-based methods that minimize handcrafted
energy functionals. However, these conventional techniques often face several
challenges: they may suffer from limited accuracy, produce overlapping
artifacts in extreme cases, and require substantial algorithmic redesign when
extended from 2D to 3D, due to the derivative-dependent nature of their energy
formulations. In this work, we propose a novel learning-based
density-equalizing mapping framework (LDEM) using deep neural networks.
Specifically, we introduce a loss function that enforces density uniformity and
geometric regularity, and utilize a hierarchical approach to predict the
transformations at both the coarse and dense levels. Our method demonstrates
superior density-equalizing and bijectivity properties compared to prior
methods for a wide range of simple and complex density distributions, and can
be easily applied to surface remeshing with different effects. Also, it
generalizes seamlessly from 2D to 3D domains without structural changes to the
model architecture or loss formulation. Altogether, our work opens up new
possibilities for scalable and robust computation of density-equalizing maps
for practical applications.
[120] Edit360: 2D Image Edits to 3D Assets from Any Angle
Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang
Main category: cs.GR
TL;DR: Edit360提出了一种无需调参的框架,将2D图像编辑扩展到多视角一致的3D编辑,解决了现有方法视角受限的问题,通过视频扩散模型和锚定视图编辑传播机制实现高质量的3D内容重建。
Details
Motivation: 现有方法在多视角3D编辑中存在视角受限和一致性不足的问题,限制了实际应用的灵活性。Contribution: 1) 提出Edit360框架,支持从任意视角对3D资产进行编辑;2) 引入锚定视图编辑传播机制,确保多视角一致性;3) 基于视频扩散模型实现高质量3D重建。
Method: 1) 基于视频扩散模型构建框架;2) 选择锚定视图进行2D编辑;3) 通过潜在空间和注意力空间对齐多视角信息;4) 生成多视角序列重建3D资产。
Result: Edit360能够生成多视角一致的3D编辑内容,支持从任意视角查看和修改。
Insight: 视频扩散模型在3D内容生成中具有潜力,锚定视图机制为多视角编辑提供了新思路。
Abstract: Recent advances in diffusion models have significantly improved image
generation and editing, but extending these capabilities to 3D assets remains
challenging, especially for fine-grained edits that require multi-view
consistency. Existing methods typically restrict editing to predetermined
viewing angles, severely limiting their flexibility and practical applications.
We introduce Edit360, a tuning-free framework that extends 2D modifications to
multi-view consistent 3D editing. Built upon video diffusion models, Edit360
enables user-specific editing from arbitrary viewpoints while ensuring
structural coherence across all views. The framework selects anchor views for
2D modifications and propagates edits across the entire 360-degree range. To
achieve this, Edit360 introduces a novel Anchor-View Editing Propagation
mechanism, which effectively aligns and merges multi-view information within
the latent and attention spaces of diffusion models. The resulting edited
multi-view sequences facilitate the reconstruction of high-quality 3D assets,
enabling customizable 3D content creation.
cs.LG [Back]
[121] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs
Shangpin Peng,Weinong Wang,Zhuotao Tian,Senqiao Yang,Xing Wu,Haotian Xu,Chengquan Zhang,Takashi Isobe,Baotian Hu,Min Zhang
Main category: cs.LG
TL;DR: Omni-DPO 是一种双视角优化的动态偏好学习框架,通过考虑偏好对的数据质量和模型学习动态,提升了 DPO 方法的性能。
Details
Motivation: 现有的 DPO 方法将所有偏好对视为同等重要,忽视了其数据质量与学习效用的差异,导致数据利用不足和性能下降。Contribution: 提出了 Omni-DPO,结合了偏好对的固有质量和模型学习动态的双视角优化方法,显著提升了性能。
Method: 通过自适应加权样本,同时考虑数据质量和模型的学习状态,实现更高效的训练数据利用。
Result: 在文本理解和数学推理任务中,Omni-DPO 表现优于基准方法,显著超越 Claude 3 Opus 6.7 分。
Insight: 动态调整偏好对的权重可以更有效地利用数据,提升模型性能。
Abstract: Direct Preference Optimization (DPO) has become a cornerstone of
reinforcement learning from human feedback (RLHF) due to its simplicity and
efficiency. However, existing DPO-based approaches typically treat all
preference pairs uniformly, ignoring critical variations in their inherent
quality and learning utility, leading to suboptimal data utilization and
performance. To address this challenge, we propose Omni-DPO, a dual-perspective
optimization framework that jointly accounts for (1) the inherent quality of
each preference pair and (2) the model’s evolving performance on those pairs.
By adaptively weighting samples according to both data quality and the model’s
learning dynamics during training, Omni-DPO enables more effective training
data utilization and achieves better performance. Experimental results on
various models and benchmarks demonstrate the superiority and generalization
capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it
finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant
margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning
tasks, Omni-DPO consistently outperforms the baseline methods across all
benchmarks, providing strong empirical evidence for the effectiveness and
robustness of our approach. Code and models will be available at
https://github.com/pspdada/Omni-DPO.
[122] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning
Jikai Jin,Vasilis Syrgkanis,Sham Kakade,Hanlin Zhang
Main category: cs.LG
TL;DR: 该论文提出了一个因果表示学习框架,通过控制基础模型作为混淆变量,识别出语言模型的潜在能力因素及其因果结构,从而更好地评估和理解模型的性能。
Details
Motivation: 现有的语言模型评估方法面临复杂混淆效应和高计算成本的挑战,难以揭示模型能力的本质因果关系。Contribution: 提出了一个因果表示学习框架,能够识别语言模型的潜在能力因素及其因果结构,为模型评估提供了新的科学视角。
Method: 通过线性变换建模基准表现与潜在能力因素的关系,控制基础模型作为混淆变量,识别因果结构并应用于1500多个模型的数据集。
Result: 发现了一个简洁的三节点线性因果结构,揭示了从通用问题解决能力到数学推理能力的因果路径。
Insight: 研究发现基础模型的差异对评估结果有显著影响,控制这些差异有助于揭示潜在能力的真实因果关系。
Abstract: Faithful evaluation of language model capabilities is crucial for deriving
actionable insights that can inform model development. However, rigorous causal
evaluations in this domain face significant methodological challenges,
including complex confounding effects and prohibitive computational costs
associated with extensive retraining. To tackle these challenges, we propose a
causal representation learning framework wherein observed benchmark performance
is modeled as a linear transformation of a few latent capability factors.
Crucially, these latent factors are identified as causally interrelated after
appropriately controlling for the base model as a common confounder. Applying
this approach to a comprehensive dataset encompassing over 1500 models
evaluated across six benchmarks from the Open LLM Leaderboard, we identify a
concise three-node linear causal structure that reliably explains the observed
performance variations. Further interpretation of this causal structure
provides substantial scientific insights beyond simple numerical rankings:
specifically, we reveal a clear causal direction starting from general
problem-solving capabilities, advancing through instruction-following
proficiency, and culminating in mathematical reasoning ability. Our results
underscore the essential role of carefully controlling base model variations
during evaluation, a step critical to accurately uncovering the underlying
causal relationships among latent model capabilities.
[123] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
Ching Chang,Jeehyun Hwang,Yidan Shi,Haixin Wang,Wen-Chih Peng,Tien-Fu Chen,Wei Wang
Main category: cs.LG
TL;DR: 论文介绍了Time-IMM数据集和IMM-TSF基准库,用于处理多模态多变量的不规则时间序列,填补了当前研究中与真实数据之间的差距。
Details
Motivation: 现实中的时间序列数据(如医疗、气候和金融领域)通常是不规则、多模态和混乱的,而现有基准多假设数据是干净、规则和单模态的。Contribution: 提出了Time-IMM数据集和IMM-TSF基准库,支持多模态异步融合和时间序列预测的评估。
Method: Time-IMM包含九种不规则时间序列类型,IMM-TSF提供了时间戳到文本的融合模块和多模态融合模块,支持基于注意力和加权平均的策略。
Result: 实验表明,显式建模多模态在不规则时间序列上的预测性能有显著提升。
Insight: 多模态和异步融合策略是提升不规则时间序列预测的关键。
Abstract: Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
[124] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
Sai Prasanna Teja Reddy Bogireddy,Abrar Majeedi,Viswanatha Reddy Gajjala,Zhuoyan Xu,Siddhant Rai,Vaishnav Potlapalli
Main category: cs.LG
TL;DR: 本文提出了一种基于DSPy的MIPROv2优化器的提示优化方法,用于提升临床电子健康记录(EHR)问答任务中的证据检索和答案生成能力。
Details
Motivation: 临床电子健康记录(EHR)问答任务需要高精度的证据检索和可靠的答案生成,但现有方法在有限监督下表现不佳。本文旨在通过提示优化方法解决这一问题。Contribution: 1. 将任务解耦为句子级证据识别和答案生成两部分;2. 使用DSPy的MIPROv2优化器自动优化提示;3. 引入自一致性投票机制提升证据召回率。
Method: 1. 使用DSPy的MIPROv2优化器联合优化指令和小样本示例;2. 通过自一致性投票机制提升证据召回率。
Result: 在隐藏测试集上,该方法总分达到51.5,排名第二,优于零样本和小样本提示方法20和10分以上。
Insight: 数据驱动的提示优化是模型微调的高效替代方案,可提升高风险临床问答任务的可靠性。
Abstract: Automated question answering (QA) over electronic health records (EHRs) can
bridge critical information gaps for clinicians and patients, yet it demands
both precise evidence retrieval and faithful answer generation under limited
supervision. In this work, we present Neural, the runner-up in the BioNLP 2025
ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method
decouples the task into (1) sentence-level evidence identification and (2)
answer synthesis with explicit citations. For each stage, we automatically
explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning
instructions and few-shot demonstrations on the development set. A
self-consistency voting scheme further improves evidence recall without
sacrificing precision. On the hidden test set, our method attains an overall
score of 51.5, placing second stage while outperforming standard zero-shot and
few-shot prompting by over 20 and 10 points, respectively. These results
indicate that data-driven prompt optimization is a cost-effective alternative
to model fine-tuning for high-stakes clinical QA, advancing the reliability of
AI assistants in healthcare.
[125] Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Adam Karvonen,Samuel Marks
Main category: cs.LG
TL;DR: 这篇论文提出了一种通过内部偏置缓解方法,在大语言模型(LLM)的实际应用中有效减少种族和性别偏见的策略。
Details
Motivation: 当前在受控环境中简单的反偏见提示可以消除LLM的人口统计偏见,但在引入实际上下文后效果不佳。论文旨在解决这种问题,确保LLM在招聘等高风险应用中公平决策。Contribution: 提出了基于敏感属性方向的中性化方法(internal bias mitigation),在保持模型性能的同时,显著降低了偏见(通常低于1%)。
Method: 通过识别和中和模型激活中的敏感属性方向,应用仿射概念编辑在推理时干预偏见。
Result: 该方法在多种商业和开源模型上一致地将偏见降低至极低水平(通常低于1%,最高2.5%)。
Insight: 实际上下文会显著增强LLM的偏见,而内部偏置缓解是一种有效且泛化性强的解决方案,适用于实际部署场景。
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring
applications, making decisions that directly impact people’s careers and
livelihoods. While prior studies suggest simple anti-bias prompts can eliminate
demographic biases in controlled evaluations, we find these mitigations fail
when realistic contextual details are introduced. We address these failures
through internal bias mitigation: by identifying and neutralizing sensitive
attribute directions within model activations, we achieve robust bias reduction
across all tested scenarios. Across leading commercial (GPT-4o, Claude 4
Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3,
Mistral-24B), we find that adding realistic context such as company names,
culture descriptions from public careers pages, and selective hiring
constraints (e.g.,``only accept candidates in the top 10%“) induces
significant racial and gender biases (up to 12% differences in interview
rates). When these biases emerge, they consistently favor Black over White
candidates and female over male candidates across all tested models and
scenarios. Moreover, models can infer demographics and become biased from
subtle cues like college affiliations, with these biases remaining invisible
even when inspecting the model’s chain-of-thought reasoning. To address these
limitations, our internal bias mitigation identifies race and gender-correlated
directions and applies affine concept editing at inference time. Despite using
directions from a simple synthetic dataset, the intervention generalizes
robustly, consistently reducing bias to very low levels (typically under 1%,
always below 2.5%) while largely maintaining model performance. Our findings
suggest that practitioners deploying LLMs for hiring should adopt more
realistic evaluation methodologies and consider internal mitigation strategies
for equitable outcomes.
[126] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models
Evelyn Ma,Duo Zhou,Peizhi Niu,Huiting Zhou,Huan Zhang,Olgica Milenkovic,S. Rasoul Etesami
Main category: cs.LG
TL;DR: 该论文提出了GUARD框架,通过数据归因方法解决了大语言模型(LLM)在遗忘特定数据时导致的非预期遗忘问题,显著提升了模型在保留重要信息方面的性能。
Details
Motivation: 随着法规合规、版权保护和隐私问题的日益重要,LLM的遗忘能力变得关键。然而,现有方法在遗忘高影响力数据时往往导致模型保留性能下降,因此需要一种更高效的数据级解决方案。Contribution: 1. 提出了一种轻量级的代理数据归因指标,专门用于LLM遗忘任务;2. 设计了基于归因分数的自适应非均匀遗忘目标;3. 通过理论证明和实验验证,GUARD在保留重要信息的同时实现了有效遗忘。
Method: GUARD框架的核心包括:1. 设计代理数据归因指标,量化遗忘集与保留集的“对齐”程度;2. 提出基于归因分数的自适应遗忘权重分配方法。通过重新分配遗忘能力,减少对保留集的非预期损失。
Result: 在TOFU基准测试中,GUARD在多种LLM架构上表现优异,遗忘10%训练数据时,保留集的Truth Ratio提升了194.92%,显著优于现有方法。
Insight: 数据级归因方法在模型遗忘任务中具有重要作用,通过自适应权重分配可以平衡遗忘与保留的需求,为LLM的合规性和实用性提供了新思路。
Abstract: Unlearning in large language models (LLMs) is becoming increasingly important
due to regulatory compliance, copyright protection, and privacy concerns.
However, a key challenge in LLM unlearning is unintended forgetting, where the
removal of specific data inadvertently impairs the utility of the model and its
retention of valuable, desired information. While prior work has primarily
focused on architectural innovations, the influence of data-level factors on
unlearning performance remains underexplored. As a result, existing methods
often suffer from degraded retention when forgetting high-impact data. To
address this, we propose GUARD-a novel framework for Guided Unlearning And
Retention via Data attribution. At its core, GUARD introduces a lightweight
proxy data attribution metric tailored for LLM unlearning, which quantifies the
“alignment” between the forget and retain sets while remaining computationally
efficient. Building on this, we design a novel unlearning objective that
assigns adaptive, nonuniform unlearning weights to samples, inversely
proportional to their proxy attribution scores. Through such a reallocation of
unlearning power, GUARD mitigates unintended losses in retention. We provide
rigorous theoretical guarantees that GUARD significantly enhances retention
while maintaining forgetting metrics comparable to prior methods. Extensive
experiments on the TOFU benchmark across multiple LLM architectures demonstrate
that GUARD substantially improves utility preservation while ensuring effective
unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to
194.92% in terms of Truth Ratio when forgetting 10% of the training data.
[127] Build the web for agents, not agents for the web
Xing Han Lù,Gaurav Kamath,Marius Mosbach,Siva Reddy
Main category: cs.LG
TL;DR: 这篇立场论文主张一种新的网络代理研究范式,提出开发专门为代理能力优化的交互界面(AWI),而非让代理适应人类设计的界面。
Details
Motivation: 当前网络代理方法面临重大挑战,因为人类设计的界面与LLM能力之间存在不匹配问题。为了解决这一问题,论文提出了专门为代理设计的界面理念。Contribution: 论文引入了Agentic Web Interface(AWI)的概念,并提出了六项设计原则,旨在优化代理在网络环境中的交互能力。
Method: 提出了AWI的设计理念,并围绕安全性、效率和标准化确立了六项设计原则,以促进代理在网络中的高效导航。
Result: 通过AWI的提出,论文展望了更加高效、可靠和透明的网络代理设计可能性。
Insight: 网络代理的研究需要协同努力,重新设计界面以更好地匹配代理的能力,而非试图让代理适应现有的人类界面。
Abstract: Recent advancements in Large Language Models (LLMs) and multimodal
counterparts have spurred significant interest in developing web agents – AI
systems capable of autonomously navigating and completing tasks within web
environments. While holding tremendous promise for automating complex web
interactions, current approaches face substantial challenges due to the
fundamental mismatch between human-designed interfaces and LLM capabilities.
Current methods struggle with the inherent complexity of web inputs, whether
processing massive DOM trees, relying on screenshots augmented with additional
information, or bypassing the user interface entirely through API interactions.
This position paper advocates for a paradigm shift in web agent research:
rather than forcing web agents to adapt to interfaces designed for humans, we
should develop a new interaction paradigm specifically optimized for agentic
capabilities. To this end, we introduce the concept of an Agentic Web Interface
(AWI), an interface specifically designed for agents to navigate a website. We
establish six guiding principles for AWI design, emphasizing safety,
efficiency, and standardization, to account for the interests of all primary
stakeholders. This reframing aims to overcome fundamental limitations of
existing interfaces, paving the way for more efficient, reliable, and
transparent web agent design, which will be a collaborative effort involving
the broader ML community.
[128] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems
Aayush Karan,Kulin Shah,Sitan Chen
Main category: cs.LG
TL;DR: ReGuidance是一个简单的扩散模型包装器,用于提升困难逆问题中的样本质量,通过逆向操作和重新初始化改进现有方法的表现。
Details
Motivation: 现有方法在信号噪声比低的困难逆问题中容易偏离数据流形,导致输出不真实。ReGuidance旨在通过一种简单的方法提升样本质量和奖励一致性。Contribution: 提出ReGuidance包装器,通过逆向概率流ODE和重新初始化的方法,显著提升现有技术在困难逆问题中的表现。
Method: 首先逆向操作无条件概率流ODE,将候选解回退到潜在空间,再以此为初始化重新运行DPS。
Result: 在大型框内填充和高倍超分辨率等任务中,ReGuidance显著提升了样本质量,且理论证明其在多模态数据分布中能同时提升奖励和接近数据流形。
Insight: ReGuidance首次为DPS提供了严格的算法保证,展示了通过简单操作显著改进样本质量的潜力。
Abstract: There has been a flurry of activity around using pretrained diffusion models
as informed data priors for solving inverse problems, and more generally around
steering these models using reward models. Training-free methods like diffusion
posterior sampling (DPS) and its many variants have offered flexible heuristic
algorithms for these tasks, but when the reward is not informative enough,
e.g., in hard inverse problems with low signal-to-noise ratio, these techniques
veer off the data manifold, failing to produce realistic outputs. In this work,
we devise a simple wrapper, ReGuidance, for boosting both the sample realism
and reward achieved by these methods. Given a candidate solution $\hat{x}$
produced by an algorithm of the user’s choice, we propose inverting the
solution by running the unconditional probability flow ODE in reverse starting
from $\hat{x}$, and then using the resulting latent as an initialization for
DPS. We evaluate our wrapper on hard inverse problems like large box
in-painting and super-resolution with high upscaling. Whereas state-of-the-art
baselines visibly fail, we find that applying our wrapper on top of these
baselines significantly boosts sample quality and measurement consistency. We
complement these findings with theory proving that on certain multimodal data
distributions, ReGuidance simultaneously boosts the reward and brings the
candidate solution closer to the data manifold. To our knowledge, this
constitutes the first rigorous algorithmic guarantee for DPS.
eess.SY [Back]
[129] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing
Rongfei Li,Francis Assadian
Main category: eess.SY
TL;DR: 论文提出了一种针对自动化制造环境中视觉伺服任务的相机位置搜索算法,旨在通过优化相机位置以减少图像噪声,从而提高观测精度。
Details
Motivation: 现有研究多关注控制与观测架构的设计,很少讨论相机位置对观测质量的影响;而在制造环境中,相机位置会显著影响图像噪声水平,从而影响观测精度。Contribution: 提出了一种能量感知的相机移动策略算法,能够高效搜索最优或次优观测位置,并在有限能量下最大化观测精度,同时避免了高频信息丢失。
Method: 算法通过自适应探索策略学习环境,结合图像平均技术,动态调整相机移动策略以最小化图像噪声。
Result: 仿真实验表明,该算法能够在有限能量下显著提高观测精度。
Insight: 相机位置的动态优化对提升视觉伺服系统的观测精度至关重要,尤其是在噪声多变的制造环境中。
Abstract: Visual servoing technology has been well developed and applied in many
automated manufacturing tasks, especially in tools’ pose alignment. To access a
full global view of tools, most applications adopt eye-to-hand configuration or
eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing
environment. Most research papers mainly put efforts into developing control
and observation architectures in various scenarios, but few of them have
discussed the importance of the camera’s location in eye-to-hand configuration.
In a manufacturing environment, the quality of camera estimations may vary
significantly from one observation location to another, as the combined effects
of environmental conditions result in different noise levels of a single image
shot at different locations. In this paper, we propose an algorithm for the
camera’s moving policy so that it explores the camera workspace and searches
for the optimal location where the images’ noise level is minimized. Also, this
algorithm ensures the camera ends up at a suboptimal (if the optimal one is
unreachable) location among the locations already searched, with limited energy
available for moving the camera. Unlike a simple brute force approach, the
algorithm enables the camera to explore space more efficiently by adapting the
search policy from learning the environment. With the aid of an image averaging
technique, this algorithm, in use of a solo camera, achieves the observation
accuracy in eye-to-hand configurations to a desirable extent without filtering
out high-frequency information in the original image. An automated
manufacturing application has been simulated and the results show the success
of this algorithm’s improvement of observation precision with limited energy.
[130] Semi-Tensor-Product Based Convolutional Neural Networks
Daizhan Cheng
Main category: eess.SY
TL;DR: 该论文提出了一种基于半张量积(STP)的卷积神经网絡(CNN),通过STP的向量卷积积(CP)避免了填充带来的噪音问题,并在图像和三维信号识别中应用。
Details
Motivation: 传统卷积操作中,填充(padding)可能引入无用信息,影响模型性能。本文通过STP和域基CP的结合,试图解决这一问题。Contribution: 提出了一种新的基于STP的卷积操作(CP),无需填充,减少了无用信息的干扰,并构建了STP-CNN模型。
Method: 采用半张量积(STP)和域基卷积积(CP),避免了传统卷积中的填充操作,从而设计出STP-CNN。
Result: 所提出的STP-CNN在图像和三维信号识别任务中展示了有效性。
Insight: 通过STP推广向量运算,不仅可以处理不同维度的向量,还能在卷积中避免填充的负面影响,提升模型性能。
Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional
inner product of vectors, which allows the factor vectors to of different
dimensions. This paper proposes a domain-based convolutional product (CP).
Combining domain-based CP with STP of vectors, a new CP is proposed. Since
there is no zero or any other padding, it can avoid the junk information caused
by padding. Using it, the STP-based convolutional neural network (CNN) is
developed. Its application to image and third order signal identifications is
considered.
cs.AI [Back]
[131] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence
Michelle M. Li,Ben Y. Reis,Adam Rodman,Tianxi Cai,Noa Dagan,Ran D. Balicer,Joseph Loscalzo,Isaac S. Kohane,Marinka Zitnik
Main category: cs.AI
TL;DR: 本文提出需要解决医疗AI在上下文切换中的动态适应问题,以避免因上下文错误导致的预测失效。
Details
Motivation: 当前的医疗基础模型(如临床文本、医学图像的多模态模型)在处理新环境时需微调或检索知识库,但这种方法不切实际且难以动态适应临床情境。Contribution: 提出了一种上下文切换的医疗AI愿景,使模型能动态适应不同专科、人群和临床角色,而无需重新训练。
Method: 未明确具体方法,但强调了上下文动态适应的重要性,可能是通过多模态推理或增强模型的上下文意识实现。
Result: 愿景目标是实现更广泛的医疗覆盖,让AI能跨专科和地区诊断、管理和治疗疾病。
Insight: 医疗AI的未来发展需解决动态上下文适应问题,以提高模型的泛化能力和临床应用可靠性。
Abstract: Medical foundation models, including language models trained on clinical
notes, vision-language models on medical images, and multimodal models on
electronic health records, can summarize clinical notes, answer medical
questions, and assist in decision-making. Adapting these models to new
populations, specialties, or settings typically requires fine-tuning, careful
prompting, or retrieval from knowledge bases. This can be impractical, and
limits their ability to interpret unfamiliar inputs and adjust to clinical
situations not represented during training. As a result, models are prone to
contextual errors, where predictions appear reasonable but fail to account for
critical patient-specific or contextual information. These errors stem from a
fundamental limitation that current models struggle with: dynamically adjusting
their behavior across evolving contexts of medical care. In this Perspective,
we outline a vision for context-switching in medical AI: models that
dynamically adapt their reasoning without retraining to new specialties,
populations, workflows, and clinical roles. We envision context-switching AI to
diagnose, manage, and treat a wide range of diseases across specialties and
regions, and expand access to medical care.
[132] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou,Yiheng Wang,Xuming He,Ruoyao Xiao,Zhiwei Li,Qiantai Feng,Zijie Guo,Yuejin Yang,Hao Wu,Wenxuan Huang,Jiaqi Wei,Dan Si,Xiuqi Yao,Jia Bu,Haiwen Huang,Tianfan Fu,Shixiang Tang,Ben Fei,Dongzhan Zhou,Fenghua Ling,Yan Lu,Siqi Sun,Chenhui Li,Guanjie Zheng,Jiancheng Lv,Wenlong Zhang,Lei Bai
Main category: cs.AI
TL;DR: 该论文提出了一个名为‘科学家第一考试’(SFE)的基准测试,旨在评估多模态大语言模型(MLLM)在科学领域的感知、理解和推理能力。实验表明当前先进模型在这些任务上表现仍有显著提升空间。
Details
Motivation: 现有的科学基准测试主要评估MLLM的知识理解能力,忽视了其感知和推理能力,因此需要一个新的基准来全面评估科学领域的认知能力。Contribution: 贡献在于提出了SFE基准,包含830个专家验证的视觉问答对,覆盖66个多模态任务和五个高价值学科,填补了科学认知能力评估的空白。
Method: SFE通过三个层次设计评估:科学信号感知、科学属性理解和科学比较推理,利用多模态任务和专家验证的问题对MLLM进行测试。
Result: 实验结果显示,当前最先进的GPT-o3和InternVL-3在SFE上的得分仅为34.08%和26.52%,表明MLLM在科学领域仍有巨大改进空间。
Insight: SFE的提出不仅揭示了MLLM在科学认知能力上的不足,还为未来AI在科学发现中的进一步发展和应用提供了方向。
Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning
based on information-intensive scientific data and domain-specific expertise.
Empowered by expert-level scientific benchmarks, scientific Multimodal Large
Language Models (MLLMs) hold the potential to significantly enhance this
discovery process in realistic workflows. However, current scientific
benchmarks mostly focus on evaluating the knowledge understanding capabilities
of MLLMs, leading to an inadequate assessment of their perception and reasoning
abilities. To address this gap, we present the Scientists’ First Exam (SFE)
benchmark, designed to evaluate the scientific cognitive capacities of MLLMs
through three interconnected levels: scientific signal perception, scientific
attribute understanding, scientific comparative reasoning. Specifically, SFE
comprises 830 expert-verified VQA pairs across three question types, spanning
66 multimodal tasks across five high-value disciplines. Extensive experiments
reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%
and 26.52% on SFE, highlighting significant room for MLLMs to improve in
scientific realms. We hope the insights obtained in SFE will facilitate further
developments in AI-enhanced scientific discoveries.
[133] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving
Vincenzo Colle,Mohamed Sana,Nicola Piovesan,Antonio De Domenico,Fadhel Ayed,Merouane Debbah
Main category: cs.AI
TL;DR: 论文介绍了TeleMath,首个专门评估LLMs在电信领域数学问题解决能力的基准数据集,包含500个问题-答案对,覆盖电信领域多主题。实验表明,专为数学或逻辑推理设计的LLMs表现最佳,通用模型即使参数多也表现不佳。
Details
Motivation: 当前LLMs在通用数学推理中表现提升,但在电信等专业领域的数学问题解决能力尚未充分探索。因此,作者提出TeleMath填补这一空白。Contribution: 1. 提出首个电信领域数学问题解决基准数据集TeleMath;2. 设计了问题生成的流程;3. 评估了多种LLMs的表现,揭示了专用模型的优势。
Method: 通过专家选定的种子问题生成500个QnA对,覆盖电信领域多主题。评估了多种开源LLMs在TeleMath上的表现,重点关注数学或逻辑推理专用模型。
Result: 实验显示,专为数学或逻辑推理设计的LLMs在TeleMath上表现最佳,通用模型即使参数多也表现不佳。数据集和评估代码已开源。
Insight: 在特定领域的数学问题解决中,通用LLMs的表现可能不如专用模型,表明领域适配的重要性。
Abstract: The increasing adoption of artificial intelligence in telecommunications has
raised interest in the capability of Large Language Models (LLMs) to address
domain-specific, mathematically intensive tasks. Although recent advancements
have improved the performance of LLMs in general mathematical reasoning, their
effectiveness within specialized domains, such as signal processing, network
optimization, and performance analysis, remains largely unexplored. To address
this gap, we introduce TeleMath, the first benchmark dataset specifically
designed to evaluate LLM performance in solving mathematical problems with
numerical solutions in the telecommunications domain. Comprising 500
question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the
telecommunications field. This paper outlines the proposed QnAs generation
pipeline, starting from a selected seed of problems crafted by Subject Matter
Experts. The evaluation of a wide range of open-source LLMs reveals that best
performance on TeleMath is achieved by recent models explicitly designed for
mathematical or logical reasoning. In contrast, general-purpose models, even
those with a large number of parameters, often struggle with these challenges.
We have released the dataset and the evaluation code to ease result
reproducibility and support future research.
[134] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Fei Lin,Ziyang Gong,Cong Wang,Yonglin Tian,Tengchao Zhang,Xue Yang,Gen Luo,Fei-Yue Wang
Main category: cs.AI
TL;DR: 该论文提出了ToxiMol,首个针对分子毒性修复的通用多模态大语言模型(MLLM)基准任务,并设计了评估框架ToxiEval,系统性评估了近30种主流MLLM,发现其在毒性理解和分子编辑方面表现出潜力。
Details
Motivation: 早期药物开发中,毒性是导致失败的主要原因,但目前缺乏系统性的分子毒性修复任务定义与基准。Contribution: 1. 提出首个分子毒性修复的基准任务ToxiMol;2. 设计自动化评估框架ToxiEval;3. 系统性评估了近30种MLLM。
Method: 构建标准化数据集,设计基于专家毒理学知识的提示标注流程,并提出集成毒性预测、合成可及性等指标的评估链。
Result: 当前MLLM在此任务上仍面临挑战,但在毒性理解、语义约束和分子编辑方面展现出潜力。
Insight: MLLM在分子毒性修复任务上的能力尚不成熟,但初步结果表明其在该领域具备进一步开发的潜力。
Abstract: Toxicity remains a leading cause of early-stage drug development failure.
Despite advances in molecular design and property prediction, the task of
molecular toxicity repair - generating structurally valid molecular
alternatives with reduced toxicity - has not yet been systematically defined or
benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task
for general-purpose Multimodal Large Language Models (MLLMs) focused on
molecular toxicity repair. We construct a standardized dataset covering 11
primary tasks and 560 representative toxic molecules spanning diverse
mechanisms and granularities. We design a prompt annotation pipeline with
mechanism-aware and task-adaptive capabilities, informed by expert
toxicological knowledge. In parallel, we propose an automated evaluation
framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic
accessibility, drug-likeness, and structural similarity into a high-throughput
evaluation chain for repair success. We systematically assess nearly 30
mainstream general-purpose MLLMs and design multiple ablation studies to
analyze key factors such as evaluation criteria, candidate diversity, and
failure attribution. Experimental results show that although current MLLMs
still face significant challenges on this task, they begin to demonstrate
promising capabilities in toxicity understanding, semantic constraint
adherence, and structure-aware molecule editing.
cs.MA [Back]
[135] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Haoyuan Shi,Yunxin Li,Xinyu Chen,Longyue Wang,Baotian Hu,Min Zhang
Main category: cs.MA
TL;DR: AniMaker是一个多智能体框架,通过蒙特卡洛树搜索(MCTS)驱动的视频片段生成和故事感知的片段选择,实现从文本输入生成全局一致且故事连贯的动画。
Details
Motivation: 当前视频生成方法在生成跨多场景和多角色的连贯故事视频时面临挑战,现有的方法通常只能生成固定长度的片段,导致叙事不连贯和节奏问题,且不稳定。Contribution: 1. 提出AniMaker框架,包含导演、摄影、评审和后期制作四个智能体;2. 引入MCTS-Gen技术优化片段生成;3. 设计AniEval框架用于多镜头动画评估。
Method: 1. 使用多智能体协作(Director Agent、Photography Agent等);2. MCTS-Gen在Photography Agent中用于高效生成高质量候选片段;3. AniEval在Reviewer Agent中评估故事一致性、动作完成度等。
Result: 实验表明,AniMaker在VBench和AniEval等指标上表现优异,显著提升了多候选片段生成的效率。
Insight: 通过智能体分工和MCTS优化,AniMaker展示了文本到视频生成中全局一致性和故事连贯性的重要性。
Abstract: Despite rapid advancements in video generation models, generating coherent
storytelling videos that span multiple scenes and characters remains
challenging. Current methods often rigidly convert pre-generated keyframes into
fixed-length clips, resulting in disjointed narratives and pacing issues.
Furthermore, the inherent instability of video generation models means that
even a single low-quality clip can significantly degrade the entire output
animation’s logical coherence and visual continuity. To overcome these
obstacles, we introduce AniMaker, a multi-agent framework enabling efficient
multi-candidate clip generation and storytelling-aware clip selection, thus
creating globally consistent and story-coherent animation solely from text
input. The framework is structured around specialized agents, including the
Director Agent for storyboard generation, the Photography Agent for video clip
generation, the Reviewer Agent for evaluation, and the Post-Production Agent
for editing and voiceover. Central to AniMaker’s approach are two key technical
components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search
(MCTS)-inspired strategy that intelligently navigates the candidate space to
generate high-potential clips while optimizing resource usage; and AniEval in
Reviewer Agent, the first framework specifically designed for multi-shot
animation evaluation, which assesses critical aspects such as story-level
consistency, action completion, and animation-specific features by considering
each clip in the context of its preceding and succeeding clips. Experiments
demonstrate that AniMaker achieves superior quality as measured by popular
metrics including VBench and our proposed AniEval framework, while
significantly improving the efficiency of multi-candidate generation, pushing
AI-generated storytelling animation closer to production standards.
eess.IV [Back]
[136] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective
Minye Shao,Zeyu Wang,Haoran Duan,Yawen Huang,Bing Zhai,Shizheng Wang,Yang Long,Yefeng Zheng
Main category: eess.IV
TL;DR: 该论文提出了一种从频域视角重新思考脑肿瘤分割的方法,通过HFF-Net网络结合频域分解和自适应拉普拉斯卷积,显著提升了对比增强区域的性能。
Details
Motivation: 当前方法在分割增强脑肿瘤区域时性能下降,主要因为未充分考虑MRI图像的复杂纹理和方向变化特征,因此需要一种更全面的方法来捕捉这些特征。Contribution: 1.提出HFF-Net网络;2.设计频域分解模块(FDD)和自适应拉普拉斯卷积模块(ALC);3.引入频域交叉注意力模块(FDCA)实现多尺度特征融合。
Method: 1.FDD模块分解MRI图像为低高频成分;2.ALC模块动态增强高频细节;3.FDCA模块融合语义、位置和切片信息。
Result: 在四个公开数据集上,HFF-Net在三个主要子区域的Dice分数平均提升4.48%,对比增强区域分割性能平均提升7.33%。
Insight: 频域视角能够更全面地表征肿瘤区域的特征,尤其是在处理复杂纹理和边界时表现出显著优势。
Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions
visible in post-contrast MRI (areas highlighted by contrast agent injection),
is crucial for accurate clinical diagnosis and treatment planning but remains
challenging. However, current methods exhibit notable performance degradation
in segmenting these enhancing brain tumor areas, largely due to insufficient
consideration of MRI-specific tumor features such as complex textures and
directional variations. To address this, we propose the Harmonized Frequency
Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a
frequency-domain perspective. To comprehensively characterize tumor regions, we
develop a Frequency Domain Decomposition (FDD) module that separates MRI images
into low-frequency components, capturing smooth tumor contours and
high-frequency components, highlighting detailed textures and directional
edges. To further enhance sensitivity to tumor boundaries, we introduce an
Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical
high-frequency details using dynamically updated convolution kernels. To
effectively fuse tumor features across multiple scales, we design a Frequency
Domain Cross-Attention (FDCA) integrating semantic, positional, and
slice-specific information. We further validate and interpret frequency-domain
improvements through visualization, theoretical reasoning, and experimental
analyses. Extensive experiments on four public datasets demonstrate that
HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39%
to 7.72%) in the mean Dice scores across the three major subregions, and an
average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the
segmentation of contrast-enhancing tumor regions, while maintaining favorable
computational efficiency and clinical applicability. Code:
https://github.com/VinyehShaw/HFF.
[137] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation
Emerson P. Grabke,Masoom A. Haider,Babak Taati
Main category: eess.IV
TL;DR: 论文提出了一种名为CCELLA的新方法,通过结合大型语言模型和病理分类的双重条件引导,解决了医学LDM训练中的数据稀缺问题,显著提升了合成图像的质量和分类器性能。
Details
Motivation: 医学影像数据稀缺限制了机器学习的发展,现有的LDM训练方法依赖于短提示文本编码器或非医学预训练模型,且需要大量数据微调,影响了性能和科学可访问性。Contribution: 1. 提出CCELLA,一种双重条件引导方法,结合文本特征和病理分类;2. 设计了联合损失函数和数据高效的LDM训练框架;3. 在有限数据下实现高质量医学图像合成,提升了分类器性能。
Method: 采用CCELLA双头条件引导,通过交叉注意力引入文本特征,并通过时间步嵌入引入病理分类。提出联合损失函数和数据高效训练框架,优化LDM在有限数据下的表现。
Result: 在受限的前列腺MRI数据集上,3D FID达到0.025,显著优于其他方法(FID 0.071)。合成图像将前列腺癌分类器准确率从69%提升到74%,且仅用合成图像训练的分类器性能接近真实图像训练结果。
Insight: 通过结合文本和病理条件的双重引导,可以在数据稀缺场景下高效训练LDM,生成高质量医学图像,并为下游任务提供显著性能提升。
Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges
affecting machine learning development for medical imaging. However, medical
LDM training typically relies on performance- or scientific
accessibility-limiting strategies including a reliance on short-prompt text
encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with
large data volumes. We propose a Class-Conditioned Efficient Large Language
model Adapter (CCELLA) to address these limitations. CCELLA is a novel
dual-head conditioning approach that simultaneously conditions the LDM U-Net
with non-medical large language model-encoded text features through
cross-attention and with pathology classification through the timestep
embedding. We also propose a joint loss function and a data-efficient LDM
training framework. In combination, these strategies enable
pathology-conditioned LDM training for high-quality medical image synthesis
given limited data volume and human data annotation, improving LDM performance
and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a
size-limited prostate MRI dataset, significantly outperforming a recent
foundation model with FID 0.071. When training a classifier for prostate cancer
prediction, adding synthetic images generated by our method to the training
dataset improves classifier accuracy from 69% to 74%. Training a classifier
solely on our method’s synthetic images achieved comparable performance to
training on real images alone.
[138] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction
Yuliang Zhu,Jing Cheng,Qi Xie,Zhuo-Xu Cui,Qingyong Zhu,Yuanyuan Liu,Xin Liu,Jianfeng Ren,Chengbo Wang,Dong Liang
Main category: eess.IV
TL;DR: 论文提出了一个名为DUN-SRE的新型深度展开网络,通过时空旋转等变性解决动态MRI重建中的对称性问题,尤其在心脏CINE MRI中表现优异。
Details
Motivation: 动态MRI在时间和空间维度上具有变换对称性,但现有方法未能有效建模时间对称性,导致重建质量受限。Contribution: 提出了DUN-SRE,首次实现了时空旋转等变性的深度展开网络,并开发了高保真群滤波器参数化机制。
Method: 采用(2+1)D等变卷积架构,将数据一致性和近似映射模块整合到统一的深度展开框架中,确保时空对称性约束的严格传播。
Result: 在心脏CINE MRI数据集上实现了最先进的性能,尤其擅长保留旋转对称结构,并表现出强大的泛化能力。
Insight: 时空对称性约束的显式建模显著提升了动态MRI重建的质量,尤其在高度欠采样场景下效果显著。
Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries,
including spatial rotation symmetry within individual frames and temporal
symmetry along the time dimension. Explicit incorporation of these symmetry
priors in the reconstruction model can significantly improve image quality,
especially under aggressive undersampling scenarios. Recently, Equivariant
convolutional neural network (ECNN) has shown great promise in exploiting
spatial symmetry priors. However, existing ECNNs critically fail to model
temporal symmetry, arguably the most universal and informative structural prior
in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep
Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for
Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance
through a (2+1)D equivariant convolutional architecture. In particular, it
integrates both the data consistency and proximal mapping module into a unified
deep unrolling framework. This architecture ensures rigorous propagation of
spatiotemporal rotation symmetry constraints throughout the reconstruction
process, enabling more physically accurate modeling of cardiac motion dynamics
in cine MRI. In addition, a high-fidelity group filter parameterization
mechanism is developed to maintain representation precision while enforcing
symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets
demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in
preserving rotation-symmetric structures, offering strong generalization
capability to a broad range of dynamic MRI reconstruction tasks.
[139] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation
Xi Chen,Zhiqiang Shen,Peng Cao,Jinzhu Yang,Osmar R. Zaiane
Main category: eess.IV
TL;DR: 论文提出了一种名为ConStyX的新型域泛化方法,通过内容和风格增强来提升医学图像分割模型的泛化性能。
Details
Motivation: 医学图像通常来自多个域,导致域偏移问题,影响分割模型的性能。现有的域随机化方法仅依赖图像风格扰动,效率受限,且忽略了过度增强图像对训练的负面影响。Contribution: 1)提出了一种内容和风格双重增强的域泛化方法(ConStyX);2)在训练中有效利用增强特征,同时减轻过度增强的负面影响。
Method: ConStyX通过同时增强训练数据的内容和风格,覆盖更广的数据域,并在模型训练中优化增强特征的利用。
Result: 实验表明,ConStyX在多个域上取得了优越的泛化性能。
Insight: 内容和风格的协同增强对于改善医学图像分割的域泛化能力至关重要。
Abstract: Medical images are usually collected from multiple domains, leading to domain
shifts that impair the performance of medical image segmentation models. Domain
Generalization (DG) aims to address this issue by training a robust model with
strong generalizability. Recently, numerous domain randomization-based DG
methods have been proposed. However, these methods suffer from the following
limitations: 1) constrained efficiency of domain randomization due to their
exclusive dependence on image style perturbation, and 2) neglect of the adverse
effects of over-augmented images on model training. To address these issues, we
propose a novel domain randomization-based DG method, called content style
augmentation (ConStyX), for generalizable medical image segmentation.
Specifically, ConStyX 1) augments the content and style of training data,
allowing the augmented training data to better cover a wider range of data
domains, and 2) leverages well-augmented features while mitigating the negative
effects of over-augmented features during model training. Extensive experiments
across multiple domains demonstrate that our ConStyX achieves superior
generalization performance. The code is available at
https://github.com/jwxsp1/ConStyX.
[140] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches
Andrea Moglia,Matteo Leccardi,Matteo Cavicchioli,Alice Maccarini,Marco Marcon,Luca Mainardi,Pietro Cerveri
Main category: eess.IV
TL;DR: 本文是对医学图像分割领域的通才模型(如SAM及其变种)的全面调查,比较了它们的性能,并探讨了与任务特定模型的差异。
Details
Motivation: 受大型语言模型的启发,通才模型在计算机视觉领域崭露头角,特别是在医学图像分割中。本文旨在调查这些模型的性能及其挑战。Contribution: 提供了通才模型的分类和性能分析,对比了任务特定模型的最新成果,并探讨了未来发展方向。
Method: 分类了通才模型的变种(如零样本、少样本、微调等),并进行了性能比较。
Result: 通才模型在医学图像分割中表现良好,但仍需解决合规性、隐私和预算等挑战。
Insight: 未来的方向包括合成数据、早期融合、代理AI和临床转化,借鉴自然语言处理的经验。
Abstract: Following the successful paradigm shift of large language models, leveraging
pre-training on a massive corpus of data and fine-tuning on different
downstream tasks, generalist models have made their foray into computer vision.
The introduction of Segment Anything Model (SAM) set a milestone on
segmentation of natural images, inspiring the design of a multitude of
architectures for medical image segmentation. In this survey we offer a
comprehensive and in-depth investigation on generalist models for medical image
segmentation. We start with an introduction on the fundamentals concepts
underpinning their development. Then, we provide a taxonomy on the different
declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on
the recent SAM 2, on other innovative models trained on images alone, and
others trained on both text and images. We thoroughly analyze their
performances at the level of both primary research and best-in-literature,
followed by a rigorous comparison with the state-of-the-art task-specific
models. We emphasize the need to address challenges in terms of compliance with
regulatory frameworks, privacy and security laws, budget, and trustworthy
artificial intelligence (AI). Finally, we share our perspective on future
directions concerning synthetic data, early fusion, lessons learnt from
generalist models in natural language processing, agentic AI and physical AI,
and clinical translation.
[141] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation
Zhenhuan Zhou
Main category: eess.IV
TL;DR: Med-URWKV提出了一种基于纯RWKV架构的医学图像分割模型,首次利用ImageNet预训练的VRWKV编码器,性能优于从头训练的RWKV模型。
Details
Motivation: 现有的医学图像分割方法(CNN、Transformer或混合架构)存在局限性,如CNN感受野受限或Transformer计算复杂度高。RWKV具备线性复杂度的远程建模能力,但其在医学领域的预训练潜力未被探索。Contribution: 1. 提出了首个基于纯RWKV架构的医学图像分割模型Med-URWKV;2. 首次利用ImageNet预训练的VRWKV编码器,验证了预训练对模型性能的提升作用。
Method: 基于U-Net框架,使用纯RWKV架构,并引入ImageNet预训练的VRWKV编码器。实验在七个数据集上进行,对比从头训练的RWKV模型。
Result: Med-URWKV在多个数据集上取得了与从头训练的RWKV模型相当或更优的分割性能,验证了预训练的有效性。
Insight: 利用ImageNet预训练的RWKV编码器可以显著提升医学图像分割任务的性能,为未来研究提供了新的方向。
Abstract: Medical image segmentation is a fundamental and key technology in
computer-aided diagnosis and treatment. Previous methods can be broadly
classified into three categories: convolutional neural network (CNN) based,
Transformer based, and hybrid architectures that combine both. However, each of
them has its own limitations, such as restricted receptive fields in CNNs or
the computational overhead caused by the quadratic complexity of Transformers.
Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a
promising alternative for various vision tasks, offering strong long-range
modeling capabilities with linear computational complexity. Some studies have
also adapted RWKV to medical image segmentation tasks, achieving competitive
performance. However, most of these studies focus on modifications to the
Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring
the potential advantages of leveraging pre-trained VRWKV models for medical
image segmentation tasks. In this paper, we propose Med-URWKV, a pure
RWKV-based architecture built upon the U-Net framework, which incorporates
ImageNet-based pretraining to further explore the potential of RWKV in medical
image segmentation tasks. To the best of our knowledge, Med-URWKV is the first
pure RWKV segmentation model in the medical field that can directly reuse a
large-scale pre-trained VRWKV encoder. Experimental results on seven datasets
demonstrate that Med-URWKV achieves comparable or even superior segmentation
performance compared to other carefully optimized RWKV models trained from
scratch. This validates the effectiveness of using a pretrained VRWKV encoder
in enhancing model performance. The codes will be released.
cs.CR [Back]
[142] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models
Zilong Wang,Xiang Zheng,Xiaosen Wang,Bo Wang,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CR
TL;DR: GenBreak提出了一种通过微调大型语言模型(LLM)来系统探索文本到图像(T2I)生成器潜在漏洞的框架,结合监督微调和强化学习,生成既能绕过安全机制又能产生高毒性图像的对抗性提示。
Details
Motivation: 当前的T2I模型(如Stable Diffusion)可能被滥用以生成有害内容,但现有研究在对抗性攻击上存在局限性:或容易被检测,或无法生成真正有害的输出。GenBreak旨在填补这一空白,提供一种可靠的工具来评估T2I模型的安全性。Contribution: GenBreak的主要贡献是提出了一个结合监督微调和强化学习的框架,通过多奖励信号引导LLM生成既能绕过安全机制又具有高毒性的对抗性提示,揭示了商用T2I模型的严重安全漏洞。
Method: 方法包括:1)使用精心策划的数据集对LLM进行监督微调;2)通过与代理T2I模型交互进行强化学习;3)整合多个奖励信号以优化提示的逃避能力和毒性。
Result: 生成的对抗性提示在针对商用T2I模型的黑盒攻击中表现出色,揭示了严重的安全弱点。
Insight: 通过LLM生成对抗性提示是一种高效发现T2I模型漏洞的方法,结合多奖励信号的强化学习可以显著提升提示的有效性和多样性。
Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and
are now widely used in content creation. However, these models can be misused
to generate harmful content, including nudity or violence, posing significant
safety risks. While most platforms employ content moderation systems,
underlying vulnerabilities can still be exploited by determined adversaries.
Recent research on red-teaming and adversarial attacks against T2I models has
notable limitations: some studies successfully generate highly toxic images but
use adversarial prompts that are easily detected and blocked by safety filters,
while others focus on bypassing safety mechanisms but fail to produce genuinely
harmful outputs, neglecting the discovery of truly high-risk prompts.
Consequently, there remains a lack of reliable tools for evaluating the safety
of defended T2I models. To address this gap, we propose GenBreak, a framework
that fine-tunes a red-team large language model (LLM) to systematically explore
underlying vulnerabilities in T2I generators. Our approach combines supervised
fine-tuning on curated datasets with reinforcement learning via interaction
with a surrogate T2I model. By integrating multiple reward signals, we guide
the LLM to craft adversarial prompts that enhance both evasion capability and
image toxicity, while maintaining semantic coherence and diversity. These
prompts demonstrate strong effectiveness in black-box attacks against
commercial T2I generators, revealing practical and concerning safety
weaknesses.
cs.SD [Back]
[143] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs
Tony Alex,Wish Suharitdamrong,Sara Atito,Armin Mustafa,Philip J. B. Jackson,Imran Razzak,Muhammad Awais
Main category: cs.SD
TL;DR: 该论文研究了音频编码器与LLMs的交互机制,提出通过延迟音频整合、仅用注意力子模块和多样化编码器集成,优化跨模态信息传输,显著提升性能。
Details
Motivation: 尽管音频-LLMs的应用发展迅速,但音频编码器与LLMs间的语义表征传输机制尚不明确。研究旨在优化这种交互,提升LLMs对音频信息的探测能力。Contribution: 1. 提出延迟音频整合的策略;2. 验证仅通过注意力子模块即可有效探测音频表征;3. 引入多样化音频编码器集成,丰富信息传输。
Method: 基于Pengi/LLaVA架构,提出延迟整合、注意力模块优化和编码器集成,并在560万音频-文本对数据集上验证。
Result: 最终架构在基线基础上提升10%-60%,验证了跨模态信息传输优化的有效性。
Insight: 延迟音频整合和注意力模块的简化设计是关键;多样化的编码器集成能显著扩展LLMs的音频信息处理能力。
Abstract: The integration of audio perception capabilities into Large Language Models
(LLMs) has enabled significant advances in Audio-LLMs. Although
application-focused developments, particularly in curating training data for
specific capabilities e.g., audio reasoning, have progressed rapidly, the
underlying mechanisms that govern efficient transfer of rich semantic
representations from audio encoders to LLMs remain under-explored. We
conceptualize effective audio-LLM interaction as the LLM’s ability to
proficiently probe the audio encoder representations to satisfy textual
queries. This paper presents a systematic investigation on how architectural
design choices can affect that. Beginning with a standard Pengi/LLaVA-style
audio-LLM architecture, we propose and evaluate several modifications guided by
hypotheses derived from mechanistic interpretability studies and LLM
operational principles. Our experiments demonstrate that: (1) delaying audio
integration until the LLM’s initial layers establish textual context that
enhances its ability to probe the audio representations for relevant
information; (2) the LLM can proficiently probe audio representations
exclusively through LLM layer’s attention submodule, without requiring
propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently
integrated ensemble of diverse audio encoders provides richer, complementary
representations, thereby broadening the LLM’s capacity to probe a wider
spectrum of audio information. All hypotheses are evaluated using an identical
three-stage training curriculum on a dataset of 5.6 million audio-text pairs,
ensuring controlled comparisons. Our final architecture, which incorporates all
proposed modifications, achieves relative improvements from 10% to 60% over
the baseline, validating our approach to optimizing cross-modal information
transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
physics.med-ph [Back]
[144] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation
Nicholas Summerfield,Qisheng He,Alex Kuo,Ahmed I. Ghanem,Simeng Zhu,Chase Ruff,Joshua Pan,Anudeep Kumar,Prashant Nagpal,Jiwei Zhao,Ming Dong,Carri K. Glide-Hurst
Main category: physics.med-ph
TL;DR: 论文提出了一种名为MAGIC的多模态心脏亚结构分割方法,通过单一的nnU-Net模型实现多模态数据的分割,并在三种模态下取得了优于对比模型的表现。
Details
Motivation: 心脏亚结构分割在放射治疗规划中至关重要,现有深度学习方法在多模态数据和重叠结构分割上缺乏通用性。Contribution: 提出了一种轻量级的、多模态通用的心脏亚结构分割框架MAGIC,能够在单一模型中处理多种模态数据。
Method: 基于nnU-Net的U型网络结构,通过复制编码和解码分支实现多模态分割功能。
Result: 在Sim-CT、MR-Linac和CCTA三种模态上的平均DSC分数分别为0.75、0.68和0.80,优于57%的对比模型。
Insight: MAGIC简化了计算需求,提高了临床应用的灵活性,为多模态心脏亚结构分割提供了一种高效解决方案。
Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to
minimize risk of radiation-induced heart disease. Deep learning (DL) offers
efficient methods to reduce contouring burden but lacks generalizability across
different modalities and overlapping structures. This work introduces and
validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and
multi-modal cardiac substructure segmentation. MAGIC is implemented through
replicated encoding and decoding branches of an nnU-Net-based, U-shaped
backbone conserving the function of a single model. Twenty cardiac
substructures (heart, chambers, great vessels (GVs), valves, coronary arteries
(CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac,
and cardiac CT angiography (CCTA) modalities were manually delineated and used
to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison
models (four segmentation subgroups across three modalities) were equivalently
trained. All methods were compared for training efficiency and against
reference contours using the Dice Similarity Coefficient (DSC) and two-tailed
Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were
0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC
outperforms the comparison in 57% of cases, with limited statistical
differences. MAGIC offers an effective and accurate segmentation solution that
is lightweight and capable of segmenting multiple modalities and overlapping
structures in a single model. MAGIC further enables clinical implementation by
simplifying the computational requirements and offering unparalleled
flexibility for clinical settings.
cs.RO [Back]
[145] A Navigation Framework Utilizing Vision-Language Models
Yicheng Duan,Kaiyu tang
Main category: cs.RO
TL;DR: 该论文提出了一种模块化的导航框架,将视觉-语言理解与动作规划解耦,利用冻结的视觉-语言模型和轻量级规划逻辑,旨在实现灵活、快速且适应性强的导航。
Details
Motivation: 视觉-语言导航(VLN)任务需要智能体解析自然语言指令并在陌生环境中导航,现有大型视觉-语言模型虽然在多模态理解上表现优异,但存在计算成本高和实时部署困难的问题。Contribution: 提出了一种模块化的导航框架,通过解耦视觉-语言理解和动作规划,结合冻结模型与轻量级规划逻辑,提升导航的效率和灵活性。
Method: 采用冻结的视觉-语言模型 Qwen2.5-VL-7B-Instruct,结合提示工程、结构化历史管理和双帧视觉输入策略,优化导航决策的连续性。
Result: 虽然在新环境泛化能力上仍有挑战,但模块化设计为高效可扩展的导航系统奠定了基础。
Insight: 通过增强环境先验和扩展多模态输入集成,未来有望进一步提升导航系统的性能。
Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied
AI, requiring agents to interpret natural language instructions and navigate
through visually rich, unfamiliar environments. Recent advances in large
vision-language models (LVLMs), such as CLIP and Flamingo, have significantly
improved multimodal understanding but introduced new challenges related to
computational cost and real-time deployment. In this project, we propose a
modular, plug-and-play navigation framework that decouples vision-language
understanding from action planning. By integrating a frozen vision-language
model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to
achieve flexible, fast, and adaptable navigation without extensive model
fine-tuning. Our framework leverages prompt engineering, structured history
management, and a two-frame visual input strategy to enhance decision-making
continuity across navigation steps. We evaluate our system on the Room-to-Room
benchmark within the VLN-CE setting using the Matterport3D dataset and
Habitat-Lab simulation environment. Although our initial results reveal
challenges in generalizing to unseen environments under strict evaluation
settings, our modular approach lays a foundation for scalable and efficient
navigation systems, highlighting promising directions for future improvement
through enhanced environmental priors and expanded multimodal input
integration.
[146] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
Wang Xinjie,Liu Liu,Cao Yu,Wu Ruiqi,Qin Wenkang,Wang Dehui,Sui Wei,Su Zhizhong
Main category: cs.RO
TL;DR: EmbodiedGen是一个用于生成高质量、可控且逼真的3D资产的平台,支持具身智能任务。它通过生成式AI解决了传统3D资产的高成本和低真实性问题。
Details
Motivation: 传统3D资产的高生产成本和有限的真实性问题阻碍了具身智能任务的扩展性需求。EmbodiedGen旨在通过生成式AI提供低成本、多样化的3D世界生成方案。Contribution: 提出了EmbodiedGen平台,通过六个关键模块(如Image-to-3D、Text-to-3D等)生成多样化的3D资产,支持URDF格式,可直接用于物理仿真引擎。
Method: 利用生成式AI技术,结合图像和文本输入,生成高质量、可控的3D资产。平台包含多个模块,涵盖从纹理生成到场景布局的全流程。
Result: 生成的3D资产具有高真实性和准确的物理属性,可直接用于具身智能任务的训练和评估。代码已开源。
Insight: EmbodiedGen展示了生成式AI在解决3D资产稀缺问题上的潜力,为具身智能研究的扩展性和泛化性提供了新工具。
Abstract: Constructing a physically realistic and accurately scaled simulated 3D world
is crucial for the training and evaluation of embodied intelligence tasks. The
diversity, realism, low cost accessibility and affordability of 3D data assets
are critical for achieving generalization and scalability in embodied AI.
However, most current embodied intelligence tasks still rely heavily on
traditional 3D computer graphics assets manually created and annotated, which
suffer from high production costs and limited realism. These limitations
significantly hinder the scalability of data driven approaches. We present
EmbodiedGen, a foundational platform for interactive 3D world generation. It
enables the scalable generation of high-quality, controllable and
photorealistic 3D assets with accurate physical properties and real-world scale
in the Unified Robotics Description Format (URDF) at low cost. These assets can
be directly imported into various physics simulation engines for fine-grained
physical control, supporting downstream tasks in training and evaluation.
EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key
modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object
Generation, Scene Generation and Layout Generation. EmbodiedGen generates
diverse and interactive 3D worlds composed of generative 3D assets, leveraging
generative AI to address the challenges of generalization and evaluation to the
needs of embodied intelligence related research. Code is available at
https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
[147] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop
Justin Kerr,Kush Hari,Ethan Weber,Chung Min Kim,Brent Yi,Tyler Bonnen,Ken Goldberg,Angjoo Kanazawa
Main category: cs.RO
TL;DR: EyeRobot是一个结合了机械眼球和强化学习的机器人系统,通过联合训练手和眼的动作来完成实际任务。
Details
Motivation: 人类通过主动观察环境来完成任务,受此启发,研究者开发了EyeRobot,模拟人类的主动注视行为,以提高机器人在大工作空间中的操作能力。Contribution: EyeRobot的主要贡献包括:1) 引入了一个可自由旋转的机械眼球,支持360度观察;2) 提出了BC-RL联合训练框架,实现手眼协调;3) 开发了高分辨率、低计算开销的注视策略架构。
Method: 方法包括:1) 使用遥操作演示数据训练初始手部动作;2) 在仿真环境中通过强化学习训练注视策略;3) 采用BC-RL循环联合优化手眼动作,其中手的动作为行为克隆,眼的动作为强化学习。
Result: 实验表明,EyeRobot在五个全景工作空间任务中表现优越,能够有效跟踪目标并忽略干扰物,实现了在大工作空间中的高效操作。
Insight: 论文的洞察包括:1) 高分辨率注视策略有助于稳定注视和目标跟踪;2) 手眼协调的自然涌现可以显著提升任务完成效果。
Abstract: Humans do not passively observe the visual world – we actively look in order
to act. Motivated by this principle, we introduce EyeRobot, a robotic system
with gaze behavior that emerges from the need to complete real-world tasks. We
develop a mechanical eyeball that can freely rotate to observe its surroundings
and train a gaze policy to control it using reinforcement learning. We
accomplish this by first collecting teleoperated demonstrations paired with a
360 camera. This data is imported into a simulation environment that supports
rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze
on top of robot demonstrations. We then introduce a BC-RL loop to train the
hand and eye jointly: the hand (BC) agent is trained from rendered eye
observations, and the eye (RL) agent is rewarded when the hand produces correct
action predictions. In this way, hand-eye coordination emerges as the eye looks
towards regions which allow the hand to complete the task. EyeRobot implements
a foveal-inspired policy architecture allowing high resolution with a small
compute budget, which we find also leads to the emergence of more stable
fixation as well as improved ability to track objects and ignore distractors.
We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring
manipulation in an arc surrounding the robot arm. Our experiments suggest
EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate
manipulation over large workspaces with a single camera. See project site for
videos: https://www.eyerobot.net/
cs.IR [Back]
[148] Conversational Search: From Fundamentals to Frontiers in the LLM Era
Fengran Mo,Chuan Meng,Mohammad Aliannejadi,Jian-Yun Nie
Main category: cs.IR
TL;DR: 该教程介绍了对话式搜索的基础知识和新兴发展,特别是大型语言模型(LLM)带来的革新,旨在帮助研究人员和实践者掌握核心技术并推动下一代对话式搜索系统的开发。
Details
Motivation: 随着LLM在指令跟随、内容生成和推理等方面能力的显著提升,为构建智能对话式搜索系统提供了新的机遇和挑战,促使研究者重新审视和推进这一领域的发展。Contribution: 该教程系统地连接了对话式搜索的基础理论与LLM驱动的先进技术,为学术界和工业界的研究者提供了一个全面的知识框架。
Method: 教程结合核心原则和前沿进展,介绍LLM如何革新对话式搜索系统,包括意图理解、上下文处理和对话生成等关键技术。
Result: 参与者将掌握对话式搜索的基础和LLM带来的前沿技术,具备开发下一代智能对话式搜索系统的能力。
Insight: LLM的强大能力为对话式搜索注入了新的活力,但也带来了如何高效利用和微调模型以满足特定需求的挑战。
Abstract: Conversational search enables multi-turn interactions between users and
systems to fulfill users’ complex information needs. During this interaction,
the system should understand the users’ search intent within the conversational
context and then return the relevant information through a flexible,
dialogue-based interface. The recent powerful large language models (LLMs) with
capacities of instruction following, content generation, and reasoning, attract
significant attention and advancements, providing new opportunities and
challenges for building up intelligent conversational search systems. This
tutorial aims to introduce the connection between fundamentals and the emerging
topics revolutionized by LLMs in the context of conversational search. It is
designed for students, researchers, and practitioners from both academia and
industry. Participants will gain a comprehensive understanding of both the core
principles and cutting-edge developments driven by LLMs in conversational
search, equipping them with the knowledge needed to contribute to the
development of next-generation conversational search systems.