Table of Contents
- cs.CL [Total: 35]
- cs.CV [Total: 77]
- cs.MM [Total: 6]
- cs.SD [Total: 1]
- cs.MA [Total: 1]
- cs.IR [Total: 1]
- eess.SY [Total: 2]
- cs.GR [Total: 1]
- cs.LG [Total: 8]
- cs.RO [Total: 3]
- cs.AI [Total: 4]
- cs.CR [Total: 2]
- eess.IV [Total: 6]
- physics.med-ph [Total: 1]
cs.CL [Back]
[1] TaskCraft: Automated Generation of Agentic Tasks
Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: TaskCraft提出了一种自动化工作流,用于生成多工具、可验证的智能体任务,解决了现有指令数据缺乏工具交互和人工标注成本高的问题。
Details
Motivation: 现有智能体任务数据缺乏工具交互且依赖人工标注,限制了其扩展性,因此需要一种自动化的任务生成方法。Contribution: 1. 提出了TaskCraft,一种自动化生成多工具、可验证智能体任务的工作流;2. 通过深度和宽度扩展生成结构化和层次化复杂的任务;3. 生成了一个包含约36,000个任务的大规模合成数据集。
Method: TaskCraft通过深度和宽度扩展扩展原子任务,生成复杂任务,并优化生成流程中的提示优化和监督微调。
Result: 实验表明,生成的任务能有效优化提示流程并提升基础模型的监督微调性能。
Insight: 自动化任务生成可解决数据标注的瓶颈问题,为智能体调优和评估提供了新的研究方向。
Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool
use, and adaptive reasoning, are becoming increasingly central to the
advancement of NLP and AI. However, existing instruction data lacks tool
interaction, and current agentic benchmarks rely on costly human annotation,
limiting their scalability. We introduce \textsc{TaskCraft}, an automated
workflow for generating difficulty-scalable, multi-tool, and verifiable agentic
tasks with execution trajectories. TaskCraft expands atomic tasks using
depth-based and width-based extensions to create structurally and
hierarchically complex challenges. Empirical results show that these tasks
improve prompt optimization in the generation workflow and enhance supervised
fine-tuning of agentic foundation models. We present a large-scale synthetic
dataset of approximately 36,000 tasks with varying difficulty to support future
research on agent tuning and evaluation.
[2] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information
Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel
Main category: cs.CL
TL;DR: 本文提出一种名为Chat-of-Thought的多智能体系统,用于优化工业资产故障模式与影响分析(FMEA)文档生成的协作LLM智能体框架。
Details
Motivation: 工业设备监控中FMEA文档生成过程复杂且耗时,需要多角色协作与迭代优化,而传统方法效率低下。Contribution: 1. 提出基于多智能体协作的Chat-of-Thought框架;2. 引入动态任务路由与多角色驱动的迭代优化机制;3. 针对工业领域FMEA的模板化工作流设计。
Method: 1. 使用多个角色化LLM智能体;2. 动态任务分配与上下文感知协作;3. 通过Chat of Thought实现迭代内容优化。
Result: 展示了系统在工业设备监控中高效生成FMEA文档的能力,并验证了多智能体协作的优越性。
Insight: 多智能体协作可以显著提升复杂领域特定任务的生成效率与质量,动态路由与迭代优化是关键。
Abstract: This paper presents a novel multi-agent system called Chat-of-Thought,
designed to facilitate the generation of Failure Modes and Effects Analysis
(FMEA) documents for industrial assets. Chat-of-Thought employs multiple
collaborative Large Language Model (LLM)-based agents with specific roles,
leveraging advanced AI techniques and dynamic task routing to optimize the
generation and validation of FMEA tables. A key innovation in this system is
the introduction of a Chat of Thought, where dynamic, multi-persona-driven
discussions enable iterative refinement of content. This research explores the
application domain of industrial equipment monitoring, highlights key
challenges, and demonstrates the potential of Chat-of-Thought in addressing
these challenges through interactive, template-driven workflows and
context-aware agent collaboration.
[3] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering
Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: ChartReasoner是一个两阶段框架,通过代码驱动的方式解决图表问答任务中的长链推理问题,能够保留图表的原始细节并实现高精度推理。其方法包括高保真图表转代码模型、自动生成推理轨迹的数据合成管道,以及结合监督微调和强化学习的训练策略。在实验中表现优异,接近GPT-4o等专有模型的性能。
Details
Motivation: 大型语言模型虽在长链推理中表现优异,但如何将其扩展到视觉推理任务(如图表问答)仍具挑战性。现有方法通过图像转文本的方式容易丢失视觉信息的结构和语义细节。Contribution: 1. 提出两阶段框架ChartReasoner,通过代码驱动的方式保留图表原始细节;2. 设计高保真图表转代码模型;3. 开发自动生成高质量推理轨迹的数据合成管道;4. 在多个基准测试中表现优异。
Method: 1. 训练高保真模型将图表图像转换为结构化ECharts代码;2. 设计自动生成推理轨迹的数据合成管道,并通过代码验证器过滤低质量样本;3. 结合监督微调和强化学习训练最终多模态模型。
Result: 在四个公共基准测试中表现优异,接近专有系统如GPT-4o的性能,且参数更少。
Insight: 通过代码驱动的方式可以更精准地保留视觉信息的结构和语义细节,为视觉推理任务提供新思路。
Abstract: Recently, large language models have shown remarkable reasoning capabilities
through long-chain reasoning before responding. However, how to extend this
capability to visual reasoning tasks remains an open challenge. Existing
multimodal reasoning approaches transfer such visual reasoning task into
textual reasoning task via several image-to-text conversions, which often lose
critical structural and semantic information embedded in visualizations,
especially for tasks like chart question answering that require a large amount
of visual details. To bridge this gap, we propose ChartReasoner, a code-driven
novel two-stage framework designed to enable precise, interpretable reasoning
over charts. We first train a high-fidelity model to convert diverse chart
images into structured ECharts codes, preserving both layout and data semantics
as lossless as possible. Then, we design a general chart reasoning data
synthesis pipeline, which leverages this pretrained transport model to
automatically and scalably generate chart reasoning trajectories and utilizes a
code validator to filter out low-quality samples. Finally, we train the final
multimodal model using a combination of supervised fine-tuning and
reinforcement learning on our synthesized chart reasoning dataset and
experimental results on four public benchmarks clearly demonstrate the
effectiveness of our proposed ChartReasoner. It can preserve the original
details of the charts as much as possible and perform comparably with
state-of-the-art open-source models while using fewer parameters, approaching
the performance of proprietary systems like GPT-4o in out-of-domain settings.
[4] Unsupervised Elicitation of Language Models
Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike
Main category: cs.CL
TL;DR: 论文提出了一种无监督算法Internal Coherence Maximization(ICM),用于微调预训练语言模型,无需外部监督即可生成标签,并在多个任务中表现优于人工监督。
Details
Motivation: 当前的后训练范式依赖人类指定期望行为,但对于超人类能力的模型,高质量的人类监督难以实现。因此,需要一种无监督方法来引导模型性能。Contribution: 提出ICM算法,实现了无监督微调语言模型,其性能在多项任务中匹配甚至超越人工监督,特别是在超人类能力任务中表现显著。
Method: ICM通过最大化模型生成标签的内部一致性来微调模型,无需外部监督。实验包括GSM8k验证、TruthfulQA和Alpaca奖励建模等任务。
Result: 在多项任务中,ICM匹配了黄金监督的性能,并超越人类众包监督。对于超人类能力任务,ICM显著优于人类标签训练。此外,无监督训练的奖励模型和助手优于人工监督版本。
Insight: 无监督方法在模型能力超越人类时具有显著优势,ICM为前沿语言模型训练提供了高效替代方案。
Abstract: To steer pretrained language models for downstream tasks, today’s
post-training paradigm relies on humans to specify desired behaviors. However,
for models with superhuman capabilities, it is difficult or impossible to get
high-quality human supervision. To address this challenge, we introduce a new
unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune
pretrained language models on their own generated labels, \emph{without
external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward
modeling tasks, our method matches the performance of training on golden
supervision and outperforms training on crowdsourced human supervision. On
tasks where LMs’ capabilities are strongly superhuman, our method can elicit
those capabilities significantly better than training on human labels. Finally,
we show that our method can improve the training of frontier LMs: we use our
method to train an unsupervised reward model and use reinforcement learning to
train a Claude 3.5 Haiku-based assistant. Both the reward model and the
assistant outperform their human-supervised counterparts.
[5] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
Yi Wang,Max Kreminski
Main category: cs.CL
TL;DR: 本文探讨了LLMs在故事生成中的能力,特别关注叙事规划问题。研究发现,GPT-4级别的LLMs可以生成小规模因果合理的故事,但在角色意图和戏剧冲突方面仍面临挑战。
Details
Motivation: 故事生成是LLMs的重要应用,但目前对其生成高质量故事能力的理解有限,部分原因是自动评估方法的不足和人工评估的高成本与主观性。Contribution: 提出了一个基于文献示例的叙事规划基准,用于评估LLMs在因果合理性、角色意图和戏剧冲突方面的能力,并指出了LLMs在这一领域的局限和潜力。
Method: 通过实验评估GPT-4级别的LLMs在叙事规划任务中的表现,重点关注因果合理性、角色意图和戏剧冲突。同时探讨了强化学习训练对提升LLMs能力的必要性。
Result: 实验表明,LLMs可以生成小规模因果合理的故事,但在角色意图和戏剧冲突方面仍需改进,尤其是需要强化学习训练以支持复杂推理。
Insight: 研究揭示了LLMs在叙事规划中的潜力与挑战,为游戏环境中的应用提供了重要启示。
Abstract: Story generation has been a prominent application of Large Language Models
(LLMs). However, understanding LLMs’ ability to produce high-quality stories
remains limited due to challenges in automatic evaluation methods and the high
cost and subjectivity of manual evaluation. Computational narratology offers
valuable insights into what constitutes a good story, which has been applied in
the symbolic narrative planning approach to story generation. This work aims to
deepen the understanding of LLMs’ story generation capabilities by using them
to solve narrative planning problems. We present a benchmark for evaluating
LLMs on narrative planning based on literature examples, focusing on causal
soundness, character intentionality, and dramatic conflict. Our experiments
show that GPT-4 tier LLMs can generate causally sound stories at small scales,
but planning with character intentionality and dramatic conflict remains
challenging, requiring LLMs trained with reinforcement learning for complex
reasoning. The results offer insights on the scale of stories that LLMs can
generate while maintaining quality from different aspects. Our findings also
highlight interesting problem solving behaviors and shed lights on challenges
and considerations for applying LLM narrative planning in game environments.
[6] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta,Francis Ferraro
Main category: cs.CL
TL;DR: Q2E提出了一种零样本多语言文本到视频检索方法,通过分解查询并利用LLMs和VLMs的潜在知识,显著提升了复杂现实事件的视频检索能力。
Details
Motivation: 现有方法在理解复杂查询和视频内容之间缺乏桥梁,尤其是在多语言和多样模态场景下。Q2E旨在通过分解查询并融合多模态知识来解决这一问题。Contribution: 1. 提出Q2E方法,将查询分解为事件,利用LLMs和VLMs的潜在知识。2. 支持多语言和多样模态(文本、视觉、语音)输入。3. 采用基于熵的融合评分实现零样本融合,显著提升检索效果。
Method: 1. 查询分解:利用LLMs和VLMs将复杂查询分解为更简单的事件描述。2. 多模态融合:结合视觉、语音和文本信息,采用基于熵的评分机制进行零样本融合。
Result: 在多个数据集和检索指标上,Q2E超越了现有方法,且音频信息的整合进一步提升了文本到视频检索的性能。
Insight: 1. 查询分解能显著提升复杂事件的检索精度。2. 多模态信息的整合(尤其是音频)对视频检索任务至关重要。
Abstract: Recent approaches have shown impressive proficiency in extracting and
leveraging parametric knowledge from Large-Language Models (LLMs) and
Vision-Language Models (VLMs). In this work, we consider how we can improve the
identification and retrieval of videos related to complex real-world events by
automatically extracting latent parametric knowledge about those events. We
present Q2E: a Query-to-Event decomposition method for zero-shot multilingual
text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our
approach demonstrates that we can enhance the understanding of otherwise overly
simplified human queries by decomposing the query using the knowledge embedded
in LLMs and VLMs. We additionally show how to apply our approach to both visual
and speech-based inputs. To combine this varied multimodal knowledge, we adopt
entropy-based fusion scoring for zero-shot fusion. Through evaluations on two
diverse datasets and multiple retrieval metrics, we demonstrate that Q2E
outperforms several state-of-the-art baselines. Our evaluation also shows that
integrating audio information can significantly improve text-to-video
retrieval. We have released code and data for future research.
[7] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: 论文介绍了TTT-Bench,一个通过简单的井字棋类游戏评估大型推理模型(LRMs)基本战略、空间和逻辑推理能力的基准测试,发现虽然LRMs在复杂数学问题上表现优异,但在这些简单任务中表现不佳。
Details
Motivation: 当前大多数推理基准集中在STEM领域,而LRMs在更广泛任务域中的推理能力尚待探索。论文旨在填补这一缺口,开发一个简单但挑战性的测试基准。Contribution: 提出了TTT-Bench基准,通过四种简单的井字棋类游戏评估LRMs的战略和空间推理能力,并揭示了模型在简单任务中表现不佳的现象。
Method: 采用一种可扩展的程序化方法生成可验证的双人游戏问题,并通过这些游戏测试多种LRMs的表现。
Result: 评估发现LRMs在简单推理游戏中的表现远低于复杂数学问题,尤其在长期战略推理任务中表现更差,且大型模型通过更短的推理路径获得更高分数。
Insight: 大型推理模型在复杂任务上可能依赖数据而非真正推理能力,而简单任务的失败揭示了其在战略和空间推理上的局限性。
Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning
capabilities across a broad range of tasks including Olympiad-level
mathematical problems, indicating evidence of their complex reasoning
abilities. While many reasoning benchmarks focus on the STEM domain, the
ability of LRMs to reason correctly in broader task domains remains
underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark
that is designed to evaluate basic strategic, spatial, and logical reasoning
abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games
that humans can effortlessly solve from a young age. We propose a simple yet
scalable programmatic approach for generating verifiable two-player game
problems for TTT-Bench. Although these games are trivial for humans, they
require reasoning about the intentions of the opponent, as well as the game
board’s spatial configurations, to ensure a win. We evaluate a diverse set of
state-of-the-art LRMs, and \textbf{discover that the models that excel at hard
math problems frequently fail at these simple reasoning games}. Further testing
reveals that our evaluated reasoning models score on average $\downarrow$ 41%
& $\downarrow$ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024
respectively, with larger models achieving higher performance using shorter
reasoning traces, where most of the models struggle on long-term strategic
reasoning situations on simple and new TTT-Bench tasks.
[8] Classifying Unreliable Narrators with Large Language Models
Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi
Main category: cs.CL
TL;DR: 本文提出利用大型语言模型(LLMs)识别不可靠叙述者的方法,并发布了TUNa数据集,通过多领域文本分类任务验证了LLMs在此任务上的潜力与挑战。
Details
Motivation: 在文学和社交媒体等文本中,叙述者可能无意间提供不准确信息,通过计算手段自动识别这种不可靠性具有重要意义。Contribution: 1. 提出TUNa数据集,涵盖博客、酒店评论、文学等多领域文本;2. 设计了不可靠叙述者的分类任务;3. 分析了LLMs在此任务上的表现,并尝试了多种学习方法。
Method: 采用了少样本学习、微调和课程学习等策略,利用LLMs对不可靠叙述者进行分类。
Result: 结果显示任务极具挑战性,但LLMs在此领域表现出了潜力。
Insight: 通过文学理论的启发,将不可靠叙述者的识别任务扩展到实际文本数据,为后续研究提供了新思路和资源。
Abstract: Often when we interact with a first-person account of events, we consider
whether or not the narrator, the primary speaker of the text, is reliable. In
this paper, we propose using computational methods to identify unreliable
narrators, i.e. those who unintentionally misrepresent information. Borrowing
literary theory from narratology to define different types of unreliable
narrators based on a variety of textual phenomena, we present TUNa, a
human-annotated dataset of narratives from multiple domains, including blog
posts, subreddit posts, hotel reviews, and works of literature. We define
classification tasks for intra-narrational, inter-narrational, and
inter-textual unreliabilities and analyze the performance of popular
open-weight and proprietary LLMs for each. We propose learning from literature
to perform unreliable narrator classification on real-world text data. To this
end, we experiment with few-shot, fine-tuning, and curriculum learning
settings. Our results show that this task is very challenging, and there is
potential for using LLMs to identify unreliable narrators. We release our
expert-annotated dataset and code and invite future research in this area.
[9] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages
Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak
Main category: cs.CL
TL;DR: Flick提出了一种在低资源语言中高效解决少标签文本分类问题的方法,通过伪标签精炼和多任务学习提升模型性能。
Details
Motivation: 解决低资源语言中少标签文本分类问题,现有方法易受噪声伪标签和领域适应性的影响。Contribution: 提出Flick方法,通过伪标签精炼和自适应top-k选择机制,显著提升伪标签质量,适用于多样化低资源语言。
Method: 引入伪标签精炼组件,利用单簇凝聚性和自适应top-k选择机制,从初始多簇中筛选高置信度伪标签。
Result: 在14个多样化数据集(包括低资源语言如阿拉伯语、乌尔都语等)上验证了方法的有效性和适应性。
Insight: 从多簇中精炼高置信度伪标签是提升少标签分类性能的关键,尤其在低资源语言中。
Abstract: Training deep learning networks with minimal supervision has gained
significant research attention due to its potential to reduce reliance on
extensive labelled data. While self-training methods have proven effective in
semi-supervised learning, they remain vulnerable to errors from noisy pseudo
labels. Moreover, most recent approaches to the few-label classification
problem are either designed for resource-rich languages such as English or
involve complex cascading models that are prone to overfitting. To address the
persistent challenge of few-label text classification in truly low-resource
linguistic contexts, where existing methods often struggle with noisy
pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods
that rely on generic multi-cluster pseudo-labelling or complex cascading
architectures, Flick leverages the fundamental insight that distilling
high-confidence pseudo-labels from a broader set of initial clusters can
dramatically improve pseudo-label quality, particularly for linguistically
diverse, low-resource settings. Flick introduces a novel pseudo-label
refinement component, a departure from traditional pseudo-labelling strategies
by identifying and leveraging top-performing pseudo-label clusters. This
component specifically learns to distil highly reliable pseudo-labels from an
initial broad set by focusing on single-cluster cohesion and leveraging an
adaptive top-k selection mechanism. This targeted refinement process is crucial
for mitigating the propagation of errors inherent in low-resource data,
allowing for robust fine-tuning of pre-trained language models with only a
handful of true labels. We demonstrate Flick’s efficacy across 14 diverse
datasets, encompassing challenging low-resource languages such as Arabic, Urdu,
and Setswana, alongside English, showcasing its superior performance and
adaptability.
[10] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context
Chuck Arvin
Main category: cs.CL
TL;DR: 论文研究了在模拟教育情境中,大型语言模型(LLMs)如何受用户提供的建议影响,特别是在谄媚行为(sycophancy)的风险下。通过测试五个不同的LLMs模型,结果显示响应质量因查询表述方式差异显著。当学生提到错误答案时,模型的正确率可能下降15个百分点,而提到正确答案时正确率提升相同幅度。这种偏差在较小模型中更明显。
Details
Motivation: 研究动机是探讨LLMs在教育情境中的谄媚行为,这可能导致对知识水平较低学生的误解加剧,影响教育公平。Contribution: 主要贡献是量化了LLMs的谄媚行为,揭示了模型响应质量因学生答案提及方式的变化,并展示了这种偏差在模型规模上的差异。
Method: 研究方法包括测试五种不同的LLMs模型,分为五种实验条件,分析模型在提到正确或错误答案时的响应变化,并通过令牌级概率分析验证行为模式。
Result: 结果显示,LLMs的响应质量因学生答案提及方式差异显著,正确率变化可达15个百分点,较小模型的偏差效应更大(高达30%)。
Insight: 研究揭示了LLMs在教育应用中潜在的谄媚行为,可能导致教育不平等,强调需要进一步研究和缓解这种偏差。
Abstract: This study examines how user-provided suggestions affect Large Language
Models (LLMs) in a simulated educational context, where sycophancy poses
significant risks. Testing five different LLMs from the OpenAI GPT-4o and
GPT-4.1 model classes across five experimental conditions, we show that
response quality varies dramatically based on query framing. In cases where the
student mentions an incorrect answer, the LLM correctness can degrade by as
much as 15 percentage points, while mentioning the correct answer boosts
accuracy by the same margin. Our results also show that this bias is stronger
in smaller models, with an effect of up to 30% for the GPT-4.1-nano model,
versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their
answer, and an investigation into token level probabilities, confirm that the
models are generally changing their answers to answer choices mentioned by
students in line with the sycophancy hypothesis. This sycophantic behavior has
important implications for educational equity, as LLMs may accelerate learning
for knowledgeable students while the same tools may reinforce misunderstanding
for less knowledgeable students. Our results highlight the need to better
understand the mechanism, and ways to mitigate, such bias in the educational
context.
[11] Code Execution as Grounded Supervision for LLM Reasoning
Dongwon Jung,Wenxuan Zhou,Muhao Chen
Main category: cs.CL
TL;DR: 论文提出了一种通过代码执行生成高质量思维链(CoT)监督数据的方法,替代依赖人工标注或LLM生成的方法,显著提升了LLM的推理能力。
Details
Motivation: 现有生成思维链监督数据的方法依赖人工标注或LLM生成,效率低且易出错。代码执行具有确定性,可提供可验证的推理过程,适合用于生成高质量监督数据。Contribution: 1. 提出利用代码执行的确定性生成高质量CoT监督数据的方法;2. 展示了该方法在多领域推理任务中的有效性;3. 减少了推理时的无用重复和过度思考。
Method: 通过代码执行提取可验证的逐步推理轨迹,并将其转换为自然语言形式的思维链推理数据,用于训练LLM。
Result: 实验表明,该方法在多领域推理任务中有效提升了LLM的推理能力,并减少了推理时的token长度。
Insight: 代码执行可以作为生成高质量监督数据的可靠来源,其确定性显著优于传统人工或LLM生成的方法。
Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision
has proven effective for enhancing their reasoning abilities. However,
obtaining reliable and accurate reasoning supervision remains a significant
challenge. We propose a scalable method for generating a high-quality CoT
supervision dataset by leveraging the determinism of program execution. Unlike
existing reasoning dataset generation methods that rely on costly human
annotations or error-prone LLM-generated CoT, our approach extracts verifiable,
step-by-step reasoning traces from code execution and transforms them into a
natural language CoT reasoning. Experiments on reasoning benchmarks across
various domains show that our method effectively equips LLMs with transferable
reasoning abilities across diverse tasks. Furthermore, the ablation studies
validate that our method produces highly accurate reasoning data and reduces
overall token length during inference by reducing meaningless repetition and
overthinking.
[12] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu,Pu Jian,Chong Chen
Main category: cs.CL
TL;DR: TableRAG是一个专为异构文档(包含文本和表格)设计的检索增强生成框架,通过迭代式的四个步骤解决了现有方法在表格数据处理上的局限性,并在新基准HeteQA上实现了最优性能。
Details
Motivation: 异构文档(文本和表格混合)的处理在现有RAG方法中存在显著不足,扁平化表格和分块策略破坏了表格结构,导致信息丢失和多跳推理能力下降。Contribution: 提出了TableRAG框架,统一了文本理解和表格操作,并开发了HeteQA基准,用于评估异构文档的多跳推理能力。
Method: TableRAG通过四步迭代操作:上下文敏感的查询分解、文本检索、SQL编程与执行、组合式中间答案生成。
Result: 在公开数据集和新基准HeteQA上,TableRAG均优于现有基线,达到新SOTA。
Insight: TableRAG的成功表明,保留表格结构并引入SQL操作是提升异构文档推理能力的关键。
Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable
effectiveness in open-domain question answering. However, when applied to
heterogeneous documents, comprising both textual and tabular components,
existing RAG approaches exhibit critical limitations. The prevailing practice
of flattening tables and chunking strategies disrupts the intrinsic tabular
structure, leads to information loss, and undermines the reasoning capabilities
of LLMs in multi-hop, global queries. To address these challenges, we propose
TableRAG, an hybrid framework that unifies textual understanding and complex
manipulations over tabular data. TableRAG iteratively operates in four steps:
context-sensitive query decomposition, text retrieval, SQL programming and
execution, and compositional intermediate answer generation. We also develop
HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous
reasoning capabilities. Experimental results demonstrate that TableRAG
consistently outperforms existing baselines on both public datasets and our
HeteQA, establishing a new state-of-the-art for heterogeneous document question
answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
[13] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan
Main category: cs.CL
TL;DR: PAG是一个通过强化学习框架统一策略和验证角色的多轮自纠正方法,仅在检测到错误时才进行修正,显著提升了LLM的推理和验证能力。
Details
Motivation: 现有的LLM自我验证方法往往依赖额外的验证器或多阶段训练流程,限制了可扩展性。PAG旨在通过统一的强化学习框架实现更高效的自纠正。Contribution: 1. 提出PAG框架,将策略和验证角色统一在多轮RL中;2. 引入选择性修正机制,仅在验证到错误时修正;3. 在推理和验证能力上均取得显著提升。
Method: 通过多轮RL交替训练策略和验证角色,结合选择性修正机制,提升模型的推理和验证能力。
Result: PAG在多个推理基准测试中表现优异,作为策略和验证器均优于现有方法。
Insight: 统一策略和验证角色的方法可以更高效地提升LLM的自纠正能力,而选择性修正减少了不必要的修正次数。
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in
complex reasoning tasks, yet they still struggle to reliably verify the
correctness of their own outputs. Existing solutions to this verification
challenge often depend on separate verifier models or require multi-stage
self-correction training pipelines, which limit scalability. In this paper, we
propose Policy as Generative Verifier (PAG), a simple and effective framework
that empowers LLMs to self-correct by alternating between policy and verifier
roles within a unified multi-turn reinforcement learning (RL) paradigm.
Distinct from prior approaches that always generate a second attempt regardless
of model confidence, PAG introduces a selective revision mechanism: the model
revises its answer only when its own generative verification step detects an
error. This verify-then-revise workflow not only alleviates model collapse but
also jointly enhances both reasoning and verification abilities. Extensive
experiments across diverse reasoning benchmarks highlight PAG’s dual
advancements: as a policy, it enhances direct generation and self-correction
accuracy; as a verifier, its self-verification outperforms self-consistency.
[14] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt
Main category: cs.CL
TL;DR: 本文提出TempVS基准测试,评估多模态大语言模型(MLLMs)在图像序列中的事件时序理解和推理能力,发现现有模型表现远低于人类水平。
Details
Motivation: 研究MLLMs是否真正能够捕捉图像序列中事件的时序关系,填补现有基准测试的不足。Contribution: 1)提出TempVS基准测试;2)评估38种先进MLLMs的时序推理能力;3)提供细粒度分析。
Method: TempVS包含三种测试(事件关系推理、句子排序和图像排序)及其基础定位测试,要求模型结合视觉和语言模态。
Result: 现有MLLMs在TempVS上表现不佳,与人类能力存在显著差距。
Insight: 未来研究可关注多模态时序推理的改进,尤其是视觉和语言模态的结合。
Abstract: This paper introduces the TempVS benchmark, which focuses on temporal
grounding and reasoning capabilities of Multimodal Large Language Models
(MLLMs) in image sequences. TempVS consists of three main tests (i.e., event
relation inference, sentence ordering and image ordering), each accompanied
with a basic grounding test. TempVS requires MLLMs to rely on both visual and
linguistic modalities to understand the temporal order of events. We evaluate
38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS,
with a substantial performance gap compared to human capabilities. We also
provide fine-grained insights that suggest promising directions for future
research. Our TempVS benchmark data and code are available at
https://github.com/yjsong22/TempVS.
[15] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty
Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng
Main category: cs.CL
TL;DR: 论文提出了一种通过动态长度惩罚提升语言模型推理效率的方法,针对简单问题缩短输出长度,同时保持复杂问题的推理深度。
Details
Motivation: 当前大型语言模型在推理任务中表现出色,但常用方法(如思维链提示)导致输出过长,增加计算延迟。现有缩短方法未考虑问题复杂性,效果不佳。Contribution: 引入动态长度惩罚机制,根据问题复杂度调节输出长度,实现了在简单任务中更短的输出和复杂任务中更精准的推理。
Method: 通过分割奖励函数并加入新的输出长度惩罚项,动态调整模型的推理效率。
Result: 在GSM8K、MATH500和AIME2024三个数据集上表现优异:简单任务中缩短了输出长度并保持或提升准确性,复杂任务中准确性提高。
Insight: 动态长度惩罚能有效权衡推理效率和准确性,适用于不同复杂度的任务。
Abstract: Large language models (LLMs) have demonstrated significant advancements in
reasoning capabilities, performing well on various challenging benchmarks.
Techniques like Chain-of-Thought prompting have been introduced to further
improve reasoning. However, these approaches frequently generate longer
outputs, which in turn increase computational latency. Although some methods
use reinforcement learning to shorten reasoning, they often apply uniform
penalties without considering the problem’s complexity, leading to suboptimal
outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by
promoting conciseness for simpler problems while preserving sufficient
reasoning for more complex ones for accuracy, thus improving the model’s
overall performance. Specifically, we manage the model’s reasoning efficiency
by dividing the reward function and including a novel penalty for output
length. Our approach has yielded impressive outcomes in benchmark evaluations
across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively
simpler datasets GSM8K and MATH500, our method has effectively shortened output
lengths while preserving or enhancing accuracy. On the more demanding AIME2024
dataset, our approach has resulted in improved accuracy.
[16] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
Main category: cs.CL
TL;DR: 论文重新定义了表格文本对齐任务为解释性任务,要求模型识别支持科学声明验证的关键表格单元格,并通过新构建的数据集和实验展示了改进方法。
Details
Motivation: 传统的科学声明验证任务仅预测最终标签,缺乏解释性,无法揭示模型的推理过程,因此需要更细粒度的表格单元格对齐分析。Contribution: 提出了表格文本对齐作为一种解释性任务,构建了包含单元格级标注的新数据集,并设计了处理模糊案例的分类法。
Method: 通过扩展SciTab基准数据集,加入人工标注的单元格级依据,并分析对齐信息对声明验证性能的影响。
Result: 实验表明,加入表格对齐信息提升了声明验证性能,但多数大语言模型未能恢复人类标注的依据。
Insight: 大语言模型的预测可能缺乏忠实推理,未来工作需关注模型的解释性与对齐能力。
Abstract: Scientific claim verification against tables typically requires predicting
whether a claim is supported or refuted given a table. However, we argue that
predicting the final label alone is insufficient: it reveals little about the
model’s reasoning and offers limited interpretability. To address this, we
reframe table-text alignment as an explanation task, requiring models to
identify the table cells essential for claim verification. We build a new
dataset by extending the SciTab benchmark with human-annotated cell-level
rationales. Annotators verify the claim label and highlight the minimal set of
cells needed to support their decision. After the annotation process, we
utilize the collected information and propose a taxonomy for handling ambiguous
cases. Our experiments show that (i) incorporating table alignment information
improves claim verification performance, and (ii) most LLMs, while often
predicting correct labels, fail to recover human-aligned rationales, suggesting
that their predictions do not stem from faithful reasoning.
[17] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang
Main category: cs.CL
TL;DR: 论文提出了RRP框架,通过结合LLM的语义能力和知识图谱的结构信息,提取高质量推理路径,提升LLM的推理能力。
Details
Motivation: 现有基于知识图谱增强的LLM在解决复杂问题时表现不佳,主要原因是未能有效利用事实间的关系和逻辑一致的推理路径。Contribution: 提出了RRP框架,包括关系嵌入和双向分布学习的方法,以及一个重新思考模块用于评估和优化推理路径。
Method: 结合LLM的语义能力和知识图谱的结构信息(关系嵌入和双向分布学习),并引入重新思考模块优化推理路径。
Result: 在两个公开数据集上,RRP取得了最先进的性能,并能以即插即用的方式增强多种LLM的推理能力。
Insight: 高质量的推理路径不仅是补充事实知识的关键,还能为LLM提供更有效的指导,从而提升其解决复杂问题的能力。
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks
due to a lack of background knowledge and a tendency to hallucinate. To address
these limitations, integrating knowledge graphs (KGs) with LLMs has been
intensively studied. Existing KG-enhanced LLMs focus on supplementary factual
knowledge, but still struggle with solving complex questions. We argue that
refining the relationships among facts and organizing them into a logically
consistent reasoning path is equally important as factual knowledge itself.
Despite their potential, extracting reliable reasoning paths from KGs poses the
following challenges: the complexity of graph structures and the existence of
multiple generated paths, making it difficult to distinguish between useful and
redundant ones. To tackle these challenges, we propose the RRP framework to
mine the knowledge graph, which combines the semantic strengths of LLMs with
structural information obtained through relation embedding and bidirectional
distribution learning. Additionally, we introduce a rethinking module that
evaluates and refines reasoning paths according to their significance.
Experimental results on two public datasets show that RRP achieves
state-of-the-art performance compared to existing baseline methods. Moreover,
RRP can be easily integrated into various LLMs to enhance their reasoning
abilities in a plug-and-play manner. By generating high-quality reasoning paths
tailored to specific questions, RRP distills effective guidance for LLM
reasoning.
[18] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal
Main category: cs.CL
TL;DR: 论文提出了一种基于检索增强提示的系统,用于评估AI导师对学生数学推理错误的识别能力,结合了多种方法并展示了LLM的有效性。
Details
Motivation: 研究旨在为AI导师的评估任务提供高效解决方案,尤其是识别学生在数学推理中的错误。Contribution: 提出了四种方法,尤其是检索增强的少样本提示系统,证明了其在错误识别任务中的优越性。
Method: 结合了多模型集成、句子变换器、历史感知模型和检索增强提示系统,利用LLM进行推理。
Result: 最终系统显著优于基线,展示了LLM在教育反馈评估中的潜力。
Insight: 检索增强提示和LLM推理的结合是解决教育领域复杂任务的有效方法。
Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA
2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The
task involves evaluating whether a tutor’s response correctly identifies a
mistake in a student’s mathematical reasoning. We explore four approaches: (1)
an ensemble of machine learning models over pooled token embeddings from
multiple pretrained language models (LMs); (2) a frozen sentence-transformer
using [CLS] embeddings with an MLP classifier; (3) a history-aware model with
multi-head attention between token-level history and response embeddings; and
(4) a retrieval-augmented few-shot prompting system with a large language model
(LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples,
constructs structured prompts, and uses schema-guided output parsing to produce
interpretable predictions. It outperforms all baselines, demonstrating the
effectiveness of combining example-driven prompting with LLM reasoning for
pedagogical feedback assessment. Our code is available at
https://github.com/NaumanNaeem/BEA_2025.
[19] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Ye Yu,Yaoning Yu,Haohan Wang
Main category: cs.CL
TL;DR: PREMISE 提出了一种无需修改模型权重的提示优化框架,通过结合诊断与梯度启发的提示优化,大幅减少数学推理任务中的冗余计算,显著降低 token 使用和成本,同时保持或提升准确率。
Details
Motivation: 大型推理模型(如 Claude 和 GPT)在数学任务中表现优异,但冗长的推理过程导致 token 使用和成本过高,限制了在实时或资源受限场景的部署。Contribution: 提出 PREMISE 框架,通过诊断和优化提示策略,显著减少推理 token 和成本,同时不降低准确率,且适用于商业化大模型。
Method: 结合跟踪级诊断与梯度启发式提示优化,通过多目标文本搜索平衡 brevity 和 correctness。
Result: 在 GSM8K、SVAMP 和 Math500 上,PREMISE 匹配或超越基线准确率(如 Claude 96%→96%),同时减少推理 token 达 87.5%,成本降低 69%-82%。
Insight: 提示优化是大模型推理高效化的重要方向,无需修改模型权重即可显著提升效率,适用于实际商业场景。
Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve
strong performance on mathematical benchmarks using lengthy chain-of-thought
(CoT) reasoning, but the resulting traces are often unnecessarily verbose. This
inflates token usage and cost, limiting deployment in latency-sensitive or
API-constrained settings. We introduce PREMISE (PRompt-based Efficient
Mathematical Inference with Strategic Evaluation), a prompt-only framework that
reduces reasoning overhead without modifying model weights. PREMISE combines
trace-level diagnostics with gradient-inspired prompt optimization to minimize
redundant computation while preserving answer accuracy. The approach jointly
optimizes brevity and correctness through a multi-objective textual search that
balances token length and answer validity. Unlike prior work, PREMISE runs in a
single-pass black-box interface, so it can be applied directly to commercial
LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy
($96%\rightarrow96%$ with Claude, $91%\rightarrow92%$ with Gemini) while
reducing reasoning tokens by up to $87.5%$ and cutting dollar cost by
$69$–$82%$. These results show that prompt-level optimization is a practical
and scalable path to efficient LRM inference without compromising reasoning
quality.
[20] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta,Runchu Tian,Jiawei Han
Main category: cs.CL
TL;DR: 该论文提出了ClaimSpect框架,通过检索增强生成技术对复杂声明进行分层分析,将其拆解为可验证的子方面,并整合不同视角的数据,为科学和政治声明提供全面解读。
Details
Motivation: 当前许多声明(如科学或政治领域)难以简单用“真”或“假”标签分类,需要更细粒度的分析。Contribution: 提出了ClaimSpect框架,能够自动构建声明的分层结构,并通过检索整合语料中的不同视角,全面分析复杂声明。
Method: 利用检索增强生成技术,将声明拆解为多个子方面,并分层检索相关语料片段,以发现新子方面和不同观点。
Result: 在真实数据集上验证了ClaimSpect的鲁棒性和准确性,优于多个基线方法。
Insight: 通过分层分析和多视角整合,可以更全面地理解复杂声明,避免了简单二元分类的局限性。
Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be
clearly labeled as entirely “true” or “false” – as is frequently the case with
scientific and political claims. However, a claim (e.g., “vaccine A is better
than vaccine B”) can be dissected into its integral aspects and sub-aspects
(e.g., efficacy, safety, distribution), which are individually easier to
validate. This enables a more comprehensive, structured response that provides
a well-rounded perspective on a given problem while also allowing the reader to
prioritize specific angles of interest within the claim (e.g., safety towards
children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based
framework for automatically constructing a hierarchy of aspects typically
considered when addressing a claim and enriching them with corpus-specific
perspectives. This structure hierarchically partitions an input corpus to
retrieve relevant segments, which assist in discovering new sub-aspects.
Moreover, these segments enable the discovery of varying perspectives towards
an aspect of the claim (e.g., support, neutral, or oppose) and their respective
prevalence (e.g., “how many biomedical papers believe vaccine A is more
transportable than B?”). We apply ClaimSpect to a wide variety of real-world
scientific and political claims featured in our constructed dataset, showcasing
its robustness and accuracy in deconstructing a nuanced claim and representing
perspectives within a corpus. Through real-world case studies and human
evaluation, we validate its effectiveness over multiple baselines.
[21] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs
Alberto Testoni,Iacer Calixto
Main category: cs.CL
TL;DR: 这篇论文对临床QA中LLMs的不确定性和校准性进行了细粒度评估,比较了10个开源LLM在不同医学专业和问题类型中的表现,并提出了一种轻量级单次生成估计方法。
Details
Motivation: 在临床决策等高风险领域,LLMs的准确和校准良好的不确定性估计至关重要,但目前缺乏对不同问题和模型类型的细粒度评估。Contribution: 论文的主要贡献包括:首次对临床QA中的LLMs进行了全面的不确定性评估,涵盖了多个维度(医学专业、问题类型等),并提出了一种轻量级单次生成不确定性估计方法。
Method: 论文对比了标准单次生成和基于采样的方法,并提出了基于推理轨迹行为信号的轻量级单次生成估计方法。
Result: 结果显示,不同医学专业和问题类型的性能差异显著,轻量级方法在性能上接近语义熵方法,且仅需一次生成。
Insight: 研究发现,LLMs的选择应基于问题的性质和模型的特长,突出了领域适应性的重要性。
Abstract: Accurate and well-calibrated uncertainty estimates are essential for
deploying large language models (LLMs) in high-stakes domains such as clinical
decision support. We present a fine-grained evaluation of uncertainty
estimation methods for clinical multiple-choice question answering, covering
ten open-source LLMs (general-purpose, biomedical, and reasoning models) across
two datasets, eleven medical specialties, and six question types. We compare
standard single-generation and sampling-based methods, and present a case study
exploring simple, single-pass estimators based on behavioral signals in
reasoning traces. These lightweight methods approach the performance of
Semantic Entropy while requiring only one generation. Our results reveal
substantial variation across specialties and question types, underscoring the
importance of selecting models based on both the nature of the question and
model-specific strengths.
[22] Improving Named Entity Transcription with Contextual LLM-based Revision
Viet Anh Trinh,Xinlu He,Jacob Whitehill
Main category: cs.CL
TL;DR: 本文提出了一种基于大型语言模型(LLM)的修正机制,通过利用LLM的推理能力和局部上下文(如课堂笔记)来修正ASR预测中错误命名的实体,显著降低了命名实体的词错误率(WER)。
Details
Motivation: 现有的ASR系统在通用语音识别上表现优异,但在命名实体识别上错误率较高,影响下游应用。Contribution: 1. 引入基于LLM的修正机制;2. 提出NER-MIT-OpenCourseWare数据集;3. 在命名实体上实现30%相对WER降低。
Method: 利用LLM的推理能力和包含正确命名实体的局部上下文(如课堂笔记)修正ASR预测。
Result: 在NER-MIT-OpenCourseWare数据集上,命名实体WER相对降低30%。
Insight: 结合LLM的推理能力和上下文信息可以有效提升ASR系统中命名实体的识别准确性。
Abstract: With recent advances in modeling and the increasing amount of supervised
training data, automatic speech recognition (ASR) systems have achieved
remarkable performance on general speech. However, the word error rate (WER) of
state-of-the-art ASR remains high for named entities. Since named entities are
often the most critical keywords, misrecognizing them can affect all downstream
applications, especially when the ASR system functions as the front end of a
complex system. In this paper, we introduce a large language model (LLM)
revision mechanism to revise incorrect named entities in ASR predictions by
leveraging the LLM’s reasoning ability as well as local context (e.g., lecture
notes) containing a set of correct named entities. Finally, we introduce the
NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses
for development and testing. On this dataset, our proposed technique achieves
up to 30% relative WER reduction for named entities.
[23] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints
Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens
Main category: cs.CL
TL;DR: 论文提出LangEdit框架,通过空空间约束隔离多语言知识更新,避免参数干扰,提升多语言大模型的知识编辑效率与一致性。
Details
Motivation: 多语言大模型在跨语言知识更新时面临参数干扰问题,导致知识一致性与泛化能力下降。现有方法(如多模型独立编辑)成本高昂,因此需要一种高效且统一的解决方案。Contribution: 提出了LangEdit框架,通过空空间投影技术隔离语言特定知识更新,确保更新独立性和多语言泛化能力。
Method: 核心是将每个语言的参数更新投影到先前更新子空间的正交补空间上,数学上保证更新的独立性。
Result: 在三种模型架构、六种语言和四项任务上的实验显示,LangEdit在减少干扰和知识准确性上优于现有方法。
Insight: 通过数学约束实现参数隔离是解决多语言知识编辑干扰的有效途径,为多语言模型的高效更新提供了新思路。
Abstract: Efficiently updating multilingual knowledge in large language models (LLMs),
while preserving consistent factual representations across languages, remains a
long-standing and unresolved challenge. While deploying separate editing
systems for each language might seem viable, this approach incurs substantial
costs due to the need to manage multiple models. A more efficient solution
involves integrating knowledge updates across all languages into a unified
model. However, performing sequential edits across languages often leads to
destructive parameter interference, significantly degrading multilingual
generalization and the accuracy of injected knowledge. To address this
challenge, we propose LangEdit, a novel null-space constrained framework
designed to precisely isolate language-specific knowledge updates. The core
innovation of LangEdit lies in its ability to project parameter updates for
each language onto the orthogonal complement of previous updated subspaces.
This approach mathematically guarantees update independence while preserving
multilingual generalization capabilities. We conduct a comprehensive evaluation
across three model architectures, six languages, and four downstream tasks,
demonstrating that LangEdit effectively mitigates parameter interference and
outperforms existing state-of-the-art editing methods. Our results highlight
its potential for enabling efficient and accurate multilingual knowledge
updates in LLMs. The code is available at
https://github.com/VRCMF/LangEdit.git.
[24] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu
Main category: cs.CL
TL;DR: ReCUT提出了一种通过逐步探索和偏好优化的方法,平衡LLMs的推理长度与准确性,显著减少了30-50%的推理长度,同时保持或提升准确性。
Details
Motivation: 现有CoT提示方法常因过度思考导致冗余推理轨迹,现有解决方案受限于生成数据的质量和过拟合问题。Contribution: 提出了ReCUT方法,结合逐步探索机制和长短切换采样策略,训练两个专门化模型(一个优化准确性,一个优化长度),并通过参数插值获得最终模型。
Method: 采用逐步探索生成多样化推理路径,构建偏好对训练两个专门化模型,最终通过参数插值整合。
Result: 在多个数学推理数据集和骨干模型上,ReCUT显著减少推理长度30-50%,同时保持或提升准确性。
Insight: 通过平衡推理长度与准确性,ReCUT为LLMs的高效推理提供了新思路,尤其适合需要简洁且准确推理的任务。
Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially
improved the reasoning capabilities of Large Language Models (LLMs). However,
these methods often suffer from overthinking, leading to unnecessarily lengthy
or redundant reasoning traces. Existing approaches attempt to mitigate this
issue through curating multiple reasoning chains for training LLMs, but their
effectiveness is often constrained by the quality of the generated data and
prone to overfitting. To address the challenge, we propose Reasoning
Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing
the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a
stepwise exploration mechanism and a long-short switched sampling strategy,
enabling LLMs to incrementally generate diverse reasoning paths. These paths
are evaluated and used to construct preference pairs to train two specialized
models (Gemini LLMs)-one optimized for reasoning accuracy, the other for
shorter reasoning. A final integrated model is obtained by interpolating the
parameters of these two models. Experimental results across multiple math
reasoning datasets and backbone models demonstrate that ReCUT significantly
reduces reasoning lengths by approximately 30-50%, while maintaining or
improving reasoning accuracy compared to various baselines. All codes and data
will be released via https://github.com/NEUIR/ReCUT.
[25] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training
Alireza Salemi,Mukta Maddipatla,Hamed Zamani
Main category: cs.CL
TL;DR: 该论文提出了mRAG,一种多智能体检索增强生成框架,通过自我训练优化智能体协作,并在LiveRAG 2025竞赛中表现优异。
Details
Motivation: 传统检索增强生成(RAG)方法在复杂任务中表现有限,作者希望通过多智能体协作和自我训练范式提升性能。Contribution: 1. 提出mRAG框架,包含规划、搜索、推理和协调等子任务的专门智能体;2. 引入基于奖励的轨迹采样自我训练方法优化协作。
Method: 多智能体协作框架(mRAG),结合自我训练和奖励引导的轨迹采样,优化任务执行。
Result: 在SIGIR 2025 LiveRAG竞赛中,mRAG优于传统RAG基线,并通过案例验证了其在复杂任务中的有效性。
Insight: 多智能体协作和自适应训练机制能够显著提升检索增强生成任务的性能。
Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG)
framework composed of specialized agents for subtasks such as planning,
searching, reasoning, and coordination. Our system uses a self-training
paradigm with reward-guided trajectory sampling to optimize inter-agent
collaboration and enhance response generation. Evaluated on DataMorgana-derived
datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms
conventional RAG baselines. We further analyze competition outcomes and
showcase the framework’s strengths with case studies, demonstrating its
efficacy for complex, real-world RAG tasks.
[26] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles
Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: 该论文提出了SlowFast Sampling,一种动态采样策略,通过交替探索和加速解码阶段,显著提升了扩散式语言模型的推理速度,同时结合dLLM-Cache减少冗余计算,实现高达34.22倍加速。
Details
Motivation: 现有扩散式语言模型的采样策略存在静态行为问题,导致效率和灵活性不足。SlowFast Sampling旨在通过动态调整解码阶段提升性能和速度。Contribution: 提出了基于三个黄金原则(确定性、收敛性和位置性)的SlowFast Sampling策略,并集成dLLM-Cache优化计算效率。
Method: 动态交替探索和加速解码阶段,结合三原则指导解码时机和位置;利用dLLM-Cache减少冗余计算。
Result: 在LLaDA上实现15.63倍加速,结合缓存后达34.22倍,且在吞吐量上超越LLaMA3 8B等自回归基线。
Insight: 通过动态采样策略,可以充分发挥扩散式语言模型的并行生成潜力,实现高效高质量生成。
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising
alternative to traditional autoregressive LLMs by enabling parallel token
generation and significantly reducing inference latency. However, existing
sampling strategies for dLLMs, such as confidence-based or semi-autoregressive
decoding, often suffer from static behavior, leading to suboptimal efficiency
and limited flexibility. In this paper, we propose SlowFast Sampling, a novel
dynamic sampling strategy that adaptively alternates between exploratory and
accelerated decoding stages. Our method is guided by three golden principles:
certainty principle, convergence principle, and positional principle, which
govern when and where tokens can be confidently and efficiently decoded. We
further integrate our strategy with dLLM-Cache to reduce redundant computation.
Extensive experiments across benchmarks and models show that SlowFast Sampling
achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and
up to 34.22$\times$ when combined with caching. Notably, our approach
outperforms strong autoregressive baselines like LLaMA3 8B in throughput,
demonstrating that well-designed sampling can unlock the full potential of
dLLMs for fast and high-quality generation.
[27] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater
Main category: cs.CL
TL;DR: 本文探讨了自监督语音模型wav2vec2在不同语言预训练中如何编码语音、声调和说话者信息,发现其表示结构与预训练语言无关。
Details
Motivation: 现有研究多集中于英语,本文旨在揭示多语言预训练的wav2vec2模型如何编码语音、声调和说话者信息。Contribution: 揭示了wav2vec2模型在多种语言中学习的表示结构具有语言独立性,且语音、声调和说话者信息的子空间基本正交。
Method: 使用探测分类器和几何分析方法,分析不同语言预训练的模型对匹配和非匹配语言信息的编码方式。
Result: 发现语音、声调和说话者信息的表示子空间正交,且层间探测准确率模式相似,仅后期层中对匹配语言的语音和声调略有优势。
Insight: 自监督语音模型学习的表示结构可能不受预训练语音材料的语种影响,具有较高的通用性。
Abstract: Analyses of self-supervised speech models have begun to reveal where and how
they represent different types of information. However, almost all analyses
have focused on English. Here, we examine how wav2vec2 models trained on four
different languages encode both language-matched and non-matched speech. We use
probing classifiers and geometric analyses to examine how phones, lexical
tones, and speaker information are represented. We show that for all
pretraining and test languages, the subspaces encoding phones, tones, and
speakers are largely orthogonal, and that layerwise patterns of probing
accuracy are similar, with a relatively small advantage for matched-language
phone and tone (but not speaker) probes in the later layers. Our findings
suggest that the structure of representations learned by wav2vec2 is largely
independent of the speech material used during pretraining.
[28] Slimming Down LLMs Without Losing Their Minds
Qingda,Mai
Main category: cs.CL
TL;DR: 本文研究了高效参数微调方法(LoRA和QLoRA)对大型语言模型(LLM)性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现。
Details
Motivation: 随着LLM规模的增大,高效微调方法的需求日益迫切。本文旨在验证参数高效方法(如LoRA和QLoRA)在实际任务中是否能在保持计算效率的同时提升模型性能。Contribution: 1. 验证了LoRA方法在任务特定性能上的有效性;2. 揭示了微调数据集与基准任务的对齐对性能的关键影响;3. 为开发者提供了有限资源下高效LLM适配的实践指导。
Method: 采用LoRA和QLoRA两种参数高效方法,在HellaSwag(常识推理)、GSM8K(数学推理)和MMLU-CS(多领域知识)三个基准任务上评估模型性能。
Result: LoRA-based方法显著提升了任务特定性能,且计算效率高;性能表现高度依赖于微调数据集与任务间的对齐程度。
Insight: 参数高效方法在特定条件下可替代全参数微调,为资源有限的开发者提供了可行的解决方案。
Abstract: This paper investigates and validates the impact of fine-tuning on large
language model performance, focusing on parameter-efficient methods (LoRA and
QLoRA). We evaluate model capabilities across three key domains: (1)
commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3)
multi-domain knowledge (MMLU-CS).
Our findings demonstrate that: (1) LoRA-based methods effectively improve
task-specific performance while maintaining computational efficiency, and (2)
performance strongly depends on alignment between fine-tuning dataset and
benchmark tasks. The study provides both theoretical insights into
parameter-efficient mechanisms and practical guidance for developers
implementing efficient LLM adaptation with limited resources.
[29] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)通过微调学习新知识时的两种行为:泛化和幻觉。作者提出这两种行为源于同一种机制——‘上下文外推理’(OCR),即模型通过关联概念推导出信息的能力,无论这些概念是否存在因果关系。
Details
Motivation: LLMs在微调过程中表现出泛化和幻觉的‘双重性’现象,但其背后的机制并不清楚。作者希望揭示这种现象的根本原因,从而为模型行为提供理论基础。Contribution: 论文的主要贡献是:(1)提出了‘上下文外推理’(OCR)的概念,并证明其是泛化和幻觉的共同机制;(2)通过实验和理论分析,揭示了梯度下降的隐式偏差对OCR能力的作用;(3)提出了一种简单的注意力模型,能够学习OCR任务。
Method: 作者设计了一个合成事实召回任务来形式化OCR,并实验验证了一个单层单头的注意力模型(带分解的输出和价值矩阵)可以解决该任务。理论分析表明,梯度下降倾向于最小化输出-价值矩阵的核范数,从而支持OCR能力。
Result: 实验证实了OCR在五种主流LLMs中驱动泛化和幻觉的行为。理论分析揭示了矩阵分解对OCR能力的关键作用。
Insight: 论文的洞察在于,泛化和幻觉并非截然不同的行为,而是同一机制在不同条件下的表现。梯度下降的隐式偏差是模型高效学习关联的关键,无论这种关联是否具有因果性。
Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning,
but this process exhibits a puzzling duality: models can generalize remarkably
from new facts, yet are also prone to hallucinating incorrect information.
However, the reasons for this phenomenon remain poorly understood. In this
work, we argue that both behaviors stem from a single mechanism known as
out-of-context reasoning (OCR): the ability to deduce implications by
associating concepts, even those without a causal link. Our experiments across
five prominent LLMs confirm that OCR indeed drives both generalization and
hallucination, depending on whether the associated concepts are causally
related. To build a rigorous theoretical understanding of this phenomenon, we
then formalize OCR as a synthetic factual recall task. We empirically show that
a one-layer single-head attention-only transformer with factorized output and
value matrices can learn to solve this task, while a model with combined
weights cannot, highlighting the crucial role of matrix factorization. Our
theoretical analysis shows that the OCR capability can be attributed to the
implicit bias of gradient descent, which favors solutions that minimize the
nuclear norm of the combined output-value matrix. This mathematical structure
explains why the model learns to associate facts and implications with high
sample efficiency, regardless of whether the correlation is causal or merely
spurious. Ultimately, our work provides a theoretical foundation for
understanding the OCR phenomenon, offering a new lens for analyzing and
mitigating undesirable behaviors from knowledge injection.
[30] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall
Main category: cs.CL
TL;DR: BioClinical ModernBERT 是一种基于 ModernBERT 的领域自适应编码器,专为生物医学和临床 NLP 设计,通过大规模预训练和长上下文处理技术,显著提升任务性能。
Details
Motivation: 生物医学和临床 NLP 的编码器发展滞后于解码器模型,导致领域适应能力有限。为了解决这一问题,作者提出了一种改进的编码器。Contribution: 1) 提出了 BioClinical ModernBERT,支持长上下文处理;2) 使用迄今为止最大的生物医学和临床语料库(53.5B tokens)进行预训练;3) 在四个下游任务中超越现有模型。
Method: 基于 ModernBERT,通过多源数据集(20 个来自不同机构、领域和地区的数据集)进行持续预训练,优化模型性能和速度。
Result: BioClinical ModernBERT 在多项生物医学和临床应用任务中表现优于现有编码器模型。
Insight: 多源数据集的使用和长上下文处理技术是提升生物医学和临床 NLP 任务性能的关键。
Abstract: Encoder-based transformer models are central to biomedical and clinical
Natural Language Processing (NLP), as their bidirectional self-attention makes
them well-suited for efficiently extracting structured information from
unstructured text through discriminative tasks. However, encoders have seen
slower development compared to decoder models, leading to limited domain
adaptation in biomedical and clinical settings. We introduce BioClinical
ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT
release, incorporating long-context processing and substantial improvements in
speed and performance for biomedical and clinical NLP. BioClinical ModernBERT
is developed through continued pretraining on the largest biomedical and
clinical corpus to date, with over 53.5 billion tokens, and addresses a key
limitation of prior clinical encoders by leveraging 20 datasets from diverse
institutions, domains, and geographic regions, rather than relying on data from
a single source. It outperforms existing biomedical and clinical encoders on
four downstream tasks spanning a broad range of use cases. We release both base
(150M parameters) and large (396M parameters) versions of BioClinical
ModernBERT, along with training checkpoints to support further research.
[31] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Lan Zhang,Marco Valentino,Andre Freitas
Main category: cs.CL
TL;DR: 这篇论文提出了一种基于大语言模型(LLM)的系统化、自动化的评估自形式化任务的方法,通过引入逻辑保持、数学一致性、形式有效性和形式质量等多维标准,提高了评估的透明性和可靠性。
Details
Motivation: 在高级数学领域,自形式化的自动评估需要领域专家的参与且耗时。现有的大语言模型作为评判者的方法通常采用粗粒度的通用标准,难以捕捉复杂的数学推理中的细微差别。Contribution: 提出了一个基于形式化和认知基础的LLM评判者集成方法(EFG),定义了多维度评估标准,显著提升了与人类评估的一致性,尤其是在形式质量方面的评估。
Method: 通过构建一个LLM评判者的集成系统(EFG),结合逻辑保持(LP)、数学一致性(MC)、形式有效性(FV)和形式质量(FQ)四个标准,实现透明的多维度评估。
Result: 实验表明,EFG集成方法比粗粒度模型更接近人类评估结果,尤其是在形式质量方面表现出更强的相关性。
Insight: 通过定义明确的原子属性,大语言模型作为评判者可以为形式数学推理提供可扩展、可解释且可靠的评估支持。
Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by
enabling the automatic translation of natural language statements into formal
languages. While recent advances using large language models (LLMs) have shown
promising results, methods for automatically evaluating autoformalization
remain underexplored. As one moves to more complex domains (e.g., advanced
mathematics), human evaluation requires significant time and domain expertise,
especially as the complexity of the underlying statements and background
knowledge increases. LLM-as-a-judge presents a promising approach for
automating such evaluation. However, existing methods typically employ
coarse-grained and generic evaluation criteria, which limit their effectiveness
for advanced formal mathematical reasoning, where quality hinges on nuanced,
multi-granular dimensions. In this work, we take a step toward addressing this
gap by introducing a systematic, automatic method to evaluate autoformalization
tasks. The proposed method is based on an epistemically and formally grounded
ensemble (EFG) of LLM judges, defined on criteria encompassing logical
preservation (LP), mathematical consistency (MC), formal validity (FV), and
formal quality (FQ), resulting in a transparent assessment that accounts for
different contributing factors. We validate the proposed framework to serve as
a proxy for autoformalization assessment within the domain of formal
mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM
judges is a suitable emerging proxy for evaluation, more strongly correlating
with human assessments than a coarse-grained model, especially when assessing
formal qualities. These findings suggest that LLM-as-judges, especially when
guided by a well-defined set of atomic properties, could offer a scalable,
interpretable, and reliable support for evaluating formal mathematical
reasoning.
[32] Magistral
Mistral-AI,:,Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang
Main category: cs.CL
TL;DR: Magistral是Mistral推出的首个推理模型,通过完全自主的强化学习(RL)流程训练,展示了纯RL训练的潜力,同时提出了一种强制模型推理语言的简单方法。
Details
Motivation: 研究目标是探索纯强化学习在训练大型语言模型(LLM)中的潜力,摆脱对现有实现和先前模型RL痕迹的依赖,验证RL在文本数据上的能力。Contribution: 1. 开发了完全自主的RL训练管线;2. 提出强制模型推理语言的方法;3. 验证纯RL训练能维持多模态理解、指令遵循和函数调用能力。
Method: 采用从零开始的强化学习训练,基于自身模型和基础设施,训练Magistral Medium(基于Mistral Medium 3)和开源Magistral Small。
Result: 纯RL训练在文本数据上能维持或提升模型的多模态理解、指令遵循和函数调用能力。
Insight: 纯RL训练具备潜力,不需要依赖先验模型的RL痕迹即可达到甚至超越现有能力。
Abstract: We introduce Magistral, Mistral’s first reasoning model and our own scalable
reinforcement learning (RL) pipeline. Instead of relying on existing
implementations and RL traces distilled from prior models, we follow a ground
up approach, relying solely on our own models and infrastructure. Notably, we
demonstrate a stack that enabled us to explore the limits of pure RL training
of LLMs, present a simple method to force the reasoning language of the model,
and show that RL on text data alone maintains most of the initial checkpoint’s
capabilities. We find that RL on text maintains or improves multimodal
understanding, instruction following and function calling. We present Magistral
Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we
open-source Magistral Small (Apache 2.0) which further includes cold-start data
from Magistral Medium.
[33] Dynamic Epistemic Friction in Dialogue
Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky
Main category: cs.CL
TL;DR: 该论文探讨了大型语言模型(LLMs)与人类对齐时忽视的‘认知摩擦’问题,提出动态认知摩擦的概念,并基于动态认知逻辑框架建模,应用于实际对话任务中以预测信念更新。
Details
Motivation: 尽管LLMs在与人类对齐方面取得了进展,但忽视‘认知摩擦’(即对新信息的信念更新阻力)会导致模型在真实对话场景中的表现受限。Contribution: 定义了动态认知摩擦的概念,并将其纳入动态认知逻辑框架;提出了一个能预测对话中信念更新的模型,并展示了其实证效果。
Method: 基于动态认知逻辑框架建模动态认知摩擦,分析其在协作任务中的表现,并扩展模型以适应复杂对话场景。
Result: 模型能有效预测对话中的信念更新,并可通过进一步复杂化以更好地适应现实对话的复杂性。
Insight: 认知摩擦是对话中的关键因素,将其建模有助于提升LLMs在真实交互中的表现。
Abstract: Recent developments in aligning Large Language Models (LLMs) with human
preferences have significantly enhanced their utility in human-AI collaborative
scenarios. However, such approaches often neglect the critical role of
“epistemic friction,” or the inherent resistance encountered when updating
beliefs in response to new, conflicting, or ambiguous information. In this
paper, we define dynamic epistemic friction as the resistance to epistemic
integration, characterized by the misalignment between an agent’s current
belief state and new propositions supported by external evidence. We position
this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit,
2011), where friction emerges as nontrivial belief-revision during the
interaction. We then present analyses from a situated collaborative task that
demonstrate how this model of epistemic friction can effectively predict belief
updates in dialogues, and we subsequently discuss how the model of belief
alignment as a measure of epistemic resistance or friction can naturally be
made more sophisticated to accommodate the complexities of real-world dialogue
scenarios.
[34] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
Main category: cs.CL
TL;DR: Domain2Vec利用元域向量化数据集,无需训练即可找到最优数据混合,验证了分布对齐假设,显著降低了计算开销并提升了下游任务性能。
Details
Motivation: 现有方法在寻找最优数据集混合时需要大量训练计算,而Domain2Vec提出了一种无需训练的方法,通过向量化和分布对齐假设提高效率。Contribution: 提出了Domain2Vec方法,将数据集分解为元域的线性组合,并验证了分布对齐假设(DA²),显著降低了计算成本。
Method: 基于元域词汇表,使用分类器将数据集分解为分布向量,通过分布对齐假设找到最优数据混合,无需额外训练。
Result: 在Pile-CC上仅需51.5%的计算量即可达到相同验证损失,同等计算预算下下游任务平均提升2.83%。
Insight: 分布对齐假设为数据集混合优化提供了理论支持,向量化方法提高了效率和可扩展性。
Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any
dataset into a linear combination of several \emph{meta-domains}, a new concept
designed to capture the key underlying features of datasets.
\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a
classifier to decompose any given dataset into a domain vector that corresponds
to a distribution over this vocabulary. These domain vectors enable the
identification of the optimal data mixture for language model (LM) pretraining
in a training-free manner under the \emph{\textbf{D}istribution
\textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when
the data distributions of the training set and the validation set are better
aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can
be seamlessly integrated into previous works to model the relationship between
domain vectors and LM performance, greatly enhancing the efficiency and
scalability of previous methods. Extensive experiments demonstrate that
\textsc{Domain2Vec} helps find the data mixture that enhances downstream task
performance with minimal computational overhead. Specifically,
\textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only
$51.5%$ of the computation required when training on the original mixture of
The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves
downstream performance by an average of $2.83%$.
[35] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva
Main category: cs.CL
TL;DR: 该论文研究了推理模型在识别和纠正无效思维(如无关、误导或不准确的内容)方面的能力,发现模型虽能识别无效思维,但在纠正过程中表现不佳,尤其是大模型在面对短无效思维时更难恢复,呼吁改进模型的自我评估能力。
Details
Motivation: 探索推理模型的自我反思能力,尤其是其识别和纠正无效思维的效果,以提升模型的推理能力和安全性。Contribution: 揭示了推理模型在无效思维识别与恢复上的局限性,特别是大模型的逆扩展性问题,并展示了其在实际应用中的潜在风险。
Method: 通过人为注入四类无效思维(无关内容、误导问题、无意义叙述、错误答案)评估模型的识别与恢复能力,并在”jailbreak”实验中验证其影响。
Result: 模型能识别大部分无效思维,但恢复能力差;大模型在短无效思维干扰下表现更差;小模型对有害触发思维的抵抗力最强。
Insight: 当前推理模型的自我评估能力尚不足,尤其是大模型在复杂干扰下可能表现更差,需进一步提升其”元认知”能力以增强安全性和鲁棒性。
Abstract: Recent reasoning models show the ability to reflect, backtrack, and
self-validate their reasoning, which is crucial in spotting mistakes and
arriving at accurate solutions. A natural question that arises is how
effectively models can perform such self-reevaluation. We tackle this question
by investigating how well reasoning models identify and recover from four types
of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to
the question, thoughts misdirecting the question as a slightly different
question, and thoughts that lead to incorrect answers. We show that models are
effective at identifying most unhelpful thoughts but struggle to recover from
the same thoughts when these are injected into their thinking process, causing
significant performance drops. Models tend to naively continue the line of
reasoning of the injected irrelevant thoughts, which showcases that their
self-reevaluation abilities are far from a general “meta-cognitive” awareness.
Moreover, we observe non/inverse-scaling trends, where larger models struggle
more than smaller ones to recover from short irrelevant thoughts, even when
instructed to reevaluate their reasoning. We demonstrate the implications of
these findings with a jailbreak experiment using irrelevant thought injection,
showing that the smallest models are the least distracted by
harmful-response-triggering thoughts. Overall, our findings call for
improvement in self-reevaluation of reasoning models to develop better
reasoning and safer systems.
cs.CV [Back]
[36] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj
Main category: cs.CV
TL;DR: 这篇论文提出了一种结合文本到图像和音频生成模型的多模态电影视频合成方法,能够在60秒内生成高质量的电影风格视频。
Details
Motivation: 随着生成式人工智能的进步,多媒体创作的自动化需求日益增长,尤其是如何从文本输入生成具有专业质量的电影视频。Contribution: 主要的贡献是提出了一种结合Stable Diffusion、GPT-2和混合音频管道的多模态框架,支持高保真图像合成、叙事结构化和音视频同步,并提供了优化的界面和性能。
Method: 采用五场景框架,结合线性帧插值、电影级后处理(如锐化)和音视频同步技术,在GPU加速的Google Colab环境中实现,并支持双模式Gradio界面。
Result: 实验结果表明,该方法在视觉质量、叙事连贯性和效率方面表现优异,适用于创意、教育和工业应用。
Insight: 通过结合多种生成模型和后处理技术,能够实现高效且高质量的多模态视频合成,为文本到视频的自动化创作提供了新思路。
Abstract: Advances in generative artificial intelligence have altered multimedia
creation, allowing for automatic cinematic video synthesis from text inputs.
This work describes a method for creating 60-second cinematic movies
incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for
narrative structuring, and a hybrid audio pipeline using gTTS and
YouTube-sourced music. It uses a five-scene framework, which is augmented by
linear frame interpolation, cinematic post-processing (e.g., sharpening), and
audio-video synchronization to provide professional-quality results. It was
created in a GPU-accelerated Google Colab environment using Python 3.11. It has
a dual-mode Gradio interface (Simple and Advanced), which supports resolutions
of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA
memory management and error handling ensure reliability. The experiments
demonstrate outstanding visual quality, narrative coherence, and efficiency,
furthering text-to-video synthesis for creative, educational, and industrial
applications.
[37] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue
Main category: cs.CV
TL;DR: LoRA-Edit提出了一种基于LoRA微调的方法,通过掩码驱动实现对视频编辑的灵活控制,避免大规模预训练,同时保留背景并优化编辑传播。
Details
Motivation: 现有视频编辑方法依赖大规模预训练,灵活性受限;一帧引导编辑无法灵活控制后续帧。为解决这些问题,作者提出掩码感知的LoRA微调方法。Contribution: 1. 提出掩码驱动的LoRA微调方法,适配预训练I2V模型;2. 结合额外参考(如多视角)提升编辑可控性;3. 实验证明其优于现有方法。
Method: 1. 通过掩码动态调节模型注意力,区分输入视频和参考图像的学习区域;2. 使用LoRA微调预训练模型,避免架构改动。
Result: 实验表明,该方法在视频编辑任务中优于现有技术,实现了高质量的灵活编辑。
Insight: 掩码和LoRA的结合为视频编辑提供了一种高效且灵活的解决方案,同时展示了多参考信息的价值。
Abstract: Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our approach preserves background
regions while enabling controllable edits propagation. This solution offers
efficient and adaptable video editing without altering the model architecture.
To better steer this process, we incorporate additional references, such as
alternate viewpoints or representative scene states, which serve as visual
anchors for how content should unfold. We address the control challenge using a
mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model
to the editing context. The model must learn from two distinct sources: the
input video provides spatial structure and motion cues, while reference images
offer appearance guidance. A spatial mask enables region-specific learning by
dynamically modulating what the model attends to, ensuring that each area draws
from the appropriate source. Experimental results show our method achieves
superior video editing performance compared to state-of-the-art methods.
[38] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding
Bin Guo,John H. L. Hansen
Main category: cs.CV
TL;DR: DeepTraverse 是一种受深度优先搜索算法启发的新型视觉架构,通过递归探索和自适应校准模块构建更结构化、可解释的特征表示,在图像分类任务中表现优异。
Details
Motivation: 传统视觉模型的特征构建过程缺乏显式的自适应迭代优化路径,能否借鉴经典搜索算法的原则,实现更结构化和逻辑化的处理流程?Contribution: 提出了 DeepTraverse,一种受搜索算法启发的视觉架构,通过递归探索和自适应校准模块实现高效的特征构建和优化。
Method: 采用递归探索模块(深度分析特征路径)和自适应校准模块(动态调整特征显著性),实现结构化的特征学习。
Result: 在多样化的图像分类任务中,DeepTraverse 表现优于传统模型,且参数效率更高。
Insight: 将算法先验融入视觉模型设计,可提升模型效率、性能和结构化程度。
Abstract: Conventional vision backbones, despite their success, often construct
features through a largely uniform cascade of operations, offering limited
explicit pathways for adaptive, iterative refinement. This raises a compelling
question: can principles from classical search algorithms instill a more
algorithmic, structured, and logical processing flow within these networks,
leading to representations built through more interpretable, perhaps
reasoning-like decision processes? We introduce DeepTraverse, a novel vision
architecture directly inspired by algorithmic search strategies, enabling it to
learn features through a process of systematic elucidation and adaptive
refinement distinct from conventional approaches. DeepTraverse operationalizes
this via two key synergistic components: recursive exploration modules that
methodically deepen feature analysis along promising representational paths
with parameter sharing for efficiency, and adaptive calibration modules that
dynamically adjust feature salience based on evolving global context. The
resulting algorithmic interplay allows DeepTraverse to intelligently construct
and refine feature patterns. Comprehensive evaluations across a diverse suite
of image classification benchmarks show that DeepTraverse achieves highly
competitive classification accuracy and robust feature discrimination, often
outperforming conventional models with similar or larger parameter counts. Our
work demonstrates that integrating such algorithmic priors provides a
principled and effective strategy for building more efficient, performant, and
structured vision backbones.
[39] Test-Time Adaptation for Generalizable Task Progress Estimation
Christos Ziakas,Alessandra Russo
Main category: cs.CV
TL;DR: 提出了一种基于测试时自适应的方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。
Details
Motivation: 为了解决进度估计模型在分布外任务、环境和实现上的泛化问题,作者提出了一种测试时自适应方法,通过利用专家视觉轨迹和自然语言任务描述来优化模型的适应性。Contribution: 1. 提出了一种梯度元学习策略,训练模型以在测试时自适应;2. 首次将语义内容优先于时间顺序引入进度估计任务;3. 在多样化的分布外场景中实现了优于现有方法的表现。
Method: 通过梯度元学习策略,模型在专家视觉轨迹和自然语言任务描述上进行训练,测试时通过优化自监督目标自适应,侧重语义内容而非时间顺序。
Result: 在分布外任务、环境和实现中,该方法表现优于当前最先进的基于自回归视觉语言模型的上下文学习方法。
Insight: 测试时自适应和语义优先策略显著提升了进度估计模型的泛化能力,特别是在分布外场景中。
Abstract: We propose a test-time adaptation method that enables a progress estimation
model to adapt online to the visual and temporal context of test trajectories
by optimizing a learned self-supervised objective. To this end, we introduce a
gradient-based meta-learning strategy to train the model on expert visual
trajectories and their natural language task descriptions, such that test-time
adaptation improves progress estimation relying on semantic content over
temporal order. Our test-time adaptation method generalizes from a single
training environment to diverse out-of-distribution tasks, environments, and
embodiments, outperforming the state-of-the-art in-context learning approach
using autoregressive vision-language models.
[40] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang
Main category: cs.CV
TL;DR: EfficientVLA提出了一种无需训练的加速框架,通过剪枝、视觉令牌优化和缓存中间特征三种策略,显著加速了VLA模型的推理,同时保持了性能。
Details
Motivation: 现有的VLA模型(如基于扩散架构的模型)计算和内存需求高,限制了实际部署。现有加速方法通常只针对局部问题,未能全面解决整个流程中的冗余问题。Contribution: 提出EfficientVLA框架,通过三种策略系统消除冗余:1) 语言模块剪枝;2) 视觉令牌优化;3) 扩散动作头的中间特征缓存。
Method: 结合剪枝(语言模块冗余层)、任务感知的视觉令牌选择(优化视觉处理)、以及扩散动作头的特征缓存(减少时间冗余)。
Result: 在CogACT模型上实现了1.93倍加速,FLOPs降至28.9%,任务成功率仅下降0.6%。
Insight: 无需训练的加速方法能显著提升效率,同时保持模型性能,为VLA模型的实用部署提供了可能。
Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based
architectures, demonstrate transformative potential for embodied intelligence
but are severely hampered by high computational and memory demands stemming
from extensive inherent and inference-time redundancies. While existing
acceleration efforts often target isolated inefficiencies, such piecemeal
solutions typically fail to holistically address the varied computational and
memory bottlenecks across the entire VLA pipeline, thereby limiting practical
deployability. We introduce EfficientVLA, a structured and training-free
inference acceleration framework that systematically eliminates these barriers
by cohesively exploiting multifaceted redundancies. EfficientVLA
synergistically integrates three targeted strategies: (1) pruning of
functionally inconsequential layers from the language module, guided by an
analysis of inter-layer redundancies; (2) optimizing the visual processing
pathway through a task-aware strategy that selects a compact, diverse set of
visual tokens, balancing task-criticality with informational coverage; and (3)
alleviating temporal computational redundancy within the iterative
diffusion-based action head by strategically caching and reusing key
intermediate features. We apply our method to a standard VLA model CogACT,
yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6%
success rate drop in the SIMPLER benchmark.
[41] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild
Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso
Main category: cs.CV
TL;DR: 论文发布了一个名为ICCWD的多模态图像-字幕数据集,用于检测未成年人的图像内容,并通过测试三种检测器展示了数据集的实用性。
Details
Motivation: 目前缺乏用于多模态环境下检测未成年人内容的数据集,论文旨在填补这一空白,以支持机器学习工具的开发和评估。Contribution: 论文的主要贡献是发布了ICCWD数据集,这是一个包含10,000张图像-字幕对的手动标注数据集,用于检测未成年人内容的基准测试。
Method: 通过手动标注图像和字幕对,构建了ICCWD数据集,并使用三种检测器(包括商业年龄估计系统)对数据集进行基准测试。
Result: 实验结果表明,未成年人检测是一个具有挑战性的任务,最佳方法的真实阳性率为75.3%。
Insight: ICCWD数据集为设计更好的未成年人检测方法提供了支持,尤其是在多模态环境下。
Abstract: Platforms and the law regulate digital content depicting minors (defined as
individuals under 18 years of age) differently from other types of content.
Given the sheer amount of content that needs to be assessed, machine
learning-based automation tools are commonly used to detect content depicting
minors. To our knowledge, no dataset or benchmark currently exists for
detecting these identification methods in a multi-modal environment. To fill
this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an
image-caption dataset aimed at benchmarking tools that detect depictions of
minors. Our dataset is richer than previous child image datasets, containing
images of children in a variety of contexts, including fictional depictions and
partially visible bodies. ICCWD contains 10,000 image-caption pairs manually
labeled to indicate the presence or absence of a child in the image. To
demonstrate the possible utility of our dataset, we use it to benchmark three
different detectors, including a commercial age estimation system applied to
images. Our results suggest that child detection is a challenging task, with
the best method achieving a 75.3% true positive rate. We hope the release of
our dataset will aid in the design of better minor detection methods in a wide
range of scenarios.
[42] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers
Natanael Lucena,Fábio S. da Silva,Ricardo Rios
Main category: cs.CV
TL;DR: 论文比较了CNN和Vision Transformers(ViT)在银屑病病变图像分类任务中的表现,发现ViT在较小的模型规模下表现更优,其中DaViT-B模型以96.4%的f1-score成为最有效的自动检测银屑病的架构。
Details
Motivation: 研究动机在于探索不同深度学习架构(尤其是ViT)在医学图像分类中的潜力,以优化自动化银屑病检测的效率和准确性。Contribution: 主要贡献是验证了ViT在医学图像分类任务中的优越性,特别是在模型规模较小时的性能表现,推荐了DaViT-B作为银屑病检测的最佳架构。
Method: 采用了预训练的CNN和ViT模型,并在特定数据集上进行了调整和比较,重点关注了模型性能和规模之间的权衡。
Result: 结果显示ViT(尤其是DaViT-B)在f1-score上表现最佳(96.4%),优于CNN模型。
Insight: 研究结果表明ViT在医学图像分类任务中具有显著潜力,尤其是在需要轻量级高效模型的情况下。
Abstract: This paper presents a comparison of the performance of Convolutional Neural
Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying
images containing lesions of psoriasis and diseases similar to it. Models
pre-trained on ImageNet were adapted to a specific data set. Both achieved high
predictive metrics, but the ViTs stood out for their superior performance with
smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the
best results, with an f1-score of 96.4%, and is recommended as the most
efficient architecture for automated psoriasis detection. This article
reinforces the potential of ViTs for medical image classification tasks.
[43] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang
Main category: cs.CV
TL;DR: 该论文提出了ViCrit任务,一种可验证的强化学习代理任务,用于提升视觉语言模型(VLMs)的视觉感知能力,通过定位文本中的视觉幻觉错误,并在多个基准测试上验证了其有效性。
Details
Motivation: 现有强化学习(RL)在大型语言模型(LLMs)中表现良好,但在视觉语言模型(VLMs)中缺乏既可验证又具有挑战性的视觉任务,ViCrit旨在填补这一空白。Contribution: 1. 提出ViCrit任务,训练VLMs定位文本中注入的视觉幻觉错误。2. 引入ViCrit-Bench基准测试,系统评估模型在多种领域的感知能力。3. 结果表明,ViCrit能显著提升模型在多种任务上的性能,并具有泛化能力。
Method: 在200字的人类图像描述中注入单一视觉描述错误(如对象、属性或空间关系错误),要求模型结合图像找出错误位置,并提供二元奖励信号。
Result: ViCrit训练后的模型在多种VL任务上表现显著提升,且能泛化到抽象图像和视觉数学任务。
Insight: ViCrit任务不仅提升了模型对已知对象的记忆能力,还增强了其真正的视觉感知能力,为VLMs的优化提供了新方向。
Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning
large language models (LLMs) using tasks that are challenging yet easily
verifiable, such as math reasoning or code generation. However, extending this
success to visual perception in vision-language models (VLMs) has been impeded
by the scarcity of vision-centric tasks that are simultaneously challenging and
unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption
Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle,
synthetic visual hallucination injected into paragraphs of human-written image
captions. Starting from a 200-word captions, we inject a single, subtle visual
description error-altering a few words on objects, attributes, counts, or
spatial relations-and task the model to pinpoint the corrupted span given the
image and the modified caption. This formulation preserves the full perceptual
difficulty while providing a binary, exact-match reward that is easy to compute
and unambiguous. Models trained with the ViCrit Task exhibit substantial gains
across a variety of VL benchmarks. Crucially, the improvements transfer beyond
natural-image training data to abstract image reasoning and visual math,
showing promises of learning to perceive rather than barely memorizing seen
objects. To facilitate evaluation, we further introduce ViCrit-Bench, a
category-balanced diagnostic benchmark that systematically probes perception
errors across diverse image domains and error types. Together, our results
demonstrate that fine-grained hallucination criticism is an effective and
generalizable objective for enhancing visual perception in VLMs.
[44] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context
Yael Frischholz,Devis Tuia,Michael Lehning
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的模型,通过隐式学习时序卫星图像中的晴空地表反射率,实现了地表太阳辐射(SSR)的准确反演,无需依赖手工设计的特征如反照率图或云掩膜。
Details
Motivation: 传统地表太阳辐射反演算法依赖月度统计估算背景反射率,但在地形复杂且雪盖动态变化的山区表现不佳。本文旨在解决这一问题。Contribution: 提出了一种基于Temporo-Spatial Vision Transformer的注意力机制模型,能够隐式学习地表反射率的动态变化,提高了复杂地形下SSR反演的准确性。
Method: 模型输入为多光谱卫星图像(SEVIRI)、静态地形特征和太阳几何信息,通过时序上下文隐式学习地表反射率,输出为SSR估计值。训练数据来自瑞士地区的HelioMont算法。
Result: 实验表明,模型在提供足够长时序上下文时,性能与依赖反照率信息的模型相当,且在山区表现尤为突出。
Insight: 时序信息对于隐式学习地表反射动态至关重要,尤其在复杂地形下能够显著提升模型的泛化能力。
Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery
critically depends on estimating the background reflectance that a spaceborne
sensor would observe under clear-sky conditions. Deviations from this baseline
can then be used to detect cloud presence and guide radiative transfer models
in inferring atmospheric attenuation. Operational retrieval algorithms
typically approximate background reflectance using monthly statistics, assuming
surface properties vary slowly relative to atmospheric conditions. However,
this approach fails in mountainous regions where intermittent snow cover and
changing snow surfaces are frequent. We propose an attention-based emulator for
SSR retrieval that implicitly learns to infer clear-sky surface reflectance
from raw satellite image sequences. Built on the Temporo-Spatial Vision
Transformer, our approach eliminates the need for hand-crafted features such as
explicit albedo maps or cloud masks. The emulator is trained on instantaneous
SSR estimates from the HelioMont algorithm over Switzerland, a region
characterized by complex terrain and dynamic snow cover. Inputs include
multi-spectral SEVIRI imagery from the Meteosat Second Generation platform,
augmented with static topographic features and solar geometry. The target
variable is HelioMont’s SSR, computed as the sum of its direct and diffuse
horizontal irradiance components, given at a spatial resolution of 1.7 km. We
show that, when provided a sufficiently long temporal context, the model
matches the performances of albedo-informed models, highlighting the model’s
ability to internally learn and exploit latent surface reflectance dynamics.
Our geospatial analysis shows this effect is most powerful in mountainous
regions and improves generalization in both simple and complex topographic
settings. Code and datasets are publicly available at
https://github.com/frischwood/HeMu-dev.git
[45] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias
Main category: cs.CV
TL;DR: 该论文提出了一种高效的注意力探测方法(EP),通过多查询交叉注意力机制,显著减少了可训练参数和计算开销,同时优于现有方法。
Details
Motivation: 随着自监督学习(SSL)的广泛应用,标准线性探测(LP)无法充分评估基于掩码图像建模(MIM)训练模型的潜力,因此需要更高效的注意力探测方法。Contribution: 主要贡献是提出高效探测(EP),一种多查询交叉注意力机制,显著提升了计算效率和性能,并在多个基准测试中优于现有方法。
Method: 通过引入多查询交叉注意力机制(EP),减少冗余投影和可训练参数,从而提升效率。
Result: EP在七个基准测试中表现优于LP和其他注意力探测方法,同时在低样本和分层设置中表现优异。
Insight: 高效的注意力探测方法可以显著提升模型评估的性能和效率,尤其适用于掩码图像建模和其他预训练范式。
Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is
emerging as the preferred evaluation protocol for self-supervised learning
(SSL). Yet, the standard linear probing (LP) fails to adequately reflect the
potential of models trained with Masked Image Modeling (MIM), due to the
distributed nature of patch tokens. This motivates the need for attentive
probing, an alternative that uses attention to selectively aggregate
patch-level features. Despite its growing adoption, attentive probing remains
under-explored, with existing methods suffering from excessive parameterization
and poor computational efficiency.
In this work, we revisit attentive probing through the lens of the
accuracy-efficiency trade-off. We conduct a systematic study of existing
methods, analyzing their mechanisms and benchmarking their performance. We
introduce efficient probing (EP), a multi-query cross-attention mechanism that
eliminates redundant projections, reduces the number of trainable parameters,
and achieves up to a 10$\times$ speed-up over conventional multi-head
attention. Despite its simplicity, EP outperforms LP and prior attentive
probing approaches across seven benchmarks, generalizes well beyond MIM to
diverse pre-training paradigms, produces interpretable attention maps, and
achieves strong gains in low-shot and layer-wise settings. Code available at
https://github.com/billpsomas/efficient-probing.
[46] Improving Personalized Search with Regularized Low-Rank Parameter Updates
Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell
Main category: cs.CV
TL;DR: 论文提出了一种通过正则化低秩参数更新改进个性化视觉-语言检索的方法,通过调整语言编码器的参数,平衡个性化和通用知识,实现了在DeepFashion2和ConCon-Chi数据集的SOTA性能。
Details
Motivation: 个性化视觉-语言检索需要从少数样本中学习新概念(如'我的狗Fido'),同时将个性化和通用知识结合。现有方法(如文本反转)存在局限性,本文探索更高效的方法。Contribution: 1. 提出正则化低秩参数更新方法,调整语言编码器最后一层的少量参数,有效平衡个性化和通用知识;2. 探索多概念参数组合策略;3. 提出基于VLM生成描述的通用知识保留评估指标。
Method: 通过正则化低秩适应调整语言编码器最后一层的参数;探索参数加法作为多概念参数组合策略;使用VLM生成描述的检索准确率评估通用知识保留。
Result: 在两个个性化图像检索基准(DeepFashion2和ConCon-Chi)上,优于现有方法4%-22%。
Insight: 低秩参数更新是实现个性化检索的高效方法;参数加法是组合多个个性化概念的有效策略;VLM生成描述可作为通用知识保留的评估工具。
Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g.
“my dog Fido”) from only a few examples. This task is challenging because it
requires not only learning a new concept from a few images, but also
integrating the personal and general knowledge together to recognize the
concept in different contexts. In this paper, we show how to effectively adapt
the internal representation of a vision-language dual encoder model for
personalized vision-language retrieval. We find that regularized low-rank
adaption of a small set of parameters in the language encoder’s final layer
serves as a highly effective alternative to textual inversion for recognizing
the personal concept while preserving general knowledge. Additionally, we
explore strategies for combining parameters of multiple learned personal
concepts, finding that parameter addition is effective. To evaluate how well
general knowledge is preserved in a finetuned representation, we introduce a
metric that measures image retrieval accuracy based on captions generated by a
vision language model (VLM). Our approach achieves state-of-the-art accuracy on
two benchmarks for personalized image retrieval with natural language queries -
DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal
retrievals.
[47] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators
Parsa Rahimi,Sebastien Marcel
Main category: cs.CV
TL;DR: ScoreMix通过扩散模型中的分数混合策略生成具有挑战性的合成样本,显著提升了判别器的性能,尤其在小样本场景下。
Details
Motivation: 解决在有限标注数据下训练判别模型时数据增强的不足,通过扩散模型的分数合成特性生成更有效的合成样本。Contribution: 提出ScoreMix方法,利用扩散模型的分数混合特性生成高质量合成样本,显著提升判别器的性能。
Method: 在扩散采样过程中,对不同类别的分数进行凸组合,生成合成样本,并通过实验验证类选择的策略。
Result: ScoreMix在多个基准测试中显著提升了判别器的性能,尤其是在数据有限的情况下。
Insight: 结合判别器嵌入空间中距离较远的类比在生成器条件空间中相近的类更有效,生成器和判别器的学习空间相关性较低。
Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation
strategy leveraging the score compositional properties of diffusion models to
enhance discriminator performance, particularly under scenarios with limited
labeled data. By convexly mixing the scores from different class-conditioned
trajectories during diffusion sampling, we generate challenging synthetic
samples that significantly improve discriminative capabilities in all studied
benchmarks. We systematically investigate class-selection strategies for mixing
and discover that greater performance gains arise when combining classes
distant in the discriminator’s embedding space, rather than close in the
generator’s condition space. Moreover, we empirically show that, under standard
metrics, the correlation between the generator’s learned condition space and
the discriminator’s embedding space is minimal. Our approach achieves notable
performance improvements without extensive parameter searches, demonstrating
practical advantages for training discriminative models while effectively
mitigating problems regarding collections of large datasets. Paper website:
https://parsa-ra.github.io/scoremix
[48] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops
Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles
Main category: cs.CV
TL;DR: 该论文提出了一个全面的加州作物产量基准数据集,结合卫星图像、气候、蒸散发和土壤数据,开发了一个多模态深度学习模型,用于预测70多种作物的县级产量,整体R2得分达到0.76。
Details
Motivation: 加州是全球农业生产的领导者,但由于环境、气候和土壤因素的复杂相互作用,准确及时的作物产量预测仍然具有挑战性。Contribution: 研究的主要贡献包括一个覆盖加州70多种作物的基准数据集,以及一个多模态深度学习模型,用于县级作物产量预测。
Method: 该方法整合了多源数据(如Landsat卫星图像、气候记录、蒸散发和土壤数据),采用分层特征提取和时间序列编码器捕捉生长季的时空动态。
Result: 模型在未见测试数据集上的整体R2得分为0.76,展现了强大的预测性能。
Insight: 该研究为农业预测、气候适应和精准农业提供了一个有价值的框架。
Abstract: California is a global leader in agricultural production, contributing 12.5%
of the United States total output and ranking as the fifth-largest food and
cotton supplier in the world. Despite the availability of extensive historical
yield data from the USDA National Agricultural Statistics Service, accurate and
timely crop yield forecasting remains a challenge due to the complex interplay
of environmental, climatic, and soil-related factors. In this study, we
introduce a comprehensive crop yield benchmark dataset covering over 70 crops
across all California counties from 2008 to 2022. The benchmark integrates
diverse data sources, including Landsat satellite imagery, daily climate
records, monthly evapotranspiration, and high-resolution soil properties. To
effectively learn from these heterogeneous inputs, we develop a multi-modal
deep learning model tailored for county-level, crop-specific yield forecasting.
The model employs stratified feature extraction and a timeseries encoder to
capture spatial and temporal dynamics during the growing season. Static inputs
such as soil characteristics and crop identity inform long-term variability.
Our approach achieves an overall R2 score of 0.76 across all crops of unseen
test dataset, highlighting strong predictive performance across California
diverse agricultural regions. This benchmark and modeling framework offer a
valuable foundation for advancing agricultural forecasting, climate adaptation,
and precision farming. The full dataset and codebase are publicly available at
our GitHub repository.
[49] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: DySS提出了一种基于动态查询和状态空间学习的高效多摄像头视频3D物体检测方法,通过稀疏查询和状态空间模型优化性能和效率。
Details
Motivation: 现有的多摄像头3D检测方法依赖密集BEV特征或大量查询,计算成本高,难以扩展到多帧视频。DySS旨在通过动态查询和状态空间学习解决这一问题。Contribution: 1. 提出DySS方法,结合状态空间模型(SSM)和动态查询优化3D检测;2. 引入未来预测和掩码重建任务提升SSM训练;3. 通过动态查询操作(合并、删除、分割)减少冗余查询。
Method: 1. 使用SSM逐帧处理特征并学习场景状态;2. 动态更新查询以减少冗余;3. 辅助任务增强SSM的时空建模能力。
Result: 在nuScenes测试集上达到65.31 NDS和57.4 mAP,优于现有方法;验证集上56.2 NDS和46.2 mAP,实时推理速度33 FPS。
Insight: 稀疏查询和状态空间学习能显著提升多摄像头视频3D检测的效率和性能,动态查询机制有助于减少计算负担。
Abstract: Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most
important perception tasks in autonomous driving. Earlier methods rely on dense
BEV features, which are costly to construct. More recent works explore sparse
query-based detection. However, they still require a large number of queries
and can become expensive to run when more video frames are used. In this paper,
we propose DySS, a novel method that employs state-space learning and dynamic
queries. More specifically, DySS leverages a state-space model (SSM) to
sequentially process the sampled features over time steps. In order to
encourage the model to better capture the underlying motion and correspondence
information, we introduce auxiliary tasks of future prediction and masked
reconstruction to better train the SSM. The state of the SSM then provides an
informative yet efficient summarization of the scene. Based on the state-space
learned features, we dynamically update the queries via merge, remove, and
split operations, which help maintain a useful, lean set of detection queries
throughout the network. Our proposed DySS achieves both superior detection
performance and efficient inference. Specifically, on the nuScenes test split,
DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the
art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a
real-time inference speed of 33 FPS.
[50] HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park,Minyeong Kim,Gunhee Kim
Main category: cs.CV
TL;DR: HalLoc提出了一种新的数据集和基线模型,用于高效、概率性的幻觉检测,增强视觉语言模型的可靠性。
Details
Motivation: 当前幻觉检测方法计算资源需求高且无法处理真实场景中模糊的幻觉与真相边界。Contribution: 1) 提供150K标注样本的HalLoc数据集;2) 提出低开销的基线模型,支持生成时同步检测幻觉。
Method: 基于HalLoc数据集训练基线模型,实现分等级置信度的幻觉检测,并支持与现有VLM无缝集成。
Result: HalLoc数据集和模型公开发布,为提升视觉语言模型的可靠性提供新工具。
Insight: 概率性幻觉检测模块有望成为提升模型可信度的实用插件,适用于真实场景。
Abstract: Hallucinations pose a significant challenge to the reliability of large
vision-language models, making their detection essential for ensuring accuracy
in critical applications. Current detection methods often rely on
computationally intensive models, leading to high latency and resource demands.
Their definitive outcomes also fail to account for real-world scenarios where
the line between hallucinated and truthful information is unclear. To address
these issues, we propose HalLoc, a dataset designed for efficient,
probabilistic hallucination detection. It features 150K token-level annotated
samples, including hallucination types, across Visual Question Answering (VQA),
instruction-following, and image captioning tasks. This dataset facilitates the
development of models that detect hallucinations with graded confidence,
enabling more informed user interactions. Additionally, we introduce a baseline
model trained on HalLoc, offering low-overhead, concurrent hallucination
detection during generation. The model can be seamlessly integrated into
existing VLMs, improving reliability while preserving efficiency. The prospect
of a robust plug-and-play hallucination detection module opens new avenues for
enhancing the trustworthiness of vision-language models in real-world
applications. The HalLoc dataset and code are publicly available at:
https://github.com/dbsltm/cvpr25_halloc.
[51] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation
Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
Main category: cs.CV
TL;DR: 这篇论文通过迁移学习和不确定性量化(UQ)对皮肤癌分类进行了全面评估,发现基于CLIP的视觉变换器和SVM组合性能最佳,集成方法在准确性和不确定性处理之间取得了良好平衡。
Details
Motivation: 皮肤癌的准确诊断对患者早期治疗至关重要,但现有深度学习方法面临数据稀缺和缺乏不确定性感知的挑战。Contribution: 论文的主要贡献是综合评估了多种预训练特征提取器和分类器在皮肤癌分类中的表现,并首次引入了不确定性量化(UQ)方法,以提升模型的可靠性和临床实用性。
Method: 论文分两阶段:1) 比较多种预训练特征提取器(如CLIP、ResNet50等)与传统分类器(如SVM、XGBoost)的性能;2) 引入蒙特卡洛dropout(MCD)、集成和集成蒙特卡洛dropout(EMCD)进行不确定性量化评估。
Result: 研究发现基于CLIP的ViT-H/14与SVM组合性能最佳,且集成方法在准确性和不确定性处理之间表现最优,EMCD对不确定性预测更为敏感。
Insight: 不确定性量化在基于深度学习的医学诊断中至关重要,能够提升模型的信任度和实际应用价值。
Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment
and improved patient outcomes. Deep learning (DL) models have shown promise in
automating skin cancer classification, but their performance can be limited by
data scarcity and a lack of uncertainty awareness. In this study, we present a
comprehensive evaluation of DL-based skin lesion classification using transfer
learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the
first phase, we benchmarked several pre-trained feature extractors-including
Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50
(ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual
Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range
of traditional classifiers such as Support Vector Machine (SVM), eXtreme
Gradient Boosting (XGBoost), and logistic regression. Our results show that
CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM,
deliver the highest classification performance. In the second phase, we
incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte
Carlo Dropout (EMCD) to assess not only prediction accuracy but also the
reliability of model outputs. We evaluated these models using uncertainty-aware
metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen),
uncertainty specificity(USpe), and uncertainty precision(UPre). The results
demonstrate that ensemble methods offer a good trade-off between accuracy and
uncertainty handling, while EMCD is more sensitive to uncertain predictions.
This study highlights the importance of integrating UQ into DL-based medical
diagnosis to enhance both performance and trustworthiness in real-world
clinical applications.
[52] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: 该论文提出了一种弱监督多模态框架,用于从有限的输入(如病变图像和稀疏临床文本)生成结构化的SOAP(主观、客观、评估和计划)笔记,目标是减轻临床医生的负担并减少对大量标注数据的依赖。
Details
Motivation: 皮肤癌是全球最常见的癌症,每年造成高额医疗支出。临床医生需要手动记录详细的SOAP笔记,这不仅耗时,还增加了工作负担。论文旨在通过弱监督方法解决这一问题。Contribution: 1. 提出了一种弱监督多模态框架,用于生成临床结构化的SOAP笔记;2. 引入了两个新的临床质量评估指标:MedConceptEval和Clinical Coherence Score (CCS)。
Method: 采用弱监督学习框架,结合病变图像和稀疏临床文本作为输入,生成SOAP笔记。减少了对手动标注数据的依赖。
Result: 该方法在关键临床相关性指标上表现与GPT-4o、Claude和DeepSeek Janus Pro相当。验证了其临床实用性和可扩展性。
Insight: 通过弱监督学习,可以在标注数据有限的情况下生成高质量的临床笔记,同时减轻医生的工作负担。
Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for
over $8 billion in annual healthcare expenditures. In clinical settings,
physicians document patient visits using detailed SOAP (Subjective, Objective,
Assessment, and Plan) notes. However, manually generating these notes is
labor-intensive and contributes to clinician burnout. In this work, we propose
a weakly supervised multimodal framework to generate clinically structured SOAP
notes from limited inputs, including lesion images and sparse clinical text.
Our approach reduces reliance on manual annotations, enabling scalable,
clinically grounded documentation while alleviating clinician burden and
reducing the need for large annotated data. Our method achieves performance
comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical
relevance metrics. To evaluate clinical quality, we introduce two novel metrics
MedConceptEval and Clinical Coherence Score (CCS) which assess semantic
alignment with expert medical concepts and input features, respectively.
[53] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
Fei Zhao,Da Pan,Zelu Qi,Ping Shi
Main category: cs.CV
TL;DR: 论文针对用户生成的全景视频(ODV)的音视频质量评估(AVQA)问题,构建了一个数据集,并提出了一种基于特征提取和融合的基线模型。
Details
Motivation: 随着元宇宙的兴起,全景视频逐渐从专业内容转向用户生成内容(UGC),但目前对UGC全景视频的音视频质量评估研究较少。Contribution: 1. 构建了一个用户生成的全景音视频数据集;2. 提出了一个有效的AVQA基线模型,包含视频特征提取、音频特征提取和音视频融合模块。
Method: 1. 使用两台全景相机拍摄300个视频,覆盖10种场景;2. 通过主观实验获得音视频序列的平均意见分数(MOS);3. 设计基线模型,分别提取视频和音频特征并进行融合。
Result: 实验表明,模型在提出的数据集上表现最优。
Insight: 1. 用户生成全景视频的音视频质量评估是一个新兴研究方向;2. 特征融合是提升AVQA模型性能的关键。
Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos
(ODVs) have garnered notable interest, gradually shifting from
professional-generated content (PGC) to user-generated content (UGC). However,
the study of audio-visual quality assessment (AVQA) within ODVs remains
limited. To address this, we construct a dataset of UGC omnidirectional audio
and video (A/V) content. The videos are captured by five individuals using two
different types of omnidirectional cameras, shooting 300 videos covering 10
different scene types. A subjective AVQA experiment is conducted on the dataset
to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to
facilitate the development of UGC-ODV AVQA fields, we construct an effective
AVQA baseline model on the proposed dataset, of which the baseline model
consists of video feature extraction module, audio feature extraction and
audio-visual fusion module. The experimental results demonstrate that our model
achieves optimal performance on the proposed dataset.
[54] Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions
Deliang Wang,Chao Yang,Gaowei Chen
Main category: cs.CV
TL;DR: 本文探讨了利用视觉语言模型(VLMs)通过零样本提示检测学生学术情绪的方法,替代传统监督学习方法,发现Qwen2.5-VL-7B-Instruct在识别困惑情绪方面表现较好,但在检测分心行为上仍有不足。
Details
Motivation: 学生学术情绪对学习表现影响显著,传统监督学习方法泛化能力差,而VLMs为跨任务泛化提供了新可能,因此研究其在情绪识别中的应用。Contribution: 首次将VLMs应用于学术情绪检测,验证其零样本提示的有效性,并比较了两款模型在识别不同类型情绪上的表现。
Method: 使用Llama-3.2-11B-Vision-Instruct和Qwen2.5-VL-7B-Instruct两款VLMs,通过零样本提示对5000张包含不同情绪(困惑、分心、快乐、中性、疲惫)的学生面部图像进行分析。
Result: Qwen2.5-VL-7B-Instruct表现优于Llama-3.2-11B-Vision-Instruct,尤其在识别困惑情绪上效果显著,但对分心行为的检测效果较差。
Insight: VLMs在学术情绪识别中展现出潜力,尤其是零样本提示方法避免了数据标注和微调的需求,但其性能仍有提升空间,特别是在特定情绪(如分心行为)的检测上。
Abstract: Students’ academic emotions significantly influence their social behavior and
learning performance. Traditional approaches to automatically and accurately
analyze these emotions have predominantly relied on supervised machine learning
algorithms. However, these models often struggle to generalize across different
contexts, necessitating repeated cycles of data collection, annotation, and
training. The emergence of Vision-Language Models (VLMs) offers a promising
alternative, enabling generalization across visual recognition tasks through
zero-shot prompting without requiring fine-tuning. This study investigates the
potential of VLMs to analyze students’ academic emotions via facial expressions
in an online learning environment. We employed two VLMs,
Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000
images depicting confused, distracted, happy, neutral, and tired expressions
using zero-shot prompting. Preliminary results indicate that both models
demonstrate moderate performance in academic facial expression recognition,
with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct.
Notably, both models excel in identifying students’ happy emotions but fail to
detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits
relatively high performance in recognizing students’ confused expressions,
highlighting its potential for practical applications in identifying content
that causes student confusion.
[55] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting
Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin
Main category: cs.CV
TL;DR: PointGS通过结合点注意力机制和高斯溅射技术,实现了从稀疏视角中实时生成高质量渲染效果,解决了3DGS在稀疏输入下过拟合的问题。
Details
Motivation: 现有3DGS方法需要大量校准视角才能生成一致的场景表示,而在稀疏视角下容易过拟合训练视角,导致渲染质量下降。作者希望通过改进3DGS,使其在稀疏输入下也能高效渲染。Contribution: 1. 提出了点级别特征感知的高斯溅射框架,支持稀疏视角下的实时高质量渲染。2. 设计了基于自注意力机制的点交互网络,增强点级别外观表示。3. 通过轻量级MLP解码高斯参数,提升渲染效率。
Method: 1. 使用立体基础模型估计相机位姿并重建密集点云用于高斯初始化。2. 从稀疏输入中采样多尺度2D外观特征编码高斯颜色属性。3. 设计基于自注意力的点交互网络,增强特征表示。4. 通过MLP解码高斯参数完成渲染。
Result: 实验表明,PointGS在多样数据集上显著优于基于NeRF的方法,并在少样本设置下达到与当前最佳3DGS方法竞争的性能。
Insight: 点注意力机制的引入能够有效提升稀疏视角下高斯溅射的泛化能力,说明局部特征交互对3D渲染质量的重要性。
Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that
surpasses the neural radiance field (NeRF) in both rendering speed and visual
quality by leveraging an explicit 3D scene representation. Existing 3DGS
approaches require a large number of calibrated views to generate a consistent
and complete scene representation. When input views are limited, 3DGS tends to
overfit the training views, leading to noticeable degradation in rendering
quality. To address this limitation, we propose a Point-wise Feature-Aware
Gaussian Splatting framework that enables real-time, high-quality rendering
from sparse training views. Specifically, we first employ the latest stereo
foundation model to estimate accurate camera poses and reconstruct a dense
point cloud for Gaussian initialization. We then encode the colour attributes
of each 3D Gaussian by sampling and aggregating multiscale 2D appearance
features from sparse inputs. To enhance point-wise appearance representation,
we design a point interaction network based on a self-attention mechanism,
allowing each Gaussian point to interact with its nearest neighbors. These
enriched features are subsequently decoded into Gaussian parameters through two
lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive
experiments on diverse benchmarks demonstrate that our method significantly
outperforms NeRF-based approaches and achieves competitive performance under
few-shot settings compared to the state-of-the-art 3DGS methods.
[56] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu
Main category: cs.CV
TL;DR: 该论文提出基于视觉-语言模型的框架UrbanSense,用于量化分析城市街景风格差异,并构建了数据集UrbanDiffBench,实验表明其能有效捕捉风格差异。
Details
Motivation: 城市街景风格因地理、历史和社会政治因素而异,传统依赖专家解释和历史文档的方法难以标准化,因此需要一种客观、数据驱动的自动化分析框架。Contribution: 1. 构建了街景数据集UrbanDiffBench;2. 开发了首个基于视觉-语言模型的街景分析框架UrbanSense;3. 通过实验验证了框架捕捉风格差异的能力。
Method: 采用多模态视觉-语言模型,自动生成和量化城市街景风格表示,并通过统计和主观评估验证其有效性。
Result: 生成描述80%通过t检验(p<0.05),主观评估Phi得分高(城市0.912,时期0.833),证明了框架对风格差异的捕捉能力。
Insight: 该框架为城市风格演变的量化分析提供了科学依据,可用于未来设计的客观评估。
Abstract: Urban cultures and architectural styles vary significantly across cities due
to geographical, chronological, historical, and socio-political factors.
Understanding these differences is essential for anticipating how cities may
evolve in the future. As representative cases of historical continuity and
modern innovation in China, Beijing and Shenzhen offer valuable perspectives
for exploring the transformation of urban streetscapes. However, conventional
approaches to urban cultural studies often rely on expert interpretation and
historical documentation, which are difficult to standardize across different
contexts. To address this, we propose a multimodal research framework based on
vision-language models, enabling automated and scalable analysis of urban
streetscape style differences. This approach enhances the objectivity and
data-driven nature of urban form research. The contributions of this study are
as follows: First, we construct UrbanDiffBench, a curated dataset of urban
streetscapes containing architectural images from different periods and
regions. Second, we develop UrbanSense, the first vision-language-model-based
framework for urban streetscape analysis, enabling the quantitative generation
and comparison of urban style representations. Third, experimental results show
that Over 80% of generated descriptions pass the t-test (p less than 0.05).
High Phi scores (0.912 for cities, 0.833 for periods) from subjective
evaluations confirm the method’s ability to capture subtle stylistic
differences. These results highlight the method’s potential to quantify and
interpret urban style evolution, offering a scientifically grounded lens for
future design.
[57] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration
Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu
Main category: cs.CV
TL;DR: RealKeyMorph (RKM) 是一种分辨率无关的图像配准方法,通过输出真实世界坐标系中的关键点避免了传统方法因重采样引入的伪影。
Details
Motivation: 医学图像配准中,由于图像采集参数不同导致的分辨率差异问题,传统方法通过固定分辨率重采样会引入插值伪影。RKM旨在解决这一问题。Contribution: RKM扩展了KeyMorph框架,输出真实世界坐标系中的关键点,避免重采样,实现了分辨率无关的图像配准。
Method: RKM利用扫描仪提供的仿射矩阵将关键点映射到真实世界坐标系,并融入训练过程,使其能直接处理原始数据。
Result: 实验表明,RKM在腹部MRI正交2D堆叠和不同分辨率脑数据集3D体积配准任务中表现优越。
Insight: 通过真实世界坐标处理关键点,可以绕过分辨率限制,提升配准质量,适用于多分辨率医学图像场景。
Abstract: Many real-world settings require registration of a pair of medical images
that differ in spatial resolution, which may arise from differences in image
acquisition parameters like pixel spacing, slice thickness, and field-of-view.
However, all previous machine learning-based registration techniques resample
images onto a fixed resolution. This is suboptimal because resampling can
introduce artifacts due to interpolation. To address this, we present
RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is
an extension of KeyMorph, a registration framework which works by training a
network to learn corresponding keypoints for a given pair of images, after
which a closed-form keypoint matching step is used to derive the transformation
that aligns them. To avoid resampling and enable operating on the raw data, RKM
outputs keypoints in real-world coordinates of the scanner. To do this, we
leverage the affine matrix produced by the scanner (e.g., MRI machine) that
encodes the mapping from voxel coordinates to real world coordinates. By
transforming keypoints into real-world space and integrating this into the
training process, RKM effectively enables the extracted keypoints to be
resolution-agnostic. In our experiments, we demonstrate the advantages of RKM
on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as
3D volumes with varying resolutions in brain datasets.
[58] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang
Main category: cs.CV
TL;DR: 该论文提出Motion-R1,一种结合链式思维推理和强化学习的框架,用于提升文本到运动生成的语义理解能力和运动质量。
Details
Motivation: 现有文本到运动生成方法通常依赖端到端映射策略,未能捕捉深层语言结构和逻辑推理,导致生成的动缺乏可控性、一致性和多样性。Contribution: 提出了结合链式思维推理的统一运动-语言建模框架Motion-R1,并采用强化学习算法优化推理链和运动合成的联合训练。
Method: 通过链式思维分解复杂文本指令为逻辑动作路径,结合Group Relative Policy Optimization强化学习算法进行联合优化。
Result: 在多个基准数据集上表现优异,尤其在需要细微语义理解和长期时间一致性的场景中优于现有方法。
Insight: 链式思维推理可以显著提升文本到运动生成中的语义指导和逻辑一致性,强化学习则进一步优化了运动质量。
Abstract: Recent advances in large language models, especially in natural language
understanding and reasoning, have opened new possibilities for text-to-motion
generation. Although existing approaches have made notable progress in semantic
alignment and motion synthesis, they often rely on end-to-end mapping
strategies that fail to capture deep linguistic structures and logical
reasoning. Consequently, generated motions tend to lack controllability,
consistency, and diversity. To address these limitations, we propose Motion-R1,
a unified motion-language modeling framework that integrates a Chain-of-Thought
mechanism. By explicitly decomposing complex textual instructions into
logically structured action paths, Motion-R1 provides high-level semantic
guidance for motion generation, significantly enhancing the model’s ability to
interpret and execute multi-step, long-horizon, and compositionally rich
commands. To train our model, we adopt Group Relative Policy Optimization, a
reinforcement learning algorithm designed for large models, which leverages
motion quality feedback to optimize reasoning chains and motion synthesis
jointly. Extensive experiments across multiple benchmark datasets demonstrate
that Motion-R1 achieves competitive or superior performance compared to
state-of-the-art methods, particularly in scenarios requiring nuanced semantic
understanding and long-term temporal coherence. The code, model and data will
be publicly available.
[59] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device
Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVT是一种轻量级但强大的人脸识别模型,通过结合混合CNN-Transformer架构和多头线性注意力机制,降低了计算复杂度并保持了高精度。
Details
Motivation: 在移动设备上实现高效、实时的人脸识别,同时减少计算资源消耗。Contribution: 提出了一种新的多头线性注意力机制(MHLA)和重参数化的token mixer,显著提升了推理速度,同时保持了高精度。
Method: 结合CNN-Transformer架构,引入了多头线性注意力机制和结构重参数化技术。
Result: 在多个基准测试中表现优异,推理速度比现有轻量级模型快8.6倍(与EdgeFace相比)和21.2倍(与纯ViT模型相比)。
Insight: 混合架构和轻量级的注意力机制是移动设备上高效人脸识别的有效解决方案。
Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition
model that integrates a hybrid Convolution Neural Network (CNN)-Transformer
architecture with an innovative and lightweight Multi-Head Linear Attention
(MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer,
FaceLiVT effectively reduces computational complexity while preserving
competitive accuracy. Extensive evaluations on challenging benchmarks;
including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior
performance compared to state-of-the-art lightweight models. MHLA notably
improves inference speed, allowing FaceLiVT to deliver high accuracy with lower
latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace,
a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2
faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers
an efficient and practical solution for real-time face recognition on
resource-constrained platforms.
[60] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu
Main category: cs.CV
TL;DR: FSATFusion是一种基于频率-空间注意力Transformer的红外与可见光图像融合网络,通过改进的Transformer模块和注意力机制提升融合性能。
Details
Motivation: 现有深度学习方法在红外与可见光图像融合(IVIF)中依赖卷积神经网络,但卷积操作难以捕捉全局上下文,导致信息丢失,限制了融合性能。Contribution: 1. 提出了一种端到端的频率-空间注意力Transformer融合网络(FSATFusion);2. 设计了频率-空间注意力机制(FSAM)和改进的Transformer模块(ITM)。
Method: 1. 使用频率-空间注意力机制(FSAM)提取显著特征;2. 通过改进的Transformer模块(ITM)增强全局上下文信息提取能力。
Result: 实验表明,FSATFusion在融合质量和效率上优于现有方法,并在下游任务(如目标检测)中表现出优异的泛化能力。
Insight: 结合频率和空间注意力机制的Transformer架构在图像融合任务中具有潜力,能够更好地保留和融合多模态图像的关键信息。
Abstract: The infrared and visible images fusion (IVIF) is receiving increasing
attention from both the research community and industry due to its excellent
results in downstream applications. Existing deep learning approaches often
utilize convolutional neural networks to extract image features. However, the
inherently capacity of convolution operations to capture global context can
lead to information loss, thereby restricting fusion performance. To address
this limitation, we propose an end-to-end fusion network named the
Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The
FSATFusion contains a frequency-spatial attention Transformer (FSAT) module
designed to effectively capture discriminate features from source images. This
FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of
extracting significant features from feature maps. Additionally, we propose an
improved Transformer module (ITM) to enhance the ability to extract global
context information of vanilla Transformer. We conducted both qualitative and
quantitative comparative experiments, demonstrating the superior fusion quality
and efficiency of FSATFusion compared to other state-of-the-art methods.
Furthermore, our network was tested on two additional tasks without any
modifications, to verify the excellent generalization capability of FSATFusion.
Finally, the object detection experiment demonstrated the superiority of
FSATFusion in downstream visual tasks. Our code is available at
https://github.com/Lmmh058/FSATFusion.
[61] Revisiting Transformers with Insights from Image Filtering
Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen
Main category: cs.CV
TL;DR: 本文通过图像处理框架重新审视Transformer的自注意力机制,提出了一种统一的理论解释方法,不仅解释了自注意力的计算,还阐明了位置编码和残差连接等组件的作用。此外,提出的两种架构修改在提升模型可解释性的同时,还显著提高了任务精度和鲁棒性。
Details
Motivation: Transformer的自注意力机制虽然效果显著,但其理论解释仍不充分。已有研究尝试从图像去噪和非参数回归角度理解自注意力,但缺乏对架构组件的深入机制解释。本文旨在填补这一空白。Contribution: 1. 提出一种基于图像处理的统一框架,用于解释自注意力及其关键组件(如位置编码和残差连接)的作用。2. 提出两种架构修改,不仅提升模型可解释性,还实验验证了其在任务精度和鲁棒性上的改进。
Method: 通过将自注意力与图像滤波类比,建立理论联系,并基于此设计两种独立的架构修改。实验覆盖了语言和视觉任务,验证了这些修改的有效性。
Result: 实验表明,提出的架构修改在多个任务上显著提升了模型的精度和对抗数据污染的鲁棒性,尤其改善了长序列理解能力。
Insight: 将Transformer的自注意力机制与图像处理理论联系起来,不仅为模型提供了理论支撑,还启发了新的架构设计方向,表明跨领域的理论迁移可以推动深度学习的进步。
Abstract: The self-attention mechanism, a cornerstone of Transformer-based
state-of-the-art deep learning architectures, is largely heuristic-driven and
fundamentally challenging to interpret. Establishing a robust theoretical
foundation to explain its remarkable success and limitations has therefore
become an increasingly prominent focus in recent research. Some notable
directions have explored understanding self-attention through the lens of image
denoising and nonparametric regression. While promising, existing frameworks
still lack a deeper mechanistic interpretation of various architectural
components that enhance self-attention, both in its original formulation and
subsequent variants. In this work, we aim to advance this understanding by
developing a unifying image processing framework, capable of explaining not
only the self-attention computation itself but also the role of components such
as positional encoding and residual connections, including numerous later
variants. We also pinpoint potential distinctions between the two concepts
building upon our framework, and make effort to close this gap. We introduce
two independent architectural modifications within transformers. While our
primary objective is interpretability, we empirically observe that image
processing-inspired modifications can also lead to notably improved accuracy
and robustness against data contamination and adversaries across language and
vision tasks as well as better long sequence understanding.
[62] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial
Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
Main category: cs.CV
TL;DR: 该论文提出了一种名为PoseIDON的计算机视觉方法,通过结合深度学习基础模型和多视图摄影测量技术,估计海底物体的六自由度位姿和周围海底的朝向,从而推断掩埋深度,实现了高精度的海底物体掩埋状态映射。
Details
Motivation: 准确估计海底人为物体的掩埋状态对于研究沉积动态、评估生态风险和污染物传输至关重要,但由于部分遮挡、能见度差和物体退化等问题,传统的遥感图像分析难以实现精确测量。Contribution: 主要贡献是提出了一种结合深度学习基础模型和多视图摄影测量技术的管道PoseIDON,能够高精度估计海底物体的六自由度位姿和掩埋深度。
Method: 方法包括使用ROV视频捕捉多视图图像,结合深度学习基础模型提取特征,通过CAD模型对齐和局部平面拟合推断掩埋深度。
Result: 在San Pedro Basin历史海洋倾倒场的实验中,模型平均掩埋深度误差约为10厘米,并能反映沉积物传输过程的空间模式。
Insight: 该方法为非侵入式、可扩展的海底掩埋状态映射提供了新途径,适用于环境污染评估和其他相关应用。
Abstract: The burial state of anthropogenic objects on the seafloor provides insight
into localized sedimentation dynamics and is also critical for assessing
ecological risks, potential pollutant transport, and the viability of recovery
or mitigation strategies for hazardous materials such as munitions. Accurate
burial depth estimation from remote imagery remains difficult due to partial
occlusion, poor visibility, and object degradation. This work introduces a
computer vision pipeline, called PoseIDON, which combines deep foundation model
features with multiview photogrammetry to estimate six degrees of freedom
object pose and the orientation of the surrounding seafloor from ROV video.
Burial depth is inferred by aligning CAD models of the objects with observed
imagery and fitting a local planar approximation of the seafloor. The method is
validated using footage of 54 objects, including barrels and munitions,
recorded at a historic ocean dumpsite in the San Pedro Basin. The model
achieves a mean burial depth error of approximately 10 centimeters and resolves
spatial burial patterns that reflect underlying sediment transport processes.
This approach enables scalable, non-invasive mapping of seafloor burial and
supports environmental assessment at contaminated sites.
[63] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba
Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin
Main category: cs.CV
TL;DR: DART提出了一种动态自适应区域标记器,为Vision Transformer和Mamba提供内容相关的可变大小图像分区,显著提升性能并减少计算开销。
Details
Motivation: 现有非卷积模型(如ViT和Vim)依赖固定大小的图像分区,导致背景区域编码冗余或关键局部细节缺失。需要一种自适应方法来解决这一问题。Contribution: 提出DART,一种完全可微的动态自适应区域标记器,通过学习区域分数和分位数操作,动态分配更密集的标记到信息丰富区域。
Method: DART通过可学习的区域分数和分段可微分分位数操作实现图像的自适应分区。
Result: 在DeiT上准确率提升2.1%,FLOPs减少45%,并在DeiT、Vim和VideoMamba上一致表现优越。
Insight: 动态自适应标记分配优于均匀增加标记密度的方法,显著提升效率与性能。
Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and
Vision Mamba (Vim) have achieved remarkable performance in computer vision
tasks. However, their reliance on fixed-size patches often results in excessive
encoding of background regions and omission of critical local details,
especially when informative objects are sparsely distributed. To address this,
we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART),
which adaptively partitions images into content-dependent patches of varying
sizes. DART combines learnable region scores with piecewise differentiable
quantile operations to allocate denser tokens to information-rich areas.
Despite introducing only approximately 1 million (1M) additional parameters,
DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that
uniformly increase token density to capture fine-grained details, DART offers a
more efficient alternative, achieving 45% FLOPs reduction with superior
performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that
DART consistently enhances accuracy while incurring minimal or even reduced
computational overhead. Code is available at
https://github.com/HCPLab-SYSU/DART.
[64] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion
Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai
Main category: cs.CV
TL;DR: ReconMOST提出了一个基于数据驱动的引导扩散模型框架,用于多层海洋温度重建,解决了传统方法因数据稀疏、算法复杂和高计算成本带来的挑战。
Details
Motivation: 传统的海洋温度重建方法受限于数据稀疏性和算法复杂性,而现有的机器学习方法多局限于海表和局部区域,难以应对云遮挡等问题。RecoMOST旨在通过扩散模型解决这些问题,实现全球多层海洋温度的高精度重建。Contribution: 1. 提出了首个基于扩散模型的多层海洋温度重建框架ReconMOST;2. 利用历史数值模拟数据预训练无条件的扩散模型,学习物理一致的海洋温度分布模式;3. 在生成阶段引入高精度现场观测数据作为引导点,实现高精度重建;4. 在无直接观测数据的区域,通过预训练学到的物理一致模式实现隐含引导重建。
Method: 1. 基于CMIP6数值模拟数据预训练无条件扩散模型;2. 在反向扩散过程中,利用稀疏的高精度观测数据作为引导点;3. 在无直接观测数据的区域,利用预训练学到的物理一致分布模式进行隐含引导。
Result: 在CMIP6和EN4分析数据上的实验结果显示,ReconMOST在引导、重建和总体任务上的均方误差(MSE)分别为0.049、0.680和0.633,能够处理92.5%的缺失数据,同时保持重建精度和空间分辨率。
Insight: 1. 扩散模型能够有效结合数值模拟和观测数据的优势,实现高精度海洋温度重建;2. 预训练的物理一致分布模式对无观测数据的区域重建至关重要;3. 该方法为全球多层海洋温度重建提供了新的思路。
Abstract: Accurate reconstruction of ocean is essential for reflecting global climate
dynamics and supporting marine meteorological research. Conventional methods
face challenges due to sparse data, algorithmic complexity, and high
computational costs, while increasing usage of machine learning (ML) method
remains limited to reconstruction problems at the sea surface and local
regions, struggling with issues like cloud occlusion. To address these
limitations, this paper proposes ReconMOST, a data-driven guided diffusion
model framework for multi-layer sea temperature reconstruction. Specifically,
we first pre-train an unconditional diffusion model using a large collection of
historical numerical simulation data, enabling the model to attain physically
consistent distribution patterns of ocean temperature fields. During the
generation phase, sparse yet high-accuracy in-situ observational data are
utilized as guidance points for the reverse diffusion process, generating
accurate reconstruction results. Importantly, in regions lacking direct
observational data, the physically consistent spatial distribution patterns
learned during pre-training enable implicitly guided and physically plausible
reconstructions. Our method extends ML-based SST reconstruction to a global,
multi-layer setting, handling over 92.5% missing data while maintaining
reconstruction accuracy, spatial resolution, and superior generalization
capability. We pre-train our model on CMIP6 numerical simulation data and
conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The
results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on
reconstruction, and 0.633 on total, respectively, demonstrating the
effectiveness and robustness of the proposed framework. Our source code is
available at https://github.com/norsheep/ReconMOST.
[65] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie
Main category: cs.CV
TL;DR: Pisces是一种自回归多模态基础模型,通过解耦的视觉编码架构和优化的训练技术,在图像理解和生成任务中均表现优异。
Details
Motivation: 尽管多模态基础模型在图像理解和生成任务上有所进展,但其统一模型的性能仍落后于专门化模型。主要挑战在于视觉特征的差异和训练过程的多样性。Contribution: 提出了解耦的视觉编码架构和针对多模态生成优化的训练技术,结合数据选择与训练策略,实现了在图像理解与生成任务上的竞争性表现。
Method: 采用自回归框架,结合解耦的视觉编码器和专门优化的训练方法,通过精心设计的数据选择和训练流程提升性能。
Result: 在20多个图像理解基准测试和GenEval图像生成基准上,Pisces表现出色,验证了其多任务能力的优势。
Insight: 研究表明,图像理解与生成任务之间存在协同效应,独立视觉编码器的使用进一步推动了统一多模态模型的进步。
Abstract: Recent advances in large language models (LLMs) have enabled multimodal
foundation models to tackle both image understanding and generation within a
unified framework. Despite these gains, unified models often underperform
compared to specialized models in either task. A key challenge in developing
unified models lies in the inherent differences between the visual features
needed for image understanding versus generation, as well as the distinct
training processes required for each modality. In this work, we introduce
Pisces, an auto-regressive multimodal foundation model that addresses this
challenge through a novel decoupled visual encoding architecture and tailored
training techniques optimized for multimodal generation. Combined with
meticulous data curation, pretraining, and finetuning, Pisces achieves
competitive performance in both image understanding and image generation. We
evaluate Pisces on over 20 public benchmarks for image understanding, where it
demonstrates strong performance across a wide range of tasks. Additionally, on
GenEval, a widely adopted benchmark for image generation, Pisces exhibits
robust generative capabilities. Our extensive analysis reveals the synergistic
relationship between image understanding and generation, and the benefits of
using separate visual encoders, advancing the field of unified multimodal
models.
[66] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment
Shuo wang,Jihao Zhang
Main category: cs.CV
TL;DR: MF2Summ是一种基于多模态融合的视频摘要方法,通过结合视觉和听觉信息以及跨模态Transformer改进传统单模态方法的不足,显著提升了性能。
Details
Motivation: 在线视频内容的快速增加需要高效的视频摘要技术。传统方法通常仅依赖单模态(如视觉),难以捕捉视频的完整语义丰富性。因此,本文提出了一种结合视觉和听觉的多模态融合方法。Contribution: 1. 提出MF2Summ模型,通过多模态(视觉和听觉)信息融合提升视频摘要性能。2. 设计跨模态Transformer和对齐引导的自注意力Transformer,有效建模模态间依赖关系和时序对齐。3. 实验结果表明,模型在SumMe和TVSum数据集上优于现有方法。
Method: 1. 特征提取:使用GoogLeNet提取视觉特征,SoundNet提取听觉特征。2. 跨模态注意交互:通过跨模态Transformer建模模态间关系。3. 对齐引导自注意力:利用Transformer捕捉时序对齐特征。4. 段重要性预测:预测段的重要性、位置和中心性。5. 关键帧选择:结合NMS和KTS算法选择关键帧。
Result: 在SumMe和TVSum数据集上,相比DSNet模型,F1分数分别提升了1.9%和0.6%,性能优于其他现有方法。
Insight: 多模态融合有效提升了视频摘要的性能,跨模态Transformer和时序对齐技术是关键。
Abstract: The rapid proliferation of online video content necessitates effective video
summarization techniques. Traditional methods, often relying on a single
modality (typically visual), struggle to capture the full semantic richness of
videos. This paper introduces MF2Summ, a novel video summarization model based
on multimodal content understanding, integrating both visual and auditory
information. MF2Summ employs a five-stage process: feature extraction,
cross-modal attention interaction, feature fusion, segment prediction, and key
shot selection. Visual features are extracted using a pre-trained GoogLeNet
model, while auditory features are derived using SoundNet. The core of our
fusion mechanism involves a cross-modal Transformer and an alignment-guided
self-attention Transformer, designed to effectively model inter-modal
dependencies and temporal correspondences. Segment importance, location, and
center-ness are predicted, followed by key shot selection using Non-Maximum
Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm.
Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ
achieves competitive performance, notably improving F1-scores by 1.9% and
0.6% respectively over the DSNet model, and performing favorably against other
state-of-the-art methods.
[67] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen
Main category: cs.CV
TL;DR: 论文提出了一种鲁棒的多模态情感识别框架CIDer,通过自蒸馏和因果推理模块同时解决模态缺失和分布偏移问题,并引入了新的任务和数据集。
Details
Motivation: 多模态情感识别(MER)面临模态缺失和分布外(OOD)数据的挑战,现有方法依赖特定模型或引入过多参数,实用性受限。Contribution: 1. 提出CIDer框架,整合模型特定自蒸馏(MSSD)和模型无关因果推理(MACI)模块;2. 定义新任务RMFM;3. 引入新数据集。
Method: 1. MSSD通过共享权重自蒸馏增强鲁棒性;2. MACI利用因果图和反事实文本减少偏差;3. 使用WSAM和MCT优化计算和融合。
Result: CIDer在RMFM和OOD场景中表现鲁棒,参数更少、训练更快,优于现有方法。
Insight: 自蒸馏和因果推理的结合能有效解决模态缺失和分布偏移问题,为MER提供了一种实用且高效的解决方案。
Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges
in addressing both modality missing and Out-Of-Distribution (OOD) data
simultaneously. Existing methods often rely on specific models or introduce
excessive parameters, which limits their practicality. To address these issues,
we propose a novel robust MER framework, Causal Inference Distiller (CIDer),
and introduce a new task, Random Modality Feature Missing (RMFM), to generalize
the definition of modality missing. CIDer integrates two key components: a
Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal
Inference (MACI) module. MSSD enhances robustness under the RMFM task through a
weight-sharing self-distillation approach applied across low-level features,
attention maps, and high-level representations. Additionally, a Word-level
Self-aligned Attention Module (WSAM) reduces computational complexity, while a
Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion.
To tackle OOD challenges, MACI employs a tailored causal graph to mitigate
label and language biases using a Multimodal Causal Module (MCM) and
fine-grained counterfactual texts. Notably, MACI can independently enhance OOD
generalization with minimal additional parameters. Furthermore, we also
introduce the new repartitioned MER OOD datasets. Experimental results
demonstrate that CIDer achieves robust performance in both RMFM and OOD
scenarios, with fewer parameters and faster training compared to
state-of-the-art methods. The implementation of this work is publicly
accessible at https://github.com/gw-zhong/CIDer.
[68] Rethinking Generative Human Video Coding with Implicit Motion Transformation
Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye
Main category: cs.CV
TL;DR: 论文提出了一种基于隐式运动变换(IMT)的生成式人体视频编码方法,解决了传统显式运动引导在复杂人体视频中导致的失真和运动不准确问题。
Details
Motivation: 传统生成式视频编码在处理复杂人体运动时,显式运动场会导致重建结果失真和运动不准确,亟需新的方法改进。Contribution: 提出了隐式运动变换(IMT)框架,通过将紧凑视觉特征转换为隐式运动引导,提高了人体视频编码的压缩效率和重建质量。
Method: 将复杂人体信号表征为紧凑视觉特征,并利用IMT将这些特征变换为隐式运动指导,以优化解码器的重建过程。
Result: 实验表明,IMT方法在生成式人体视频编码中实现了高效压缩和高保真合成。
Insight: 隐式运动变换能够更灵活地捕捉复杂人体运动模式,避免显式运动场的局限性,为生成式视频编码提供新思路。
Abstract: Beyond traditional hybrid-based video codec, generative video codec could
achieve promising compression performance by evolving high-dimensional signals
into compact feature representations for bitstream compactness at the encoder
side and developing explicit motion fields as intermediate supervision for
high-quality reconstruction at the decoder side. This paradigm has achieved
significant success in face video compression. However, compared to facial
videos, human body videos pose greater challenges due to their more complex and
diverse motion patterns, i.e., when using explicit motion guidance for
Generative Human Video Coding (GHVC), the reconstruction results could suffer
severe distortions and inaccurate motion. As such, this paper highlights the
limitations of explicit motion-based approaches for human body video
compression and investigates the GHVC performance improvement with the aid of
Implicit Motion Transformation, namely IMT. In particular, we propose to
characterize complex human body signal into compact visual features and
transform these features into implicit motion guidance for signal
reconstruction. Experimental results demonstrate the effectiveness of the
proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency
compression and high-fidelity synthesis.
[69] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 该论文提出了一种新的医学图像分割任务——医学图像推理分割(reasoning segmentation),并通过MedSeg-R框架结合多模态大语言模型(MLLMs)实现了基于复杂临床问题的精确分割。同时,作者还发布了MedSeg-QA数据集用于支持该任务。
Details
Motivation: 现有医学图像分割模型依赖显式指令,缺乏对复杂临床问题的推理能力,而多模态大语言模型在医学问答任务中表现优异,但难以生成精确的分割掩模。因此,论文旨在解决这一问题。Contribution: 1)提出医学图像推理分割任务;2)开发MedSeg-R框架,结合MLLMs的推理能力实现精确分割;3)发布MedSeg-QA数据集。
Method: MedSeg-R包含两个核心模块:1)全局上下文理解模块,生成多模态中间token;2)像素级定位模块,解码token生成分割掩模和文本响应。
Result: 实验表明,MedSeg-R在多个基准测试中表现优异,实现了高分割精度和可解释的文本分析。
Insight: 通过结合MLLMs的推理能力,可以解决复杂医学指令下的分割问题,同时生成可解释的结果,推动自动医学诊断的发展。
Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing
models are limited by their reliance on explicit human instructions and lack
the active reasoning capabilities to understand complex clinical questions.
While recent advancements in multimodal large language models (MLLMs) have
improved medical question-answering (QA) tasks, most methods struggle to
generate precise segmentation masks, limiting their application in automatic
medical diagnosis. In this paper, we introduce medical image reasoning
segmentation, a novel task that aims to generate segmentation masks based on
complex and implicit medical instructions. To address this, we propose
MedSeg-R, an end-to-end framework that leverages the reasoning abilities of
MLLMs to interpret clinical questions while also capable of producing
corresponding precise segmentation masks for medical images. It is built on two
core components: 1) a global context understanding module that interprets
images and comprehends complex medical instructions to generate multi-modal
intermediate tokens, and 2) a pixel-level grounding module that decodes these
tokens to produce precise segmentation masks and textual responses.
Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the
medical image reasoning segmentation task. It includes over 10,000 image-mask
pairs and multi-turn conversations, automatically annotated using large
language models and refined through physician reviews. Experiments show
MedSeg-R’s superior performance across several benchmarks, achieving high
segmentation accuracy and enabling interpretable textual analysis of medical
images.
[70] LLMs Are Not Yet Ready for Deepfake Image Detection
Shahroz Tariq,David Nguyen,M. A. P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore
Main category: cs.CV
TL;DR: 这篇论文通过零样本评估验证了四种主流视觉语言模型(VLM)在深度伪造图像检测中的表现,发现虽然这些模型能生成合理解释并识别表面异常,但尚不适合作为独立的检测系统。
Details
Motivation: 深度伪造技术的发展对媒体完整性和公众信任构成威胁,而视觉语言模型(VLM)因其多模态能力被认为可能适用于检测深度伪造。研究旨在评估VLM在此任务中的实际表现。Contribution: 论文系统地评估了四种VLM(ChatGPT、Claude、Gemini、Grok)在深度伪造检测中的零样本性能,揭示了其局限性(如过度关注风格元素)和潜力(如可解释性)。
Method: 研究采用零样本评估方法,使用包含真实和伪造图像的基准数据集,测试模型在三种深度伪造类型(换脸、动作重演、合成生成)上的分类准确性和推理能力。
Result: 结果显示VLM在独立检测中存在显著局限性,但对上下文分析和可解释性的优势使其可作为混合或人机协作检测框架的补充工具。
Insight: 尽管通用模型目前无法完全自主完成深度伪造检测,但其在增强人类专家审核流程中具有潜力,尤其在提供解释性和上下文分析方面。
Abstract: The growing sophistication of deepfakes presents substantial challenges to
the integrity of media and the preservation of public trust. Concurrently,
vision-language models (VLMs), large language models enhanced with visual
reasoning capabilities, have emerged as promising tools across various domains,
sparking interest in their applicability to deepfake detection. This study
conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT,
Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap,
reenactment, and synthetic generation. Leveraging a meticulously assembled
benchmark comprising authentic and manipulated images from diverse sources, we
evaluate each model’s classification accuracy and reasoning depth. Our analysis
indicates that while VLMs can produce coherent explanations and detect
surface-level anomalies, they are not yet dependable as standalone detection
systems. We highlight critical failure modes, such as an overemphasis on
stylistic elements and vulnerability to misleading visual patterns like vintage
aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and
contextual analysis, suggesting their potential to augment human expertise in
forensic workflows. These insights imply that although general-purpose models
currently lack the reliability needed for autonomous deepfake detection, they
hold promise as integral components in hybrid or human-in-the-loop detection
frameworks.
[71] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao
Main category: cs.CV
TL;DR: PSLG-SAM是一个两阶段框架,通过粗定位和精细分割解决遥感图像分割任务,显著减少了标注需求并在性能上超越了当前最优模型。
Details
Motivation: 当前的RRSIS方法依赖多模态融合骨干和语义分割头,但面临密集标注需求和复杂场景解释挑战。Contribution: 提出了PSLG-SAM框架,将任务分解为粗定位和精细分割两阶段;贡献了一个高质量多类别标注数据集。
Method: 粗定位阶段通过视觉定位网络定位文本描述对象,精细分割阶段用聚类生成前景点和掩码边界优化策略指导SAM模型。
Result: 在两个数据集上验证表明,PSLG-SAM性能显著优于现有最优模型。
Insight: 任务分解避免复杂场景干扰,第二阶段可无训练,显著降低标注负担。
Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates
segmentation masks for specified objects in images based on textual
descriptions, which has attracted widespread attention and research interest.
Current RRSIS methods rely on multi-modal fusion backbones and semantic
segmentation heads but face challenges like dense annotation requirements and
complex scene interpretation. To address these issues, we propose a framework
named \textit{prompt-generated semantic localization guiding Segment Anything
Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse
localization and fine segmentation. In coarse localization stage, a visual
grounding network roughly locates the text-described object. In fine
segmentation stage, the coordinates from the first stage guide the Segment
Anything Model (SAM), enhanced by a clustering-based foreground point generator
and a mask boundary iterative optimization strategy for precise segmentation.
Notably, the second stage can be train-free, significantly reducing the
annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS
task into two stages allows for focusing on specific region segmentation,
avoiding interference from complex scenes.We further contribute a high-quality,
multi-category manually annotated dataset. Experimental validation on two
datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant
performance improvements and surpasses existing state-of-the-art models.Our
code will be made publicly available.
[72] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft
Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai
Main category: cs.CV
TL;DR: J-DDL是一个用于战斗机表面损伤检测与定位的智能系统,通过结合2D图像和3D点云数据,采用优化的YOLO架构和新型损失函数,提高了损伤检测的精度和效率。
Details
Motivation: 战斗机表面损伤检测的传统人工方法存在可扩展性、效率和一致性问题,J-DDL旨在通过自动化技术解决这些挑战。Contribution: 提出了一种基于YOLO架构的新型损伤检测网络,结合了轻量级模块和多尺度注意力机制;开发了首个公开的飞机损伤数据集。
Method: 使用激光扫描仪和相机采集2D图像与3D点云数据,通过优化的YOLO架构(含Fasternet块和EMA模块)进行损伤检测,并采用Inner-CIOU损失函数提升精度。
Result: 实验验证了J-DDL的高效性,能精确检测和定位战斗机表面损伤。
Insight: 结合2D与3D数据的方法在复杂表面的损伤检测中具有潜力,轻量化和注意力机制是提升模型效率的关键。
Abstract: Ensuring the safety and extended operational life of fighter aircraft
necessitates frequent and exhaustive inspections. While surface defect
detection is feasible for human inspectors, manual methods face critical
limitations in scalability, efficiency, and consistency due to the vast surface
area, structural complexity, and operational demands of aircraft maintenance.
We propose a smart surface damage detection and localization system for fighter
aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the
entire aircraft surface, captured using a combined system of laser scanners and
cameras, to achieve precise damage detection and localization. Central to our
system is a novel damage detection network built on the YOLO architecture,
specifically optimized for identifying surface defects in 2D aircraft images.
Key innovations include lightweight Fasternet blocks for efficient feature
extraction, an optimized neck architecture incorporating Efficient Multiscale
Attention (EMA) modules for superior feature aggregation, and the introduction
of a novel loss function, Inner-CIOU, to enhance detection accuracy. After
detecting damage in 2D images, the system maps the identified anomalies onto
corresponding 3D point clouds, enabling accurate 3D localization of defects
across the aircraft surface. Our J-DDL not only streamlines the inspection
process but also ensures more comprehensive and detailed coverage of large and
complex aircraft exteriors. To facilitate further advancements in this domain,
we have developed the first publicly available dataset specifically focused on
aircraft damage. Experimental evaluations validate the effectiveness of our
framework, underscoring its potential to significantly advance automated
aircraft inspection technologies.
[73] CogStream: Context-guided Streaming Video Question Answering
Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为CogStream的新任务,专注于流媒体视频中的上下文推理问题,并通过提出一个密集标注的数据集和基线模型CogReasoner来解决现有方法在计算负担和无关上下文干扰上的问题。
Details
Motivation: 现有视频大型语言模型(Vid-LLMs)在处理流媒体视频时,依赖所有可用历史上下文信息,导致计算负担重且可能因无关信息而分心。本文旨在解决这一问题。Contribution: 1. 提出新任务CogStream;2. 构建了一个密集标注的数据集;3. 提出了基线模型CogReasoner,通过视觉流压缩和历史对话检索高效完成任务。
Method: CogReasoner通过视觉流压缩减少计算负担,并通过历史对话检索机制动态选择最相关的上下文信息。
Result: 实验证明该方法在流媒体视频推理任务中高效且有效。
Insight: 仅依赖相关上下文信息可以显著提升模型性能并降低计算开销,为流媒体视频推理任务提供了新思路。
Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving
multimodal understanding, challenges persist in streaming video reasoning due
to its reliance on contextual information. Existing paradigms feed all
available historical contextual information into Vid-LLMs, resulting in a
significant computational burden for visual data processing. Furthermore, the
inclusion of irrelevant context distracts models from key details. This paper
introduces a challenging task called Context-guided Streaming Video Reasoning
(CogStream), which simulates real-world streaming video scenarios, requiring
models to identify the most relevant historical contextual information to
deduce answers for questions about the current stream. To support CogStream, we
present a densely annotated dataset featuring extensive and hierarchical
question-answer pairs, generated by a semi-automatic pipeline. Additionally, we
present CogReasoner as a baseline model. It efficiently tackles this task by
leveraging visual stream compression and historical dialogue retrieval.
Extensive experiments prove the effectiveness of this method. Code will be
released soon.
[74] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
Yutong Zhou,Masahiro Ryo
Main category: cs.CV
TL;DR: 该论文提出了一种端到端的视觉到因果框架,将物种图像转化为可解释的栖息地偏好因果见解,结合了物种识别、全球分布检索等方法,并生成人类可读的解释。
Details
Motivation: 现有的生态工作流程分散且对非专家不友好,需要一种更直观的方法来解释物种栖息地偏好的原因。Contribution: 提出了一种从图像到因果见解的完整框架,整合了多模态AI和生态建模实践,生成人类可读的栖息地解释。
Method: 框架包括物种识别、全球分布检索、伪缺失采样、气候数据提取,并通过因果推理方法发现环境特征的因果结构。
Result: 在蜜蜂和花卉物种上展示了框架的潜力,生成了统计支持的、人类可读的解释。
Insight: 多模态AI助手结合生态建模实践,为非专家提供了一种理解物种栖息地偏好的新方式。
Abstract: Explaining why the species lives at a particular location is important for
understanding ecological systems and conserving biodiversity. However, existing
ecological workflows are fragmented and often inaccessible to non-specialists.
We propose an end-to-end visual-to-causal framework that transforms a species
image into interpretable causal insights about its habitat preference. The
system integrates species recognition, global occurrence retrieval,
pseudo-absence sampling, and climate data extraction. We then discover causal
structures among environmental features and estimate their influence on species
occurrence using modern causal inference methods. Finally, we generate
statistically grounded, human-readable causal explanations from structured
templates and large language models. We demonstrate the framework on a bee and
a flower species and report early results as part of an ongoing project,
showing the potential of the multimodal AI assistant backed up by a recommended
ecological modeling practice for describing species habitat in
human-understandable language.
[75] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics
Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
Main category: cs.CV
TL;DR: 论文提出了Comprehensive Equity Index (CEI)及其自动化版本CEI^A,用于检测人脸识别系统中传统指标难以捕捉的尾部分布偏差,实验证明其优于现有方法。
Details
Motivation: 现有指标难以检测高性能人脸识别系统中的尾部分布偏差,尤其是细微的不公平现象。Contribution: 1. 提出CEI指标,单独分析真实和冒用分数分布,可配置尾部关注;2. 提出自动化版本CEI^A,增强客观性;3. 在多种数据和模型上验证其有效性。
Method: CEI通过分析真实和冒用分数分布的尾部概率和整体形状,量化偏差;CEI^A自动优化参数。
Result: 实验表明CEI优于现有方法,能检测细微偏差,并适用于多种分布比较问题。
Insight: 关注分布尾部能更敏感地捕捉偏差,自动化版本简化了实际应用。
Abstract: Demographic bias in high-performance face recognition (FR) systems often
eludes detection by existing metrics, especially with respect to subtle
disparities in the tails of the score distribution. We introduce the
Comprehensive Equity Index (CEI), a novel metric designed to address this
limitation. CEI uniquely analyzes genuine and impostor score distributions
separately, enabling a configurable focus on tail probabilities while also
considering overall distribution shapes. Our extensive experiments (evaluating
state-of-the-art FR systems, intentionally biased models, and diverse datasets)
confirm CEI’s superior ability to detect nuanced biases where previous methods
fall short. Furthermore, we present CEI^A, an automated version of the metric
that enhances objectivity and simplifies practical application. CEI provides a
robust and sensitive tool for operational FR fairness assessment. The proposed
methods have been developed particularly for bias evaluation in face biometrics
but, in general, they are applicable for comparing statistical distributions in
any problem where one is interested in analyzing the distribution tails.
[76] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou
Main category: cs.CV
TL;DR: 提出了一种基于扩散变换器(DiT)的框架DreamActor-H1,用于生成高保真的人-产品演示视频,同时保留人类身份和产品细节,并通过3D身体网格和产品边界框实现精确运动引导。
Details
Motivation: 在电子商务中,生成高质量的人-产品演示视频对产品展示至关重要,但现有方法难以同时保留人和产品的身份细节,且缺乏对人-产品空间关系的理解。Contribution: 1. 提出一种DiT框架,通过配对的人-产品参考信息和掩码交叉注意力机制,保留身份和细节;2. 利用3D身体网格和产品边界框实现运动引导;3. 引入结构化文本编码增强3D一致性。
Method: 基于扩散变换器,结合掩码交叉注意力机制、3D身体网格模板和产品边界框,生成视频时注入参考信息并提供运动引导。
Result: 在混合数据集上训练,优于现有技术,能更好地保留身份和生成自然运动。
Insight: 通过结合3D空间信息和结构化文本编码,可以显著提升人-产品交互视频的真实感和一致性。
Abstract: In e-commerce and digital marketing, generating high-fidelity human-product
demonstration videos is important for effective product presentation. However,
most existing frameworks either fail to preserve the identities of both humans
and products or lack an understanding of human-product spatial relationships,
leading to unrealistic representations and unnatural interactions. To address
these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our
method simultaneously preserves human identities and product-specific details,
such as logos and textures, by injecting paired human-product reference
information and utilizing an additional masked cross-attention mechanism. We
employ a 3D body mesh template and product bounding boxes to provide precise
motion guidance, enabling intuitive alignment of hand gestures with product
placements. Additionally, structured text encoding is used to incorporate
category-level semantics, enhancing 3D consistency during small rotational
changes across frames. Trained on a hybrid dataset with extensive data
augmentation strategies, our approach outperforms state-of-the-art techniques
in maintaining the identity integrity of both humans and products and
generating realistic demonstration motions. Project page:
https://submit2025-dream.github.io/DreamActor-H1/.
[77] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He
Main category: cs.CV
TL;DR: PLACE框架通过病理级别的跨模态对齐和相关探索改进医学视觉表示学习,无需额外人工标注,在多个下游任务上取得SOTA性能。
Details
Motivation: 解决医学领域数据稀缺问题,同时克服长报告中的复杂语义和病理关联,而现有方法多忽略病理级别的一致性。Contribution: 提出PLACE框架,包含病理级别跨模态对齐(PCMA)和相关性探索任务,提升模型在无外部标注情况下的泛化性和鲁棒性。
Method: 1. 设计视觉病理观察提取器;2. 提出PCMA模块实现病理对齐;3. 引入图像块相关性代理任务增强细粒度细节。
Result: 在分类、图像到文本检索、语义分割、目标检测和报告生成等多个下游任务中达到最先进水平。
Insight: 病理级别对齐能有效提升医学视觉表示学习的性能,而无监督相关性探索任务可进一步丰富细节信息。
Abstract: Learning medical visual representations from image-report pairs through joint
learning has garnered increasing research attention due to its potential to
alleviate the data scarcity problem in the medical domain. The primary
challenges stem from the lengthy reports that feature complex discourse
relations and semantic pathologies. Previous works have predominantly focused
on instance-wise or token-wise cross-modal alignment, often neglecting the
importance of pathological-level consistency. This paper presents a novel
framework PLACE that promotes the Pathological-Level Alignment and enriches the
fine-grained details via Correlation Exploration without additional human
annotations. Specifically, we propose a novel pathological-level cross-modal
alignment (PCMA) approach to maximize the consistency of pathology observations
from both images and reports. To facilitate this, a Visual Pathology
Observation Extractor is introduced to extract visual pathological observation
representations from localized tokens. The PCMA module operates independently
of any external disease annotations, enhancing the generalizability and
robustness of our methods. Furthermore, we design a proxy task that enforces
the model to identify correlations among image patches, thereby enriching the
fine-grained details crucial for various downstream tasks. Experimental results
demonstrate that our proposed framework achieves new state-of-the-art
performance on multiple downstream tasks, including classification,
image-to-text retrieval, semantic segmentation, object detection and report
generation.
[78] DanceChat: Large Language Model-Guided Music-to-Dance Generation
Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan
Main category: cs.CV
TL;DR: DanceChat是一种基于大型语言模型(LLM)的音乐到舞蹈生成方法,通过结合文本动作指导和音乐特征,生成多样化且与音乐风格一致舞蹈动作。
Details
Motivation: 音乐到舞蹈生成面临语义鸿沟和一对多映射的挑战,现有方法仅依赖音乐学习舞蹈动作,导致多样性和风格对齐不足。Contribution: 1. 引入LLM作为舞蹈编导,提供文本动作指导;2. 提出多模态特征提取与融合模块;3. 结合扩散模型和多模态对齐损失生成舞蹈动作。
Method: 1. LLM生成伪指令;2. 多模态特征融合音乐、节奏和文本指导;3. 扩散模型合成动作并优化对齐损失。
Result: 在AIST++数据集和人工评估中,DanceChat定性定量均优于现有方法。
Insight: LLM的文本指导能显式提升舞蹈生成的多样性和音乐风格对齐,弥补纯音乐驱动的不足。
Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned
on musical input. Despite recent progress, significant challenges remain due to
the semantic gap between music and dance motion, as music offers only abstract
cues, such as melody, groove, and emotion, without explicitly specifying the
physical movements. Moreover, a single piece of music can produce multiple
plausible dance interpretations. This one-to-many mapping demands additional
guidance, as music alone provides limited information for generating diverse
dance movements. The challenge is further amplified by the scarcity of paired
music and dance data, which restricts the model^a\u{A}'Zs ability to learn
diverse dance patterns. In this paper, we introduce DanceChat, a Large Language
Model (LLM)-guided music-to-dance generation approach. We use an LLM as a
choreographer that provides textual motion instructions, offering explicit,
high-level guidance for dance generation. This approach goes beyond implicit
learning from music alone, enabling the model to generate dance that is both
more diverse and better aligned with musical styles. Our approach consists of
three components: (1) an LLM-based pseudo instruction generation module that
produces textual dance guidance based on music style and structure, (2) a
multi-modal feature extraction and fusion module that integrates music, rhythm,
and textual guidance into a shared representation, and (3) a diffusion-based
motion synthesis module together with a multi-modal alignment loss, which
ensures that the generated dance is aligned with both musical and textual cues.
Extensive experiments on AIST++ and human evaluations show that DanceChat
outperforms state-of-the-art methods both qualitatively and quantitatively.
[79] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning
Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu
Main category: cs.CV
TL;DR: 该论文提出了一种名为T2I-PAL的新方法,通过利用文本到图像生成模型减少模态差距,并结合提示调优和适配器学习,显著提升了多标签图像识别的性能。
Details
Motivation: CLIP等预训练视觉语言模型虽然可以通过图像-文本对比学习高效微调,但模态差距问题限制了其性能,尤其在多标签图像识别任务中。Contribution: 1)提出T2I-PAL方法,通过生成逼真多样的图像减少模态差距;2)结合类级热图和可学习原型增强局部视觉特征表示;3)联合提示调优和适配器学习以提升分类性能。
Method: 1)利用文本到图像生成模型生成图像;2)设计类级热图和可学习原型以聚合局部相似性;3)结合提示调优和适配器学习进行微调。
Result: 在MS-COCO、VOC2007和NUS-WIDE等基准测试中,T2I-PAL比现有最优方法平均提升3.47%的识别性能。
Insight: 通过生成图像弥补模态差距是一种有效方法,同时结合局部特征增强和高效微调策略可以进一步提升模型性能。
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language
models, e.g., CLIP, allow to direct leverage texts as images (TaI) for
parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image
features to be similar to the corresponding text features, the modality gap
remains a nontrivial issue and limits image recognition performance of TaI.
Using multi-label image recognition (MLR) as an example, we present a novel
method, called T2I-PAL to tackle the modality gap issue when using only text
captions for PEFT. The core design of T2I-PAL is to leverage pre-trained
text-to-image generation models to generate photo-realistic and diverse images
from text captions, thereby reducing the modality gap. To further enhance MLR,
T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This
aggregates local similarities, making the representation of local visual
features more robust and informative for multi-label recognition. For better
PEFT, we further combine both prompt tuning and adapter learning to enhance
classification performance. T2I-PAL offers significant advantages: it
eliminates the need for fully semantically annotated training images, thereby
reducing the manual annotation workload, and it preserves the intrinsic mode of
the CLIP model, allowing for seamless integration with any existing CLIP
framework. Extensive experiments on multiple benchmarks, including MS-COCO,
VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance
by 3.47% in average above the top-ranked state-of-the-art methods.
[80] Rethinking Random Masking in Self Distillation on ViT
Jihyeon Seong,Hyunkyung Han
Main category: cs.CV
TL;DR: 该论文探讨了在ViT的自蒸馏框架(如DINO)中随机掩码的作用,提出了一种非对称掩码策略,仅在学生的全局视图上应用掩码,保留局部视图和教师视图的完整信息,从而提升了注意力的鲁棒性和下游性能。
Details
Motivation: 在ViT的自蒸馏框架中,随机掩码可能无意中破坏关键语义信息,因此论文旨在探索更合理的掩码策略,以平衡训练效率和语义保留。Contribution: 提出了一种非对称随机掩码策略,仅在学生的全局视图上应用掩码,同时在自蒸馏框架中保留局部视图和教师视图的完整信息,从而优化了注意力机制。
Method: 在DINO框架中,仅对学生的全局视图进行随机掩码,保留局部视图和教师视图的完整信息。通过这种非对称设计,结合DINO的多视图增强机制,实现了鲁棒的训练效果。
Result: 实验表明,该策略在mini-ImageNet数据集上显著提升了注意力的细粒度和鲁棒性,并进一步提升了下游任务的性能。
Insight: 随机掩码在自蒸馏中的有效性依赖于其应用方式,非对称掩码设计能够在不破坏关键信息的前提下,有效提升模型的鲁棒性和性能。
Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a
wide range of vision tasks. In particular, self-distillation frameworks such as
DINO have contributed significantly to these advances. Within such frameworks,
random masking is often utilized to improve training efficiency and introduce
regularization. However, recent studies have raised concerns that
indiscriminate random masking may inadvertently eliminate critical semantic
information, motivating the development of more informed masking strategies. In
this study, we explore the role of random masking in the self-distillation
setting, focusing on the DINO framework. Specifically, we apply random masking
exclusively to the student’s global view, while preserving the student’s local
views and the teacher’s global view in their original, unmasked forms. This
design leverages DINO’s multi-view augmentation scheme to retain clean
supervision while inducing robustness through masked inputs. We evaluate our
approach using DINO-Tiny on the mini-ImageNet dataset and show that random
masking under this asymmetric setup yields more robust and fine-grained
attention maps, ultimately enhancing downstream performance.
[81] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement
Jin Huang,Honghua Chen,Mingqiang Wei
Main category: cs.CV
TL;DR: 该论文提出了一种名为HEA-MM的分层误差评估框架,用于飞机CAD模型在制造与测量平台中的质量评估。
Details
Motivation: 航空设备的高质量是至关重要的,需要高精度评估制造过程中的误差,以确保性能、稳定性和可靠性。Contribution: 提出了一种分层误差评估框架(HEA-MM),包括全局、部件和特征三个层级的误差分析,并引入了优化基元细化方法和两阶段圆形特征检测算法。
Method: 使用结构光扫描仪获取工件3D点云,并通过分层误差分析(全局、部件、特征)进行误差评估。其中,部件级通过分割与合并操作优化点云基元,特征级采用两阶段算法检测圆形孔。
Result: 在多种飞机CAD模型上的实验证明了该方法的有效性,能准确评估制造误差。
Insight: 分层误差评估能够更全面地捕捉制造过程中的误差,优化基元与两阶段特征检测方法显著提升了分析的精度与效率。
Abstract: The most essential feature of aviation equipment is high quality, including
high performance, high stability and high reliability. In this paper, we
propose a novel hierarchical error assessment framework for aircraft CAD models
within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs
structured light scanners to obtain comprehensive 3D measurements of
manufactured workpieces. The measured point cloud is registered with the
reference CAD model, followed by an error analysis conducted at three
hierarchical levels: global, part, and feature. At the global level, the error
analysis evaluates the overall deviation of the scanned point cloud from the
reference CAD model. At the part level, error analysis is performed on these
patches underlying the point clouds. We propose a novel optimization-based
primitive refinement method to obtain a set of meaningful patches of point
clouds. Two basic operations, splitting and merging, are introduced to refine
the coarse primitives. At the feature level, error analysis is performed on
circular holes, which are commonly found in CAD models. To facilitate it, a
two-stage algorithm is introduced for the detection of circular holes. First,
edge points are identified using a tensor-voting algorithm. Then, multiple
circles are fitted through a hypothesize-and-clusterize framework, ensuring
accurate detection and analysis of the circular features. Experimental results
on various aircraft CAD models demonstrate the effectiveness of our proposed
method.
[82] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection
Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai
Main category: cs.CV
TL;DR: 该论文提出了一种名为SSP(语义解耦空间分区)的统一框架,用于解决点监督下的定向目标检测中的样本分配和实例混淆问题。通过结合规则驱动的先验注入和数据驱动的标签净化,SSP显著提升了检测性能。
Details
Motivation: 遥感图像的快速增长需要高效的定向目标检测方法,但高密度场景下的人工标注成本高昂。现有基于点监督的方法因样本分配不足和实例混淆而效果不佳,亟需改进。Contribution: 1)提出了SSP框架,结合规则驱动和数据驱动的方法优化样本分配和标签生成;2)设计了基于像素级和语义级空间分区的新颖样本分配与边界框提取方法。
Method: 1)像素级空间分区样本分配:通过空间分区估计目标尺度范围,挖掘高质量正负样本;2)语义空间分区边界框提取:利用语义图调制空间分区生成伪标签。
Result: 在DOTA-v1.0等数据集上,SSP以45.78%的mAP超越了SOTA方法PointOBB-v2(提升4.10%),与ORCNN和ReDet结合后分别达到47.86%和48.50%的mAP。
Insight: SSP通过解耦语义和空间信息,有效缓解了点监督下的样本分配和实例提取问题,为高密度场景的遥感目标检测提供了高效、低成本的解决方案。
Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented
object detection rapid development, yet hindered by labor-intensive annotation
for high-density scenes. Oriented object detection with point supervision
offers a cost-effective solution for densely packed scenes in remote sensing,
yet existing methods suffer from inadequate sample assignment and instance
confusion due to rigid rule-based designs. To address this, we propose SSP
(Semantic-decoupled Spatial Partition), a unified framework that synergizes
rule-driven prior injection and data-driven label purification. Specifically,
SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based
Sample Assignment, which compactly estimates the upper and lower bounds of
object scales and mines high-quality positive samples and hard negative samples
through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based
Box Extraction, which derives instances from spatial partitions modulated by
semantic maps and reliably converts them into bounding boxes to form
pseudo-labels for supervising the learning of downstream detectors. Experiments
on DOTA-v1.0 and others demonstrate SSP' s superiority: it achieves 45.78% mAP
under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%.
Furthermore, when integrated with ORCNN and ReDet architectures, the SSP
framework achieves mAP values of 47.86% and 48.50%, respectively. The code is
available at https://github.com/antxinyuan/ssp.
[83] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model
Eshan Ramesh,Nishio Takayuki
Main category: cs.CV
TL;DR: LatentCSI是一种从WiFi信道状态信息(CSI)生成高分辨率图像的新方法,利用预训练的潜在扩散模型(LDM),绕过传统像素空间生成任务的复杂性。
Details
Motivation: 现有方法依赖复杂且计算密集的GANs等技术,LatentCSI通过轻量级网络将CSI映射到LDM的潜在空间,显著提高效率和生成质量。Contribution: 提出LatentCSI,通过LDM的潜在扩散模型实现高效、高质量的图像生成,支持文本引导控制。
Method: 使用轻量级网络将CSI幅度映射到LDM潜在空间,利用预训练的LDM进行去噪扩散和文本引导,最后解码生成图像。
Result: 在两个数据集上验证,LatentCSI在计算效率和感知质量上优于基线方法,并支持文本引导的灵活性。
Insight: 通过利用预训练LDM和轻量级映射,LatentCSI在图像生成任务中实现了高效与高质量的平衡,同时支持用户控制的生成能力。
Abstract: We present LatentCSI, a novel method for generating images of the physical
environment from WiFi CSI measurements that leverages a pretrained latent
diffusion model (LDM). Unlike prior approaches that rely on complex and
computationally intensive techniques such as GANs, our method employs a
lightweight neural network to map CSI amplitudes directly into the latent space
of an LDM. We then apply the LDM’s denoising diffusion model to the latent
representation with text-based guidance before decoding using the LDM’s
pretrained decoder to obtain a high-resolution image. This design bypasses the
challenges of pixel-space image generation and avoids the explicit image
encoding stage typically required in conventional image-to-image pipelines,
enabling efficient and high-quality image synthesis. We validate our approach
on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi
devices and cameras; and a subset of the publicly available MM-Fi dataset. The
results demonstrate that LatentCSI outperforms baselines of comparable
complexity trained directly on ground-truth images in both computational
efficiency and perceptual quality, while additionally providing practical
advantages through its unique capacity for text-guided controllability.
[84] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu
Main category: cs.CV
TL;DR: MSTAR提出了一种无需边界框标注的注意力循环多查询场景文本检索方法,通过渐进式视觉嵌入和多实例匹配模块提升文本表示与对齐能力,并在新基准MQTR数据集上显著优于现有方法。
Details
Motivation: 现有场景文本检索方法依赖昂贵的边界框标注,且难以统一不同类型的查询以满足多样化需求。Contribution: 1. 提出MSTAR方法,无需边界框标注;2. 引入渐进式视觉嵌入和多实例匹配模块;3. 构建MQTR数据集评估多查询能力。
Method: 结合渐进式视觉嵌入和注意力循环机制动态捕捉多粒度文本表示,并通过风格感知指令协调自由文本查询。
Result: 在Total-Text数据集上MAP提升6.4%,在MQTR数据集上平均提升8.5%。
Insight: 无需边界框标注的方法在文本检索中具有潜力,多查询与视觉语言对齐是未来发展方向。
Abstract: Scene text retrieval has made significant progress with the assistance of
accurate text localization. However, existing approaches typically require
costly bounding box annotations for training. Besides, they mostly adopt a
customized retrieval strategy but struggle to unify various types of queries to
meet diverse retrieval needs. To address these issues, we introduce Muti-query
Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for
scene text retrieval. It incorporates progressive vision embedding to
dynamically capture the multi-grained representation of texts and harmonizes
free-style text queries with style-aware instructions. Additionally, a
multi-instance matching module is integrated to enhance vision-language
alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset,
the first benchmark designed to evaluate the multi-query scene text retrieval
capability of models, comprising four query types and 16k images. Extensive
experiments demonstrate the superiority of our method across seven public
datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous
state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box
annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly
outperforms the previous models by an average of 8.5%. The code and datasets
are available at https://github.com/yingift/MSTAR.
[85] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models
Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O’Neil,Sotirios A. Tsaftaris
Main category: cs.CV
TL;DR: 该论文提出了一种基于解剖学信息的弱监督提示调整方法,用于改进胸部X光潜扩散模型的多模态对齐能力,使其能够更好地适应下游任务。
Details
Motivation: 现有文本到图像的潜扩散模型在医学影像领域(如胸部X光)中的多模态对齐能力不足,主要是由于数据有限。本文旨在解决这一问题。Contribution: 提出了一种针对预训练模型的微调框架,显著提升了模型在临床文本和图像区域之间的对齐能力,并在标准数据集和分布外数据上实现了最优性能。
Method: 通过结合解剖学信息的弱监督提示调整,对预训练模型进行微调,优化文本与图像区域的对应关系。
Result: 在MS-CXR数据集上实现了新的state-of-the-art,并在VinDr-CXR数据上表现出鲁棒性能。
Insight: 通过引入弱监督提示调整,可以在数据受限的医学影像领域中有效提升多模态对齐能力,为下游任务提供支持。
Abstract: Latent Diffusion Models have shown remarkable results in text-guided image
synthesis in recent years. In the domain of natural (RGB) images, recent works
have shown that such models can be adapted to various vision-language
downstream tasks with little to no supervision involved. On the contrary,
text-to-image Latent Diffusion Models remain relatively underexplored in the
field of medical imaging, primarily due to limited data availability (e.g., due
to privacy concerns). In this work, focusing on the chest X-ray modality, we
first demonstrate that a standard text-conditioned Latent Diffusion Model has
not learned to align clinically relevant information in free-text radiology
reports with the corresponding areas of the given scan. Then, to alleviate this
issue, we propose a fine-tuning framework to improve multi-modal alignment in a
pre-trained model such that it can be efficiently repurposed for downstream
tasks such as phrase grounding. Our method sets a new state-of-the-art on a
standard benchmark dataset (MS-CXR), while also exhibiting robust performance
on out-of-distribution data (VinDr-CXR). Our code will be made publicly
available.
[86] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models
Francisco Caetano,Christiaan Viviers,Peter H. N. De With,Fons van der Sommen
Main category: cs.CV
TL;DR: 该论文提出了一种名为Symmetrical Flow Matching(SymmFlow)的新方法,将图像生成、语义分割和分类任务统一在一个模型中,通过对称学习目标和新的训练目标实现高效采样和语义结构保留。
Details
Motivation: 现有的Flow Matching方法虽然在高保真生成模型上表现出色,但未能统一处理生成、分割和分类任务。SymmFlow旨在通过对称学习目标和双向一致性解决这一问题。Contribution: 1. 提出SymmFlow,统一生成、分割和分类;2. 引入对称学习目标和新的训练目标;3. 支持像素级和图像级标签;4. 在多个基准上实现SOTA性能。
Method: 采用对称学习目标联合建模正向和逆向变换,确保双向一致性和生成多样性;通过新训练目标显式保留语义信息,实现一步分割和分类。
Result: 在CelebAMask-HQ和COCO-Stuff上分别达到FID 11.9和7.0(仅25步推理);在分割和分类任务中表现优异。
Insight: SymmFlow通过统一框架和对称学习目标展示了多任务建模的潜力,同时高效采样为实际应用提供了便利。
Abstract: Flow Matching has emerged as a powerful framework for learning continuous
transformations between distributions, enabling high-fidelity generative
modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new
formulation that unifies semantic segmentation, classification, and image
generation within a single model. Using a symmetric learning objective,
SymmFlow models forward and reverse transformations jointly, ensuring
bi-directional consistency, while preserving sufficient entropy for generative
diversity. A new training objective is introduced to explicitly retain semantic
information across flows, featuring efficient sampling while preserving
semantic structure, allowing for one-step segmentation and classification
without iterative refinement. Unlike previous approaches that impose strict
one-to-one mapping between masks and images, SymmFlow generalizes to flexible
conditioning, supporting both pixel-level and image-level class labels.
Experimental results on various benchmarks demonstrate that SymmFlow achieves
state-of-the-art performance on semantic image synthesis, obtaining FID scores
of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps.
Additionally, it delivers competitive results on semantic segmentation and
shows promising capabilities in classification tasks. The code will be publicly
available.
[87] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang
Main category: cs.CV
TL;DR: GigaVideo-1提出了一种高效的视频生成微调框架,通过自动反馈优化预训练视频扩散模型,无需人工标注或海量计算资源,仅用4 GPU小时即可在多维度提升生成质量。
Details
Motivation: 现有的视频生成模型微调通常依赖人工标注和大规模计算资源,限制了实用性。GigaVideo-1旨在通过自动反馈机制,高效提升视频生成质量。Contribution: 1. 提出了GigaVideo-1框架,通过自动反馈优化视频生成;2. 设计了基于提示的数据引擎和奖励引导的训练策略;3. 仅用4 GPU小时即可显著提升性能。
Method: 1. 使用提示驱动的数据引擎生成多样化训练样本;2. 通过预训练的视觉语言模型提供自动反馈,作为优化奖励;3. 结合真实性约束自适应调整样本权重。
Result: 在VBench-2.0基准测试中,GigaVideo-1在17个维度上平均提升4%的性能,仅需4 GPU小时。
Insight: 通过自动反馈机制和高效数据利用,可以显著减少对人工和大规模计算的依赖,为视频生成模型的实用化提供新思路。
Abstract: Recent progress in diffusion models has greatly enhanced video generation
quality, yet these models still require fine-tuning to improve specific
dimensions like instance preservation, motion rationality, composition, and
physical plausibility. Existing fine-tuning approaches often rely on human
annotations and large-scale computational resources, limiting their
practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning
framework that advances video generation without additional human supervision.
Rather than injecting large volumes of high-quality data from external sources,
GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models
through automatic feedback. Specifically, we focus on two key aspects of the
fine-tuning process: data and optimization. To improve fine-tuning data, we
design a prompt-driven data engine that constructs diverse, weakness-oriented
training samples. On the optimization side, we introduce a reward-guided
training strategy, which adaptively weights samples using feedback from
pre-trained vision-language models with a realism constraint. We evaluate
GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17
evaluation dimensions. Experiments show that GigaVideo-1 consistently improves
performance on almost all the dimensions with an average gain of about 4% using
only 4 GPU-hours. Requiring no manual annotations and minimal real data,
GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and
data will be publicly available.
[88] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović
Main category: cs.CV
TL;DR: 论文提出了一种基于视觉Transformer的原型模型PiPViT,通过学习图像块(patch)的长距离依赖关系,生成可解释的原型,用于视网膜图像分析。
Details
Motivation: 现有原型方法在医学图像中的可视化结果与人类可理解的生物标志物不一致,且原型过于细粒度,难以解释病变的范围。PiPViT旨在解决这些问题,通过学习可解释的原型,提供透明化的决策依据。Contribution: 1. 提出了PiPViT模型,通过学习图像块的长距离依赖生成可解释原型。2. 结合对比学习和多分辨率输入处理,实现了跨尺度的生物标志物定位。3. 在视网膜OCT图像分类任务中展示了竞争性的性能,并提供有临床意义的解释。
Method: 1. 利用Vision Transformer(ViT)捕捉图像块的长距离依赖关系,生成原型。2. 引入对比学习增强原型学习。3. 使用多分辨率输入处理以定位不同尺度的生物标志物。
Result: 在四个视网膜OCT数据集上,PiPViT实现了与SOTA方法相当的性能,同时提供了更具临床意义的解释。定量评估验证了学习到的原型具有语义和临床相关性。
Insight: PiPViT通过结合ViT的多尺度处理能力,生成了更符合医学需求的可解释原型,为临床决策提供了透明化的支持。
Abstract: Background and Objective: Prototype-based methods improve interpretability by
learning fine-grained part-prototypes; however, their visualization in the
input pixel space is not always consistent with human-understandable
biomarkers. In addition, well-known prototype-based approaches typically learn
extremely granular prototypes that are less interpretable in medical imaging,
where both the presence and extent of biomarkers and lesions are critical.
Methods: To address these challenges, we propose PiPViT (Patch-based Visual
Interpretable Prototypes), an inherently interpretable prototypical model for
image recognition. Leveraging a vision transformer (ViT), PiPViT captures
long-range dependencies among patches to learn robust, human-interpretable
prototypes that approximate lesion extent only using image-level labels.
Additionally, PiPViT benefits from contrastive learning and multi-resolution
input processing, which enables effective localization of biomarkers across
scales.
Results: We evaluated PiPViT on retinal OCT image classification across four
datasets, where it achieved competitive quantitative performance compared to
state-of-the-art methods while delivering more meaningful explanations.
Moreover, quantitative evaluation on a hold-out test set confirms that the
learned prototypes are semantically and clinically relevant. We believe PiPViT
can transparently explain its decisions and assist clinicians in understanding
diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT
[89] Enhancing Deepfake Detection using SE Block Attention with CNN
Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的CNN结合SE注意力机制的Deepfake检测模型,通过动态通道特征重校准提升检测效率,在低计算资源下达到高精度。
Details
Motivation: Deepfake技术的高真实性对信息安全和真实性构成威胁,传统检测方法难以应对。现有检测模型通常体积庞大,计算资源消耗高。Contribution: 提出了一种结合SE注意力机制的轻量化CNN模型,实现了高效、低资源消耗的Deepfake检测。
Method: 使用SE注意力模块动态重校准通道特征,增强信息特征并抑制冗余特征,与简单顺序模型结合实现轻量化检测。
Result: 在Style GAN数据集上达到94.14%的分类准确率和0.985的AUC-ROC分数。
Insight: SE注意力机制在轻量化模型中表现优异,为Deepfake检测提供了高效且可扩展的解决方案。
Abstract: In the digital age, Deepfake present a formidable challenge by using advanced
artificial intelligence to create highly convincing manipulated content,
undermining information authenticity and security. These sophisticated
fabrications surpass traditional detection methods in complexity and realism.
To address this issue, we aim to harness cutting-edge deep learning
methodologies to engineer an innovative deepfake detection model. However, most
of the models designed for deepfake detection are large, causing heavy storage
and memory consumption. In this research, we propose a lightweight convolution
neural network (CNN) with squeeze and excitation block attention (SE) for
Deepfake detection. The SE block module is designed to perform dynamic
channel-wise feature recalibration. The SE block allows the network to
emphasize informative features and suppress less useful ones, which leads to a
more efficient and effective learning module. This module is integrated with a
simple sequential model to perform Deepfake detection. The model is smaller in
size and it achieves competing accuracy with the existing models for deepfake
detection tasks. The model achieved an overall classification accuracy of
94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse
Fake Face Dataset. Our proposed approach presents a promising avenue for
combating the Deepfake challenge with minimal computational resources,
developing efficient and scalable solutions for digital content verification.
[90] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework
Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 本文提出了一种名为Unsourced Adversarial CAPTCHA (UAC)的新框架,通过文本提示生成高质量对抗样本,提升CAPTCHA的多样性,支持定向和非定向攻击。
Details
Motivation: 随着深度学习的快速发展,传统CAPTCHA方案对基于DNN的自动化攻击越来越脆弱。现有对抗攻击方法依赖原始图像特征,导致输出失真且缺乏对无初始图像场景的支持。Contribution: 提出UAC框架,利用LLM生成高保真对抗样本,支持文本提示驱动的攻击;提出EDICT方法和BP-UAC策略,分别优化定向和非定向攻击。
Method: 1. 定向攻击:使用EDICT方法在扩散模型中优化双潜在变量。2. 非定向攻击:提出BP-UAC策略,结合多模态梯度和双路径优化实现高效误分类。
Result: 实验表明,BP-UAC在多种系统上实现了高攻击成功率,生成的CAPTCHA对人类和DNN均难以区分。
Insight: 文本提示可有效指导对抗样本生成,双路径优化策略显著提升了黑盒场景下的攻击效率。
Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are
increasingly vulnerable to automated attacks powered by deep neural networks
(DNNs). Existing adversarial attack methods often rely on original image
characteristics, resulting in distortions that hinder human interpretation and
limit applicability in scenarios lacking initial input images. To address these
challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel
framework generating high-fidelity adversarial examples guided by
attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC
enhances CAPTCHA diversity and supports both targeted and untargeted attacks.
For targeted attacks, the EDICT method optimizes dual latent variables in a
diffusion model for superior image quality. In untargeted attacks, especially
for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA
(BP-UAC), a two-step optimization strategy employing multimodal gradients and
bi-path optimization for efficient misclassification. Experiments show BP-UAC
achieves high attack success rates across diverse systems, generating natural
CAPTCHAs indistinguishable to humans and DNNs.
[91] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral
Main category: cs.CV
TL;DR: 论文提出了一种多任务架构,结合年龄回归和二值分类任务,用于无约束图像中未成年人的检测,通过改进的损失函数和采样方法提高性能,并在新的基准测试中验证了模型的泛化能力。
Details
Motivation: 公开数据中未成年人的代表性不足以及分布偏移问题,导致未成年人检测模型在无约束图像中表现不佳。Contribution: 1. 提出多任务架构,结合年龄回归和四个二值分类任务(12、15、18和21岁阈值);2. 改进损失函数和采样方法以解决类别不平衡;3. 提出新的基准测试集(ASORES-39k和ASWIFT-20k)。
Method: 使用冻结的FaRL视觉-语言骨干和多层感知机(MLP),结合年龄回归和二值分类头,引入α加权的焦点损失和年龄平衡的小批量采样。
Result: 模型在ASORES-39k测试集上降低了均方根误差(从5.733降至5.656),并在ASWIFT-20k测试集上显著提升了召回率和F2分数。
Insight: 多任务学习和改进的损失设计能有效提升模型在未成年人检测任务中的性能和泛化能力。
Abstract: Accurate automatic screening of minors in unconstrained images demands models
that are robust to distribution shift and resilient to the children
under-representation in publicly available data. To overcome these issues, we
propose a multi-task architecture with dedicated under/over-age discrimination
tasks based on a frozen FaRL vision-language backbone joined with a compact
two-layer MLP that shares features across one age-regression head and four
binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing
on the legally critical age range. To address the severe class imbalance, we
introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch
sampling, which equalizes twelve age bins during stochastic optimization.
Further improvement is achieved with an age gap that removes edge cases from
the loss.
Moreover, we set a rigorous evaluation by proposing the Overall Under-Age
Benchmark, with 303k cleaned training images and 110k test images, defining
both the “ASORES-39k” restricted overall test, which removes the noisiest
domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images,
stressing extreme pose ($>$45{\deg}), expression, and low image quality to
emulate real-world shifts.
Trained on the cleaned overall set with resampling and age gap, our multiage
model “F” lowers the root-mean-square-error on the ASORES-39k restricted test
from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from
F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to
the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall
while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline,
demonstrating strong generalization under distribution shift. For the under-12
and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and
from 0.689 to 0.916, respectively.
[92] Continual Hyperbolic Learning of Instances and Classes
Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth
Main category: cs.CV
TL;DR: 论文提出了一种结合实例和类别的持续学习方法HyperCLIC,利用双曲空间建模层级关系,通过双曲分类和蒸馏目标实现层级关系的持续嵌入,并在EgoObjects数据集上验证了其有效性。
Details
Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,传统方法未能解决这一问题。作者发现实例和类别天然形成层级结构,因此提出利用双曲空间建模这种关系。Contribution: 1. 提出了同时学习实例和类别的持续学习任务;2. 设计了HyperCLIC算法,利用双曲空间建模层级关系;3. 引入了持续层级评估指标。
Method: 提出的HyperCLIC算法通过双曲空间的低失真和紧凑嵌入特性,结合双曲分类和蒸馏目标,实现了对层级关系的持续嵌入。
Result: 在EgoObjects数据集上的实验表明,HyperCLIC能够在多粒度层级上有效运行,并提升了层级泛化能力。
Insight: 双曲空间非常适合建模层级数据,为持续学习中的层级关系建模提供了新思路。
Abstract: Continual learning has traditionally focused on classifying either instances
or classes, but real-world applications, such as robotics and self-driving
cars, require models to handle both simultaneously. To mirror real-life
scenarios, we introduce the task of continual learning of instances and
classes, at the same time. This task challenges models to adapt to multiple
levels of granularity over time, which requires balancing fine-grained instance
recognition with coarse-grained class generalization. In this paper, we
identify that classes and instances naturally form a hierarchical structure. To
model these hierarchical relationships, we propose HyperCLIC, a continual
learning algorithm that leverages hyperbolic space, which is uniquely suited
for hierarchical data due to its ability to represent tree-like structures with
low distortion and compact embeddings. Our framework incorporates hyperbolic
classification and distillation objectives, enabling the continual embedding of
hierarchical relations. To evaluate performance across multiple granularities,
we introduce continual hierarchical metrics. We validate our approach on
EgoObjects, the only dataset that captures the complexity of hierarchical
object recognition in dynamic real-world environments. Empirical results show
that HyperCLIC operates effectively at multiple granularities with improved
hierarchical generalization.
[93] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement
Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He
Main category: cs.CV
TL;DR: 该论文提出了不确定性掩码伯努利扩散模型(UMBD),这是一个专门用于伪装目标检测(COD)后处理精细化的生成式框架。通过引入不确定性引导的掩码机制,UMBD能够针对性改进分割质量较差的区域,同时保留正确分割部分。实验显示,该方法在多个COD基准上显著提升了性能。
Details
Motivation: 伪装目标检测(COD)因目标和背景视觉差异细微而具有挑战性,现有方法在精细化处理方面仍有提升空间。论文旨在填补这一空白,提出一种生成式精细化框架。Contribution: 1. 首次提出专门用于COD的生成式精细化框架UMBD;2. 设计了混合不确定性量化网络(HUQNet),提升不确定性估计精度;3. 方法轻量且可无缝集成到现有COD模型中。
Method: UMBD通过不确定性引导的掩码机制选择性地对分割质量差的区域应用伯努利扩散,HUQNet采用多分支架构融合多源不确定性。
Result: 实验表明,UMBD在多个COD基准上平均提升5.5%的MAE和3.2%的加权F-measure,计算开销较小。
Insight: 生成式方法可用于COD后处理精细化,不确定性引导的掩码机制能有效提升分割质量,同时保持轻量化设计。
Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the
subtle visual differences between targets and their backgrounds. While existing
methods have made notable progress, there remains significant potential for
post-processing refinement that has yet to be fully explored. To address this
limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model,
the first generative refinement framework specifically designed for COD. UMBD
introduces an uncertainty-guided masking mechanism that selectively applies
Bernoulli diffusion to residual regions with poor segmentation quality,
enabling targeted refinement while preserving correctly segmented areas. To
support this process, we design the Hybrid Uncertainty Quantification Network
(HUQNet), which employs a multi-branch architecture and fuses uncertainty from
multiple sources to improve estimation accuracy. This enables adaptive guidance
during the generative sampling process. The proposed UMBD framework can be
seamlessly integrated with a wide range of existing Encoder-Decoder-based COD
models, combining their discriminative capabilities with the generative
advantages of diffusion-based refinement. Extensive experiments across multiple
COD benchmarks demonstrate consistent performance improvements, achieving
average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest
computational overhead. Code will be released.
[94] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng
Main category: cs.CV
TL;DR: IQE-CLIP提出了一个基于CLIP的零样本/少样本异常检测框架,通过结合文本和实例感知的视觉信息生成更有效的异常指示嵌入,针对医学领域优化并取得SOTA性能。
Details
Motivation: 现有基于CLIP的异常检测方法依赖类别先验知识和特定场景的文本提示,但它们在联合嵌入空间中难以区分正常与异常实例,且医学领域的探索较少。Contribution: 提出IQE-CLIP框架,通过类基础和可学习的提示词优化CLIP在医学领域的适应性,并设计实例感知查询模块提取多模态的区域级信息,生成敏感于异常的嵌入。
Method: 结合类基础和可学习的提示词,设计实例感知查询模块,从文本和视觉模态提取区域级上下文信息。
Result: 在六个医学数据集上的实验表明,IQE-CLIP在零样本和少样本设定下均达到SOTA性能。
Insight: 融合文本与实例感知的视觉信息能更有效地区分异常,且在医学领域显著优于现有方法。
Abstract: Recent advances in vision-language models, such as CLIP, have significantly
improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks.
However, most existing CLIP-based methods assume prior knowledge of categories
and rely on carefully designed prompts tailored to specific scenarios. While
these text prompts capture semantic information in the textual space, they
often fail to distinguish normal and anomalous instances in the joint embedding
space. Moreover, most ZFSAD approaches focus on industrial domains, with
limited exploration in medical tasks. To address these limitations, we propose
IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query
embeddings integrating both textual and instance-aware visual information serve
as more effective indicators of anomalies. Specifically, we introduce
class-based and learnable prompting tokens to better adapt CLIP to the medical
setting. Furthermore, we design an instance-aware query module that extracts
region-level contextual information from both modalities, enabling the
generation of anomaly-sensitive embeddings. Extensive experiments on six
medical datasets demonstrate that IQE-CLIP achieves state-of-the-art
performance in both zero-shot and few-shot settings. Code and data are
available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.
[95] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu
Main category: cs.CV
TL;DR: PosterCraft提出了一个统一框架,用于生成高质量海报,通过多阶段优化实现文本渲染与艺术内容的无缝整合,显著优于开源基线。
Details
Motivation: 现有海报生成技术通常采用模块化流程和固定布局,难以实现艺术内容与文本的和谐统一。PosterCraft旨在解决这一问题,提供自由度更高的生成框架。Contribution: 主要贡献包括:统一的框架设计;新数据集Text-Render-2M和大规模文本渲染优化;区域感知微调与美学强化学习;全自动数据构建流程。
Method: 采用分阶段优化流程:大规模文本渲染训练、区域感知微调、美学强化学习、联合视觉语言反馈迭代。
Result: 实验表明,PosterCraft在文本渲染准确度、布局连贯性和整体美感上显著优于开源基线,接近商业SOTA系统。
Insight: 通过统一框架和多阶段优化,海报生成的自由度和质量得以提升,同时全自动数据构建流程降低了模型复杂性。
Abstract: Generating aesthetic posters is more challenging than simple design images:
it requires not only precise text rendering but also the seamless integration
of abstract artistic content, striking layouts, and overall stylistic harmony.
To address this, we propose PosterCraft, a unified framework that abandons
prior modular pipelines and rigid, predefined layouts, allowing the model to
freely explore coherent, visually compelling compositions. PosterCraft employs
a carefully designed, cascaded workflow to optimize the generation of
high-aesthetic posters: (i) large-scale text-rendering optimization on our
newly introduced Text-Render-2M dataset; (ii) region-aware supervised
fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via
best-of-n preference optimization; and (iv) joint vision-language feedback
refinement. Each stage is supported by a fully automated data-construction
pipeline tailored to its specific needs, enabling robust training without
complex architectural modifications. Evaluated on multiple experiments,
PosterCraft significantly outperforms open-source baselines in rendering
accuracy, layout coherence, and overall visual appeal-approaching the quality
of SOTA commercial systems. Our code, models, and datasets can be found in the
Project page: https://ephemeral182.github.io/PosterCraft
[96] SlotPi: Physics-informed Object-centric Reasoning Models
Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun
Main category: cs.CV
TL;DR: SlotPi是一种结合物理知识和对象中心推理的模型,通过哈密顿原理和时空预测模块解决动态模拟中的挑战,在多种任务和数据集上表现出色。
Details
Motivation: 现有对象中心动态模拟方法未充分利用物理知识,且缺乏多场景验证能力。人类通过观察世界获取物理知识并用于动态推理,而SlotPi旨在填补这一空白。Contribution: 1) 提出SlotPi模型,将物理知识与对象中心推理结合;2) 构建包含物体与流体交互的真实数据集;3) 在预测和VQA任务中验证模型性能。
Method: SlotPi结合基于哈密顿原理的物理模块和时空预测模块,动态模拟物体和流体交互。
Result: 模型在基准和流体数据集上的预测及VQA任务中表现优异,验证了其强适应性。
Insight: 物理知识的集成和多场景验证对动态模拟至关重要,SlotPi为高级世界模型开发奠定了基础。
Abstract: Understanding and reasoning about dynamics governed by physical laws through
visual observation, akin to human capabilities in the real world, poses
significant challenges. Currently, object-centric dynamic simulation methods,
which emulate human behavior, have achieved notable progress but overlook two
critical aspects: 1) the integration of physical knowledge into models. Humans
gain physical insights by observing the world and apply this knowledge to
accurately reason about various dynamic scenarios; 2) the validation of model
adaptability across diverse scenarios. Real-world dynamics, especially those
involving fluids and objects, demand models that not only capture object
interactions but also simulate fluid flow characteristics. To address these
gaps, we introduce SlotPi, a slot-based physics-informed object-centric
reasoning model. SlotPi integrates a physical module based on Hamiltonian
principles with a spatio-temporal prediction module for dynamic forecasting.
Our experiments highlight the model’s strengths in tasks such as prediction and
Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore,
we have created a real-world dataset encompassing object interactions, fluid
dynamics, and fluid-object interactions, on which we validated our model’s
capabilities. The model’s robust performance across all datasets underscores
its strong adaptability, laying a foundation for developing more advanced world
models.
[97] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning
Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae
Main category: cs.CV
TL;DR: 论文提出了一种结合事件相机与强化学习的机器人导航控制器,用于实时的人为中心导航与避障,解决了传统图像控制器的延迟与运动模糊问题。
Details
Motivation: 传统图像控制器因固定帧率和延迟性难以满足实时导航需求,事件相机的异步特性为这一问题提供了新的解决方案。Contribution: 提出了一种结合事件相机、多传感器与强化学习的导航框架,实现了自适应推理与控制,提升了实时性与鲁棒性。
Method: 框架整合了事件相机感知、距离传感器与深度确定性策略梯度(DDPG)优化,并通过模仿学习提升样本效率。
Result: 在仿真环境中实现了鲁棒的导航、行人跟随和避障,展示了方法的有效性。
Insight: 事件相机的异步特性为实时导航提供了新思路,强化学习与模仿学习的结合显著提升了样本效率。
Abstract: This work introduces a robot navigation controller that combines event
cameras and other sensors with reinforcement learning to enable real-time
human-centered navigation and obstacle avoidance. Unlike conventional
image-based controllers, which operate at fixed rates and suffer from motion
blur and latency, this approach leverages the asynchronous nature of event
cameras to process visual information over flexible time intervals, enabling
adaptive inference and control. The framework integrates event-based
perception, additional range sensing, and policy optimization via Deep
Deterministic Policy Gradient, with an initial imitation learning phase to
improve sample efficiency. Promising results are achieved in simulated
environments, demonstrating robust navigation, pedestrian following, and
obstacle avoidance. A demo video is available at the project website.
[98] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization
Mario Barbara,Alaa Maalouf
Main category: cs.CV
TL;DR: 该论文提出了一种零样本、基于自然语言查询的视频摘要方法,通过结合预训练的视频-语言模型和大语言模型,无需领域特定训练数据即可实现用户可控的视频摘要。
Details
Motivation: 随着视频数据的爆炸式增长,急需无需领域特定训练数据且能灵活响应用户自然语言意图的视频摘要工具。现有方法或依赖数据集的限制泛化能力,或无法融入用户意图。Contribution: 提出了首个零样本、可查询文本的视频摘要框架,通过结合预训练的视频-语言模型和大语言模型的评分机制,无需训练数据即可生成用户引导的摘要。
Method: 1) 将视频分割为连贯场景;2) 通过高效批量式视频-语言模型生成场景描述;3) 利用大语言模型为场景分配重要性分数;4) 通过一致性和独特性指标将分数传播至帧级别。
Result: 在SumMe和TVSum数据集上超越所有无监督方法,并与监督方法媲美;在QFVS基准测试中表现优异。还发布了VidSum-Reason数据集作为新基准。
Insight: 通过合理设计提示词和分数传播机制,预训练多模态模型已为通用、可查询文本的视频摘要提供了强大基础,展示了零样本任务的潜力。
Abstract: The explosive growth of video data intensified the need for flexible
user-controllable summarization tools that can operate without domain-specific
training data. Existing methods either rely on datasets, limiting
generalization, or cannot incorporate user intent expressed in natural
language. We introduce Prompts-to-Summaries: the first zero-shot,
text-queryable video summarizer that converts off-the-shelf video-language
models (VidLMs) captions into user-guided skims via large language models
(LLMs) judging, without the use of training data at all, beating all
unsupervised and matching supervised methods. Our pipeline (i) segments raw
video footage into coherent scenes, (ii) generates rich scene-level
descriptions through a memory-efficient, batch-style VidLM prompting scheme
that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a
judge to assign scene-level importance scores under a carefully crafted prompt,
and finally, (iv) propagates those scores to short segments level via two new
metrics: consistency (temporal coherency) and uniqueness (novelty), yielding
fine-grained frame importance. On SumMe and TVSum, our data-free approach
surpasses all prior data-hungry unsupervised methods. It also performs
competitively on the Query-Focused Video Summarization (QFVS) benchmark,
despite using no training data and the competing methods requiring supervised
frame-level importance. To spur further research, we release VidSum-Reason, a
new query-driven dataset featuring long-tailed concepts and multi-step
reasoning; our framework attains robust F1 scores and serves as the first
challenging baseline. Overall, our results demonstrate that pretrained
multimodal models, when orchestrated with principled prompting and score
propagation, already provide a powerful foundation for universal,
text-queryable video summarization.
[99] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing
Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan
Main category: cs.CV
TL;DR: 提出了SmoothProper模块,用于解决无监督变形图像配准中稀疏特征和大位移的问题,通过集成优化层实现平滑性,并在视网膜血管数据集上验证了有效性。
Details
Motivation: 在稀疏特征和大位移场景下,传统无监督变形图像配准方法表现不佳,需要一种能够强制平滑性和结构一致性的模块。Contribution: 提出了SmoothProper,一种即插即用的模块,通过在网络前向传播中实现平滑性和消息传递,解决了稀疏特征和大位移问题。
Method: 集成基于对偶优化的层和定制交互项,实现流信号的空间传播和平滑性。
Result: 在2912x2912的视网膜血管数据集上,将配准误差降至1.88像素。
Insight: 网络前向传播中的平滑性约束可以有效解决无监督配准中的挑战,且无需调整正则化超参数。
Abstract: Learning-based deformable image registration (DIR) accelerates alignment by
amortizing traditional optimization via neural networks. Label supervision
further enhances accuracy, enabling efficient and precise nonlinear alignment
of unseen scans. However, images with sparse features amid large smooth
regions, such as retinal vessels, introduce aperture and large-displacement
challenges that unsupervised DIR methods struggle to address. This limitation
occurs because neural networks predict deformation fields in a single forward
pass, leaving fields unconstrained post-training and shifting the
regularization burden entirely to network weights. To address these issues, we
introduce SmoothProper, a plug-and-play neural module enforcing smoothness and
promoting message passing within the network’s forward pass. By integrating a
duality-based optimization layer with tailored interaction terms, SmoothProper
efficiently propagates flow signals across spatial locations, enforces
smoothness, and preserves structural consistency. It is model-agnostic,
seamlessly integrates into existing registration frameworks with minimal
parameter overhead, and eliminates regularizer hyperparameter tuning.
Preliminary results on a retinal vessel dataset exhibiting aperture and
large-displacement challenges demonstrate our method reduces registration error
to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach
to effectively address both challenges. The source code will be available at
https://github.com/tinymilky/SmoothProper.
[100] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders
Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian
Main category: cs.CV
TL;DR: 论文提出了一种基于掩码自编码器的遮挡感知3D手-物姿态估计方法(HOMAE),通过目标聚焦掩码策略和多尺度特征融合,显著提升了遮挡情况下的姿态估计性能。
Details
Motivation: 现有方法在单目RGB图像中估计手-物姿态时,由于严重的遮挡问题,未能充分探索全局结构感知和推理,限制了其在遮挡情况下的有效性。Contribution: 1) 提出了一种目标聚焦掩码策略,通过结构化遮挡区域引导模型学习上下文感知特征;2) 结合隐式SDF和显式点云,充分利用两者的互补优势。
Method: HOMAE方法包括目标聚焦掩码策略和SDF与点云的融合,多尺度特征用于全局上下文和细粒度几何捕捉。
Result: 在DexYCB和HO3Dv2基准测试中,HOMAE实现了最先进的性能。
Insight: 通过结构化的遮挡设计和表示融合,可以有效提升遮挡条件下的姿态估计能力。
Abstract: Hand-object pose estimation from monocular RGB images remains a significant
challenge mainly due to the severe occlusions inherent in hand-object
interactions. Existing methods do not sufficiently explore global structural
perception and reasoning, which limits their effectiveness in handling occluded
hand-object interactions. To address this challenge, we propose an
occlusion-aware hand-object pose estimation method based on masked
autoencoders, termed as HOMAE. Specifically, we propose a target-focused
masking strategy that imposes structured occlusion on regions of hand-object
interaction, encouraging the model to learn context-aware features and reason
about the occluded structures. We further integrate multi-scale features
extracted from the decoder to predict a signed distance field (SDF), capturing
both global context and fine-grained geometry. To enhance geometric perception,
we combine the implicit SDF with an explicit point cloud derived from the SDF,
leveraging the complementary strengths of both representations. This fusion
enables more robust handling of occluded regions by combining the global
context from the SDF with the precise local geometry provided by the point
cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks
demonstrate that HOMAE achieves state-of-the-art performance in hand-object
pose estimation. We will release our code and model.
[101] VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
Main category: cs.CV
TL;DR: VideoDeepResearch是一个新型代理框架,仅依赖纯文本大推理模型(LRM)和多模态工具包,通过选择性访问视频内容解决了长视频理解(LVU)的挑战,性能显著优于现有方法。
Details
Motivation: 当前的多模态大语言模型(MLLMs)由于任务复杂性和上下文窗口限制,难以处理长视频理解(LVU)任务。论文质疑了“必须依赖扩展上下文窗口的强大MLLM”这一假设。Contribution: 提出了VideoDeepResearch,一种基于纯文本大推理模型和多模态工具包的代理框架,无需依赖复杂的MLLMs,即可高效解决LVU任务。
Method: 通过纯文本大推理模型(LRM)结合模块化多模态工具包(如多模态检索器和视觉感知器),选择性访问视频内容并制定问题解决策略。
Result: 在MLVU、LVBench和LongVideoBench基准测试中,性能分别提升了9.6%、6.6%和3.9%,超越了现有方法。
Insight: 代理系统通过模块化工具和选择性访问内容,可以有效解决LVU任务的复杂性和上下文窗口限制,为未来研究提供了新方向。
Abstract: Long video understanding (LVU) presents a significant challenge for current
multi-modal large language models (MLLMs) due to the task’s inherent complexity
and context window constraint. It is widely assumed that addressing LVU tasks
requires foundation MLLMs with extended context windows, strong visual
perception capabilities, and proficient domain expertise. In this work, we
challenge this common belief by introducing VideoDeepResearch, a novel agentic
framework for long video understanding. Our approach relies solely on a
text-only large reasoning model (LRM) combined with a modular multi-modal
toolkit, including multimodal retrievers and visual perceivers, all of which
are readily available in practice. For each LVU task, the system formulates a
problem-solving strategy through reasoning, while selectively accessing and
utilizing essential video content via tool using. We conduct extensive
experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench.
Our results demonstrate that VideoDeepResearch achieves substantial
improvements over existing MLLM baselines, surpassing the previous
state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and
LongVideoBench, respectively. These findings highlight the promise of agentic
systems in overcoming key challenges in LVU problems.
[102] Post-Training Quantization for Video Matting
Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang
Main category: cs.CV
TL;DR: 该论文提出了一种专为视频抠图设计的后训练量化(PTQ)框架,通过两阶段策略、统计驱动的全局仿射校准和光流辅助组件,显著提升了低比特量化下的准确性和时序一致性。
Details
Motivation: 视频抠图在电影制作和虚拟现实中至关重要,但在资源受限设备上部署其计算密集型模型具有挑战性。量化作为模型压缩和加速的关键技术,目前在视频抠图领域的应用尚处于早期阶段。Contribution: 1. 两阶段PTQ策略,结合块重建优化和全局校准;2. 统计驱动的全局仿射校准(GAC)方法,减少统计失真;3. 光流辅助(OFA)组件,利用时空先验提升模型性能。
Method: 提出了一个两阶段PTQ框架,结合块重建优化、GAC和OFA组件,以优化量化过程并减少精度损失。
Result: 在4比特量化下,PTQ4VM性能接近全精度模型,并节省了8倍的计算量,达到了当前最佳水平。
Insight: 通过结合局部优化和全局校准,以及引入时空先验信息,可以显著提升视频抠图模型在低比特量化下的性能。
Abstract: Video matting is crucial for applications such as film production and virtual
reality, yet deploying its computationally intensive models on
resource-constrained devices presents challenges. Quantization is a key
technique for model compression and acceleration. As an efficient approach,
Post-Training Quantization (PTQ) is still in its nascent stages for video
matting, facing significant hurdles in maintaining accuracy and temporal
coherence. To address these challenges, this paper proposes a novel and general
PTQ framework specifically designed for video matting models, marking, to the
best of our knowledge, the first systematic attempt in this domain. Our
contributions include: (1) A two-stage PTQ strategy that combines
block-reconstruction-based optimization for fast, stable initial quantization
and local dependency capture, followed by a global calibration of quantization
parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine
Calibration (GAC) method that enables the network to compensate for cumulative
statistical distortions arising from factors such as neglected BN layer
effects, even reducing the error of existing PTQ methods on video matting tasks
up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages
temporal and semantic priors from frames to guide the PTQ process, enhancing
the model’s ability to distinguish moving foregrounds in complex scenes and
ultimately achieving near full-precision performance even under ultra-low-bit
quantization. Comprehensive quantitative and visual results show that our
PTQ4VM achieves the state-of-the-art accuracy performance across different
bit-widths compared to the existing quantization methods. We highlight that the
4-bit PTQ4VM even achieves performance close to the full-precision counterpart
while enjoying 8x FLOP savings.
[103] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang
Main category: cs.CV
TL;DR: VRBench 是一个针对长叙事视频的多步推理评估基准,填补了现有评测忽视时序推理和程序有效性的空白,包含了大量标注数据和全面的评估方法。
Details
Motivation: 现有的视频评测基准在多步推理和时序建模方面表现不足,需要一种更全面的评估工具来推动大型模型在这方面的能力。Contribution: 提出了首个长叙事视频多步推理评测基准 VRBench,包含高质量的标注数据和多阶段评估流程。
Method: 通过多阶段筛选构建数据集,设计人类-AI 协作框架生成推理链,并提出了多维度进度级评分指标。
Result: 对 12 个 LLM 和 16 个 VLM 的评估表明,VRBench 能有效评测模型的多步推理能力,并提供有价值的分析。
Insight: 时序推理和多步推理是长视频理解的关键挑战,VRBench 为这一领域的进一步研究提供了重要工具和参考。
Abstract: We present VRBench, the first long narrative video benchmark crafted for
evaluating large models’ multi-step reasoning capabilities, addressing
limitations in existing evaluations that overlook temporal reasoning and
procedural validity. It comprises 1,010 long videos (with an average duration
of 1.6 hours), along with 9,468 human-labeled multi-step question-answering
pairs and 30,292 reasoning steps with timestamps. These videos are curated via
a multi-stage filtering process including expert inter-rater reviewing to
prioritize plot coherence. We develop a human-AI collaborative framework that
generates coherent reasoning chains, each requiring multiple temporally
grounded steps, spanning seven types (e.g., event attribution, implicit
inference). VRBench designs a multi-phase evaluation pipeline that assesses
models at both the outcome and process levels. Apart from the MCQs for the
final results, we propose a progress-level LLM-guided scoring metric to
evaluate the quality of the reasoning chain from multiple dimensions
comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on
VRBench, we undertake a thorough analysis and provide valuable insights that
advance the field of multi-step reasoning.
[104] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu
Main category: cs.CV
TL;DR: CreatiPoster是一个生成可编辑、多图层图形设计的框架,支持自然语言指令或用户提供的素材输入,能够生成符合专业审美且可编辑的设计。
Details
Motivation: 图形设计对商业和个人都非常重要,但高质量的、可编辑的设计需要大量时间和技能。现有AI工具难以满足用户需求,尤其是对素材的整合和编辑性的保证。Contribution: 提出了CreatiPoster框架,通过协议模型和条件背景模型生成可编辑的JSON规范和专业背景,并发布了一个10万条无版权多图层设计数据集。
Method: 使用协议模型生成JSON规范(包含图层、布局、样式等),再通过条件背景模型合成背景,最终生成可编辑的多图层设计。
Result: CreatiPoster在图形设计生成任务上超越开源和商业系统,支持多种应用(如画布编辑、多语言适应等)。
Insight: 通过分离前景和背景生成,并结合用户输入,能够实现高质量且可编辑的设计,推动了AI辅助图形设计的普及。
Abstract: Graphic design plays a crucial role in both commercial and personal contexts,
yet creating high-quality, editable, and aesthetically pleasing graphic
compositions remains a time-consuming and skill-intensive task, especially for
beginners. Current AI tools automate parts of the workflow, but struggle to
accurately incorporate user-supplied assets, maintain editability, and achieve
professional visual appeal. Commercial systems, like Canva Magic Design, rely
on vast template libraries, which are impractical for replicate. In this paper,
we introduce CreatiPoster, a framework that generates editable, multi-layer
compositions from optional natural-language instructions or assets. A protocol
model, an RGBA large multimodal model, first produces a JSON specification
detailing every layer (text or asset) with precise layout, hierarchy, content
and style, plus a concise background prompt. A conditional background model
then synthesizes a coherent background conditioned on this rendered foreground
layers. We construct a benchmark with automated metrics for graphic-design
generation and show that CreatiPoster surpasses leading open-source approaches
and proprietary commercial systems. To catalyze further research, we release a
copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports
diverse applications such as canvas editing, text overlay, responsive resizing,
multilingual adaptation, and animated posters, advancing the democratization of
AI-assisted graphic design. Project homepage:
https://github.com/graphic-design-ai/creatiposter
[105] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement
Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 该论文提出了一种零样本生成模型适应方法AIR,通过迭代优化解决CLIP嵌入空间中图像与文本偏移不对齐的问题,显著提升了生成图像的质量。
Details
Motivation: 现有的零样本生成模型适应方法假设图像和文本偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。论文通过分析偏移不对齐现象,提出改进方法。Contribution: 1. 实证分析了CLIP嵌入空间中图像与文本偏移的不对齐现象,发现偏移不对齐与概念距离相关;2. 提出了一种迭代优化方法AIR,显著提升了生成质量。
Method: AIR方法通过对CLIP嵌入空间中的偏移不对齐进行迭代优化,逐步修正生成图像,使其更贴合目标域的特征。
Result: 在26种实验设置中,AIR方法在定性、定量和用户研究中均达到了最先进的性能。
Insight: 偏移不对齐现象与概念距离相关,通过迭代优化可以有效解决这一问题,提升生成模型的适应能力。
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained
generator to a target domain using only text guidance and without any samples
from the target domain. Central to recent ZSGM approaches are directional loss
which use the text guidance in the form of aligning the image offset with text
offset in the embedding space of a vision-language model like CLIP. This is
similar to the analogical reasoning in NLP where the offset between one pair of
words is used to identify a missing element in another pair by aligning the
offset between these two pairs. However, a major limitation of existing ZSGM
methods is that the learning objective assumes the complete alignment between
image offset and text offset in the CLIP embedding space, resulting in quality
degrade in generated images. Our work makes two main contributions. Inspired by
the offset misalignment studies in NLP, as our first contribution, we perform
an empirical study to analyze the misalignment between text offset and image
offset in CLIP embedding space for various large publicly available datasets.
Our important finding is that offset misalignment in CLIP embedding space is
correlated with concept distance, i.e., close concepts have a less offset
misalignment. To address the limitations of the current approaches, as our
second contribution, we propose Adaptation with Iterative Refinement (AIR)
which is the first ZSGM approach to focus on improving target domain image
quality based on our new insight on offset misalignment.Qualitative,
quantitative, and user study in 26 experiment setups consistently demonstrate
the proposed AIR approach achieves SOTA performance. Additional experiments are
in Supp.
[106] M4V: Multi-Modal Mamba for Text-to-Video Generation
Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma
Main category: cs.CV
TL;DR: M4V是一种基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块(MM-DiM)实现高效的多模态信息整合和时空建模,显著降低计算成本并提升视频质量。
Details
Motivation: 文本到视频生成需要建模复杂的时空空间,而传统Transformer的二次复杂度限制了其实际应用。Mamba架构作为线性时间序列建模的替代方案效率更高,但其简单设计难以直接应用于多模态视频生成任务。Contribution: 提出了M4V框架,引入多模态扩散Mamba块(MM-DiM),通过多模态标记重组合设计实现高效的多模态信息整合和时空建模。此外,还提出奖励学习策略以提升长上下文自回归生成过程中的视觉质量。
Method: 采用Mamba架构,设计MM-DiM块进行多模态信息整合和时空建模,提出奖励学习策略优化生成质量。
Result: M4V在生成768×1280分辨率视频时,比基于注意力机制的方案减少45%的FLOPs,并在文本到视频基准测试中展现出高质量生成能力。
Insight: 展示了Mamba架构在多模态视频生成任务中的潜力,通过创新设计显著降低了计算成本,同时提出奖励学习策略解决了长序列生成中的视觉退化问题。
Abstract: Text-to-video generation has significantly enriched content creation and
holds the potential to evolve into powerful world simulators. However, modeling
the vast spatiotemporal space remains computationally demanding, particularly
when employing Transformers, which incur quadratic complexity in sequence
processing and thus limit practical applications. Recent advancements in
linear-time sequence modeling, particularly the Mamba architecture, offer a
more efficient alternative. Nevertheless, its plain design limits its direct
applicability to multi-modal and spatiotemporal video generation tasks. To
address these challenges, we introduce M4V, a Multi-Modal Mamba framework for
text-to-video generation. Specifically, we propose a multi-modal diffusion
Mamba (MM-DiM) block that enables seamless integration of multi-modal
information and spatiotemporal modeling through a multi-modal token
re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45%
compared to the attention-based alternative when generating videos at
768$\times$1280 resolution. Additionally, to mitigate the visual quality
degradation in long-context autoregressive generation processes, we introduce a
reward learning strategy that further enhances per-frame visual realism.
Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to
produce high-quality videos while significantly lowering computational costs.
Code and models will be publicly available at
https://huangjch526.github.io/M4V_project.
[107] VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
Main category: cs.CV
TL;DR: 本文提出了一种直接从视频中学习上下文图像编辑的方法VINCIE,通过设计块因果扩散变换器和多代理任务学习,实现了强大的图像编辑能力,并在多轮编辑基准测试中取得了领先的结果。
Details
Motivation: 现有上下文图像编辑方法依赖特定任务的流程和专家模型,限制了可扩展性和灵活性。本文探索直接从视频中学习上下文图像编辑的可能性,以克服数据标注的瓶颈。Contribution: 1)提出了一种基于视频的可扩展序列标注方法;2)设计了块因果扩散变换器,结合多代理任务学习;3)推出了新的多轮图像编辑基准测试;4)展示了视频训练模型在多概念组合、故事生成等应用中的潜力。
Method: 通过视频标注生成多模态序列,利用块因果扩散变换器训练三个代理任务:下一帧预测、当前分割预测和下一分割预测,从而学习上下文图像编辑能力。
Result: 在多个基准测试中达到SOTA,并展示了在多概念组合、故事生成等任务中的优异表现。
Insight: 直接从视频中学习上下文编辑是可行的,且视频数据能提供丰富的多模态信息,支持多样化的编辑应用。
Abstract: In-context image editing aims to modify images based on a contextual sequence
comprising text and previously generated images. Existing methods typically
depend on task-specific pipelines and expert models (e.g., segmentation and
inpainting) to curate training data. In this work, we explore whether an
in-context image editing model can be learned directly from videos. We
introduce a scalable approach to annotate videos as interleaved multimodal
sequences. To effectively learn from this data, we design a block-causal
diffusion transformer trained on three proxy tasks: next-image prediction,
current segmentation prediction, and next-segmentation prediction.
Additionally, we propose a novel multi-turn image editing benchmark to advance
research in this area. Extensive experiments demonstrate that our model
exhibits strong in-context image editing capabilities and achieves
state-of-the-art results on two multi-turn image editing benchmarks. Despite
being trained exclusively on videos, our model also shows promising abilities
in multi-concept composition, story generation, and chain-of-editing
applications.
[108] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
Main category: cs.CV
TL;DR: 这篇论文提出了一个新的任务——知识图像生成,并推出了MMMG基准测试来评估图像生成模型的推理能力。MMMG包含覆盖多学科和多教育层次的专家验证数据集,并提出了MMMG-Score作为评估标准。实验显示当前模型的推理能力不足,同时提出了一个开源的基线模型FLUX-Reason。
Details
Motivation: 知识图像对人类学习和文明至关重要,但现有的文本到图像生成模型在此类任务中的表现尚未得到充分评估。因此,作者希望通过MMMG基准测试填补这一空白。Contribution: 1. 提出了知识图像生成这一新任务;2. 发布了MMMG基准测试,包含多学科、多教育层次的数据集;3. 提出了MMMG-Score评估标准;4. 开源了一个基线模型FLUX-Reason。
Method: 1. 采用统一的Knowledge Graph(KG)表示知识图像的核心实体和依赖关系;2. 通过图编辑距离和视觉清晰度评估生成图像的质量。
Result: 评估了16个先进模型,显示其推理能力有限,GPT-4o的MMMG-Score仅为50.20。提出的FLUX-Reason模型得分为34.45。
Insight: 当前文本到图像生成模型在知识图像生成任务中的推理能力仍有显著不足,多模态推理和知识融合是关键挑战。
Abstract: In this paper, we introduce knowledge image generation as a new task,
alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation
Benchmark (MMMG) to probe the reasoning capability of image generation models.
Knowledge images have been central to human civilization and to the mechanisms
of human learning–a fact underscored by dual-coding theory and the
picture-superiority effect. Generating such images is challenging, demanding
multimodal reasoning that fuses world knowledge with pixel-level grounding into
clear explanatory visuals. To enable comprehensive evaluation, MMMG offers
4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines,
6 educational levels, and diverse knowledge formats such as charts, diagrams,
and mind maps. To eliminate confounding complexity during evaluation, we adopt
a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a
target image’s core entities and their dependencies. We further introduce
MMMG-Score to evaluate generated knowledge images. This metric combines factual
fidelity, measured by graph-edit distance between KGs, with visual clarity
assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image
generation models expose serious reasoning deficits–low entity fidelity, weak
relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20,
underscoring the benchmark’s difficulty. To spur further progress, we release
FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines
a reasoning LLM with diffusion models and is trained on 16,000 curated
knowledge image-prompt pairs.
[109] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为CDPruner的新型视觉token剪枝方法,通过最大化条件多样性来优化MLLMs中的冗余token问题,显著降低了计算成本,同时保持了高准确率。
Details
Motivation: 在MLLMs中,视觉token的长度通常远大于文本token,导致推理成本高昂。现有方法(基于注意力或相似性的剪枝)存在重复token多或忽视指令相关性的问题,性能不理想。Contribution: 提出了基于条件多样性的token剪枝方法CDPruner,首次将条件相似性与DPP结合,避免了冗余token问题,同时保持了对指令的紧密遵循。
Method: 定义视觉token的条件相似性,并采用DPP最大化条件多样性来剪枝。方法无需训练且与模型无关,适用于多种MLLMs。
Result: 在多个MLLMs上验证,CDPruner显著降低计算量(FLOPs减少95%,延迟降低78%)且保持94%的原始准确率,优于现有方法。
Insight: 通过条件多样性的最大化,不仅提升了token的代表性,还确保了对用户指令的严格遵循,为MLLMs的高效推理提供了新思路。
Abstract: In multimodal large language models (MLLMs), the length of input visual
tokens is often significantly greater than that of their textual counterparts,
leading to a high inference cost. Many works aim to address this issue by
removing redundant visual tokens. However, current approaches either rely on
attention-based pruning, which retains numerous duplicate tokens, or use
similarity-based pruning, overlooking the instruction relevance, consequently
causing suboptimal performance. In this paper, we go beyond attention or
similarity by proposing a novel visual token pruning method named CDPruner,
which maximizes the conditional diversity of retained tokens. We first define
the conditional similarity between visual tokens conditioned on the
instruction, and then reformulate the token pruning problem with determinantal
point process (DPP) to maximize the conditional diversity of the selected
subset. The proposed CDPruner is training-free and model-agnostic, allowing
easy application to various MLLMs. Extensive experiments across diverse MLLMs
show that CDPruner establishes new state-of-the-art on various vision-language
benchmarks. By maximizing conditional diversity through DPP, the selected
subset better represents the input images while closely adhering to user
instructions, thereby preserving strong performance even with high reduction
ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency
by 78%, while maintaining 94% of the original accuracy. Our code is available
at https://github.com/Theia-4869/CDPruner.
[110] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
Main category: cs.CV
TL;DR: GenWorld提出了一個大規模、高品質的現實世界模擬數據集,用於AI生成視頻檢測。通過多模態提示和多生成器的多樣性,該數據集提供了更通用的鑒別特徵。研究還發現現有方法在檢測高品質視頻時失效,因此提出了一個基於多視角一致性的簡單模型SpannDetector,實驗顯示其優越性能。
Details
Motivation: 隨著視頻生成技術的發展,現實信息的可信度受到威脅,急需可靠的AI生成視頻檢測方法。然而,現有數據集質量不足,缺乏現實世界模擬場景,這限制了檢測器的發展。Contribution: 1) 提出GenWorld數據集,具有高品質、現實世界模擬和多樣性;2) 發現現有方法在高品質視頻檢測中的不足;3) 提出基於多視角一致性的SpannDetector模型,性能優越。
Method: 1) 使用多種先進視頻生成模型構建高品質現實模擬視頻;2) 提出SpannDetector模型,利用多視角一致性作為檢測標準。
Result: 實驗表明,SpannDetector在GenWorld數據集上表現優越,特別是在檢測高品質AI生成視頻時效果顯著。
Insight: 物理合理性(如多視角一致性)可作為AI生成視頻檢測的關鍵特徵,為解釋性檢測方法提供了新方向。
Abstract: The flourishing of video generation technologies has endangered the
credibility of real-world information and intensified the demand for
AI-generated video detectors. Despite some progress, the lack of high-quality
real-world datasets hinders the development of trustworthy detectors. In this
paper, we propose GenWorld, a large-scale, high-quality, and real-world
simulation dataset for AI-generated video detection. GenWorld features the
following characteristics: (1) Real-world Simulation: GenWorld focuses on
videos that replicate real-world scenarios, which have a significant impact due
to their realism and potential influence; (2) High Quality: GenWorld employs
multiple state-of-the-art video generation models to provide realistic and
high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes
videos generated from diverse generators and various prompt modalities (e.g.,
text, image, video), offering the potential to learn more generalizable
forensic features. We analyze existing methods and find they fail to detect
high-quality videos generated by world models (i.e., Cosmos), revealing
potential drawbacks of ignoring real-world clues. To address this, we propose a
simple yet effective model, SpannDetector, to leverage multi-view consistency
as a strong criterion for real-world AI-generated video detection. Experiments
show that our method achieves superior results, highlighting a promising
direction for explainable AI-generated video detection based on physical
plausibility. We believe that GenWorld will advance the field of AI-generated
video detection. Project Page: https://chen-wl20.github.io/GenWorld
[111] Fine-Grained Perturbation Guidance via Attention Head Selection
Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim
Main category: cs.CV
TL;DR: 该论文研究了在扩散模型中通过注意力头选择进行细粒度扰动引导的方法,提出了一种名为HeadHunter的框架和SoftPAG技术,实现了对生成质量和视觉属性的精细控制。
Details
Motivation: 现有注意力扰动方法缺乏确定扰动位置的系统性方法,尤其是在扩散Transformer架构中,质量相关的计算分布在多个层中。Contribution: 1. 提出了HeadHunter框架,迭代选择与用户目标对齐的注意力头;2. 引入SoftPAG技术,通过线性插值调整扰动强度;3. 首次进行了注意力头级别的扰动分析,揭示了注意力层的可解释性专化。
Method: 通过分析注意力头的粒度,发现特定头控制不同的视觉概念(如结构、风格、纹理),并基于此设计HeadHunter框架和SoftPAG技术。
Result: 在Stable Diffusion 3和FLUX.1等模型上验证了方法的有效性,实现了生成质量提升和风格特异性引导。
Insight: 注意力头在扩散模型中具有明确的视觉概念分工,通过针对性扰动可以实现对生成结果的精细控制。
Abstract: Recent guidance methods in diffusion models steer reverse sampling by
perturbing the model to construct an implicit weak model and guide generation
away from it. Among these approaches, attention perturbation has demonstrated
strong empirical performance in unconditional scenarios where classifier-free
guidance is not applicable. However, existing attention perturbation methods
lack principled approaches for determining where perturbations should be
applied, particularly in Diffusion Transformer (DiT) architectures where
quality-relevant computations are distributed across layers. In this paper, we
investigate the granularity of attention perturbations, ranging from the layer
level down to individual attention heads, and discover that specific heads
govern distinct visual concepts such as structure, style, and texture quality.
Building on this insight, we propose “HeadHunter”, a systematic framework for
iteratively selecting attention heads that align with user-centric objectives,
enabling fine-grained control over generation quality and visual attributes. In
addition, we introduce SoftPAG, which linearly interpolates each selected
head’s attention map toward an identity matrix, providing a continuous knob to
tune perturbation strength and suppress artifacts. Our approach not only
mitigates the oversmoothing issues of existing layer-level perturbation but
also enables targeted manipulation of specific visual styles through
compositional head selection. We validate our method on modern large-scale
DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,
demonstrating superior performance in both general quality enhancement and
style-specific guidance. Our work provides the first head-level analysis of
attention perturbation in diffusion models, uncovering interpretable
specialization within attention layers and enabling practical design of
effective perturbation strategies.
[112] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model
Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: InstaInpaint提出了一种基于参考的前馈框架,能够在0.4秒内完成3D场景修复,实现1000倍速度提升,同时保持SOTA性能。
Details
Motivation: 现有3D场景修复方法依赖耗时优化,无法满足实时交互需求,因此需要一种快速高效的解决方案。Contribution: 提出InstaInpaint框架,结合自监督掩码微调策略和大规模数据集训练,实现快速且高质量的3D场景修复。
Method: 采用基于2D修复提议的参考前馈框架,利用自定义大型重建模型(LRM),并通过自监督掩码微调优化训练。
Result: 在标准基准测试中达到SOTA性能,速度提升1000倍,并能灵活应用于对象插入和多区域修复。
Insight: 关键设计(如掩码微调和模型架构)显著改善了泛化能力、纹理一致性和几何正确性。
Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in
virtual and augmented reality. To support interactive operations for better
immersiveness, such as moving or editing objects, 3D scene inpainting methods
are proposed to repair or complete the altered geometry. However, current
approaches rely on lengthy and computationally intensive optimization, making
them impractical for real-time or online applications. We propose InstaInpaint,
a reference-based feed-forward framework that produces 3D-scene inpainting from
a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised
masked-finetuning strategy to enable training of our custom large
reconstruction model (LRM) on the large-scale dataset. Through extensive
experiments, we analyze and identify several key designs that improve
generalization, textural consistency, and geometric correctness. InstaInpaint
achieves a 1000x speed-up from prior methods while maintaining a
state-of-the-art performance across two standard benchmarks. Moreover, we show
that InstaInpaint generalizes well to flexible downstream applications such as
object insertion and multi-region inpainting. More video results are available
at our project page: https://dhmbb2.github.io/InstaInpaint_page/.
cs.MM [Back]
[113] Multimodal Large Language Models: A Survey
Longzhen Han,Awes Mubarak,Almas Baimagambetov,Nikolaos Polatidis,Thar Baker
Main category: cs.MM
TL;DR: 这篇综述总结了多模态大语言模型(MLLMs)的发展,探讨了其在文本、图像、音乐等多样化输出模态中的应用,分析了关键技术(如SSL、MoE等)和架构创新(如Transformer、扩散模型),并提出了未来挑战。
Details
Motivation: 随着多模态大语言模型的快速发展,如何统一架构并实现跨模态能力成为关键问题。本文旨在系统梳理MLLMs的进展、技术及挑战,为未来研究方向提供指导。Contribution: 1. 分类了六种主要生成模态;2. 分析了SSL、MoE、RLHF和CoT等关键技术;3. 总结了Transformer和扩散模型等架构创新;4. 提出未来挑战,如评估、模块化和结构化推理。
Method: 通过文献综述,系统分析了MLLMs的关键技术(SSL、MoE等)和架构设计(Transformer、扩散模型),并结合案例研究多模态生成能力。
Result: 总结了MLLMs在跨模态生成中的成功案例和协同效应,揭示了关键技术的作用及局限性。
Insight: 未来MLLMs的发展需要关注评估标准化、模块化设计和增强结构化推理能力,以实现更通用、自适应和可解释的多模态系统。
Abstract: Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text
generation, now spanning diverse output modalities including images, music,
video, human motion, and 3D objects, by integrating language with other sensory
modalities under unified architectures. This survey categorises six primary
generative modalities and examines how foundational techniques, namely
Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement
Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting,
enable cross-modal capabilities. We analyze key models, architectural trends,
and emergent cross-modal synergies, while highlighting transferable techniques
and unresolved challenges. Architectural innovations like transformers and
diffusion models underpin this convergence, enabling cross-modal transfer and
modular specialization. We highlight emerging patterns of synergy, and identify
open challenges in evaluation, modularity, and structured reasoning. This
survey offers a unified perspective on MLLM development and identifies critical
paths toward more general-purpose, adaptive, and interpretable multimodal
systems.
[114] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis
Jianwu Fang,Lei-Lei Li,Zhedong Zheng,Hongkai Yu,Jianru Xue,Zhengguo Li,Tat-Seng Chua
Main category: cs.MM
TL;DR: 该论文提出了一种名为EQ-TAA的新方法,通过基于扩散的交通事故事件生成模型(AVD)合成事故视频片段,并利用等变三元损失(equivariant triple loss)提升交通事故事件预测性能,以解决背景干扰和标注难题。
Details
Motivation: 当前交通事故事件预测(TAA)方法因需要标注事故持续时间而面临困难,且交通场景的长尾、不确定性和快速变化特性导致因果部分难以识别,易受数据偏差影响。Contribution: 论文的创新点包括:1) 提出了AVD模型,通过扩散模型生成因果视频片段;2) 设计了EQ-TAA框架,利用等变三元损失提升预测性能;3) 无需额外标注即可训练。
Method: AVD模型通过文本提示生成因果视频帧(从正常到事故),保留视频风格和内容。EQ-TAA结合等变三元损失,对比生成的伪正常和伪事故片段进行训练。
Result: 实验结果表明,AVD和EQ-TAA在性能上达到了先进水平,解决了背景干扰问题。
Insight: 通过合成因果视频片段和对比学习,能够有效减少数据偏差对模型的影响,提升交通事故事件预测的鲁棒性。
Abstract: Traffic Accident Anticipation (TAA) in traffic scenes is a challenging
problem for achieving zero fatalities in the future. Current approaches
typically treat TAA as a supervised learning task needing the laborious
annotation of accident occurrence duration. However, the inherent long-tailed,
uncertain, and fast-evolving nature of traffic scenes has the problem that real
causal parts of accidents are difficult to identify and are easily dominated by
data bias, resulting in a background confounding issue. Thus, we propose an
Attentive Video Diffusion (AVD) model that synthesizes additional accident
video clips by generating the causal part in dashcam videos, i.e., from normal
clips to accident clips. AVD aims to generate causal video frames based on
accident or accident-free text prompts while preserving the style and content
of frames for TAA after video generation. This approach can be trained using
datasets collected from various driving scenes without any extra annotations.
Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant
triple loss for an anchor accident-free video clip, along with the generated
pair of contrastive pseudo-normal and pseudo-accident clips. Extensive
experiments have been conducted to evaluate the performance of AVD and EQ-TAA,
and competitive performance compared to state-of-the-art methods has been
obtained.
[115] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction
Jie Qin,Wei Yang,Yan Su,Yiran Zhu,Weizhen Li,Yunyue Pan,Chengchang Pan,Honggang Qi
Main category: cs.MM
TL;DR: 论文提出了一种动态双向重构的多模态HER2表达预测框架,通过灵活的模态输入选择提升准确性,并在资源有限的情况下减少IHC成本。
Details
Motivation: 现有的HER2评估模型通常单独分析H&E或IHC图像,但临床实践中需结合两者进行综合判断。然而,同时获取两种模态数据的成本和流程复杂性限制了其应用。Contribution: 1) 动态分支选择器根据输入完整性激活单模态重构或双模态联合推理;2) 双向跨模态GAN实现缺失模态的特征空间重构;3) 结合对抗学习和多任务优化的混合训练协议。
Method: 通过动态分支选择器、双向跨模态GAN和混合训练协议,灵活支持单/双模态输入,并在特征空间重构缺失模态。
Result: 单模态H&E预测准确率从71.44%提升至94.25%,双模态准确率达95.09%,仅用IHC输入时可靠性为90.28%。此外,F1分数显著提升(H&E到IHC为0.9609,IHC到H&E为0.9251)。
Insight: 动态弹性架构在资源受限的场景中具有优势,通过减少IHC基础设施成本,同时实现接近双模态的性能。重构路径的引入有效缓解了数据缺失导致的性能下降问题。
Abstract: Current HER2 assessment models for breast cancer predominantly analyze H&E or
IHC images in isolation,despite clinical reliance on their synergistic
interpretation. However, concurrent acquisition of both modalities is often
hindered by workflow complexity and cost constraints. We propose an adaptive
bimodal framework enabling flexible single-/dual-modality HER2 prediction
through three innovations: 1) A dynamic branch selector that activates either
single-modality reconstruction or dual-modality joint inference based on input
completeness; 2) A bidirectional cross-modal GAN performing context-aware
feature-space reconstruction of missing modalities; 3) A hybrid training
protocol integrating adversarial learning and multi-task optimization. This
architecture elevates single-modality H&E prediction accuracy from 71.44% to
94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28%
reliability with sole IHC inputs. The framework’s “dual-preferred,
single-compatible” design delivers near-bimodal performance without requiring
synchronized acquisition, particularly benefiting resource-limited settings
through IHC infrastructure cost reduction. Experimental validation confirms
22.81%/12.90% accuracy improvements over H&E/IHC baselines respectively, with
cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251
(IHC to HE). By dynamically routing inputs through reconstruction-enhanced or
native fusion pathways, the system mitigates performance degradation from
missing data while preserving computational efficiency (78.55% parameter
reduction in lightweight variant). This elastic architecture demonstrates
significant potential for democratizing precise HER2 assessment across diverse
healthcare settings.
[116] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
Kangwei Liu,Junwu Liu,Xiaowei Yi,Jinlin Guo,Yun Cao
Main category: cs.MM
TL;DR: 该论文提出了一种基于扩散模型的3D面部动画生成框架,通过多模态情感绑定策略和注意力机制的潜在扩散模型,解决了单模态控制和确定性回归方法的局限性。
Details
Motivation: 现有的音频驱动3D面部动画方法依赖于单模态控制信号,且使用确定性回归方法限制了情感表达和行为的多样性,无法充分发挥多模态信号的互补优势。Contribution: 1. 提出一种基于FLAME的多模态情感绑定策略,通过对比学习对齐文本、音频和情感标签。2. 设计了一种注意力机制的潜在扩散模型,增强运动多样性并保持时间一致性。
Method: 1. 使用FLAME模型将多模态信号绑定到一个统一空间。2. 提出内容感知注意力和情感引导层的扩散模型。
Result: 实验表明,该方法在情感相似性上提升了21.6%,同时保持了自然的面部动态。
Insight: 多模态信号和扩散模型的结合可以显著提升3D面部动画的表达力和可控性。
Abstract: Audio-driven emotional 3D facial animation encounters two significant
challenges: (1) reliance on single-modal control signals (videos, text, or
emotion labels) without leveraging their complementary strengths for
comprehensive emotion manipulation, and (2) deterministic regression-based
mapping that constrains the stochastic nature of emotional expressions and
non-verbal behaviors, limiting the expressiveness of synthesized animations. To
address these challenges, we present a diffusion-based framework for
controllable expressive 3D facial animation. Our approach introduces two key
innovations: (1) a FLAME-centered multimodal emotion binding strategy that
aligns diverse modalities (text, audio, and emotion labels) through contrastive
learning, enabling flexible emotion control from multiple signal sources, and
(2) an attention-based latent diffusion model with content-aware attention and
emotion-guided layers, which enriches motion diversity while maintaining
temporal coherence and natural facial dynamics. Extensive experiments
demonstrate that our method outperforms existing approaches across most
metrics, achieving a 21.6% improvement in emotion similarity while preserving
physiologically plausible facial dynamics. Project Page:
https://kangweiiliu.github.io/Control_3D_Animation.
[117] Structured Graph Representations for Visual Narrative Reasoning: A Hierarchical Framework for Comics
Yi-Chun Chen
Main category: cs.MM
TL;DR: 该论文提出了一种分层的知识图谱框架,用于结构化理解漫画等视觉叙事内容,支持多模态推理。
Details
Motivation: 视觉叙事(如漫画)包含复杂的多模态信息(视觉和文本),传统的单一层次分析方法难以捕捉其语义、空间和时间关系。Contribution: 主要贡献是提出了一种分层的知识图谱框架,将叙事内容从宏观故事弧分解到细粒度事件段,并整合了语义、空间和时间关系。
Method: 方法包括构建多模态图谱,将视觉元素(如角色、对象、动作)与文本组件(如对话和旁白)关联起来,并通过分层整合支持推理。
Result: 在Manga109数据集上的实验表明,该方法在动作检索、对话追踪、角色定位和面板时间线重建等任务中表现优异,精确率和召回率较高。
Insight: 该工作为视觉媒体的叙事分析、交互式故事讲述和多模态推理提供了可扩展的基础,强调了分层图谱在复杂叙事理解中的重要性。
Abstract: This paper presents a hierarchical knowledge graph framework for the
structured understanding of visual narratives, focusing on multimodal media
such as comics. The proposed method decomposes narrative content into multiple
levels, from macro-level story arcs to fine-grained event segments. It
represents them through integrated knowledge graphs that capture semantic,
spatial, and temporal relationships. At the panel level, we construct
multimodal graphs that link visual elements such as characters, objects, and
actions with corresponding textual components, including dialogue and captions.
These graphs are integrated across narrative levels to support reasoning over
story structure, character continuity, and event progression.
We apply our approach to a manually annotated subset of the Manga109 dataset
and demonstrate its ability to support symbolic reasoning across diverse
narrative tasks, including action retrieval, dialogue tracing, character
appearance mapping, and panel timeline reconstruction. Evaluation results show
high precision and recall across tasks, validating the coherence and
interpretability of the framework. This work contributes a scalable foundation
for narrative-based content analysis, interactive storytelling, and multimodal
reasoning in visual media.
[118] WDMIR: Wavelet-Driven Multimodal Intent Recognition
Weiyin Gong,Kai Zhang,Yanghai Zhang,Qi Liu,Xinjie Sun,Junyu Lu,Linbo Zhu
Main category: cs.MM
TL;DR: 论文提出了一种基于小波变换的多模态意图识别框架WDMIR,通过频域分析提升对非语言信息的理解,并在MIntRec数据集上实现了最佳性能。
Details
Motivation: 现有方法多侧重于文本分析,忽视了非语言信息的丰富语义,WDMIR旨在通过频域分析填补这一空白。Contribution: 1) 提出基于小波的视频-音频特征同步分解与融合模块;2) 设计跨模态交互机制,实现从双模态到三模态的特征增强。
Method: 通过小波变换在频域分解信号,并结合跨模态交互机制逐步整合视频、音频和文本信息。
Result: 在MIntRec数据集上准确率提升1.13%,小波融合模块对情感线索的识别准确率提升0.41%。
Insight: 频域分析能更精细地捕捉非语言信息的时间动态,跨模态交互有效弥合语言与非语言语义之间的鸿沟。
Abstract: Multimodal intent recognition (MIR) seeks to accurately interpret user
intentions by integrating verbal and non-verbal information across video, audio
and text modalities. While existing approaches prioritize text analysis, they
often overlook the rich semantic content embedded in non-verbal cues. This
paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR)
framework that enhances intent understanding through frequency-domain analysis
of non-verbal information. To be more specific, we propose: (1) a
wavelet-driven fusion module that performs synchronized decomposition and
integration of video-audio features in the frequency domain, enabling
fine-grained analysis of temporal dynamics; (2) a cross-modal interaction
mechanism that facilitates progressive feature enhancement from bimodal to
trimodal integration, effectively bridging the semantic gap between verbal and
non-verbal information. Extensive experiments on MIntRec demonstrate that our
approach achieves state-of-the-art performance, surpassing previous methods by
1.13% on accuracy. Ablation studies further verify that the wavelet-driven
fusion module significantly improves the extraction of semantic information
from non-verbal sources, with a 0.41% increase in recognition accuracy when
analyzing subtle emotional cues.
cs.SD [Back]
[119] PAL: Probing Audio Encoders via LLMs – A Study of Information Transfer from Audio Encoders to LLMs
Tony Alex,Wish Suharitdamrong,Sara Atito,Armin Mustafa,Philip J. B. Jackson,Imran Razzak,Muhammad Awais
Main category: cs.SD
TL;DR: 这篇论文系统地研究了音频编码器与大型语言模型(LLM)之间的信息传递机制,通过改进架构设计提升了音频-LLM的性能。
Details
Motivation: 尽管音频-LLM的应用开发进展迅速,但其底层信息传递机制仍未充分探索,尤其是音频编码器如何向LLM高效传递丰富语义信息。Contribution: 提出了延迟音频集成、仅通过注意力模块探测音频表示以及多编码器集成的方法,显著优化了音频-LLM的信息传递效率。
Method: 基于Pengi/LLaVA架构,提出并验证了三种改进:延迟音频集成、注意力模块专用化和多编码器集成。
Result: 最终架构在560万音频-文本对数据集上实现了10%到60%的性能提升。
Insight: LLM的初始文本上下文有助于增强对音频表示的探测能力,注意力模块足够高效,而多编码器集成能够提供更丰富的音频信息。
Abstract: The integration of audio perception capabilities into Large Language Models
(LLMs) has enabled significant advances in Audio-LLMs. Although
application-focused developments, particularly in curating training data for
specific capabilities e.g., audio reasoning, have progressed rapidly, the
underlying mechanisms that govern efficient transfer of rich semantic
representations from audio encoders to LLMs remain under-explored. We
conceptualize effective audio-LLM interaction as the LLM’s ability to
proficiently probe the audio encoder representations to satisfy textual
queries. This paper presents a systematic investigation on how architectural
design choices can affect that. Beginning with a standard Pengi/LLaVA-style
audio-LLM architecture, we propose and evaluate several modifications guided by
hypotheses derived from mechanistic interpretability studies and LLM
operational principles. Our experiments demonstrate that: (1) delaying audio
integration until the LLM’s initial layers establish textual context that
enhances its ability to probe the audio representations for relevant
information; (2) the LLM can proficiently probe audio representations
exclusively through LLM layer’s attention submodule, without requiring
propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently
integrated ensemble of diverse audio encoders provides richer, complementary
representations, thereby broadening the LLM’s capacity to probe a wider
spectrum of audio information. All hypotheses are evaluated using an identical
three-stage training curriculum on a dataset of 5.6 million audio-text pairs,
ensuring controlled comparisons. Our final architecture, which incorporates all
proposed modifications, achieves relative improvements from 10% to 60% over
the baseline, validating our approach to optimizing cross-modal information
transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
cs.MA [Back]
[120] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Haoyuan Shi,Yunxin Li,Xinyu Chen,Longyue Wang,Baotian Hu,Min Zhang
Main category: cs.MA
TL;DR: AniMaker是一个多代理框架,通过MCTS驱动的剪辑生成和故事感知剪辑选择,从文本输入生成全局一致且故事连贯的动画。
Details
Motivation: 现有视频生成方法在生成多场景和多角色的连贯叙事视频时存在挑战,表现为叙事脱节和节奏问题。AniMaker旨在解决这些问题。Contribution: 1. 提出AniMaker框架,通过多代理协作高效生成和评估剪辑;2. 引入MCTS-Gen策略优化剪辑生成的候选空间探索;3. 提出AniEval框架,首次专注于多镜头动画评估。
Method: 1. 多代理架构(导演、摄影、审阅、后期制作);2. MCTS-Gen策略在摄影代理中生成高质量剪辑;3. AniEval框架评估剪辑的连贯性和动画特性。
Result: AniMaker在VBench和AniEval指标上表现优异,显著提升多候选生成的效率和质量。
Insight: 通过分代理协作和MCTS优化,AniMaker展示了文本到动画生成中全局一致性和资源效率的平衡。
Abstract: Despite rapid advancements in video generation models, generating coherent
storytelling videos that span multiple scenes and characters remains
challenging. Current methods often rigidly convert pre-generated keyframes into
fixed-length clips, resulting in disjointed narratives and pacing issues.
Furthermore, the inherent instability of video generation models means that
even a single low-quality clip can significantly degrade the entire output
animation’s logical coherence and visual continuity. To overcome these
obstacles, we introduce AniMaker, a multi-agent framework enabling efficient
multi-candidate clip generation and storytelling-aware clip selection, thus
creating globally consistent and story-coherent animation solely from text
input. The framework is structured around specialized agents, including the
Director Agent for storyboard generation, the Photography Agent for video clip
generation, the Reviewer Agent for evaluation, and the Post-Production Agent
for editing and voiceover. Central to AniMaker’s approach are two key technical
components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search
(MCTS)-inspired strategy that intelligently navigates the candidate space to
generate high-potential clips while optimizing resource usage; and AniEval in
Reviewer Agent, the first framework specifically designed for multi-shot
animation evaluation, which assesses critical aspects such as story-level
consistency, action completion, and animation-specific features by considering
each clip in the context of its preceding and succeeding clips. Experiments
demonstrate that AniMaker achieves superior quality as measured by popular
metrics including VBench and our proposed AniEval framework, while
significantly improving the efficiency of multi-candidate generation, pushing
AI-generated storytelling animation closer to production standards.
cs.IR [Back]
[121] Conversational Search: From Fundamentals to Frontiers in the LLM Era
Fengran Mo,Chuan Meng,Mohammad Aliannejadi,Jian-Yun Nie
Main category: cs.IR
TL;DR: 论文《Conversational Search: From Fundamentals to Frontiers in the LLM Era》探讨了大语言模型(LLMs)时代下会话搜索的基础与前沿技术,介绍了多轮交互实现复杂信息需求的方法以及LLM带来的机遇与挑战。
Details
Motivation: 会话搜索通过多轮交互满足用户的复杂信息需求,而LLMs具备指令遵循、内容生成和推理能力,为构建智能会话搜索系统提供了新的机会和挑战。Contribution: 论文的主要贡献在于系统地介绍了会话搜索的基础知识,并探讨了LLM如何推动会话搜索领域的技术革新。
Method: 论文采用综述性方法,结合LLMs的背景,分析了会话搜索的核心原理和前沿进展。
Result: 论文为学术界和工业界的参与者提供了全面的知识框架,帮助他们理解并推动下一代会话搜索系统的发展。
Insight: LLMs的引入为会话搜索带来了更强的上下文理解和动态交互能力,但其落地仍需解决如意图理解、信息准确性等挑战。
Abstract: Conversational search enables multi-turn interactions between users and
systems to fulfill users’ complex information needs. During this interaction,
the system should understand the users’ search intent within the conversational
context and then return the relevant information through a flexible,
dialogue-based interface. The recent powerful large language models (LLMs) with
capacities of instruction following, content generation, and reasoning, attract
significant attention and advancements, providing new opportunities and
challenges for building up intelligent conversational search systems. This
tutorial aims to introduce the connection between fundamentals and the emerging
topics revolutionized by LLMs in the context of conversational search. It is
designed for students, researchers, and practitioners from both academia and
industry. Participants will gain a comprehensive understanding of both the core
principles and cutting-edge developments driven by LLMs in conversational
search, equipping them with the knowledge needed to contribute to the
development of next-generation conversational search systems.
eess.SY [Back]
[122] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing
Rongfei Li,Francis Assadian
Main category: eess.SY
TL;DR: 本文提出了一种能量感知的摄像头位置搜索算法,用于在自动化制造中提升观测精度,通过优化摄像头位置以减少图像噪声,同时考虑能量限制。
Details
Motivation: 在自动化制造环境中,摄像头的观测质量因位置不同而有显著差异,但目前研究较少关注摄像头位置的影响。本文旨在解决这一问题,通过优化摄像头位置提升观测精度。Contribution: 1. 提出了一种能量感知的摄像头移动策略算法,能够在有限能量下搜索最优或次优观测位置;2. 结合图像平均技术,实现了单摄像头的高精度观测。
Method: 1. 通过自适应搜索策略高效探索摄像头工作空间;2. 利用环境学习优化移动策略;3. 使用图像平均技术减少噪声。
Result: 仿真实验表明,该算法在有限能量下显著提升了观测精度。
Insight: 摄像头位置的优化对自动化制造中的观测精度至关重要,能量感知的搜索策略可以在资源受限的情况下实现高效优化。
Abstract: Visual servoing technology has been well developed and applied in many
automated manufacturing tasks, especially in tools’ pose alignment. To access a
full global view of tools, most applications adopt eye-to-hand configuration or
eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing
environment. Most research papers mainly put efforts into developing control
and observation architectures in various scenarios, but few of them have
discussed the importance of the camera’s location in eye-to-hand configuration.
In a manufacturing environment, the quality of camera estimations may vary
significantly from one observation location to another, as the combined effects
of environmental conditions result in different noise levels of a single image
shot at different locations. In this paper, we propose an algorithm for the
camera’s moving policy so that it explores the camera workspace and searches
for the optimal location where the images’ noise level is minimized. Also, this
algorithm ensures the camera ends up at a suboptimal (if the optimal one is
unreachable) location among the locations already searched, with limited energy
available for moving the camera. Unlike a simple brute force approach, the
algorithm enables the camera to explore space more efficiently by adapting the
search policy from learning the environment. With the aid of an image averaging
technique, this algorithm, in use of a solo camera, achieves the observation
accuracy in eye-to-hand configurations to a desirable extent without filtering
out high-frequency information in the original image. An automated
manufacturing application has been simulated and the results show the success
of this algorithm’s improvement of observation precision with limited energy.
[123] Semi-Tensor-Product Based Convolutional Neural Networks
Daizhan Cheng
Main category: eess.SY
TL;DR: 本文提出了一种基于半张量积(STP)的新型卷积神经网络(CNN),通过结合领域卷积积(CP)和STP,避免了传统CNN中填充带来的垃圾信息。
Details
Motivation: 传统CNN的卷积操作要求输入向量维度一致,且填充操作会引入无用信息。本文旨在利用STP的灵活性,避免填充问题,提升CNN性能。Contribution: 提出了STP与领域卷积积(CP)结合的新型卷积操作,并构建了STP-based CNN,适用于图像和三阶信号识别任务。
Method: 结合领域卷积积(CP)与半张量积(STP),设计无需填充的卷积操作,进而构建STP-based CNN。
Result: 实验验证了该方法在图像和三阶信号识别中的有效性,避免了填充带来的干扰。
Insight: 半张量积的灵活性为卷积操作提供了新思路,能够处理不同维度输入并避免填充问题,为CNN设计拓展了可能性。
Abstract: The semi-tensor product (STP) of vectors is a generalization of conventional
inner product of vectors, which allows the factor vectors to of different
dimensions. This paper proposes a domain-based convolutional product (CP).
Combining domain-based CP with STP of vectors, a new CP is proposed. Since
there is no zero or any other padding, it can avoid the junk information caused
by padding. Using it, the STP-based convolutional neural network (CNN) is
developed. Its application to image and third order signal identifications is
considered.
cs.GR [Back]
[124] Edit360: 2D Image Edits to 3D Assets from Any Angle
Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang
Main category: cs.GR
TL;DR: Edit360是一个无需调整的框架,将2D图像编辑扩展到多视角一致的3D编辑,通过视频扩散模型和Anchor-View Editing Propagation机制,实现高质量3D资产重建。
Details
Motivation: 现有方法通常限制在多角度编辑上,缺乏灵活性,难以实现多视角一致的精细编辑。Contribution: 提出了Edit360框架,支持从任意视角进行用户特定的3D编辑,并通过锚点视图编辑传播机制确保多视角一致性。
Method: 基于视频扩散模型,利用Anchor-View Editing Propagation机制在潜在和注意力空间中对多视角信息进行对齐与融合。
Result: 能够重建高质量的3D资产,支持可定制的3D内容创作。
Insight: 通过扩散模型的潜在和注意力空间实现多视角信息对齐,为3D编辑提供了一种高效且灵活的方法。
Abstract: Recent advances in diffusion models have significantly improved image
generation and editing, but extending these capabilities to 3D assets remains
challenging, especially for fine-grained edits that require multi-view
consistency. Existing methods typically restrict editing to predetermined
viewing angles, severely limiting their flexibility and practical applications.
We introduce Edit360, a tuning-free framework that extends 2D modifications to
multi-view consistent 3D editing. Built upon video diffusion models, Edit360
enables user-specific editing from arbitrary viewpoints while ensuring
structural coherence across all views. The framework selects anchor views for
2D modifications and propagates edits across the entire 360-degree range. To
achieve this, Edit360 introduces a novel Anchor-View Editing Propagation
mechanism, which effectively aligns and merges multi-view information within
the latent and attention spaces of diffusion models. The resulting edited
multi-view sequences facilitate the reconstruction of high-quality 3D assets,
enabling customizable 3D content creation.
cs.LG [Back]
[125] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs
Shangpin Peng,Weinong Wang,Zhuotao Tian,Senqiao Yang,Xing Wu,Haotian Xu,Chengquan Zhang,Takashi Isobe,Baotian Hu,Min Zhang
Main category: cs.LG
TL;DR: Omni-DPO 提出了一种双视角优化框架,动态调整偏好对的学习权重,显著提升了 DPO 在 RLHF 中的性能。
Details
Motivation: 现有的 DPO 方法通常对所有偏好对一视同仁,忽略了它们在质量和学习效用上的差异,导致数据利用效率和性能不佳。Contribution: 提出了 Omni-DPO 框架,结合数据质量和模型动态学习表现,自适应地调整样本权重,从而更高效地利用数据并提升性能。
Method: 通过双视角(数据质量和模型学习动态)联合优化,动态赋予偏好对不同权重,实现更有效的训练数据利用。
Result: 在文本理解任务中,Gemma-2-9b-it 微调后超越 Claude 3 Opus 6.7 分;在数学推理任务中,Omni-DPO 在所有基准测试中均优于基线方法。
Insight: 动态调整偏好对的学习权重是提升 DPO 性能的关键,数据质量和模型学习动态是两大核心视角。
Abstract: Direct Preference Optimization (DPO) has become a cornerstone of
reinforcement learning from human feedback (RLHF) due to its simplicity and
efficiency. However, existing DPO-based approaches typically treat all
preference pairs uniformly, ignoring critical variations in their inherent
quality and learning utility, leading to suboptimal data utilization and
performance. To address this challenge, we propose Omni-DPO, a dual-perspective
optimization framework that jointly accounts for (1) the inherent quality of
each preference pair and (2) the model’s evolving performance on those pairs.
By adaptively weighting samples according to both data quality and the model’s
learning dynamics during training, Omni-DPO enables more effective training
data utilization and achieves better performance. Experimental results on
various models and benchmarks demonstrate the superiority and generalization
capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it
finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant
margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning
tasks, Omni-DPO consistently outperforms the baseline methods across all
benchmarks, providing strong empirical evidence for the effectiveness and
robustness of our approach. Code and models will be available at
https://github.com/pspdada/Omni-DPO.
[126] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning
Jikai Jin,Vasilis Syrgkanis,Sham Kakade,Hanlin Zhang
Main category: cs.LG
TL;DR: 该论文提出了一个因果表示学习框架,通过建模语言模型的潜在能力因素来解释基准测试表现,揭示了能力间的因果结构,强调了控制基础模型变体的重要性。
Details
Motivation: 语言模型能力评估中存在复杂的混杂效应和高计算成本,传统方法难以揭示其潜在能力间的因果关系。Contribution: 提出了一个因果表示学习框架,识别出潜在能力因素的线性因果结构,并揭示了其科学意义。
Method: 通过线性变换建模基准表现与潜在能力因素的关系,控制基础模型作为混杂变量,识别因果结构。
Result: 在一个包含1500多个模型的数据集中,成功识别出一个三节点的线性因果结构,揭示了能力间的因果方向。
Insight: 研究发现能力发展从通用问题解决开始,逐步到指令跟随能力,最终到数学推理能力,强调了控制基础模型变体的关键作用。
Abstract: Faithful evaluation of language model capabilities is crucial for deriving
actionable insights that can inform model development. However, rigorous causal
evaluations in this domain face significant methodological challenges,
including complex confounding effects and prohibitive computational costs
associated with extensive retraining. To tackle these challenges, we propose a
causal representation learning framework wherein observed benchmark performance
is modeled as a linear transformation of a few latent capability factors.
Crucially, these latent factors are identified as causally interrelated after
appropriately controlling for the base model as a common confounder. Applying
this approach to a comprehensive dataset encompassing over 1500 models
evaluated across six benchmarks from the Open LLM Leaderboard, we identify a
concise three-node linear causal structure that reliably explains the observed
performance variations. Further interpretation of this causal structure
provides substantial scientific insights beyond simple numerical rankings:
specifically, we reveal a clear causal direction starting from general
problem-solving capabilities, advancing through instruction-following
proficiency, and culminating in mathematical reasoning ability. Our results
underscore the essential role of carefully controlling base model variations
during evaluation, a step critical to accurately uncovering the underlying
causal relationships among latent model capabilities.
[127] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
Ching Chang,Jeehyun Hwang,Yidan Shi,Haixin Wang,Wen-Chih Peng,Tien-Fu Chen,Wei Wang
Main category: cs.LG
TL;DR: Time-IMM 是一个专门为不规则多模态多变量时间序列设计的数据集,结合 IMM-TSF 基准库,填补了研究和实际应用之间的差距。
Details
Motivation: 现实世界中的时间序列数据(如医疗、气候建模和金融)通常是不规则、多模态且脏乱的,而现有基准通常假设数据是干净的、规则采样的单模态数据,与实际需求脱节。Contribution: 1) Time-IMM 数据集捕捉了多模态多变量时间序列中的九种不规则性;2) IMM-TSF 基准库支持异步多模态融合和现实评估;3) 提出两种专用融合模块。
Method: 1) 设计 Time-IMM 数据集,分类触发、约束和伪影三种不规则性机制;2) IMM-TSF 库包含时间戳-文本和多模态融合模块,支持最近平均和注意力融合策略。
Result: 实验表明,显式建模多模态在时间序列中的不规则性显著提升了预测性能。
Insight: 这项研究为实际应用中不规则多模态时间序列分析提供了重要工具,推动了该领域的发展。
Abstract: Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
[128] Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering
Sai Prasanna Teja Reddy Bogireddy,Abrar Majeedi,Viswanatha Reddy Gajjala,Zhuoyan Xu,Siddhant Rai,Vaishnav Potlapalli
Main category: cs.LG
TL;DR: 论文提出了一个基于数据驱动的提示优化方法(Neural),用于临床电子健康记录(EHR)的问题回答,通过分离证据检索和答案生成步骤,并结合自一致性投票机制,显著提升了性能。
Details
Motivation: 临床电子健康记录的自动问答(QA)需要高精度的证据检索和可靠的答案生成,但在监督数据有限的情况下,传统方法表现不佳。因此,论文旨在提出一种高效的、基于提示优化的解决方案。Contribution: 论文的主要贡献是:1)将任务解耦为证据识别和答案生成两个阶段;2)使用DSPy的MIPROv2优化器自动探索提示空间;3)通过自一致性投票机制提升证据召回率。
Method: 方法包括两个阶段:1)句子级证据识别;2)带有明确引用的答案合成。利用MIPROv2优化器联合优化指令和小样本示例,并通过自一致性投票机制进一步提升性能。
Result: 在隐藏测试集上,论文方法得分为51.5,表现优于零样本和小样本提示方法分别超过20和10个百分点,位居第二名。
Insight: 数据驱动的提示优化是一种成本效益高的替代微调的方案,尤其在临床QA等高风险领域,提升了AI助手的可靠性。
Abstract: Automated question answering (QA) over electronic health records (EHRs) can
bridge critical information gaps for clinicians and patients, yet it demands
both precise evidence retrieval and faithful answer generation under limited
supervision. In this work, we present Neural, the runner-up in the BioNLP 2025
ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method
decouples the task into (1) sentence-level evidence identification and (2)
answer synthesis with explicit citations. For each stage, we automatically
explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning
instructions and few-shot demonstrations on the development set. A
self-consistency voting scheme further improves evidence recall without
sacrificing precision. On the hidden test set, our method attains an overall
score of 51.5, placing second stage while outperforming standard zero-shot and
few-shot prompting by over 20 and 10 points, respectively. These results
indicate that data-driven prompt optimization is a cost-effective alternative
to model fine-tuning for high-stakes clinical QA, advancing the reliability of
AI assistants in healthcare.
[129] Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Adam Karvonen,Samuel Marks
Main category: cs.LG
TL;DR: 该论文提出了一种通过解释性方法在现实场景中稳健提升大语言模型公平性的技术,发现传统反偏见提示在现实背景中失效,并提出了一种内部偏见缓解策略。
Details
Motivation: 大语言模型(LLMs)在高风险招聘应用中的部署日益增多,但其在现实复杂背景下表现出的偏见未被充分研究,亟需一种有效的内部干预方法。Contribution: 论文的主要贡献在于揭示了现实上下文会显著增加LLMs的偏见,并提出了一种通过内部激活方向中和敏感属性的方法,实现了跨模型的稳健偏见减少。
Method: 方法包括识别模型激活中的敏感属性相关方向,并在推理时应用仿射概念编辑,即使在简单合成数据上训练的方向仍能推广到现实场景。
Result: 实验表明,该方法能将偏见降至极低水平(通常低于1%),同时保持模型性能,并在多种商业和开源模型中验证了有效性。
Insight: 论文揭示了现实上下文对模型偏见的潜在影响,强调了内部干预策略的重要性,为公平LLM的实际部署提供了指导。
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes hiring
applications, making decisions that directly impact people’s careers and
livelihoods. While prior studies suggest simple anti-bias prompts can eliminate
demographic biases in controlled evaluations, we find these mitigations fail
when realistic contextual details are introduced. We address these failures
through internal bias mitigation: by identifying and neutralizing sensitive
attribute directions within model activations, we achieve robust bias reduction
across all tested scenarios. Across leading commercial (GPT-4o, Claude 4
Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3,
Mistral-24B), we find that adding realistic context such as company names,
culture descriptions from public careers pages, and selective hiring
constraints (e.g.,``only accept candidates in the top 10%“) induces
significant racial and gender biases (up to 12% differences in interview
rates). When these biases emerge, they consistently favor Black over White
candidates and female over male candidates across all tested models and
scenarios. Moreover, models can infer demographics and become biased from
subtle cues like college affiliations, with these biases remaining invisible
even when inspecting the model’s chain-of-thought reasoning. To address these
limitations, our internal bias mitigation identifies race and gender-correlated
directions and applies affine concept editing at inference time. Despite using
directions from a simple synthetic dataset, the intervention generalizes
robustly, consistently reducing bias to very low levels (typically under 1%,
always below 2.5%) while largely maintaining model performance. Our findings
suggest that practitioners deploying LLMs for hiring should adopt more
realistic evaluation methodologies and consider internal mitigation strategies
for equitable outcomes.
[130] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models
Evelyn Ma,Duo Zhou,Peizhi Niu,Huiting Zhou,Huan Zhang,Olgica Milenkovic,S. Rasoul Etesami
Main category: cs.LG
TL;DR: GUARD 是一种针对大型语言模型(LLMs)的轻量级数据属性框架,通过自适应权重分配解决遗忘学习中的意外遗忘问题,显著提升了模型在遗忘后的信息保留能力。
Details
Motivation: 由于法规遵从性、版权保护和隐私需求,LLMs的遗忘学习变得越来越重要。然而,现有方法在遗忘高影响力数据时容易出现意外遗忘问题,导致模型性能下降。GUARD 旨在通过数据层面的优化解决这一问题。Contribution: 1. 提出了一种专为LLMs设计的轻量级代理数据属性度量;2. 设计了一种自适应非均匀权重的遗忘学习目标;3. 通过理论证明和实验验证,GUARD 在保持遗忘效果的同时显著提升了信息保留能力。
Method: GUARD 的核心是引入一个轻量级的代理数据属性度量,用于量化遗忘集与保留集之间的“对齐”程度。基于此,提出了一个自适应的非均匀权重遗忘目标,根据代理属性分数分配权重,以减少意外损失。
Result: 在TOFU基准测试中,GUARD 在遗忘10%训练数据时,将保留集的效用牺牲(Truth Ratio)降低了194.92%,同时保持了与传统方法相当的遗忘效果。
Insight: 数据层面的优化在LLMs的遗忘学习中具有重要意义,GUARD 通过数据属性的量化与自适应权重分配,显著提升了模型在实际应用中的可靠性。
Abstract: Unlearning in large language models (LLMs) is becoming increasingly important
due to regulatory compliance, copyright protection, and privacy concerns.
However, a key challenge in LLM unlearning is unintended forgetting, where the
removal of specific data inadvertently impairs the utility of the model and its
retention of valuable, desired information. While prior work has primarily
focused on architectural innovations, the influence of data-level factors on
unlearning performance remains underexplored. As a result, existing methods
often suffer from degraded retention when forgetting high-impact data. To
address this, we propose GUARD-a novel framework for Guided Unlearning And
Retention via Data attribution. At its core, GUARD introduces a lightweight
proxy data attribution metric tailored for LLM unlearning, which quantifies the
“alignment” between the forget and retain sets while remaining computationally
efficient. Building on this, we design a novel unlearning objective that
assigns adaptive, nonuniform unlearning weights to samples, inversely
proportional to their proxy attribution scores. Through such a reallocation of
unlearning power, GUARD mitigates unintended losses in retention. We provide
rigorous theoretical guarantees that GUARD significantly enhances retention
while maintaining forgetting metrics comparable to prior methods. Extensive
experiments on the TOFU benchmark across multiple LLM architectures demonstrate
that GUARD substantially improves utility preservation while ensuring effective
unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to
194.92% in terms of Truth Ratio when forgetting 10% of the training data.
[131] Build the web for agents, not agents for the web
Xing Han Lù,Gaurav Kamath,Marius Mosbach,Siva Reddy
Main category: cs.LG
TL;DR: 这篇论文提出了一种新的范式转变,即开发专门为AI代理设计的网页接口(AWI),而不是让代理适应人类设计的界面,以提高效率和可靠性。
Details
Motivation: 当前的AI代理在处理网页任务时面临巨大挑战,因为这些界面是为人类设计的,而非为代理优化的。这种不匹配限制了代理的能力和效率。Contribution: 论文提出了’代理化网页接口(AWI)’的概念,并制定了六项设计原则(安全性、效率、标准化等),旨在为代理设计更高效的网页交互方式。
Method: 通过提出AWI框架,论文倡导重新设计网页交互范式,以更好地适配AI代理的能力,而非依赖现有的人类界面。
Result: AWI的提出为未来的网页代理研究提供了新的方向,有望解决现有方法在处理复杂网页任务时的局限性。
Insight: 论文指出,未来的网页代理开发需要与机器学习社区共同协作,推动标准化和优化的代理友好型接口设计。
Abstract: Recent advancements in Large Language Models (LLMs) and multimodal
counterparts have spurred significant interest in developing web agents – AI
systems capable of autonomously navigating and completing tasks within web
environments. While holding tremendous promise for automating complex web
interactions, current approaches face substantial challenges due to the
fundamental mismatch between human-designed interfaces and LLM capabilities.
Current methods struggle with the inherent complexity of web inputs, whether
processing massive DOM trees, relying on screenshots augmented with additional
information, or bypassing the user interface entirely through API interactions.
This position paper advocates for a paradigm shift in web agent research:
rather than forcing web agents to adapt to interfaces designed for humans, we
should develop a new interaction paradigm specifically optimized for agentic
capabilities. To this end, we introduce the concept of an Agentic Web Interface
(AWI), an interface specifically designed for agents to navigate a website. We
establish six guiding principles for AWI design, emphasizing safety,
efficiency, and standardization, to account for the interests of all primary
stakeholders. This reframing aims to overcome fundamental limitations of
existing interfaces, paving the way for more efficient, reliable, and
transparent web agent design, which will be a collaborative effort involving
the broader ML community.
[132] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems
Aayush Karan,Kulin Shah,Sitan Chen
Main category: cs.LG
TL;DR: ReGuidance是一种简单的扩散模型包装器,用于提升在困难逆问题中的样本质量。它通过反转无条件概率流ODE并重新初始化DPS,显著提升了样本真实性和奖励一致性。
Details
Motivation: 现有方法(如DPS及其变体)在处理低信噪比的困难逆问题时容易偏离数据流形,导致输出不真实。ReGuidance旨在解决这一问题。Contribution: 提出了ReGuidance包装器,通过反转ODE和重新初始化DPS,提升样本真实性和奖励一致性。首次为DPS提供了理论保证。
Method: 从候选解$\hat{x}$出发,反转无条件概率流ODE,生成潜在表示,再将其作为DPS的初始化。
Result: 在困难逆问题(如大框补全和高倍超分辨率)中,ReGuidance显著优于现有方法,提升了样本质量和测量一致性。
Insight: 该方法首次证明了对某些多模态数据分布,ReGuidance能同时提升奖励并将候选解拉回数据流形。
Abstract: There has been a flurry of activity around using pretrained diffusion models
as informed data priors for solving inverse problems, and more generally around
steering these models using reward models. Training-free methods like diffusion
posterior sampling (DPS) and its many variants have offered flexible heuristic
algorithms for these tasks, but when the reward is not informative enough,
e.g., in hard inverse problems with low signal-to-noise ratio, these techniques
veer off the data manifold, failing to produce realistic outputs. In this work,
we devise a simple wrapper, ReGuidance, for boosting both the sample realism
and reward achieved by these methods. Given a candidate solution $\hat{x}$
produced by an algorithm of the user’s choice, we propose inverting the
solution by running the unconditional probability flow ODE in reverse starting
from $\hat{x}$, and then using the resulting latent as an initialization for
DPS. We evaluate our wrapper on hard inverse problems like large box
in-painting and super-resolution with high upscaling. Whereas state-of-the-art
baselines visibly fail, we find that applying our wrapper on top of these
baselines significantly boosts sample quality and measurement consistency. We
complement these findings with theory proving that on certain multimodal data
distributions, ReGuidance simultaneously boosts the reward and brings the
candidate solution closer to the data manifold. To our knowledge, this
constitutes the first rigorous algorithmic guarantee for DPS.
cs.RO [Back]
[133] A Navigation Framework Utilizing Vision-Language Models
Yicheng Duan,Kaiyu tang
Main category: cs.RO
TL;DR: 该论文提出了一种模块化的导航框架,通过解耦视觉-语言理解和动作规划,结合轻量级规划和冻结的视觉-语言模型,实现了高效且灵活的导航。虽然实验结果在未见过环境中存在挑战,但为未来的改进提供了方向。
Details
Motivation: 视觉与语言导航(VLN)是具身AI中的重要课题,现有的视觉-语言模型(如CLIP)虽然提升了多模态理解能力,但其计算成本和实时部署问题仍待解决。本文旨在通过模块化设计解决这些问题。Contribution: 1. 提出模块化的导航框架,解耦视觉-语言理解和动作规划;2. 结合冻结的视觉-语言模型与轻量级规划逻辑,提高效率;3. 使用提示工程和双帧输入策略优化决策连续性。
Method: 采用模块化设计,冻结视觉-语言模型Qwen2.5-VL-7B-Instruct,配合轻量级规划和结构化历史管理,通过双帧视觉输入和提示工程提升导航表现。
Result: 在VLN-CE的Room-to-Room基准测试中,系统在未见过环境中的泛化能力面临挑战,但模块化方法为未来优化(如增强环境先验和多模态输入)奠定了基础。
Insight: 1. 模块化设计有效降低了计算成本;2. 双帧输入策略改善了决策连续性;3. 未来可通过加强环境先验和多模态输入的整合进一步提升性能。
Abstract: Vision-and-Language Navigation (VLN) presents a complex challenge in embodied
AI, requiring agents to interpret natural language instructions and navigate
through visually rich, unfamiliar environments. Recent advances in large
vision-language models (LVLMs), such as CLIP and Flamingo, have significantly
improved multimodal understanding but introduced new challenges related to
computational cost and real-time deployment. In this project, we propose a
modular, plug-and-play navigation framework that decouples vision-language
understanding from action planning. By integrating a frozen vision-language
model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to
achieve flexible, fast, and adaptable navigation without extensive model
fine-tuning. Our framework leverages prompt engineering, structured history
management, and a two-frame visual input strategy to enhance decision-making
continuity across navigation steps. We evaluate our system on the Room-to-Room
benchmark within the VLN-CE setting using the Matterport3D dataset and
Habitat-Lab simulation environment. Although our initial results reveal
challenges in generalizing to unseen environments under strict evaluation
settings, our modular approach lays a foundation for scalable and efficient
navigation systems, highlighting promising directions for future improvement
through enhanced environmental priors and expanded multimodal input
integration.
[134] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
Wang Xinjie,Liu Liu,Cao Yu,Wu Ruiqi,Qin Wenkang,Wang Dehui,Sui Wei,Su Zhizhong
Main category: cs.RO
TL;DR: EmbodiedGen是一个生成式3D世界引擎平台,旨在通过生成高质量、可控且逼真的3D资产,支持具身智能任务的训练与评估,提升数据多样性与可扩展性。
Details
Motivation: 当前具身智能任务依赖传统人工创建的3D资产,成本高且真实性有限,限制了数据驱动方法的扩展性。Contribution: 提出了EmbodiedGen平台,提供六种模块化生成工具(如图像/文本到3D转换),支持低成本生成高质量、物理精准的3D资产。
Method: 结合生成式AI技术,生成可控的、逼真的3D资产,并通过URDF格式兼容物理仿真引擎。
Result: 生成的3D资产可直接用于物理仿真,支持具身智能任务的高效训练与评估。
Insight: 生成式AI可作为解决3D数据稀缺和多样性的有效工具,同时提升仿真环境的真实性与可交互性。
Abstract: Constructing a physically realistic and accurately scaled simulated 3D world
is crucial for the training and evaluation of embodied intelligence tasks. The
diversity, realism, low cost accessibility and affordability of 3D data assets
are critical for achieving generalization and scalability in embodied AI.
However, most current embodied intelligence tasks still rely heavily on
traditional 3D computer graphics assets manually created and annotated, which
suffer from high production costs and limited realism. These limitations
significantly hinder the scalability of data driven approaches. We present
EmbodiedGen, a foundational platform for interactive 3D world generation. It
enables the scalable generation of high-quality, controllable and
photorealistic 3D assets with accurate physical properties and real-world scale
in the Unified Robotics Description Format (URDF) at low cost. These assets can
be directly imported into various physics simulation engines for fine-grained
physical control, supporting downstream tasks in training and evaluation.
EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key
modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object
Generation, Scene Generation and Layout Generation. EmbodiedGen generates
diverse and interactive 3D worlds composed of generative 3D assets, leveraging
generative AI to address the challenges of generalization and evaluation to the
needs of embodied intelligence related research. Code is available at
https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
[135] Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop
Justin Kerr,Kush Hari,Ethan Weber,Chung Min Kim,Brent Yi,Tyler Bonnen,Ken Goldberg,Angjoo Kanazawa
Main category: cs.RO
TL;DR: EyeRobot是一个结合了模仿学习(BC)和强化学习(RL)的机器人系统,通过联合训练手和眼的行为,实现任务驱动的主动视觉感知。
Details
Motivation: 人类通过主动观察环境来完成任务,而传统机器人系统缺乏这种动态视觉感知能力。作者希望通过训练机器人主动调整视线(gaze),以实现更高效的操控任务。Contribution: 1. 提出了一个结合BC-RL的感知-动作循环框架;2. 设计了机械眼球和基于注视的策略架构;3. 在仿真和真实环境中验证了手眼协调的有效性。
Method: 1. 通过360°摄像头收集演示数据;2. 在仿真环境中训练手(BC)和眼(RL)的策略;3. 采用高分辨率的注视策略架构,优化视觉注意力。
Result: 在五个全景工作空间任务中,EyeRobot表现出高效的手眼协调能力,能够稳定注视目标并忽略干扰物。
Insight: 动态视觉感知(如注视调整)可以显著提升机器人在复杂环境中的任务表现,同时高分辨率策略设计有助于降低计算成本。
Abstract: Humans do not passively observe the visual world – we actively look in order
to act. Motivated by this principle, we introduce EyeRobot, a robotic system
with gaze behavior that emerges from the need to complete real-world tasks. We
develop a mechanical eyeball that can freely rotate to observe its surroundings
and train a gaze policy to control it using reinforcement learning. We
accomplish this by first collecting teleoperated demonstrations paired with a
360 camera. This data is imported into a simulation environment that supports
rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze
on top of robot demonstrations. We then introduce a BC-RL loop to train the
hand and eye jointly: the hand (BC) agent is trained from rendered eye
observations, and the eye (RL) agent is rewarded when the hand produces correct
action predictions. In this way, hand-eye coordination emerges as the eye looks
towards regions which allow the hand to complete the task. EyeRobot implements
a foveal-inspired policy architecture allowing high resolution with a small
compute budget, which we find also leads to the emergence of more stable
fixation as well as improved ability to track objects and ignore distractors.
We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring
manipulation in an arc surrounding the robot arm. Our experiments suggest
EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate
manipulation over large workspaces with a single camera. See project site for
videos: https://www.eyerobot.net/
cs.AI [Back]
[136] One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence
Michelle M. Li,Ben Y. Reis,Adam Rodman,Tianxi Cai,Noa Dagan,Ran D. Balicer,Joseph Loscalzo,Isaac S. Kohane,Marinka Zitnik
Main category: cs.AI
TL;DR: 这篇论文提出了一种上下文切换的医疗AI愿景,旨在解决当前医疗基础模型在新场景中动态适应能力不足的问题,从而减少因忽略关键上下文信息而导致的错误。
Details
Motivation: 当前医疗AI模型在新人群、专科或场景中需要微调或精心设计提示,无法动态适应不同上下文,导致预测时忽略关键信息,造成错误。Contribution: 提出了上下文切换的医疗AI愿景,使模型能够在不重新训练的情况下动态适应新专科、人群、工作流程和临床角色。
Method: 通过上下文切换机制,使AI模型能够动态调整其推理过程,无需针对新场景重新训练。
Result: 这种上下文切换的AI有望在多个专科和地区诊断、管理和治疗多种疾病,扩大医疗服务的可及性。
Insight: 动态调整上下文的能力是提升医疗AI实用性和泛化性的关键。
Abstract: Medical foundation models, including language models trained on clinical
notes, vision-language models on medical images, and multimodal models on
electronic health records, can summarize clinical notes, answer medical
questions, and assist in decision-making. Adapting these models to new
populations, specialties, or settings typically requires fine-tuning, careful
prompting, or retrieval from knowledge bases. This can be impractical, and
limits their ability to interpret unfamiliar inputs and adjust to clinical
situations not represented during training. As a result, models are prone to
contextual errors, where predictions appear reasonable but fail to account for
critical patient-specific or contextual information. These errors stem from a
fundamental limitation that current models struggle with: dynamically adjusting
their behavior across evolving contexts of medical care. In this Perspective,
we outline a vision for context-switching in medical AI: models that
dynamically adapt their reasoning without retraining to new specialties,
populations, workflows, and clinical roles. We envision context-switching AI to
diagnose, manage, and treat a wide range of diseases across specialties and
regions, and expand access to medical care.
[137] Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou,Yiheng Wang,Xuming He,Ruoyao Xiao,Zhiwei Li,Qiantai Feng,Zijie Guo,Yuejin Yang,Hao Wu,Wenxuan Huang,Jiaqi Wei,Dan Si,Xiuqi Yao,Jia Bu,Haiwen Huang,Tianfan Fu,Shixiang Tang,Ben Fei,Dongzhan Zhou,Fenghua Ling,Yan Lu,Siqi Sun,Chenhui Li,Guanjie Zheng,Jiancheng Lv,Wenlong Zhang,Lei Bai
Main category: cs.AI
TL;DR: 该论文提出了名为‘Scientists’ First Exam’(SFE)的基准测试,用于评估多模态大语言模型(MLLMs)在科学领域的感知、理解和推理能力。通过涵盖66个多模态任务和830个专家验证的问题对,SFE揭示了当前先进模型(如GPT-3和InternVL-3)在科学认知能力上的不足。
Details
Motivation: 当前科学基准测试主要关注MLLMs的知识理解能力,忽视了其感知和推理能力的评估。为了解决这一局限性,SFE旨在全面评估MLLMs在科学领域的认知能力。Contribution: 提出了SFE基准测试,涵盖科学信号感知、科学属性理解和科学比较推理三个层次,补充了现有基准的不足。
Method: 设计了830个专家验证的多模态问题对,覆盖66个任务和5个高价值学科,通过三个层次评估MLLMs的能力。
Result: 实验显示,GPT-3和InternVL-3在SFE上的表现仅为34.08%和26.52%,表明MLLMs在科学领域的认知能力仍有较大提升空间。
Insight: SFE为AI在科学领域的应用提供了新的评估标准,强调了多模态感知和推理能力的重要性,有望推动AI增强的科学发现。
Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning
based on information-intensive scientific data and domain-specific expertise.
Empowered by expert-level scientific benchmarks, scientific Multimodal Large
Language Models (MLLMs) hold the potential to significantly enhance this
discovery process in realistic workflows. However, current scientific
benchmarks mostly focus on evaluating the knowledge understanding capabilities
of MLLMs, leading to an inadequate assessment of their perception and reasoning
abilities. To address this gap, we present the Scientists’ First Exam (SFE)
benchmark, designed to evaluate the scientific cognitive capacities of MLLMs
through three interconnected levels: scientific signal perception, scientific
attribute understanding, scientific comparative reasoning. Specifically, SFE
comprises 830 expert-verified VQA pairs across three question types, spanning
66 multimodal tasks across five high-value disciplines. Extensive experiments
reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08%
and 26.52% on SFE, highlighting significant room for MLLMs to improve in
scientific realms. We hope the insights obtained in SFE will facilitate further
developments in AI-enhanced scientific discoveries.
[138] TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving
Vincenzo Colle,Mohamed Sana,Nicola Piovesan,Antonio De Domenico,Fadhel Ayed,Merouane Debbah
Main category: cs.AI
TL;DR: TeleMath是首个专门评估大语言模型在电信领域数学问题解决能力的基准数据集,包含500个问答对,覆盖广泛主题。评估发现,专为数学或逻辑推理设计的模型表现最佳,而通用模型即使参数庞大也表现不佳。
Details
Motivation: 人工智能在电信领域的应用增加,但对大语言模型在领域专用数学密集型任务中的能力研究不足,尤其是在信号处理、网络优化等方面。Contribution: 提出了TeleMath基准数据集,填补了电信领域数学问题评估的空白,并公开了数据集和评估代码以促进研究。
Method: 通过专家设计种子问题,生成500个问答对,并对多种开源大语言模型进行评估。
Result: 专为数学或逻辑推理设计的模型表现最佳,通用模型表现较差。
Insight: 领域专用模型在复杂数学任务中优于通用模型,表明领域适应性和专门化的重要性。
Abstract: The increasing adoption of artificial intelligence in telecommunications has
raised interest in the capability of Large Language Models (LLMs) to address
domain-specific, mathematically intensive tasks. Although recent advancements
have improved the performance of LLMs in general mathematical reasoning, their
effectiveness within specialized domains, such as signal processing, network
optimization, and performance analysis, remains largely unexplored. To address
this gap, we introduce TeleMath, the first benchmark dataset specifically
designed to evaluate LLM performance in solving mathematical problems with
numerical solutions in the telecommunications domain. Comprising 500
question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the
telecommunications field. This paper outlines the proposed QnAs generation
pipeline, starting from a selected seed of problems crafted by Subject Matter
Experts. The evaluation of a wide range of open-source LLMs reveals that best
performance on TeleMath is achieved by recent models explicitly designed for
mathematical or logical reasoning. In contrast, general-purpose models, even
those with a large number of parameters, often struggle with these challenges.
We have released the dataset and the evaluation code to ease result
reproducibility and support future research.
[139] Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Fei Lin,Ziyang Gong,Cong Wang,Yonglin Tian,Tengchao Zhang,Xue Yang,Gen Luo,Fei-Yue Wang
Main category: cs.AI
TL;DR: 论文介绍了首个针对多模态大语言模型(MLLMs)的分子毒性修复基准任务ToxiMol,并提出了自动化评估框架ToxiEval。实验表明,尽管当前MLLMs在该任务中面临挑战,但已展现出毒性理解、语义约束和结构感知分子编辑的潜力。
Details
Motivation: 毒性是早期药物开发失败的主要原因,但目前分子毒性修复任务缺乏系统性定义和基准。研究旨在填补这一空白,提出通用MLLMs在分子毒性修复中的适用性评估。Contribution: 1. 提出首个分子毒性修复基准任务ToxiMol;2. 设计自动化评估框架ToxiEval;3. 首次系统评估近30种主流MLLMs在该任务中的表现。
Method: 1. 构建覆盖11项任务和560个毒性分子的标准化数据集;2. 设计基于专家毒理学知识的提示注释流程;3. 采用ToxiEval框架集成多种评价指标。
Result: 实验结果表明,当前MLLMs在分子毒性修复任务中仍面临挑战,但表现出毒性理解、语义约束和结构感知编辑的潜力。
Insight: MLLMs在分子毒性修复任务中具备潜力,但需进一步优化评估标准、生成多样性和失败归因分析。
Abstract: Toxicity remains a leading cause of early-stage drug development failure.
Despite advances in molecular design and property prediction, the task of
molecular toxicity repair - generating structurally valid molecular
alternatives with reduced toxicity - has not yet been systematically defined or
benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task
for general-purpose Multimodal Large Language Models (MLLMs) focused on
molecular toxicity repair. We construct a standardized dataset covering 11
primary tasks and 560 representative toxic molecules spanning diverse
mechanisms and granularities. We design a prompt annotation pipeline with
mechanism-aware and task-adaptive capabilities, informed by expert
toxicological knowledge. In parallel, we propose an automated evaluation
framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic
accessibility, drug-likeness, and structural similarity into a high-throughput
evaluation chain for repair success. We systematically assess nearly 30
mainstream general-purpose MLLMs and design multiple ablation studies to
analyze key factors such as evaluation criteria, candidate diversity, and
failure attribution. Experimental results show that although current MLLMs
still face significant challenges on this task, they begin to demonstrate
promising capabilities in toxicity understanding, semantic constraint
adherence, and structure-aware molecule editing.
cs.CR [Back]
[140] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models
Zilong Wang,Xiang Zheng,Xiaosen Wang,Bo Wang,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CR
TL;DR: GenBreak通过微调大型语言模型(LLM)生成对抗性提示,系统性评估文本到图像(T2I)模型的安全漏洞,既能绕过安全机制,又能生成高毒性图像。
Details
Motivation: T2I模型的安全问题日益突出,现有方法在绕过安全机制或生成高毒性图像上存在局限性,缺乏综合评估工具。Contribution: 提出GenBreak框架,结合监督微调和强化学习,生成既能绕过安全机制又高毒性的对抗性提示,揭示商业T2I模型的潜在安全风险。
Method: 采用监督微调与强化学习结合,通过奖励信号引导LLM生成语义连贯、多样化的高毒性提示。
Result: 生成的对抗性提示在黑盒攻击中表现优异,成功揭示了商业T2I模型的安全缺陷。
Insight: GenBreak展示了通过LLM增强系统安全性评估的潜力,为未来T2I模型的安全设计提供了新方向。
Abstract: Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and
are now widely used in content creation. However, these models can be misused
to generate harmful content, including nudity or violence, posing significant
safety risks. While most platforms employ content moderation systems,
underlying vulnerabilities can still be exploited by determined adversaries.
Recent research on red-teaming and adversarial attacks against T2I models has
notable limitations: some studies successfully generate highly toxic images but
use adversarial prompts that are easily detected and blocked by safety filters,
while others focus on bypassing safety mechanisms but fail to produce genuinely
harmful outputs, neglecting the discovery of truly high-risk prompts.
Consequently, there remains a lack of reliable tools for evaluating the safety
of defended T2I models. To address this gap, we propose GenBreak, a framework
that fine-tunes a red-team large language model (LLM) to systematically explore
underlying vulnerabilities in T2I generators. Our approach combines supervised
fine-tuning on curated datasets with reinforcement learning via interaction
with a surrogate T2I model. By integrating multiple reward signals, we guide
the LLM to craft adversarial prompts that enhance both evasion capability and
image toxicity, while maintaining semantic coherence and diversity. These
prompts demonstrate strong effectiveness in black-box attacks against
commercial T2I generators, revealing practical and concerning safety
weaknesses.
[141] Secure Data Access in Cloud Environments Using Quantum Cryptography
S. Vasavi Venkata Lakshmi,Ziaul Haque Choudhury
Main category: cs.CR
TL;DR: 该论文提出了一种在云环境中使用量子密码学(如BB84协议和量子一次性加密)保障数据安全的新方法,以应对未来量子计算的威胁。
Details
Motivation: 云计算的普及带来了数据存储与访问的便利,但传统加密方法在量子计算时代可能不安全。量子密码学为解决这一问题提供了新的方向。Contribution: 论文的核心贡献是结合量子密钥分发(QKD)和量子一次性加密(QOTP),提出了一种在云环境中实现安全数据访问的量子加密方案。
Method: 采用了BB84协议用于量子密钥分发,结合量子一次性加密(QOTP)对云数据进行加解密,确保数据传输和存储的安全性。
Result: 该方法能够有效抵抗量子计算攻击,为云数据提供强大的安全保障,适用于未来量子计算环境。
Insight: 量子密码学是未来数据安全的重要方向,尤其在云计算领域,结合QKD和QOTP可以为现有系统提供长期的防护能力。
Abstract: Cloud computing has made storing and accessing data easier but keeping it
secure is a big challenge nowadays. Traditional methods of ensuring data may
not be strong enough in the future when powerful quantum computers become
available. To solve this problem, this study uses quantum cryptography to
protect data in the cloud environment. Quantum Key Distribution (QKD) creates
secure keys by sending information using quantum particles like photons.
Specifically, we use the BB84 protocol, a simple and reliable way to make
secure keys that cannot be stolen without detection. To protect the data, we
use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the
data stays completely private. This study shows how these Quantum methods can
be applied in cloud systems to provide a strong defense against hackers, even
if they have access to quantum computers. The combination of QKD, BB84, and
QOTP creates a safe and reliable way to keep data secure when it is stored or
shared in the cloud. Using quantum cryptography, this paper provides a way to
ensure data security now and in the future, making cloud computing safer for
everyone to store their data securely and safely.
eess.IV [Back]
[142] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective
Minye Shao,Zeyu Wang,Haoran Duan,Yawen Huang,Bing Zhai,Shizheng Wang,Yang Long,Yefeng Zheng
Main category: eess.IV
TL;DR: 这篇论文提出了一种基于频域视角的脑肿瘤分割方法HFF-Net,通过频率域分解和自适应拉普拉斯卷积模块显著提升了对比增强区域的肿瘤分割性能。
Details
Motivation: 当前脑肿瘤分割方法在对比增强区域的性能不足,主要由于对MRI特定特征(如复杂纹理和方向变化)的考虑不足。Contribution: 提出了HFF-Net,包含频率域分解模块(FDD)、自适应拉普拉斯卷积模块(ALC)和频域交叉注意力模块(FDCA),显著提升了分割精度。
Method: 采用频率域分解捕捉高低频信息,动态卷积核增强高频细节,交叉注意力融合多尺度特征。
Result: 在四个公共数据集上,HFF-Net在主要肿瘤子区域的平均Dice得分相对提升了4.48%,对比增强区域提升了7.33%。
Insight: 频域分析可以更好地捕捉MRI图像的纹理和方向特征,动态卷积核显著提升了对边界细节的敏感性。
Abstract: Precise segmentation of brain tumors, particularly contrast-enhancing regions
visible in post-contrast MRI (areas highlighted by contrast agent injection),
is crucial for accurate clinical diagnosis and treatment planning but remains
challenging. However, current methods exhibit notable performance degradation
in segmenting these enhancing brain tumor areas, largely due to insufficient
consideration of MRI-specific tumor features such as complex textures and
directional variations. To address this, we propose the Harmonized Frequency
Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a
frequency-domain perspective. To comprehensively characterize tumor regions, we
develop a Frequency Domain Decomposition (FDD) module that separates MRI images
into low-frequency components, capturing smooth tumor contours and
high-frequency components, highlighting detailed textures and directional
edges. To further enhance sensitivity to tumor boundaries, we introduce an
Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical
high-frequency details using dynamically updated convolution kernels. To
effectively fuse tumor features across multiple scales, we design a Frequency
Domain Cross-Attention (FDCA) integrating semantic, positional, and
slice-specific information. We further validate and interpret frequency-domain
improvements through visualization, theoretical reasoning, and experimental
analyses. Extensive experiments on four public datasets demonstrate that
HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39%
to 7.72%) in the mean Dice scores across the three major subregions, and an
average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the
segmentation of contrast-enhancing tumor regions, while maintaining favorable
computational efficiency and clinical applicability. Code:
https://github.com/VinyehShaw/HFF.
[143] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation
Emerson P. Grabke,Masoom A. Haider,Babak Taati
Main category: eess.IV
TL;DR: 论文提出了一种新型的CCELLA方法,通过双头条件化策略和联合损失函数,结合高效的LDM训练框架,显著提升了3D前列腺MRI生成的性能,解决了医学图像合成中的数据稀缺问题。
Details
Motivation: 医学图像合成中,潜在扩散模型(LDM)通常依赖短提示文本编码器或非医学预训练模型,且需要大量数据进行微调,限制了性能和科学可访问性。本文旨在通过新方法解决这些问题。Contribution: 1. 提出CCELLA双头条件化方法,结合非医学大语言模型文本特征和病理分类条件化;2. 设计联合损失函数和高效LDM训练框架;3. 在数据有限的情况下实现高质量医学图像合成。
Method: 使用CCELLA双头条件化策略,将文本特征(通过交叉注意力)和病理分类(通过时间步嵌入)同时注入LDM U-Net;提出联合损失函数和数据高效训练框架。
Result: 在3D前列腺MRI数据集上,FID得分为0.025,显著优于基线模型(FID 0.071);合成图像能提升分类器准确率(从69%到74%),且仅用合成数据训练的分类器性能与真实数据相当。
Insight: 通过条件化策略和高效训练框架,可以在小数据场景下实现高质量的医学图像合成,缓解了数据稀缺问题,同时提升了模型的实用性和可访问性。
Abstract: Latent diffusion models (LDM) could alleviate data scarcity challenges
affecting machine learning development for medical imaging. However, medical
LDM training typically relies on performance- or scientific
accessibility-limiting strategies including a reliance on short-prompt text
encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with
large data volumes. We propose a Class-Conditioned Efficient Large Language
model Adapter (CCELLA) to address these limitations. CCELLA is a novel
dual-head conditioning approach that simultaneously conditions the LDM U-Net
with non-medical large language model-encoded text features through
cross-attention and with pathology classification through the timestep
embedding. We also propose a joint loss function and a data-efficient LDM
training framework. In combination, these strategies enable
pathology-conditioned LDM training for high-quality medical image synthesis
given limited data volume and human data annotation, improving LDM performance
and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a
size-limited prostate MRI dataset, significantly outperforming a recent
foundation model with FID 0.071. When training a classifier for prostate cancer
prediction, adding synthetic images generated by our method to the training
dataset improves classifier accuracy from 69% to 74%. Training a classifier
solely on our method’s synthetic images achieved comparable performance to
training on real images alone.
[144] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction
Yuliang Zhu,Jing Cheng,Qi Xie,Zhuo-Xu Cui,Qingyong Zhu,Yuanyuan Liu,Xin Liu,Jianfeng Ren,Chengbo Wang,Dong Liang
Main category: eess.IV
TL;DR: 该论文提出了一种具有时空旋转等变性的深度展开网络(DUN-SRE),用于动态MRI重建,通过结合时空对称性先验,显著提升了图像质量。
Details
Motivation: 动态MRI存在空间和时间维度的对称性先验,但现有方法未能有效建模这些对称性。DUN-SRE旨在填补这一空白,特别是在处理时间对称性方面。Contribution: 1. 提出了一种新的(2+1)D等变卷积架构,统一了数据一致性和近端映射模块;2. 开发了高保真群滤波器参数化机制;3. 在心脏CINE MRI数据集上实现了SOTA性能。
Method: 采用深度展开网络框架,结合(2+1)D等变卷积,通过群滤波器参数化机制确保对称性约束与表示精度的平衡。
Result: 在心脏CINE MRI数据集上,DUN-SRE在保留旋转对称结构方面表现优异,并展现出广泛的泛化能力。
Insight: 时空对称性先验对动态MRI重建至关重要,DUN-SRE通过等变性设计实现了更精确的物理建模。
Abstract: Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries,
including spatial rotation symmetry within individual frames and temporal
symmetry along the time dimension. Explicit incorporation of these symmetry
priors in the reconstruction model can significantly improve image quality,
especially under aggressive undersampling scenarios. Recently, Equivariant
convolutional neural network (ECNN) has shown great promise in exploiting
spatial symmetry priors. However, existing ECNNs critically fail to model
temporal symmetry, arguably the most universal and informative structural prior
in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep
Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for
Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance
through a (2+1)D equivariant convolutional architecture. In particular, it
integrates both the data consistency and proximal mapping module into a unified
deep unrolling framework. This architecture ensures rigorous propagation of
spatiotemporal rotation symmetry constraints throughout the reconstruction
process, enabling more physically accurate modeling of cardiac motion dynamics
in cine MRI. In addition, a high-fidelity group filter parameterization
mechanism is developed to maintain representation precision while enforcing
symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets
demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in
preserving rotation-symmetric structures, offering strong generalization
capability to a broad range of dynamic MRI reconstruction tasks.
[145] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation
Xi Chen,Zhiqiang Shen,Peng Cao,Jinzhu Yang,Osmar R. Zaiane
Main category: eess.IV
TL;DR: 论文提出了一种新的领域随机化方法ConStyX,用于提升医学图像分割模型的泛化能力,通过同时增强内容与风格,解决了现有方法仅依赖风格扰动和过度增强的问题。
Details
Motivation: 医学图像多域采集导致的域偏移问题影响了模型的性能,现有领域泛化方法仅依赖风格扰动且忽略了过度增强的负面影响。Contribution: 提出ConStyX方法,通过同时增强内容和风格数据,并优化训练过程,减轻过度增强的负面影响,从而提升泛化能力。
Method: 结合内容与风格增强,利用优化机制平衡增强特征的影响,提升模型在多域数据上的泛化性能。
Result: 实验表明,ConStyX在多个领域上优于现有方法,表现出更强的泛化能力。
Insight: 同时增强内容与风格能更全面地覆盖多域数据,而优化训练过程则能有效避免过度增强的负面影响。
Abstract: Medical images are usually collected from multiple domains, leading to domain
shifts that impair the performance of medical image segmentation models. Domain
Generalization (DG) aims to address this issue by training a robust model with
strong generalizability. Recently, numerous domain randomization-based DG
methods have been proposed. However, these methods suffer from the following
limitations: 1) constrained efficiency of domain randomization due to their
exclusive dependence on image style perturbation, and 2) neglect of the adverse
effects of over-augmented images on model training. To address these issues, we
propose a novel domain randomization-based DG method, called content style
augmentation (ConStyX), for generalizable medical image segmentation.
Specifically, ConStyX 1) augments the content and style of training data,
allowing the augmented training data to better cover a wider range of data
domains, and 2) leverages well-augmented features while mitigating the negative
effects of over-augmented features during model training. Extensive experiments
across multiple domains demonstrate that our ConStyX achieves superior
generalization performance. The code is available at
https://github.com/jwxsp1/ConStyX.
[146] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches
Andrea Moglia,Matteo Leccardi,Matteo Cavicchioli,Alice Maccarini,Marco Marcon,Luca Mainardi,Pietro Cerveri
Main category: eess.IV
TL;DR: 这篇综述论文系统研究了通用模型(特别是SAM及其变体)在医学图像分割中的应用,并对比了其与任务专用模型的性能,同时探讨了未来发展方向和面临的挑战。
Details
Motivation: 受到大型语言模型和自然图像分割模型(如SAM)成功的启发,研究通用模型在医学图像分割中的潜力,探讨其是否能超越任务专用模型。Contribution: 提供了医学图像分割中通用模型的全面综述,包括SAM的多种变体(零样本、少样本、微调等),以及其他创新模型,并对其性能与任务专用模型进行了严格比较。
Method: 通过分类和比较分析的方法,总结了不同通用模型的架构和实现方式(如SAM 2、基于文本和图像的模型等),并评估了它们的性能和适用性。
Result: 研究发现通用模型在某些任务上表现优异,但与任务专用模型相比仍有差距,特别是在医学影像的复杂性和多样性方面。
Insight: 未来的研究方向应包括合成数据、多模态融合、借鉴自然语言处理的通用模型经验、可信AI,以及临床转化中的实际应用问题。
Abstract: Following the successful paradigm shift of large language models, leveraging
pre-training on a massive corpus of data and fine-tuning on different
downstream tasks, generalist models have made their foray into computer vision.
The introduction of Segment Anything Model (SAM) set a milestone on
segmentation of natural images, inspiring the design of a multitude of
architectures for medical image segmentation. In this survey we offer a
comprehensive and in-depth investigation on generalist models for medical image
segmentation. We start with an introduction on the fundamentals concepts
underpinning their development. Then, we provide a taxonomy on the different
declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on
the recent SAM 2, on other innovative models trained on images alone, and
others trained on both text and images. We thoroughly analyze their
performances at the level of both primary research and best-in-literature,
followed by a rigorous comparison with the state-of-the-art task-specific
models. We emphasize the need to address challenges in terms of compliance with
regulatory frameworks, privacy and security laws, budget, and trustworthy
artificial intelligence (AI). Finally, we share our perspective on future
directions concerning synthetic data, early fusion, lessons learnt from
generalist models in natural language processing, agentic AI and physical AI,
and clinical translation.
[147] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation
Zhenhuan Zhou
Main category: eess.IV
TL;DR: Med-URWKV是首个在医学图像分割领域基于纯RWKV架构的模型,通过ImageNet预训练提升性能,在多个数据集上表现优异。
Details
Motivation: 现有医学图像分割方法(如CNN、Transformer或混合架构)分别存在感受野受限或计算复杂度高的问题,RWKV结合线性复杂度和长程建模能力成为新选择,但尚未探索其预训练优势。Contribution: 提出Med-URWKV,基于U-Net框架的纯RWKV分割模型,首次利用ImageNet预训练VRWKV编码器,无需从头训练。
Method: 采用U-Net架构与纯RWKV设计,直接复用预训练的VRWKV编码器,提升模型性能。
Result: 在7个数据集上验证,Med-URWKV性能优于或媲美从头训练的RWKV模型,证明预训练的有效性。
Insight: 预训练RWKV编码器可显著提升医学图像分割任务表现,为轻量化和高效长程建模提供新方向。
Abstract: Medical image segmentation is a fundamental and key technology in
computer-aided diagnosis and treatment. Previous methods can be broadly
classified into three categories: convolutional neural network (CNN) based,
Transformer based, and hybrid architectures that combine both. However, each of
them has its own limitations, such as restricted receptive fields in CNNs or
the computational overhead caused by the quadratic complexity of Transformers.
Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a
promising alternative for various vision tasks, offering strong long-range
modeling capabilities with linear computational complexity. Some studies have
also adapted RWKV to medical image segmentation tasks, achieving competitive
performance. However, most of these studies focus on modifications to the
Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring
the potential advantages of leveraging pre-trained VRWKV models for medical
image segmentation tasks. In this paper, we propose Med-URWKV, a pure
RWKV-based architecture built upon the U-Net framework, which incorporates
ImageNet-based pretraining to further explore the potential of RWKV in medical
image segmentation tasks. To the best of our knowledge, Med-URWKV is the first
pure RWKV segmentation model in the medical field that can directly reuse a
large-scale pre-trained VRWKV encoder. Experimental results on seven datasets
demonstrate that Med-URWKV achieves comparable or even superior segmentation
performance compared to other carefully optimized RWKV models trained from
scratch. This validates the effectiveness of using a pretrained VRWKV encoder
in enhancing model performance. The codes will be released.
physics.med-ph [Back]
[148] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation
Nicholas Summerfield,Qisheng He,Alex Kuo,Ahmed I. Ghanem,Simeng Zhu,Chase Ruff,Joshua Pan,Anudeep Kumar,Prashant Nagpal,Jiwei Zhao,Ming Dong,Carri K. Glide-Hurst
Main category: physics.med-ph
TL;DR: 论文提出了一种名为MAGIC的多模态心脏子结构分割方法,基于nnU-Net框架,通过复制的编码和解码分支实现多模态适应性,在CT、MR-Linac和CCTA上表现优异。
Details
Motivation: 心脏子结构分割对放射治疗规划至关重要,但现有深度学习方法在多模态和重叠结构上泛化能力不足。Contribution: 提出MAGIC方法,首次在单模型中实现多模态(CT、MR-Linac、CCTA)和重叠结构的分割,简化了临床部署的计算需求。
Method: 基于nnU-Net的U型架构,通过复制编码解码分支实现多模态适应,训练和验证分别使用76和15例数据,测试30例。
Result: 在Dice相似系数(DSC)评估中,MAGIC在57%的案例中优于对比模型,且计算轻量。
Insight: MAGIC展示了单模型在多模态任务中的潜力,但其统计差异有限,需进一步优化。
Abstract: Cardiac substructures are essential in thoracic radiation therapy planning to
minimize risk of radiation-induced heart disease. Deep learning (DL) offers
efficient methods to reduce contouring burden but lacks generalizability across
different modalities and overlapping structures. This work introduces and
validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and
multi-modal cardiac substructure segmentation. MAGIC is implemented through
replicated encoding and decoding branches of an nnU-Net-based, U-shaped
backbone conserving the function of a single model. Twenty cardiac
substructures (heart, chambers, great vessels (GVs), valves, coronary arteries
(CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac,
and cardiac CT angiography (CCTA) modalities were manually delineated and used
to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison
models (four segmentation subgroups across three modalities) were equivalently
trained. All methods were compared for training efficiency and against
reference contours using the Dice Similarity Coefficient (DSC) and two-tailed
Wilcoxon Signed-Rank test (threshold, p<0.05). Average DSC scores were
0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC
outperforms the comparison in 57% of cases, with limited statistical
differences. MAGIC offers an effective and accurate segmentation solution that
is lightweight and capable of segmenting multiple modalities and overlapping
structures in a single model. MAGIC further enables clinical implementation by
simplifying the computational requirements and offering unparalleled
flexibility for clinical settings.