Table of Contents

cs.CL [Back]

[1] TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection

Tarık Saraç,Selin Mergen,Mucahid Kutlu

Main category: cs.CL

TL;DR: 论文提出了一种基于大型语言模型(LLMs)的委员会辩论方法,用于检测推文中的科学主张、科学文献引用或科学实体提及。尽管在科学主张和实体提及任务中表现一般,但在检测科学文献引用方面表现最佳。

Details Motivation: 科学网络讨论检测是一个重要但具有挑战性的任务,尤其是在社交媒体中快速识别科学相关内容的需求日益增长。传统的单一模型方法可能不足以处理复杂的语义和上下文关系。

Contribution: 提出了三种辩论方法(单辩论、团队辩论、委员会辩论),并验证了委员会辩论方法在检测科学文献引用方面的有效性。

Method: 通过多LLM模拟结构化学术讨论:单辩论(两方对抗加第三方裁判)、团队辩论(多模型协作对抗)、委员会辩论(多专家模型协商达成共识)。

Result: 委员会辩论在检测科学文献引用任务中排名第一,但在科学主张和科学实体提及任务中表现不佳(分别排名第8和第9)。

Insight: 多专家模型的辩论机制可以提高特定任务的性能,但可能因任务复杂性不同而表现不一。未来研究可优化辩论结构和专家模型选择。

Abstract: In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.

[2] Real-time News Story Identification

Tadej Škvorc,Nikola Ivačič,Sebastjan Hribar,Marko Robnik-Šikonja

Main category: cs.CL

TL;DR: 该论文提出了一种实时新闻故事识别方法,通过结合文本表示技术、聚类算法和在线主题建模方法,能够在线分配新闻文章到特定故事中。

Details Motivation: 为了提升新闻阅读体验,许多新闻网站会将新闻按主题分类为故事(story)。然而,现有的文本聚类和主题建模方法无法满足基于特定事件、地点和人物进行分组的需求。因此,需要一种实时的方法来动态识别和分配新闻故事。

Contribution: 提出了一种实时新闻故事识别方法,结合多种文本表示技术和在线主题建模方法(如BERTopic、DBStream和TextClust),能够动态分配新闻文章到特定故事中。

Method: 采用混合文本表示技术提取事件和命名实体,结合聚类算法和在线主题建模方法(BERTopic、DBStream、TextClust)实现实时故事识别。

Result: 在斯洛文尼亚媒体的新闻数据集上进行了评估,结果显示该方法能够生成合理的结果,并通过人工评估验证了其有效性。

Insight: 通过混合多种文本表示和在线主题建模方法,可以更精确地捕捉新闻中的特定事件和实体,从而实现实时动态的故事识别。

Abstract: To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators.

[3] TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning

Kristian Miok,Blaz Škrlj,Daniela Zaharie,Marko Robnik Šikonja

Main category: cs.CL

TL;DR: TT-XAI是一个轻量级框架,通过关键词提取和大型语言模型推理提升临床文本分类性能和可解释性。

Details Motivation: 临床语言模型在处理冗长且非结构化的电子健康记录时,预测和解释的可信度不足。

Contribution: 提出了TT-XAI框架,结合关键词提取和LLM推理,同时优化分类性能和解释质量。

Method: 通过关键词提取简化输入,结合LIME提高局部解释的保真度,并利用关键词引导LLM生成临床解释。

Result: 实验表明关键词增强方法在机器和人评估中均表现更优,提升了解释的临床相关性和可信度。

Insight: 关键词蒸馏能够有效简化输入并增强解释的可信度,为临床决策支持提供可扩展的路径。

Abstract: Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support.

[4] Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition

Roberto Labadie-Tamayo,Djordje Slijepčević,Xihui Chen,Adrian Jaques Böck,Andreas Babic,Liz Freimann,Christiane Atzmüller Matthias Zeppelzauer

Main category: cs.CL

TL;DR: 论文提出了一种透明的方法(SCBM)用于仇恨和反仇恨言论识别,通过形容词作为可解释的瓶颈概念,结合大型语言模型和轻量级分类器,在多个数据集上表现优于现有方法,同时提供高解释性。

Details Motivation: 社交媒体上仇恨言论的快速增加对社会产生了重大影响,亟需自动检测方法。现有的黑盒模型缺乏解释性,因此论文提出了一种透明且可解释的方法。

Contribution: 1. 提出SCBM模型,通过形容词作为瓶颈概念实现透明和可解释的仇恨言论识别;2. 在五个多语言、多平台的数据集上表现优于现有方法;3. 结合形容词表示和Transformer嵌入进一步提升了性能。

Method: 采用大型语言模型(LLM)将输入文本映射到基于形容词的抽象表示,再通过轻量级分类器完成下游任务。此外,将形容词表示与Transformer嵌入融合以提升性能。

Result: 在五个数据集上平均macro-F1得分为0.69,优于现有方法。结合Transformer嵌入后性能平均提升1.8%。模型同时具有高准确性和解释性。

Insight: 形容词可以作为紧凑且有效的表示形式,适用于仇恨言论识别及其他NLP任务。可解释性对模型的实际应用具有重要意义。

Abstract: The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., “Speech Concept Bottleneck Model” (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks.

[5] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Haiyun Guo,ZhiYan Hou,Yu Chen,Jinghan He,Yandu Sun,Yuzhe Zhou,Shujing Guo,Kuan Zhu,Jinqiao Wang

Main category: cs.CL

TL;DR: 这篇论文提出了MLLM-CTBench,一个用于多模态大语言模型持续性指令调优的综合评测基准,结合最终答案准确性和细粒度CoT推理质量评估,并系统评估了不同算法和训练范式。主要发现包括模型通用能力越强遗忘越小、推理链退化更慢、算法有效性依赖任务顺序等。

Details Motivation: 由于缺乏系统性的评测基准,多模态大语言模型(MLLMs)在持续性指令调优领域的进展受限。为此,作者提出了MLLM-CTBench,填补这一空白并推动研究发展。

Contribution: 1. 多维评测:结合最终答案准确性和CoT推理质量评估;2. 算法和训练范式的全面评估:对比8种持续学习算法和4大类方法;3. 精心设计的任务:组织16个数据集覆盖6个领域。

Method: 使用了专用训练的CoT评估器进行细粒度推理质量评估,系统比较了监督微调和强化学习范式,并对8种持续学习算法进行评测。

Result: 发现通用能力强的模型在持续学习中遗忘更少,推理链退化速度比最终答案慢,任务顺序和模型能力对算法效果有重要影响,KL散度约束在强化学习中能减轻遗忘。

Insight: 推理链的保留支持了层次性遗忘假设,强化学习中KL约束对稳定性至关重要,为算法设计和评测提供了实践指导。

Abstract: Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation.

[6] Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models

Yassine Jamaa,Badr AlKhamissi,Satrajit Ghosh,Martin Schrimpf

Main category: cs.CL

TL;DR: 该研究通过神经科学的对比定位器方法,定位了大型语言模型和视觉语言模型中与心智理论和数学推理任务因果相关的单元,并发现低激活单元对性能的影响有时大于高激活单元。

Details Motivation: 研究旨在通过对比定位器方法识别语言模型中与特定任务(如心智理论和数学推理)因果相关的单元,以验证这些定位器的有效性。

Contribution: 提出了基于对比定位器的方法来识别因果相关单元,并在多个模型和任务中验证了其效果,发现了一些与传统认知相悖的现象。

Method: 使用对比刺激集定位高激活单元,并通过目标性切除评估其因果作用,同时对比低激活和随机选择单元的影响。

Result: 发现低激活单元有时比高激活单元对性能的影响更大,数学任务的定位单元对心智理论任务的干扰更强。

Insight: 研究结果表明基于对比的定位器可能无法准确捕捉任务特定单元,需要更广泛的刺激集和更精确的方法。

Abstract: This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.

[7] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions

Farah Atif,Nursultan Askarbekuly,Kareem Darwish,Monojit Choudhury

Main category: cs.CL

TL;DR: 该论文提出了一种新型基准FiqhQA,用于评估大语言模型(LLM)在伊斯兰教法裁决生成中的准确性和是否应回避回答的能力,揭示了模型在语言和法律学派间的差异。

Details Motivation: LLM在宗教领域的可靠性和准确性尚未得到充分研究,尤其是在伊斯兰教法学派特定裁决生成方面。

Contribution: 1. 提出了FiqhQA基准,首次针对伊斯兰四大法学派在阿拉伯语和英语中进行细粒度评估。2. 首次评估LLM在伊斯兰法学查询中的回避回答行为。

Method: 设计了零样本和回避回答实验,评估多种LLM在不同语言和法学派中的表现。

Result: GPT-4o在准确性上最优,而Gemini和Fanar在回避错误回答上表现更好;阿拉伯语环境下所有模型性能下降。

Insight: 强调在宗教应用中需任务特定的评估,并谨慎部署LLM,尤其关注模型的回避行为以减少错误回答。

Abstract: Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.

[8] Putnam-AXIOM: A Functional and Static Benchmark

Aryan Gulati,Brando Miranda,Eric Chen,Emily Xia,Kai Fronsdal,Bruno Dumont,Elyas Obbad,Sanmi Koyejo

Main category: cs.CL

TL;DR: Putnam-AXIOM是一个新的数学推理基准,基于大学竞赛问题生成,提供动态变化的题目以防止数据污染,并引入TFA指标评估模型推理能力。

Details Motivation: 现有数学推理基准因训练集污染和饱和问题(准确率>90%)影响了评估效果,需要更动态、抗污染的新基准。

Contribution: 1.引入Putnam-AXIOM基准,包含522个大学竞赛问题和100个动态变体;2.提出TFA指标评估推理过程;3.证明现有模型存在记忆问题。

Method: 通过程序化扰动变量和常数生成动态变体,结合教师强制准确率(TFA)评估推理能力。

Result: 在OpenAI等19个模型中,准确率在动态变体上显著下降(相对降幅46.8%),显示模型对训练数据的依赖。

Insight: 动态基准和直接评估推理的指标对真实模型能力评估至关重要,现有模型可能过度依赖记忆而非推理。

Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement “boxed” accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

[9] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

Shuzhou Yuan,William LaCroix,Hardik Ghoshal,Ercong Nie,Michael Färber

Main category: cs.CL

TL;DR: CoDAE通过增强思维链数据,优化大型语言模型在教育场景中的表现,解决其过早揭示答案、缺乏适应性和易受威胁的问题。

Details Motivation: 现有的大型语言模型在教育场景中表现不佳,常过早揭示答案、缺乏对学生不确定性的适应性和易受情绪操控提示影响,亟需针对性优化。

Contribution: 提出了CoDAE框架,通过增强思维链数据和设计针对性对话案例,显著改善了模型在教育场景中的表现。

Method: 收集真实师生对话数据,利用思维链提示增强数据,设计针对性案例,并对四种开源语言模型进行微调。

Result: 实验表明,CoDAE优化的模型在提供教学指导、支持推理过程和避免过早答案揭示方面表现更好。

Insight: 通过特定领域的数据增强和微调,可以显著提升大型语言模型在教育应用中的适应性和实用性。

Abstract: Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.

[10] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

Jiatong Li,Weida Wang,Qinggang Zhang,Junxian Li,Di Zhang,Changmeng Zheng,Shufei Zhang,Xiaoyong Wei,Qing Li

Main category: cs.CL

TL;DR: Mol-R1 是一个专为分子发现任务设计的新型框架,通过改进 Explicit Long Chain-of-Thought (CoT) 推理模型的解释性和推理能力,结合高质量数据集和高级训练策略,在文本分子生成任务中表现优异。

Details Motivation: 现有 Long-CoT 推理模型在知识密集型领域(如分子发现)中表现受限,原因包括领域知识的复杂性以及高质量标注数据的稀缺性。Mol-R1 旨在解决这些问题,提升推理模型的性能和可解释性。

Contribution: 1. 提出 Mol-R1 框架,改进 Explicit Long-CoT 模型的分子生成推理能力;2. 引入 PRID 蒸馏策略,生成高质量推理数据集;3. 开发 MoIA 训练策略,结合 SFT 和 RPO 迭代优化模型性能。

Method: 1. 使用 PRID 生成包含领域知识的高质量推理数据集;2. 设计 MoIA 训练策略,通过 SFT 和 RPO 的迭代结合优化模型推理能力;3. 在文本分子生成任务中评估 Mol-R1 的表现。

Result: Mol-R1 在文本分子推理生成任务中优于现有基线方法,验证了其在复杂知识密集型任务中的有效性。

Insight: 通过结合领域特定的蒸馏策略和迭代训练方法,可以有效提升 Long-CoT 模型在复杂任务中的推理能力和可解释性。

Abstract: Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.

[11] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

Jadie Adams,Brian Hu,Emily Veenhuis,David Joy,Bharadwaj Ravichandran,Aaron Bray,Anthony Hoogs,Arslan Basharat

Main category: cs.CL

TL;DR: 该论文提出了一种基于少量样本比较回归的可引导多样化对齐方法,以捕获多样化的用户偏好,超越传统标量奖励的局限性。

Details Motivation: 传统对齐方法(如RLHF)仅通过标量奖励反映平均用户偏好,无法捕捉多样化的用户价值取向。论文旨在通过多样化对齐技术解决这一问题。

Contribution: 1. 提出基于少量样本比较回归的可引导多样化对齐模型;2. 设计了两个新的评估基准(基于MIC和HelpSteer2数据集);3. 在多样化对齐领域实现了可解释性和多属性兼容性。

Method: 利用上下文学习(in-context learning)和基于细粒度属性的推理,对响应选项进行比较并做出对齐选择。方法支持少量样本且可适配不同属性与LLM。

Result: 提出方法在多样化对齐任务中优于多基线方法和现有先进技术,验证了其在价值对齐决策和奖励建模中的适用性。

Insight: 研究为多样化对齐开辟了新方向,推动LLM更公平、更具代表性的使用,同时提升了伦理AI的技术水平。

Abstract: Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.

[12] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

Peiji Li,Jiasheng Ye,Yongkang Chen,Yichuan Ma,Zijie Yu,Kedi Chen,Ganqu Cui,Haozhan Li,Jiacheng Chen,Chengqi Lyu,Wenwei Zhang,Linyang Li,Qipeng Guo,Dahua Lin,Bowen Zhou,Kai Chen

Main category: cs.CL

TL;DR: 该论文提出了InternBootcamp框架,专注于提升LLM的推理能力,通过1000多个多样化任务环境和自动化工具支持,验证了任务规模化的有效性。

Details Motivation: 现实世界的推理任务多样且复杂,现有的领域特定基准无法完全捕捉,需要更广泛的训练和评估环境。

Contribution: 提出了InternBootcamp框架,包含大量多样化任务环境和自动化验证模块,支持RL优化和模型评估。

Method: 通过自动化代理生成任务环境和验证模块,结合手动验证协议,快速扩展任务范围。

Result: 训练后的32B模型在Bootcamp-EVAL和其他基准测试中表现优异,验证了任务规模化的有效性。

Insight: 任务规模化是提升LLM推理通用性的有效途径。

Abstract: Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

[13] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

Zheng Wu,Heyuan Huang,Yanjia Yang,Yuanyi Song,Xingyu Lou,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 论文提出了IFRAgent框架,通过分析人类演示中的显式和隐式意图流,构建个性化的移动使用代理,显著提高了意图对齐率和任务完成率。

Details Motivation: 现有移动使用代理主要关注显式意图流(如步骤序列),而忽略了隐式意图流(如个人偏好),导致难以构建个性化的代理。

Contribution: 提出了IFRAgent框架:1) 收集并公开了MobileIAR数据集;2) 通过显式意图流构建标准操作流程库,通过隐式意图流构建用户习惯库;3) 结合检索增强生成和查询重写技术,提高代理与人类意图的对齐能力。

Method: IFRAgent框架包含:1) 显式意图流分析,构建SOP库;2) 隐式意图流分析,构建用户习惯库;3) 结合SOP提取器、检索增强生成和查询重写器,生成个性化的查询和SOP。

Result: 实验表明,IFRAgent在意图对齐率上比基线平均提高了6.79%(相对提升32.06%),在任务完成率上平均提高了5.30%(相对提升26.34%)。

Insight: 隐式意图流(如个人偏好)对个性化代理至关重要;结合显式和隐式意图流的方法显著提升了代理的表现。

Abstract: As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents’ understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79% (32.06% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30% (26.34% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake.

[14] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement

Rajmohan C,Sarthak Harne,Arvind Agarwal

Main category: cs.CL

TL;DR: 该论文提出了一种基于LLM的高效文本到表格生成系统,通过任务分解和迭代优化提升了生成质量。

Details Motivation: 非结构化文本转换为结构化表格是一个复杂任务,现有LLM在语义理解、表格结构保持和数值推理方面存在不足,需要更有效的解决方法。

Contribution: 论文的主要贡献是提出了两种策略:1)将文本到表格任务分解为可管理的子任务;2)通过迭代自反馈优化生成的表格。

Method: 采用任务分解和迭代自反馈的方法,逐步优化表格生成质量。

Result: 在两个公开数据集上,该方法显著优于基线模型。

Insight: 任务分解和迭代优化能显著提升LLM在复杂任务中的表现,但需权衡性能提升与计算成本。

Abstract: Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain.

[15] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Junjie Ye,Changhao Jiang,Zhengyin Du,Yufei Xu,Xuesong Yao,Zhiheng Xi,Xiaoran Fan,Qi Zhang,Xuanjing Huang,Jiecao Chen

Main category: cs.CL

TL;DR: 论文提出了一种自动化环境构建管道和改进的奖励机制,以提升大语言模型的工具使用能力,同时不影响其通用能力。

Details Motivation: 现有的大语言模型在工具使用方面的进步受到缺乏高效强化学习框架的限制,主要挑战在于构建稳定的训练环境和可验证的奖励机制。

Contribution: 提出了自动化环境构建管道和可验证奖励机制,通过反馈驱动的训练显著提升了模型的工具使用能力。

Method: 结合场景分解、文档生成、功能集成、复杂性调节和本地化部署,构建高质量训练环境;引入可验证奖励机制,评价工具使用精度和任务完成度。

Result: 实验表明,该方法显著提升了大语言模型的工具使用性能,同时不影响其通用能力,改进来自于上下文理解和推理能力的提升。

Insight: 研究表明工具使用能力的提升通过底层MLP参数的更新实现,这为未来模型优化提供了方向。

Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.

[16] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao,Xiang Wan,Chengxiang Zhai

Main category: cs.CL

TL;DR: 本文提出了一种系统性框架,通过数学等价变换测试大语言模型(LLM)在高级数学问题上的推理鲁棒性,发现模型对非数学扰动敏感。

Details Motivation: 传统方法难以全面评估LLM的数学推理能力,尤其在面对语言和参数变化时。

Contribution: 提出了通过数学等价变换的新评估方法,并创建了PutnamGAP基准数据集,揭示了LLM对非数学扰动的敏感性。

Method: 采用数学等价变换生成问题变体,通过PutnamGAP数据集测试18个商业和开源模型的性能。

Result: 模型在问题变体上表现显著下降,尤其是核心步骤变体(如OpenAI的O3模型下降10.5个百分点)。

Insight: 该评估方法能有效揭示LLM的鲁棒性缺陷,为改进数学推理能力提供了新方向。

Abstract: In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

[17] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

Keyu Chen,Zhifeng Shen,Daohai Yu,Haoqian Wu,Wei Wen,Jianfeng He,Ruizhi Qiao,Xing Sun

Main category: cs.CL

TL;DR: 该论文提出了一种自适应串行-并行解码方法ASPD,通过挖掘大语言模型中的内在并行性结构,显著提升推理速度,同时保持生成质量。

Details Motivation: 大语言模型(LLMs)的自回归解码方式因逐词预测而存在高延迟问题。通过重新观察模型输出,发现某些片段具有并行化潜力,因此提出了自适应串行-并行解码以提升效率。

Contribution: 1. 提出了ASPD方法,通过自动化构造并行化数据与高效并行解码机制提升LLMs推理速度;2. 实现了混合解码引擎,支持串行与并行模式的无缝切换。

Method: 1. 设计非侵入式流水线,从自回归模型响应中自动提取并行结构;2. 实现混合解码引擎,优化KV缓存重用以最大化计算效率。

Result: 在Vicuna Bench等任务中,ASPD实现了最高3.19倍(平均1.85倍)的加速,且生成质量差异低于1%。

Insight: 挖掘模型内在并行性是提升LLM推理效率的关键,ASPD为延迟敏感场景(如客服机器人)提供了实用解决方案。

Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.

[18] Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

Mahmoud Salhab,Shameed Sait,Mohammad Abusheikh,Hasan Abusheikh

Main category: cs.CL

TL;DR: 该论文提出了一种结合弱监督预训练和持续监督微调的训练流程,用于开发鲁棒的多方言阿拉伯语自动语音识别(ASR)模型,并在资源匮乏的情况下取得了最先进的结果。

Details Motivation: 阿拉伯语等低资源语言由于标注数据的稀缺和方言多样性带来的语言复杂性,开发高精度ASR系统具有挑战性。弱监督和微调结合的方法有望解决这一问题。

Contribution: 1)提出了一个可扩展的训练流程,结合弱监督预训练和持续监督微调;2)在预训练阶段使用了15,000小时的弱标注语音数据(涵盖现代标准阿拉伯语MSA和多方言阿拉伯语DA);3)在持续微调阶段结合了筛选后的弱标注数据和小规模高质量标注数据。

Method: 1)第一阶段:使用15,000小时弱标注语音数据(MSA和DA)进行预训练;2)第二阶段:通过混合筛选的弱标注数据和小规模高质量标注数据进行持续监督微调。

Result: 该方法在多方言阿拉伯语ASR挑战中排名第一,证明了弱监督与微调结合在低资源语言任务中的有效性。

Insight: 弱监督数据与高质量标注数据的结合可以有效缓解低资源语言的标注数据稀缺问题,尤其对于方言丰富的语言具有显著优势。

Abstract: Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages.

[19] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Khondoker Ittehadul Islam,Gabriele Sarti

Main category: cs.CL

TL;DR: 论文介绍了一个手动翻译的孟加拉语多步推理数据集Reveal-Bangla,用于评估跨语言多步推理能力,发现模型在非二元问题上能从推理上下文中受益,但难以有效利用孟加拉语推理步骤。

Details Motivation: 当前语言模型在多步推理任务上的评估主要集中于高资源语言(如英语),缺乏对低资源语言(如孟加拉语)的评估。

Contribution: 提出了首个手动翻译的孟加拉语多步推理数据集Reveal-Bangla,用于跨语言多步推理能力评估,并分析了语言模型在不同语言中的推理表现差异。

Method: 通过翻译英语Reveal数据集生成孟加拉语版本,并使用英语中心和孟加拉语中心的多语言小语言模型在原始和翻译数据集上进行对比评估。

Result: 研究表明,推理上下文对非二元问题的解答有帮助,但模型在利用孟加拉语推理步骤时表现不佳。

Insight: 模型的推理能力在不同语言中存在差异,如何有效利用非英语推理步骤仍需进一步研究。

Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.

[20] Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Hasan Abed Al Kader Hammoud,Kumail Alhamoud,Abed Hammoud,Elie Bou-Zeid,Marzyeh Ghassemi,Bernard Ghanem

Main category: cs.CL

TL;DR: 该论文提出了一种基于课程学习的长度控制推理方法,通过逐步减少推理步骤的预算,提升模型的效率和准确性,优于固定预算基线方法。

Details Motivation: 当前的推理方法通常采用固定的长度预算,这未能充分利用学习过程中从探索到压缩的自然过渡。作者希望通过动态调整预算来优化推理过程。

Contribution: 提出了一种基于课程学习的策略(GRPO),通过逐步收紧推理预算,结合任务正确性、长度效率和格式遵守的奖励信号,显著提升了推理效率和准确性。

Method: 使用Group Relative Policy Optimization(GRPO)方法,结合动态奖励函数(任务正确性、长度效率和格式遵守),采用逐步收紧预算的课程学习策略。

Result: 在多个数据集(GSM8K、MATH500等)上实验表明,课程学习方法在相同最终预算下优于固定预算方法,准确率和标记效率显著提升。

Insight: 渐进式约束(逐步收紧预算)是一种强大的归纳偏置,可以训练出高效的推理模型,同时动态奖励设计对性能有显著影响。

Abstract: Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.

[21] Retrospective Sparse Attention for Efficient Long-Context Generation

Seonghwan Choi,Beomseok Kang,Dongwon Jo,Jae-Joon Kim

Main category: cs.CL

TL;DR: 论文提出了一种新型的KV缓存更新技术RetroAttention,通过实时修正历史注意力输出来提升长上下文生成的效率和质量。

Details Motivation: 长上下文任务(如推理、代码生成和多轮对话)中,KV缓存的内存占用和延迟问题成为主要瓶颈,现有的KV压缩方法未能解决长解码过程中累积的注意力误差。

Contribution: 提出了RetroAttention技术,通过轻量级输出缓存实时修正历史注意力输出,从而提升KV缓存的利用率和生成准确性。

Method: RetroAttention利用后续解码步骤的新KV条目,动态更新和修正过去的注意力输出,打破固定注意力输出范式。

Result: 实验表明,RetroAttention在长生成任务中显著优于现有KV压缩方法,有效KV暴露提升1.6倍,准确性提高21.9%。

Insight: 动态修正历史注意力输出可显著缓解长上下文生成中的累积误差问题,同时保持低延迟开销。

Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9%.

[22] A Survey on Training-free Alignment of Large Language Models

Birong Pan,Yongqi Li,Weiyu Zhang,Wenpeng Lu,Mayi Xu,Shen Zhou,Yuanyuan Zhu,Ming Zhong,Tieyun Qian

Main category: cs.CL

TL;DR: 本文首次系统综述了大语言模型(LLM)的无训练对齐方法,将其分为解码前、解码中和解码后三个阶段,分别探讨了其在LLM和多模态LLM(MLLM)中的机制与局限,并指出了未来的研究方向。

Details Motivation: 传统对齐方法依赖资源密集的微调,可能导致知识退化或无法适应某些场景。无训练对齐技术提供了一种高效且适应性强的替代方案。

Contribution: 首次对无训练对齐方法进行系统性分类和综述,涵盖了LLM和MLLM的不同阶段,为实践者提供了指导。

Method: 将无训练对齐方法分为解码前、解码中和解码后三个阶段,分别探讨其机制,包括上下文学习、解码时调整和后生成修正。

Result: 总结出无训练对齐方法的优势与局限,为未来研究提供了方向。

Insight: 无训练对齐方法在资源受限或模型不可访问的场景中具有显著优势,但其在复杂任务中的效果仍需进一步验证。

Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

[23] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions

Zeyu Huang,Juyuan Wang,Longfeng Chen,Boyi Xiao,Leng Cai,Yawen Zeng,Jin Xu

Main category: cs.CL

TL;DR: MVISU-Bench是一个双语基准测试,覆盖404个任务和137个移动应用,旨在评测移动智能体在多任务、模糊、交互、单任务和不道德指令下的表现。提出的Aider模块动态优化提示,提升整体成功率19.55%。

Details Motivation: 现有的评测基准与真实世界脱节,无法满足用户多样化和复杂的需求。通过用户问卷分析,发现五类典型任务需求,需要新的评测工具。

Contribution: 1. 提出MVISU-Bench基准,覆盖多任务、模糊交互等真实场景;2. 设计Aider模块,动态优化提示,显著提升任务成功率。

Method: 基于用户问卷定义五类任务(多应用、模糊、交互等),构建MVISU-Bench。Aider模块通过动态提示澄清用户意图,降低风险。

Result: Aider模块将整体成功率提升19.55%,在不道德和交互指令上分别提升53.52%和29.41%,显著优于SOTA。

Insight: 现有移动智能体与用户真实需求存在差距,动态提示优化是提升任务完成率的关键。

Abstract: Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users’ automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.

[24] CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

Xinge Ye,Rui Wang,Yuchuan Wu,Victor Ma,Feiteng Fang,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 论文CPO通过比较策略优化解决角色扮演对话中的奖励模糊性问题,提出了一种从样本评分转向组间比较的奖励评估新范式,并结合CharacterArena框架实现了更鲁棒和公平的性能评估。

Details Motivation: 传统的基于独立样本评分的奖励建模在主观任务(如角色扮演对话)中面临评价标准主观和奖励信号不稳定的问题,而人类评价通常结合显式标准和隐式比较。

Contribution: 1. 提出了Comparative Policy Optimization (CPO),通过组间比较优化奖励评估范式;2. 引入了CharacterArena评估框架,包含多轮角色扮演模拟和轨迹级比较评估两阶段。

Method: CPO将奖励评估从样本评分转为组间比较;CharacterArena通过上下文多轮角色扮演模拟和轨迹比较,实现更鲁棒的主观评分。

Result: 在CharacterEval、CharacterBench和CharacterArena上的实验证明,CPO有效缓解了奖励模糊性,显著提升了对话质量。

Insight: 通过将主观评分转化为客观的轨迹比较,可以减少上下文偏差,实现更公平和鲁棒的评估。

Abstract: Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.

[25] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

Imalsha Puranegedara,Themira Chathumina,Nisal Ranathunga,Nisansa de Silva,Surangika Ranathunga,Mokanarangan Thayaparan

Main category: cs.CL

TL;DR: 本文提出一种新架构,通过融合多语言编码器的所有中间层(而非仅最后一层)来提升大语言模型在低资源语言上的性能。采用全局Softmax加权和Transformer Softmax模型两种策略,显著提升了多项任务的性能。

Details Motivation: 大语言模型在低资源语言上表现不佳,因其训练数据以英语为中心。现有方法仅利用多语言编码器的最后一层,限制了信息传递。

Contribution: 提出了融合多语言编码器所有中间层的新架构,设计两种加权策略(全局Softmax和Transformer Softmax),显著提升低资源语言性能。

Method: 1. 全局Softmax加权整体层重要性;2. Transformer Softmax模型学习令牌特定权重。将融合表征映射到大语言模型的嵌入空间。

Result: 在XNLI、IndicXNLI等任务上显著超越基线,如僧伽罗语分类准确率从71.66%提升至75.86%,印地语等语言也有明显改进。

Insight: 无需平行或多语言数据,仅用英语数据训练即可提升多语言能力,为低资源语言提供了高效、可扩展的解决方案。

Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.

Anastasia Zhukova,Thomas Walton,Christian E. Matt,Bela Gipp

Main category: cs.CL

TL;DR: 论文提出了一种基于跨文档共指消解(CDCR)的链接预测方法,结合自然语言推理(NLI)和语义文本相似度(STS),用于解决流程工业中事件日志的碎片化问题,显著提升了链接预测性能。

Details Motivation: 流程工业中的事件日志通常以碎片化形式记录在交接班日志中,导致相关记录(如设备问题和解决方案)无法有效关联,阻碍了知识的推荐和复用。

Contribution: 1) 将链接预测问题转化为跨文档共指消解任务,并引入因果推理(CI);2) 提出了一种结合NLI和STS的RL模型,显著优于传统基线;3) 实现了对流程工业特定文本格式的适配。

Method: 采用跨文档共指消解(CDCR)框架,结合自然语言推理(NLI)和语义文本相似度(STS),并将其转化为因果推理问题。模型在段落级别操作,适应流程工业的非结构化和结构化文本格式。

Result: 提出的RL模型在链接预测任务中比NLI和STS驱动的基线方法分别提高了28%(11.43分)和27%(11.21分)。

Insight: 通过领域适配和增强推理能力,现有CDCR模型可以有效应用于流程工业,提升事件日志的质量和关联性。这为其他领域的知识管理问题提供了借鉴。

Abstract: Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry’s specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs.

[27] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Jason Chou,Ao Liu,Yuchi Deng,Zhiying Zeng,Tao Zhang,Haotian Zhu,Jianwei Cai,Yue Mao,Chenchen Zhang,Lingyun Tan,Ziyan Xu,Bohui Zhai,Hengyi Liu,Speed Zhu,Wiggin Zhou,Fengzong Lian

Main category: cs.CL

TL;DR: 论文提出了AutoCodeGen方法,自动生成高难度多语言代码生成数据集,无需人工标注,并推出了AutoCodeBench基准测试工具,用于评估大型语言模型在多语言、高难度代码生成任务上的表现。

Details Motivation: 现有的代码生成基准测试工具依赖人工标注,扩展性差且语言分布不均,难以满足多语言和高复杂度的需求。因此,作者提出自动化方法来解决这些问题。

Contribution: 主要贡献包括:1)提出AutoCodeGen方法,自动化生成高质量、多语言、高难度的代码生成数据集;2)推出AutoCodeBench工具,包含3920个分布均匀的多语言问题;3)评估了30多个主流LLM,展示了其在复杂性和多语言任务上的困难。

Method: 通过生成测试输入并利用多语言沙箱获取测试输出,结合逆序问题生成和多步过滤确保数据集质量。此外,设计了AutoCodeBench-Complete专门评估基础模型的少样本生成能力。

Result: 实验表明,即使是顶级LLM在AutoCodeBench的高难度、多语言任务上表现不佳,凸显了现有模型的局限性。

Insight: 论文强调了自动化数据集生成的重要性,并呼吁社区关注更具挑战性和实用性的多语言代码生成场景。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.

[28] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

Weixuan Wang,Dongge Han,Daniel Madrigal Diaz,Jin Xu,Victor Rühle,Saravan Rajmohan

Main category: cs.CL

TL;DR: 该论文提出了OdysseyBench,一个用于评估大语言模型代理在长期复杂办公应用工作流中的表现的基准测试,包含真实案例和合成任务,并通过自动化框架HomerAgents生成测试任务。

Details Motivation: 现有基准测试主要关注原子任务,未能捕捉真实场景中长期上下文依赖和多交互协调的需求。

Contribution: 1. 设计了OdysseyBench,包含300个真实用例和302个合成任务;2. 提出了HomerAgents,一个自动化生成长期工作流基准的多代理框架。

Method: 采用HomerAgents框架,通过系统环境探索、任务生成和对话合成自动化生成长期工作流测试任务。

Result: OdysseyBench能够有效挑战当前最先进的大语言模型代理,提供更准确的复杂场景评估能力。

Insight: 长期复杂工作流测试能更真实反映代理在现实生产力场景中的能力,推动相关研究发展。

Abstract: Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.

[29] Complex Logical Instruction Generation

Mian Zhang,Shujian Liu,Sixun Dong,Ming Yin,Yebowen Hu,Xun Wang,Steven Ma,Song Wang,Sathish Reddy Indurthi,Haoyun Deng,Zhiyu Zoey Chen,Kaiqiang Song

Main category: cs.CL

TL;DR: 论文提出了LogicIFGen和LogicIFEval,前者是一种从代码函数生成可验证逻辑指令的自动化框架,后者是一个包含426条复杂逻辑指令的基准测试。实验表明,现有的先进大语言模型在遵循这些指令时表现不佳。

Details Motivation: 随着任务复杂性增加,自然语言指令中的逻辑结构变得更加复杂,而当前对大语言模型(LLMs)在这种富逻辑指令下的表现研究不足。

Contribution: 1. 提出LogicIFGen框架,可从代码函数自动生成可验证的富逻辑指令;2. 构建LogicIFEval基准测试,包含426条复杂逻辑指令;3. 揭示当前先进LLMs在遵循复杂逻辑指令时的不足。

Method: LogicIFGen通过分析代码函数(如条件、嵌套、递归和函数调用)生成自然语言指令,并确保其可验证性。LogicIFEval从精选的复杂代码函数中生成指令,用于评估LLMs的指令遵循能力。

Result: 实验显示,当前先进LLMs在LogicIFEval上的表现较差,仅能正确遵循不到60%的指令。

Insight: LLMs在处理复杂逻辑指令时仍有显著缺陷,其指令遵循能力需要进一步提升。

Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

[30] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Wen Wang,Bozhen Fang,Chenchen Jing,Yongliang Shen,Yangyi Shen,Qiuyu Wang,Hao Ouyang,Hao Chen,Chunhua Shen

Main category: cs.CL

TL;DR: 论文揭示了扩散语言模型中存在的时间振荡现象,并提出两种方法(Temporal Self-Consistency Voting和Temporal Consistency Reinforcement)利用时间一致性提升生成质量,实验效果显著。

Details Motivation: 当前扩散语言模型的解码策略仅关注最终输出,忽略了中间预测的丰富信息。研究发现中间步骤可能产生正确结果,却被后续去噪步骤覆盖,因此需要改进。

Contribution: 1) 提出时间振荡现象的解释;2) 引入两种方法:训练无关的解码策略Temporal Self-Consistency Voting和基于时间语义熵(TSE)的奖励方法Temporal Consistency Reinforcement。

Method: 1) 在测试时通过投票机制聚合去噪步骤的预测;2) 利用TSE作为奖励信号,通过强化学习增强时间一致性。

Result: 在多个基准测试中显著提升性能,如Countdown数据集平均提升24.7%,结合准确率奖励后其他任务也有2.0%-25.3%的绝对提升。

Insight: 时间动态信息在扩散语言模型中有巨大潜力,简单的时序一致性方法即可显著改进生成质量。

Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

cs.CV [Back]

[31] Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection

Saptarshi Banerjee,Tausif Mallick,Amlan Chakroborty,Himadri Nath Saha,Nityananda T. Takur

Main category: cs.CV

TL;DR: 该论文综述了深度学习在植物病虫害检测中的最新技术,将其分为五大类并展示了AI方法的优越性,尤其是视觉变换器(如HvT)在准确率上超越了传统方法。

Details Motivation: 植物病虫害对农业生产和经济损失具有重大影响,传统检测方法效率低且不精确,因此需借助现代AI技术提升检测能力。

Contribution: 1. 提出五类植物病虫害检测方法的分类体系;2. 展示了AI方法(尤其是视觉变换器)在速度和精度上的显著优势(如HvT达到99.3%准确率)。

Method: 研究方法包括:1. 系统化分类(高光谱成像、非可视化技术等);2. 对比分析不同深度学习架构(如HvT与MobileNetV3)。

Result: 结果显示现代AI方法优于传统技术,HvT在植物病害检测中准确率达99.3%,远超MobileNetV3等方法。

Insight: 视觉变换器在植物病虫害检测中表现突出,未来研究方向可集中在解决系统设计挑战和进一步优化模型性能。

Abstract: Addressing plant diseases and pests is critical for enhancing crop production and preventing economic losses. Recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have significantly improved the precision and efficiency of detection methods, surpassing the limitations of manual identification. This study reviews modern computer-based techniques for detecting plant diseases and pests from images, including recent AI developments. The methodologies are organized into five categories: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning architectures, and transformer models. This structured taxonomy provides researchers with detailed, actionable insights for selecting advanced state-of-the-art detection methods. A comprehensive survey of recent work and comparative studies demonstrates the consistent superiority of modern AI-based approaches, which often outperform older image analysis methods in speed and accuracy. In particular, vision transformers such as the Hierarchical Vision Transformer (HvT) have shown accuracy exceeding 99.3% in plant disease detection, outperforming architectures like MobileNetV3. The study concludes by discussing system design challenges, proposing solutions, and outlining promising directions for future research.

[32] Designing Object Detection Models for TinyML: Foundations, Comparative Analysis, Challenges, and Emerging Solutions

Christophe EL Zeinaty,Wassim Hamidouche,Glenn Herrou,Daniel Menard

Main category: cs.CV

TL;DR: 该论文综述了在资源受限的物联网设备上部署目标检测(OD)模型的优化技术,填补了现有研究中忽略TinyML环境部署挑战的空白。

Details Motivation: 随着物联网设备的快速增长(预计2030年超过1500亿),如何在低功耗微控制器上部署高效的深度学习和目标检测模型成为关键挑战。TinyML为解决这一问题提供了潜力,但现有研究未充分关注其优化挑战。

Contribution: 论文详细分析了在TinyML环境下部署OD模型的关键优化技术,包括量化、剪枝、知识蒸馏和神经架构搜索,并比较了现有实现的性能指标(KPIs)。

Method: 通过理论分析和实践结合,论文探讨了优化OD模型的技术及其在微控制器设备上的实际部署效果。此外,还创建了一个公共仓库以持续追踪领域进展。

Result: 论文比较了不同OD实现方案的性能表现,总结了当前方案在预测精度和效率方面的成熟度。

Insight: 该研究为学术界和工业界提供了宝贵的资源,填补了TinyML和OD结合研究的空白,并指出了未来发展的方向。

Abstract: Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.

[33] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang,Ziqi Huang,Ruoxi Jia,Paul Debevec,Ning Yu

Main category: cs.CV

TL;DR: MAViS是一个多智能体协作框架,专注于生成长序列视频叙事内容,通过分阶段的多智能体协作和3E原则提升视频质量、辅助能力和表现力。

Details Motivation: 目前的长序列视频生成框架在辅助能力、视觉质量和表现力方面存在不足,缺乏多模态设计输出。

Contribution: 提出MAViS框架,通过多智能体协作和3E原则(探索、检验、增强)优化视频生成流程,并引入脚本编写指南提升生成模型的兼容性。

Method: MAViS分阶段使用多个智能体(脚本编写、镜头设计、角色建模等),每个阶段遵循3E原则,并结合多样的生成模型和工具。

Result: MAViS在辅助能力、视觉质量和视频表现力方面达到最先进水平,支持多模态输出(视频、叙事和背景音乐)。

Insight: 模块化设计和多智能体协作能够显著提升长序列视频生成的综合能力,结合生成模型的优化指南可以解决兼容性问题。

Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.

[34] MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization

Ankan Deria,Dwarikanath Mahapatra,Behzad Bozorgtabar,Mohna Chakraborty,Snehashis Chakraborty,Sudipta Roy

Main category: cs.CV

TL;DR: MuGa-VTON是一种基于扩散变换器的多服装虚拟试穿框架,通过联合建模上衣和裤子以及人物身份,实现高保真且身份保留的试穿效果。

Details Motivation: 现有虚拟试穿方法通常单独处理上衣和裤子,依赖繁琐的预处理,且难以保留人物特定特征(如纹身、配饰和体型)。

Contribution: 提出了统一的多服装扩散框架MuGa-VTON,包含三个关键模块:GRM(服装语义捕捉)、PRM(身份和姿态编码)和A-DiT(特征融合)。

Method: 采用扩散变换器联合建模服装和人物特征,支持基于文本提示的自定义修改。

Result: 在VITON-HD和DressCode基准测试中表现优异,生成高保真且保留身份的试穿图像。

Insight: 统一建模服装和人物特征可提升虚拟试穿的灵活性和真实性,适用于实际应用。

Abstract: Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.

[35] CObL: Toward Zero-Shot Ordinal Layering without User Prompting

Aneel Damaraju,Dean Hazineh,Todd Zickler

Main category: cs.CV

TL;DR: CObL是一种基于扩散模型的架构,用于从图像中推断遮挡有序的对象层,无需用户提示即可完成多遮挡物体的重建,并在真实场景中表现出零样本泛化能力。

Details Motivation: 视觉任务需要将像素分组为物体并理解其空间关系(包括深度和遮挡关系)。现有方法通常需要用户提示或已知物体数量,CObL旨在无需这些先验条件即可完成对象层的推断。

Contribution: 提出了CObL架构,实现了无需用户提示和已知物体数量的多遮挡对象重建;利用Stable Diffusion作为先验,支持零样本泛化到真实场景。

Method: 采用扩散模型生成并行的对象层堆栈,结合Stable Diffusion的自然对象先验和推理时指导,确保层堆栈能复合回原始图像。用合成数据集训练,泛化到真实场景。

Result: CObL在合成数据集上训练后,能够零样本泛化到真实世界场景,成功重建多遮挡对象且无需用户干预。

Insight: 扩散模型结合对象层堆栈设计,可以实现无需先验的多对象重建,为场景理解和生成提供了新思路。

Abstract: Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. We capture this with a scene representation comprising an occlusion-ordered stack of “object layers,” each containing an isolated and amodally-completed object. To infer this representation from an image, we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers in parallel, using Stable Diffusion as a prior for natural objects and inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple occluded objects without user prompting and without knowing the number of objects beforehand. Unlike previous models for unsupervised object-centric representation learning, CObL is not limited to the world it was trained in.

[36] Re:Verse – Can Your VLM Read a Manga?

Aaditya Baranwal,Madhav Kataria,Naitik Agrawal,Yogesh S Rawat,Shruti Vyas

Main category: cs.CV

TL;DR: 论文揭示了当前视觉语言模型(VLMs)在漫画叙事理解中的局限性,特别是在时间因果性和跨面板连贯性方面。通过引入新的评估框架,研究发现现有模型缺乏故事级智能,无法处理非线性叙事和长序列推理。

Details Motivation: 尽管现有大型多模态模型在单面板解释上表现优异,但在处理具有时间因果和跨面板连贯性的长叙事时存在明显缺陷。论文旨在系统性评估VLMs在故事理解中的能力。

Contribution: 1. 提出了结合多模态注释和跨模态嵌入分析的新评估框架;2. 首次对VLMs的长篇叙事理解能力进行了系统性研究,包括生成叙事、对话背景定位和时间推理三个核心维度。

Method: 方法包括:(i) 通过对齐轻小说文本的严格注释协议;(ii) 跨多种推理范式的全面评估;(iii) 揭示VLMs联合表征错位的跨模态相似性分析。实验基于《Re:Zero》漫画的308个注释面板。

Result: 结果显示,当前模型缺乏真正的故事级智能,尤其在非线性叙事、角色一致性和长序列因果推理上表现不佳。

Insight: 研究为评估叙事智能奠定了基础,并揭示了多模态模型在离散视觉叙事深度理解上的局限性,提供了改进方向。

Abstract: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs’ joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.

[37] VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Mansi Phute,Ravikumar Balakrishnan

Main category: cs.CV

TL;DR: VISOR提出了一种基于视觉输入的输出重定向方法,通过优化的视觉输入实现高效行为控制,避免了传统方法的局限性,并揭示了视觉攻击的安全漏洞。

Details Motivation: 当前针对视觉语言模型(VLMs)的行为控制方法(如系统提示或激活向量)存在易检测、侵入性强或效果有限的问题,需要更高效且隐形的解决方案。

Contribution: 1. 提出VISOR方法,通过优化视觉输入实现行为控制;2. 展示其性能优势,超过传统方法;3. 揭示视觉攻击的安全隐患。

Method: VISOR通过设计通用的引导图像,诱导模型产生目标激活模式,从而实现行为控制,仅需150KB的图像输入。

Result: 在拒绝、奉承和生存本能任务中,VISOR表现优于传统方法(如25%行为偏移),同时不影响其他任务性能(99.9%保持率)。

Insight: 视觉输入可以成为行为控制的高效手段,但也为模型安全带来新挑战,需开发针对视觉攻击的防御机制。

Abstract: Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals–incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering–achieving up to 25% shifts from baseline compared to steering vectors’ modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks.

[38] Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers

Wenhao Liang,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen

Main category: cs.CV

TL;DR: 论文提出了CalAttn模块,通过学习Vision Transformer(ViT)CLS token的每个实例的自适应温度,显著降低了校准误差,且参数增加极少。

Details Motivation: 在风险敏感应用中,Vision Transformers的概率校准至关重要,而传统全局温度标定方法需要验证集且效果有限。因此,作者提出了一种自适应、实例级的解决方案。

Contribution: 1. 提出CalAttn模块,直接从ViT的CLS token学习每个实例的温度;2. 在多个数据集上显著降低校准误差;3. 增加参数极少(0.1%以下),且不牺牲准确率。

Method: 通过在ViT中嵌入CalAttn模块,学习每个输入实例的自适应温度,替代传统的全局温度标定方法。

Result: 相比传统方法,CalAttn在CIFAR-10/100、MNIST等数据集上将校准误差降低高达4倍,参数增加仅0.1%。

Insight: 实例级温度标定比全局标定更灵活有效,且学习的温度集中在1.0附近,说明了其合理性。

Abstract: Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT’s CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-

[39] Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation

Wei Li,Pengcheng Zhou,Linye Ma,Wenyi Zhao,Huihua Yang

Main category: cs.CV

TL;DR: 该论文提出了一种通用的半监督医学图像分割框架DTLP-Net,通过多样化教师模型和标签传播解决领域偏移和伪标签可靠性问题。

Details Motivation: 医学图像分割中因标注有限和领域偏移导致半监督学习效果不佳,传统方法难以通用且性能有限。

Contribution: 提出了DTLP-Net框架,通过双教师模型生成多样化伪标签,并结合标签传播和数据增强,实现了对SSMIS、UMDA和Semi-MDG任务的通用支持。

Method: 采用单一学生模型和两个多样性教师模型(一个解耦训练,一个动量更新),结合数据增强和标签传播优化伪标签质量。

Result: 在五个基准数据集上验证了DTLP-Net的优越性,显著优于现有方法。

Insight: 多样化教师模型和标签传播是提升半监督医学图像分割效果的关键,尤其是在领域偏移场景下。

Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.

[40] Unlocking the Potential of Diffusion Priors in Blind Face Restoration

Yunqi Miao,Zhiyu Qu,Mingqi Gao,Changrui Chen,Jifei Song,Jungong Han,Jiankang Deng

Main category: cs.CV

TL;DR: 该论文提出了一种名为FLIPNET的统一网络,通过切换两种模式(恢复模式和退化模式)来解决盲人脸修复(BFR)中的挑战,弥合了高质量与低质量图像以及合成与真实图像之间的差距。

Details Motivation: 传统的扩散模型无法直接适用于盲人脸修复(BFR),因为其与BFR之间存在两大差异:1)高质量(HQ)和低质量(LQ)图像的差异;2)合成图像与真实世界图像的差异。这限制了扩散模型在BFR中的潜力。

Contribution: 论文的主要贡献是提出了FLIPNET,一个统一的双模式网络,能够有效解决BFR中的图像质量和真实性问题。同时,该方法在真实性和保真度上优于现有的基于扩散先验的BFR方法。

Method: FLIPNET通过两种模式工作:1)恢复模式,逐步整合BFR导向的特征和低质量图像的人脸嵌入,实现真实的修复;2)退化模式,基于从真实退化数据集中学习到的知识,合成真实世界的退化图像。

Result: 实验表明,FLIPNET在基准数据集上表现优异,不仅优于之前的扩散先验BFR方法,还在模拟真实世界退化方面超越了简单的退化模型。

Insight: 该研究揭示了通过统一网络结合两种模式(恢复与退化)的潜力,为扩散模型在复杂真实世界应用中的适应性提供了新思路。

Abstract: Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images. The vanilla diffusion model is trained on images with no or less degradations, whereas BFR handles moderately to severely degraded images. Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate complex and unknown degradations in real-world scenarios. In this work, we use a unified network FLIPNET that switches between two modes to resolve specific gaps. In Restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration. In Degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets. Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.

[41] Think as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines

Tuo Liu,Qinghan Yang,Yu Zhang,Rongjun Ge,Yang Chen,Guangquan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为AutoSAME的新框架,用于结合视觉基础模型(如SAM)与左心室(LV)指标测量任务,以符合临床超声心动图指南的要求。通过引入滤波交叉分支注意力(FCBA)和空间引导提示对齐(SGPA),AutoSAME在分割和关键点定位任务中表现出色。

Details Motivation: 在心血管疾病诊断中,遵循临床指南的左心室指标测量至关重要。现有的自动化方法由于训练数据规模小,难以捕捉通用的视觉表示,而视觉基础模型(如SAM)虽然擅长分割任务,但无法识别关键解剖点。因此,需要一种结合分割与关键点定位能力的解决方案。

Contribution: 1. 提出AutoSAME框架,结合SAM的分割能力与关键点定位任务;2. 引入FCBA,通过频域视角优化热图回归;3. 提出SGPA,利用空间先验知识生成提示嵌入,提升密集预测精度。

Method: AutoSAME框架结合SAM的分割能力和关键点定位任务,使用FCBA从分割任务中提取全面特征以优化热图回归,并通过SGPA利用空间属性自动生成提示嵌入。

Result: 在超声心动图数据集上的实验表明,AutoSAME在左心室分割、关键点定位和指标测量任务中均表现优异。

Insight: 通过结合视觉基础模型与任务特定的设计(如FCBA和SGPA),可以显著提升在医学影像任务中的性能,尤其在小数据集场景下。

Abstract: Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.

[42] Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation

Chenruo Liu,Hongjun Liu,Zeyu Lai,Yiqiu Shen,Chen Zhao,Qi Lei

Main category: cs.CV

TL;DR: 该论文提出了一种利用类别标签中的超类信息来减少对虚假特征的依赖的方法,通过梯度注意力机制和预训练的视觉语言模型解耦特征,提升了模型在领域泛化任务中的鲁棒性。

Details Motivation: 现有的提升组鲁棒性的方法通常需要辅助标注或假设源和目标域的组结构一致,但这些假设在实际应用场景中不现实且不自然。该论文旨在克服这些限制,提出无需标注源样本即可提升模型鲁棒性的方法。

Contribution: 论文的主要贡献是提出了一种基于超类信息的特征解耦方法,利用预训练的视觉语言模型和梯度注意力机制,减少对虚假特征的依赖,从而提升领域泛化性能。

Method: 方法包括利用超类信息指导特征解耦,通过梯度注意力机制识别超类相关和非相关特征,并在预测时鼓励使用所有超类相关特征。该方法无需标注任何源样本。

Result: 实验表明,该方法在多个数据集上显著优于基线方法,不仅在定量指标上表现更好,定性可视化结果也验证了其有效性。

Insight: 论文揭示了类别标签中的语义结构(如超类信息)可以自然减少对虚假特征的依赖,为领域泛化任务提供了新的研究思路。

Abstract: To enhance group robustness to spurious correlations, prior work often relies on auxiliary annotations for groups or spurious features and assumes identical sets of groups across source and target domains. These two requirements are both unnatural and impractical in real-world settings. To overcome these limitations, we propose a method that leverages the semantic structure inherent in class labels–specifically, superclass information–to naturally reduce reliance on spurious features. Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features. Then, by promoting the use of all superclass-relevant features for prediction, our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples. Experiments across diverse datasets demonstrate that our method significantly outperforms baselines in domain generalization tasks, with clear improvements in both quantitative metrics and qualitative visualizations.

[43] RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space

Jingyun Liang,Jingkai Zhou,Shikai Li,Chenjie Cao,Lei Sun,Yichen Qian,Weihua Chen,Fan Wang

Main category: cs.CV

TL;DR: 本文提出了一种分解式人体运动控制与视频生成框架,通过显式分离运动、外观、主体与背景,实现了对视频元素的灵活组合与控制。

Details Motivation: 现有方法难以实现对视频中四大关键元素(前景主体、背景、人体轨迹和动作模式)的独立控制,限制了视频生成的灵活性和真实性。

Contribution: 1. 提出了一种基于世界空间的分解式人体运动控制与视频生成框架。2. 通过在3D空间中编辑轨迹和动作,实现了对视频元素的高度可控性。3. 结合扩散变换器模型,提出了主体、背景和运动的注入方法。

Method: 1. 构建地面感知的3D世界坐标系,直接在3D空间中进行运动编辑。2. 通过焦距校准和坐标变换,将2D轨迹反投影到3D空间。3. 利用运动库或文本到动作方法生成动作。4. 在扩散变换器模型中注入主体、背景和运动信号。

Result: 在基准数据集和实际案例上的实验表明,该方法在元素可控性和视频整体质量上均达到了最先进的性能。

Insight: 通过显式分离控制信号并利用3D空间编辑,可以实现更灵活和高质量的视频生成,同时为未来基于文本的多模态生成提供了新的思路。

Abstract: Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.

[44] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

Wenwen Yu,Zhibo Yang,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为DocThinker的方法,结合基于规则的强化学习(RL)框架,用于动态推理时的解释性多模态大语言模型(MLLM)文档理解。通过优化策略学习和多目标规则奖励,该方法显著提升了适应性、透明性和泛化能力。

Details Motivation: 现有的MLLM在文档理解中表现出色,但其推理过程不透明,难以在如法律、金融和医疗等高风险领域确保可靠性和可信度。固定思维链(CoT)推理和监督微调(SFT)方法存在灾难性遗忘、适应性差和泛化能力有限的问题。

Contribution: 1. 提出DocThinker,一种动态推理框架,通过基于规则的强化学习优化推理策略。2. 生成可解释的中间结果,包括结构化推理过程、问题重述、支持答案的兴趣区域(RoI)和最终答案。3. 通过多目标规则奖励和KL约束优化,减少灾难性遗忘并提升透明性和适应性。

Method: DocThinker结合基于规则的强化学习,通过策略学习动态优化推理路径,生成可解释的中间结果。使用多目标规则奖励和KL约束优化确保适应性、透明性和泛化能力。

Result: 在多个基准测试中,DocThinker显著提升了泛化能力,并生成更可解释、人类可理解的推理步骤。实验结果验证了RL在增强MLLM解释性和适应性方面的潜力。

Insight: 强化学习可以作为优化MLLM推理透明性和适应性的有效工具,特别是在高风险领域文档理解任务中。规则驱动的奖励机制能够平衡性能和可解释性。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.

[45] QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

Yuxiao Wang,Wolin Liang,Yu Lei,Weiying Xue,Nan Zhuang,Qi Liu

Main category: cs.CV

TL;DR: QueryCraft提出了一种基于Transformer引导的查询初始化方法,通过结合语义先验和特征学习,提升了人-物交互检测的性能。

Details Motivation: 现有的DETR-based方法在HOI检测中由于查询初始化缺乏显式语义,导致性能不佳。QueryCraft旨在解决这一问题。

Contribution: 1) 提出了ACTOR,一种跨模态Transformer编码器,通过视觉和文本提示提取动作相关特征;2) 设计了PDQD,通过预训练检测器蒸馏对象类别感知,增强查询质量。

Method: 1) 使用ACTOR联合视觉和文本模态,生成语义丰富的查询;2) PDQD模块蒸馏对象类别信息用于查询初始化。

Result: 在HICO-Det和V-COCO基准测试中达到了SOTA性能,展现出强泛化能力。

Insight: 结合多模态信息和预训练知识的查询初始化策略能显著提升HOI检测的性能和可解释性。

Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbf{ACTOR} (\textbf{A}ction-aware \textbf{C}ross-modal \textbf{T}ransf\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \textbf{P}erceptual \textbf{D}istilled \textbf{Q}uery \textbf{D}ecoder (\textbf{PDQD}), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.

[46] Yan: Foundational Interactive Video Generation

Yan Team

Main category: cs.CV

TL;DR: Yan是一个基础性的交互式视频生成框架,通过模拟、生成和编辑三个核心模块,实现了从实时模拟到多模态生成和多粒度编辑的完整流程,推动了交互式视频生成的发展。

Details Motivation: 目前交互式视频生成的功能较为孤立,缺乏统一的框架将这些能力整合。Yan旨在通过一个综合性的系统,实现从模拟到编辑的全流程交互式视频生成,为创意工具和娱乐媒体提供新的可能性。

Contribution: 提出Yan框架,包含AAA级模拟、多模态生成和多粒度编辑三个模块;设计了低延迟的3D-VAE和基于KV缓存的去噪推理过程;引入分层自回归标注方法,实现动作可控的实时无限视频生成;通过解耦交互机制和视觉渲染,实现多粒度编辑。

Method: 1. AAA级模拟:使用高度压缩的3D-VAE和KV缓存窗口移动去噪推理;2. 多模态生成:基于分层自回归标注和视频扩散模型(VDM),实现动作可控的生成;3. 多粒度编辑:解耦交互机制和视觉渲染,支持文本驱动的编辑。

Result: Yan实现了实时1080P/60FPS的交互模拟和无限视频生成,表现出跨领域的风格和机制泛化能力。

Insight: 通过模块化设计和解耦技术,Yan展示了交互式视频生成的潜力,尤其是在多模态生成和多粒度编辑方面的灵活性,为未来创意工具和娱乐媒体提供了方向。

Abstract: We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

[47] Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Jihwan Park,Taehoon song,Sanghyeok Lee,Miso Choi,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: The paper proposes TransMiter, a lightweight, model-agnostic adapter for vision-language models (VLMs) that transfers adaptation knowledge from weaker to stronger models without backpropagation, improving efficiency and performance.

Details Motivation: Fine-tuning large VLMs is computationally expensive, and existing adaptation transfer methods are limited in transferability and efficiency. The paper aims to address these issues by developing a reusable and efficient adapter.

Contribution: TransMiter, a novel adapter that captures and transfers knowledge gaps between pre-trained and fine-tuned VLMs without backpropagation, enabling efficient adaptation across models of different sizes and architectures.

Method: TransMiter is trained in an unsupervised manner to learn the knowledge gap between VLMs. It is lightweight, consisting of few layers, and can be supplemented with minimal labeled data for additional performance gains.

Result: TransMiter effectively transfers adaptation knowledge, often surpassing the performance of fine-tuned stronger models, while adding negligible inference cost.

Insight: The study highlights the potential of model-agnostic adapters for efficient knowledge transfer in VLMs, reducing computational costs while maintaining or improving generalization capabilities.

Abstract: Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from ‘weaker’ models to efficiently enhance ‘stronger’ ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models ‘without backpropagation’. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an ‘unsupervised’ manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

[48] SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones

Honglei Xu,Zhilu Zhang,Junjie Fan,Xiaohe Wu,Wangmeng Zuo

Main category: cs.CV

TL;DR: 论文提出了一种自监督的手持视频去模糊方法SelfHVD,通过从视频中提取锐利线索作为训练标签,并提出SEVD和SCSCM方法来提升模型性能和一致性,同时构建了新的数据集验证其有效性。

Details Motivation: 手持手机拍摄视频时,由于抖动等因素导致模糊帧的问题普遍存在,而现有方法在真实场景中表现不佳,因训练与测试数据之间存在模糊域差异。

Contribution: 1. 提出自监督的去模糊方法SelfHVD;2. 引入SEVD方法生成更高质量训练数据;3. 提出SCSCM方法保持空间一致性;4. 构建合成和真实手持视频数据集。

Method: 1. 从视频中提取锐利线索作为标签;2. 使用SEVD生成高质量配对数据;3. 通过SCSCM约束输出与输入帧的空间一致性。

Result: 在合成和真实数据集上显著优于现有自监督方法。

Insight: 锐利线索可用作自监督学习的有效标签,空间一致性约束能提升去模糊质量。

Abstract: Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model’s ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://github.com/cshonglei/SelfHVD.

[49] Neural Artistic Style and Color Transfer Using Deep Learning

Justin London

Main category: cs.CV

TL;DR: 论文提出了一种结合神经艺术风格迁移与颜色转移的方法,通过KL散度量化评估不同颜色和亮度直方图匹配算法,用于图像和视频的艺术处理与增强。

Details Motivation: 神经艺术风格迁移和颜色转移在艺术、设计和影视等领域具有广泛的应用潜力,但如何有效结合这两种技术并量化评估其效果仍是一个挑战。

Contribution: 提出了一种结合深度学习的神经艺术风格迁移与颜色转移的方法,并引入KL散度对不同颜色匹配算法进行定量评估。

Method: 利用KL散度评估Reinhard全局颜色转移、迭代分布转移(IDT)等多种算法的颜色和亮度直方图匹配效果,并结合深度学习实现风格与内容的迁移。

Result: 通过实验验证了不同算法在颜色转移中的表现,并基于KL散度提供了量化评估结果。

Insight: 结合KL散度的评估方法为颜色和风格迁移提供了一种可量化的标准,有助于实际应用中算法的选择和优化。

Abstract: Neural artistic style transfers and blends the content and style representation of one image with the style of another. This enables artists to create unique innovative visuals and enhances artistic expression in various fields including art, design, and film. Color transfer algorithms are an important in digital image processing by adjusting the color information in a target image based on the colors in the source image. Color transfer enhances images and videos in film and photography, and can aid in image correction. We introduce a methodology that combines neural artistic style with color transfer. The method uses the Kullback-Leibler (KL) divergence to quantitatively evaluate color and luminance histogram matching algorithms including Reinhard global color transfer, iteration distribution transfer (IDT), IDT with regrain, Cholesky, and PCA between the original and neural artistic style transferred image using deep learning. We estimate the color channel kernel densities. Various experiments are performed to evaluate the KL of these algorithms and their color histograms for style to content transfer.

[50] Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Jiahua Dong,Hui Yin,Wenqi Liang,Hanbin Zhao,Henghui Ding,Nicu Sebe,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的分层视觉提示学习(HVPL)模型,用于解决视频实例分割(VIS)中的灾难性遗忘问题,通过帧级和视频级的提示学习提升模型性能。

Details Motivation: 现有的视频实例分割方法假设对象类别固定且无法持续学习新类别,导致对新类别的灾难性遗忘问题。论文旨在通过分层视觉提示学习解决这一问题。

Contribution: 1. 提出分层视觉提示学习(HVPL)模型;
2. 设计帧级提示和正交梯度校正(OGC)模块;
3. 开发视频级提示和视频上下文解码器。

Method: 1. 使用任务特定的帧提示和OGC模块缓解帧级遗忘;
2. 通过视频提示和上下文解码器嵌入跨帧的结构性类间关系。

Result: HVPL模型在持续学习新类别时优于基线方法,有效减少灾难性遗忘。

Insight: 分层视觉提示学习从帧和视频两级角度解决了持续学习的挑战,为视频实例分割的长期适应性提供了新思路。

Abstract: Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.

[51] AME: Aligned Manifold Entropy for Robust Vision-Language Distillation

Guiming Cao,Yuming Ou

Main category: cs.CV

TL;DR: 论文提出了一种名为AME的方法,通过对齐流形熵实现鲁棒的视觉-语言知识蒸馏,解决了在低数据条件下跨模态特征表示的挑战。方法无需修改主干网络结构,具有广泛的兼容性。

Details Motivation: 现有的视觉-语言知识蒸馏方法在数据不足时表现不佳,尤其是在预测不确定性高的情况下。这限制了其在实际场景中的应用。

Contribution: 1. 提出AME方法,通过流形对齐和熵最小化实现鲁棒的跨模态知识蒸馏;2. 理论分析表明其能缩小泛化误差界限;3. 验证了方法的广泛适用性和优越性能。

Method: AME通过重新配置共享流形,利用投影函数连接多模态数据(如图像和文本),并在该流形上应用熵最小化,实现跨模态特征的结构压缩。

Result: 实验表明AME在不同蒸馏架构和训练设置下均表现出优越的泛化性能。

Insight: 流形对齐和熵最小化的结合可以有效提升跨模态知识蒸馏的鲁棒性,尤其是在低数据条件下。

Abstract: Knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the context of the recent emergence of large vision-language models (VLMs). However, vision-language knowledge distillation often requires sufficient training data to achieve robust generalization on amples with ambiguous or boundary-adjacent representations, which are associated with high predictive uncertainty. Critically, collecting such large-scale, task-specific data for training is often impractical in real-world scenarios. To address this major challenge arising from the entanglement of uncertainty and cross-modal feature representation, we propose Aligned Manifold Entropy for Robust Vision-Language Distillation (AME), aiming to achieve robust generalization under real-world conditions. AME applies entropy minimization over a reconfigured shared manifold, where multi-modal data (i.e., image and text) are bridged through a pair of projection functions, conducive to structural compression for cross-modal feature representations. This enables robust knowledge distillation under low-data regimes, while requiring no architectural modifications to the backbone. As a result, it can serve as a plug-and-play module compatible with a wide range of vision-language distillation frameworks. Notably, our theoretical analysis reveals that integrating knowledge distillation with entropy minimization over the shared manifold leads to a tighter generalization error bound. Extensive experiments across diverse distillation architectures and training settings demonstrate that AME consistently facilitates robust knowledge distillation, resulting in superior generalization performance across a wide spectrum of downstream tasks.

[52] Unified and Semantically Grounded Domain Adaptation for Medical Image Segmentation

Xin Wang,Yin Guo,Jiamin Xia,Kaiyu Zhang,Niranjan Balu,Mahmud Mossa-Basha,Linda Shapiro,Chun Yuan

Main category: cs.CV

TL;DR: 该论文提出了一种统一的、语义基础的医学图像分割领域适应框架,适用于源可访问和源自由两种设置,通过学习领域无关的概率流形实现了跨领域的自然适应性。

Details Motivation: 现有医学图像分割领域适应方法在源可访问和源自由两种设置中存在设计差异,缺乏对解剖学知识的显式结构化建模,论文旨在解决这一问题。

Contribution: 提出了一个统一的语义基础框架,支持源可访问和源自由两种设置,通过领域无关概率流形实现自然适应性,具有强解释性。

Method: 模型学习一个领域无关的概率流形作为解剖学规律的全局空间,将图像结构内容解耦为规范解剖形状和空间变换,实现语义预测和适应性。

Result: 在心脏和腹部数据集上取得了最先进结果,源自由性能接近源可访问设置,并展示了流形遍历的强解释性。

Insight: 通过结构化建模解剖学知识,论文展示了领域适应中语义一致性的重要性,为医学图像分析提供了新思路。

Abstract: Most prior unsupervised domain adaptation approaches for medical image segmentation are narrowly tailored to either the source-accessible setting, where adaptation is guided by source-target alignment, or the source-free setting, which typically resorts to implicit supervision mechanisms such as pseudo-labeling and model distillation. This substantial divergence in methodological designs between the two settings reveals an inherent flaw: the lack of an explicit, structured construction of anatomical knowledge that naturally generalizes across domains and settings. To bridge this longstanding divide, we introduce a unified, semantically grounded framework that supports both source-accessible and source-free adaptation. Fundamentally distinct from all prior works, our framework’s adaptability emerges naturally as a direct consequence of the model architecture, without the need for any handcrafted adaptation strategies. Specifically, our model learns a domain-agnostic probabilistic manifold as a global space of anatomical regularities, mirroring how humans establish visual understanding. Thus, the structural content in each image can be interpreted as a canonical anatomy retrieved from the manifold and a spatial transformation capturing individual-specific geometry. This disentangled, interpretable formulation enables semantically meaningful prediction with intrinsic adaptability. Extensive experiments on challenging cardiac and abdominal datasets show that our framework achieves state-of-the-art results in both settings, with source-free performance closely approaching its source-accessible counterpart, a level of consistency rarely observed in prior works. Beyond quantitative improvement, we demonstrate strong interpretability of the proposed framework via manifold traversal for smooth shape manipulation.

[53] Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization

Ke Liu,Xuanhan Wang,Qilong Zhang,Lianli Gao,Jingkuan Song

Main category: cs.CV

TL;DR: 该论文提出了一种名为HiWL的分层两阶段优化方法,旨在同时满足图像水印的不可见性、鲁棒性和广泛适用性,显著提升了水印提取准确率和处理效率。

Details Motivation: 现有深度图像水印方法难以同时满足水印的不可见性、鲁棒性和广泛适用性,限制了其在实际应用中的效果。

Contribution: HiWL框架通过分层两阶段优化,实现了水印的分布对齐学习和广义水印表示学习,显著提升了水印的性能和适用性。

Method: 第一阶段通过分布对齐学习建立共有的潜在空间,确保水印不可见和鲁棒性;第二阶段通过广义水印表示学习实现水印与图像内容的解耦。

Result: 实验表明,HiWL的水印提取准确率比现有方法高7.6%,且处理效率极高(100K图像仅需8秒)。

Insight: 分层优化策略有效解决了水印的多目标冲突问题,为通用水印技术提供了新思路。

Abstract: Deep image watermarking, which refers to enable imperceptible watermark embedding and reliable extraction in cover images, has shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: 1) invisibility (imperceptible hide of watermarks), 2) robustness (reliable watermark recovery under diverse conditions), and 3) broad applicability (low latency in watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL), a two-stage optimization that enable a watermarking model to simultaneously achieve three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: 1) visual consistency between watermarked and non-watermarked images, and 2) information invariance across watermark latent representations. In this way, multi-modal inputs including watermark message (binary codes) and cover images (RGB pixels) can be well represented, ensuring the invisibility of watermarks and robustness in watermarking process thereby. The second stage employs generalized watermark representation learning to establish a disentanglement policy for separating watermarks from image content in RGB space. In particular, it strongly penalizes substantial fluctuations in separated RGB watermarks corresponding to identical messages. Consequently, HiWL effectively learns generalizable latent-space watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of proposed method. In particular, it achieves 7.6% higher accuracy in watermark extraction than existing methods, while maintaining extremely low latency (100K images processed in 8s).

[54] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion

Tao Luo,Weihua Xu

Main category: cs.CV

TL;DR: 论文提出了一种用于多模态医学图像融合的新方法MMIF-AMIN,通过可逆密集网络和多尺度互补特征提取模块有效整合多模态信息,并引入了自适应损失函数优化学习过程。实验表明其性能优于现有方法。

Details Motivation: 多模态医学图像融合(MMIF)的核心挑战是如何同时捕获不同模态中的独特和互补信息,以生成更全面的诊断图像。传统方法在特征提取和损失函数设计上存在局限,因此需要更高效的解决方案。

Contribution: 1. 提出了一种新颖的MMIF-AMIN架构,结合了可逆密集网络(IDN)和多尺度互补特征提取模块(MCFEM)。2. 设计了自适应损失函数以优化模型学习。3. 通过实验验证了方法的优越性,并展示了其泛化能力。

Method: 1. 使用可逆密集网络(IDN)实现无损特征提取。2. 设计多尺度互补特征提取模块(MCFEM),融合混合注意力机制、不同尺寸卷积层和Transformer。3. 引入自适应损失函数指导模型训练。

Result: MMIF-AMIN在定量和定性分析中均优于九种现有方法。消融实验验证了各模块的有效性,且在扩展任务中也表现出色。

Insight: 1. 可逆网络和多尺度特征提取的结合能有效捕捉多模态信息。2. 自适应损失函数可以更灵活地挖掘数据特征。3. 方法在医学图像融合和其他任务中均具有潜力。

Abstract: Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance.

[55] PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences

Yimeng Geng,Mingyang Zhao,Fan Xu,Guanglin Cao,Gaofeng Meng,Hongbin Liu

Main category: cs.CV

TL;DR: PADReg是一种基于物理先验的超声图像变形配准框架,通过接触力引导实现更准确的解剖对齐,显著优于现有方法。

Details Motivation: 超声图像的变形配准在生物力学特性捕捉和疾病诊断中至关重要,但图像的低对比度、噪声和大变形使其极具挑战性。现有方法缺乏物理可解释性且对齐效果不佳。

Contribution: 提出PADReg,结合接触力作为物理先验,通过构建像素级刚度图和轻量级物理感知模块,实现解剖对齐更优的变形配准。

Method: 利用接触力和超声图像构建刚度图,结合轻量级物理模块(基于胡克定律)估计变形场。

Result: 实验显示HD95达到12.90,比现有方法提升21.34%。

Insight: 多模态物理先验(如接触力)可显著提升超声图像配准的解剖合理性和准确性。

Abstract: Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke’s law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34% better than state-of-the-art methods. The source code is available at https://github.com/evelynskip/PADReg.

[56] ROD: RGB-Only Fast and Efficient Off-road Freespace Detection

Tong Sun,Hongliang Ye,Jilin Mei,Liang Chen,Fangzhou Zhao,Leiqiang Zong,Yu Hu

Main category: cs.CV

TL;DR: ROD提出了一种仅依赖RGB图像的实时高效越野自由空间检测方法,通过预训练的ViT和轻量解码器,替代了多模态融合方案,在精度和速度上均超越现有方法。

Details Motivation: 现有基于RGB和LiDAR的多模态方法计算表面法线图时延迟高,难以满足实时需求,特别是在需要高FPS的越野场景中。

Contribution: 1. 提出了无需LiDAR的纯RGB方法ROD;2. 结合预训练ViT和轻量解码器,提升精度与速度;3. 在ORFD和RELLIS-3D数据集上实现了SOTA性能和50 FPS的推理速度。

Method: 1. 使用预训练ViT提取RGB图像特征;2. 设计高效的轻量解码器;3. 完全脱离LiDAR数据依赖。

Result: ROD在ORFD和RELLIS-3D数据集上表现最优,推理速度达50 FPS,显著优于此前方法。

Insight: 仅依赖RGB的轻量设计能有效解决多模态方法的高延迟问题,同时保持高精度,适合实时越野场景。

Abstract: Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models.

[57] Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos

Qi Zheng,Li-Heng Chen,Chenlong He,Neil Berkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik,Yibo Fan,Zhengzhong Tu

Main category: cs.CV

TL;DR: 该论文系统研究了视频压缩中的带状伪影问题,创建了首个开放视频数据集LIVE-YT-Banding,并提出了一种高效的无参考视频质量评估方法CBAND,显著超越了现有方法。

Details Motivation: 高清视频在压缩过程中产生的带状伪影严重影响感知质量,但现有数据集仅局限于静态图像,无法捕捉时间动态,亟需系统性研究。

Contribution: 1. 创建了首个带状伪影视频数据集LIVE-YT-Banding;2. 提出了一种高效的无参考视频质量评估方法CBAND。

Method: CBAND利用深度神经网络嵌入的自然图像统计特性,检测带状伪影并评估其对感知质量的影响。

Result: 实验表明CBAND在带状伪影预测性能上显著优于现有方法,且速度更快,还可作为可微分损失函数用于优化视频去带状伪影模型。

Insight: 深度学习嵌入的自然图像统计特性为视频质量评估提供了新思路,CBAND的高效性展示了其实际应用的潜力。

Abstract: Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.

[58] SafeFix: Targeted Model Repair via Controlled Image Generation

Ouyang Xu,Baoming Zhang,Ruiyu Mao,Yunhui Guo

Main category: cs.CV

TL;DR: SafeFix提出了一种通过可控图像生成来针对性修复深度学习模型中系统误差的方法,结合条件性文本到图像模型和大视觉语言模型,生成并筛选语义正确且分布对齐的图像,显著提升了模型对罕见子群体的鲁棒性。

Details Motivation: 现有的深度学习视觉识别模型在对罕见语义子群体的识别中常出现系统误差,而现有修复方法依赖手动设计提示,容易导致分布偏移和语义错误,因此需要一种更自动化和精准的修复方案。

Contribution: 提出了基于可解释故障归因的模型修复模块,通过条件性文本到图像模型和LVLM生成并筛选语义一致且分布对齐的图像,显著减少了罕见子群体相关的识别错误。

Method: 结合可解释故障归因、条件性文本到图像模型和大视觉语言模型(LVLM),生成并筛选高质量、语义一致和分布对齐的图像,用于模型重新训练。

Result: 实验表明,该方法显著减少了模型对罕见子群体的识别错误,提升了鲁棒性且未引入新错误。

Insight: 通过可控生成和筛选机制,可以更精准地补充模型训练数据,解决系统误差问题,同时避免分布偏移和语义不一致性。

Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix

[59] Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT

Zunjie Xiao,Xiao Wu,Tianhang Liu,Lingxi Hu,Yinling Zhang,Xiaoqing Zhang,Risa Higashita,Jiang Liu

Main category: cs.CV

TL;DR: 这篇论文提出了一种自适应置信度损失(ACW Loss),通过动态调整不同区域的置信度阈值,优化了前节OCT图像中晶状体结构的分割精度,显著提高了分割性能。

Details Motivation: 现有的深度学习分割网络通常在交叉熵损失下对所有像素进行均等加权,忽视了晶状体结构子区域的不均匀性(例如边界区域分割校准较差)和专家标注的置信度差异。

Contribution: 1. 提出自适应置信度损失(ACW Loss),动态分组和加权高低置信度区域;2. 设计自适应置信度阈值优化算法;3. 提出新的边界期望校准误差(BECE)指标。

Method: ACW Loss通过置信度阈值将晶状体结构子区域分为高低置信度组,并应用区域加权损失;同时动态优化置信度阈值。

Result: 在晶状体结构分割任务中,ACW Loss显著优于其他方法(如MedSAM),在U-Net上实现了6.13% IoU提升、4.33% DSC增加和4.79% BECE降低。

Insight: 利用专家标注的置信度先验,动态调整损失权重和置信度阈值,能有效提升边界区域的分割校准和整体性能。

Abstract: Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.

[60] Bridging the Gap: A Framework for Real-World Video Deepfake Detection via Social Network Compression Emulation

Andrea Montibeller,Dasara Shullani,Daniele Baracchi,Alessandro Piva,Giulia Boato

Main category: cs.CV

TL;DR: 该论文提出了一种框架,通过模拟社交网络的视频压缩和调整参数,以生成接近真实上传视频的退化模式,从而帮助深度伪造检测器在真实场景中表现更好。

Details Motivation: 由于社交网络平台的专有压缩技术会破坏深度伪造视频的低级取证线索,导致实验室训练的检测器在真实场景中失效,因此需要一种方法模拟这些压缩效果以提升检测器的泛化能力。

Contribution: 提出了首个框架,通过小规模上传视频估计社交网络的压缩和调整参数,从而在本地模拟平台特有的退化模式,无需直接访问API。

Method: 从少量上传视频中估计社交网络的压缩和调整参数,并基于这些参数构建本地模拟器,用于大规模数据集的退化模式生成。

Result: 在FaceForensics++数据集上的实验表明,模拟数据与真实上传视频的退化模式高度匹配,且检测器在模拟数据上微调后的性能接近使用真实共享数据的效果。

Insight: 通过模拟社交网络的压缩机制,可以有效缩小实验室训练与真实部署之间的性能差距,为解决压缩视频领域的深度伪造检测问题提供了实用方案。

Abstract: The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content.

[61] SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)

Trong-Thuan Nguyen,Viet-Tham Huynh,Quang-Thuc Nguyen,Hoang-Phuc Nguyen,Long Le Bao,Thai Hoang Minh,Minh Nguyen Anh,Thang Nguyen Tien,Phat Nguyen Thuan,Huy Nguyen Phong,Bao Huynh Thai,Vinh-Tiep Nguyen,Duc-Vu Nguyen,Phu-Hoa Pham,Minh-Huy Le-Hoang,Nguyen-Khang Le,Minh-Chinh Nguyen,Minh-Quan Ho,Ngoc-Long Tran,Hien-Long Le-Hoang,Man-Khoi Tran,Anh-Duong Tran,Kim Nguyen,Quan Nguyen Hung,Dat Phan Thanh,Hoang Tran Van,Tien Huynh Viet,Nhan Nguyen Viet Thien,Dinh-Khoi Vo,Van-Loc Nguyen,Trung-Nghia Le,Tam V. Nguyen,Minh-Triet Tran

Main category: cs.CV

TL;DR: ROOMELSA是一个新的3D检索基准,旨在评估系统在复杂现实场景中基于自然语言描述检索对应3D模型的能力,包含大量场景和查询数据。

Details Motivation: 当前的3D检索系统通常设计用于简单场景,但现实场景更复杂,需要基于模糊的自由形式描述识别对象。

Contribution: 提出了ROOMELSA基准,包含1,600个公寓场景、5,200个房间和44,000个查询,填补了场景级理解与细粒度3D检索之间的空白。

Method: 通过全景房间图像中的特定区域和大型3D模型数据库,评估系统对自然语言的理解和检索能力。

Result: 尽管粗粒度对象检索已基本解决,但只有表现最好的模型在几乎所有测试案例中一致排名第一。轻量级CLIP模型表现良好,但在材料、部件结构和上下文线索的细微变化上存在错误。

Insight: 视觉和语言理解的紧密集成对提升3D检索系统在复杂场景中的鲁棒性至关重要。

Abstract: Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system’s ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.

[62] DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

Tianyu Xiong,Dayi Tan,Wei Tian

Main category: cs.CV

TL;DR: DiffPose-Animal 是一种基于扩散模型的动物姿态估计框架,结合语言模型提供的解剖学先验和语义指导,通过去噪过程逐步细化姿态预测。

Details Motivation: 动物姿态估计由于物种形态多样性和标注数据稀缺而比人类姿态估计更具挑战性。传统热图回归方法难以应对复杂场景,因此需要一种更鲁棒且能利用语义信息的框架。

Contribution: 1) 将扩散模型引入动物姿态估计,通过去噪过程逐步优化预测;2) 利用语言模型提取全局解剖学先验和局部关键点语义作为指导;3) 设计了扩散式关键点解码器,提升遮挡和稀疏标注下的鲁棒性。

Method: 1) 基于扩散模型将姿态估计建模为去噪过程;2) 使用语言模型从物种特定提示中提取文本先验,通过交叉注意力与图像特征融合;3) 提出扩散式关键点解码器逐步细化预测。

Result: 在公开动物姿态数据集上的实验表明,该方法在多样性物种、复杂背景和关键点缺失场景下具有优越性能。

Insight: 结合生成模型(扩散模型)和语言模型的语义指导,可以为姿态估计提供生物学有意义的约束,提升模型的泛化能力和鲁棒性。

Abstract: Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.

[63] Region-Adaptive Video Sharpening via Rate-Perception Optimization

Yingxue Pang,Shijie Zhao,Mengxi Guo,Junlin Li,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种端到端的区域自适应视频锐化模型RPO-AdaSharp,通过感知-码率优化来动态调整锐化强度,以平衡视频质量与码率开销。

Details Motivation: 传统视频锐化采用均匀强度,忽视了纹理差异,导致质量下降且增加码率。本文旨在解决这一问题,优化码率分配。

Contribution: 提出了RPO-AdaSharp模型,利用CTU分区掩码作为先验信息,动态分配码率以优化锐化效果,实现了感知增强与码率节省的平衡。

Method: 采用端到端训练,结合CTU分区掩码引导码率分配,通过区域自适应机制动态调整锐化强度。

Result: 在基准测试中,模型在定性和定量分析中均表现出色,有效提升了视频质量并减少了码率开销。

Insight: 区域自适应的锐化策略能够更精细地平衡感知质量与码率,为视频增强提供了新思路。

Abstract: Sharpening is a widely adopted video enhancement technique. However, uniform sharpening intensity ignores texture variations, degrading video quality. Sharpening also increases bitrate, and there’s a lack of techniques to optimally allocate these additional bits across diverse regions. Thus, this paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening model for both perceptual enhancement and bitrate savings. We use the coding tree unit (CTU) partition mask as prior information to guide and constrain the allocation of increased bits. Experiments on benchmarks demonstrate the effectiveness of the proposed model qualitatively and quantitatively.

[64] MonoPartNeRF:Human Reconstruction from Monocular Video via Part-Based Neural Radiance Fields

Yao Lu,Jiawei Li,Ming Jiang

Main category: cs.CV

TL;DR: MonoPartNeRF提出了一种基于部分的神经辐射场方法,用于从单目视频中重建动态人体,通过双向变形模型和部分姿态嵌入机制,显著提升了复杂姿态和遮挡情况下的重建质量。

Details Motivation: 现有方法在处理复杂姿态变化和遮挡时表现不佳,尤其是部分边界过渡不自然和遮挡区域重建不准确。MonoPartNeRF旨在解决这些问题,提升动态人体渲染的平滑性和鲁棒性。

Contribution: 1. 提出了双向变形模型,结合刚性和非刚性变换,实现观察和标准空间之间的连续可逆映射;2. 引入部分姿态嵌入机制,分解全局姿态为局部关节嵌入;3. 结合关键帧姿态检索和动态纹理建模,提升重建质量。

Method: 1. 使用参数化的( u, v, t )空间采样点,捕捉非刚性运动;2. 通过一致性损失抑制变形伪影;3. 姿态感知特征采样结合关键帧姿态插值和注意力机制建模动态纹理。

Result: 在ZJU-MoCap和MonoCap数据集上的实验表明,MonoPartNeRF在复杂姿态和遮挡条件下显著优于现有方法,实现了更好的关节对齐、纹理保真和结构连续性。

Insight: 部分分解和局部姿态嵌入能更高效地处理复杂姿态和遮挡问题,结合动态纹理建模可以进一步提升渲染的自然度和连续性。

Abstract: In recent years, Neural Radiance Fields (NeRF) have achieved remarkable progress in dynamic human reconstruction and rendering. Part-based rendering paradigms, guided by human segmentation, allow for flexible parameter allocation based on structural complexity, thereby enhancing representational efficiency. However, existing methods still struggle with complex pose variations, often producing unnatural transitions at part boundaries and failing to reconstruct occluded regions accurately in monocular settings. We propose MonoPartNeRF, a novel framework for monocular dynamic human rendering that ensures smooth transitions and robust occlusion recovery. First, we build a bidirectional deformation model that combines rigid and non-rigid transformations to establish a continuous, reversible mapping between observation and canonical spaces. Sampling points are projected into a parameterized surface-time space (u, v, t) to better capture non-rigid motion. A consistency loss further suppresses deformation-induced artifacts and discontinuities. We introduce a part-based pose embedding mechanism that decomposes global pose vectors into local joint embeddings based on body regions. This is combined with keyframe pose retrieval and interpolation, along three orthogonal directions, to guide pose-aware feature sampling. A learnable appearance code is integrated via attention to model dynamic texture changes effectively. Experiments on the ZJU-MoCap and MonoCap datasets demonstrate that our method significantly outperforms prior approaches under complex pose and occlusion conditions, achieving superior joint alignment, texture fidelity, and structural continuity.

[65] Identity-Preserving Aging and De-Aging of Faces in the StyleGAN Latent Space

Luis S. Luevano,Pavel Korshunov,Sebastien Marcel

Main category: cs.CV

TL;DR: 该论文提出了一种在StyleGAN2潜在空间中通过支持向量建模和特征选择方法合成年龄变化(老化或年轻化)的人脸图像的技术,同时确保身份一致性。

Details Motivation: 现有的人脸年龄变化方法通常依赖于条件GAN或扩散模型,训练复杂且难以保证身份一致性。作者旨在通过潜在空间编辑实现高效、身份保留的年龄变换。

Contribution: 1. 提出了一个简单的方法在StyleGAN2潜在空间中建模老化/年轻化方向;2. 提出了身份保留的参数估计公式;3. 发布了一个公共数据集用于跨年龄人脸识别基准测试。

Method: 通过支持向量建模和特征选择方法在StyleGAN2潜在空间中学习老化/年轻化方向,并结合两个先进的人脸识别系统确定身份保留子空间。

Result: 生成了一个公共数据集,展示了在身份保留的前提下实现人脸年龄变化的有效性。

Insight: 潜在空间编辑结合支持向量建模可以更高效地实现特定属性(如年龄)的编辑,同时避免复杂训练和大量数据需求。

Abstract: Face aging or de-aging with generative AI has gained significant attention for its applications in such fields like forensics, security, and media. However, most state of the art methods rely on conditional Generative Adversarial Networks (GANs), Diffusion-based models, or Visual Language Models (VLMs) to age or de-age faces based on predefined age categories and conditioning via loss functions, fine-tuning, or text prompts. The reliance on such conditioning leads to complex training requirements, increased data needs, and challenges in generating consistent results. Additionally, identity preservation is rarely taken into accountor evaluated on a single face recognition system without any control or guarantees on whether identity would be preserved in a generated aged/de-aged face. In this paper, we propose to synthesize aged and de-aged faces via editing latent space of StyleGAN2 using a simple support vector modeling of aging/de-aging direction and several feature selection approaches. By using two state-of-the-art face recognition systems, we empirically find the identity preserving subspace within the StyleGAN2 latent space, so that an apparent age of a given face can changed while preserving the identity. We then propose a simple yet practical formula for estimating the limits on aging/de-aging parameters that ensures identity preservation for a given input face. Using our method and estimated parameters we have generated a public dataset of synthetic faces at different ages that can be used for benchmarking cross-age face recognition, age assurance systems, or systems for detection of synthetic images. Our code and dataset are available at the project page https://www.idiap.ch/paper/agesynth/

[66] Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment

Shi-Chen Zhang,Yunheng Li,Yu-Huan Wu,Qibin Hou,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 该论文提出了一种双分支偏移学习范式(OffSeg),通过显式学习特征和类别偏移来动态优化类别表示和空间图像特征,从而解决高效语义分割中类别与特征对齐问题,实验证明其在多个数据集上显著提升性能。

Details Motivation: 现有高效语义分割方法因逐像素分类范式导致类别表示与图像特征不对齐,这在资源受限设备上是显著缺陷。

Contribution: 1. 提出了双分支偏移学习范式(OffSeg),动态优化类别表示和空间特征;2. 该范式无需额外架构修改即可应用于现有方法。

Method: 通过耦合双分支结构学习特征偏移和类别偏移,动态对齐图像特征与类别表示。

Result: 在ADE20K等数据集上显著提升SegFormer-B0等模型的mIoU(如2.7%),仅需少量额外参数(0.1-0.2M)。

Insight: 偏移学习能有效解决高效语义分割中的特征与类别对齐问题,且泛化性强,适用于多种现有方法。

Abstract: Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.

[67] 3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

Noor Ahmed,Cameron Braunstein,Steffen Eger,Eddy Ilg

Main category: cs.CV

TL;DR: 论文提出了3DFroMLLM框架,利用预训练的多模态大语言模型直接生成3D物体原型,包括几何形状和部件标签,无需额外训练数据或详细用户指令。该框架在图像分类预训练任务中表现优异,且能显著提升细粒度视觉语言模型的性能。

Details Motivation: 现有的多模态大语言模型在空间推理方面能力有限,难以直接生成3D物体原型。研究旨在利用这些模型的潜力,开发一种无需额外训练的自动化方法生成3D原型。

Contribution: 1. 提出3DFroMLLM框架,直接从预训练MLLMs生成3D原型。2. 无需额外训练数据或用户指令。3. 生成的图像在分类任务中表现优于先前方法15%。4. 通过生成的部件标签原型微调CLIP,显著提升部件分割任务性能(55%提升)。

Method: 采用代理式流程,包括设计师、编码器和视觉检查器,通过迭代优化生成3D原型。框架基于MLLMs提取的联合表征,自动生成几何形状和部件标签。

Result: 生成的图像在分类预训练任务中表现优于先前方法15%。微调CLIP后,部件分割任务准确率提升55%。

Insight: 预训练MLLMs的空间表征能力足以支持3D原型生成,且生成的图像可用于提升下游任务性能。自动化流程减少了人工标注需求。

Abstract: Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that enables the generation of 3D object prototypes directly from MLLMs, including geometry and part labels. Our pipeline is agentic, comprising a designer, coder, and visual inspector operating in a refinement loop. Notably, our approach requires no additional training data or detailed user instructions. Building on prior work in 2D generation, we demonstrate that rendered images produced by our framework can be effectively used for image classification pretraining tasks and outperforms previous methods by 15%. As a compelling real-world use case, we show that the generated prototypes can be leveraged to improve fine-grained vision-language models by using the rendered, part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a 55% accuracy improvement without relying on any additional human-labeled data.

[68] Adaptive High-Frequency Preprocessing for Video Coding

Yingxue Pang,Shijie Zhao,Junlin Li,Li Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种基于学习的自适应高频预处理框架,用于视频编码中优化高频组件的处理,实现主观质量提升与比特率节省的平衡。

Details Motivation: 高频组件对视频清晰度至关重要,但会增加编码比特率,从而提高带宽和存储成本。因此,需要一种自适应方法在比特率和质量之间取得最优权衡。

Contribution: 提出了频率注意力特征金字塔预测网络(FFPN),用于预测最优高频预处理策略,并通过伪标注训练优化RD(率失真)性能。

Method: 使用FFPN预测预处理策略,结合伪标注训练数据(基于不同预处理类型和强度的RD性能比较),指导后续滤波操作。

Result: 在多个数据集上的评估表明,该框架能够显著提升视觉质量并节省比特率。

Insight: 通过自适应高频预处理策略,可以更高效地平衡视频编码中的比特率与质量,为实际应用提供优化方案。

Abstract: High-frequency components are crucial for maintaining video clarity and realism, but they also significantly impact coding bitrate, resulting in increased bandwidth and storage costs. This paper presents an end-to-end learning-based framework for adaptive high-frequency preprocessing to enhance subjective quality and save bitrate in video coding. The framework employs the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the optimal high-frequency preprocessing strategy, guiding subsequent filtering operators to achieve the optimal tradeoff between bitrate and quality after compression. For training FFPN, we pseudo-label each training video with the optimal strategy, determined by comparing the rate-distortion (RD) performance across different preprocessing types and strengths. Distortion is measured using the latest quality assessment metric. Comprehensive evaluations on multiple datasets demonstrate the visually appealing enhancement capabilities and bitrate savings achieved by our framework.

[69] GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments

Lin Zeng,Boming Zhao,Jiarui Hu,Xujie Shen,Ziqiang Dang,Hujun Bao,Zhaopeng Cui

Main category: cs.CV

TL;DR: GaussianUpdate是一种结合3D高斯表达和持续学习的新方法,用于动态场景的实时渲染和更新。

Details Motivation: 现有方法在应对场景变化时需要大量重训练或无法捕捉细节变化,亟需一种高效且精准的更新方法。

Contribution: 提出GaussianUpdate,通过多阶段更新策略和可见性感知的持续学习,实现动态场景的高效更新。

Method: 结合3D高斯表达和多阶段更新,引入生成重放的可见性感知持续学习。

Result: 在基准数据集上实现了实时渲染,并能可视化不同时间点的场景变化。

Insight: 通过高斯表达和持续学习的结合,为动态场景的建模提供了一种高效且可扩展的解决方案。

Abstract: Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times

[70] Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos

Chaoyi Wang,Yifan Yang,Jun Pei,Lijie Xia,Jianpo Liu,Xiaobing Yuan,Xinhan Di

Main category: cs.CV

TL;DR: 该论文提出了一个开源的多模态基准数据集WB-DH,用于评估全身可动画化虚拟形象的生成,填补了现有数据集的不足。

Details Motivation: 现有数据集在捕捉细微表情、身体动作和动态背景方面存在局限性,无法满足全身虚拟形象生成的需求。

Contribution: 1)提供了细粒度的多模态标注;2)设计了一个多功能评估框架;3)公开了数据集和工具。

Method: 通过构建多模态标注数据集WB-DH,并提供评估框架,支持对全身虚拟形象生成任务的全面评估。

Result: WB-DH数据集及工具已开源,为研究社区提供了新的基准资源。

Insight: 强调了多模态数据在全身虚拟形象生成中的重要性,为未来研究提供了数据支持和方法指导。

Abstract: Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.

[71] A Robust Epipolar-Domain Regularization Algorithm for Light Field Depth Estimation

Noor Islam S. Mohammad

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级的光场深度估计算法,结合了光场视差信息与有向随机游走优化算法,具有低计算复杂性和竞争性精度。

Details Motivation: 光场成像中的鲁棒深度估计是增强现实、生物医学成像和场景重建等应用的关键挑战。现有深度卷积网络方法计算成本高且在噪声环境中表现不佳。

Contribution: 提出了一种无需大规模数据集训练的方法,通过光场视差和有向随机行走改进深度图一致性,降低了计算成本。

Method: 结合光场视差信息与有向随机游走优化算法,提高了深度图的鲁棒性和一致性。

Result: 在4D光场基准数据集和真实图像上测试,结果显示算法在保持低计算复杂性的同时,与前沿深度学习模型竞争精度。

Insight: 工作展示了概率图模型与深度感知框架结合的潜力,为光场深度估计提供了高效实用的算法设计方向。

Abstract: Robust depth estimation in light field imaging remains a critical challenge for pattern recognition applications such as augmented reality, biomedical imaging, and scene reconstruction. While existing approaches often rely heavily on deep convolutional neural networks, they tend to incur high computational costs and struggle in noisy real-world environments. This paper proposes a novel lightweight depth estimation pipeline that integrates light field-based disparity information with a directed random walk refinement algorithm. Unlike traditional CNN-based methods, our approach enhances depth map consistency without requiring extensive training or large-scale datasets. The proposed method was evaluated on the 4D Light Field Benchmark dataset and a diverse set of real-world images. Experimental results indicate that while performance slightly declines under uncontrolled conditions, the algorithm consistently maintains low computational complexity and competitive accuracy compared to state-of-the-art deep learning models. These findings highlight the potential of our method as a robust and efficient alternative for depth estimation and segmentation in light field imaging. The work provides insights into practical algorithm design for light field-based pattern recognition and opens new directions for integrating probabilistic graph models with depth sensing frameworks.

[72] Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

Bin Ren,Xiaoshui Huang,Mengyuan Liu,Hong Liu,Fabio Poiesi,Nicu Sebe,Guofeng Mei

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为MaskClu的无监督预训练方法,结合掩蔽点建模与聚类学习,用于3D点云的视觉Transformer(ViT)预训练。该方法通过同时优化密集语义重建和实例级对比学习,提升了语义特征的学习效果。

Details Motivation: 当前基于掩蔽自编码(masked autoencoding)的3D点云预训练方法在密集语义特征学习方面存在不足。作者希望通过结合聚类学习和对比学习,提升ViT在点云任务中的表现。

Contribution: 1. 提出MaskClu,首次将掩蔽点建模与聚类学习结合用于3D点云预训练;2. 引入全局对比学习机制增强实例级特征学习;3. 在多个3D任务中取得新竞争性结果。

Method: MaskClu通过重建掩蔽点云的聚类分配和聚类中心,学习密集语义信息,并结合全局对比学习优化特征。

Result: 在部件分割、语义分割、目标检测和分类等任务中,MaskClu表现优异,刷新了多项基准。

Insight: 结合聚类和对比学习能够有效提升ViT在3D点云任务中的语义表征能力,为无监督预训练提供了新思路。

Abstract: Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.

[73] Automatic and standardized surgical reporting for central nervous system tumors

David Bouget,Mathilde Gajda Faanes,Asgeir Store Jakola,Frederik Barkhof,Hilko Ardon,Lorenzo Bello,Mitchel S. Berger,Shawn L. Hervey-Jumper,Julia Furtner,Albert J. S. Idema,Barbara Kiesel,Georg Widhalm,Rishi Nandoe Tewarie,Emmanuel Mandonnet,Pierre A. Robe,Michiel Wagemakers,Timothy R. Smith,Philip C. De Witt Hamer,Ole solheim,Ingerid Reinertsen

Main category: cs.CV

TL;DR: 该论文提出了一种用于中枢神经系统肿瘤术后标准化报告的自动化分析流程,结合了Attention U-Net和DenseNet架构进行肿瘤分割与分类,并整合为开源软件平台。

Details Motivation: 现有研究主要集中在术前影像分析,而术后影像分析缺乏标准化的自动化解决方案,影响临床决策效率。

Contribution: 提出了一个全面的术后报告流程,结合多任务深度学习模型(分割与分类),并遵循RANO 2.0指南,集成到开源平台Raidionics中。

Method: 使用Attention U-Net进行肿瘤分割(包括非增强核心、增强残留肿瘤和切除腔),DenseNet用于MR序列分类和肿瘤类型识别。训练数据为多中心2000-7000例患者,采用5折交叉验证。

Result: 分割模型的平均Dice分数为87%(肿瘤核心)、66%(非增强核心)、70%(增强残留肿瘤)和77%(切除腔)。分类模型在序列分类和肿瘤类型分类中分别达到99.5%和80%的平衡准确率。

Insight: 该研究填补了术后影像分析的空白,标准化报告流程和开源工具的结合可显著提升临床决策效率,并为未来研究提供了基准。

Abstract: Magnetic resonance (MR) imaging is essential for evaluating central nervous system (CNS) tumors, guiding surgical planning, treatment decisions, and assessing postoperative outcomes and complication risks. While recent work has advanced automated tumor segmentation and report generation, most efforts have focused on preoperative data, with limited attention to postoperative imaging analysis. This study introduces a comprehensive pipeline for standardized postsurtical reporting in CNS tumors. Using the Attention U-Net architecture, segmentation models were trained for the preoperative (non-enhancing) tumor core, postoperative contrast-enhancing residual tumor, and resection cavity. Additionally, MR sequence classification and tumor type identification for contrast-enhancing lesions were explored using the DenseNet architecture. The models were integrated into a reporting pipeline, following the RANO 2.0 guidelines. Training was conducted on multicentric datasets comprising 2000 to 7000 patients, using a 5-fold cross-validation. Evaluation included patient-, voxel-, and object-wise metrics, with benchmarking against the latest BraTS challenge results. The segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models reached 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification. The pipeline presented in this study enables robust, automated segmentation, MR sequence classification, and standardized report generation aligned with RANO 2.0 guidelines, enhancing postoperative evaluation and clinical decision-making. The proposed models and methods were integrated into Raidionics, open-source software platform for CNS tumor analysis, now including a dedicated module for postsurgical analysis.

[74] A Pseudo Global Fusion Paradigm-Based Cross-View Network for LiDAR-Based Place Recognition

Jintao Cheng,Jiehao Luo,Xieyuanli Chen,Jin Wu,Rui Fan,Xiaoyu Tang,Wei Zhang

Main category: cs.CV

TL;DR: 论文提出一种基于伪全局融合范式的跨视角网络,用于LiDAR场景识别,通过多模态分支协调学习和SPD矩阵的Mahalanobis距离度量,显著提升了复杂环境下的性能。

Details Motivation: 现有LiDAR场景识别方法依赖欧氏距离度量学习,忽略了特征空间的固有结构和类内差异,导致在复杂环境和时变场景中表现不佳。

Contribution: 1. 提出伪全局信息引导机制,统一多模态分支的语义空间;2. 引入SPD矩阵的Mahalanobis距离度量,取代传统欧氏距离,更精准刻画数据分布。

Method: 结合多模态分支的跨视角网络,采用Manifold Adaptation和Pairwise Variance-Locality Learning Metric构建SPD矩阵。

Result: 实验表明,该方法在复杂环境中表现优异,性能显著优于传统欧氏距离方法。

Insight: 通过几何视角优化特征空间,能够更高效地捕捉场景的内在分布和类间关系,适用于动态变化的真实场景。

Abstract: LiDAR-based Place Recognition (LPR) remains a critical task in Embodied Artificial Intelligence (AI) and Autonomous Driving, primarily addressing localization challenges in GPS-denied environments and supporting loop closure detection. Existing approaches reduce place recognition to a Euclidean distance-based metric learning task, neglecting the feature space’s intrinsic structures and intra-class variances. Such Euclidean-centric formulation inherently limits the model’s capacity to capture nonlinear data distributions, leading to suboptimal performance in complex environments and temporal-varying scenarios. To address these challenges, we propose a novel cross-view network based on an innovative fusion paradigm. Our framework introduces a pseudo-global information guidance mechanism that coordinates multi-modal branches to perform feature learning within a unified semantic space. Concurrently, we propose a Manifold Adaptation and Pairwise Variance-Locality Learning Metric that constructs a Symmetric Positive Definite (SPD) matrix to compute Mahalanobis distance, superseding traditional Euclidean distance metrics. This geometric formulation enables the model to accurately characterize intrinsic data distributions and capture complex inter-class dependencies within the feature space. Experimental results demonstrate that the proposed algorithm achieves competitive performance, particularly excelling in complex environmental conditions.

[75] Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions

Miruna-Alexandra Gafencu,Reem Shaban,Yordanka Velikova,Mohammad Farid Azampour,Nassir Navab

Main category: cs.CV

TL;DR: 这篇论文提出了一种结合机器人超声与实时形状补全的新系统,用于增强脊柱可视化,并在实验中验证了其准确性和实用性。

Details Motivation: 超声波成像在脊柱手术中因其实时性和无辐射特性被广泛应用,但其效果受到阴影伪影的限制,传统方法如CT到超声的配准又存在复杂性。论文旨在解决这些问题,提供一种更高效、可重复的解决方案。

Contribution: 论文的主要贡献是提出了一种集成了机器人超声和实时形状补全的系统,能够自主获取超声扫查数据并重建完整的脊柱结构,提供交互式实时可视化功能。

Method: 方法包括机器人平台自主获取腰椎超声数据,提取椎骨表面,并通过深度学习形状补全网络重建完整解剖结构。还验证了形状补全的准确性和可视化效果。

Result: 实验通过定量评估形状补全的准确性,并在幻影装置上验证了多种脊柱扫查协议,展示了对志愿者扫描的定性可视化结果。

Insight: 该研究为脊柱手术提供了更一致、可重复的解决方案,结合机器人技术和深度学习,显著提升了超声成像的实用性和可视化能力。

Abstract: Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.

[76] MADPromptS: Unlocking Zero-Shot Morphing Attack Detection with Multiple Prompt Aggregation

Eduarda Caldeira,Fadi Boutros,Naser Damer

Main category: cs.CV

TL;DR: 该论文提出了一种基于CLIP的零样本方法MADPromptS,通过聚合多个文本提示来检测面部变形攻击,无需微调即可实现高效的攻击检测。

Details Motivation: 当前面部变形攻击检测(MAD)任务中,传统方法依赖微调多模态基础模型(FM),忽略了其直接部署的潜力。论文旨在探索一种纯粹的零样本方法,通过提示工程充分利用CLIP的预训练知识。

Contribution: 1. 首次研究了CLIP在MAD任务中的零样本能力;2. 提出多提示聚合方法MADPromptS,显著提升检测性能;3. 证明了提示工程设计对基础模型泛化能力的重要性。

Method: 通过设计每类多个文本提示,聚合其嵌入以丰富特征表示,直接利用CLIP的零样本能力检测变形攻击,避免了额外的微调或训练。

Result: 实验表明,多提示聚合显著提升了零样本检测性能,验证了通过提示工程高效利用基础模型内置知识的有效性。

Insight: 1. 基础模型的预训练知识可直接用于零样本任务;2. 提示工程设计是挖掘模型潜力的关键;3. 聚合多提示比单提示更能捕捉多样化的攻击特征。

Abstract: Face Morphing Attack Detection (MAD) is a critical challenge in face recognition security, where attackers can fool systems by interpolating the identity information of two or more individuals into a single face image, resulting in samples that can be verified as belonging to multiple identities by face recognition systems. While multimodal foundation models (FMs) like CLIP offer strong zero-shot capabilities by jointly modeling images and text, most prior works on FMs for biometric recognition have relied on fine-tuning for specific downstream tasks, neglecting their potential for direct, generalizable deployment. This work explores a pure zero-shot approach to MAD by leveraging CLIP without any additional training or fine-tuning, focusing instead on the design and aggregation of multiple textual prompts per class. By aggregating the embeddings of diverse prompts, we better align the model’s internal representations with the MAD task, capturing richer and more varied cues indicative of bona-fide or attack samples. Our results show that prompt aggregation substantially improves zero-shot detection performance, demonstrating the effectiveness of exploiting foundation models’ built-in multimodal knowledge through efficient prompt engineering.

[77] Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

Ao Ma,Jiasong Feng,Ke Cao,Jing Wang,Yun Wang,Quanwei Zhang,Zhanjie Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Lay2Story的框架,利用扩散变换器(DiTs)实现布局可切换的故事生成任务,通过布局条件增强生成的帧序列一致性和细节控制,并发布了大规模高质量数据集Lay2Story-1M和评测基准Lay2Story-Bench。

Details Motivation: 目前的故事生成方法在保持主题一致性和细节控制方面存在挑战,且缺乏高质量数据支持。论文通过引入布局条件来解决这些问题,推动任务发展。

Contribution: 1) 提出布局可切换的故事生成任务;2) 发布大规模数据集Lay2Story-1M和评测基准Lay2Story-Bench;3) 基于DiTs提出Lay2Story框架,在一致性和细节控制上优于SOTA。

Method: 基于扩散变换器(DiTs)架构,利用布局条件(如主题位置和属性)增强帧间交互和一致性控制,结合新数据集训练和优化。

Result: 实验表明Lay2Story在一致性、语义相关性和美学质量上优于现有方法。

Insight: 布局条件是增强生成任务一致性和细节控制的有效工具,同时高质量数据集对任务性能提升至关重要。

Abstract: Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject’s position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject’s position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject’s position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Togglable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable Storytelling tasks. Through both qualitative and quantitative experiments, we find that our method outperforms the previous state-of-the-art (SOTA) techniques, achieving the best results in terms of consistency, semantic correlation, and aesthetic quality.

[78] Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

Elman Ghazaei,Erchan Aptoula

Main category: cs.CV

TL;DR: 该论文提出了一种用于领域泛化的变化检测视觉问答(CDVQA)任务的文本条件状态空间模型(TCSSM),旨在解决领域偏移问题,并引入了新的多模态多领域数据集BrightVQA。

Details Motivation: 传统变化检测方法需要专家知识且无法应对领域偏移,CDVQA任务的目标是让非专家用户更灵活地获取变化信息,但现有方法假设训练和测试数据分布相似,这与现实不符。

Contribution: 1. 引入BrightVQA数据集支持CDVQA的领域泛化研究;2. 提出TCSSM模型,通过融合双时相图像和地理灾害文本信息提取领域不变特征。

Method: TCSSM利用双时相图像和地理灾害文本动态预测输入相关参数,统一处理视觉和文本数据以实现对齐。

Result: 实验表明TCSSM优于现有方法。

Insight: 通过多模态数据(图像+文本)的动态融合可以提升模型的领域泛化能力。

Abstract: The Earth’s surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.

[79] TaoCache: Structure-Maintained Video Generation Acceleration

Zhentao Fan,Zongzuo Wang,Weiwei Zhang

Main category: cs.CV

TL;DR: TaoCache是一种无需训练、即插即用的缓存策略,通过固定点视角预测噪声输出,有效保留视频生成中的高分辨率结构,显著提升视觉质量。

Details Motivation: 现有基于缓存的视频生成加速方法通常在去噪早期或中期跳过步骤,导致结构差异和指令遵循问题,影响生成质量。

Contribution: 提出TaoCache,通过校准余弦相似度和范数比来预测噪声输出,保留结构完整性,适用于后期去噪阶段。

Method: 采用固定点视角预测噪声,校准余弦相似度和噪声差值的范数比,实现高效的缓存策略。

Result: 在Latte-1、OpenSora-Plan v110和Wan2.1等数据集上,TaoCache在相同加速下视觉质量(LPIPS、SSIM、PSNR)显著优于现有方法。

Insight: 固定点视角和噪声校准是保留视频结构的关键,TaoCache与其他加速方法(如PAB和TeaCache)正交,可无缝集成到DiT框架中。

Abstract: Existing cache-based acceleration methods for video diffusion models primarily skip early or mid denoising steps, which often leads to structural discrepancies relative to full-timestep generation and can hinder instruction following and character consistency. We present TaoCache, a training-free, plug-and-play caching strategy that, instead of residual-based caching, adopts a fixed-point perspective to predict the model’s noise output and is specifically effective in late denoising stages. By calibrating cosine similarities and norm ratios of consecutive noise deltas, TaoCache preserves high-resolution structure while enabling aggressive skipping. The approach is orthogonal to complementary accelerations such as Pyramid Attention Broadcast (PAB) and TeaCache, and it integrates seamlessly into DiT-based frameworks. Across Latte-1, OpenSora-Plan v110, and Wan2.1, TaoCache attains substantially higher visual quality (LPIPS, SSIM, PSNR) than prior caching methods under the same speedups.

[80] ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation

Ding Xia,Naoto Inoue,Qianru Qiu,Kotaro Kikuchi

Main category: cs.CV

TL;DR: ColorGPT proposes using Large Language Models (LLMs) for color recommendation in design tasks, outperforming traditional methods in accuracy and diversity.

Details Motivation: Traditional color recommendation methods struggle with design complexity and limited data. Leveraging LLMs' commonsense reasoning could improve results.

Contribution: ColorGPT is a robust pipeline using LLMs for color palette completion and full palette generation, validated through systematic testing and prompt engineering.

Method: The approach tests multiple color representations, applies prompt engineering, and leverages LLMs for recommending colors based on partial palettes or textual descriptions.

Result: ColorGPT outperformed existing methods in accuracy for palette completion and achieved better diversity and similarity in full palette generation.

Insight: LLMs can effectively bridge the gap in multimodal color recommendation tasks, showcasing their potential beyond traditional language tasks.

Abstract: Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating communication, improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (LLMs) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained LLMs serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our LLM-based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.

[81] KFFocus: Highlighting Keyframes for Enhanced Video Understanding

Ming Nie,Chunwei Wang,Hang Xu,Li Zhang

Main category: cs.CV

TL;DR: KFFocus是一种用于视频理解的创新方法,通过智能选择关键帧和动态压缩帧内信息,显著提升了视频大语言模型(Vid-LLMs)的计算效率和准确性。

Details Motivation: 现有视频大语言模型在压缩视频帧时通常采用均匀采样或固定压缩策略,忽略了重要信息在时间上的不均匀分布,可能导致关键帧的丢失。

Contribution: 1. 提出KFFocus方法,通过非均匀采样和动态压缩比优化视频帧处理;2. 引入时空建模模块,增强模型对时空动态的理解。

Method: KFFocus结合经典视频压缩思想,根据帧的冗余度和上下文重要性动态分配压缩比,并通过时空建模模块捕捉帧间和帧内关系。

Result: 在长视频场景的基准测试中,KFFocus在计算效率和准确性上均显著优于现有方法。

Insight: 动态调整帧的压缩比和捕捉关键帧是提升视频理解能力的关键,而时空建模进一步增强了模型对复杂视频内容的理解。

Abstract: Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.

[82] Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation

Zan Wang,Jingze Zhang,Yixin Chen,Baoxiong Jia,Wei Liang,Siyuan Huang

Main category: cs.CV

TL;DR: 论文提出了MSQ方法,通过多尺度的时空量化生成灵活的运动表示,解决了现有运动表示在复杂模式建模和泛化能力上的不足,并在多个任务中表现优异。

Details Motivation: 当前的运动表示通常以离散帧序列形式出现,难以从多尺度角度捕捉运动模式,且缺乏组合灵活性,限制了复杂模式建模和多任务泛化能力。

Contribution: 提出了MSQ方法,一种多尺度的时空量化方法,将运动序列压缩为空间和时间维度的多尺度离散token,支持运动编辑、控制和条件生成。

Method: 采用独立编码器捕捉不同空间粒度的身体部位,并在时间维度上插值编码特征为多尺度表示后量化。基于此表示,构建了生成掩模建模模型。

Result: 方法能够无缝组合运动token且无需重训练,在多个基准测试上表现优于现有方法。

Insight: 多尺度时空量化能有效提升运动表示的表达能力和灵活性,为复杂运动生成任务提供了新的方法。

Abstract: Despite significant advancements in human motion generation, current motion representations, typically formulated as discrete frame sequences, still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model’s generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.

[83] UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale

Yuhao Wang,Wei Xi

Main category: cs.CV

TL;DR: UniConvNet提出了一种通过组合小核(如7×7、9×9、11×11)扩展有效感受野(ERF)并保持渐进高斯分布(AGD)的新范式,显著提升了ConvNet的性能和效率。

Details Motivation: 传统ConvNet在扩大有效感受野时面临高参数量和计算成本问题,且会破坏ERF的渐进高斯分布特性,限制了其性能。

Contribution: 提出了通过小核组合扩展ERF并保持AGD的新方法,设计了Three-layer Receptive Field Aggregator和Layer Operator作为基本模块,并构建了适用于任意规模的通用模型UniConvNet。

Method: 通过堆叠Three-layer Receptive Field Aggregator模块和Layer Operator,扩展ERF至大核ConvNet水平,同时保持AGD特性。

Result: 在ImageNet-1K、COCO2017和ADE20K上,UniConvNet在各种规模模型中均优于SOTA的CNN和ViT,轻量级模型UniConvNet-T在3000万参数和5.1G FLOPs下达到了84.2%的ImageNet top-1准确率。

Insight: 小核组合在保持ERF的渐进高斯分布时能够高效扩展感受野,为ConvNet设计提供了新的思路。

Abstract: Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4%$ top-1 accuracy on ImageNet. Code and models are publicly available at https://github.com/ai-paperwithcode/UniConvNet.

[84] Towards Perfection: Building Inter-component Mutual Correction for Retinex-based Low-light Image Enhancement

Luyang Cao,Han Xu,Jian Zhang,Lei Qi,Jiayi Ma,Yinghuan Shi,Yang Gao

Main category: cs.CV

TL;DR: 该论文提出了一种新的Inter-correction Retinex模型(IRetinex),通过减少分解和增强阶段的inter-component residuals(ICR)来提高低光图像增强的质量。

Details Motivation: 在低光图像增强中,Retinex方法虽然具有可解释性,但其对光照和反射分量的分解并不完美,导致残差(ICR)影响增强效果。

Contribution: 1. 正式提出ICR问题;2. 设计了IRetinex模型,通过减少ICR提升分解和增强效果;3. 在三个低光数据集上验证了方法的优越性。

Method: 1. 在分解阶段,使用inter-component residual reduction模块减少光照和反射分量的特征相似性;2. 在增强阶段,利用分量间特征相似性检测和缓解ICR的影响。

Result: 实验表明,通过减少ICR,方法在质量和定量指标上均优于当前最佳方法。

Insight: ICR是影响Retinex方法性能的关键因素,通过针对性的互校正机制可以有效提升低光图像增强的效果。

Abstract: In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.

[85] Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation

Kaiwen Huang,Tao Zhou,Huazhu Fu,Yizhe Zhang,Yi Zhou,Xiao-Jun Wu

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的交叉训练框架(UC-Seg),用于半监督医学图像分割。该框架通过两个不同子网和交叉一致性保留策略(CCP)来减轻模型认知偏差,并提出不确定性感知伪标签生成(UPG)组件以提高伪标签质量。实验表明,UC-Seg在多种医学图像分割任务中表现优于现有方法。

Details Motivation: 半监督学习在医学图像分割中可减少对专家标注的依赖,但现有基于均值教师(MT)的方法过于依赖学生模型且忽略模型认知偏差,而基于伪标签的协同训练方法在生成高质量伪标签时面临挑战。

Contribution: 1) 提出UC-Seg框架,通过两个子网和CCP策略减轻认知偏差;2) 设计UPG组件,利用不确定性图和分割结果生成高置信伪标签。

Method: 1) 构建两个异构子网,通过CCP策略确保特征一致性与共享语义学习;2) 基于不确定性图和分割结果生成伪标签(UPG)。

Result: 在MRI、CT等多模态医学图像分割任务中,UC-Seg表现出更高的分割精度和泛化性能。

Insight: 1) 异构子网协同训练可有效减轻模型偏差;2) 结合不确定性的伪标签生成能提升半监督学习效果。

Abstract: Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at https://github.com/taozh2017/UCSeg.

[86] When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges

Zhiqiang Yang,Renshuai Tao,Xiaolong Zheng,Guodong Yang,Chunjie Zhang

Main category: cs.CV

TL;DR: 该论文提出了DPGNet,一种利用无标注数据检测高度逼真AI生成人脸的方法,解决了现有方法依赖标注数据且标注困难的挑战。

Details Motivation: 随着AI生成内容越来越逼真,人类标注者难以区分真假图像,导致标注过程耗时且不可靠,亟需利用大规模无标注数据的解决方案。

Contribution: 1. 提出了DPGNet,通过双路径引导网络解决生成模型间的域差距问题;2. 利用文本引导的跨域对齐和课程驱动的伪标签生成模块,有效利用无标注数据;3. 在11个流行数据集上表现优于现有方法6.3%。

Method: 1. 文本引导的跨域对齐模块:通过可学习提示将视觉与文本嵌入统一到域不变特征空间;2. 课程驱动的伪标签生成:动态选择信息量更大的无标注样本;3. 跨域知识蒸馏防止灾难性遗忘。

Result: 在11个数据集上,DPGNet比现有最优方法提升了6.3%,展示了其利用无标注数据解决深度伪造检测问题的有效性。

Insight: 1. 结合视觉和文本模态可以有效缓解域差距问题;2. 课程学习策略能动态优化无标注数据的利用;3. 跨域知识蒸馏有助于保持模型在多域任务中的性能。

Abstract: Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by \textbf{6.3%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.

[87] Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

Maxim A. Patratskiy,Alexey K. Kovalev,Aleksandr I. Panov

Main category: cs.CV

TL;DR: 论文提出了一种通过视觉提示增强VLA模型空间-时间理解的新方法,名为Spatial Traces,实验结果表明其性能显著优于现有模型。

Details Motivation: 当前视觉-语言-动作模型在空间和时间理解上是独立优化的,缺乏联合建模。作者希望通过一个统一的方法同时捕捉这两种信息。

Contribution: 提出了Spatial Traces方法,通过将视觉关键点轨迹投影到深度图上,实现空间-时间联合建模。

Method: 利用视觉提示技术,将关键点轨迹与深度图结合,同时学习空间和时间信息。

Result: 在SimplerEnv实验中,任务完成率比SpatialVLA和TraceVLA分别提高了4%和19%。

Insight: 该方法能以较少数据实现性能提升,适用于数据稀缺的实际场景。

Abstract: Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA.

[88] ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds

Shanle Yao,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi

Main category: cs.CV

TL;DR: 该论文提出了一个基于主动学习的半监督异常检测框架ALFred,专注于视频异常检测(VAD)在动态真实场景中的应用。通过结合人机交互机制和自适应阈值,提升了模型在环境变化中的适应性。

Details Motivation: 传统的VAD方法在动态场景中表现不佳,主要因静态假设和阈值固定的问题。论文旨在解决这些问题,使VAD能够在真实世界的动态环境中更有效地工作。

Contribution: 1. 提出了ALFred框架,结合主动学习和人机交互机制,适应动态场景。2. 引入了自适应阈值技术,以应对不同环境下的‘正常’行为定义变化。3. 提出了新的评估指标(EBI),并在模拟真实场景的实验中获得显著效果。

Method: 1. 利用主动学习选择信息量最大的数据点进行标注。2. 结合人机交互机制,从AI生成的伪标签中识别真实的正常和异常实例。3. 动态调整阈值以适应不同环境。

Result: 实验表明,ALFred在模拟真实场景中实现了68.91的EBI(Q3),显著优于传统方法。

Insight: 通过主动学习和自适应阈值,VAD可以更灵活地适应动态环境,同时人机交互机制为模型提供了更可靠的真实数据。

Abstract: Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of ‘normal’ shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.

[89] VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Fuhao Chang,Shuxin Li,Yabei Li,Lei He

Main category: cs.CV

TL;DR: VLM-3D是首个端到端框架,利用视觉语言模型(VLMs)在自动驾驶场景中实现3D几何感知,通过低秩适应(LoRA)和联合语义-几何损失设计显著提升了感知精度。

Details Motivation: 自动驾驶系统在复杂交通环境中需识别未见过的物体类别,现有方法多采用多阶段流程导致误差传播。VLM-3D旨在通过端到端框架直接利用VLMs的语义推理能力解决这一问题。

Contribution: 1) 提出首个端到端的VLMs 3D感知框架VLM-3D;2) 引入LoRA高效适应VLMs;3) 提出联合语义-几何损失设计。

Method: 结合LoRA适应VLMs,设计联合损失:早期用令牌级语义损失稳定收敛,后期加入3D IoU损失优化边界框预测。

Result: 在nuScenes数据集上感知精度提升12.8%,验证了方法的有效性。

Insight: VLMs在3D感知中潜力巨大,端到端设计避免误差传播,联合损失平衡语义和几何学习是提升性能的关键。

Abstract: Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.

[90] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision

Ahsan Habib Akash,Greg Murray,Annahita Amireskandari,Joel Palko,Carol Laxson,Binod Bhattarai,Prashnna Gyawali

Main category: cs.CV

TL;DR: 本文提出了一种无需保护属性监督的去偏方法,通过无监督聚类和多模态对比学习框架,提升青光眼检测的公平性。

Details Motivation: 尽管视觉语言模型在多模态任务中表现优异,但可能存在隐含的偏见。青光眼筛查尤为重要,因其对边缘人群影响更大,需解决模型偏见问题。

Contribution: 提出了一种属性无关的去偏方法,通过无监督聚类推断子群,结合对比学习框架动态调整训练目标,减少子群间性能差异。

Method: 基于对比学习框架,利用无监督聚类推断子群,计算梯度相似性权重,并通过加权目标优化模型,聚焦表现较差的子群。

Result: 在Harvard FairVLMed数据集上验证,通过Equalized Odds Distance等指标,展示了在不同推断子群中的公平性能。

Insight: 无监督聚类和动态加权方法可有效减少模型偏见,尤其在没有显式保护属性监督的场景下,为公平医学诊断提供新思路。

Abstract: Vision-Language Models (VLMs) have achieved remarkable success on multimodal tasks such as image-text retrieval and zero-shot classification, yet they can exhibit demographic biases even when explicit protected attributes are absent during training. In this work, we focus on automated glaucoma screening from retinal fundus images, a critical application given that glaucoma is a leading cause of irreversible blindness and disproportionately affects underserved populations. Building on a reweighting-based contrastive learning framework, we introduce an attribute-agnostic debiasing method that (i) infers proxy subgroups via unsupervised clustering of image-image embeddings, (ii) computes gradient-similarity weights between the CLIP-style multimodal loss and a SimCLR-style image-pair contrastive loss, and (iii) applies these weights in a joint, top-$k$ weighted objective to upweight underperforming clusters. This label-free approach adaptively targets the hardest examples, thereby reducing subgroup disparities. We evaluate our method on the Harvard FairVLMed glaucoma subset, reporting Equalized Odds Distance (EOD), Equalized Subgroup AUC (ES AUC), and Groupwise AUC to demonstrate equitable performance across inferred demographic subgroups.

[91] Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices

Ya Zou,Jingfeng Yao,Siyuan Yu,Shuai Zhang,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 这篇论文提出了一种低成本解决方案Turbo-VAED,用于高效地将视频变分自编码器(VAE)部署到移动设备上,显著降低参数数量和计算延迟,同时保留高重建质量。

Details Motivation: 移动设备上部署大型生成模型(如视频VAE)面临计算瓶颈和内存问题,需要一种高效的解决方案。

Contribution: 1) 通过3D深度可分离卷积减少参数;2) 提出解耦的3D像素混洗方案降低延迟;3) 提出高效的VAE解码器训练方法,仅需蒸馏解码器。

Method: 结合3D深度可分离卷积和解耦3D像素混洗技术,设计移动优化的VAE解码器,并通过蒸馏实现快速迁移。

Result: 在720p分辨率下,Turbo-VAED比原始VAE加速84.5倍,参数减少82.5%,重建质量保留96.9%。在iPhone 16 Pro上比现有移动优化VAE快2.9倍且质量更高。

Insight: 视频VAE的冗余设计和低效上采样是移动部署的主要瓶颈,通过结构优化和硬件适配可以显著提升性能。

Abstract: There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.

[92] HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis

Timo Teufel,Pulkit Gera,Xilong Zhou,Umar Iqbal,Pramod Rao,Jan Kautz,Vladislav Golyanik,Christian Theobalt

Main category: cs.CV

TL;DR: HumanOLAT 是一个首个公开的大规模多视角单光源(OLAT)全身人体捕捉数据集,用于全身人体的重新光照和新视角合成任务,填补了高质量公开数据集的空白。

Details Motivation: 由于缺乏高质量公开的全身人体捕捉数据集,重新光照和新视角合成的进展受限。HumanOLAT 旨在填补这一空白,推动相关研究发展。

Contribution: 发布首个公开的大规模多视角 OLAT 全身人体数据集,包括多种光照条件下的 HDR 数据,为重新光照和新视角合成提供基准。

Method: 通过多视角捕获 HDR RGB 帧,涵盖白光源、环境光、颜色梯度等多种光照条件,生成高质量的 OLAT 数据。

Result: 评估表明当前最先进的重新光照和新视角合成方法仍难以建模复杂的人体外观和光照交互,数据集为未来研究提供了重要支持。

Insight: HumanOLAT 为重新光照和渲染技术提供了严格基准,有望推动领域发展。

Abstract: Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset’s value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.

cs.SD [Back]

[93] Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

Chaoqun Cui,Liangbin Huang,Shijing Wang,Zhe Tong,Zhaolong Huang,Xiao Zeng,Xiaofeng Liu

Main category: cs.SD

TL;DR: 论文提出了一种基于分段监督偏好优化(SSPO)的方法,用于解决视频配音中的时长对齐问题,显著减少了源语言与目标语言的时长不匹配现象。

Details Motivation: 视频配音中,由于不同语言信息密度的差异,目标语言的语音时长可能与原始语音不匹配,导致音视频同步问题,影响观看体验。因此,需要一种有效的方法来对齐时长。

Contribution: 1. 将时长对齐问题建模为偏好优化问题;2. 提出了分段监督偏好优化(SSPO)方法,通过分段采样策略和细粒度损失函数减少时长不匹配。

Method: 采用分段采样策略,结合细粒度损失函数,优化目标语言的语音时长,使其与源语言对齐。SSPO通过偏好优化框架实现这一目标。

Result: 实验结果表明,SSPO在时长对齐任务中表现出色,显著改善了音视频同步效果。

Insight: 1. 偏好优化框架可以有效解决多模态任务中的对齐问题;2. 细粒度损失函数和分段策略是实现高精度对齐的关键。

Abstract: Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.

cs.IR [Back]

[94] Adaptive Personalized Conversational Information Retrieval

Fengran Mo,Yuchen Hui,Yuxing Tian,Zhaoxuan Tan,Chuan Meng,Zhan Su,Kaiyu Huang,Jian-Yun Nie

Main category: cs.IR

TL;DR: APCIR提出了一种自适应个性化对话信息检索方法,动态识别查询的个性化需求并通过加权融合优化查询结果。

Details Motivation: 现有对话信息检索系统通常对所有查询采取统一的个性化策略,而忽略了并非所有查询都需要个性化,导致性能下降。

Contribution: 1) 提出自适应个性化方法,动态判断查询的个性化需求;2) 设计个性化感知的排序融合策略;3) 在TREC iKAT数据集上验证了方法的有效性。

Method: 1) 识别查询所需的个性化级别;2) 生成多种增强查询;3) 动态分配融合权重。

Result: APCIR在TREC iKAT数据集上超越了现有最优方法。

Insight: 动态调整个性化权重可以提高检索系统的灵活性和性能。

Abstract: Personalized conversational information retrieval (CIR) systems aim to satisfy users’ complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users’ personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all’’ personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods.

eess.IV [Back]

[95] Preprocessing Algorithm Leveraging Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance

Praveen Sumanasekara,Athulya Ratnayake,Buddhi Wijenayake,Keshawa Ratnayake,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath

Main category: eess.IV

TL;DR: 该论文提出了一种新型预处理算法,用于校正高光谱图像中的尺度诱导光谱变异性,以提升解混性能。该算法通过数学框架描述尺度变异性,并通过实验验证其有效性。

Details Motivation: 高光谱图像中的光谱变异性(尤其是由地形、光照和阴影引起的大尺度变化)严重影响解混算法的准确性和收敛性。现有方法难以有效处理这些尺度变化。

Contribution: 1. 提出了一种预处理算法,用于校正尺度诱导的光谱变异性;2. 提供了严格的数学框架;3. 通过广泛实验展示了算法的普遍性和显著效果。

Method: 通过数学建模分离和补偿大尺度乘性效应,生成更干净的输入数据,使解混算法能更专注于非线性光谱变异性建模和丰度估计。

Result: 在合成和真实数据集上的实验表明,该预处理步骤能显著提升现有解混算法的性能,误差减少接近50%。

Insight: 尺度校正是解混的一个补充步骤,可提升现有方法的性能,尤其在处理光谱变异性时效果显著,具有实际应用潜力。

Abstract: Spectral variability significantly impacts the accuracy and convergence of hyperspectral unmixing algorithms. While many methods address complex spectral variability, large-scale variations in spectral signature scale caused by factors such as topography, illumination, and shadowing remain a major challenge. These variations often degrade unmixing performance and complicate model fitting. In this paper, we propose a novel preprocessing algorithm that corrects scale-induced spectral variability prior to unmixing. By isolating and compensating for these large-scale multiplicative effects, the algorithm provides a cleaner input, enabling unmixing methods to focus more effectively on modeling nonlinear spectral variability and abundance estimation. We present a rigorous mathematical framework to describe scale variability and extensive experimental validation of the proposed algorithm. Furthermore, the algorithm’s impact is evaluated across a broad spectrum of state-of-the-art unmixing algorithms on two synthetic and two real hyperspectral datasets. The proposed preprocessing step consistently improves the performance of these algorithms, including those specifically designed to handle spectral variability, with error reductions close to 50% in many cases. This demonstrates that scale correction acts as a complementary step, facilitating more accurate unmixing by existing methods. The algorithm’s generality and significant impact highlight its potential as a key component in practical hyperspectral unmixing pipelines. The implementation code will be made publicly available upon publication.

[96] Frequency-Assisted Adaptive Sharpening Scheme Considering Bitrate and Quality Tradeoff

Yingxue Pang,Shijie Zhao,Haiqiang Wang,Gen Zhan,Junlin Li,Li Zhang

Main category: eess.IV

TL;DR: 这篇论文提出了一种新颖的频率辅助自适应锐化方案(FreqSP),通过结合CNN特征和高频分量预测最佳锐化水平,以优化视频质量和比特率之间的权衡。

Details Motivation: 锐化技术能提升视频质量,但过度锐化会增加比特率并可能降低质量。因此,需要一种方法在提升视频质量的同时有效控制带宽成本。

Contribution: 提出了频率辅助锐化水平预测模型(FreqSP),通过标注视频的最佳锐化水平并结合高频分量预测,优化了比特率与质量的权衡。

Method: 以未压缩的源视频为输入,结合CNN特征和高频分量,训练模型预测最优锐化水平。

Result: 实验验证了FreqSP在优化视频质量和比特率之间的有效性。

Insight: 高频分量的引入有助于更准确预测锐化水平,避免了过度锐化问题,同时平衡了带宽成本。

Abstract: Sharpening is a widely adopted technique to improve video quality, which can effectively emphasize textures and alleviate blurring. However, increasing the sharpening level comes with a higher video bitrate, resulting in degraded Quality of Service (QoS). Furthermore, the video quality does not necessarily improve with increasing sharpening levels, leading to issues such as over-sharpening. Clearly, it is essential to figure out how to boost video quality with a proper sharpening level while also controlling bandwidth costs effectively. This paper thus proposes a novel Frequency-assisted Sharpening level Prediction model (FreqSP). We first label each video with the sharpening level correlating to the optimal bitrate and quality tradeoff as ground truth. Then taking uncompressed source videos as inputs, the proposed FreqSP leverages intricate CNN features and high-frequency components to estimate the optimal sharpening level. Extensive experiments demonstrate the effectiveness of our method.

[97] A new dataset and comparison for multi-camera frame synthesis

Conall Daly,Anil Kokaram

Main category: eess.IV

TL;DR: 论文提出了一种新的多相机数据集,用于公平比较帧插值和视图合成方法,发现深度学习方法在真实图像数据上并未显著优于传统方法,而在合成场景中3D高斯喷洒表现更优。

Details Motivation: 现有帧插值和视图合成数据集各有所偏,缺乏直接比较的基础。

Contribution: 构建了一个新的多相机数据集,实现了帧插值与视图合成方法的公平比较。

Method: 使用自定义密集线性相机阵列收集数据,并对传统和深度学习帧插值方法与3D高斯喷洒进行了对比评估。

Result: 在真实图像数据上,深度学习方法未显著优于传统方法;在合成场景中,3D高斯喷洒表现优于帧插值算法。

Insight: 数据集特性对方法性能影响显著,合成与真实场景的结果差异值得关注。

Abstract: Many methods exist for frame synthesis in image sequences but can be broadly categorised into frame interpolation and view synthesis techniques. Fundamentally, both frame interpolation and view synthesis tackle the same task, interpolating a frame given surrounding frames in time or space. However, most frame interpolation datasets focus on temporal aspects with single cameras moving through time and space, while view synthesis datasets are typically biased toward stereoscopic depth estimation use cases. This makes direct comparison between view synthesis and frame interpolation methods challenging. In this paper, we develop a novel multi-camera dataset using a custom-built dense linear camera array to enable fair comparison between these approaches. We evaluate classical and deep learning frame interpolators against a view synthesis method (3D Gaussian Splatting) for the task of view in-betweening. Our results reveal that deep learning methods do not significantly outperform classical methods on real image data, with 3D Gaussian Splatting actually underperforming frame interpolators by as much as 3.5 dB PSNR. However, in synthetic scenes, the situation reverses – 3D Gaussian Splatting outperforms frame interpolation algorithms by almost 5 dB PSNR at a 95% confidence level.

[98] Efficient motion-based metrics for video frame interpolation

Conall Daly,Darren Ramsook,Anil Kokaram

Main category: eess.IV

TL;DR: 该论文研究了基于运动的指标用于评估视频帧插值的感知质量,提出了基于运动场发散性的高效指标,其计算效率更高且与感知评分相关性较好。

Details Motivation: 尽管视频帧插值算法发展迅速,但评估插值内容的感知质量仍是一个开放问题。当前的运动指标(如FloLPIPS)计算复杂,需要更高效的替代方案。

Contribution: 提出了一种基于运动场发散性的高效视频质量指标,用于评估帧插值算法。该指标计算效率高(速度提升2.7倍),且与感知评分的相关性较好(PLCC=0.51)。

Method: 通过对运动场进行简单处理(如测量发散性)来生成质量指标,并使用BVI-VFI数据集(包含插值序列的感知评分)进行验证。

Result: 提出的指标在计算效率和感知相关性上优于FloLPIPS,且在评估现有插值算法时更倾向于感知效果更好的结果,而不是PSNR或SSIM得分高的结果。

Insight: 简单的运动场处理可以生成高效且与感知质量相关的指标,为视频帧插值算法的评估提供了新的方向。

Abstract: Video frame interpolation (VFI) offers a way to generate intermediate frames between consecutive frames of a video sequence. Although the development of advanced frame interpolation algorithms has received increased attention in recent years, assessing the perceptual quality of interpolated content remains an ongoing area of research. In this paper, we investigate simple ways to process motion fields, with the purposes of using them as video quality metric for evaluating frame interpolation algorithms. We evaluate these quality metrics using the BVI-VFI dataset which contains perceptual scores measured for interpolated sequences. From our investigation we propose a motion metric based on measuring the divergence of motion fields. This metric correlates reasonably with these perceptual scores (PLCC=0.51) and is more computationally efficient (x2.7 speedup) compared to FloLPIPS (a well known motion-based metric). We then use our new proposed metrics to evaluate a range of state of the art frame interpolation metrics and find our metrics tend to favour more perceptual pleasing interpolated frames that may not score highly in terms of PSNR or SSIM.

eess.AS [Back]

[99] MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLMs

Xiaoxue Gao,Huayun Zhang,Nancy F. Chen

Main category: eess.AS

TL;DR: MultiAiTutor 是一个基于大型语言模型(LLMs)的多语言教育语音生成系统,专为儿童设计,支持新加坡口音的普通话、马来语和泰米尔语,通过文化和年龄适配的语音生成提升语言学习效果。

Details Motivation: 儿童教育中的高质量、跨语言文化适配的语音生成需求迫切,尤其是低资源语言领域存在巨大挑战。MultiAiTutor 旨在解决这一问题,为儿童提供个性化的多语言学习支持。

Contribution: 1. 提出首个结合 LLM 架构的多语言儿童友好语音生成系统。2. 针对低资源语言(新加坡口音的普通话、马来语和泰米尔语)设计文化和年龄适配的生成任务。3. 通过客观和主观评测验证其优越性。

Method: 1. 利用 LLM 架构实现多语言语音生成。2. 结合图像描述任务,生成文化和年龄适配的语音内容。3. 在三种低资源语言上优化模型性能。

Result: 实验表明,MultiAiTutor 在客观指标和主观评测中均优于基线方法,显著提升了儿童语言学习的效果和体验。

Insight: 利用 LLM 和多模态任务(图像描述)可以显著提升低资源语言的语音生成质量,尤其是在儿童教育场景中。

Abstract: Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods.

cs.DB [Back]

[100] E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency

Dongjie Xu,Yue Cui,Weijie Shi,Qingzhi Ma,Hanghui Guo,Jiaming Li,Yao Zhao,Ruiyuan Zhang,Shimin Di,Jia Zhu,Kai Zheng,Jiajie Xu

Main category: cs.DB

TL;DR: 论文提出了E3-Rewrite框架,利用大语言模型(LLMs)重写SQL查询,以生成可执行、等价且高效的查询,克服了基于规则方法的局限性。

Details Motivation: 传统基于规则的SQL重写方法泛化能力差且难以处理复杂查询,而LLMs虽能捕捉复杂策略,但直接应用可能生成次优或不等价的重写结果。

Contribution: 提出了一种结合上下文构造模块和强化学习框架的LLM-based方法,生成可执行、等价且高效的SQL查询重写。

Method: 1. 上下文模块利用执行计划和检索演示生成提示;2. 设计针对可执行性、等价性和效率的奖励函数;3. 采用阶段性课程学习策略。

Result: 实验表明,E3-Rewrite在多个SQL基准测试中将查询执行时间减少了25.6%,且成功重写率提升24.4%。

Insight: 结合执行感知和语义基础的强化学习框架能显著提升LLMs在SQL重写任务中的表现,尤其是复杂查询场景。

Abstract: SQL query rewriting aims to reformulate a query into a more efficient form while preserving equivalence. Most existing methods rely on predefined rewrite rules. However, such rule-based approaches face fundamental limitations: (1) fixed rule sets generalize poorly to novel query patterns and struggle with complex queries; (2) a wide range of effective rewriting strategies cannot be fully captured by declarative rules. To overcome these issues, we propose using large language models (LLMs) to generate rewrites. LLMs can capture complex strategies, such as evaluation reordering and CTE rewriting. Despite this potential, directly applying LLMs often results in suboptimal or non-equivalent rewrites due to a lack of execution awareness and semantic grounding. To address these challenges, We present E3-Rewrite, an LLM-based SQL rewriting framework that produces executable, equivalent, and efficient queries. It integrates two core components: a context construction module and a reinforcement learning framework. First, the context module leverages execution plans and retrieved demonstrations to build bottleneck-aware prompts that guide inference-time rewriting. Second, we design a reward function targeting executability, equivalence, and efficiency, evaluated via syntax checks, equivalence verification, and cost estimation. Third, to ensure stable multi-objective learning, we adopt a staged curriculum that first emphasizes executability and equivalence, then gradually incorporates efficiency. Extensive experiments show that E3-Rewrite achieves up to a 25.6% reduction in query execution time compared to state-of-the-art methods across multiple SQL benchmarks. Moreover, it delivers up to 24.4% more successful rewrites, expanding coverage to complex queries that previous systems failed to handle.

cs.LG [Back]

[101] Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI

Dong Xue,Ziyao Shao,Zhaoyang Duan,Fangzhou Liu,Bing Li,Zhongheng Zhang

Main category: cs.LG

TL;DR: 论文介绍了Doctor Sun,一种专门用于医学的大规模双语多模态生成模型,旨在解决现有医学多模态模型在理解复杂医学概念和文本-图像关系上的不足。

Details Motivation: 现有的多模态医学AI通常基于基础LLM,难以理解复杂的医学概念,且缺乏医学训练数据。此外,最近基于LLaVA的医学LMM未能有效捕捉文本与图像间的复杂关系。

Contribution: 提出了Doctor Sun模型,通过整合预训练视觉编码器和医学LLM,采用两阶段训练实现特征对齐和指令微调。同时发布了SunMed-VL双语医学多模态数据集及相关资源。

Method: 预训练视觉编码器与医学LLM结合,分两阶段训练:特征对齐和指令微调。利用多种医学数据集优化性能。

Result: 生成了SunMed-VL数据集,并开源了模型、代码和资源,推动了生物医学多模态研究的进展。

Insight: Doctor Sun通过特征对齐和多模态训练,显著提升了医学多模态任务中的性能,同时数据集的开源促进了领域研究的协作与发展。

Abstract: Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.

[102] MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time

Peter Phan,Dhruv Agarwal,Kavitha Srinivas,Horst Samulowitz,Pavan Kapanipathi,Andrew McCallum

Main category: cs.LG

TL;DR: 本文提出了一种名为MiGrATe的方法,用于在测试时通过混合策略GRPO优化LLMs的性能,无需外部训练数据,显著提升了解决方案的质量。

Details Motivation: 现有方法在利用上下文学习指导模型优化时,难以平衡探索与开发;而测试时训练依赖手工数据,限制了其可行性和扩展性。MiGrATe旨在解决这些问题。

Contribution: MiGrATe引入了混合策略组构造过程,结合贪婪采样和邻域采样,通过GRPO算法在线调整LLMs,无需外部数据支持。

Method: 方法包括混合策略组构造(结合贪婪采样和邻域采样)和GRPO算法,动态调整策略以平衡探索与开发。

Result: 在单词搜索、分子优化和ARC任务上,MiGrATe均优于仅推理和测试时训练的基线方法。

Insight: MiGrATe展示了在线测试时训练在无监督复杂搜索任务中的潜力,为LLMs的优化提供了新思路。

Abstract: Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.

[103] $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models

Jiaxin Ju,Yizhen Zheng,Huan Yee Koh,Can Wang,Shirui Pan

Main category: cs.LG

TL;DR: 该论文提出了一种名为$ ext{M}^{2}$LLM的多视角分子表示学习框架,结合了分子结构、任务和规则三种视角,利用大语言模型(LLM)生成丰富的分子表示,显著提升了分子属性预测任务的表现。

Details Motivation: 现有的分子表示方法(如指纹和GNN)主要依赖分子结构信息,忽略了语义和上下文知识。LLMs在科学领域展现了强大的推理能力,因此论文提出利用LLMs从多视角整合知识以改进分子表示。

Contribution: 1. 提出$ ext{M}^{2}$LLM框架,首次将分子结构、任务和规则三种视角动态融合。2. 利用LLMs的编码和推理能力生成分子嵌入和特征。3. 在多个分类和回归任务上达到SOTA性能。

Method: 1. 多视角融合:引入分子结构、任务和规则三种视角。2. 动态适应:根据任务需求动态调整视角权重。3. 利用LLM的编码能力生成嵌入,并通过推理过程提取特征。

Result: 在多个基准测试中,$ ext{M}^{2}$LLM在分类和回归任务上均超越现有方法,验证了多视角框架和LLMs的有效性。

Insight: LLMs不仅可用于自然语言任务,还能通过多视角推理生成高质量的分子表示,为化学和药物发现提供了新思路。

Abstract: Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose $\text{M}^{2}$LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that $\text{M}^{2}$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes.

[104] MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder

Seonyoung Kim,Dongil Kim

Main category: cs.LG

TL;DR: 论文提出了一种名为MoSSDA的半监督领域自适应框架,用于多元时间序列分类,通过动量编码器和混合增强对比模块实现鲁棒且领域不变的表示,并在多个数据集上取得最佳性能。

Details Motivation: 当训练数据与测试数据分布不同(领域偏移)时,深度学习模型的性能可能下降,尤其是在多元时间序列分类任务中。半监督领域自适应(SSDA)是一种解决这一问题的方法,但传统方法对噪声敏感的时序数据效果不佳。

Contribution: 提出了一种新颖的两阶段半监督领域自适应框架MoSSDA,通过动量编码器和混合增强的对比模块,学习鲁棒、领域不变且类别可区分的表示,并在多个数据集上验证了其有效性。

Method: MoSSDA采用两阶段方法:1)使用领域不变编码器从源域和目标域学习特征;2)通过混合增强的正对比模块(包含动量编码器)优化特征表示。梯度流在编码器和分类器之间分离以提高表示复杂性。

Result: 在六个多样化的数据集上,MoSSDA在三种不同主干网络和不同未标记比例的目标域数据中均达到了最先进的性能。消融实验证实了各模块的有效性。

Insight: 通过分离梯度流和动量编码器,MoSSDA能够在不依赖数据增强的情况下,从有限的标记目标域数据中学习到一致且可区分的特征表示,这对领域偏移问题尤为关键。

Abstract: Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at https://github.com/seonyoungKimm/MoSSDA

cs.AI [Back]

[105] Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance

Dongwook Choi,Taeyoon Kwon,Dongil Yang,Hyojun Kim,Jinyoung Yeo

Main category: cs.AI

TL;DR: 论文提出了一个用于增强现实(AR)代理的记忆增强框架,通过结合时空推理和用户个性化经验,解决复杂多步骤任务中的历史交互利用不足问题。

Details Motivation: 现有的AR代理虽能支持即时任务,但在复杂多步骤场景中表现不佳,主要是由于无法捕捉和利用用户的长期经验与偏好。

Contribution: 提出了一个记忆增强的AR代理框架,包含感知、存储、推理和执行四大模块,旨在实现个性化任务辅助。

Method: 框架由多模态感知、持久化时空记忆存储、时空推理合成和AR通信执行四个模块组成。

Result: 提出了实现路线图、评估策略和潜在应用场景,展示了框架的实用性。

Insight: 该研究为开发更智能的AR系统铺路,能够结合用户交互历史和上下文感知任务辅助。

Abstract: Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle with complex multi-step scenarios that require understanding and leveraging user’s long-term experiences and preferences. This limitation stems from their inability to capture, retain, and reason over historical user interactions in spatiotemporal contexts. To address these challenges, we propose a conceptual framework for memory-augmented AR agents that can provide personalized task assistance by learning from and adapting to user-specific experiences over time. Our framework consists of four interconnected modules: (1) Perception Module for multimodal sensor processing, (2) Memory Module for persistent spatiotemporal experience storage, (3) Spatiotemporal Reasoning Module for synthesizing past and present contexts, and (4) Actuator Module for effective AR communication. We further present an implementation roadmap, a future evaluation strategy, a potential target application and use cases to demonstrate the practical applicability of our framework across diverse domains. We aim for this work to motivate future research toward developing more intelligent AR systems that can effectively bridge user’s interaction history with adaptive, context-aware task assistance.

[106] A Dual-Axis Taxonomy of Knowledge Editing for LLMs: From Mechanisms to Functions

Amir Mohammad Salehoof,Ali Ramezani,Yadollah Yaghoobzadeh,Majid Nili Ahmadabadi

Main category: cs.AI

TL;DR: 这篇论文提出了一种新的双轴分类法,用于评估大语言模型(LLMs)的知识编辑方法,不仅关注编辑机制,还强调了知识功能的分类,填补了现有研究的空白。

Details Motivation: 由于大语言模型从大规模文本语料库中获取的知识可能过时或不准确,而完全重新训练成本高昂,因此知识编辑成为一种高效替代方案。现有研究多关注编辑机制,却忽视了知识功能的分类,导致评估不够全面。

Contribution: 论文的主要贡献是引入了一种基于功能的知识编辑分类法,结合编辑机制,为知识编辑领域提供了更全面的视角。

Method: 通过双轴分类法(编辑机制和知识功能)组织现有研究,并分析不同机制在不同知识类型(如事实、时间、概念、常识和社会知识)上的适用性。

Result: 研究展示了知识编辑方法的当前研究图景,总结了现有方法的优缺点,并提出了正式的问题定义、评估任务和数据集。

Insight: 知识编辑的有效性高度依赖于目标知识的类型,未来的研究需要综合考虑机制和功能,以解决知识更新中的开放性问题。

Abstract: Large language models (LLMs) acquire vast knowledge from large text corpora, but this information can become outdated or inaccurate. Since retraining is computationally expensive, knowledge editing offers an efficient alternative – modifying internal knowledge without full retraining. These methods aim to update facts precisely while preserving the model’s overall capabilities. While existing surveys focus on the mechanism of editing (e.g., parameter changes vs. external memory), they often overlook the function of the knowledge being edited. This survey introduces a novel, complementary function-based taxonomy to provide a more holistic view. We examine how different mechanisms apply to various knowledge types – factual, temporal, conceptual, commonsense, and social – highlighting how editing effectiveness depends on the nature of the target knowledge. By organizing our review along these two axes, we map the current landscape, outline the strengths and limitations of existing methods, define the problem formally, survey evaluation tasks and datasets, and conclude with open challenges and future directions.

[107] STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li,Han Zhang,Zhantao Yang,Fangyi Chen,Zihan Wang,Anudeepsekhar Bolimera,Marios Savvides

Main category: cs.AI

TL;DR: STELAR-Vision提出了一种自拓扑感知的高效学习框架,通过TopoAug数据合成管道和Frugal Learning方法,显著提升了多模态任务的推理能力和输出效率。

Details Motivation: 现有的视觉语言模型(VLMs)在处理复杂多模态任务时依赖于链式推理(CoT),但许多任务需要树状或图状拓扑结构。STELAR-Vision旨在通过拓扑感知的推理方法解决这一问题。

Contribution: 1. 提出了STELAR-Vision框架,支持拓扑感知的推理。2. 开发了TopoAug数据合成管道,丰富了训练数据的拓扑多样性。3. 提出了Frugal Learning方法,减少输出长度同时保持高精度。

Method: 1. 使用TopoAug生成多样拓扑结构的合成数据。2. 通过监督微调和强化学习训练Qwen2VL模型,兼顾准确性和效率。3. 引入Frugal Learning优化输出长度。

Result: 在MATH-V和VLM-S2H数据集上,STELAR-Vision比基础模型提升了9.7%的准确率,并超越更大的Qwen2VL-72B-Instruct模型7.3%。在OOD测试中,表现优于Phi-4-Multimodal-Instruct和LLaMA-3.2-11B-Vision-Instruct。

Insight: 拓扑感知的推理方法能显著提升多模态任务的性能,尤其是在复杂任务中。Frugal Learning为平衡输出长度和准确性提供了新思路。

Abstract: Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available.

[108] Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition

Mustafa Akben,Vinayaka Gude,Haya Ajjan

Main category: cs.AI

TL;DR: 该研究比较了多模态大语言模型(MLLMs)与人类在情绪识别任务中的表现,发现个体层面上MLLMs优于人类,但群体层面的人类集体智慧显著超越MLLMs。此外,人机协作的增强智能方法表现出最佳性能。

Details Motivation: 探索AI(尤其是MLLMs)在情绪识别任务中的能力,并与人类专家及群体智慧对比,以评估AI在情感智能领域的潜力。

Contribution: 1. 证明MLLMs在个体情绪识别任务中优于人类;2. 展示集体智慧在超越AI方面的优势;3. 提出人机协作的增强智能方法,表现最优。

Method: 使用Reading the Mind in the Eyes Test(RMET)及其多民族版本(MRMET),对比MLLMs与人类参与者的情绪识别表现,并分析个体、群体及协作模式的效果。

Result: 1. MLLMs在个体任务中表现优于人类;2. 群体智慧显著超越MLLMs;3. 人机协作的增强智能方法表现最佳。

Insight: 情绪识别任务中,群体智慧和人机协作展现了AI单独使用无法实现的潜力,为未来情感智能AI的发展指明了方向。

Abstract: The ability to discern subtle emotional cues is fundamental to human social intelligence. As artificial intelligence (AI) becomes increasingly common, AI’s ability to recognize and respond to human emotions is crucial for effective human-AI interactions. In particular, whether such systems can match or surpass human experts remains to be seen. However, the emotional intelligence of AI, particularly multimodal large language models (MLLMs), remains largely unexplored. This study evaluates the emotion recognition abilities of MLLMs using the Reading the Mind in the Eyes Test (RMET) and its multiracial counterpart (MRMET), and compares their performance against human participants. Results show that, on average, MLLMs outperform humans in accurately identifying emotions across both tests. This trend persists even when comparing performance across low, medium, and expert-level performing groups. Yet when we aggregate independent human decisions to simulate collective intelligence, human groups significantly surpass the performance of aggregated MLLM predictions, highlighting the wisdom of the crowd. Moreover, a collaborative approach (augmented intelligence) that combines human and MLLM predictions achieves greater accuracy than either humans or MLLMs alone. These results suggest that while MLLMs exhibit strong emotion recognition at the individual level, the collective intelligence of humans and the synergistic potential of human-AI collaboration offer the most promising path toward effective emotional AI. We discuss the implications of these findings for the development of emotionally intelligent AI systems and future research directions.

[109] OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang,Bowen Wang,Dunjie Lu,Junlin Yang,Tianbao Xie,Junli Wang,Jiaqi Deng,Xiaole Guo,Yiheng Xu,Chen Henry Wu,Zhennan Shen,Zhuokai Li,Ryan Li,Xiaochuan Li,Junda Chen,Boyuan Zheng,Peihang Li,Fangyu Lei,Ruisheng Cao,Yeqiao Fu,Dongchan Shin,Martin Shin,Jiarui Hu,Yuyan Wang,Jixuan Chen,Yuxiao Ye,Danyang Zhang,Dikang Du,Hao Hu,Huarong Chen,Zaida Zhou,Yipu Wang,Heng Wang,Diyi Yang,Victor Zhong,Flood Sung,Y. Charles,Zhilin Yang,Tao Yu

Main category: cs.AI

TL;DR: OpenCUA是一个开源框架,旨在为计算机使用代理(CUA)提供开放的研究基础,包括数据采集、模型训练和评估工具。

Details Motivation: 随着CUA的商业潜力增加,许多先进系统的细节仍被封闭,研究社区需要开放的框架来研究其能力、局限性和风险。

Contribution: 提出了OpenCUA框架,包括数据标注工具、大规模任务数据集AgentNet,以及可扩展的训练管道,同时发布了相关工具和模型。

Method: 结合人类演示捕获、大规模数据标注和Chain-of-Thought推理生成状态-动作对,训练高性能CUA模型。

Result: OpenCUA-32B在OSWorld-Verified基准测试中达到34.8%的平均成功率,超过GPT-4o等模型。

Insight: 开放的框架和数据对CUA研究至关重要,测试时的计算资源增加会显著提升模型性能。

Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

cs.GR [Back]

[110] Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Mutian Tong,Rundi Wu,Changxi Zheng

Main category: cs.GR

TL;DR: 论文提出了一种利用扩散先验估计室内光场的方法,通过单图像或视频实现时空一致的照明估计。

Details Motivation: 室内照明估计因其高度不适定性(尤其是场景光照随时间和空间变化时)而极具挑战。现有方法难以处理野外场景的时空一致性光照估计。

Contribution: 1. 提出了一种从视频中估计连续光场的方法,利用2D扩散先验优化MLP表示的光场;2. 通过微调预训练扩散模型实现零样本泛化,支持多位置光照预测。

Method: 1. 使用MLP表示光场;2. 利用2D扩散先验优化光场;3. 通过联合修复多颗铬球作为光探针,微调预训练扩散模型。

Result: 实验表明,方法在单图像或视频的室内光照估计任务上优于基线,并首次展示了野外视频的时空一致性光照估计结果。

Insight: 扩散先验可用于优化光场表示,实现复杂场景的零样本泛化,为光照估计提供了新思路。

Abstract: Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works.

[111] Hybrid Long and Short Range Flows for Point Cloud Filtering

Dasith de Silva Edirimuni,Xuequan Lu,Ajmal Saeed Mian,Lei Wei,Gang Li,Scott Schaefer,Ying He

Main category: cs.GR

TL;DR: 该论文提出了一种混合长短期范围流的点云滤波方法(HybridPF),通过结合短程评分和长程流信息,显著提升了点云去噪的性能和效率。

Details Motivation: 点云采集过程中常引入噪声,现有方法在去噪时易导致点聚类问题或未能完全去除噪声。因此,论文提出了一种结合长程流和短程评分的新方法,以提升去噪效果。

Contribution: 1. 提出HybridPF方法,首次将短程评分(∇ₓlog p(xₜ))和长程流结合用于点云滤波。2. 设计了并行模块ShortModule和LongModule,分别处理短程和长程信息。3. 引入动态图卷积解码器,改进了现有基于位移的方法的解码器架构限制。

Method: 1. 使用Encoder-Decoder架构分别设计ShortModule(处理短程评分)和LongModule(处理长程流)。2. 提出联合损失函数,端到端训练两个模块。3. 动态图卷积解码器优化推理过程。

Result: 综合实验表明,HybridPF在去噪效果和推理速度上均达到SOTA水平。

Insight: 长程流信息可以引导短程评分更接近干净点云分布,而动态图卷积解码器有效解决了现有方法的解码器局限性。

Abstract: Point cloud capture processes are error-prone and introduce noisy artifacts that necessitate filtering/denoising. Recent filtering methods often suffer from point clustering or noise retaining issues. In this paper, we propose Hybrid Point Cloud Filtering ($\textbf{HybridPF}$) that considers both short-range and long-range filtering trajectories when removing noise. It is well established that short range scores, given by $\nabla_{x}\log p(x_t)$, may provide the necessary displacements to move noisy points to the underlying clean surface. By contrast, long range velocity flows approximate constant displacements directed from a high noise variant patch $x_0$ towards the corresponding clean surface $x_1$. Here, noisy patches $x_t$ are viewed as intermediate states between the high noise variant and the clean patches. Our intuition is that long range information from velocity flow models can guide the short range scores to align more closely with the clean points. In turn, score models generally provide a quicker convergence to the clean surface. Specifically, we devise two parallel modules, the ShortModule and LongModule, each consisting of an Encoder-Decoder pair to respectively account for short-range scores and long-range flows. We find that short-range scores, guided by long-range features, yield filtered point clouds with good point distributions and convergence near the clean surface. We design a joint loss function to simultaneously train the ShortModule and LongModule, in an end-to-end manner. Finally, we identify a key weakness in current displacement based methods, limitations on the decoder architecture, and propose a dynamic graph convolutional decoder to improve the inference process. Comprehensive experiments demonstrate that our HybridPF achieves state-of-the-art results while enabling faster inference speed.

[112] Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin,Xili Dai,Ling-Hao Chen,Deyu Zhou,Jianan Wang,Duomin Wang,Gang Yu,Lionel M. Ni,Heung-Yeung Shum

Main category: cs.GR

TL;DR: ColorCtrl是一种无需训练的文本引导颜色编辑方法,通过多模态扩散变换器(MM-DiT)的注意力机制实现精确的颜色控制,保持编辑区域与非编辑区域的一致性。

Details Motivation: 现有无需训练的方法在颜色编辑中难以实现精确控制,且容易引入视觉不一致性,因此需要一种更有效的方法来解决这些问题。

Contribution: 提出ColorCtrl方法,利用MM-DiT的注意力机制分解结构和颜色,实现精确且一致的文本引导颜色编辑,并在编辑质量和一致性上达到SOTA。

Method: 通过有针对性地操作注意力图和值令牌来解耦结构与颜色,实现单词级属性强度控制和区域选择性编辑。

Result: 在SD3和FLUX.1-dev等数据集上优于现有方法,并在视频编辑中表现出更高的时间连贯性和稳定性。

Insight: 注意力机制可用于解耦图像属性,而无需额外训练;这种方法可以推广到其他基于指令的编辑扩散模型中。

Abstract: Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.